Soccer-themed question: null fields?

2006-06-18 Thread JMA

Greetings,
In honor of the world cup and players that use 1 name only, 
can someone help me with the following...

1) Is there a way to find a document that has null fields?  
For example, if I have two fields (FIRST_NAME, LAST_NAME) for World Cup players:

FIRST_NAME: Brian   LAST_NAME: McBride
FIRST_NAME: Agustin LAST_NAME: Delgado
FIRST_NAME: Zinha   LAST_NAME: (null or blank)
FIRST_NAME: KakaLAST_NAME: (null or blank)

... and so on

What's the way to find all players that use only their first name?

2) Is there a way to count field terms?  For example, if instead we have one 
field...

NAME: Brian McBride
NAME: Agustin Delgado
NAME: Zinha
NAME: Kaka

Can I answer the same question by finding all documents where the number of 
terms
in the NAME field is 1 and only 1?  Is there a way to do that?

Thanks in advance,
JMA




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Recency weightage in Lucene

2006-06-18 Thread PrasenjitM
I am thinking of modifying lucene's current ranking algorithm to include the 
document's recency-weightage. So that the latest modified documents gets 
preference over earlier modified documents, which makes sense for news search. 

(I believe) To do this I have to tinker with TermScorer.score() method, and 
calculate document-score  in its while (doc < end) {..} loop. The requirement 
is that document's lastModifiedTime is stored in the doc's field, and 
extracting this value could be quite expensive for every iteration in its 
posting stream. One approach could be to store it in a separate file (like 
Normalization) to avoid field-lookup. 

Any other ideas/suggestions.. Or if anyone has already implemented this ? 

thanks,
Prasen

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-605) Make Explanation include information about match/non-match

2006-06-18 Thread paul.elschot (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-605?page=comments#action_12416658 ] 

paul.elschot commented on LUCENE-605:
-

I like the Boolean for indicating the match.
The demo-fix.patch applies cleanly on my working copy, and all tests pass with 
it.
I'll keep the patch in my working copy for now.

Regards,
Paul Elschot


> Make Explanation include information about match/non-match
> --
>
>  Key: LUCENE-605
>  URL: http://issues.apache.org/jira/browse/LUCENE-605
>  Project: Lucene - Java
> Type: Improvement

>   Components: Search
> Reporter: Hoss Man
> Assignee: Hoss Man
>  Attachments: demo-fix.patch
>
> As discussed, I'm looking into the possibility of improving the Explanation 
> class to include some basic info about the "match" status of the Explanation 
> -- independent of the value...
> http://www.nabble.com/BooleanWeight.normalize%28float%29-doesn%27t-normalize-prohibited-clauses--t1596471.html#a4347644
> This is neccesary to deal with things like LUCENE-451

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-605) Make Explanation include information about match/non-match

2006-06-18 Thread paul.elschot (JIRA)
[ 
http://issues.apache.org/jira/browse/LUCENE-605?page=comments#action_12416660 ] 

paul.elschot commented on LUCENE-605:
-

I tried removing the Explanation constructor that is deprecated in the 
demo-fix.patch.
One of the uses of this constructor is in the (patched) BooleanQuery from line 
317,
and fixed it like this (under ASL 2):

  sumExpl.setMatch(Boolean.TRUE);
  sumExpl.setValue(sum);
  
  float coordFactor = similarity.coord(coord, maxCoord);
  if (coordFactor != 1.0f) { // coordination has effect
sumExpl.setValue(sumExpl.getValue() * coordFactor);
sumExpl.setDescription(sumExpl.getDescription() + " * " + coordFactor + 
"=coord("+coord+"/"+maxCoord+")");
  }
  return sumExpl;

The point is that adding by adding a match indicator to Explanation, 
Explanation becomes less useful
to explain a subformula of a (matching) score value, in this case the 
coordination factor.
The fix is to add the subformula to the description and the value of the 
explanation.

Btw. the actual explained score value was not changed by setValue() in the 
existing code for the coordination factor.
This is probably a bug in BooleanQuery.explain().

There seems to be no test for the explanation descriptions, and I did not have 
a look at the actually produced
getDescription() of the returned Explanation in this case.



> Make Explanation include information about match/non-match
> --
>
>  Key: LUCENE-605
>  URL: http://issues.apache.org/jira/browse/LUCENE-605
>  Project: Lucene - Java
> Type: Improvement

>   Components: Search
> Reporter: Hoss Man
> Assignee: Hoss Man
>  Attachments: demo-fix.patch
>
> As discussed, I'm looking into the possibility of improving the Explanation 
> class to include some basic info about the "match" status of the Explanation 
> -- independent of the value...
> http://www.nabble.com/BooleanWeight.normalize%28float%29-doesn%27t-normalize-prohibited-clauses--t1596471.html#a4347644
> This is neccesary to deal with things like LUCENE-451

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-18 Thread Vic Bancroft

Robert Engels wrote:


Do you have any hard numbers to support this? The last time I checked, gcj
had minimal improvement over JVM 1.5. 
 

In terms of speed, there is not much difference between native code and 
classes (see sample timings).  However, the pragmatic availability of 
java 5 environment for even somewhat _exotic_ platforms is sadly 
limited.  My current environment is linux on a dual core x86_64. 

One can only ride a jrocket into 1.5 land and still address 64 bits of 
goodness !


more,
l8r,
v

BTW, given a native compile and link,

   [EMAIL PROTECTED] lucene-415145]$ ldd  build/indexFiles
   libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x003f0040)
   libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x003efec0)
   libgcj.so.7 => /usr/lib64/libgcj.so.7 (0x2aac2000)
   libm.so.6 => /lib64/libm.so.6 (0x003ef910)
   libpthread.so.0 => /lib64/libpthread.so.0 (0x003efa50)
   libz.so.1 => /usr/lib64/libz.so.1 (0x003ef950)
   libdl.so.2 => /lib64/libdl.so.2 (0x003ef930)
   libc.so.6 => /lib64/libc.so.6 (0x003ef8e0)
   /lib64/ld-linux-x86-64.so.2 (0x003ef8c0)

The native indexing,

[EMAIL PROTECTED] lucene-415145]$ time build/indexFiles . 2>&1 > /dev/null

real0m22.932s
user0m16.581s
sys 0m6.224s

The virtual machine indexing,

[EMAIL PROTECTED] lucene-415145]$ time java -d64 -Xmx8192m -cp 
build/lucene-demos-2.0-rc1-dev.jar:build/lucene-core-2.0-rc1-dev.jar 
org.apache.lucene.demo.IndexFiles . 2>&1 > /dev/null
real0m23.224s
user0m33.238s
sys 0m5.184s
 

Side note, the jrocket seems to use both processors just about 1/3 of 
the way through, where as the gcj doesn't . . .


--
"The future is here. It's just not evenly distributed yet."
-- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Soccer-themed question: null fields?

2006-06-18 Thread Chuck Williams

JMA wrote on 06/17/2006 10:16 PM:
> 1) Is there a way to find a document that has null fields?  
> For example, if I have two fields (FIRST_NAME, LAST_NAME) for World Cup 
> players:
>
> FIRST_NAME: Brian LAST_NAME: McBride
> FIRST_NAME: Agustin   LAST_NAME: Delgado
> FIRST_NAME: Zinha LAST_NAME: (null or blank)
> FIRST_NAME: Kaka  LAST_NAME: (null or blank)
>
> ... and so on
>
> What's the way to find all players that use only their first name?
>   

By far the best way is to store a special token into null fields and
then just match on this.

One less-performant alternative if you have no control over the index is
to enable prefix wildcard queries and then write a query like this:

FIRST_NAME:* -LAST_NAME:*

To enable prefix wildcard queries, you need to regenerate
QueryParser.java from QueryParser.jj after replacing the wildcard
production (search for OG, as Otis has nicely included the appropriate
production as a comment).

> 2) Is there a way to count field terms?  For example, if instead we have one 
> field...
>
> NAME: Brian McBride
> NAME: Agustin Delgado
> NAME: Zinha
> NAME: Kaka
>
> Can I answer the same question by finding all documents where the number of 
> terms
> in the NAME field is 1 and only 1?  Is there a way to do that?
>   

You would need to write your own Query subclass, and I can't think of
any way to achieve this that would not be very slow.  Not recommended.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recency weightage in Lucene

2006-06-18 Thread Chuck Williams


[EMAIL PROTECTED] wrote on 06/17/2006 10:52 PM:
> I am thinking of modifying lucene's current ranking algorithm to include the 
> document's recency-weightage. So that the latest modified documents gets 
> preference over earlier modified documents, which makes sense for news 
> search. 
>
> (I believe) To do this I have to tinker with TermScorer.score() method, and 
> calculate document-score  in its while (doc < end) {..} loop. The requirement 
> is that document's lastModifiedTime is stored in the doc's field, and 
> extracting this value could be quite expensive for every iteration in its 
> posting stream. One approach could be to store it in a separate file (like 
> Normalization) to avoid field-lookup. 
>
> Any other ideas/suggestions.. Or if anyone has already implemented this ? 
>   

Does recency correlate with the order in which documents are added to
you index?  If so, then perhaps you can use doc-id as a measure of
recency and thereby avoid accessing a stored field.  I'm not certain,
but based on a quick perusal of the relevant code, it appears that both
index opening and segment merging preserve the order of doc-ids.  If you
take this approach, you should verify.

If you end up needed a stored field, then be sure to use the lazy fields
capability (recently committed) to access it.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-18 Thread Robert Engels
Are you sure about the JVM numbers? I would think that user + sys must
always be < real (unless maybe the multiprocessor affects this - i.e. sums
the processor time used on each).

-Original Message-
From: Vic Bancroft [mailto:[EMAIL PROTECTED] 
Sent: Sunday, June 18, 2006 11:55 AM
To: [EMAIL PROTECTED]
Cc: java-dev@lucene.apache.org
Subject: Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

Robert Engels wrote:

>Do you have any hard numbers to support this? The last time I checked, 
>gcj had minimal improvement over JVM 1.5.
>  
>
In terms of speed, there is not much difference between native code and
classes (see sample timings).  However, the pragmatic availability of java 5
environment for even somewhat _exotic_ platforms is sadly limited.  My
current environment is linux on a dual core x86_64. 

One can only ride a jrocket into 1.5 land and still address 64 bits of
goodness !

more,
l8r,
v

BTW, given a native compile and link,

[EMAIL PROTECTED] lucene-415145]$ ldd  build/indexFiles
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x003f0040)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x003efec0)
libgcj.so.7 => /usr/lib64/libgcj.so.7 (0x2aac2000)
libm.so.6 => /lib64/libm.so.6 (0x003ef910)
libpthread.so.0 => /lib64/libpthread.so.0 (0x003efa50)
libz.so.1 => /usr/lib64/libz.so.1 (0x003ef950)
libdl.so.2 => /lib64/libdl.so.2 (0x003ef930)
libc.so.6 => /lib64/libc.so.6 (0x003ef8e0)
/lib64/ld-linux-x86-64.so.2 (0x003ef8c0)

The native indexing,

[EMAIL PROTECTED] lucene-415145]$ time build/indexFiles . 2>&1 > /dev/null

real0m22.932s
user0m16.581s
sys 0m6.224s

The virtual machine indexing,

[EMAIL PROTECTED] lucene-415145]$ time java -d64 -Xmx8192m -cp
build/lucene-demos-2.0-rc1-dev.jar:build/lucene-core-2.0-rc1-dev.jar
org.apache.lucene.demo.IndexFiles . 2>&1 > /dev/null
real0m23.224s
user0m33.238s
sys 0m5.184s
  

Side note, the jrocket seems to use both processors just about 1/3 of the
way through, where as the gcj doesn't . . .

--
"The future is here. It's just not evenly distributed yet."
 -- William Gibson, quoted by Whitfield Diffie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recency weightage in Lucene

2006-06-18 Thread prasenjitm
Using the doc-id itself as a recency metric is smart thinking. But the weight 
is actually a sigmoidal function based on the oldness(i.e. 
currentTime-documentIndexingTime), hence just cant use the doc-id itself. 
 
What is the JIRA BUGid for the lazy fiekd capability. Woudl like to know more 
about this feature. 
 
thanks for the help,
Prasen
 
 
-Original Message-
From: Chuck Williams <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Sun, 18 Jun 2006 07:47:40 -1000
Subject: Re: Recency weightage in Lucene




[EMAIL PROTECTED] wrote on 06/17/2006 10:52 PM:
> I am thinking of modifying lucene's current ranking algorithm to include the 
document's recency-weightage. So that the latest modified documents gets 
preference over earlier modified documents, which makes sense for news search. 
>
> (I believe) To do this I have to tinker with TermScorer.score() method, and 
calculate document-score  in its while (doc < end) {..} loop. The requirement 
is 
that document's lastModifiedTime is stored in the doc's field, and extracting 
this value could be quite expensive for every iteration in its posting stream. 
One approach could be to store it in a separate file (like Normalization) to 
avoid field-lookup. 
>
> Any other ideas/suggestions.. Or if anyone has already implemented this ? 
>   

Does recency correlate with the order in which documents are added to
you index?  If so, then perhaps you can use doc-id as a measure of
recency and thereby avoid accessing a stored field.  I'm not certain,
but based on a quick perusal of the relevant code, it appears that both
index opening and segment merging preserve the order of doc-ids.  If you
take this approach, you should verify.

If you end up needed a stored field, then be sure to use the lazy fields
capability (recently committed) to access it.

Chuck


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Check out AOL.com today. Breaking news, video search, pictures, email and IM. 
All on demand. Always Free.


RE: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)

2006-06-18 Thread Bhoomi Mehta

Any specific reason why PorterStemmer class in org.apache.lucene.analysis is
not made public?

Thank you,
Best Regards,

Bhoomi Mehta
Sr. Project Leader
I- Link Infosoft (G) Pvt . Ltd.
Ahmedabad
Email: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene.NET Jira Emails?

2006-06-18 Thread Chris Hostetter

: If this is the case, who ever has the karma to fix this, can you take care
: of it?

I think the proper way to deal with this is to file a Jira request with
the Infrastructure Project in the JIRA component, but I'm not 100% sure.

: Also, I can't figure out how to assign, close or even edit a JIRA issue
: opened against Lucene.Net.  For example, take a look at:
: http://issues.apache.org/jira/browse/LUCENENET-6 and I can't see anything
: there to edit this issue.  Yes, I am logged in.

That's the Permission Scheme thing I mentioned -- it seems that members of
the "lucene-developers" Jira Group (the Java Lucene Developers that
is) eare the ones who can modify LUCENENET issues.


: : I don't think this is intentional.  Something is broken in the JIRA setup.
: : I have posted this email on general@incubator.apache.org to see if folks
: : there may know what's the problem and fix it.
:
: It looks like when the LUCENENET Jira project was setup, the "Permission
: Scheme" and "Notification Scheme" wre set to "Lucene Permissions" and
: "Lucene Notification Scheme" instead of making new ones specific to
: LUCENENET (perhaps someone assumed the "Lucene *" Schemes were generic for
: all projects, not specific to the Lucene Java project)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Recency weightage in Lucene

2006-06-18 Thread Chris Hostetter

: Subject: Recency weightage in Lucene
:
: I am thinking of modifying lucene's current ranking algorithm to include
: the document's recency-weightage. So that the latest modified documents
: gets preference over earlier modified documents, which makes sense for
: news search.

FunctionQuery is your friend...

http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/package-summary.html

..It's part of the Solr project, but it's extremely generic and should be
usable out of the box with any Lucene app.

: requirement is that document's lastModifiedTime is stored in the doc's
: field, and extracting this value could be quite expensive for every
: iteration in its posting stream. One approach could be to store it in a
: separate file (like Normalization) to avoid field-lookup.

if you store it as an indexed field, you can use the FieldCache to access
it and it's a lot less expensive to look at at scoring time (if you look
at the FunctionQuery support classes, this is what the FieldCacheSource
class does)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Why PorterStemmer class is not visible out side the package?

2006-06-18 Thread Rakesh Prajapati
Hi,
 
I want to use the PorterStemmer class, but as it is not visible to
outside the package I am unable to use it.
 
Is their any specific reason that PorterStemmer is not public?
Thanks & Regards,
Sr. Software Engineer
I- Link Infosoft (G) Pvt . Ltd. 
[EMAIL PROTECTED]