Soccer-themed question: null fields?
Greetings, In honor of the world cup and players that use 1 name only, can someone help me with the following... 1) Is there a way to find a document that has null fields? For example, if I have two fields (FIRST_NAME, LAST_NAME) for World Cup players: FIRST_NAME: Brian LAST_NAME: McBride FIRST_NAME: Agustin LAST_NAME: Delgado FIRST_NAME: Zinha LAST_NAME: (null or blank) FIRST_NAME: KakaLAST_NAME: (null or blank) ... and so on What's the way to find all players that use only their first name? 2) Is there a way to count field terms? For example, if instead we have one field... NAME: Brian McBride NAME: Agustin Delgado NAME: Zinha NAME: Kaka Can I answer the same question by finding all documents where the number of terms in the NAME field is 1 and only 1? Is there a way to do that? Thanks in advance, JMA - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Recency weightage in Lucene
I am thinking of modifying lucene's current ranking algorithm to include the document's recency-weightage. So that the latest modified documents gets preference over earlier modified documents, which makes sense for news search. (I believe) To do this I have to tinker with TermScorer.score() method, and calculate document-score in its while (doc < end) {..} loop. The requirement is that document's lastModifiedTime is stored in the doc's field, and extracting this value could be quite expensive for every iteration in its posting stream. One approach could be to store it in a separate file (like Normalization) to avoid field-lookup. Any other ideas/suggestions.. Or if anyone has already implemented this ? thanks, Prasen - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-605) Make Explanation include information about match/non-match
[ http://issues.apache.org/jira/browse/LUCENE-605?page=comments#action_12416658 ] paul.elschot commented on LUCENE-605: - I like the Boolean for indicating the match. The demo-fix.patch applies cleanly on my working copy, and all tests pass with it. I'll keep the patch in my working copy for now. Regards, Paul Elschot > Make Explanation include information about match/non-match > -- > > Key: LUCENE-605 > URL: http://issues.apache.org/jira/browse/LUCENE-605 > Project: Lucene - Java > Type: Improvement > Components: Search > Reporter: Hoss Man > Assignee: Hoss Man > Attachments: demo-fix.patch > > As discussed, I'm looking into the possibility of improving the Explanation > class to include some basic info about the "match" status of the Explanation > -- independent of the value... > http://www.nabble.com/BooleanWeight.normalize%28float%29-doesn%27t-normalize-prohibited-clauses--t1596471.html#a4347644 > This is neccesary to deal with things like LUCENE-451 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-605) Make Explanation include information about match/non-match
[ http://issues.apache.org/jira/browse/LUCENE-605?page=comments#action_12416660 ] paul.elschot commented on LUCENE-605: - I tried removing the Explanation constructor that is deprecated in the demo-fix.patch. One of the uses of this constructor is in the (patched) BooleanQuery from line 317, and fixed it like this (under ASL 2): sumExpl.setMatch(Boolean.TRUE); sumExpl.setValue(sum); float coordFactor = similarity.coord(coord, maxCoord); if (coordFactor != 1.0f) { // coordination has effect sumExpl.setValue(sumExpl.getValue() * coordFactor); sumExpl.setDescription(sumExpl.getDescription() + " * " + coordFactor + "=coord("+coord+"/"+maxCoord+")"); } return sumExpl; The point is that adding by adding a match indicator to Explanation, Explanation becomes less useful to explain a subformula of a (matching) score value, in this case the coordination factor. The fix is to add the subformula to the description and the value of the explanation. Btw. the actual explained score value was not changed by setValue() in the existing code for the coordination factor. This is probably a bug in BooleanQuery.explain(). There seems to be no test for the explanation descriptions, and I did not have a look at the actually produced getDescription() of the returned Explanation in this case. > Make Explanation include information about match/non-match > -- > > Key: LUCENE-605 > URL: http://issues.apache.org/jira/browse/LUCENE-605 > Project: Lucene - Java > Type: Improvement > Components: Search > Reporter: Hoss Man > Assignee: Hoss Man > Attachments: demo-fix.patch > > As discussed, I'm looking into the possibility of improving the Explanation > class to include some basic info about the "match" status of the Explanation > -- independent of the value... > http://www.nabble.com/BooleanWeight.normalize%28float%29-doesn%27t-normalize-prohibited-clauses--t1596471.html#a4347644 > This is neccesary to deal with things like LUCENE-451 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)
Robert Engels wrote: Do you have any hard numbers to support this? The last time I checked, gcj had minimal improvement over JVM 1.5. In terms of speed, there is not much difference between native code and classes (see sample timings). However, the pragmatic availability of java 5 environment for even somewhat _exotic_ platforms is sadly limited. My current environment is linux on a dual core x86_64. One can only ride a jrocket into 1.5 land and still address 64 bits of goodness ! more, l8r, v BTW, given a native compile and link, [EMAIL PROTECTED] lucene-415145]$ ldd build/indexFiles libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x003f0040) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x003efec0) libgcj.so.7 => /usr/lib64/libgcj.so.7 (0x2aac2000) libm.so.6 => /lib64/libm.so.6 (0x003ef910) libpthread.so.0 => /lib64/libpthread.so.0 (0x003efa50) libz.so.1 => /usr/lib64/libz.so.1 (0x003ef950) libdl.so.2 => /lib64/libdl.so.2 (0x003ef930) libc.so.6 => /lib64/libc.so.6 (0x003ef8e0) /lib64/ld-linux-x86-64.so.2 (0x003ef8c0) The native indexing, [EMAIL PROTECTED] lucene-415145]$ time build/indexFiles . 2>&1 > /dev/null real0m22.932s user0m16.581s sys 0m6.224s The virtual machine indexing, [EMAIL PROTECTED] lucene-415145]$ time java -d64 -Xmx8192m -cp build/lucene-demos-2.0-rc1-dev.jar:build/lucene-core-2.0-rc1-dev.jar org.apache.lucene.demo.IndexFiles . 2>&1 > /dev/null real0m23.224s user0m33.238s sys 0m5.184s Side note, the jrocket seems to use both processors just about 1/3 of the way through, where as the gcj doesn't . . . -- "The future is here. It's just not evenly distributed yet." -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Soccer-themed question: null fields?
JMA wrote on 06/17/2006 10:16 PM: > 1) Is there a way to find a document that has null fields? > For example, if I have two fields (FIRST_NAME, LAST_NAME) for World Cup > players: > > FIRST_NAME: Brian LAST_NAME: McBride > FIRST_NAME: Agustin LAST_NAME: Delgado > FIRST_NAME: Zinha LAST_NAME: (null or blank) > FIRST_NAME: Kaka LAST_NAME: (null or blank) > > ... and so on > > What's the way to find all players that use only their first name? > By far the best way is to store a special token into null fields and then just match on this. One less-performant alternative if you have no control over the index is to enable prefix wildcard queries and then write a query like this: FIRST_NAME:* -LAST_NAME:* To enable prefix wildcard queries, you need to regenerate QueryParser.java from QueryParser.jj after replacing the wildcard production (search for OG, as Otis has nicely included the appropriate production as a comment). > 2) Is there a way to count field terms? For example, if instead we have one > field... > > NAME: Brian McBride > NAME: Agustin Delgado > NAME: Zinha > NAME: Kaka > > Can I answer the same question by finding all documents where the number of > terms > in the NAME field is 1 and only 1? Is there a way to do that? > You would need to write your own Query subclass, and I can't think of any way to achieve this that would not be very slow. Not recommended. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recency weightage in Lucene
[EMAIL PROTECTED] wrote on 06/17/2006 10:52 PM: > I am thinking of modifying lucene's current ranking algorithm to include the > document's recency-weightage. So that the latest modified documents gets > preference over earlier modified documents, which makes sense for news > search. > > (I believe) To do this I have to tinker with TermScorer.score() method, and > calculate document-score in its while (doc < end) {..} loop. The requirement > is that document's lastModifiedTime is stored in the doc's field, and > extracting this value could be quite expensive for every iteration in its > posting stream. One approach could be to store it in a separate file (like > Normalization) to avoid field-lookup. > > Any other ideas/suggestions.. Or if anyone has already implemented this ? > Does recency correlate with the order in which documents are added to you index? If so, then perhaps you can use doc-id as a measure of recency and thereby avoid accessing a stored field. I'm not certain, but based on a quick perusal of the relevant code, it appears that both index opening and segment merging preserve the order of doc-ids. If you take this approach, you should verify. If you end up needed a stored field, then be sure to use the lazy fields capability (recently committed) to access it. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)
Are you sure about the JVM numbers? I would think that user + sys must always be < real (unless maybe the multiprocessor affects this - i.e. sums the processor time used on each). -Original Message- From: Vic Bancroft [mailto:[EMAIL PROTECTED] Sent: Sunday, June 18, 2006 11:55 AM To: [EMAIL PROTECTED] Cc: java-dev@lucene.apache.org Subject: Re: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5) Robert Engels wrote: >Do you have any hard numbers to support this? The last time I checked, >gcj had minimal improvement over JVM 1.5. > > In terms of speed, there is not much difference between native code and classes (see sample timings). However, the pragmatic availability of java 5 environment for even somewhat _exotic_ platforms is sadly limited. My current environment is linux on a dual core x86_64. One can only ride a jrocket into 1.5 land and still address 64 bits of goodness ! more, l8r, v BTW, given a native compile and link, [EMAIL PROTECTED] lucene-415145]$ ldd build/indexFiles libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x003f0040) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x003efec0) libgcj.so.7 => /usr/lib64/libgcj.so.7 (0x2aac2000) libm.so.6 => /lib64/libm.so.6 (0x003ef910) libpthread.so.0 => /lib64/libpthread.so.0 (0x003efa50) libz.so.1 => /usr/lib64/libz.so.1 (0x003ef950) libdl.so.2 => /lib64/libdl.so.2 (0x003ef930) libc.so.6 => /lib64/libc.so.6 (0x003ef8e0) /lib64/ld-linux-x86-64.so.2 (0x003ef8c0) The native indexing, [EMAIL PROTECTED] lucene-415145]$ time build/indexFiles . 2>&1 > /dev/null real0m22.932s user0m16.581s sys 0m6.224s The virtual machine indexing, [EMAIL PROTECTED] lucene-415145]$ time java -d64 -Xmx8192m -cp build/lucene-demos-2.0-rc1-dev.jar:build/lucene-core-2.0-rc1-dev.jar org.apache.lucene.demo.IndexFiles . 2>&1 > /dev/null real0m23.224s user0m33.238s sys 0m5.184s Side note, the jrocket seems to use both processors just about 1/3 of the way through, where as the gcj doesn't . . . -- "The future is here. It's just not evenly distributed yet." -- William Gibson, quoted by Whitfield Diffie - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recency weightage in Lucene
Using the doc-id itself as a recency metric is smart thinking. But the weight is actually a sigmoidal function based on the oldness(i.e. currentTime-documentIndexingTime), hence just cant use the doc-id itself. What is the JIRA BUGid for the lazy fiekd capability. Woudl like to know more about this feature. thanks for the help, Prasen -Original Message- From: Chuck Williams <[EMAIL PROTECTED]> To: java-dev@lucene.apache.org Sent: Sun, 18 Jun 2006 07:47:40 -1000 Subject: Re: Recency weightage in Lucene [EMAIL PROTECTED] wrote on 06/17/2006 10:52 PM: > I am thinking of modifying lucene's current ranking algorithm to include the document's recency-weightage. So that the latest modified documents gets preference over earlier modified documents, which makes sense for news search. > > (I believe) To do this I have to tinker with TermScorer.score() method, and calculate document-score in its while (doc < end) {..} loop. The requirement is that document's lastModifiedTime is stored in the doc's field, and extracting this value could be quite expensive for every iteration in its posting stream. One approach could be to store it in a separate file (like Normalization) to avoid field-lookup. > > Any other ideas/suggestions.. Or if anyone has already implemented this ? > Does recency correlate with the order in which documents are added to you index? If so, then perhaps you can use doc-id as a measure of recency and thereby avoid accessing a stored field. I'm not certain, but based on a quick perusal of the relevant code, it appears that both index opening and segment merging preserve the order of doc-ids. If you take this approach, you should verify. If you end up needed a stored field, then be sure to use the lazy fields capability (recently committed) to access it. Chuck - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Check out AOL.com today. Breaking news, video search, pictures, email and IM. All on demand. Always Free.
RE: Results (Re: Survey: Lucene and Java 1.4 vs. 1.5)
Any specific reason why PorterStemmer class in org.apache.lucene.analysis is not made public? Thank you, Best Regards, Bhoomi Mehta Sr. Project Leader I- Link Infosoft (G) Pvt . Ltd. Ahmedabad Email: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene.NET Jira Emails?
: If this is the case, who ever has the karma to fix this, can you take care : of it? I think the proper way to deal with this is to file a Jira request with the Infrastructure Project in the JIRA component, but I'm not 100% sure. : Also, I can't figure out how to assign, close or even edit a JIRA issue : opened against Lucene.Net. For example, take a look at: : http://issues.apache.org/jira/browse/LUCENENET-6 and I can't see anything : there to edit this issue. Yes, I am logged in. That's the Permission Scheme thing I mentioned -- it seems that members of the "lucene-developers" Jira Group (the Java Lucene Developers that is) eare the ones who can modify LUCENENET issues. : : I don't think this is intentional. Something is broken in the JIRA setup. : : I have posted this email on general@incubator.apache.org to see if folks : : there may know what's the problem and fix it. : : It looks like when the LUCENENET Jira project was setup, the "Permission : Scheme" and "Notification Scheme" wre set to "Lucene Permissions" and : "Lucene Notification Scheme" instead of making new ones specific to : LUCENENET (perhaps someone assumed the "Lucene *" Schemes were generic for : all projects, not specific to the Lucene Java project) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recency weightage in Lucene
: Subject: Recency weightage in Lucene : : I am thinking of modifying lucene's current ranking algorithm to include : the document's recency-weightage. So that the latest modified documents : gets preference over earlier modified documents, which makes sense for : news search. FunctionQuery is your friend... http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/package-summary.html ..It's part of the Solr project, but it's extremely generic and should be usable out of the box with any Lucene app. : requirement is that document's lastModifiedTime is stored in the doc's : field, and extracting this value could be quite expensive for every : iteration in its posting stream. One approach could be to store it in a : separate file (like Normalization) to avoid field-lookup. if you store it as an indexed field, you can use the FieldCache to access it and it's a lot less expensive to look at at scoring time (if you look at the FunctionQuery support classes, this is what the FieldCacheSource class does) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Why PorterStemmer class is not visible out side the package?
Hi, I want to use the PorterStemmer class, but as it is not visible to outside the package I am unable to use it. Is their any specific reason that PorterStemmer is not public? Thanks & Regards, Sr. Software Engineer I- Link Infosoft (G) Pvt . Ltd. [EMAIL PROTECTED]