[jira] Commented: (LUCENE-868) Making Term Vectors more accessible

2007-07-09 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511267
 ] 

Yonik Seeley commented on LUCENE-868:
-

I haven't really used the term vector APIs, but I like the goal of allowing the 
app to handle things.
What about dropping down a level lower, and not constructing the arrays or 
TermVectorOffsetInfo either?
Perhaps something like:

public interface TermVectorMapper {
  void setExpectations(String field, int numTerms, boolean hasOffsets, boolean 
hasPositions);
  void mapTerm(String term, int frequency)
  void mapTermPos(int startOffset, int endOffset, int position)
}

One could have an implementation of TermVectorMapper that collected the offsets 
and positions into an array as your patch does now.  I'm not sure if there 
would be a noticable performance impact to a method call per term instance or 
not.

Oh, wait...  I just went and looked at the readTermVector() code, and positions 
and offsets aren't stored interleaved, so one would have to do a sequence of 
mapTermPos() followed by a sequence of mapTerm Offset(), which makes less sense 
than what you have now.

Might also consider using an abstract class instead of an interface in case you 
want to make backward-compatible tweaks later.

> Making Term Vectors more accessible
> ---
>
> Key: LUCENE-868
> URL: https://issues.apache.org/jira/browse/LUCENE-868
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-868-v1.patch
>
>
> One of the big issues with term vector usage is that the information is 
> loaded into parallel arrays as it is loaded, which are then often times 
> manipulated again to use in the application (for instance, they are sorted by 
> frequency).
> Adding a callback mechanism that allows the vector loading to be handled by 
> the application would make this a lot more efficient.
> I propose to add to IndexReader:
> abstract public void getTermFreqVector(int docNumber, String field, 
> TermVectorMapper mapper) throws IOException;
> and a similar one for the all fields version
> Where TermVectorMapper is an interface with a single method:
> void map(String term, int frequency, int offset, int position);
> The TermVectorReader will be modified to just call the TermVectorMapper.  The 
> existing getTermFreqVectors will be reimplemented to use an implementation of 
> TermVectorMapper that creates the parallel arrays.  Additionally, some simple 
> implementations that automatically sort vectors will also be created.
> This is my first draft of this API and is subject to change.  I hope to have 
> a patch soon.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003
>  for related information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-944) Remove deprecated methods in BooleanQuery

2007-07-09 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reassigned LUCENE-944:


Assignee: Michael Busch

> Remove deprecated methods in BooleanQuery
> -
>
> Key: LUCENE-944
> URL: https://issues.apache.org/jira/browse/LUCENE-944
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Reporter: Paul Elschot
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 2.3
>
> Attachments: BooleanQuery20070626.patch
>
>
> Remove deprecated methods setUseScorer14 and getUseScorer14 in BooleanQuery, 
> and adapt javadocs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-951) PATCH MultiLevelSkipListReader NullPointerException

2007-07-09 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511259
 ] 

Michael Busch commented on LUCENE-951:
--

Shame on me, this is a pretty bad typo!
Rich, thank you for finding this. The patch is good. I'll
add a testcase that hits this bug and commit it shortly.

> PATCH MultiLevelSkipListReader NullPointerException
> ---
>
> Key: LUCENE-951
> URL: https://issues.apache.org/jira/browse/LUCENE-951
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.2
>Reporter: Rich Johnson
>Assignee: Michael Busch
> Attachments: MultiLevelSkipListReader.patch
>
>
>  When Reconstructing Document Using Luke Tool, received NullPointerException.
> java.lang.NullPointerException
> at 
> org.apache.lucene.index.MultiLevelSkipListReader.loadSkipLevels(MultiLevelSkipListReader.java:188)
> at 
> org.apache.lucene.index.MultiLevelSkipListReader.skipTo(MultiLevelSkipListReader.java:97)
> at 
> org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:164)
> at org.getopt.luke.Luke$2.run(Unknown Source)
> Luke version 0.7.1
> I emailed with Luke author Andrzej Bialecki and he suggested the attached 
> patch file which fixed the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Assigned: (LUCENE-951) PATCH MultiLevelSkipListReader NullPointerException

2007-07-09 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch reassigned LUCENE-951:


Assignee: Michael Busch

> PATCH MultiLevelSkipListReader NullPointerException
> ---
>
> Key: LUCENE-951
> URL: https://issues.apache.org/jira/browse/LUCENE-951
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 2.2
>Reporter: Rich Johnson
>Assignee: Michael Busch
> Attachments: MultiLevelSkipListReader.patch
>
>
>  When Reconstructing Document Using Luke Tool, received NullPointerException.
> java.lang.NullPointerException
> at 
> org.apache.lucene.index.MultiLevelSkipListReader.loadSkipLevels(MultiLevelSkipListReader.java:188)
> at 
> org.apache.lucene.index.MultiLevelSkipListReader.skipTo(MultiLevelSkipListReader.java:97)
> at 
> org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:164)
> at org.getopt.luke.Luke$2.run(Unknown Source)
> Luke version 0.7.1
> I emailed with Luke author Andrzej Bialecki and he suggested the attached 
> patch file which fixed the problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-868) Making Term Vectors more accessible

2007-07-09 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511256
 ] 

Grant Ingersoll commented on LUCENE-868:


Anyone have any comments on this approach for Term Vectors?

I'm not sure if the patch still applies to trunk, but I will update it and 
commit on Wednesday or Thursday unless I hear other comments.

> Making Term Vectors more accessible
> ---
>
> Key: LUCENE-868
> URL: https://issues.apache.org/jira/browse/LUCENE-868
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: LUCENE-868-v1.patch
>
>
> One of the big issues with term vector usage is that the information is 
> loaded into parallel arrays as it is loaded, which are then often times 
> manipulated again to use in the application (for instance, they are sorted by 
> frequency).
> Adding a callback mechanism that allows the vector loading to be handled by 
> the application would make this a lot more efficient.
> I propose to add to IndexReader:
> abstract public void getTermFreqVector(int docNumber, String field, 
> TermVectorMapper mapper) throws IOException;
> and a similar one for the all fields version
> Where TermVectorMapper is an interface with a single method:
> void map(String term, int frequency, int offset, int position);
> The TermVectorReader will be modified to just call the TermVectorMapper.  The 
> existing getTermFreqVectors will be reimplemented to use an implementation of 
> TermVectorMapper that creates the parallel arrays.  Additionally, some simple 
> implementations that automatically sort vectors will also be created.
> This is my first draft of this API and is subject to change.  I hope to have 
> a patch soon.
> See 
> http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003
>  for related information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-584) Decouple Filter from BitSet

2007-07-09 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511186
 ] 

Paul Elschot commented on LUCENE-584:
-

With 2.2 out, and LUCENE-730 out of the way, wouldn't this be a good moment for 
some progress with this issue?
The patch still applies cleanly, and I'd like to start working on a skipping 
extension of SortedVIntList, much like the latest index format for document 
lists.


> Decouple Filter from BitSet
> ---
>
> Key: LUCENE-584
> URL: https://issues.apache.org/jira/browse/LUCENE-584
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.0.1
>Reporter: Peter Schäfer
>Priority: Minor
> Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, 
> Filter-20060628.patch, HitCollector-20060628.patch, 
> IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, 
> Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, 
> Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, 
> TestSortedVIntList.java
>
>
> {code}
> package org.apache.lucene.search;
> public abstract class Filter implements java.io.Serializable 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract 
> interface, instead of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's 
> privileges, only a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of 
> memory. It would be desirable to have an alternative BitSet implementation 
> with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was 
> obviously not designed for that purpose.
> That's why I propose to use an interface instead. The default implementation 
> could still delegate to =java.util.BitSet=.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-954) Toggle score normalization in Hits

2007-07-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/LUCENE-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Kohlschütter updated LUCENE-954:
--

Attachment: hits-scoreNorm.patch

Adds a switch to enable/disable Hits-based score normalization.


> Toggle score normalization in Hits
> --
>
> Key: LUCENE-954
> URL: https://issues.apache.org/jira/browse/LUCENE-954
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
> Environment: any
>Reporter: Christian Kohlschütter
> Fix For: 2.2
>
> Attachments: hits-scoreNorm.patch
>
>
> The current implementation of the "Hits" class sometimes performs score 
> normalization.
> In particular, whenever the top-ranked score is bigger than 1.0, it is 
> normalized to a maximum of 1.0.
> In this case, Hits may return different score results than TopDocs-based 
> methods.
> In my scenario (a federated search system), Hits delievered just plain wrong 
> results.
> I was merging results from several sources, all having homogeneous statistics 
> (similar to MultiSearcher, but over the Internet using HTTP/XML-based 
> protocols).
> Sometimes, some of the sources had a top-score greater than 1, so I ended up 
> with garbled results.
> I suggest to add a switch to enable/disable this score-normalization at 
> runtime.
> My patch (attached) has an additional peformance benefit, since score 
> normalization now occurs only when Hits#score() is called, not when creating 
> the Hits result list. Whenever scores are not required, you save one 
> multiplication per retrieved hit (i.e., at least 100 multiplications with the 
> current implementation of Hits).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-954) Toggle score normalization in Hits

2007-07-09 Thread JIRA
Toggle score normalization in Hits
--

 Key: LUCENE-954
 URL: https://issues.apache.org/jira/browse/LUCENE-954
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.2
 Environment: any
Reporter: Christian Kohlschütter
 Fix For: 2.2


The current implementation of the "Hits" class sometimes performs score 
normalization.
In particular, whenever the top-ranked score is bigger than 1.0, it is 
normalized to a maximum of 1.0.

In this case, Hits may return different score results than TopDocs-based 
methods.

In my scenario (a federated search system), Hits delievered just plain wrong 
results.
I was merging results from several sources, all having homogeneous statistics 
(similar to MultiSearcher, but over the Internet using HTTP/XML-based 
protocols).
Sometimes, some of the sources had a top-score greater than 1, so I ended up 
with garbled results.

I suggest to add a switch to enable/disable this score-normalization at runtime.
My patch (attached) has an additional peformance benefit, since score 
normalization now occurs only when Hits#score() is called, not when creating 
the Hits result list. Whenever scores are not required, you save one 
multiplication per retrieved hit (i.e., at least 100 multiplications with the 
current implementation of Hits).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [OT, slightly] Some interesting metrics on Lucene

2007-07-09 Thread Grant Ingersoll
Yep, I didn't really think the money was all that accurate, just  
thought it was interesting that someone was trying to quantify it.   
Like I said it also severely sells short the contributions of the  
community, putting all credit into the committers (for all projects)  
which is far from accurate.


-Grant

On Jul 8, 2007, at 11:56 PM, Ian Holsman wrote:


Grant Ingersoll wrote:
http://www.ohloh.net/projects/3564 has some interesting metrics on  
Lucene (and Solr and Nutch).  Most interesting is that they  
estimate it is 34 person years to develop at a cost of  
approximately $1.8 million dollars (using a salary of $55k)


before you get too excited, it estimates that Apache Labs (which is  
a sandbox where people try things out) is worth $2.5m http:// 
www.ohloh.net/projects/6271


FWIW.. I think the brand value of 'lucene' is worth at least 5-10x  
(if not more) what ohloh thinks it is.
not to mention the amount of unseen development time corporates  
have done around lucene, and the amount of revenue which depends on  
lucene working correctly.



--Ian


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]