[jira] Commented: (LUCENE-868) Making Term Vectors more accessible
[ https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511267 ] Yonik Seeley commented on LUCENE-868: - I haven't really used the term vector APIs, but I like the goal of allowing the app to handle things. What about dropping down a level lower, and not constructing the arrays or TermVectorOffsetInfo either? Perhaps something like: public interface TermVectorMapper { void setExpectations(String field, int numTerms, boolean hasOffsets, boolean hasPositions); void mapTerm(String term, int frequency) void mapTermPos(int startOffset, int endOffset, int position) } One could have an implementation of TermVectorMapper that collected the offsets and positions into an array as your patch does now. I'm not sure if there would be a noticable performance impact to a method call per term instance or not. Oh, wait... I just went and looked at the readTermVector() code, and positions and offsets aren't stored interleaved, so one would have to do a sequence of mapTermPos() followed by a sequence of mapTerm Offset(), which makes less sense than what you have now. Might also consider using an abstract class instead of an interface in case you want to make backward-compatible tweaks later. > Making Term Vectors more accessible > --- > > Key: LUCENE-868 > URL: https://issues.apache.org/jira/browse/LUCENE-868 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-868-v1.patch > > > One of the big issues with term vector usage is that the information is > loaded into parallel arrays as it is loaded, which are then often times > manipulated again to use in the application (for instance, they are sorted by > frequency). > Adding a callback mechanism that allows the vector loading to be handled by > the application would make this a lot more efficient. > I propose to add to IndexReader: > abstract public void getTermFreqVector(int docNumber, String field, > TermVectorMapper mapper) throws IOException; > and a similar one for the all fields version > Where TermVectorMapper is an interface with a single method: > void map(String term, int frequency, int offset, int position); > The TermVectorReader will be modified to just call the TermVectorMapper. The > existing getTermFreqVectors will be reimplemented to use an implementation of > TermVectorMapper that creates the parallel arrays. Additionally, some simple > implementations that automatically sort vectors will also be created. > This is my first draft of this API and is subject to change. I hope to have > a patch soon. > See > http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003 > for related information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-944) Remove deprecated methods in BooleanQuery
[ https://issues.apache.org/jira/browse/LUCENE-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch reassigned LUCENE-944: Assignee: Michael Busch > Remove deprecated methods in BooleanQuery > - > > Key: LUCENE-944 > URL: https://issues.apache.org/jira/browse/LUCENE-944 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Reporter: Paul Elschot >Assignee: Michael Busch >Priority: Minor > Fix For: 2.3 > > Attachments: BooleanQuery20070626.patch > > > Remove deprecated methods setUseScorer14 and getUseScorer14 in BooleanQuery, > and adapt javadocs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-951) PATCH MultiLevelSkipListReader NullPointerException
[ https://issues.apache.org/jira/browse/LUCENE-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511259 ] Michael Busch commented on LUCENE-951: -- Shame on me, this is a pretty bad typo! Rich, thank you for finding this. The patch is good. I'll add a testcase that hits this bug and commit it shortly. > PATCH MultiLevelSkipListReader NullPointerException > --- > > Key: LUCENE-951 > URL: https://issues.apache.org/jira/browse/LUCENE-951 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.2 >Reporter: Rich Johnson >Assignee: Michael Busch > Attachments: MultiLevelSkipListReader.patch > > > When Reconstructing Document Using Luke Tool, received NullPointerException. > java.lang.NullPointerException > at > org.apache.lucene.index.MultiLevelSkipListReader.loadSkipLevels(MultiLevelSkipListReader.java:188) > at > org.apache.lucene.index.MultiLevelSkipListReader.skipTo(MultiLevelSkipListReader.java:97) > at > org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:164) > at org.getopt.luke.Luke$2.run(Unknown Source) > Luke version 0.7.1 > I emailed with Luke author Andrzej Bialecki and he suggested the attached > patch file which fixed the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Assigned: (LUCENE-951) PATCH MultiLevelSkipListReader NullPointerException
[ https://issues.apache.org/jira/browse/LUCENE-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Busch reassigned LUCENE-951: Assignee: Michael Busch > PATCH MultiLevelSkipListReader NullPointerException > --- > > Key: LUCENE-951 > URL: https://issues.apache.org/jira/browse/LUCENE-951 > Project: Lucene - Java > Issue Type: Bug > Components: Index >Affects Versions: 2.2 >Reporter: Rich Johnson >Assignee: Michael Busch > Attachments: MultiLevelSkipListReader.patch > > > When Reconstructing Document Using Luke Tool, received NullPointerException. > java.lang.NullPointerException > at > org.apache.lucene.index.MultiLevelSkipListReader.loadSkipLevels(MultiLevelSkipListReader.java:188) > at > org.apache.lucene.index.MultiLevelSkipListReader.skipTo(MultiLevelSkipListReader.java:97) > at > org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:164) > at org.getopt.luke.Luke$2.run(Unknown Source) > Luke version 0.7.1 > I emailed with Luke author Andrzej Bialecki and he suggested the attached > patch file which fixed the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-868) Making Term Vectors more accessible
[ https://issues.apache.org/jira/browse/LUCENE-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511256 ] Grant Ingersoll commented on LUCENE-868: Anyone have any comments on this approach for Term Vectors? I'm not sure if the patch still applies to trunk, but I will update it and commit on Wednesday or Thursday unless I hear other comments. > Making Term Vectors more accessible > --- > > Key: LUCENE-868 > URL: https://issues.apache.org/jira/browse/LUCENE-868 > Project: Lucene - Java > Issue Type: New Feature > Components: Store >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll >Priority: Minor > Attachments: LUCENE-868-v1.patch > > > One of the big issues with term vector usage is that the information is > loaded into parallel arrays as it is loaded, which are then often times > manipulated again to use in the application (for instance, they are sorted by > frequency). > Adding a callback mechanism that allows the vector loading to be handled by > the application would make this a lot more efficient. > I propose to add to IndexReader: > abstract public void getTermFreqVector(int docNumber, String field, > TermVectorMapper mapper) throws IOException; > and a similar one for the all fields version > Where TermVectorMapper is an interface with a single method: > void map(String term, int frequency, int offset, int position); > The TermVectorReader will be modified to just call the TermVectorMapper. The > existing getTermFreqVectors will be reimplemented to use an implementation of > TermVectorMapper that creates the parallel arrays. Additionally, some simple > implementations that automatically sort vectors will also be created. > This is my first draft of this API and is subject to change. I hope to have > a patch soon. > See > http://www.gossamer-threads.com/lists/lucene/java-user/48003?search_string=get%20the%20total%20term%20frequency;#48003 > for related information. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-584) Decouple Filter from BitSet
[ https://issues.apache.org/jira/browse/LUCENE-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511186 ] Paul Elschot commented on LUCENE-584: - With 2.2 out, and LUCENE-730 out of the way, wouldn't this be a good moment for some progress with this issue? The patch still applies cleanly, and I'd like to start working on a skipping extension of SortedVIntList, much like the latest index format for document lists. > Decouple Filter from BitSet > --- > > Key: LUCENE-584 > URL: https://issues.apache.org/jira/browse/LUCENE-584 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.0.1 >Reporter: Peter Schäfer >Priority: Minor > Attachments: bench-diff.txt, bench-diff.txt, BitsMatcher.java, > Filter-20060628.patch, HitCollector-20060628.patch, > IndexSearcher-20060628.patch, MatchCollector.java, Matcher.java, > Matcher20070226.patch, Scorer-20060628.patch, Searchable-20060628.patch, > Searcher-20060628.patch, Some Matchers.zip, SortedVIntList.java, > TestSortedVIntList.java > > > {code} > package org.apache.lucene.search; > public abstract class Filter implements java.io.Serializable > { > public abstract AbstractBitSet bits(IndexReader reader) throws IOException; > } > public interface AbstractBitSet > { > public boolean get(int index); > } > {code} > It would be useful if the method =Filter.bits()= returned an abstract > interface, instead of =java.util.BitSet=. > Use case: there is a very large index, and, depending on the user's > privileges, only a small portion of the index is actually visible. > Sparsely populated =java.util.BitSet=s are not efficient and waste lots of > memory. It would be desirable to have an alternative BitSet implementation > with smaller memory footprint. > Though it _is_ possibly to derive classes from =java.util.BitSet=, it was > obviously not designed for that purpose. > That's why I propose to use an interface instead. The default implementation > could still delegate to =java.util.BitSet=. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Updated: (LUCENE-954) Toggle score normalization in Hits
[ https://issues.apache.org/jira/browse/LUCENE-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christian Kohlschütter updated LUCENE-954: -- Attachment: hits-scoreNorm.patch Adds a switch to enable/disable Hits-based score normalization. > Toggle score normalization in Hits > -- > > Key: LUCENE-954 > URL: https://issues.apache.org/jira/browse/LUCENE-954 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: 2.2 > Environment: any >Reporter: Christian Kohlschütter > Fix For: 2.2 > > Attachments: hits-scoreNorm.patch > > > The current implementation of the "Hits" class sometimes performs score > normalization. > In particular, whenever the top-ranked score is bigger than 1.0, it is > normalized to a maximum of 1.0. > In this case, Hits may return different score results than TopDocs-based > methods. > In my scenario (a federated search system), Hits delievered just plain wrong > results. > I was merging results from several sources, all having homogeneous statistics > (similar to MultiSearcher, but over the Internet using HTTP/XML-based > protocols). > Sometimes, some of the sources had a top-score greater than 1, so I ended up > with garbled results. > I suggest to add a switch to enable/disable this score-normalization at > runtime. > My patch (attached) has an additional peformance benefit, since score > normalization now occurs only when Hits#score() is called, not when creating > the Hits result list. Whenever scores are not required, you save one > multiplication per retrieved hit (i.e., at least 100 multiplications with the > current implementation of Hits). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Created: (LUCENE-954) Toggle score normalization in Hits
Toggle score normalization in Hits -- Key: LUCENE-954 URL: https://issues.apache.org/jira/browse/LUCENE-954 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.2 Environment: any Reporter: Christian Kohlschütter Fix For: 2.2 The current implementation of the "Hits" class sometimes performs score normalization. In particular, whenever the top-ranked score is bigger than 1.0, it is normalized to a maximum of 1.0. In this case, Hits may return different score results than TopDocs-based methods. In my scenario (a federated search system), Hits delievered just plain wrong results. I was merging results from several sources, all having homogeneous statistics (similar to MultiSearcher, but over the Internet using HTTP/XML-based protocols). Sometimes, some of the sources had a top-score greater than 1, so I ended up with garbled results. I suggest to add a switch to enable/disable this score-normalization at runtime. My patch (attached) has an additional peformance benefit, since score normalization now occurs only when Hits#score() is called, not when creating the Hits result list. Whenever scores are not required, you save one multiplication per retrieved hit (i.e., at least 100 multiplications with the current implementation of Hits). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [OT, slightly] Some interesting metrics on Lucene
Yep, I didn't really think the money was all that accurate, just thought it was interesting that someone was trying to quantify it. Like I said it also severely sells short the contributions of the community, putting all credit into the committers (for all projects) which is far from accurate. -Grant On Jul 8, 2007, at 11:56 PM, Ian Holsman wrote: Grant Ingersoll wrote: http://www.ohloh.net/projects/3564 has some interesting metrics on Lucene (and Solr and Nutch). Most interesting is that they estimate it is 34 person years to develop at a cost of approximately $1.8 million dollars (using a salary of $55k) before you get too excited, it estimates that Apache Labs (which is a sandbox where people try things out) is worth $2.5m http:// www.ohloh.net/projects/6271 FWIW.. I think the brand value of 'lucene' is worth at least 5-10x (if not more) what ohloh thinks it is. not to mention the amount of unseen development time corporates have done around lucene, and the amount of revenue which depends on lucene working correctly. --Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]