[jira] Issue Comment Edited: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720379#action_12720379 ] Grant Ingersoll edited comment on LUCENE-1567 at 6/16/09 3:16 PM: -- I need an MD5/SHA1 hash (http://incubator.apache.org/ip-clearance/ip-clearance-template.html) for the exact code listed in the software grant. Also include the version number of the software used to create the hash. Please also upload that code as a tarball on this issue. No need to worry about the patches for now. See https://issues.apache.org/jira/browse/INCUBATOR-77 for example. was (Author: gsingers): I need an MD5/SHA1 hash (http://incubator.apache.org/ip-clearance/ip-clearance-template.html) for the exact code listed in the software grant. Also include the version number of the software used to create the hash. See https://issues.apache.org/jira/browse/INCUBATOR-77 for example. New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch, QueryParser_restructure_meetup_june2009_v2.pdf From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in
[jira] Commented: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720386#action_12720386 ] Grant Ingersoll commented on LUCENE-1567: - OK, only outstanding items for clearance are: 1. tarball and hash 2. Vote on Incubator for clearance. New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch, QueryParser_restructure_meetup_june2009_v2.pdf From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query parser, people don't make the corresponding changes in the contrib parsers. (I'm guilty here too) With this new architecture it will be much easier to maintain different query syntaxes, as the actual code for the first layer is not very much. All syntaxes would benefit from patches and improvements we make to the underlying layers, which will make supporting different syntaxes much more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720417#action_12720417 ] Grant Ingersoll commented on LUCENE-1567: - Commit is separate from IP Clearance and you can't commit until the clearance is accepted. I just need the tarball for the code that was referenced in the software grant along with a hash on it. In the grant, you have a file directory listing describing the code. Take that file listing, tar it up and run md5 on it. New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch, QueryParser_restructure_meetup_june2009_v2.pdf From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query parser, people don't make the corresponding changes in the contrib parsers. (I'm guilty here too) With this new architecture it will be much easier to maintain different query syntaxes, as the actual code for the first layer is not very much. All syntaxes would benefit from patches and improvements we make to the underlying layers,
[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720037#action_12720037 ] Grant Ingersoll commented on MAHOUT-65: --- Hey Jeff, Minor request, it seems like you have some sort of reformatting going on that causes the patch to contain all kinds of formatting changes that make it a lot harder to see the actual changes. In thinking about this a little bit more, is there a way to just name a vector and a row in a Matrix. All I really want right now is to be able to track which Vector is associated with which document, and I could do this by setting a unique name on the Vector and having that serialized. The name itself could be stored in the first entry (for SparseVector, it would have to coincide with the sCardinality stuff. I'm fine with all the other label stuff, too. Also, the patch doesn't apply because of the JSONVectorAdapter Add Element Labels to Vectors and Matrices -- Key: MAHOUT-65 URL: https://issues.apache.org/jira/browse/MAHOUT-65 Project: Mahout Issue Type: New Feature Components: Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch Many applications can benefit by accessing elements in vectors and matrices using String labels in addition to numeric indices. Investigate adding such a capability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720045#action_12720045 ] Grant Ingersoll commented on MAHOUT-65: --- Is the only way to add bindings by setting the map? Seems like {code} set(String label, int index, double value) {code} would be useful. And, that if bindings is null, it would create a new map. Also, I'll see if I can work up the name thing. Add Element Labels to Vectors and Matrices -- Key: MAHOUT-65 URL: https://issues.apache.org/jira/browse/MAHOUT-65 Project: Mahout Issue Type: New Feature Components: Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch Many applications can benefit by accessing elements in vectors and matrices using String labels in addition to numeric indices. Investigate adding such a capability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-134) [PATCH] Cluster decode error handling
[ https://issues.apache.org/jira/browse/MAHOUT-134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-134. Resolution: Fixed Fix Version/s: 0.2 Committed revision 785197. [PATCH] Cluster decode error handling - Key: MAHOUT-134 URL: https://issues.apache.org/jira/browse/MAHOUT-134 Project: Mahout Issue Type: Improvement Affects Versions: 0.2 Reporter: Robert Burrell Donkin Fix For: 0.2 Attachments: mahout-cluster-format-error.patch ATM the javadocs are unclear as to whether null is an acceptable return value and callers do not null check the return value. However, the implementation may return null in or throw other runtime exceptions when the format is not correct. This makes it hard to diagnose when there's a problem with the format. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720161#action_12720161 ] Grant Ingersoll commented on MAHOUT-65: --- OK, I will work up a patch for the name thing on a vector, unless you think that can be handled through the bindings thing. Basically, I think we need a way to name a vector and have it carried through. Add Element Labels to Vectors and Matrices -- Key: MAHOUT-65 URL: https://issues.apache.org/jira/browse/MAHOUT-65 Project: Mahout Issue Type: New Feature Components: Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch Many applications can benefit by accessing elements in vectors and matrices using String labels in addition to numeric indices. Investigate adding such a capability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720199#action_12720199 ] Grant Ingersoll commented on MAHOUT-65: --- That works for Matrix. For Vector, I was thinking, probably naively, we simply need to be able to add a name attribute. For MAHOUT-130, I just dumped the column/cell labels out separately. Like you said, I'm not sure we want all of that serialized. Add Element Labels to Vectors and Matrices -- Key: MAHOUT-65 URL: https://issues.apache.org/jira/browse/MAHOUT-65 Project: Mahout Issue Type: New Feature Components: Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Attachments: MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch, MAHOUT-65d.patch Many applications can benefit by accessing elements in vectors and matrices using String labels in addition to numeric indices. Investigate adding such a capability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-65) Add Element Labels to Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-65: -- Attachment: MAHOUT-65-name.patch Add name attribute. Also added some docs on equals and added a strict equivalence notion that can be useful if one cares about the implementation. Add Element Labels to Vectors and Matrices -- Key: MAHOUT-65 URL: https://issues.apache.org/jira/browse/MAHOUT-65 Project: Mahout Issue Type: New Feature Components: Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Attachments: MAHOUT-65-name.patch, MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch, MAHOUT-65d.patch Many applications can benefit by accessing elements in vectors and matrices using String labels in addition to numeric indices. Investigate adding such a capability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720302#action_12720302 ] Grant Ingersoll commented on MAHOUT-65: --- Jeff, One comment on the GSON serialization stuff. It can get pretty verbose storing the class name repeatedly, although I do realize it's a drop in the bucket compared to the vector itself. Perhaps we could do like Solr does and, if some abbreviated form is present where a class name is required (maybe 'DV' or 'SV') it could know to use those forms, otherwise it can do the full class lookup. Might just save a little bit on size of a serialized file, which I imagine can add up. Add Element Labels to Vectors and Matrices -- Key: MAHOUT-65 URL: https://issues.apache.org/jira/browse/MAHOUT-65 Project: Mahout Issue Type: New Feature Components: Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Attachments: MAHOUT-65-name.patch, MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch, MAHOUT-65d.patch Many applications can benefit by accessing elements in vectors and matrices using String labels in addition to numeric indices. Investigate adding such a capability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-131) Vector improvements
[ https://issues.apache.org/jira/browse/MAHOUT-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-131: --- Resolution: Fixed Status: Resolved (was: Patch Available) Vector improvements --- Key: MAHOUT-131 URL: https://issues.apache.org/jira/browse/MAHOUT-131 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 0.2 Attachments: MAHOUT-131.patch, MAHOUT-131.patch Vector and it's implementations could use a few things: 1. DenseVector should implement equals and hashCode similar to SparseVector 2. The VectorView asFormatString() is not compatible with actually recreating any type of vector. 3. Add tests to VectorTest that assert that decodeFormat/asFormatString is able to do a round trip. 4. Add static AbstractVector.equivalent(Vector, Vector) that takes in two vectors and compares them for equality, regardless of their implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-126: --- Fix Version/s: 0.2 Affects Version/s: 0.2 Status: Patch Available (was: Open) Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-126: --- Attachment: MAHOUT-126.patch Here's a version that is brought up to trunk and adds in MAHOUT-65-name.patch to allow for labeling the vectors. Next, I'm going to run the output through some clustering Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Affects Versions: 0.2 Reporter: Shashikant Kore Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-65) Add Element Labels to Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-65: -- Attachment: MAHOUT-65-name.patch implement hashCode better, require equals and hashcode as part of the interface, same as java.util.List Add Element Labels to Vectors and Matrices -- Key: MAHOUT-65 URL: https://issues.apache.org/jira/browse/MAHOUT-65 Project: Mahout Issue Type: New Feature Components: Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Attachments: MAHOUT-65-name.patch, MAHOUT-65-name.patch, MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch, MAHOUT-65d.patch Many applications can benefit by accessing elements in vectors and matrices using String labels in addition to numeric indices. Investigate adding such a capability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-65) Add Element Labels to Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-65: -- Attachment: MAHOUT-65-name.patch How about a version where the tests actually pass? Will commit shortly. Add Element Labels to Vectors and Matrices -- Key: MAHOUT-65 URL: https://issues.apache.org/jira/browse/MAHOUT-65 Project: Mahout Issue Type: New Feature Components: Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Attachments: MAHOUT-65-name.patch, MAHOUT-65-name.patch, MAHOUT-65-name.patch, MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch, MAHOUT-65d.patch Many applications can benefit by accessing elements in vectors and matrices using String labels in addition to numeric indices. Investigate adding such a capability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices
[ https://issues.apache.org/jira/browse/MAHOUT-65?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720345#action_12720345 ] Grant Ingersoll commented on MAHOUT-65: --- Committed the name stuff: Committed revision 785386. Add Element Labels to Vectors and Matrices -- Key: MAHOUT-65 URL: https://issues.apache.org/jira/browse/MAHOUT-65 Project: Mahout Issue Type: New Feature Components: Matrix Affects Versions: 0.1 Reporter: Jeff Eastman Assignee: Jeff Eastman Attachments: MAHOUT-65-name.patch, MAHOUT-65-name.patch, MAHOUT-65-name.patch, MAHOUT-65.patch, MAHOUT-65b.patch, MAHOUT-65c.patch, MAHOUT-65d.patch Many applications can benefit by accessing elements in vectors and matrices using String labels in addition to numeric indices. Investigate adding such a capability. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719524#action_12719524 ] Grant Ingersoll commented on MAHOUT-126: Yeah, still needs the labeling stuff. As for weights, you should be able to pass in a Weight object. See the TFIDF implementation. Likely still needs some work. As for the Lucene error, I thought I had updated the Lucene version to be 2.9-dev, which I believe makes this all right. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Reporter: Shashikant Kore Assignee: Grant Ingersoll Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (MAHOUT-132) [PATCH] Push magic names into public constants
[ https://issues.apache.org/jira/browse/MAHOUT-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned MAHOUT-132: -- Assignee: Grant Ingersoll [PATCH] Push magic names into public constants -- Key: MAHOUT-132 URL: https://issues.apache.org/jira/browse/MAHOUT-132 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.2 Reporter: Robert Burrell Donkin Assignee: Grant Ingersoll Attachments: mahout-constants.patch ATM the examples (and any similar code) need to hard code magic strings for directories. This makes the code more fragile and more difficult to understand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-132) [PATCH] Push magic names into public constants
[ https://issues.apache.org/jira/browse/MAHOUT-132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-132. Resolution: Fixed Fix Version/s: 0.2 Committed revision 784640. [PATCH] Push magic names into public constants -- Key: MAHOUT-132 URL: https://issues.apache.org/jira/browse/MAHOUT-132 Project: Mahout Issue Type: Improvement Components: Clustering Affects Versions: 0.2 Reporter: Robert Burrell Donkin Assignee: Grant Ingersoll Fix For: 0.2 Attachments: mahout-constants.patch ATM the examples (and any similar code) need to hard code magic strings for directories. This makes the code more fragile and more difficult to understand. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-131) Vector improvements
[ https://issues.apache.org/jira/browse/MAHOUT-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-131: --- Attachment: MAHOUT-131.patch Updated patch implements equals/hashcode for all the Vectors and puts in various tests related to these issues. It should be the case now that Vector.equals() acts just like List.equals(), namely that two vectors containing the same elements are equivalent regardless of the implementation. Vector improvements --- Key: MAHOUT-131 URL: https://issues.apache.org/jira/browse/MAHOUT-131 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 0.2 Attachments: MAHOUT-131.patch, MAHOUT-131.patch Vector and it's implementations could use a few things: 1. DenseVector should implement equals and hashCode similar to SparseVector 2. The VectorView asFormatString() is not compatible with actually recreating any type of vector. 3. Add tests to VectorTest that assert that decodeFormat/asFormatString is able to do a round trip. 4. Add static AbstractVector.equivalent(Vector, Vector) that takes in two vectors and compares them for equality, regardless of their implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719117#action_12719117 ] Grant Ingersoll commented on MAHOUT-121: bq. Would it be useful to take a shot at rewriting SparseVector to use this? You could do that, or an alternate implementation. Is there any case where one wouldn't want this? Also, I wouldn't mind a little better name than FastIntDouble. ;-) Speed up distance calculations for sparse vectors - Key: MAHOUT-121 URL: https://issues.apache.org/jira/browse/MAHOUT-121 Project: Mahout Issue Type: Improvement Components: Matrix Reporter: Shashikant Kore Attachments: mahout-121.patch From my mail to the Mahout mailing list. I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list. http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ I tried out this optimization. The test setup had 2000 document vectors with few hundred items. I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200. Current Canopy Generation: 28 min 15 sec. Canopy Generation with distance optimization: 1 min 38 sec. I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector implementation to used primitive collections by Trove [ http://trove4j.sourceforge.net/ ]. Distance optimization with Trove: 59 sec Current canopy generation with Trove: 21 min 55 sec To sum, these two optimizations reduced cluster generation time by a 97%. Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. Licensing of Trove seems to be an issue which needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors
[ https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719118#action_12719118 ] Grant Ingersoll commented on MAHOUT-121: Also, seems like we could split out the original two issues that Shashikant brought up, right? Speed up distance calculations for sparse vectors - Key: MAHOUT-121 URL: https://issues.apache.org/jira/browse/MAHOUT-121 Project: Mahout Issue Type: Improvement Components: Matrix Reporter: Shashikant Kore Attachments: mahout-121.patch From my mail to the Mahout mailing list. I am working on clustering a dataset which has thousands of sparse vectors. The complete dataset has few tens of thousands of feature items but each vector has only couple of hundred feature items. For this, there is an optimization in distance calculation, a link to which I found the archives of Mahout mailing list. http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/ I tried out this optimization. The test setup had 2000 document vectors with few hundred items. I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200. Current Canopy Generation: 28 min 15 sec. Canopy Generation with distance optimization: 1 min 38 sec. I know by experience that using Integer, Double objects instead of primitives is computationally expensive. I changed the sparse vector implementation to used primitive collections by Trove [ http://trove4j.sourceforge.net/ ]. Distance optimization with Trove: 59 sec Current canopy generation with Trove: 21 min 55 sec To sum, these two optimizations reduced cluster generation time by a 97%. Currently, I have made the changes for Euclidean Distance, Canopy and KMeans. Licensing of Trove seems to be an issue which needs to be addressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (LUCENE-1687) Remove ExtendedFieldCache by rolling functionality into FieldCache
Remove ExtendedFieldCache by rolling functionality into FieldCache -- Key: LUCENE-1687 URL: https://issues.apache.org/jira/browse/LUCENE-1687 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9 It is silly that we have ExtendedFieldCache. It is a workaround to our supposed back compatibility problem. This patch will merge the ExtendedFieldCache interface into FieldCache, thereby breaking back compatibility, but creating a much simpler API for FieldCache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1676) New Token filter for adding payloads in-stream
[ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718815#action_12718815 ] Grant Ingersoll commented on LUCENE-1676: - OK, I moved to contrib/CHANGES. I'm going to commit this today. New Token filter for adding payloads in-stream Key: LUCENE-1676 URL: https://issues.apache.org/jira/browse/LUCENE-1676 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9 Attachments: LUCENE-1676.patch This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload. This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time). An example is apropos. Given a | delimiter, we could have a stream that looks like: {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote} In this case, this would produce tokens and payloads (assuming whitespace tokenization): Token: the Payload: null Token: quick Payload: JJ Token: red Pay: JJ. and so on. This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1676) New Token filter for adding payloads in-stream
[ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718817#action_12718817 ] Grant Ingersoll commented on LUCENE-1676: - BTW, I'm curious if people have a better way to convert from char[] to byte[] for encoding the payloads (see FloatEncoder), other than going through Strings. New Token filter for adding payloads in-stream Key: LUCENE-1676 URL: https://issues.apache.org/jira/browse/LUCENE-1676 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9 Attachments: LUCENE-1676.patch This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload. This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time). An example is apropos. Given a | delimiter, we could have a stream that looks like: {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote} In this case, this would produce tokens and payloads (assuming whitespace tokenization): Token: the Payload: null Token: quick Payload: JJ Token: red Pay: JJ. and so on. This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1687) Remove ExtendedFieldCache by rolling functionality into FieldCache
[ https://issues.apache.org/jira/browse/LUCENE-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718845#action_12718845 ] Grant Ingersoll commented on LUCENE-1687: - True, but you know how we are about adding methods to an interface! Remove ExtendedFieldCache by rolling functionality into FieldCache -- Key: LUCENE-1687 URL: https://issues.apache.org/jira/browse/LUCENE-1687 Project: Lucene - Java Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9 It is silly that we have ExtendedFieldCache. It is a workaround to our supposed back compatibility problem. This patch will merge the ExtendedFieldCache interface into FieldCache, thereby breaking back compatibility, but creating a much simpler API for FieldCache. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1676) New Token filter for adding payloads in-stream
[ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718943#action_12718943 ] Grant Ingersoll commented on LUCENE-1676: - I grabbed Apache Harmony's Integer.parseInt() code and converted it to take in a char array, which should speed up the IntegerEncoder. However, the Float.parseInt implementation relies on some constructs that are not available in JDK 1.4, so that one is going to have to stay as it is. The main problem lies in the reliance on the HexStringParser (https://svn.apache.org/repos/asf/harmony/enhanced/classlib/archive/java6/modules/luni/src/main/java/org/apache/harmony/luni/util/HexStringParser.java) which is in need of some Long specific attributes that are either JDK1.4 or are Harmony specific attributes of Long (I didn't take the time to investigate) At any rate, I added the Integer stuff to ArrayUtils and also added some tests. For reference, see: https://svn.apache.org/repos/asf/harmony/enhanced/classlib/archive/java6/modules/luni/src/main/java/org/apache/harmony/luni/util/FloatingPointParser.java https://svn.apache.org/repos/asf/harmony/enhanced/classlib/archive/java6/modules/luni/src/main/java/java/lang/Integer.java New Token filter for adding payloads in-stream Key: LUCENE-1676 URL: https://issues.apache.org/jira/browse/LUCENE-1676 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9 Attachments: LUCENE-1676.patch This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload. This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time). An example is apropos. Given a | delimiter, we could have a stream that looks like: {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote} In this case, this would produce tokens and payloads (assuming whitespace tokenization): Token: the Payload: null Token: quick Payload: JJ Token: red Pay: JJ. and so on. This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1676) New Token filter for adding payloads in-stream
[ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved LUCENE-1676. - Resolution: Fixed Lucene Fields: (was: [New]) Committed revision 784297. New Token filter for adding payloads in-stream Key: LUCENE-1676 URL: https://issues.apache.org/jira/browse/LUCENE-1676 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9 Attachments: LUCENE-1676.patch This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload. This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time). An example is apropos. Given a | delimiter, we could have a stream that looks like: {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote} In this case, this would produce tokens and payloads (assuming whitespace tokenization): Token: the Payload: null Token: quick Payload: JJ Token: red Pay: JJ. and so on. This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1676) New Token filter for adding payloads in-stream
[ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718447#action_12718447 ] Grant Ingersoll commented on LUCENE-1676: - bq. Shouldn't the CHANGES entry in this patch go into contrib/CHANGES? It can, I've never quite been sure. I think more people read the top-level CHANGES, thus it is more likely to be noticed, but I'm fine either way. New Token filter for adding payloads in-stream Key: LUCENE-1676 URL: https://issues.apache.org/jira/browse/LUCENE-1676 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9 Attachments: LUCENE-1676.patch This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload. This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time). An example is apropos. Given a | delimiter, we could have a stream that looks like: {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote} In this case, this would produce tokens and payloads (assuming whitespace tokenization): Token: the Payload: null Token: quick Payload: JJ Token: red Pay: JJ. and so on. This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-979) Remove Deprecated Benchmarking Utilities from contrib/benchmark
[ https://issues.apache.org/jira/browse/LUCENE-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718450#action_12718450 ] Grant Ingersoll commented on LUCENE-979: bq. What are the old benchmark utilities? It's like one class from the pre-Doron Task oriented approach. I believe it's called Benchmark.java and was only able to do a few benchmarking tasks. Remove Deprecated Benchmarking Utilities from contrib/benchmark --- Key: LUCENE-979 URL: https://issues.apache.org/jira/browse/LUCENE-979 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Grant Ingersoll Assignee: Michael McCandless Priority: Minor Fix For: 2.9 The old Benchmark utilities in contrib/benchmark have been deprecated and should be removed in 2.9 of Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-979) Remove Deprecated Benchmarking Utilities from contrib/benchmark
[ https://issues.apache.org/jira/browse/LUCENE-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718463#action_12718463 ] Grant Ingersoll commented on LUCENE-979: Yes. Remove Deprecated Benchmarking Utilities from contrib/benchmark --- Key: LUCENE-979 URL: https://issues.apache.org/jira/browse/LUCENE-979 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Grant Ingersoll Assignee: Michael McCandless Priority: Minor Fix For: 2.9 The old Benchmark utilities in contrib/benchmark have been deprecated and should be removed in 2.9 of Lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (MAHOUT-131) Vector improvements
[ https://issues.apache.org/jira/browse/MAHOUT-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-131: --- Attachment: MAHOUT-131.patch Some minor changes to Vector, etc. Vector improvements --- Key: MAHOUT-131 URL: https://issues.apache.org/jira/browse/MAHOUT-131 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 0.2 Attachments: MAHOUT-131.patch Vector and it's implementations could use a few things: 1. DenseVector should implement equals and hashCode similar to SparseVector 2. The VectorView asFormatString() is not compatible with actually recreating any type of vector. 3. Add tests to VectorTest that assert that decodeFormat/asFormatString is able to do a round trip. 4. Add static AbstractVector.equivalent(Vector, Vector) that takes in two vectors and compares them for equality, regardless of their implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1209) Site search powered by Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12718507#action_12718507 ] Grant Ingersoll commented on SOLR-1209: --- bq. Just a small doubt - I assume Google shares revenue generated from clicks on the site search page with ASF. Are we sure this is not affecting ASF money-wise? They don't share the revenue. All the Google box is right now is a Forrest auto-generated, default, plugin. Site search powered by Lucene/Solr -- Key: SOLR-1209 URL: https://issues.apache.org/jira/browse/SOLR-1209 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-1209.patch For a number of years now, the Lucene community has been criticized for not eating our own dog food when it comes to search. My company has built and hosts a site search (http://www.lucidimagination.com/search) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Solr content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org A preview of the site (for Mahout) is available at http://people.apache.org/~gsingers/mahout/site/publish/ Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site. The following patch adds a skin to the Forrest site that enables the Solr site to search Solr only content using Lucene/Solr. When a search is submitted, it automatically selects the Solr facet such that only Solr content is searched. From there, users can then narrow/broaden their search criteria. I'm submitting this patch to Solr first, as we'd like to roll out our capabilities to some of the smaller communities first and then broaden to the rest of the Lucene ecosystem. I plan on committing in a 3 or 4 days. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-131) Vector improvements
Vector improvements --- Key: MAHOUT-131 URL: https://issues.apache.org/jira/browse/MAHOUT-131 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 0.2 Vector and it's implementations could use a few things: 1. DenseVector should implement equals and hashCode similar to SparseVector 2. The VectorView asFormatString() is not compatible with actually recreating any type of vector. 3. Add tests to VectorTest that assert that decodeFormat/asFormatString is able to do a round trip. 4. Add static AbstractVector.equivalent(Vector, Vector) that takes in two vectors and compares them for equality, regardless of their implementation. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717819#action_12717819 ] Grant Ingersoll commented on LUCENE-1678: - I frankly don't like renaming something like this. This is, once again, a case of back compatibility biting us. If instead of working around back compat. we had just made Analyzer.tokenStream be reusable, we wouldn't have to do this. Now, instead, we are going to have a convoluted name for something (reusableTS). In my mind, better to just make .tokenStream do the right thing and get rid of reusableTokenStream. Deprecate Analyzer.tokenStream -- Key: LUCENE-1678 URL: https://issues.apache.org/jira/browse/LUCENE-1678 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 The addition of reusableTokenStream to the core analyzers unfortunately broke back compat of external subclasses: http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html On upgrading, such subclasses would silently not be used anymore, since Lucene's indexing invokes reusableTokenStream. I think we should should at least deprecate Analyzer.tokenStream, today, so that users see deprecation warnings if their classes override this method. But going forward when we want to change the API of core classes that are extended, I think we have to introduce entirely new classes, to keep back compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1678) Deprecate Analyzer.tokenStream
[ https://issues.apache.org/jira/browse/LUCENE-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717888#action_12717888 ] Grant Ingersoll commented on LUCENE-1678: - bq. If there are sane/smart ways to change our back compat policy, I think you have seen that no one would object. The sane/smart way is to do it on a case by case basis. Here is a specific case. Generalizing it a bit, the place where it should be more easily relaxable are the cases where we know very few people make customizations, as in implementing Fieldable or FieldCache. As for this specific case, the original change was the thing that broke back compat. So, given it is already broken, why not fix it the right way? Deprecate Analyzer.tokenStream -- Key: LUCENE-1678 URL: https://issues.apache.org/jira/browse/LUCENE-1678 Project: Lucene - Java Issue Type: Bug Components: Analysis Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 The addition of reusableTokenStream to the core analyzers unfortunately broke back compat of external subclasses: http://www.nabble.com/Extending-StandardAnalyzer-considered-harmful-td23863822.html On upgrading, such subclasses would silently not be used anymore, since Lucene's indexing invokes reusableTokenStream. I think we should should at least deprecate Analyzer.tokenStream, today, so that users see deprecation warnings if their classes override this method. But going forward when we want to change the API of core classes that are extended, I think we have to introduce entirely new classes, to keep back compatibility. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm
[ https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-130. Resolution: Fixed Committed Ted's patch Vector should allow for other normalize powers than the L-2 norm Key: MAHOUT-130 URL: https://issues.apache.org/jira/browse/MAHOUT-130 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: MAHOUT-130-both-ways.patch, MAHOUT-130-slight-tweaks.patch, MAHOUT-130.patch, MAHOUT-130.patch Modify Vector to allow other normalize functions for the Vector See http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-126: --- Attachment: MAHOUT-126.patch Here's a first attempt at my thoughts based on the two previous patches, plus some other ideas. The main gist of the idea centers around the VectorIterable interface and is driven by the o.a.mahout.utils.vectors.Driver class. Note, I dropped the Lucene indexing part, as I don't think we need to be in the game of creating Lucene indexes. That is a well known and well document process that is available elsewhere. In fact, for this particular piece, I indexed Wikipedia in Solr and then pointed the Driver class at the Lucene index. See http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text for details on using. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Reporter: Shashikant Kore Assignee: Grant Ingersoll Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-108) Implementation of Assoication Rules learning by Apriori algorithm
[ https://issues.apache.org/jira/browse/MAHOUT-108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717245#action_12717245 ] Grant Ingersoll commented on MAHOUT-108: http://cwiki.apache.org/MAHOUT/howtocontribute.html Implementation of Assoication Rules learning by Apriori algorithm - Key: MAHOUT-108 URL: https://issues.apache.org/jira/browse/MAHOUT-108 Project: Mahout Issue Type: Task Environment: Linux, Hadoop-0.17.1 Reporter: chao deng Fix For: 0.2 Original Estimate: 504h Remaining Estimate: 504h Target: Association Rules learning is a popular method for discovering interesting relations between variables in large databases. Here, we would implement the Apriori algorithm using HadoopMapreduce parallel techniques. Applications: Typically, association rules learning is used to discover regularities between products in large scale transaction data in supermarkets. For example, the rule {onions, patatoes}-beef found in the sales data would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy beef. Such information can be used as the basis for decisions about marketing activities. In addition to the market basket analysis, association rules are employed today in many application areas including Web usage mining, intrusion detection and bioinformatics. Apriori algorithm: Apriori is the best-known algorithm to mine association rules. It uses a breadth-first search strategy to counting the support of itemsets and uses a candidate generation function which exploits the downward closure property of support -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm
[ https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-130: --- Attachment: MAHOUT-130.patch Draft. Not sure if the optimizations make sense or not, but I think they do. Patch applies in the core directory, not the top level Vector should allow for other normalize powers than the L-2 norm Key: MAHOUT-130 URL: https://issues.apache.org/jira/browse/MAHOUT-130 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: MAHOUT-130.patch Modify Vector to allow other normalize functions for the Vector See http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm
[ https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-130: --- Attachment: (was: MAHOUT-130.patch) Vector should allow for other normalize powers than the L-2 norm Key: MAHOUT-130 URL: https://issues.apache.org/jira/browse/MAHOUT-130 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: MAHOUT-130.patch Modify Vector to allow other normalize functions for the Vector See http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm
[ https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-130: --- Attachment: MAHOUT-130.patch No reason the power needs to be int only, I suppose Vector should allow for other normalize powers than the L-2 norm Key: MAHOUT-130 URL: https://issues.apache.org/jira/browse/MAHOUT-130 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: MAHOUT-130.patch Modify Vector to allow other normalize functions for the Vector See http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm
[ https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12717511#action_12717511 ] Grant Ingersoll commented on MAHOUT-130: D'oh. My bad. I had initialized val = 1 instead of val = 0; All looks good now. Vector should allow for other normalize powers than the L-2 norm Key: MAHOUT-130 URL: https://issues.apache.org/jira/browse/MAHOUT-130 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: MAHOUT-130.patch Modify Vector to allow other normalize functions for the Vector See http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-130) Vector should allow for other normalize powers than the L-2 norm
[ https://issues.apache.org/jira/browse/MAHOUT-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-130: --- Attachment: MAHOUT-130.patch Adds 0 norm and Infinite norm as well as Ted and David's suggestions, as well as maxValue and maxValueIndex methods. Vector should allow for other normalize powers than the L-2 norm Key: MAHOUT-130 URL: https://issues.apache.org/jira/browse/MAHOUT-130 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: MAHOUT-130-both-ways.patch, MAHOUT-130.patch, MAHOUT-130.patch Modify Vector to allow other normalize functions for the Vector See http://www.lucidimagination.com/search/document/bf3a7a7a004d4191/norm_calculations -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1208) The Default SearchComponents (QueryComponent, etc.) cannot currently support SolrCoreAware or ResourceLoaderAware
The Default SearchComponents (QueryComponent, etc.) cannot currently support SolrCoreAware or ResourceLoaderAware - Key: SOLR-1208 URL: https://issues.apache.org/jira/browse/SOLR-1208 Project: Solr Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 The Default SearchComponents are not instantiated via the SolrResourceLoader and are thus not put in the waiting lists for SolrCoreAware and ResourceLoaderAware. Thus, they are not constructed in the same that other SearchComponents might be constructed. See http://www.lucidimagination.com/search/document/ef69fdc7dfb17428/default_searchcomponents_and_solrcoreaware -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1209) Site search powered by Lucene/Solr
Site search powered by Lucene/Solr -- Key: SOLR-1209 URL: https://issues.apache.org/jira/browse/SOLR-1209 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor For a number of years now, the Lucene community has been criticized for not eating our own dog food when it comes to search. My company has built and hosts a site search (http://www.lucidimagination.com/search) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Solr content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org A preview of the site (for Mahout) is available at http://people.apache.org/~gsingers/mahout/site/publish/ Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site. The following patch adds a skin to the Forrest site that enables the Solr site to search Solr only content using Lucene/Solr. When a search is submitted, it automatically selects the Solr facet such that only Solr content is searched. From there, users can then narrow/broaden their search criteria. I'm submitting this patch to Solr first, as we'd like to roll out our capabilities to some of the smaller communities first and then broaden to the rest of the Lucene ecosystem. I plan on committing in a 3 or 4 days. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1209) Site search powered by Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-1209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-1209: -- Attachment: SOLR-1209.patch Patch changing the skin to use Solr powered search Site search powered by Lucene/Solr -- Key: SOLR-1209 URL: https://issues.apache.org/jira/browse/SOLR-1209 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-1209.patch For a number of years now, the Lucene community has been criticized for not eating our own dog food when it comes to search. My company has built and hosts a site search (http://www.lucidimagination.com/search) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Solr content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org A preview of the site (for Mahout) is available at http://people.apache.org/~gsingers/mahout/site/publish/ Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site. The following patch adds a skin to the Forrest site that enables the Solr site to search Solr only content using Lucene/Solr. When a search is submitted, it automatically selects the Solr facet such that only Solr content is searched. From there, users can then narrow/broaden their search criteria. I'm submitting this patch to Solr first, as we'd like to roll out our capabilities to some of the smaller communities first and then broaden to the rest of the Lucene ecosystem. I plan on committing in a 3 or 4 days. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12716311#action_12716311 ] Grant Ingersoll commented on LUCENE-1567: - The software Grant has been received and filed. I will update the paperwork and work to finish this out next week, such that we can then work to commit it. New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query parser, people don't make the corresponding changes in the contrib parsers. (I'm guilty here too) With this new architecture it will be much easier to maintain different query syntaxes, as the actual code for the first layer is not very much. All syntaxes would benefit from patches and improvements we make to the underlying layers, which will make supporting different syntaxes much more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-128) maven parent not included in build
[ https://issues.apache.org/jira/browse/MAHOUT-128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-128. Resolution: Fixed committed maven parent not included in build -- Key: MAHOUT-128 URL: https://issues.apache.org/jira/browse/MAHOUT-128 Project: Mahout Issue Type: Bug Reporter: Benson Margulies Attachments: pom.diff The maven parent isn't included as a module, so it's pom isn't installed, so building some other project that depends on mahout-core fails. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (LUCENE-1676) New Token filter for adding payloads in-stream
New Token filter for adding payloads in-stream Key: LUCENE-1676 URL: https://issues.apache.org/jira/browse/LUCENE-1676 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9 This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload. This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time). An example is apropos. Given a | delimiter, we could have a stream that looks like: {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote} In this case, this would produce tokens and payloads (assuming whitespace tokenization): Token: the Payload: null Token: quick Payload: JJ Token: red Pay: JJ. and so on. This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1676) New Token filter for adding payloads in-stream
[ https://issues.apache.org/jira/browse/LUCENE-1676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-1676: Attachment: LUCENE-1676.patch Here's a first draft of this. See the test case for an example. New Token filter for adding payloads in-stream Key: LUCENE-1676 URL: https://issues.apache.org/jira/browse/LUCENE-1676 Project: Lucene - Java Issue Type: New Feature Components: contrib/analyzers Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9 Attachments: LUCENE-1676.patch This TokenFilter is able to split a token based on a delimiter and use one part as the token and the other part as a payload. This allows someone to include payloads inline with tokens (presumably setup by a pipeline ahead of time). An example is apropos. Given a | delimiter, we could have a stream that looks like: {quote}The quick|JJ red|JJ fox|NN jumped|VB over the lazy|JJ brown|JJ dogs|NN{quote} In this case, this would produce tokens and payloads (assuming whitespace tokenization): Token: the Payload: null Token: quick Payload: JJ Token: red Pay: JJ. and so on. This patch will also support pluggable encoders for the payloads, so it can convert from the character array to byte arrays as appropriate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned MAHOUT-126: -- Assignee: Grant Ingersoll Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Reporter: Shashikant Kore Assignee: Grant Ingersoll Attachments: MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714414#action_12714414 ] Grant Ingersoll commented on MAHOUT-126: See SOLR-1193. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Reporter: Shashikant Kore Assignee: Grant Ingersoll Attachments: MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714509#action_12714509 ] Grant Ingersoll commented on MAHOUT-126: So just kind of brainstorming here, but I think we should create a separate Module for this kind of stuff, to keep out of core and give us some more flexibility in regards to dependencies, etc. Also (and I realize this is just a start patch), I think we should assume a Lucene index exists already instead of maintaining code to actually create an index. There are a lot of ways to do that and people will likely have different fields, etc. For instance, Solr can provide all of the capabilities here and it has distributed support, so it can scale. Moreover, though, is people may have the info in a DB or in other places. I realize we need baby steps, but... I'll try to post a patch this afternoon that takes this effort and melds it with some of my ideas for demo purposes. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Reporter: Shashikant Kore Assignee: Grant Ingersoll Attachments: mahout-126-benson.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-126) Prepare document vectors from the text
[ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714515#action_12714515 ] Grant Ingersoll commented on MAHOUT-126: Shashikant, Couple of comments on the Lucene specific stuff, though, so that you guys can speed up what you have. First off, have a look at Lucene's support of TermVectorMapper. Much like SAX, it gives you a call back mechanism such that you don't have to construct two different data structures (i.e. many people incorrectly use the DOM to parse XML and then extract out of the DOM into their own Data Structure when they should use SAX instead). You might have a look at the TermVectorComponent in Solr, as it pretty much does what you are looking to do in this patch and I believe it to be more efficient. Seems like we should be able to avoid caching the whole term list in memory. At a minimum, if you are going to, allTerms should be a MapString, Integer that stores the term and it's DF (doc freq.), as you are currently doing the DF lookup twice, AFAICT. DF lookup is expensive in Lucene. If you don't cache the whole list, we should at least have an LRU cache for DF. Prepare document vectors from the text -- Key: MAHOUT-126 URL: https://issues.apache.org/jira/browse/MAHOUT-126 Project: Mahout Issue Type: New Feature Reporter: Shashikant Kore Assignee: Grant Ingersoll Attachments: mahout-126-benson.patch, MAHOUT-126.patch Clustering algorithms presently take the document vectors as input. Generating these document vectors from the text can be broken in two tasks. 1. Create lucene index of the input plain-text documents 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. Presently, I have created two separate utilities, which could possibly be invoked from another class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-120) Site search powered by Solr
[ https://issues.apache.org/jira/browse/MAHOUT-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-120. Resolution: Fixed Committed revision 779594. Site search powered by Solr --- Key: MAHOUT-120 URL: https://issues.apache.org/jira/browse/MAHOUT-120 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: MAHOUT-120.patch, MAHOUT-120.patch For a number of years now, the Lucene community has been criticized for not eating our own dog food when it comes to search. My company has built and hosts a site search (http://www.lucidimagination.com/search) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Mahout content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org A preview of the site is available at http://people.apache.org/~gsingers/mahout/site/publish/ Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site. The following patch adds a skin to the Forrest site that enables the Mahout site to search Mahout only content using Lucene/Solr. When a search is submitted, it automatically selects the Mahout facet such that only Mahout content is searched. From there, users can then narrow/broaden their search criteria. I'm submitting this patch to Mahout first, as we'd like to roll out our capabilities to some of the smaller communities first and then broaden to the rest of the Lucene ecosystem. I plan on committing in a 3 or 4 days. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1193) Provide option for TermVectorComponent to provide a way of retrieving TV info around a position instead of the whole vector
Provide option for TermVectorComponent to provide a way of retrieving TV info around a position instead of the whole vector --- Key: SOLR-1193 URL: https://issues.apache.org/jira/browse/SOLR-1193 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll Priority: Minor It's often useful to retrieve TermVector information around (within some user specified window) a specific position or offset. The TermVectorComponent can easily be modifed to use a TermVectorMapper that is aware of position/offset information and only returns term info within the window. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (TIKA-235) Site search powered by Lucene/Solr
Site search powered by Lucene/Solr -- Key: TIKA-235 URL: https://issues.apache.org/jira/browse/TIKA-235 Project: Tika Issue Type: New Feature Reporter: Grant Ingersoll Priority: Minor For a number of years now, the Lucene community has been criticized for not eating our own dog food when it comes to search. My company has built and hosts a site search (http://search.lucidimagination.com/) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Tika content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org A sample of what it _might_ look like is at http://people.apache.org/~gsingers/tika/Note, however, I am not entirely sure how Tika deploys just yet, so there are a few issues w/ the display Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site. The following patch adds the basics to Tika to support the search, but isn't entirely done yet b/c I'm not sure what the Look and Feel Tika wants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-235) Site search powered by Lucene/Solr
[ https://issues.apache.org/jira/browse/TIKA-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated TIKA-235: - Attachment: TIKA-235.patch First draft of a patch. See also MAHOUT-120 Site search powered by Lucene/Solr -- Key: TIKA-235 URL: https://issues.apache.org/jira/browse/TIKA-235 Project: Tika Issue Type: New Feature Reporter: Grant Ingersoll Priority: Minor Attachments: TIKA-235.patch For a number of years now, the Lucene community has been criticized for not eating our own dog food when it comes to search. My company has built and hosts a site search (http://search.lucidimagination.com/) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Tika content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org A sample of what it _might_ look like is at http://people.apache.org/~gsingers/tika/Note, however, I am not entirely sure how Tika deploys just yet, so there are a few issues w/ the display Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site. The following patch adds the basics to Tika to support the search, but isn't entirely done yet b/c I'm not sure what the Look and Feel Tika wants. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (MAHOUT-63) Self Organizing Map
[ https://issues.apache.org/jira/browse/MAHOUT-63?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved MAHOUT-63. --- Resolution: Duplicate See MAHOUT-64 Self Organizing Map --- Key: MAHOUT-63 URL: https://issues.apache.org/jira/browse/MAHOUT-63 Project: Mahout Issue Type: New Feature Components: Classification Reporter: Farid Bourennani Priority: Minor Original Estimate: 120h Remaining Estimate: 120h Implementation of the Kohonen's Self organizing map algorithm. Execution: run the SOMViewer . takes 300 iteration. - The algo is too slow because of: GUI: the current one is a temporary one, but should be replaced by prefused library as suggested by Ted. Self-Organizing Maps: Batch Algorithm is faster than the sequentiel one that I am currently using. - Documentation needs to be completed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAHOUT-66) EuclideanDistanceMeasure and ManhattanDistanceMeasure classes are not optimized for Sparse Vectors
[ https://issues.apache.org/jira/browse/MAHOUT-66?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12713660#action_12713660 ] Grant Ingersoll commented on MAHOUT-66: --- Can this be closed? EuclideanDistanceMeasure and ManhattanDistanceMeasure classes are not optimized for Sparse Vectors -- Key: MAHOUT-66 URL: https://issues.apache.org/jira/browse/MAHOUT-66 Project: Mahout Issue Type: Improvement Components: Clustering Reporter: Pallavi Palleti Priority: Minor Attachments: MAHOUT-66.patch, MAHOUT-66.patch, MAHOUT-66.patch, MAHOUT-66.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-125) Remove Deprecated Ant builds
Remove Deprecated Ant builds Key: MAHOUT-125 URL: https://issues.apache.org/jira/browse/MAHOUT-125 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Finish transferring functionality from build-deprecated.xml files to Maven. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1188) TermVectorComponent Efficiency improvements
TermVectorComponent Efficiency improvements --- Key: SOLR-1188 URL: https://issues.apache.org/jira/browse/SOLR-1188 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 The TermVectorComponent currently uses a TermVectorMapper that does not indicate to Lucene whether positions and offsets are of interest by overriding isIgnoringOffsets and isIgnoringPositions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-1188) TermVectorComponent Efficiency improvements
[ https://issues.apache.org/jira/browse/SOLR-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved SOLR-1188. --- Resolution: Fixed Committed simple patch to override the two methods. TermVectorComponent Efficiency improvements --- Key: SOLR-1188 URL: https://issues.apache.org/jira/browse/SOLR-1188 Project: Solr Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 The TermVectorComponent currently uses a TermVectorMapper that does not indicate to Lucene whether positions and offsets are of interest by overriding isIgnoringOffsets and isIgnoringPositions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (SOLR-1177) Distributed TermsComponent
[ https://issues.apache.org/jira/browse/SOLR-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned SOLR-1177: - Assignee: Grant Ingersoll Distributed TermsComponent -- Key: SOLR-1177 URL: https://issues.apache.org/jira/browse/SOLR-1177 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Matt Weber Assignee: Grant Ingersoll Priority: Minor Fix For: 1.5 Attachments: SOLR-1177.patch, TermsComponent.java, TermsComponent.patch TermsComponent should be distributed -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (LUCENE-1550) Add N-Gram String Matching for Spell Checking
[ https://issues.apache.org/jira/browse/LUCENE-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711150#action_12711150 ] Grant Ingersoll commented on LUCENE-1550: - Committed revision 776704. Thanks Tom! Add N-Gram String Matching for Spell Checking - Key: LUCENE-1550 URL: https://issues.apache.org/jira/browse/LUCENE-1550 Project: Lucene - Java Issue Type: New Feature Components: contrib/spellchecker Affects Versions: 2.9 Reporter: Thomas Morton Assignee: Grant Ingersoll Priority: Minor Fix For: 2.9 Attachments: LUCENE-1550.patch, LUCENE-1550.patch, LUCENE-1550.patch N-Gram version of edit distance based on paper by Grzegorz Kondrak, N-gram similarity and distance. Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126, Buenos Aires, Argentina, November 2005. http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711137#action_12711137 ] Grant Ingersoll commented on SOLR-769: -- Committed revision 776692. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711137#action_12711137 ] Grant Ingersoll edited comment on SOLR-769 at 5/20/09 6:42 AM: --- Committed revision 776692. Thanks to everyone who helped out, especially Carrot2 creators Dawid and Stanislaw. was (Author: gsingers): Committed revision 776692. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-119) Create an uber jar for use on Amazon Elastic M/R, etc.
Create an uber jar for use on Amazon Elastic M/R, etc. -- Key: MAHOUT-119 URL: https://issues.apache.org/jira/browse/MAHOUT-119 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Priority: Minor Some cloud resources have problems loading classes across JARs in the Job jar. See http://www.lucidimagination.com/search/document/3a5680dfe567d812/running_dirichlet_example_on_aemr This can be fixed by adding a new target that creates a single Jar target. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAHOUT-120) Site search powered by Solr
[ https://issues.apache.org/jira/browse/MAHOUT-120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-120: --- Attachment: MAHOUT-120.patch Patch to change the site skin. This was created by slightly modifying the default Forrest skin. Site search powered by Solr --- Key: MAHOUT-120 URL: https://issues.apache.org/jira/browse/MAHOUT-120 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: MAHOUT-120.patch For a number of years now, the Lucene community has been criticized for not eating our own dog food when it comes to search. My company has built and hosts a site search (http://www.lucidimagination.com/search) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Mahout content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org A preview of the site is available at http://people.apache.org/~gsingers/mahout/site/publish/ Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site. The following patch adds a skin to the Forrest site that enables the Mahout site to search Mahout only content using Lucene/Solr. When a search is submitted, it automatically selects the Mahout facet such that only Mahout content is searched. From there, users can then narrow/broaden their search criteria. I'm submitting this patch to Mahout first, as we'd like to roll out our capabilities to some of the smaller communities first and then broaden to the rest of the Lucene ecosystem. I plan on committing in a 3 or 4 days. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAHOUT-120) Site search powered by Solr
Site search powered by Solr --- Key: MAHOUT-120 URL: https://issues.apache.org/jira/browse/MAHOUT-120 Project: Mahout Issue Type: Improvement Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: MAHOUT-120.patch For a number of years now, the Lucene community has been criticized for not eating our own dog food when it comes to search. My company has built and hosts a site search (http://www.lucidimagination.com/search) that is powered by Apache Solr and Lucene and we'd like to donate it's use to the Lucene community. Additionally, it allows one to search all of the Mahout content from a single place, including web, wiki, JIRA and mail archives. See also http://www.lucidimagination.com/search/document/bf22a570bf9385c7/search_on_lucene_apache_org A preview of the site is available at http://people.apache.org/~gsingers/mahout/site/publish/ Lucid has a fault tolerant setup with replication and fail over as well as monitoring services in place. We are committed to maintaining and expanding the search capabilities on the site. The following patch adds a skin to the Forrest site that enables the Mahout site to search Mahout only content using Lucene/Solr. When a search is submitted, it automatically selects the Mahout facet such that only Mahout content is searched. From there, users can then narrow/broaden their search criteria. I'm submitting this patch to Mahout first, as we'd like to roll out our capabilities to some of the smaller communities first and then broaden to the rest of the Lucene ecosystem. I plan on committing in a 3 or 4 days. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-761) Fix Flare license headers
[ https://issues.apache.org/jira/browse/SOLR-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710870#action_12710870 ] Grant Ingersoll commented on SOLR-761: -- FYI: ant rat-sources is helpful in easily identifying which files are missing license headers. Fix Flare license headers - Key: SOLR-761 URL: https://issues.apache.org/jira/browse/SOLR-761 Project: Solr Issue Type: Task Reporter: Erik Hatcher Assignee: Erik Hatcher Fix For: 1.4 Solr Flare has inconsistent use of the Apache Software License header in its files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: - Attachment: SOLR-769.patch OK, I think all the ducks are in a row. I intend to commit on Friday. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708534#action_12708534 ] Grant Ingersoll commented on SOLR-773: -- I think, and correct me if I'm wrong, that one of the things that often happens with geo stuff is that there are a lot of unique values. This often has memory ramifications when using with FunctionQueries since most ValueSources uninvert the field. Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, SOLR-773.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708534#action_12708534 ] Grant Ingersoll edited comment on SOLR-773 at 5/12/09 11:02 AM: I think, and correct me if I'm wrong, that one of the things that often happens with geo stuff is that there are a lot of unique values. This often has memory ramifications when using with FunctionQueries since most ValueSources uninvert the field. Otherwise, I like the sounds of Yonik's proposal as well. was (Author: gsingers): I think, and correct me if I'm wrong, that one of the things that often happens with geo stuff is that there are a lot of unique values. This often has memory ramifications when using with FunctionQueries since most ValueSources uninvert the field. Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, SOLR-773.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708541#action_12708541 ] Grant Ingersoll commented on SOLR-773: -- Also, how does the TrieRange stuff factor into this? Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, SOLR-773.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708680#action_12708680 ] Grant Ingersoll commented on SOLR-773: -- {quote} 1) What is the goal we want to achieve? * Provide a first iteration of a geographical search entity to SOLR * Bring an external popular plugin, in out of the cold into ASF and SOLR, helps solr users out, increases developers from 1 to many. {quote} Agreed on the first, not 100% certain on the second. On the second, this issue is the gate keeper. If people reviewing the patch feel there are better ways to do things, then we should work through them before committing. What you are effectively seeing is an increase in the developers working on from 1 to many, it's just not on committed code. {quote} 2) What is the level of commitment, and road map of spatial solutions in lucene and solr? * The primary goal of SOLR is as a text search engine, not GIS search, there are other and better ways to do that without reinventing the wheel and shoe horn-ing it into lucene. (e.g. persistent doc id mappings that can be referenced outside of lucene, so things like postGis and other tools can be used) * We can never fully solve everyone's needs at once, lets start with what we have, and iterate upon it. * I'm happy for any improvements as long as they keep to two goals A. don't make it stupid B. don't make it complex. {quote} On the first point, I don't follow. Isn't LocalLucene and LocalSolr, just exactly a GIS search capability for Lucene/Solr? I'm not sure if I would categorize it as shoe-horning. There are many things that Lucene/Solr can power, GIS search is one of them. By committing this patch (or some variation), we are saying Solr is going to support GIS search. Of course, there are other ways to do it, but that doesn't preclude it from L/S. The combination of text search plus GIS search is very powerful, as you know. Still, I think Yonik's main point is why reinvent the wheel when it comes to things like distributed search and the need for custom code for indexing, etc. when they likely can be handled through function queries and field types and therefore all of Solr's current functionality would just work. The other capabilities (like sorting by a FunctionQuery) is icing on the cake that helps solve other problems as well. Totally agree on the other points. Also very cool to see the benchmarking info. Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, SOLR-773.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708680#action_12708680 ] Grant Ingersoll edited comment on SOLR-773 at 5/12/09 4:21 PM: --- {quote} 1) What is the goal we want to achieve? * Provide a first iteration of a geographical search entity to SOLR * Bring an external popular plugin, in out of the cold into ASF and SOLR, helps solr users out, increases developers from 1 to many. {quote} Agreed on the first, not 100% certain on the second. On the second, this issue is the gate keeper. If people reviewing the patch feel there are better ways to do things, then we should work through them before committing. What you are effectively seeing is an increase in the developers working on from 1 to many, it's just not on committed code. {quote} 2) What is the level of commitment, and road map of spatial solutions in lucene and solr? * The primary goal of SOLR is as a text search engine, not GIS search, there are other and better ways to do that without reinventing the wheel and shoe horn-ing it into lucene. (e.g. persistent doc id mappings that can be referenced outside of lucene, so things like postGis and other tools can be used) * We can never fully solve everyone's needs at once, lets start with what we have, and iterate upon it. * I'm happy for any improvements as long as they keep to two goals A. don't make it stupid B. don't make it complex. {quote} On the first point, I don't follow. Isn't LocalLucene and LocalSolr, just exactly a GIS search capability for Lucene/Solr? I'm not sure if I would categorize it as shoe-horning. There are many things that Lucene/Solr can power, GIS search with text is one of them. By committing this patch (or some variation), we are saying Solr is going to support it. Of course, there are other ways to do it, but that doesn't preclude it from L/S. The combination of text search plus GIS search is very powerful, as you know. Still, I think Yonik's main point is why reinvent the wheel when it comes to things like distributed search and the need for custom code for indexing, etc. when they likely can be handled through function queries and field types and therefore all of Solr's current functionality would just work. The other capabilities (like sorting by a FunctionQuery) is icing on the cake that helps solve other problems as well. Totally agree on the other points. Also very cool to see the benchmarking info. was (Author: gsingers): {quote} 1) What is the goal we want to achieve? * Provide a first iteration of a geographical search entity to SOLR * Bring an external popular plugin, in out of the cold into ASF and SOLR, helps solr users out, increases developers from 1 to many. {quote} Agreed on the first, not 100% certain on the second. On the second, this issue is the gate keeper. If people reviewing the patch feel there are better ways to do things, then we should work through them before committing. What you are effectively seeing is an increase in the developers working on from 1 to many, it's just not on committed code. {quote} 2) What is the level of commitment, and road map of spatial solutions in lucene and solr? * The primary goal of SOLR is as a text search engine, not GIS search, there are other and better ways to do that without reinventing the wheel and shoe horn-ing it into lucene. (e.g. persistent doc id mappings that can be referenced outside of lucene, so things like postGis and other tools can be used) * We can never fully solve everyone's needs at once, lets start with what we have, and iterate upon it. * I'm happy for any improvements as long as they keep to two goals A. don't make it stupid B. don't make it complex. {quote} On the first point, I don't follow. Isn't LocalLucene and LocalSolr, just exactly a GIS search capability for Lucene/Solr? I'm not sure if I would categorize it as shoe-horning. There are many things that Lucene/Solr can power, GIS search is one of them. By committing this patch (or some variation), we are saying Solr is going to support GIS search. Of course, there are other ways to do it, but that doesn't preclude it from L/S. The combination of text search plus GIS search is very powerful, as you know. Still, I think Yonik's main point is why reinvent the wheel when it comes to things like distributed search and the need for custom code for indexing, etc. when they likely can be handled through function queries and field types and therefore all of Solr's current functionality would just work. The other capabilities (like sorting by a FunctionQuery) is icing on the cake that helps solve other problems as well. Totally agree on the other points. Also very cool to see the benchmarking info. Incorporate Local
[jira] Commented: (LUCENE-1387) Add LocalLucene
[ https://issues.apache.org/jira/browse/LUCENE-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12708178#action_12708178 ] Grant Ingersoll commented on LUCENE-1387: - FWIW, you might find the discussion on SOLR-773 interesting. Add LocalLucene --- Key: LUCENE-1387 URL: https://issues.apache.org/jira/browse/LUCENE-1387 Project: Lucene - Java Issue Type: New Feature Components: contrib/spatial Reporter: Grant Ingersoll Assignee: Ryan McKinley Priority: Minor Fix For: 2.9 Attachments: spatial-lucene.zip, spatial.tar.gz, spatial.zip Local Lucene (Geo-search) has been donated to the Lucene project, per https://issues.apache.org/jira/browse/INCUBATOR-77. This issue is to handle the Lucene portion of integration. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (SOLR-1138) Query Elevation Component should gracefully handle empty queries
[ https://issues.apache.org/jira/browse/SOLR-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705559#action_12705559 ] Grant Ingersoll commented on SOLR-1138: --- Committed revision 771268. Query Elevation Component should gracefully handle empty queries Key: SOLR-1138 URL: https://issues.apache.org/jira/browse/SOLR-1138 Project: Solr Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-1138.patch From http://www.lucidimagination.com/search/document/3b50cd3506952f7 : {quote} In the QueryElevComponent (QEC) it currently throws an exception if the input Query is null (line 329).Additionally, I've seen cases where it's possible that the Query is not null (q is not set, but q.alt is *:*), but the rb.getQueryString() is null, which causes an NPE on line 300 or so. I'd like to suggest that if the Query is empty/null, the QEC should just go on it's merry way as if there is nothing to do. I don't think a lack of query means that the QEC is improperly configured, as the exception message implies: The QueryElevationComponent needs to be registered 'after' the query component We should be making sure the QEC is properly registered during initialization time. Thoughts? -Grant{quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1138) Query Elevation Component should gracefully handle empty queries
[ https://issues.apache.org/jira/browse/SOLR-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-1138: -- Attachment: SOLR-1138.patch Here's a patch that fixes this. I plan on committing today. Query Elevation Component should gracefully handle empty queries Key: SOLR-1138 URL: https://issues.apache.org/jira/browse/SOLR-1138 Project: Solr Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-1138.patch From http://www.lucidimagination.com/search/document/3b50cd3506952f7 : {quote} In the QueryElevComponent (QEC) it currently throws an exception if the input Query is null (line 329).Additionally, I've seen cases where it's possible that the Query is not null (q is not set, but q.alt is *:*), but the rb.getQueryString() is null, which causes an NPE on line 300 or so. I'd like to suggest that if the Query is empty/null, the QEC should just go on it's merry way as if there is nothing to do. I don't think a lack of query means that the QEC is improperly configured, as the exception message implies: The QueryElevationComponent needs to be registered 'after' the query component We should be making sure the QEC is properly registered during initialization time. Thoughts? -Grant{quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1138) Query Elevation Component should gracefully handle empty queries
Query Elevation Component should gracefully handle empty queries Key: SOLR-1138 URL: https://issues.apache.org/jira/browse/SOLR-1138 Project: Solr Issue Type: Bug Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor From http://www.lucidimagination.com/search/document/3b50cd3506952f7 : {quote} In the QueryElevComponent (QEC) it currently throws an exception if the input Query is null (line 329).Additionally, I've seen cases where it's possible that the Query is not null (q is not set, but q.alt is *:*), but the rb.getQueryString() is null, which causes an NPE on line 300 or so. I'd like to suggest that if the Query is empty/null, the QEC should just go on it's merry way as if there is nothing to do. I don't think a lack of query means that the QEC is improperly configured, as the exception message implies: The QueryElevationComponent needs to be registered 'after' the query component We should be making sure the QEC is properly registered during initialization time. Thoughts? -Grant{quote} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1128) Solr Cell Extract Only should also return Metadata too
Solr Cell Extract Only should also return Metadata too -- Key: SOLR-1128 URL: https://issues.apache.org/jira/browse/SOLR-1128 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Just as the title says. When using extract.only, we should also include the Metadata in the response -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-1128) Solr Cell Extract Only should also return Metadata too
[ https://issues.apache.org/jira/browse/SOLR-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved SOLR-1128. --- Resolution: Fixed Committed revision 768281. Solr Cell Extract Only should also return Metadata too -- Key: SOLR-1128 URL: https://issues.apache.org/jira/browse/SOLR-1128 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Just as the title says. When using extract.only, we should also include the Metadata in the response -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1099) FieldAnalysisRequestHandler
[ https://issues.apache.org/jira/browse/SOLR-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700782#action_12700782 ] Grant Ingersoll commented on SOLR-1099: --- So, why not just fold all of this into the ARH? Seems like all of these features work just as well as input parameters and there is no need for deprecation, etc. FieldAnalysisRequestHandler --- Key: SOLR-1099 URL: https://issues.apache.org/jira/browse/SOLR-1099 Project: Solr Issue Type: New Feature Components: Analysis Affects Versions: 1.3 Reporter: Uri Boness Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: AnalisysRequestHandler_refactored.patch, analysis_request_handlers_incl_solrj.patch, AnalysisRequestHandler_refactored1.patch, FieldAnalysisRequestHandler_incl_test.patch The FieldAnalysisRequestHandler provides the analysis functionality of the web admin page as a service. This handler accepts a filetype/fieldname parameter and a value and as a response returns a breakdown of the analysis process. It is also possible to send a query value which will use the configured query analyzer as well as a showmatch parameter which will then mark every matched token as a match. If this handler is added to the code base, I also recommend to rename the current AnalysisRequestHandler to DocumentAnalysisRequestHandler and have them both inherit from one AnalysisRequestHandlerBase class which provides the common functionality of the analysis breakdown and its translation to named lists. This will also enhance the current AnalysisRequestHandler which right now is fairly simplistic. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (SOLR-1099) FieldAnalysisRequestHandler
[ https://issues.apache.org/jira/browse/SOLR-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700622#action_12700622 ] Grant Ingersoll edited comment on SOLR-1099 at 4/19/09 6:02 PM: Sorry for being a bit late... Am I understanding that the main thing this does is allow you to specify one input and get back analysis for each field you specify? Well, that and the GET, right? was (Author: gsingers): Sorry for being a bit late... Am I understand that the main thing this does is allow you to specify one input and get back analysis for each field you specify? FieldAnalysisRequestHandler --- Key: SOLR-1099 URL: https://issues.apache.org/jira/browse/SOLR-1099 Project: Solr Issue Type: New Feature Components: Analysis Affects Versions: 1.3 Reporter: Uri Boness Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: AnalisysRequestHandler_refactored.patch, AnalysisRequestHandler_refactored1.patch, FieldAnalysisRequestHandler_incl_test.patch The FieldAnalysisRequestHandler provides the analysis functionality of the web admin page as a service. This handler accepts a filetype/fieldname parameter and a value and as a response returns a breakdown of the analysis process. It is also possible to send a query value which will use the configured query analyzer as well as a showmatch parameter which will then mark every matched token as a match. If this handler is added to the code base, I also recommend to rename the current AnalysisRequestHandler to DocumentAnalysisRequestHandler and have them both inherit from one AnalysisRequestHandlerBase class which provides the common functionality of the analysis breakdown and its translation to named lists. This will also enhance the current AnalysisRequestHandler which right now is fairly simplistic. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1099) FieldAnalysisRequestHandler
[ https://issues.apache.org/jira/browse/SOLR-1099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700622#action_12700622 ] Grant Ingersoll commented on SOLR-1099: --- Sorry for being a bit late... Am I understand that the main thing this does is allow you to specify one input and get back analysis for each field you specify? FieldAnalysisRequestHandler --- Key: SOLR-1099 URL: https://issues.apache.org/jira/browse/SOLR-1099 Project: Solr Issue Type: New Feature Components: Analysis Affects Versions: 1.3 Reporter: Uri Boness Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: AnalisysRequestHandler_refactored.patch, AnalysisRequestHandler_refactored1.patch, FieldAnalysisRequestHandler_incl_test.patch The FieldAnalysisRequestHandler provides the analysis functionality of the web admin page as a service. This handler accepts a filetype/fieldname parameter and a value and as a response returns a breakdown of the analysis process. It is also possible to send a query value which will use the configured query analyzer as well as a showmatch parameter which will then mark every matched token as a match. If this handler is added to the code base, I also recommend to rename the current AnalysisRequestHandler to DocumentAnalysisRequestHandler and have them both inherit from one AnalysisRequestHandlerBase class which provides the common functionality of the analysis breakdown and its translation to named lists. This will also enhance the current AnalysisRequestHandler which right now is fairly simplistic. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12700628#action_12700628 ] Grant Ingersoll commented on SOLR-769: -- Where can we download nni.jar from? Seems like if you only need two classes it would be easy enough to replace them with your own code. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: - Comment: was deleted (was: Where can we download nni.jar from? Seems like if you only need two classes it would be easy enough to replace them with your own code.) Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-769: - Attachment: SOLR-769.tar SOLR-769.patch OK, I think this is ready to go, except I still need to double check how it works with release. Since we can't distribute LGPL, this is going to have to be a source only release artifact and thus can never be in the WAR, unfortunately. The tarball contains the JAR files that one needs, with the exception of the LGPL deps which are downloaded from the approp. places. Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.tar, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699908#action_12699908 ] Grant Ingersoll commented on SOLR-769: -- Looks like we need to make the NNI JAR be a download, too, right? It appears to be LGPL. Where does that library come from, anyway? I don't see it on Carrot trunk, but it is in the zip. And a search for it doesn't reveal much. -Grant Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: clustering-libs.tar, clustering-libs.tar, SOLR-769-lib.zip, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.patch, SOLR-769.zip Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (LUCENE-1588) Update Spatial Lucene sort to use FieldComparatorSource
[ https://issues.apache.org/jira/browse/LUCENE-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved LUCENE-1588. - Resolution: Fixed This was committed. Update Spatial Lucene sort to use FieldComparatorSource --- Key: LUCENE-1588 URL: https://issues.apache.org/jira/browse/LUCENE-1588 Project: Lucene - Java Issue Type: Improvement Components: contrib/spatial Affects Versions: 2.9 Reporter: patrick o'leary Assignee: patrick o'leary Priority: Trivial Fix For: 2.9 Attachments: LUCENE-1588.patch Update distance sorting to use FieldComparator sorting as opposed to SortComparator -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned SOLR-773: Assignee: Grant Ingersoll Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699329#action_12699329 ] Grant Ingersoll commented on SOLR-773: -- We should be able to incorporate the GeoHash stuff in Lucene now, right? I'm not spatial expert, but this means we could have an update processor that only uses one field, right? Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699364#action_12699364 ] Grant Ingersoll commented on SOLR-773: -- OK, so color me a total geo newbie, but... So, if I index the spatial.xml in the patch I just submitted and I execute: {code} http://localhost:8983/solr/select?q=name:five {code} I get one result, which is expected. If I then do a geo search: {code} http://localhost:8983/solr/select?q=name:fiveqt=geolong=-74.0093994140625lat=40.75141843299745radius=100debugQuery=true {code} I get two results. The second result is the other theater in the spatial.xml file. Yet, it does not contain the value five in the name field even though it meets the spatial search criteria. Shouldn't there just be one result? What am I not understanding? Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699403#action_12699403 ] Grant Ingersoll commented on SOLR-773: -- OK, I think I understand why it does this, but it seems a little odd to me. The reason is due to the fact that the geo handler uses the geo QParser, which ignores the query parameter and produces a query solely based on the lat/lon information. Like I said, I'm a newbie to geo search, but it seems like the QParser should delegate the parsing of the q param to some other parser and then it would only do distance calculations on the docset returned from the QueryComponent. Of course, I guess one could ask what the semantics are of combining a text query with a spatial query, but I would suppose we could combine them with either AND or OR, right, such that if I OR'd them together, I would get all docs matching the query term OR'd with all docs in the bounding box. Similarily, AND would yield all docs with the term in the bounding box, right? Again, I am likely missing something, so bear with me. Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: lucene.tar.gz, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-773) Incorporate Local Lucene/Solr
[ https://issues.apache.org/jira/browse/SOLR-773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12698429#action_12698429 ] Grant Ingersoll commented on SOLR-773: -- This latest patch doesn't compile b/c it is missing the SpatialParams class. Incorporate Local Lucene/Solr - Key: SOLR-773 URL: https://issues.apache.org/jira/browse/SOLR-773 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Priority: Minor Attachments: SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, SOLR-773-local-lucene.patch, spatial-solr.tar.gz Local Lucene has been donated to the Lucene project. It has some Solr components, but we should evaluate how best to incorporate it into Solr. See http://lucene.markmail.org/message/orzro22sqdj3wows?q=LocalLucene -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (SOLR-804) include lucene misc jar in solr distro
[ https://issues.apache.org/jira/browse/SOLR-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll resolved SOLR-804. -- Resolution: Fixed Committed revision 764580. Added lucene-misc-2.9-dev.jar from rev 764281 which should match the Lucene version on trunk. include lucene misc jar in solr distro -- Key: SOLR-804 URL: https://issues.apache.org/jira/browse/SOLR-804 Project: Solr Issue Type: Wish Affects Versions: 1.3 Environment: all Reporter: solrize Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 It would be useful to have the lucene misc jar file included with solr. My immediate goal is to build several solr indexes in parallel on separate servers, then run the index merge utility at the end to combine them into a single index. Erik H suggested I post an issue requesting including the misc jar with solr. Thanks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-804) include lucene misc jar in solr distro
[ https://issues.apache.org/jira/browse/SOLR-804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-804: - Fix Version/s: (was: 1.5) 1.4 include lucene misc jar in solr distro -- Key: SOLR-804 URL: https://issues.apache.org/jira/browse/SOLR-804 Project: Solr Issue Type: Wish Affects Versions: 1.3 Environment: all Reporter: solrize Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 It would be useful to have the lucene misc jar file included with solr. My immediate goal is to build several solr indexes in parallel on separate servers, then run the index merge utility at the end to combine them into a single index. Erik H suggested I post an issue requesting including the misc jar with solr. Thanks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (LUCENE-1567) New flexible query parser
[ https://issues.apache.org/jira/browse/LUCENE-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned LUCENE-1567: --- Assignee: Grant Ingersoll (was: Michael Busch) New flexible query parser - Key: LUCENE-1567 URL: https://issues.apache.org/jira/browse/LUCENE-1567 Project: Lucene - Java Issue Type: New Feature Components: QueryParser Environment: N/A Reporter: Luis Alves Assignee: Grant Ingersoll Attachments: lucene_trunk_FlexQueryParser_2009March24.patch, lucene_trunk_FlexQueryParser_2009March26_v3.patch From New flexible query parser thread by Micheal Busch in my team at IBM we have used a different query parser than Lucene's in our products for quite a while. Recently we spent a significant amount of time in refactoring the code and designing a very generic architecture, so that this query parser can be easily used for different products with varying query syntaxes. This work was originally driven by Andreas Neumann (who, however, left our team); most of the code was written by Luis Alves, who has been a bit active in Lucene in the past, and Adriano Campos, who joined our team at IBM half a year ago. Adriano is Apache committer and PMC member on the Tuscany project and getting familiar with Lucene now too. We think this code is much more flexible and extensible than the current Lucene query parser, and would therefore like to contribute it to Lucene. I'd like to give a very brief architecture overview here, Adriano and Luis can then answer more detailed questions as they're much more familiar with the code than I am. The goal was it to separate syntax and semantics of a query. E.g. 'a AND b', '+a +b', 'AND(a,b)' could be different syntaxes for the same query. We distinguish the semantics of the different query components, e.g. whether and how to tokenize/lemmatize/normalize the different terms or which Query objects to create for the terms. We wanted to be able to write a parser with a new syntax, while reusing the underlying semantics, as quickly as possible. In fact, Adriano is currently working on a 100% Lucene-syntax compatible implementation to make it easy for people who are using Lucene's query parser to switch. The query parser has three layers and its core is what we call the QueryNodeTree. It is a tree that initially represents the syntax of the original query, e.g. for 'a AND b': AND / \ A B The three layers are: 1. QueryParser 2. QueryNodeProcessor 3. QueryBuilder 1. The upper layer is the parsing layer which simply transforms the query text string into a QueryNodeTree. Currently our implementations of this layer use javacc. 2. The query node processors do most of the work. It is in fact a configurable chain of processors. Each processors can walk the tree and modify nodes or even the tree's structure. That makes it possible to e.g. do query optimization before the query is executed or to tokenize terms. 3. The third layer is also a configurable chain of builders, which transform the QueryNodeTree into Lucene Query objects. Furthermore the query parser uses flexible configuration objects, which are based on AttributeSource/Attribute. It also uses message classes that allow to attach resource bundles. This makes it possible to translate messages, which is an important feature of a query parser. This design allows us to develop different query syntaxes very quickly. Adriano wrote the Lucene-compatible syntax in a matter of hours, and the underlying processors and builders in a few days. We now have a 100% compatible Lucene query parser, which means the syntax is identical and all query parser test cases pass on the new one too using a wrapper. Recent posts show that there is demand for query syntax improvements, e.g improved range query syntax or operator precedence. There are already different QP implementations in Lucene+contrib, however I think we did not keep them all up to date and in sync. This is not too surprising, because usually when fixes and changes are made to the main query parser, people don't make the corresponding changes in the contrib parsers. (I'm guilty here too) With this new architecture it will be much easier to maintain different query syntaxes, as the actual code for the first layer is not very much. All syntaxes would benefit from patches and improvements we make to the underlying layers, which will make supporting different syntaxes much more manageable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: