[jira] Updated: (SOLR-651) A SearchComponent for fetching TF-IDF values
[ https://issues.apache.org/jira/browse/SOLR-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-651: - Attachment: SOLR-651.patch Here's a start at making this support distributed. Still needs testing. I'm not sure if I'm doing the distributed right, but there ain't a whole lot of docs on it just yet, so I'm going based off of what I see in the other components. I'm especially not clear if I am understanding the stages correctly. Also, would be handy if there was a better way of testing the distributed stuff. So far, I call directly into the component to call distributedProcess, but would also be nice to have a harness that does what TestDistributedSearch does (i.e. setup a couple of Jetty instances and actually run them) A SearchComponent for fetching TF-IDF values Key: SOLR-651 URL: https://issues.apache.org/jira/browse/SOLR-651 Project: Solr Issue Type: New Feature Affects Versions: 1.3 Reporter: Noble Paul Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: SOLR-651.patch, SOLR-651.patch, SOLR-651.patch A SearchComponent that can return TF-IDF vector for any given document in the SOLR index Query : A Document Number / a query identifying a Document Response : A Map of term vs.TF-IDF value of every term in the Selected Document Why ? Most of the Machine Learning Algorithms work on TFIDF representation of documents, hence adding a Request Handler proving the TFIDF representation will pave the way for incorporating Learning Paradigms to SOLR framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-651) A SearchComponent for fetching TF-IDF values
[ https://issues.apache.org/jira/browse/SOLR-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-651: - Attachment: SOLR-651.patch Addresses Noble's thoughts. A SearchComponent for fetching TF-IDF values Key: SOLR-651 URL: https://issues.apache.org/jira/browse/SOLR-651 Project: Solr Issue Type: New Feature Affects Versions: 1.3 Reporter: Noble Paul Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: SOLR-651.patch, SOLR-651.patch A SearchComponent that can return TF-IDF vector for any given document in the SOLR index Query : A Document Number / a query identifying a Document Response : A Map of term vs.TF-IDF value of every term in the Selected Document Why ? Most of the Machine Learning Algorithms work on TFIDF representation of documents, hence adding a Request Handler proving the TFIDF representation will pave the way for incorporating Learning Paradigms to SOLR framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-651) A SearchComponent for fetching TF-IDF values
[ https://issues.apache.org/jira/browse/SOLR-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-651: - Attachment: SOLR-651.patch Here's a first crack at this. It still needs more unit tests to exercise the various combination of options, but I think it is a reasonable first crack at the idea. Questions to be answered/things to still do: 1. How do people like the format for output? It's basically broken down by doc, then field, then term, then term information, See the unit tests for some samples 2. Would be good to have a more efficient lookup for IDF. At a minimum, a cache of IDF values would be useful, but the memory would need to be controlled. Lucene may do some caching under the hood, so that should be investigated more 3. It relies on the query component doing it's thing. That is, you send in a query, start and rows, and this component just loops over the doc list and fetches. I could see a case for doing things separately, but that seems like duplication. People using this can just send explicit queries designed for this Component. 4. Probably needs some error handling for documents that don't have term vectors, but haven't tested yet. A SearchComponent for fetching TF-IDF values Key: SOLR-651 URL: https://issues.apache.org/jira/browse/SOLR-651 Project: Solr Issue Type: New Feature Affects Versions: 1.3 Reporter: Noble Paul Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 Attachments: SOLR-651.patch A SearchComponent that can return TF-IDF vector for any given document in the SOLR index Query : A Document Number / a query identifying a Document Response : A Map of term vs.TF-IDF value of every term in the Selected Document Why ? Most of the Machine Learning Algorithms work on TFIDF representation of documents, hence adding a Request Handler proving the TFIDF representation will pave the way for incorporating Learning Paradigms to SOLR framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-651) A SearchComponent for fetching TF-IDF values
[ https://issues.apache.org/jira/browse/SOLR-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated SOLR-651: - Fix Version/s: 1.4 A SearchComponent for fetching TF-IDF values Key: SOLR-651 URL: https://issues.apache.org/jira/browse/SOLR-651 Project: Solr Issue Type: New Feature Affects Versions: 1.3 Reporter: Noble Paul Assignee: Grant Ingersoll Priority: Minor Fix For: 1.4 A SearchComponent that can return TF-IDF vector for any given document in the SOLR index Query : A Document Number / a query identifying a Document Response : A Map of term vs.TF-IDF value of every term in the Selected Document Why ? Most of the Machine Learning Algorithms work on TFIDF representation of documents, hence adding a Request Handler proving the TFIDF representation will pave the way for incorporating Learning Paradigms to SOLR framework. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.