[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849225#action_12849225 ] Andrzej Bialecki commented on SOLR-799: This issue is closed - please use the mailing lists for discussions. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Assignee: Yonik Seeley Priority: Minor Fix For: 1.4 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1837) Reconstruct a Document (stored fields, indexed fields, payloads)
[ https://issues.apache.org/jira/browse/SOLR-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847923#action_12847923 ] Andrzej Bialecki commented on SOLR-1837: - Re: bugs in Luke that result in missing terms - I recently fixed one such bug, and indeed it was located in the DocReconstructor - if you are aware of others then please report them using the Luke issue tracker. Document reconstruction is a very IO-intensive operation, so I would advise against using it on a production system, and also it produces inexact results (because analysis is usually a lossy operation). Reconstruct a Document (stored fields, indexed fields, payloads) Key: SOLR-1837 URL: https://issues.apache.org/jira/browse/SOLR-1837 Project: Solr Issue Type: New Feature Components: Schema and Analysis, web gui Affects Versions: 1.5 Environment: All Reporter: Trey Grainger Priority: Minor Fix For: 1.5 Original Estimate: 168h Remaining Estimate: 168h One Solr feature I've been sorely in need of is the ability to inspect an index for any particular document. While the analysis page is good when you have specific content and a specific field/type your want to test the analysis process for, once a document is indexed it is not currently possible to easily see what is actually sitting in the index. One can use the Lucene Index Browser (Luke), but this has several limitations (gui only, doesn't understand solr schema, doesn't display many non-text fields in human readable format, doesn't show payloads, some bugs lead to missing terms, exposes features dangerous to use in a production Solr environment, slow or difficult to check from a remote location, etc.). The document reconstruction feature of Luke provides the base for what can become a much more powerful tool when coupled with Solr's understanding of a schema, however. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1536) Support for TokenFilters that may modify input documents
[ https://issues.apache.org/jira/browse/SOLR-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835660#action_12835660 ] Andrzej Bialecki commented on SOLR-1536: - Term freq. vectors are not available at this stage, unless you go to an expense of creating a MemoryIndex. I think the solution I proposed is less costly and more generic. Support for TokenFilters that may modify input documents Key: SOLR-1536 URL: https://issues.apache.org/jira/browse/SOLR-1536 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.5 Reporter: Andrzej Bialecki Attachments: altering.patch In some scenarios it's useful to be able to create or modify fields in the input document based on analysis of other fields of this document. This need arises e.g. when indexing multilingual documents, or when doing NLP processing such as NER. However, currently this is not possible to do. This issue provides an implementation of this functionality that consists of the following parts: * DocumentAlteringFilterFactory - abstract superclass that indicates that TokenFilter-s created from this factory may modify fields in a SolrInputDocument. * TypeAsFieldFilterFactory - example implementation that illustrates this concept, with a JUnit test. * DocumentBuilder modifications to support this functionality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1536) Support for TokenFilters that may modify input documents
[ https://issues.apache.org/jira/browse/SOLR-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1536: Attachment: altering.patch Patch updated to trunk. Support for TokenFilters that may modify input documents Key: SOLR-1536 URL: https://issues.apache.org/jira/browse/SOLR-1536 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.5 Reporter: Andrzej Bialecki Attachments: altering.patch, altering.patch In some scenarios it's useful to be able to create or modify fields in the input document based on analysis of other fields of this document. This need arises e.g. when indexing multilingual documents, or when doing NLP processing such as NER. However, currently this is not possible to do. This issue provides an implementation of this functionality that consists of the following parts: * DocumentAlteringFilterFactory - abstract superclass that indicates that TokenFilter-s created from this factory may modify fields in a SolrInputDocument. * TypeAsFieldFilterFactory - example implementation that illustrates this concept, with a JUnit test. * DocumentBuilder modifications to support this functionality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1535) Pre-analyzed field type
[ https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1535: Attachment: (was: preanalyzed.patch) Pre-analyzed field type --- Key: SOLR-1535 URL: https://issues.apache.org/jira/browse/SOLR-1535 Project: Solr Issue Type: New Feature Affects Versions: 1.5 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: preanalyzed.patch PreAnalyzedFieldType provides a functionality to index (and optionally store) content that was already processed and split into tokens using some external processing chain. This implementation defines a serialization format for sending tokens with any currently supported Attributes (eg. type, posIncr, payload, ...). This data is de-serialized into a regular TokenStream that is returned in Field.tokenStreamValue() and thus added to the index as index terms, and optionally a stored part that is returned in Field.stringValue() and is then added as a stored value of the field. This field type is useful for integrating Solr with existing text-processing pipelines, such as third-party NLP systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1535) Pre-analyzed field type
[ https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1535: Attachment: preanalyzed.patch Sigh ... attach correct patch. Pre-analyzed field type --- Key: SOLR-1535 URL: https://issues.apache.org/jira/browse/SOLR-1535 Project: Solr Issue Type: New Feature Affects Versions: 1.5 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: preanalyzed.patch, preanalyzed.patch PreAnalyzedFieldType provides a functionality to index (and optionally store) content that was already processed and split into tokens using some external processing chain. This implementation defines a serialization format for sending tokens with any currently supported Attributes (eg. type, posIncr, payload, ...). This data is de-serialized into a regular TokenStream that is returned in Field.tokenStreamValue() and thus added to the index as index terms, and optionally a stored part that is returned in Field.stringValue() and is then added as a stored value of the field. This field type is useful for integrating Solr with existing text-processing pipelines, such as third-party NLP systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1535) Pre-analyzed field type
[ https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1535: Attachment: (was: altering.patch) Pre-analyzed field type --- Key: SOLR-1535 URL: https://issues.apache.org/jira/browse/SOLR-1535 Project: Solr Issue Type: New Feature Affects Versions: 1.5 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: preanalyzed.patch, preanalyzed.patch PreAnalyzedFieldType provides a functionality to index (and optionally store) content that was already processed and split into tokens using some external processing chain. This implementation defines a serialization format for sending tokens with any currently supported Attributes (eg. type, posIncr, payload, ...). This data is de-serialized into a regular TokenStream that is returned in Field.tokenStreamValue() and thus added to the index as index terms, and optionally a stored part that is returned in Field.stringValue() and is then added as a stored value of the field. This field type is useful for integrating Solr with existing text-processing pipelines, such as third-party NLP systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1535) Pre-analyzed field type
[ https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1535: Attachment: altering.patch Oops .. previous patch produced NPEs. This one doesn't. Pre-analyzed field type --- Key: SOLR-1535 URL: https://issues.apache.org/jira/browse/SOLR-1535 Project: Solr Issue Type: New Feature Affects Versions: 1.5 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: preanalyzed.patch, preanalyzed.patch PreAnalyzedFieldType provides a functionality to index (and optionally store) content that was already processed and split into tokens using some external processing chain. This implementation defines a serialization format for sending tokens with any currently supported Attributes (eg. type, posIncr, payload, ...). This data is de-serialized into a regular TokenStream that is returned in Field.tokenStreamValue() and thus added to the index as index terms, and optionally a stored part that is returned in Field.stringValue() and is then added as a stored value of the field. This field type is useful for integrating Solr with existing text-processing pipelines, such as third-party NLP systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1536) Support for TokenFilters that may modify input documents
[ https://issues.apache.org/jira/browse/SOLR-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1536: Attachment: altering.patch Updated patch - previous patch produced NPEs. Support for TokenFilters that may modify input documents Key: SOLR-1536 URL: https://issues.apache.org/jira/browse/SOLR-1536 Project: Solr Issue Type: New Feature Components: Schema and Analysis Affects Versions: 1.5 Reporter: Andrzej Bialecki Attachments: altering.patch, altering.patch In some scenarios it's useful to be able to create or modify fields in the input document based on analysis of other fields of this document. This need arises e.g. when indexing multilingual documents, or when doing NLP processing such as NER. However, currently this is not possible to do. This issue provides an implementation of this functionality that consists of the following parts: * DocumentAlteringFilterFactory - abstract superclass that indicates that TokenFilter-s created from this factory may modify fields in a SolrInputDocument. * TypeAsFieldFilterFactory - example implementation that illustrates this concept, with a JUnit test. * DocumentBuilder modifications to support this functionality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1535) Pre-analyzed field type
[ https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1535: Attachment: preanalyzed.patch Patch updated to the current trunk. Pre-analyzed field type --- Key: SOLR-1535 URL: https://issues.apache.org/jira/browse/SOLR-1535 Project: Solr Issue Type: New Feature Affects Versions: 1.5 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: preanalyzed.patch, preanalyzed.patch PreAnalyzedFieldType provides a functionality to index (and optionally store) content that was already processed and split into tokens using some external processing chain. This implementation defines a serialization format for sending tokens with any currently supported Attributes (eg. type, posIncr, payload, ...). This data is de-serialized into a regular TokenStream that is returned in Field.tokenStreamValue() and thus added to the index as index terms, and optionally a stored part that is returned in Field.stringValue() and is then added as a stored value of the field. This field type is useful for integrating Solr with existing text-processing pipelines, such as third-party NLP systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833786#action_12833786 ] Andrzej Bialecki commented on SOLR-1316: - I would lean towards the latter - complex do-it-all components often suffer from creeping featuritis and insufficient testing/maintenance, because there are few users that use all their features, and few developers that understand how they work. I subscribe to the Unix philosophy - do one thing, and do it right, so I think that if we can implement autosuggest that works well from the technical POV, then it will become a reliable component that you can combine in many creative ways to satisfy different scenarios, of which there are likely many more than what you described ... Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 1.5 Attachments: suggest.patch, suggest.patch, suggest.patch, TST.zip Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800746#action_12800746 ] Andrzej Bialecki commented on SOLR-1301: - bq. I'm curious about the not sending over the network. Have you tried the Streaming Server or even just the regular one? Hmm, I don't think this would make sense - the whole point of this patch is to distribute the load by indexing into multiple Solr instances that use the same config - and this can be an existing user's config including the components from ${solr.home}/lib . bq. How would this work with someone who already has a separate Solr cluster setup? It wouldn't - partly because there is no canonical Solr cluster setup against which to code this ... Would that be the same cluster (1:1 mapping) as the Hadoop cluster? bq. Also, I haven't looked closely at the patch, but if I understand correctly, it is writing out the indexes to the local disks on the Hadoop cluster? HDFS doesn't support enough POSIX to support writing Lucene indexes directly to HDFS - for this reason indexes are always created on local storage of each node, and then after closing they are copied to HDFS. Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800758#action_12800758 ] Andrzej Bialecki commented on SOLR-1301: - Iff we somehow could get a mapping between a mapred task on node X to a particular target Solr server (beyond the two obvious choices, ie. single URL for one Solr, or localhost for per-node Solr-s) then sure why not. And you are right that we wouldn't use the embedded Solr in that case. But this patch solves a different problem, and it solves it within the facilities of the current config ;) bq. Right, and then copied down from HDFS and installed in Solr, correct? You still have the issue of knowing which Solr instances get which shards off of HDFS, right? Just seems like a little more configuration knowledge could alleviate all that extra copying/installing, etc. Yes. But that would be a completely different scenario - we could wrap it in a Hadoop OutputFormat as well, but the implementation would be totally different from this patch. Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.5 Attachments: commons-logging-1.0.4.jar, commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1602) Refactor SOLR package structure to include o.a.solr.response and move QueryResponseWriters in there
[ https://issues.apache.org/jira/browse/SOLR-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796291#action_12796291 ] Andrzej Bialecki commented on SOLR-1602: - I'm in favor of B. This worked well in Hadoop (mapred - mapreduce) where the list of deprecations was massive and API changes were not straightforward at all - still it was done to promote a better design and allow new functionality. Whole deprecated hierarchies live there for at least two major releases, and surely they were visible to thousands of Hadoop devs. The downside was occasional confusion, and of course the porting effort required to use the new API, but the upside was an excellent back-compat to keep serious users happy, and a clear message to all to get prepared for the switch. So IMHO having a bunch of deprecated classes for a while is not a big deal, if it gives us freedom to pursue a better design. Refactor SOLR package structure to include o.a.solr.response and move QueryResponseWriters in there --- Key: SOLR-1602 URL: https://issues.apache.org/jira/browse/SOLR-1602 Project: Solr Issue Type: Improvement Components: Response Writers Affects Versions: 1.2, 1.3, 1.4 Environment: independent of environment (code structure) Reporter: Chris A. Mattmann Assignee: Noble Paul Fix For: 1.5 Attachments: SOLR-1602.Mattmann.112509.patch.txt, SOLR-1602.Mattmann.112509_02.patch.txt, upgrade_solr_config Currently all o.a.solr.request.QueryResponseWriter implementations are curiously located in the o.a.solr.request package. Not only is this package getting big (30+ classes), a lot of them are misplaced. There should be a first-class o.a.solr.response package, and the response related classes should be given a home there. Patch forthcoming. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1632) Distributed IDF
[ https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1632: Attachment: distrib-2.patch Updated patch, contains also: * LRU-based cache that optimizes requests using cached values of docFreq for known terms * unit tests Distributed IDF --- Key: SOLR-1632 URL: https://issues.apache.org/jira/browse/SOLR-1632 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.5 Reporter: Andrzej Bialecki Attachments: distrib-2.patch, distrib.patch Distributed IDF is a valuable enhancement for distributed search across non-uniform shards. This issue tracks the proposed implementation of an API to support this functionality in Solr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1316: Attachment: suggest.patch Updated patch: * removed the broken RadixTree, * changed Suggester and Lookup API so that they don't join the tokens - instead they will use whatever tokens are produced by the analyzer. For now results are merged into a single SpellingResult. Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 1.5 Attachments: suggest.patch, suggest.patch, suggest.patch, TST.zip Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791164#action_12791164 ] Andrzej Bialecki commented on SOLR-1316: - bq. What about DAWGs? Are we still considering them? I would be happy to include DAWGs if someone were to implement them ... ;) Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 1.5 Attachments: suggest.patch, suggest.patch, suggest.patch, TST.zip Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1632) Distributed IDF
[ https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789174#action_12789174 ] Andrzej Bialecki commented on SOLR-1632: - I'm not sure what approach you are referring to. Following the terminology in that thread, this implementation follows the approach where there is a single merged big idf map at the master, and it's sent out to slaves on each query. However, when exactly this merging and sending happens is implementation-specific - in the ExactDFSource it happens on every query, but I hope the API can support other scenarios as well. Distributed IDF --- Key: SOLR-1632 URL: https://issues.apache.org/jira/browse/SOLR-1632 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.5 Reporter: Andrzej Bialecki Attachments: distrib.patch Distributed IDF is a valuable enhancement for distributed search across non-uniform shards. This issue tracks the proposed implementation of an API to support this functionality in Solr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1632) Distributed IDF
[ https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789607#action_12789607 ] Andrzej Bialecki commented on SOLR-1632: - I believe the API that I propose would support such implementation as well. Please note that it's usually not feasible to compute and distribute the complete IDF table for all terms - you would have to replicate a union of all term dictionaries across the cluster. In practice, you limit the amount of information by various means, e.g. only distributing data related to the current request (this implementation) or reducing the frequency of updates (e.g. LRU caching), or approximating global DF with a constant for frequent terms (where the contribution of their IDF to the score would be negligible anyway). Distributed IDF --- Key: SOLR-1632 URL: https://issues.apache.org/jira/browse/SOLR-1632 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.5 Reporter: Andrzej Bialecki Attachments: distrib.patch Distributed IDF is a valuable enhancement for distributed search across non-uniform shards. This issue tracks the proposed implementation of an API to support this functionality in Solr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788913#action_12788913 ] Andrzej Bialecki commented on SOLR-1316: - Thanks for the review! bq. Why do we concatenate all the tokens into one before calling Lookup#lookup? It seems we should be getting suggestions for each token just as SpellCheckComponent does. Yeah, it's disputable, and we could change it to use single tokens ... My thinking was that the usual scenario is that you submit autosuggest queries soon after user starts typing the query, and the highest perceived value of such functionality is when it can suggest complete meaningful phrases and not just individual terms. I.e. when you start typing token sug it won't suggest token sugar but instead it will suggest token suggestions. bq. Related to #1, the Lookup#lookup method should return something more fine grained rather than a SpellingResult Such as? What you put there is what you get ;) so the fact that we are getting complete phrases as suggestions is the consequence of the choice above - the trie in this case is populated with phrases. If we populate it with tokens, then we can return per-token suggestions, again - losing the added value I mentioned above. bq. Has anyone done any benchmarking to figure out the data structure we want to go ahead with? For now I'm sure that we do NOT want to use the impl. of RadixTree in this patch, because it doesn't support our use case - I'll prepare a patch that removes this impl. Other implementations seem comparable wrt. to the speed, based on casual tests using /usr/share/dict/words, but I didn't run any exact benchmarks yet. Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Assignee: Shalin Shekhar Mangar Priority: Minor Fix For: 1.5 Attachments: suggest.patch, suggest.patch, TST.zip Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1632) Distributed IDF
Distributed IDF --- Key: SOLR-1632 URL: https://issues.apache.org/jira/browse/SOLR-1632 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.5 Reporter: Andrzej Bialecki Distributed IDF is a valuable enhancement for distributed search across non-uniform shards. This issue tracks the proposed implementation of an API to support this functionality in Solr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1632) Distributed IDF
[ https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1632: Attachment: distrib.patch Initial implementation. This supports the current global IDF (i.e. none ;) ), and an exact version of global IDF that requires one additional request per query to obtain per-shard stats. The design should be already flexible enough to implement LRU caching of docFreqs, and ultimately to implement other methods for global IDF calculation (e.g. based on estimation or re-ranking). Distributed IDF --- Key: SOLR-1632 URL: https://issues.apache.org/jira/browse/SOLR-1632 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.5 Reporter: Andrzej Bialecki Attachments: distrib.patch Distributed IDF is a valuable enhancement for distributed search across non-uniform shards. This issue tracks the proposed implementation of an API to support this functionality in Solr. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1614) Search in Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784212#action_12784212 ] Andrzej Bialecki commented on SOLR-1614: - If query performance is not a concern, then why not execute it directly on HDFS (using e.g. Nutch FsDirectory to read indexes from HDFS)? Search in Hadoop Key: SOLR-1614 URL: https://issues.apache.org/jira/browse/SOLR-1614 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 What's the use case? Sometimes queries are expensive (such as regex) or one has indexes located in HDFS, that then need to be searched on. By leveraging Hadoop, these non-time sensitive queries may be executed without dynamically deploying the indexes to new Solr servers. We'll download the index out of HDFS (assuming they're zipped), perform the queries in a batch on the index shard, then merge the results either using a Solr query results priority queue, or simply using Hadoop's built in merge sorting. The query file will be encoded in JSON format, (ID, query, numresults,fields). The shards file will simply contain newline delimited paths (HDFS or otherwise). The output can be a Solr encoded results file per query. I'm hoping to add an actual Hadoop unit test. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780530#action_12780530 ] Andrzej Bialecki commented on SOLR-1316: - Re: question 1 - currently this component doesn't support populating the dictionary from a distributed index. Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Attachments: suggest.patch, suggest.patch, TST.zip Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778280#action_12778280 ] Andrzej Bialecki commented on SOLR-1316: - Re: the tree creation - well, this is the current limitation of the Dictionary API that provides only an Iterator. So in general case it's not possible to start from the middle of the iterator so that the tree is well-balanced. Is it possible to re-balance the tree on the fly? Re: svn - it works for me ... Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Attachments: suggest.patch, suggest.patch, TST.zip Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778295#action_12778295 ] Andrzej Bialecki commented on SOLR-1316: - Well, this is kind of ugly, because it increases the memory footprint of the build phase - that was the whole point of using Iterator in the Dictionary, so that you don't have to cache all dictionary data in memory - dictionaries could be large, and they are not guaranteed to be sorted and with unique keys. But if there are no better options for now, then yes we could do this just in TSTLookup. Is there really no way to rebalance the tree? Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Attachments: suggest.patch, suggest.patch, TST.zip Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1316: Attachment: suggest.patch Updated patch that includes the new TST sources. Tests on a 100k-words dictionary yield very similar results for the TST and Jaspell implementations, i.e. the initial build time is around 600ms, and then the lookup time is around 4-7ms for prefixes that yield more than 100 results. To test it put this in your solrconfig.xml: {code:xml} searchComponent name=spellcheck class=solr.SpellCheckComponent lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str str name=fieldtext/str str name=sourceLocationamerican-english/str /lst /searchComponent ... {code} And then use e.g. the following parameters: {noformat} spellcheck=truespellcheck.build=truespellcheck.dictionary=suggest \ spellcheck.extendedResults=truespellcheck.count=100q=test {noformat} Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Attachments: suggest.patch, suggest.patch, TST.zip Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777045#action_12777045 ] Andrzej Bialecki commented on SOLR-1316: - Forgot to add - the RadixTree implementation doesn't work for now - it needs further refactoring to return the completed keys, and not just the values stored in nodes ... Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Attachments: suggest.patch, suggest.patch, TST.zip Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1536) Support for TokenFilters that may modify input documents
[ https://issues.apache.org/jira/browse/SOLR-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774162#action_12774162 ] Andrzej Bialecki commented on SOLR-1536: - My opinion may be biased, but I'll try to be as objective as I can ;) I think it's better, because it provides you much more flexibility in building analysis indexing chains without coding. If we went with URProcessor you would have to implement a new one whenever your analysis chain changes ... With the approach in this patch it's just a configuration issue, and not an issue of implementing as many custom update processors as there are possible combinations ... Support for TokenFilters that may modify input documents Key: SOLR-1536 URL: https://issues.apache.org/jira/browse/SOLR-1536 Project: Solr Issue Type: New Feature Components: Analysis Affects Versions: 1.5 Reporter: Andrzej Bialecki Attachments: altering.patch In some scenarios it's useful to be able to create or modify fields in the input document based on analysis of other fields of this document. This need arises e.g. when indexing multilingual documents, or when doing NLP processing such as NER. However, currently this is not possible to do. This issue provides an implementation of this functionality that consists of the following parts: * DocumentAlteringFilterFactory - abstract superclass that indicates that TokenFilter-s created from this factory may modify fields in a SolrInputDocument. * TypeAsFieldFilterFactory - example implementation that illustrates this concept, with a JUnit test. * DocumentBuilder modifications to support this functionality. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-398) Widen return type of FiledType.createField to Fieldable in order to maximize flexibility
[ https://issues.apache.org/jira/browse/SOLR-398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-398: --- Attachment: fieldable.patch Patch updated to current trunk. One concrete use case where this is needed is Fieldable implementations that provide different values of tokenStreamValue() and stringValue(), e.g. when using external tools to provide a pre-tokenized value of the field. Widen return type of FiledType.createField to Fieldable in order to maximize flexibility Key: SOLR-398 URL: https://issues.apache.org/jira/browse/SOLR-398 Project: Solr Issue Type: Improvement Components: update Affects Versions: 1.2, 1.3 Reporter: Espen Amble Kolstad Priority: Minor Fix For: 1.5 Attachments: 1.2-FieldType-2.patch, fieldable.patch, trunk-FieldType-3.patch, trunk-FieldType-4.patch, trunk-FieldType-5.patch FieldType.createField currently returns Field. In order to maximize flexibility for developers to extend Solr, it should return Fieldable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader
[ https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758381#action_12758381 ] Andrzej Bialecki commented on SOLR-1366: - Looks good to me, +1. UnsupportedOperationException may be thrown when using custom IndexReader - Key: SOLR-1366 URL: https://issues.apache.org/jira/browse/SOLR-1366 Project: Solr Issue Type: Bug Components: replication (java), search Affects Versions: 1.4 Reporter: Andrzej Bialecki Assignee: Shalin Shekhar Mangar Fix For: 1.4 Attachments: searcher.patch, SOLR-1366.patch If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and IndexReader-s that it produces don't support IndexReader.directory() (such as is the case with ParallelReader or MultiReader) then an uncaught UnsupportedOperationException is thrown. This call is used only to retrieve the full path of the directory for informational purpose, so it shouldn't lead to a crash. Instead we could supply other available information about the reader (e.g. from its toString() method). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1316: Attachment: suggest.patch This is a very much work in progress, to get review before proceeding. Highlights of this patch: * created a set of interfaces in o.a.s.spelling.suggest to hide implementation details of various autocomplete mechanisms. * imported sources of RadixTree, Jaspell TST and Ankul's TST. Wrapped each implementation so that it works with the same interface. (Ankul: I couldn't figure out how to actually retrieve suggested keys from your TST?) * extended HighFrequencyDictionary to return TermFreqIterator, which gives not only words but also their frequencies. Implemented a similar iterator for file-based term-freq lists. Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Attachments: suggest.patch, TernarySearchTree.tar.gz Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader
[ https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756652#action_12756652 ] Andrzej Bialecki commented on SOLR-1366: - +1 for adding a big red flag. My application depends on this functionality, and it's working well once I overrode a bunch of additional methods in IndexReader that deal with Directory, IndexCommit, index version, etc. (A few details on this, and why my solution is not applicable in general case: I'm using ParallelReader, and the other indexes that I add are throwaways, i.e. I recreate them on each index refresh from external shared resources. So I basically short-circuited those methods that deal with directory and commits so that they return information from the main index. This way the file-based replication works as before for the main index). UnsupportedOperationException may be thrown when using custom IndexReader - Key: SOLR-1366 URL: https://issues.apache.org/jira/browse/SOLR-1366 Project: Solr Issue Type: Bug Components: replication (java), search Affects Versions: 1.4 Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 1.4 Attachments: searcher.patch If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and IndexReader-s that it produces don't support IndexReader.directory() (such as is the case with ParallelReader or MultiReader) then an uncaught UnsupportedOperationException is thrown. This call is used only to retrieve the full path of the directory for informational purpose, so it shouldn't lead to a crash. Instead we could supply other available information about the reader (e.g. from its toString() method). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756819#action_12756819 ] Andrzej Bialecki commented on SOLR-1316: - Yes, it should work for now. In fact I started writing a new component, but it had to replicate most of the spellchecker ;) so I will just add bits to the existing spellchecker. I'm worried though that we abuse the semantics of the API, and it will be more difficult to fit both functions in a single API as the functionality evolves. Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Attachments: TernarySearchTree.tar.gz Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756050#action_12756050 ] Andrzej Bialecki commented on SOLR-1316: - bq. These enable suffix compression and create much smaller word graphs. DAWGs are problematic, because they are essentially immutable once created (the cost of insert / delete is very high). So I propose to stick to TSTs for now. Also, I think that populating TST from the index would have to be discriminative, perhaps based on a threshold (so that it only adds terms with large enough docFreq), and it would be good to adjust the content of the tree based on actual queries that return some results (poor man's auto-learning), gradually removing least frequent strings to save space.. We could also use as a source a field with 1-3 word shingles (no tf, unstored, to save space in the source index, with a similar thresholding mechanism). Ankul, I'm not sure what's the behavior of your implementation when dynamically adding / removing keys? Does it still remain balanced? I also found a MIT-licensed impl. of radix tree here: http://code.google.com/p/radixtree, which looks good too, one spelling mistake in the API notwithstanding ;) Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Attachments: TernarySearchTree.tar.gz Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756149#action_12756149 ] Andrzej Bialecki commented on SOLR-1316: - bq. Andrej, why would immutability be a problem? Wouldn't we have to re-build the TST if the source index changes? Well, the use case I have in mind is a TST that improves itself over time based on the observed query log. I.e. you would bootstrap a TST from the index (and here indeed you can do this on every searcher refresh), but it's often claimed that real query logs provide a far better source of autocomplete than the index terms. My idea was to start with what you have - in the absence of query logs - and then improve upon it by adding successful queries (and removing least-used terms to keep the tree at a more or less constant size). Alternatively we could provide an option to bootstrap it from a real query log data. This use case requires mutability, hence my negative opinion about DAGWs (besides, we are lacking an implementation, don't we, whereas we already have a few suitable TST implementations). Perhaps this doesn't have to be an either/or, if we come up with a pluggable interface for this type of component? bq. I think the building of the data structure can be done in a way similar to what SpellCheckComponent does. [..] +1 Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Attachments: TernarySearchTree.tar.gz Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1316) Create autosuggest component
[ https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754404#action_12754404 ] Andrzej Bialecki commented on SOLR-1316: - Jason, did you make any progress on this? I'm interested in this functionality.. I'm not sure tries are the best choice, unless heavily pruned they occupy a lot of RAM space. I had some moderate success using ngram based method (I reused the spellchecker, with slight modifications) - the method is fast and reuses the existing spellchecker index, but precision of lookups is not ideal. Create autosuggest component Key: SOLR-1316 URL: https://issues.apache.org/jira/browse/SOLR-1316 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Original Estimate: 96h Remaining Estimate: 96h Autosuggest is a common search function that can be integrated into Solr as a SearchComponent. Our first implementation will use the TernaryTree found in Lucene contrib. * Enable creation of the dictionary from the index or via Solr's RPC mechanism * What types of parameters and settings are desirable? * Hopefully in the future we can include user click through rates to boost those terms/phrases higher -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753648#action_12753648 ] Andrzej Bialecki commented on SOLR-1321: - This comment refers to the limitation of Lucene's QueryParser - there is only a single flag there to decide whether it accepts leading wildcards or not, regardless of field. Consequently, after checking the schema in SolrQueryParser we turn on this flag if _any_ field type supports leading wildcards. The end effect of this is that parsers for any field, which are created with IndexSchema.getSolrQueryParser(), will return true if any field type supports leading wildcards, not neccessarily the one for which the parser was created.. I don't see a way to fix this. I can clarify the comment, though, so that it's clear that this is a limitation in Lucene QueryParser. Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Assignee: Grant Ingersoll Fix For: 1.4 Attachments: SOLR-1321.patch, wildcards-2.patch, wildcards-3.patch, wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader
[ https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753761#action_12753761 ] Andrzej Bialecki commented on SOLR-1366: - I didn't make myself clear .. I fixed this for my application, where I control the implementation of IndexReader, but I wouldn't recommend this fix for general use. So this was just FYI. UnsupportedOperationException may be thrown when using custom IndexReader - Key: SOLR-1366 URL: https://issues.apache.org/jira/browse/SOLR-1366 Project: Solr Issue Type: Bug Components: replication (java), search Affects Versions: 1.4 Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 1.4 Attachments: searcher.patch If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and IndexReader-s that it produces don't support IndexReader.directory() (such as is the case with ParallelReader or MultiReader) then an uncaught UnsupportedOperationException is thrown. This call is used only to retrieve the full path of the directory for informational purpose, so it shouldn't lead to a crash. Instead we could supply other available information about the reader (e.g. from its toString() method). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader
[ https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753289#action_12753289 ] Andrzej Bialecki commented on SOLR-1366: - FYI, for now I solved this by extending my IndexReader to support this call and return the original directory that lists all index files plus a few resources that I care about. However, this is just glossing over the deeper problem - replication handler shouldn't assume the directory is file-based. UnsupportedOperationException may be thrown when using custom IndexReader - Key: SOLR-1366 URL: https://issues.apache.org/jira/browse/SOLR-1366 Project: Solr Issue Type: Bug Components: replication (java), search Affects Versions: 1.4 Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 1.4 Attachments: searcher.patch If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and IndexReader-s that it produces don't support IndexReader.directory() (such as is the case with ParallelReader or MultiReader) then an uncaught UnsupportedOperationException is thrown. This call is used only to retrieve the full path of the directory for informational purpose, so it shouldn't lead to a crash. Instead we could supply other available information about the reader (e.g. from its toString() method). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1321: Attachment: wildcards-3.patch Updated patch that uses new TokenAttribute API and uses (as much as possible) the new ReverseStringFilter. Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Assignee: Grant Ingersoll Fix For: 1.4 Attachments: wildcards-2.patch, wildcards-3.patch, wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748067#action_12748067 ] Andrzej Bialecki commented on SOLR-1321: - I'll update the patch, assuming the presence of the updated filter in Lucene, but I'd rather leave updating the libs to someone more intimate with Solr internals ... Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Assignee: Grant Ingersoll Fix For: 1.4 Attachments: wildcards-2.patch, wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1375) BloomFilter on a field
[ https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746591#action_12746591 ] Andrzej Bialecki commented on SOLR-1375: - See here for a Java impl. of FastBits: http://code.google.com/p/compressedbitset/ . Re: BloomFilters - in BloomIndexComponent you seem to assume that when BloomKeySet.contains(key) returns true then a key exists in a set. This is not strictly speaking true. You can only be sure with 1.0 probability that a key does NOT exist in a set, for other key when the result is true you only have a (1.0 - eps) probability that the answer is correct, i.e. the BloomFilter will return a false positive result for non-existent keys, with (eps) probability. You should take this into account when writing client code. BloomFilter on a field -- Key: SOLR-1375 URL: https://issues.apache.org/jira/browse/SOLR-1375 Project: Solr Issue Type: New Feature Components: update Affects Versions: 1.4 Reporter: Jason Rutherglen Priority: Minor Fix For: 1.5 Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch Original Estimate: 120h Remaining Estimate: 120h * A bloom filter is a read only probabilistic set. Its useful for verifying a key exists in a set, though it returns false positives. http://en.wikipedia.org/wiki/Bloom_filter * The use case is indexing in Hadoop and checking for duplicates against a Solr cluster (which when using term dictionary or a query) is too slow and exceeds the time consumed for indexing. When a match is found, the host, segment, and term are returned. If the same term is found on multiple servers, multiple results are returned by the distributed process. (We'll need to add in the core name I just realized). * When new segments are created, and commit is called, a new bloom filter is generated from a given field (default:id) by iterating over the term dictionary values. There's a bloom filter file per segment, which is managed on each Solr shard. When segments are merged away, their corresponding .blm files is also removed. In a future version we'll have a central server for the bloom filters so we're not abusing the thread pool of the Solr proxy and the networking of the Solr cluster (this will be done sooner than later after testing this version). I held off because the central server requires syncing the Solr servers' files (which is like reverse replication). * The patch uses the BloomFilter from Hadoop 0.20. I want to jar up only the necessary classes so we don't have a giant Hadoop jar in lib. http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html * Distributed code is added and seems to work, I extended TestDistributedSearch to test over multiple HTTP servers. I chose this approach rather than the manual method used by (for example) TermVectorComponent.testDistributed because I'm new to Solr's distributed search and wanted to learn how it works (the stages are confusing). Using this method, I didn't need to setup multiple tomcat servers and manually execute tests. * We need more of the bloom filter options passable via solrconfig * I'll add more test cases -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader
[ https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744993#action_12744993 ] Andrzej Bialecki commented on SOLR-1366: - Indeed, that complicates the matter ... It looks like using a non-file based IndexReader breaks replication. This is not a regression from 1.3, but the functionality to specify custom IndexReaders will be available in 1.4, so it should be clearly stated in docs that it prevents replication from working properly, until we rectify this issue. UnsupportedOperationException may be thrown when using custom IndexReader - Key: SOLR-1366 URL: https://issues.apache.org/jira/browse/SOLR-1366 Project: Solr Issue Type: Bug Components: search Affects Versions: 1.4 Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 1.4 Attachments: searcher.patch If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and IndexReader-s that it produces don't support IndexReader.directory() (such as is the case with ParallelReader or MultiReader) then an uncaught UnsupportedOperationException is thrown. This call is used only to retrieve the full path of the directory for informational purpose, so it shouldn't lead to a crash. Instead we could supply other available information about the reader (e.g. from its toString() method). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader
[ https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744996#action_12744996 ] Andrzej Bialecki commented on SOLR-1366: - .. I haven't looked into it yet, but perhaps this could be solved by extending the replication handler to support multiple dirs, and for those IndexReader that don't support directory() try asking for getSubReaders() and use their directory() ... UnsupportedOperationException may be thrown when using custom IndexReader - Key: SOLR-1366 URL: https://issues.apache.org/jira/browse/SOLR-1366 Project: Solr Issue Type: Bug Components: search Affects Versions: 1.4 Reporter: Andrzej Bialecki Assignee: Mark Miller Fix For: 1.4 Attachments: searcher.patch If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and IndexReader-s that it produces don't support IndexReader.directory() (such as is the case with ParallelReader or MultiReader) then an uncaught UnsupportedOperationException is thrown. This call is used only to retrieve the full path of the directory for informational purpose, so it shouldn't lead to a crash. Instead we could supply other available information about the reader (e.g. from its toString() method). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader
[ https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1366: Attachment: searcher.patch Patch that catches the exception and supplies IndexReader.toString() instead. UnsupportedOperationException may be thrown when using custom IndexReader - Key: SOLR-1366 URL: https://issues.apache.org/jira/browse/SOLR-1366 Project: Solr Issue Type: Bug Components: search Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.4 Attachments: searcher.patch If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and IndexReader-s that it produces don't support IndexReader.directory() (such as is the case with ParallelReader or MultiReader) then an uncaught UnsupportedOperationException is thrown. This call is used only to retrieve the full path of the directory for informational purpose, so it shouldn't lead to a crash. Instead we could supply other available information about the reader (e.g. from its toString() method). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742559#action_12742559 ] Andrzej Bialecki commented on SOLR-1321: - bq. Since this is a new filter, we might as well use the new incrementToken capability and reusable stuff as well as avoiding other deprecated analysis calls. Indeed, I'll fix this. bq. Also no need to do the string round trip in the reverse method, right? See the ReverseStringFilter in Lucene contrib/analysis. Roundtrip ... you mean the allocation of new char[] buffer, or conversion to String? I assume the latter - the former is needed because we add the marker char in front. Yeah, I can return char[] and convert to String only in QP. bq. Perhaps we should just patch that and add some config options to it? Then all Solr would need is the QP change and the FilterFactory change, no? Hmm. After adding the marker-related stuff the code in ReverseStringFilter won't be so nice as it is now. I'd keep in mind the specific use case of this filter ... Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Assignee: Grant Ingersoll Fix For: 1.4 Attachments: wildcards-2.patch, wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742573#action_12742573 ] Andrzej Bialecki commented on SOLR-1321: - bq. FWIW, it also seemed like the reverse code in ReverseStringFilter was faster than the patch, but I didn't compare quantitatively. It better be - it can reverse in-place, while we have to allocate a new buffer because of the marker char in front. That's what I meant by messy code - we would need both the in-place and the out-of-place method depending on an option. Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Assignee: Grant Ingersoll Fix For: 1.4 Attachments: wildcards-2.patch, wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738347#action_12738347 ] Andrzej Bialecki commented on SOLR-1321: - Exactly, that's the reason to put this logic in an isolated well-defined place, with some configurable knobs. One parameter would be the max. position of the leading wildcard, another would be relative cost of ? and *, or whether we allow wildcards at any positions except the 0-th (pure suffix search). Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.4 Attachments: wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1321: Attachment: wildcards-2.patch Updated patch with more configurable knobs. See javadoc of ReversedWildcardsFilterFactory and unit tests. Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.4 Attachments: wildcards-2.patch, wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1321: Attachment: (was: wildcards-2.patch) Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.4 Attachments: wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1321: Attachment: wildcards-2.patch Previous patch mistakenly included other stuff instead of ReversedWildcardFilterFactory. Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.4 Attachments: wildcards-2.patch, wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1321) Support for efficient leading wildcards search
Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.4 This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1321: Attachment: wildcards.patch Patch containing the new filter, example schema and unit tests. Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.4 Attachments: wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737596#action_12737596 ] Andrzej Bialecki commented on SOLR-1321: - If you follow the logic in getWildcardQuery, a field has to meet specific requirements for this reversal to occur - namely, it needs to declare in its indexing analysis chain that it uses ReversedWildcardFilter. This filter does very special kind of reversal (prepending the marker) so it's unlikely that anyone would use it for other purpose than to explicitly support leading wildcards. So for now I'd say that users should consciously choose between this method of supporting leading wildcards and the automaton wildcard query. Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.4 Attachments: wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search
[ https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737763#action_12737763 ] Andrzej Bialecki commented on SOLR-1321: - I think your example of g?abcde* could be handled if we assigned different costs of expanding ? and *, the latter being more costly. There could be also a rule that prevents the reversing if a trailing costly wildcard is used. This quickly gets more and more complicated, so ultimately we may want to put this logic elsewhere, in a class that knows the best how to make such decisions (ReversedWildcardFilter ?). I'll try to modify the patch in this direction. Support for efficient leading wildcards search -- Key: SOLR-1321 URL: https://issues.apache.org/jira/browse/SOLR-1321 Project: Solr Issue Type: Improvement Components: Analysis Affects Versions: 1.4 Reporter: Andrzej Bialecki Fix For: 1.4 Attachments: wildcards.patch This patch is an implementation of the reversed tokens strategy for efficient leading wildcards queries. ReversedWildcardsTokenFilter reverses tokens and returns both the original token (optional) and the reversed token (with positionIncrement == 0). Reversed tokens are prepended with a marker character to avoid collisions between legitimate tokens and the reversed tokens - e.g. DNA would become and, thus colliding with the regular term and, but with the marker character it becomes \u0001and. This TokenFilter can be added to the analyzer chain that it used during indexing. SolrQueryParser has been modified to detect the presence of such fields in the current schema, and treat them in a special way. First, SolrQueryParser examines the schema and collects a map of fields where these reversed tokens are indexed. If there is at least one such field, it also sets QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in getWildcardQuery) the term text may be optionally reversed to put wildcards further along the term text. This happens when the field uses the reversing filter during indexing (as detected above), AND if the wildcard characters are either at 0-th or 1-st position in the term. Otherwise the term text is processed as before, i.e. turned into a regular wildcard query. Unit tests are provided to test the TokenFilter and the query parsing. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737253#action_12737253 ] Andrzej Bialecki commented on SOLR-1301: - This patch is intended to work with Solr as it is now, and the idea is to use Hadoop to buld shards (in the Solr sense) so that they could be used by the current Solr distributed search. I have no idea how / whether Katta/Zookeeper fits in this picture - if you want to pursue this integration I feel it would be best to do it in a separate issue. Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Attachments: hadoop-0.19.1-core.jar, hadoop.patch This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737306#action_12737306 ] Andrzej Bialecki commented on SOLR-1301: - bq. Are you going to add a way to automatically add an index to a Solr core? This way already exists by (ab)using the forced replication to a slave from a temporary master. bq. Are you planning on adding test cases for this patch? This functionality requires a running Hadoop cluster. I'm not sure how to write functional tests without bringing more Hadoop dependencies. I could add unit tests that test some aspects of the patch, but they would be trivial. bq. How does one set the maximum size of a generated shard? One doesn't, at the moment ;) The size of each shard (in # of documents) is a function of the total number of records divided by the number of reduce tasks. Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Attachments: hadoop-0.19.1-core.jar, hadoop.patch This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (SOLR-1301) Solr + Hadoop
Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-1301) Solr + Hadoop
[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-1301: Attachment: hadoop-0.19.1-core.jar hadoop.patch Solr + Hadoop - Key: SOLR-1301 URL: https://issues.apache.org/jira/browse/SOLR-1301 Project: Solr Issue Type: Improvement Affects Versions: 1.4 Reporter: Andrzej Bialecki Attachments: hadoop-0.19.1-core.jar, hadoop.patch This patch contains a contrib module that provides distributed indexing (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold: * provide an API that is familiar to Hadoop developers, i.e. that of OutputFormat * avoid unnecessary export and (de)serialization of data maintained on HDFS. SolrOutputFormat consumes data produced by reduce tasks directly, without storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, the indexing task is split into as many parts as there are reducers, and the data to be indexed is not sent over the network. Design -- Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates an EmbeddedSolrServer, and it also instantiates an implementation of SolrDocumentConverter, which is responsible for turning Hadoop (key, value) into a SolrInputDocument. This data is then added to a batch, which is periodically submitted to EmbeddedSolrServer. When reduce task completes, and the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on the EmbeddedSolrServer. The API provides facilities to specify an arbitrary existing solr.home directory, from which the conf/ and lib/ files will be taken. This process results in the creation of as many partial Solr home directories as there were reduce tasks. The output shards are placed in the output directory on the default filesystem (e.g. HDFS). Such part-N directories can be used to run N shard servers. Additionally, users can specify the number of reduce tasks, in particular 1 reduce task, in which case the output will consist of a single shard. An example application is provided that processes large CSV files and uses this API. It uses a custom CSV processing to avoid (de)serialization overhead. This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, you should put it in contrib/hadoop/lib. Note: the development of this patch was sponsored by an anonymous contributor and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1298) FunctionQuery results as pseudo-fields
[ https://issues.apache.org/jira/browse/SOLR-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733624#action_12733624 ] Andrzej Bialecki commented on SOLR-1298: - Not sure about not adding it - what fields are returned is selectable, right? and it's not possible to obtain this information otherwise. Some time ago I implemented this for a client - it was before SOLR-243, but I used the same idea, i.e. to use a subclass of IndexReader that returns documents with added function fields (and score). FunctionQuery results as pseudo-fields -- Key: SOLR-1298 URL: https://issues.apache.org/jira/browse/SOLR-1298 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Priority: Minor Fix For: 1.5 It would be helpful if the results of FunctionQueries could be added as fields to a document. Couple of options here: 1. Run FunctionQuery as part of relevance score and add that piece to the document 2. Run the function (not really a query) during Document/Field retrieval -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-243) Create a hook to allow custom code to create custom IndexReaders
[ https://issues.apache.org/jira/browse/SOLR-243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-243: --- Attachment: SOLR-243.patch This is a useful functionality when using FilterIndexReader or ParallelReader, +1 for adding it to core. I updated the patch to the latest trunk - all tests pass. Create a hook to allow custom code to create custom IndexReaders Key: SOLR-243 URL: https://issues.apache.org/jira/browse/SOLR-243 Project: Solr Issue Type: Improvement Components: search Environment: Solr core Reporter: John Wang Assignee: Hoss Man Fix For: 1.5 Attachments: indexReaderFactory.patch, indexReaderFactory.patch, indexReaderFactory.patch, indexReaderFactory.patch, indexReaderFactory.patch, indexReaderFactory.patch, indexReaderFactory.patch, SOLR-243.patch, SOLR-243.patch, SOLR-243.patch, SOLR-243.patch, SOLR-243.patch, SOLR-243.patch, SOLR-243.patch I have a customized IndexReader and I want to write a Solr plugin to use my derived IndexReader implementation. Currently IndexReader instantiation is hard coded to be: IndexReader.open(path) It would be really useful if this is done thru a plugable factory that can be configured, e.g. IndexReaderFactory interface IndexReaderFactory{ IndexReader newReader(String name,String path); } the default implementation would just return: IndexReader.open(path) And in the newSearcher and getSearcher methods in SolrCore class can call the current factory implementation to get the IndexReader instance and then build the SolrIndexSearcher by passing in the reader. It would be really nice to add this improvement soon (This seems to be a trivial addition) as our project really depends on this. Thanks -John -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1116) Add a Binary FieldType
[ https://issues.apache.org/jira/browse/SOLR-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711528#action_12711528 ] Andrzej Bialecki commented on SOLR-1116: - Indeed! then it's not relevant here. +0 from me for the regular base64. Add a Binary FieldType -- Key: SOLR-1116 URL: https://issues.apache.org/jira/browse/SOLR-1116 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Noble Paul Assignee: Noble Paul Fix For: 1.4 Attachments: SOLR-1116.patch, SOLR-1116.patch Lucene supports binary data for field but Solr has no corresponding field type. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1116) Add a Binary FieldType
[ https://issues.apache.org/jira/browse/SOLR-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711081#action_12711081 ] Andrzej Bialecki commented on SOLR-1116: - bq. No browser accepts the image data as Base64. your front-end will have to read the string and send it out as a byte[]. Please see http://en.wikipedia.org/wiki/Data_URI_scheme - this is the use case I was referring to, and indeed you can send base64 encoded content directly to any modern browser. Add a Binary FieldType -- Key: SOLR-1116 URL: https://issues.apache.org/jira/browse/SOLR-1116 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Noble Paul Assignee: Noble Paul Fix For: 1.4 Attachments: SOLR-1116.patch, SOLR-1116.patch Lucene supports binary data for field but Solr has no corresponding field type. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-1116) Add a Binary FieldType
[ https://issues.apache.org/jira/browse/SOLR-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710782#action_12710782 ] Andrzej Bialecki commented on SOLR-1116: - One scenario that I have experience with is when you store small images as fields, to be displayed on the result list. URL-safe encoding means you can directly embed the returned string without re-encoding it. Add a Binary FieldType -- Key: SOLR-1116 URL: https://issues.apache.org/jira/browse/SOLR-1116 Project: Solr Issue Type: New Feature Components: search Affects Versions: 1.3 Reporter: Noble Paul Assignee: Noble Paul Fix For: 1.4 Attachments: SOLR-1116.patch, SOLR-1116.patch Lucene supports binary data for field but Solr has no corresponding field type. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-769) Support Document and Search Result clustering
[ https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638875#action_12638875 ] Andrzej Bialecki commented on SOLR-769: FYI, Carrot2 does support a handful of different clustering algorithms (the ones I know of are Fuzzy Ants, KMeans and Suffix Tree, in addition to Lingo). Support Document and Search Result clustering - Key: SOLR-769 URL: https://issues.apache.org/jira/browse/SOLR-769 Project: Solr Issue Type: New Feature Reporter: Grant Ingersoll Assignee: Grant Ingersoll Priority: Minor Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch Clustering is a useful tool for working with documents and search results, similar to the notion of dynamic faceting. Carrot2 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing search results clustering. Mahout (http://lucene.apache.org/mahout) is well suited for whole-corpus clustering. The patch I lays out a contrib module that starts off w/ an integration of a SearchComponent for doing clustering and an implementation using Carrot. In search results mode, it will use the DocList as the input for the cluster. While Carrot2 comes w/ a Solr input component, it is not the same as the SearchComponent that I have in that the Carrot example actually submits a query to Solr, whereas my SearchComponent is just chained into the Component list and uses the ResponseBuilder to add in the cluster results. While not fully fleshed out yet, the collection based mode will take in a list of ids or just use the whole collection and will produce clusters. Since this is a longer, typically offline task, there will need to be some type of storage mechanism (and replication??) for the clusters. I _may_ push this off to a separate JIRA issue, but I at least want to present the use case as part of the design of this component/contrib. It may even make sense that we split this out, such that the building piece is something like an UpdateProcessor and then the SearchComponent just acts as a lookup mechanism. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638217#action_12638217 ] Andrzej Bialecki commented on SOLR-799: +1 on the incremental sig calculation. Re: different types of signatures. Our experience in Nutch is that signature type is rarely changed, and we assume that this setting is selected once per lifetime of an index, i.e. there are never any mixed cases of documents with incompatible signatures. If we want to be sure that they are comparable, we could prepend a byte or two of unique signature type id - this way, even if a signature value matches but was calculated using other impl. the documents won't be considered duplicates, which is the way it should work, because different signature algorithms are incomparable. Re: signature as byte[] - I think it's better if we return byte[] from Signature, and until we support binary fields we just turn this into a hex string. Re: field ordering in DeduplicateUpdateProcessorFactory: I think that both sigFields (if defined) and any other document fields (if sigFields is undefined) should be first ordered in a predictable way (lexicographic?). Current patch uses a HashSet which doesn't guarantee any particular ordering - in fact the ordering may be different if you run the same code under different JVMs, which may introduce a random factor to the sig. calculation. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling
[ https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637649#action_12637649 ] Andrzej Bialecki commented on SOLR-799: Interesting development in light of NUTCH-442 :) Some comments: * in MD5Signature I suggest using the code from org.apache.hadoop.io.MD5Hash.toString() instead of BigInteger. * TextProfileSignature should contain a remark that it's copied from Nutch, since AFAIK the algorithm that it implements is currently used only in Nutch. * in Nutch the concept of a page Signature is only a part of the deduplication process. The other part is the algorithm to decide which copy to keep and which one to discard. In your patch the latest update always removes all other documents with the same signature. IMHO this decision should be isolated into a DuplicateDeletePolicy class that gets all duplicates and can decide (based on arbitrary criteria) which one to keep, with the default implementation that simply keeps the latest document. Add support for hash based exact/near duplicate document handling - Key: SOLR-799 URL: https://issues.apache.org/jira/browse/SOLR-799 Project: Solr Issue Type: New Feature Components: update Reporter: Mark Miller Priority: Minor Attachments: SOLR-799.patch Hash based duplicate document detection is efficient and allows for blocking as well as field collapsing. Lets put it into solr. http://wiki.apache.org/solr/Deduplication -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-398) Widen return type of FiledType.createField to Fieldable in order to maximize flexibility
[ https://issues.apache.org/jira/browse/SOLR-398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12626099#action_12626099 ] Andrzej Bialecki commented on SOLR-398: Great minds think alike :) I was about to submit exactly the same patch when I noticed this JIRA. One thing is missing in your patch - DocumentBuilder:282 should use out.getFieldable() instead of out.getField(). This is required if you provide a custom Fieldable implementation (!instanceof o.a.l.d.Field) in FieldType subclasses, because o.a.l.d.Document.getField tries to cast the result to o.a.l.d.Field, whereas Document.getFieldable is happy with any subclass of Fieldable ;) Widen return type of FiledType.createField to Fieldable in order to maximize flexibility Key: SOLR-398 URL: https://issues.apache.org/jira/browse/SOLR-398 Project: Solr Issue Type: Improvement Components: update Affects Versions: 1.2, 1.3 Reporter: Espen Amble Kolstad Priority: Minor Attachments: 1.2-FieldType-2.patch, trunk-FieldType-3.patch, trunk-FieldType-4.patch FieldType.createField currently returns Field. In order to maximize flexibility for developers to extend Solr, it should return Fieldable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (SOLR-139) Support updateable/modifiable documents
[ https://issues.apache.org/jira/browse/SOLR-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12616725#action_12616725 ] Andrzej Bialecki commented on SOLR-139: It's possible to recover unstored fields, if the purpose of such recovery is to make a copy of the document and update other fields. The process is time-consuming, because you need to traverse all postings for all terms, so it might be impractical for larger indexes. Furthermore, such recovered content may be incomplete - tokens may have been changed or skipped/added by analyzers, positionIncrement gaps may have been introduced, etc, etc. Most of this functionality is implemented in Luke Restore Edit function. Perhaps it's possible to implement a new low-level Lucene API to do it more efficiently. Support updateable/modifiable documents --- Key: SOLR-139 URL: https://issues.apache.org/jira/browse/SOLR-139 Project: Solr Issue Type: New Feature Components: update Reporter: Ryan McKinley Assignee: Ryan McKinley Attachments: Eriks-ModifiableDocument.patch, Eriks-ModifiableDocument.patch, Eriks-ModifiableDocument.patch, Eriks-ModifiableDocument.patch, Eriks-ModifiableDocument.patch, Eriks-ModifiableDocument.patch, getStoredFields.patch, getStoredFields.patch, getStoredFields.patch, getStoredFields.patch, getStoredFields.patch, SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, SOLR-139-ModifyInputDocuments.patch, SOLR-139-ModifyInputDocuments.patch, SOLR-139-ModifyInputDocuments.patch, SOLR-139-ModifyInputDocuments.patch, SOLR-139-XmlUpdater.patch, SOLR-269+139-ModifiableDocumentUpdateProcessor.patch It would be nice to be able to update some fields on a document without having to insert the entire document. Given the way lucene is structured, (for now) one can only modify stored fields. While we are at it, we can support incrementing an existing value - I think this only makes sense for numbers. for background, see: http://www.nabble.com/loading-many-documents-by-ID-tf3145666.html#a8722293 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (SOLR-84) New Solr logo?
[ https://issues.apache.org/jira/browse/SOLR-84?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated SOLR-84: -- Attachment: solr.svg Some ideas for a black white version of the logo New Solr logo? -- Key: SOLR-84 URL: https://issues.apache.org/jira/browse/SOLR-84 Project: Solr Issue Type: Improvement Reporter: Bertrand Delacretaz Priority: Minor Attachments: logo-grid.jpg, logo-solr-d.jpg, logo-solr-e.jpg, logo-solr-source-files-take2.zip, solr-84-source-files.zip, solr-f.jpg, solr-logo-20061214.jpg, solr-logo-20061218.JPG, solr-logo-20070124.JPG, solr-nick.gif, solr.jpg, solr.svg, sslogo-solr-flare.jpg, sslogo-solr.jpg, sslogo-solr2-flare.jpg, sslogo-solr2.jpg, sslogo-solr3.jpg Following up on SOLR-76, our trainee Nicolas Barbay (nicolas (put at here) sarraux-dessous.ch) has reworked his logo proposal to be more solar. This can either be the start of a logo contest, or if people like it we could adopt it. The gradients can make it a bit hard to integrate, not sure if this is really a problem. WDYT? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.