[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2010-03-24 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849225#action_12849225
 ] 

Andrzej Bialecki  commented on SOLR-799:


This issue is closed - please use the mailing lists for discussions.

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Assignee: Yonik Seeley
Priority: Minor
 Fix For: 1.4

 Attachments: SOLR-799.patch, SOLR-799.patch, SOLR-799.patch, 
 SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1837) Reconstruct a Document (stored fields, indexed fields, payloads)

2010-03-21 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12847923#action_12847923
 ] 

Andrzej Bialecki  commented on SOLR-1837:
-

Re: bugs in Luke that result in missing terms - I recently fixed one such bug, 
and indeed it was located in the DocReconstructor - if you are aware of others 
then please report them using the Luke issue tracker.

Document reconstruction is a very IO-intensive operation, so I would advise 
against using it on a production system, and also it produces inexact results 
(because analysis is usually a lossy operation).

 Reconstruct a Document (stored fields, indexed fields, payloads)
 

 Key: SOLR-1837
 URL: https://issues.apache.org/jira/browse/SOLR-1837
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis, web gui
Affects Versions: 1.5
 Environment: All
Reporter: Trey Grainger
Priority: Minor
 Fix For: 1.5

   Original Estimate: 168h
  Remaining Estimate: 168h

 One Solr feature I've been sorely in need of is the ability to inspect an 
 index for any particular document.  While the analysis page is good when you 
 have specific content and a specific field/type your want to test the 
 analysis process for, once a document is indexed it is not currently possible 
 to easily see what is actually sitting in the index.
 One can use the Lucene Index Browser (Luke), but this has several limitations 
 (gui only, doesn't understand solr schema, doesn't display many non-text 
 fields in human readable format, doesn't show payloads, some bugs lead to 
 missing terms, exposes features dangerous to use in a production Solr 
 environment, slow or difficult to check from a remote location, etc.).  The 
 document reconstruction feature of Luke provides the base for what can become 
 a much more powerful tool when coupled with Solr's understanding of a schema, 
 however.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1536) Support for TokenFilters that may modify input documents

2010-02-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835660#action_12835660
 ] 

Andrzej Bialecki  commented on SOLR-1536:
-

Term freq. vectors are not available at this stage, unless you go to an expense 
of creating a MemoryIndex. I think the solution I proposed is less costly and 
more generic.

 Support for TokenFilters that may modify input documents
 

 Key: SOLR-1536
 URL: https://issues.apache.org/jira/browse/SOLR-1536
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Attachments: altering.patch


 In some scenarios it's useful to be able to create or modify fields in the 
 input document based on analysis of other fields of this document. This need 
 arises e.g. when indexing multilingual documents, or when doing NLP 
 processing such as NER. However, currently this is not possible to do.
 This issue provides an implementation of this functionality that consists of 
 the following parts:
 * DocumentAlteringFilterFactory - abstract superclass that indicates that 
 TokenFilter-s created from this factory may modify fields in a 
 SolrInputDocument.
 * TypeAsFieldFilterFactory - example implementation that illustrates this 
 concept, with a JUnit test.
 * DocumentBuilder modifications to support this functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1536) Support for TokenFilters that may modify input documents

2010-02-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1536:


Attachment: altering.patch

Patch updated to trunk.

 Support for TokenFilters that may modify input documents
 

 Key: SOLR-1536
 URL: https://issues.apache.org/jira/browse/SOLR-1536
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Attachments: altering.patch, altering.patch


 In some scenarios it's useful to be able to create or modify fields in the 
 input document based on analysis of other fields of this document. This need 
 arises e.g. when indexing multilingual documents, or when doing NLP 
 processing such as NER. However, currently this is not possible to do.
 This issue provides an implementation of this functionality that consists of 
 the following parts:
 * DocumentAlteringFilterFactory - abstract superclass that indicates that 
 TokenFilter-s created from this factory may modify fields in a 
 SolrInputDocument.
 * TypeAsFieldFilterFactory - example implementation that illustrates this 
 concept, with a JUnit test.
 * DocumentBuilder modifications to support this functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1535) Pre-analyzed field type

2010-02-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1535:


Attachment: (was: preanalyzed.patch)

 Pre-analyzed field type
 ---

 Key: SOLR-1535
 URL: https://issues.apache.org/jira/browse/SOLR-1535
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: preanalyzed.patch


 PreAnalyzedFieldType provides a functionality to index (and optionally store) 
 content that was already processed and split into tokens using some external 
 processing chain. This implementation defines a serialization format for 
 sending tokens with any currently supported Attributes (eg. type, posIncr, 
 payload, ...). This data is de-serialized into a regular TokenStream that is 
 returned in Field.tokenStreamValue() and thus added to the index as index 
 terms, and optionally a stored part that is returned in Field.stringValue() 
 and is then added as a stored value of the field.
 This field type is useful for integrating Solr with existing text-processing 
 pipelines, such as third-party NLP systems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1535) Pre-analyzed field type

2010-02-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1535:


Attachment: preanalyzed.patch

Sigh ... attach correct patch.

 Pre-analyzed field type
 ---

 Key: SOLR-1535
 URL: https://issues.apache.org/jira/browse/SOLR-1535
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: preanalyzed.patch, preanalyzed.patch


 PreAnalyzedFieldType provides a functionality to index (and optionally store) 
 content that was already processed and split into tokens using some external 
 processing chain. This implementation defines a serialization format for 
 sending tokens with any currently supported Attributes (eg. type, posIncr, 
 payload, ...). This data is de-serialized into a regular TokenStream that is 
 returned in Field.tokenStreamValue() and thus added to the index as index 
 terms, and optionally a stored part that is returned in Field.stringValue() 
 and is then added as a stored value of the field.
 This field type is useful for integrating Solr with existing text-processing 
 pipelines, such as third-party NLP systems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1535) Pre-analyzed field type

2010-02-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1535:


Attachment: (was: altering.patch)

 Pre-analyzed field type
 ---

 Key: SOLR-1535
 URL: https://issues.apache.org/jira/browse/SOLR-1535
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: preanalyzed.patch, preanalyzed.patch


 PreAnalyzedFieldType provides a functionality to index (and optionally store) 
 content that was already processed and split into tokens using some external 
 processing chain. This implementation defines a serialization format for 
 sending tokens with any currently supported Attributes (eg. type, posIncr, 
 payload, ...). This data is de-serialized into a regular TokenStream that is 
 returned in Field.tokenStreamValue() and thus added to the index as index 
 terms, and optionally a stored part that is returned in Field.stringValue() 
 and is then added as a stored value of the field.
 This field type is useful for integrating Solr with existing text-processing 
 pipelines, such as third-party NLP systems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1535) Pre-analyzed field type

2010-02-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1535:


Attachment: altering.patch

Oops .. previous patch produced NPEs. This one doesn't.

 Pre-analyzed field type
 ---

 Key: SOLR-1535
 URL: https://issues.apache.org/jira/browse/SOLR-1535
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: preanalyzed.patch, preanalyzed.patch


 PreAnalyzedFieldType provides a functionality to index (and optionally store) 
 content that was already processed and split into tokens using some external 
 processing chain. This implementation defines a serialization format for 
 sending tokens with any currently supported Attributes (eg. type, posIncr, 
 payload, ...). This data is de-serialized into a regular TokenStream that is 
 returned in Field.tokenStreamValue() and thus added to the index as index 
 terms, and optionally a stored part that is returned in Field.stringValue() 
 and is then added as a stored value of the field.
 This field type is useful for integrating Solr with existing text-processing 
 pipelines, such as third-party NLP systems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1536) Support for TokenFilters that may modify input documents

2010-02-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1536:


Attachment: altering.patch

Updated patch - previous patch produced NPEs.

 Support for TokenFilters that may modify input documents
 

 Key: SOLR-1536
 URL: https://issues.apache.org/jira/browse/SOLR-1536
 Project: Solr
  Issue Type: New Feature
  Components: Schema and Analysis
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Attachments: altering.patch, altering.patch


 In some scenarios it's useful to be able to create or modify fields in the 
 input document based on analysis of other fields of this document. This need 
 arises e.g. when indexing multilingual documents, or when doing NLP 
 processing such as NER. However, currently this is not possible to do.
 This issue provides an implementation of this functionality that consists of 
 the following parts:
 * DocumentAlteringFilterFactory - abstract superclass that indicates that 
 TokenFilter-s created from this factory may modify fields in a 
 SolrInputDocument.
 * TypeAsFieldFilterFactory - example implementation that illustrates this 
 concept, with a JUnit test.
 * DocumentBuilder modifications to support this functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1535) Pre-analyzed field type

2010-02-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1535:


Attachment: preanalyzed.patch

Patch updated to the current trunk.

 Pre-analyzed field type
 ---

 Key: SOLR-1535
 URL: https://issues.apache.org/jira/browse/SOLR-1535
 Project: Solr
  Issue Type: New Feature
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: preanalyzed.patch, preanalyzed.patch


 PreAnalyzedFieldType provides a functionality to index (and optionally store) 
 content that was already processed and split into tokens using some external 
 processing chain. This implementation defines a serialization format for 
 sending tokens with any currently supported Attributes (eg. type, posIncr, 
 payload, ...). This data is de-serialized into a regular TokenStream that is 
 returned in Field.tokenStreamValue() and thus added to the index as index 
 terms, and optionally a stored part that is returned in Field.stringValue() 
 and is then added as a stored value of the field.
 This field type is useful for integrating Solr with existing text-processing 
 pipelines, such as third-party NLP systems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1316) Create autosuggest component

2010-02-15 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12833786#action_12833786
 ] 

Andrzej Bialecki  commented on SOLR-1316:
-

I would lean towards the latter - complex do-it-all components often suffer 
from creeping featuritis and insufficient testing/maintenance, because there 
are few users that use all their features, and few developers that understand 
how they work. I subscribe to the Unix philosophy - do one thing, and do it 
right, so I think that if we can implement autosuggest that works well from the 
technical POV, then it will become a reliable component that you can combine in 
many creative ways to satisfy different scenarios, of which there are likely 
many more than what you described ...

 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 1.5

 Attachments: suggest.patch, suggest.patch, suggest.patch, TST.zip

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-15 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800746#action_12800746
 ] 

Andrzej Bialecki  commented on SOLR-1301:
-

bq. I'm curious about the not sending over the network. Have you tried the 
Streaming Server or even just the regular one?

Hmm, I don't think this would make sense - the whole point of this patch is to 
distribute the load by indexing into multiple Solr instances that use the same 
config - and this can be an existing user's config including the components 
from ${solr.home}/lib .

bq. How would this work with someone who already has a separate Solr cluster 
setup?

It wouldn't - partly because there is no canonical Solr cluster setup against 
which to code this ... Would that be the same cluster (1:1 mapping) as the 
Hadoop cluster?

bq. Also, I haven't looked closely at the patch, but if I understand correctly, 
it is writing out the indexes to the local disks on the Hadoop cluster?

HDFS doesn't support enough POSIX to support writing Lucene indexes directly to 
HDFS - for this reason indexes are always created on local storage of each 
node, and then after closing they are copied to HDFS.

 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2010-01-15 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12800758#action_12800758
 ] 

Andrzej Bialecki  commented on SOLR-1301:
-

Iff we somehow could get a mapping between a mapred task on node X  to a 
particular target Solr server (beyond the two obvious choices, ie. single URL 
for one Solr, or localhost for per-node Solr-s) then sure why not. And you are 
right that we wouldn't use the embedded Solr in that case. But this patch 
solves a different problem, and it solves it within the facilities of the 
current config ;)

bq. Right, and then copied down from HDFS and installed in Solr, correct? You 
still have the issue of knowing which Solr instances get which shards off of 
HDFS, right? Just seems like a little more configuration knowledge could 
alleviate all that extra copying/installing, etc.

Yes. But that would be a completely different scenario - we could wrap it in a 
Hadoop OutputFormat as well, but the implementation would be totally different 
from this patch.

 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.5

 Attachments: commons-logging-1.0.4.jar, 
 commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
 log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
 SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SolrRecordWriter.java


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1602) Refactor SOLR package structure to include o.a.solr.response and move QueryResponseWriters in there

2010-01-04 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796291#action_12796291
 ] 

Andrzej Bialecki  commented on SOLR-1602:
-

I'm in favor of B. This worked well in Hadoop (mapred - mapreduce) where the 
list of deprecations was massive and API changes were not straightforward at 
all - still it was done to promote a better design and allow new functionality. 
Whole deprecated hierarchies live there for at least two major releases, and 
surely they were visible to thousands of Hadoop devs. The downside was 
occasional confusion, and of course the porting effort required to use the new 
API, but the upside was an excellent back-compat to keep serious users happy, 
and a clear message to all to get prepared for the switch.

So IMHO having a bunch of deprecated classes for a while is not a big deal, if 
it gives us freedom to pursue a better design.

 Refactor SOLR package structure to include o.a.solr.response and move 
 QueryResponseWriters in there
 ---

 Key: SOLR-1602
 URL: https://issues.apache.org/jira/browse/SOLR-1602
 Project: Solr
  Issue Type: Improvement
  Components: Response Writers
Affects Versions: 1.2, 1.3, 1.4
 Environment: independent of environment (code structure)
Reporter: Chris A. Mattmann
Assignee: Noble Paul
 Fix For: 1.5

 Attachments: SOLR-1602.Mattmann.112509.patch.txt, 
 SOLR-1602.Mattmann.112509_02.patch.txt, upgrade_solr_config


 Currently all o.a.solr.request.QueryResponseWriter implementations are 
 curiously located in the o.a.solr.request package. Not only is this package 
 getting big (30+ classes), a lot of them are misplaced. There should be a 
 first-class o.a.solr.response package, and the response related classes 
 should be given a home there. Patch forthcoming.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1632) Distributed IDF

2009-12-23 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1632:


Attachment: distrib-2.patch

Updated patch, contains also:

* LRU-based cache that optimizes requests using cached values of docFreq for 
known terms
* unit tests

 Distributed IDF
 ---

 Key: SOLR-1632
 URL: https://issues.apache.org/jira/browse/SOLR-1632
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Attachments: distrib-2.patch, distrib.patch


 Distributed IDF is a valuable enhancement for distributed search across 
 non-uniform shards. This issue tracks the proposed implementation of an API 
 to support this functionality in Solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1316) Create autosuggest component

2009-12-15 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1316:


Attachment: suggest.patch

Updated patch:

 * removed the broken RadixTree,
 * changed Suggester and Lookup API so that they don't join the tokens - 
instead they will use whatever tokens are produced by the analyzer. For now 
results are merged into a single SpellingResult.

 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 1.5

 Attachments: suggest.patch, suggest.patch, suggest.patch, TST.zip

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1316) Create autosuggest component

2009-12-15 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791164#action_12791164
 ] 

Andrzej Bialecki  commented on SOLR-1316:
-

bq. What about DAWGs? Are we still considering them?

I would be happy to include DAWGs if someone were to implement them ... ;)

 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 1.5

 Attachments: suggest.patch, suggest.patch, suggest.patch, TST.zip

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1632) Distributed IDF

2009-12-11 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789174#action_12789174
 ] 

Andrzej Bialecki  commented on SOLR-1632:
-

I'm not sure what approach you are referring to. Following the terminology in 
that thread, this implementation follows the approach where there is a single 
merged big idf map at the master, and it's sent out to slaves on each query. 
However, when exactly this merging and sending happens is 
implementation-specific - in the ExactDFSource it happens on every query, but I 
hope the API can support other scenarios as well.

 Distributed IDF
 ---

 Key: SOLR-1632
 URL: https://issues.apache.org/jira/browse/SOLR-1632
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Attachments: distrib.patch


 Distributed IDF is a valuable enhancement for distributed search across 
 non-uniform shards. This issue tracks the proposed implementation of an API 
 to support this functionality in Solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1632) Distributed IDF

2009-12-11 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789607#action_12789607
 ] 

Andrzej Bialecki  commented on SOLR-1632:
-

I believe the API that I propose would support such implementation as well. 
Please note that it's usually not feasible to compute and distribute the 
complete IDF table for all terms - you would have to replicate a union of all 
term dictionaries across the cluster. In practice, you limit the amount of 
information by various means, e.g. only distributing data related to the 
current request (this implementation) or reducing the frequency of updates 
(e.g. LRU caching), or approximating global DF with a constant for frequent 
terms (where the contribution of their IDF to the score would be negligible 
anyway).

 Distributed IDF
 ---

 Key: SOLR-1632
 URL: https://issues.apache.org/jira/browse/SOLR-1632
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Attachments: distrib.patch


 Distributed IDF is a valuable enhancement for distributed search across 
 non-uniform shards. This issue tracks the proposed implementation of an API 
 to support this functionality in Solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1316) Create autosuggest component

2009-12-10 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788913#action_12788913
 ] 

Andrzej Bialecki  commented on SOLR-1316:
-

Thanks for the review!

bq.  Why do we concatenate all the tokens into one before calling 
Lookup#lookup? It seems we should be getting suggestions for each token just as 
SpellCheckComponent does.

Yeah, it's disputable, and we could change it to use single tokens ... My 
thinking was that the usual scenario is that you submit autosuggest queries 
soon after user starts typing the query, and the highest perceived value of 
such functionality is when it can suggest complete meaningful phrases and not 
just individual terms. I.e. when you start typing token sug it won't suggest 
token sugar but instead it will suggest token suggestions.

bq. Related to #1, the Lookup#lookup method should return something more fine 
grained rather than a SpellingResult

Such as? What you put there is what you get ;) so the fact that we are getting 
complete phrases as suggestions is the consequence of the choice above - the 
trie in this case is populated with phrases. If we populate it with tokens, 
then we can return per-token suggestions, again - losing the added value I 
mentioned above.

bq. Has anyone done any benchmarking to figure out the data structure we want 
to go ahead with?

For now I'm sure that we do NOT want to use the impl. of RadixTree in this 
patch, because it doesn't support our use case - I'll prepare a patch that 
removes this impl. Other implementations seem comparable wrt. to the speed, 
based on casual tests using /usr/share/dict/words, but I didn't run any exact 
benchmarks yet.


 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Assignee: Shalin Shekhar Mangar
Priority: Minor
 Fix For: 1.5

 Attachments: suggest.patch, suggest.patch, TST.zip

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1632) Distributed IDF

2009-12-07 Thread Andrzej Bialecki (JIRA)
Distributed IDF
---

 Key: SOLR-1632
 URL: https://issues.apache.org/jira/browse/SOLR-1632
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.5
Reporter: Andrzej Bialecki 


Distributed IDF is a valuable enhancement for distributed search across 
non-uniform shards. This issue tracks the proposed implementation of an API to 
support this functionality in Solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1632) Distributed IDF

2009-12-07 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1632:


Attachment: distrib.patch

Initial implementation. This supports the current global IDF (i.e. none ;) ), 
and an exact version of global IDF that requires one additional request per 
query to obtain per-shard stats.

The design should be already flexible enough to implement LRU caching of 
docFreqs, and ultimately to implement other methods for global IDF calculation 
(e.g. based on estimation or re-ranking).

 Distributed IDF
 ---

 Key: SOLR-1632
 URL: https://issues.apache.org/jira/browse/SOLR-1632
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Attachments: distrib.patch


 Distributed IDF is a valuable enhancement for distributed search across 
 non-uniform shards. This issue tracks the proposed implementation of an API 
 to support this functionality in Solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1614) Search in Hadoop

2009-12-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784212#action_12784212
 ] 

Andrzej Bialecki  commented on SOLR-1614:
-

If query performance is not a concern, then why not execute it directly on HDFS 
(using e.g. Nutch FsDirectory to read indexes from HDFS)?

 Search in Hadoop
 

 Key: SOLR-1614
 URL: https://issues.apache.org/jira/browse/SOLR-1614
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5


 What's the use case? Sometimes queries are expensive (such as
 regex) or one has indexes located in HDFS, that then need to be
 searched on. By leveraging Hadoop, these non-time sensitive
 queries may be executed without dynamically deploying the
 indexes to new Solr servers. 
 We'll download the index out of HDFS (assuming they're zipped),
 perform the queries in a batch on the index shard, then merge
 the results either using a Solr query results priority queue, or
 simply using Hadoop's built in merge sorting. 
 The query file will be encoded in JSON format, (ID, query,
 numresults,fields). The shards file will simply contain newline
 delimited paths (HDFS or otherwise). The output can be a Solr
 encoded results file per query.
 I'm hoping to add an actual Hadoop unit test.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1316) Create autosuggest component

2009-11-20 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780530#action_12780530
 ] 

Andrzej Bialecki  commented on SOLR-1316:
-

Re: question 1 - currently this component doesn't support populating the 
dictionary from a distributed index.

 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

 Attachments: suggest.patch, suggest.patch, TST.zip

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1316) Create autosuggest component

2009-11-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778280#action_12778280
 ] 

Andrzej Bialecki  commented on SOLR-1316:
-

Re: the tree creation - well, this is the current limitation of the Dictionary 
API that provides only an Iterator. So in general case it's not possible to 
start from the middle of the iterator so that the tree is well-balanced. Is it 
possible to re-balance the tree on the fly?

Re: svn - it works for me ...

 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

 Attachments: suggest.patch, suggest.patch, TST.zip

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1316) Create autosuggest component

2009-11-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778295#action_12778295
 ] 

Andrzej Bialecki  commented on SOLR-1316:
-

Well, this is kind of ugly, because it increases the memory footprint of the 
build phase - that was the whole point of using Iterator in the Dictionary, so 
that you don't have to cache all dictionary data in memory - dictionaries could 
be large, and they are not guaranteed to be sorted and with unique keys.

But if there are no better options for now, then yes we could do this just in 
TSTLookup. Is there really no way to rebalance the tree?

 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

 Attachments: suggest.patch, suggest.patch, TST.zip

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1316) Create autosuggest component

2009-11-12 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1316:


Attachment: suggest.patch

Updated patch that includes the new TST sources. Tests on a 100k-words 
dictionary yield very similar results for the TST and Jaspell implementations, 
i.e. the initial build time is around 600ms, and then the lookup time is around 
4-7ms for prefixes that yield more than 100 results.

To test it put this in your solrconfig.xml:

{code:xml}
  searchComponent name=spellcheck class=solr.SpellCheckComponent
lst name=spellchecker
  str name=namesuggest/str
  str name=classnameorg.apache.solr.spelling.suggest.Suggester/str
  str 
name=lookupImplorg.apache.solr.spelling.suggest.jaspell.JaspellLookup/str
  str name=fieldtext/str
  str name=sourceLocationamerican-english/str
/lst
  /searchComponent

...


{code}

And then use e.g. the following parameters:

{noformat}
spellcheck=truespellcheck.build=truespellcheck.dictionary=suggest \
spellcheck.extendedResults=truespellcheck.count=100q=test
{noformat}

 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

 Attachments: suggest.patch, suggest.patch, TST.zip

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1316) Create autosuggest component

2009-11-12 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777045#action_12777045
 ] 

Andrzej Bialecki  commented on SOLR-1316:
-

Forgot to add - the RadixTree implementation doesn't work for now - it needs 
further refactoring to return the completed keys, and not just the values 
stored in nodes ...

 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

 Attachments: suggest.patch, suggest.patch, TST.zip

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1536) Support for TokenFilters that may modify input documents

2009-11-05 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12774162#action_12774162
 ] 

Andrzej Bialecki  commented on SOLR-1536:
-

My opinion may be biased, but I'll try to be as objective as I can ;) I think 
it's better, because it provides you much more flexibility in building analysis 
 indexing chains without coding. If we went with URProcessor you would have to 
implement a new one whenever your analysis chain changes ... With the approach 
in this patch it's just a configuration issue, and not an issue of implementing 
as many custom update processors as there are possible combinations ...

 Support for TokenFilters that may modify input documents
 

 Key: SOLR-1536
 URL: https://issues.apache.org/jira/browse/SOLR-1536
 Project: Solr
  Issue Type: New Feature
  Components: Analysis
Affects Versions: 1.5
Reporter: Andrzej Bialecki 
 Attachments: altering.patch


 In some scenarios it's useful to be able to create or modify fields in the 
 input document based on analysis of other fields of this document. This need 
 arises e.g. when indexing multilingual documents, or when doing NLP 
 processing such as NER. However, currently this is not possible to do.
 This issue provides an implementation of this functionality that consists of 
 the following parts:
 * DocumentAlteringFilterFactory - abstract superclass that indicates that 
 TokenFilter-s created from this factory may modify fields in a 
 SolrInputDocument.
 * TypeAsFieldFilterFactory - example implementation that illustrates this 
 concept, with a JUnit test.
 * DocumentBuilder modifications to support this functionality.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-398) Widen return type of FiledType.createField to Fieldable in order to maximize flexibility

2009-10-13 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-398:
---

Attachment: fieldable.patch

Patch updated to current trunk. One concrete use case where this is needed is 
Fieldable implementations that provide different values of tokenStreamValue() 
and stringValue(), e.g. when using external tools to provide a pre-tokenized 
value of the field.

 Widen return type of FiledType.createField to Fieldable in order to maximize 
 flexibility
 

 Key: SOLR-398
 URL: https://issues.apache.org/jira/browse/SOLR-398
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 1.2, 1.3
Reporter: Espen Amble Kolstad
Priority: Minor
 Fix For: 1.5

 Attachments: 1.2-FieldType-2.patch, fieldable.patch, 
 trunk-FieldType-3.patch, trunk-FieldType-4.patch, trunk-FieldType-5.patch


 FieldType.createField currently returns Field.
 In order to maximize flexibility for developers to extend Solr, it should 
 return Fieldable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader

2009-09-22 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758381#action_12758381
 ] 

Andrzej Bialecki  commented on SOLR-1366:
-

Looks good to me, +1.

 UnsupportedOperationException may be thrown when using custom IndexReader
 -

 Key: SOLR-1366
 URL: https://issues.apache.org/jira/browse/SOLR-1366
 Project: Solr
  Issue Type: Bug
  Components: replication (java), search
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
Assignee: Shalin Shekhar Mangar
 Fix For: 1.4

 Attachments: searcher.patch, SOLR-1366.patch


 If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and 
 IndexReader-s that it produces don't support IndexReader.directory() (such as 
 is the case with ParallelReader or MultiReader) then an uncaught 
 UnsupportedOperationException is thrown.
 This call is used only to retrieve the full path of the directory for 
 informational purpose, so it shouldn't lead to a crash. Instead we could 
 supply other available information about the reader (e.g. from its toString() 
 method).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1316) Create autosuggest component

2009-09-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1316:


Attachment: suggest.patch

This is a very much work in progress, to get review before proceeding.

Highlights of this patch:

* created a set of interfaces in o.a.s.spelling.suggest to hide implementation 
details of various autocomplete mechanisms.

* imported sources of RadixTree, Jaspell TST and Ankul's TST. Wrapped each 
implementation so that it works with the same interface. (Ankul: I couldn't 
figure out how to actually retrieve suggested keys from your TST?)

* extended HighFrequencyDictionary to return TermFreqIterator, which gives not 
only words but also their frequencies. Implemented a similar iterator for 
file-based term-freq lists.

 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

 Attachments: suggest.patch, TernarySearchTree.tar.gz

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader

2009-09-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756652#action_12756652
 ] 

Andrzej Bialecki  commented on SOLR-1366:
-

+1 for adding a big red flag. My application depends on this functionality, and 
it's working well once I overrode a bunch of additional methods in IndexReader 
that deal with Directory, IndexCommit, index version, etc.

(A few details on this, and why my solution is not applicable in general case: 
I'm using ParallelReader, and the other indexes that I add are throwaways, i.e. 
I recreate them on each index refresh from external shared resources. So I 
basically short-circuited those methods that deal with directory and commits so 
that they return information from the main index. This way the file-based 
replication works as before for the main index).

 UnsupportedOperationException may be thrown when using custom IndexReader
 -

 Key: SOLR-1366
 URL: https://issues.apache.org/jira/browse/SOLR-1366
 Project: Solr
  Issue Type: Bug
  Components: replication (java), search
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 1.4

 Attachments: searcher.patch


 If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and 
 IndexReader-s that it produces don't support IndexReader.directory() (such as 
 is the case with ParallelReader or MultiReader) then an uncaught 
 UnsupportedOperationException is thrown.
 This call is used only to retrieve the full path of the directory for 
 informational purpose, so it shouldn't lead to a crash. Instead we could 
 supply other available information about the reader (e.g. from its toString() 
 method).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1316) Create autosuggest component

2009-09-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756819#action_12756819
 ] 

Andrzej Bialecki  commented on SOLR-1316:
-

Yes, it should work for now. In fact I started writing a new component, but it 
had to replicate most of the spellchecker ;) so I will just add bits to the 
existing spellchecker. I'm worried though that we abuse the semantics of the 
API, and it will be more difficult to fit both functions in a single API as the 
functionality evolves.

 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

 Attachments: TernarySearchTree.tar.gz

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1316) Create autosuggest component

2009-09-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756050#action_12756050
 ] 

Andrzej Bialecki  commented on SOLR-1316:
-

bq. These enable suffix compression and create much smaller word graphs.

DAWGs are problematic, because they are essentially immutable once created (the 
cost of insert / delete is very high). So I propose to stick to TSTs for now.

Also, I think that populating TST from the index would have to be 
discriminative, perhaps based on a threshold (so that it only adds terms with 
large enough docFreq), and it would be good to adjust the content of the tree 
based on actual queries that return some results (poor man's auto-learning), 
gradually removing least frequent strings to save space.. We could also use as 
a source a field with 1-3 word shingles (no tf, unstored, to save space in the 
source index, with a similar thresholding mechanism).

Ankul, I'm not sure what's the behavior of your implementation when dynamically 
adding / removing keys? Does it still remain balanced?

I also found a MIT-licensed  impl. of radix tree here: 
http://code.google.com/p/radixtree, which looks good too, one spelling mistake 
in the API notwithstanding ;)


 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

 Attachments: TernarySearchTree.tar.gz

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1316) Create autosuggest component

2009-09-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756149#action_12756149
 ] 

Andrzej Bialecki  commented on SOLR-1316:
-

bq. Andrej, why would immutability be a problem? Wouldn't we have to re-build 
the TST if the source index changes?

Well, the use case I have in mind is a TST that improves itself over time based 
on the observed query log. I.e. you would bootstrap a TST from the index (and 
here indeed you can do this on every searcher refresh), but it's often claimed 
that real query logs provide a far better source of autocomplete than the index 
terms. My idea was to start with what you have - in the absence of query logs - 
and then improve upon it by adding successful queries (and removing least-used 
terms to keep the tree at a more or less constant size).

Alternatively we could provide an option to bootstrap it from a real query log 
data.

This use case requires mutability, hence my negative opinion about DAGWs 
(besides, we are lacking an implementation, don't we, whereas we already have a 
few suitable TST implementations). Perhaps this doesn't have to be an 
either/or, if we come up with a pluggable interface for this type of component?

bq. I think the building of the data structure can be done in a way similar to 
what SpellCheckComponent does. [..]

+1


 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

 Attachments: TernarySearchTree.tar.gz

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1316) Create autosuggest component

2009-09-11 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754404#action_12754404
 ] 

Andrzej Bialecki  commented on SOLR-1316:
-

Jason, did you make any progress on this? I'm interested in this 
functionality.. I'm not sure tries are the best choice, unless heavily pruned 
they occupy a lot of RAM space. I had some moderate success using ngram based 
method (I reused the spellchecker, with slight modifications) - the method is 
fast and reuses the existing spellchecker index, but precision of lookups is 
not ideal.

 Create autosuggest component
 

 Key: SOLR-1316
 URL: https://issues.apache.org/jira/browse/SOLR-1316
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

   Original Estimate: 96h
  Remaining Estimate: 96h

 Autosuggest is a common search function that can be integrated
 into Solr as a SearchComponent. Our first implementation will
 use the TernaryTree found in Lucene contrib. 
 * Enable creation of the dictionary from the index or via Solr's
 RPC mechanism
 * What types of parameters and settings are desirable?
 * Hopefully in the future we can include user click through
 rates to boost those terms/phrases higher

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search

2009-09-10 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753648#action_12753648
 ] 

Andrzej Bialecki  commented on SOLR-1321:
-

This comment refers to the limitation of Lucene's QueryParser - there is only a 
single flag there to decide whether it accepts leading wildcards or not, 
regardless of field. Consequently, after checking the schema in SolrQueryParser 
we turn on this flag if _any_ field type supports leading wildcards. The end 
effect of this is that parsers for any field, which are created with 
IndexSchema.getSolrQueryParser(), will return true if any field type supports 
leading wildcards, not neccessarily the one for which the parser was created..

I don't see a way to fix this. I can clarify the comment, though, so that it's 
clear that this is a limitation in Lucene QueryParser.

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
Assignee: Grant Ingersoll
 Fix For: 1.4

 Attachments: SOLR-1321.patch, wildcards-2.patch, wildcards-3.patch, 
 wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader

2009-09-10 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753761#action_12753761
 ] 

Andrzej Bialecki  commented on SOLR-1366:
-

I didn't make myself clear .. I fixed this for my application, where I control 
the implementation of IndexReader, but I wouldn't recommend this fix for 
general use. So this was just FYI.

 UnsupportedOperationException may be thrown when using custom IndexReader
 -

 Key: SOLR-1366
 URL: https://issues.apache.org/jira/browse/SOLR-1366
 Project: Solr
  Issue Type: Bug
  Components: replication (java), search
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 1.4

 Attachments: searcher.patch


 If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and 
 IndexReader-s that it produces don't support IndexReader.directory() (such as 
 is the case with ParallelReader or MultiReader) then an uncaught 
 UnsupportedOperationException is thrown.
 This call is used only to retrieve the full path of the directory for 
 informational purpose, so it shouldn't lead to a crash. Instead we could 
 supply other available information about the reader (e.g. from its toString() 
 method).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader

2009-09-09 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753289#action_12753289
 ] 

Andrzej Bialecki  commented on SOLR-1366:
-

FYI, for now I solved this by extending my IndexReader to support this call and 
return the original directory that lists all index files plus a few resources 
that I care about. However, this is just glossing over the deeper problem - 
replication handler shouldn't assume the directory is file-based.

 UnsupportedOperationException may be thrown when using custom IndexReader
 -

 Key: SOLR-1366
 URL: https://issues.apache.org/jira/browse/SOLR-1366
 Project: Solr
  Issue Type: Bug
  Components: replication (java), search
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 1.4

 Attachments: searcher.patch


 If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and 
 IndexReader-s that it produces don't support IndexReader.directory() (such as 
 is the case with ParallelReader or MultiReader) then an uncaught 
 UnsupportedOperationException is thrown.
 This call is used only to retrieve the full path of the directory for 
 informational purpose, so it shouldn't lead to a crash. Instead we could 
 supply other available information about the reader (e.g. from its toString() 
 method).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1321) Support for efficient leading wildcards search

2009-09-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1321:


Attachment: wildcards-3.patch

Updated patch that uses new TokenAttribute API and uses (as much as possible) 
the new ReverseStringFilter.

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
Assignee: Grant Ingersoll
 Fix For: 1.4

 Attachments: wildcards-2.patch, wildcards-3.patch, wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search

2009-08-26 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748067#action_12748067
 ] 

Andrzej Bialecki  commented on SOLR-1321:
-

I'll update the patch, assuming the presence of the updated filter in Lucene, 
but I'd rather leave updating the libs to someone more intimate with Solr 
internals ...

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
Assignee: Grant Ingersoll
 Fix For: 1.4

 Attachments: wildcards-2.patch, wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1375) BloomFilter on a field

2009-08-23 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746591#action_12746591
 ] 

Andrzej Bialecki  commented on SOLR-1375:
-

See here for a Java impl. of FastBits: 
http://code.google.com/p/compressedbitset/ .

Re: BloomFilters - in BloomIndexComponent you seem to assume that when 
BloomKeySet.contains(key) returns true then a key exists in a set. This is not 
strictly speaking true. You can only be sure with 1.0 probability that a key 
does NOT exist in a set, for other key when the result is true you only have a 
(1.0 - eps) probability that the answer is correct, i.e. the BloomFilter will 
return a false positive result for non-existent keys, with (eps) probability. 
You should take this into account when writing client code.

 BloomFilter on a field
 --

 Key: SOLR-1375
 URL: https://issues.apache.org/jira/browse/SOLR-1375
 Project: Solr
  Issue Type: New Feature
  Components: update
Affects Versions: 1.4
Reporter: Jason Rutherglen
Priority: Minor
 Fix For: 1.5

 Attachments: SOLR-1375.patch, SOLR-1375.patch, SOLR-1375.patch

   Original Estimate: 120h
  Remaining Estimate: 120h

 * A bloom filter is a read only probabilistic set. Its useful
 for verifying a key exists in a set, though it returns false
 positives. http://en.wikipedia.org/wiki/Bloom_filter 
 * The use case is indexing in Hadoop and checking for duplicates
 against a Solr cluster (which when using term dictionary or a
 query) is too slow and exceeds the time consumed for indexing.
 When a match is found, the host, segment, and term are returned.
 If the same term is found on multiple servers, multiple results
 are returned by the distributed process. (We'll need to add in
 the core name I just realized). 
 * When new segments are created, and commit is called, a new
 bloom filter is generated from a given field (default:id) by
 iterating over the term dictionary values. There's a bloom
 filter file per segment, which is managed on each Solr shard.
 When segments are merged away, their corresponding .blm files is
 also removed. In a future version we'll have a central server
 for the bloom filters so we're not abusing the thread pool of
 the Solr proxy and the networking of the Solr cluster (this will
 be done sooner than later after testing this version). I held
 off because the central server requires syncing the Solr
 servers' files (which is like reverse replication). 
 * The patch uses the BloomFilter from Hadoop 0.20. I want to jar
 up only the necessary classes so we don't have a giant Hadoop
 jar in lib.
 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/bloom/BloomFilter.html
 * Distributed code is added and seems to work, I extended
 TestDistributedSearch to test over multiple HTTP servers. I
 chose this approach rather than the manual method used by (for
 example) TermVectorComponent.testDistributed because I'm new to
 Solr's distributed search and wanted to learn how it works (the
 stages are confusing). Using this method, I didn't need to setup
 multiple tomcat servers and manually execute tests.
 * We need more of the bloom filter options passable via
 solrconfig
 * I'll add more test cases

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader

2009-08-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744993#action_12744993
 ] 

Andrzej Bialecki  commented on SOLR-1366:
-

Indeed, that complicates the matter ... It looks like using a non-file based 
IndexReader breaks replication. This is not a regression from 1.3, but the 
functionality to specify custom IndexReaders will be available in 1.4, so it 
should be clearly stated in docs that it prevents replication from working 
properly, until we rectify this issue.

 UnsupportedOperationException may be thrown when using custom IndexReader
 -

 Key: SOLR-1366
 URL: https://issues.apache.org/jira/browse/SOLR-1366
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 1.4

 Attachments: searcher.patch


 If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and 
 IndexReader-s that it produces don't support IndexReader.directory() (such as 
 is the case with ParallelReader or MultiReader) then an uncaught 
 UnsupportedOperationException is thrown.
 This call is used only to retrieve the full path of the directory for 
 informational purpose, so it shouldn't lead to a crash. Instead we could 
 supply other available information about the reader (e.g. from its toString() 
 method).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader

2009-08-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12744996#action_12744996
 ] 

Andrzej Bialecki  commented on SOLR-1366:
-

.. I haven't looked into it yet, but perhaps this could be solved by extending 
the replication handler to support multiple dirs, and for those IndexReader 
that don't support directory() try asking for getSubReaders() and use their 
directory() ...

 UnsupportedOperationException may be thrown when using custom IndexReader
 -

 Key: SOLR-1366
 URL: https://issues.apache.org/jira/browse/SOLR-1366
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
Assignee: Mark Miller
 Fix For: 1.4

 Attachments: searcher.patch


 If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and 
 IndexReader-s that it produces don't support IndexReader.directory() (such as 
 is the case with ParallelReader or MultiReader) then an uncaught 
 UnsupportedOperationException is thrown.
 This call is used only to retrieve the full path of the directory for 
 informational purpose, so it shouldn't lead to a crash. Instead we could 
 supply other available information about the reader (e.g. from its toString() 
 method).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1366) UnsupportedOperationException may be thrown when using custom IndexReader

2009-08-18 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1366:


Attachment: searcher.patch

Patch that catches the exception and supplies IndexReader.toString() instead.

 UnsupportedOperationException may be thrown when using custom IndexReader
 -

 Key: SOLR-1366
 URL: https://issues.apache.org/jira/browse/SOLR-1366
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: searcher.patch


 If a custom IndexReaderFactory is specifiedd in solrconfig.xml, and 
 IndexReader-s that it produces don't support IndexReader.directory() (such as 
 is the case with ParallelReader or MultiReader) then an uncaught 
 UnsupportedOperationException is thrown.
 This call is used only to retrieve the full path of the directory for 
 informational purpose, so it shouldn't lead to a crash. Instead we could 
 supply other available information about the reader (e.g. from its toString() 
 method).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search

2009-08-12 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742559#action_12742559
 ] 

Andrzej Bialecki  commented on SOLR-1321:
-

bq. Since this is a new filter, we might as well use the new incrementToken 
capability and reusable stuff as well as avoiding other deprecated analysis 
calls.

Indeed, I'll fix this.

bq. Also no need to do the string round trip in the reverse method, right? See 
the ReverseStringFilter in Lucene contrib/analysis.

Roundtrip ... you mean the allocation of new char[] buffer, or conversion to 
String? I assume the latter - the former is needed because we add the marker 
char in front. Yeah, I can return char[] and convert to String only in QP.

bq. Perhaps we should just patch that and add some config options to it? Then 
all Solr would need is the QP change and the FilterFactory change, no?

Hmm. After adding the marker-related stuff the code in ReverseStringFilter 
won't be so nice as it is now. I'd keep in mind the specific use case of this 
filter ...

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
Assignee: Grant Ingersoll
 Fix For: 1.4

 Attachments: wildcards-2.patch, wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search

2009-08-12 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742573#action_12742573
 ] 

Andrzej Bialecki  commented on SOLR-1321:
-

bq. FWIW, it also seemed like the reverse code in ReverseStringFilter was 
faster than the patch, but I didn't compare quantitatively.

It better be - it can reverse in-place, while we have to allocate a new buffer 
because of the marker char in front. That's what I meant by messy code - we 
would need both the in-place and the out-of-place method depending on an option.

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
Assignee: Grant Ingersoll
 Fix For: 1.4

 Attachments: wildcards-2.patch, wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search

2009-08-03 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12738347#action_12738347
 ] 

Andrzej Bialecki  commented on SOLR-1321:
-

Exactly, that's the reason to put this logic in an isolated well-defined place, 
with some configurable knobs. One parameter would be the max. position of the 
leading wildcard, another would be relative cost of ? and *, or whether we 
allow wildcards at any positions except the 0-th (pure suffix search).

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1321) Support for efficient leading wildcards search

2009-08-03 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1321:


Attachment: wildcards-2.patch

Updated patch with more configurable knobs. See javadoc of 
ReversedWildcardsFilterFactory and unit tests.

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: wildcards-2.patch, wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1321) Support for efficient leading wildcards search

2009-08-03 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1321:


Attachment: (was: wildcards-2.patch)

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1321) Support for efficient leading wildcards search

2009-08-03 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1321:


Attachment: wildcards-2.patch

Previous patch mistakenly included other stuff instead of 
ReversedWildcardFilterFactory.

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: wildcards-2.patch, wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1321) Support for efficient leading wildcards search

2009-07-31 Thread Andrzej Bialecki (JIRA)
Support for efficient leading wildcards search
--

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.4


This patch is an implementation of the reversed tokens strategy for efficient 
leading wildcards queries.

ReversedWildcardsTokenFilter reverses tokens and returns both the original 
token (optional) and the reversed token (with positionIncrement == 0). Reversed 
tokens are prepended with a marker character to avoid collisions between 
legitimate tokens and the reversed tokens - e.g. DNA would become and, thus 
colliding with the regular term and, but with the marker character it becomes 
\u0001and.

This TokenFilter can be added to the analyzer chain that it used during 
indexing.

SolrQueryParser has been modified to detect the presence of such fields in the 
current schema, and treat them in a special way. First, SolrQueryParser 
examines the schema and collects a map of fields where these reversed tokens 
are indexed. If there is at least one such field, it also sets 
QueryParser.setAllowLeadingWildcards(true). When building a wildcard query (in 
getWildcardQuery) the term text may be optionally reversed to put wildcards 
further along the term text. This happens when the field uses the reversing 
filter during indexing (as detected above), AND if the wildcard characters are 
either at 0-th or 1-st position in the term. Otherwise the term text is 
processed as before, i.e. turned into a regular wildcard query.

Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1321) Support for efficient leading wildcards search

2009-07-31 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1321:


Attachment: wildcards.patch

Patch containing the new filter, example schema and unit tests.

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search

2009-07-31 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737596#action_12737596
 ] 

Andrzej Bialecki  commented on SOLR-1321:
-

If you follow the logic in getWildcardQuery, a field has to meet specific 
requirements for this reversal to occur - namely, it needs to declare in its 
indexing analysis chain that it uses ReversedWildcardFilter. This filter does 
very special kind of reversal (prepending the marker) so it's unlikely that 
anyone would use it for other purpose than to explicitly support leading 
wildcards. So for now I'd say that users should consciously choose between this 
method of supporting leading wildcards and the automaton wildcard query.

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1321) Support for efficient leading wildcards search

2009-07-31 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737763#action_12737763
 ] 

Andrzej Bialecki  commented on SOLR-1321:
-

I think your example of g?abcde* could be handled if we assigned different 
costs of expanding ? and *, the latter being more costly. There could be also 
a rule that prevents the reversing if a trailing costly wildcard is used.

This quickly gets more and more complicated, so ultimately we may want to put 
this logic elsewhere, in a class that knows the best how to make such decisions 
(ReversedWildcardFilter ?). I'll try to modify the patch in this direction.

 Support for efficient leading wildcards search
 --

 Key: SOLR-1321
 URL: https://issues.apache.org/jira/browse/SOLR-1321
 Project: Solr
  Issue Type: Improvement
  Components: Analysis
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Fix For: 1.4

 Attachments: wildcards.patch


 This patch is an implementation of the reversed tokens strategy for 
 efficient leading wildcards queries.
 ReversedWildcardsTokenFilter reverses tokens and returns both the original 
 token (optional) and the reversed token (with positionIncrement == 0). 
 Reversed tokens are prepended with a marker character to avoid collisions 
 between legitimate tokens and the reversed tokens - e.g. DNA would become 
 and, thus colliding with the regular term and, but with the marker 
 character it becomes \u0001and.
 This TokenFilter can be added to the analyzer chain that it used during 
 indexing.
 SolrQueryParser has been modified to detect the presence of such fields in 
 the current schema, and treat them in a special way. First, SolrQueryParser 
 examines the schema and collects a map of fields where these reversed tokens 
 are indexed. If there is at least one such field, it also sets 
 QueryParser.setAllowLeadingWildcards(true). When building a wildcard query 
 (in getWildcardQuery) the term text may be optionally reversed to put 
 wildcards further along the term text. This happens when the field uses the 
 reversing filter during indexing (as detected above), AND if the wildcard 
 characters are either at 0-th or 1-st position in the term. Otherwise the 
 term text is processed as before, i.e. turned into a regular wildcard query.
 Unit tests are provided to test the TokenFilter and the query parsing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2009-07-30 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737253#action_12737253
 ] 

Andrzej Bialecki  commented on SOLR-1301:
-

This patch is intended to work with Solr as it is now, and the idea is to use 
Hadoop to buld shards (in the Solr sense) so that they could be used by the 
current Solr distributed search. I have no idea how / whether Katta/Zookeeper 
fits in this picture - if you want to pursue this integration I feel it would 
be best to do it in a separate issue.

 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Attachments: hadoop-0.19.1-core.jar, hadoop.patch


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1301) Solr + Hadoop

2009-07-30 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737306#action_12737306
 ] 

Andrzej Bialecki  commented on SOLR-1301:
-

bq.  Are you going to add a way to automatically add an index to a Solr core?

This way already exists by (ab)using the forced replication to a slave from a 
temporary master.

bq. Are you planning on adding test cases for this patch?

This functionality requires a running Hadoop cluster. I'm not sure how to write 
functional tests without bringing more Hadoop dependencies. I could add unit 
tests that test some aspects of the patch, but they would be trivial.

bq. How does one set the maximum size of a generated shard?

One doesn't, at the moment ;) The size of each shard (in # of documents) is a 
function of the total number of records divided by the number of reduce tasks.

 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Attachments: hadoop-0.19.1-core.jar, hadoop.patch


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-1301) Solr + Hadoop

2009-07-22 Thread Andrzej Bialecki (JIRA)
Solr + Hadoop
-

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 


This patch contains  a contrib module that provides distributed indexing (using 
Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is twofold:

* provide an API that is familiar to Hadoop developers, i.e. that of 
OutputFormat
* avoid unnecessary export and (de)serialization of data maintained on HDFS. 
SolrOutputFormat consumes data produced by reduce tasks directly, without 
storing it in intermediate files. Furthermore, by using an EmbeddedSolrServer, 
the indexing task is split into as many parts as there are reducers, and the 
data to be indexed is not sent over the network.

Design
--

Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, which 
in turn uses SolrRecordWriter to write this data. SolrRecordWriter instantiates 
an EmbeddedSolrServer, and it also instantiates an implementation of 
SolrDocumentConverter, which is responsible for turning Hadoop (key, value) 
into a SolrInputDocument. This data is then added to a batch, which is 
periodically submitted to EmbeddedSolrServer. When reduce task completes, and 
the OutputFormat is closed, SolrRecordWriter calls commit() and optimize() on 
the EmbeddedSolrServer.

The API provides facilities to specify an arbitrary existing solr.home 
directory, from which the conf/ and lib/ files will be taken.

This process results in the creation of as many partial Solr home directories 
as there were reduce tasks. The output shards are placed in the output 
directory on the default filesystem (e.g. HDFS). Such part-N directories 
can be used to run N shard servers. Additionally, users can specify the number 
of reduce tasks, in particular 1 reduce task, in which case the output will 
consist of a single shard.

An example application is provided that processes large CSV files and uses this 
API. It uses a custom CSV processing to avoid (de)serialization overhead.

This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this issue, 
you should put it in contrib/hadoop/lib.

Note: the development of this patch was sponsored by an anonymous contributor 
and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-1301) Solr + Hadoop

2009-07-22 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-1301:


Attachment: hadoop-0.19.1-core.jar
hadoop.patch

 Solr + Hadoop
 -

 Key: SOLR-1301
 URL: https://issues.apache.org/jira/browse/SOLR-1301
 Project: Solr
  Issue Type: Improvement
Affects Versions: 1.4
Reporter: Andrzej Bialecki 
 Attachments: hadoop-0.19.1-core.jar, hadoop.patch


 This patch contains  a contrib module that provides distributed indexing 
 (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
 twofold:
 * provide an API that is familiar to Hadoop developers, i.e. that of 
 OutputFormat
 * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
 SolrOutputFormat consumes data produced by reduce tasks directly, without 
 storing it in intermediate files. Furthermore, by using an 
 EmbeddedSolrServer, the indexing task is split into as many parts as there 
 are reducers, and the data to be indexed is not sent over the network.
 Design
 --
 Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
 which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
 instantiates an EmbeddedSolrServer, and it also instantiates an 
 implementation of SolrDocumentConverter, which is responsible for turning 
 Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
 batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
 task completes, and the OutputFormat is closed, SolrRecordWriter calls 
 commit() and optimize() on the EmbeddedSolrServer.
 The API provides facilities to specify an arbitrary existing solr.home 
 directory, from which the conf/ and lib/ files will be taken.
 This process results in the creation of as many partial Solr home directories 
 as there were reduce tasks. The output shards are placed in the output 
 directory on the default filesystem (e.g. HDFS). Such part-N directories 
 can be used to run N shard servers. Additionally, users can specify the 
 number of reduce tasks, in particular 1 reduce task, in which case the output 
 will consist of a single shard.
 An example application is provided that processes large CSV files and uses 
 this API. It uses a custom CSV processing to avoid (de)serialization overhead.
 This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
 issue, you should put it in contrib/hadoop/lib.
 Note: the development of this patch was sponsored by an anonymous contributor 
 and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1298) FunctionQuery results as pseudo-fields

2009-07-21 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733624#action_12733624
 ] 

Andrzej Bialecki  commented on SOLR-1298:
-

Not sure about not adding it - what fields are returned is selectable, right? 
and it's not possible to obtain this information otherwise. Some time ago I 
implemented this for a client - it was before SOLR-243, but I used the same 
idea, i.e. to use a subclass of IndexReader that returns documents with added 
function fields (and score).

 FunctionQuery results as pseudo-fields
 --

 Key: SOLR-1298
 URL: https://issues.apache.org/jira/browse/SOLR-1298
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Priority: Minor
 Fix For: 1.5


 It would be helpful if the results of FunctionQueries could be added as 
 fields to a document. 
 Couple of options here:
 1. Run FunctionQuery as part of relevance score and add that piece to the 
 document
 2. Run the function (not really a query) during Document/Field retrieval

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-243) Create a hook to allow custom code to create custom IndexReaders

2009-05-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-243:
---

Attachment: SOLR-243.patch

This is a useful functionality when using FilterIndexReader or ParallelReader, 
+1 for adding it to core. I updated the patch to the latest trunk - all tests 
pass.

 Create a hook to allow custom code to create custom IndexReaders
 

 Key: SOLR-243
 URL: https://issues.apache.org/jira/browse/SOLR-243
 Project: Solr
  Issue Type: Improvement
  Components: search
 Environment: Solr core
Reporter: John Wang
Assignee: Hoss Man
 Fix For: 1.5

 Attachments: indexReaderFactory.patch, indexReaderFactory.patch, 
 indexReaderFactory.patch, indexReaderFactory.patch, indexReaderFactory.patch, 
 indexReaderFactory.patch, indexReaderFactory.patch, SOLR-243.patch, 
 SOLR-243.patch, SOLR-243.patch, SOLR-243.patch, SOLR-243.patch, 
 SOLR-243.patch, SOLR-243.patch


 I have a customized IndexReader and I want to write a Solr plugin to use my 
 derived IndexReader implementation. Currently IndexReader instantiation is 
 hard coded to be: 
 IndexReader.open(path)
 It would be really useful if this is done thru a plugable factory that can be 
 configured, e.g. IndexReaderFactory
 interface IndexReaderFactory{
  IndexReader newReader(String name,String path);
 }
 the default implementation would just return: IndexReader.open(path)
 And in the newSearcher and getSearcher methods in SolrCore class can call the 
 current factory implementation to get the IndexReader instance and then build 
 the SolrIndexSearcher by passing in the reader.
 It would be really nice to add this improvement soon (This seems to be a 
 trivial addition) as our project really depends on this.
 Thanks
 -John

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1116) Add a Binary FieldType

2009-05-21 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711528#action_12711528
 ] 

Andrzej Bialecki  commented on SOLR-1116:
-

Indeed! then it's not relevant here. +0 from me for the regular base64.

 Add a Binary FieldType
 --

 Key: SOLR-1116
 URL: https://issues.apache.org/jira/browse/SOLR-1116
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Noble Paul
Assignee: Noble Paul
 Fix For: 1.4

 Attachments: SOLR-1116.patch, SOLR-1116.patch


 Lucene supports binary data for field but Solr has no corresponding field 
 type. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1116) Add a Binary FieldType

2009-05-20 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12711081#action_12711081
 ] 

Andrzej Bialecki  commented on SOLR-1116:
-

bq. No browser accepts the image data as Base64. your front-end will have to 
read the string and send it out as a byte[].

Please see http://en.wikipedia.org/wiki/Data_URI_scheme - this is the use case 
I was referring to, and indeed you can send base64 encoded content directly to 
any modern browser.

 Add a Binary FieldType
 --

 Key: SOLR-1116
 URL: https://issues.apache.org/jira/browse/SOLR-1116
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Noble Paul
Assignee: Noble Paul
 Fix For: 1.4

 Attachments: SOLR-1116.patch, SOLR-1116.patch


 Lucene supports binary data for field but Solr has no corresponding field 
 type. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1116) Add a Binary FieldType

2009-05-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710782#action_12710782
 ] 

Andrzej Bialecki  commented on SOLR-1116:
-

One scenario that I have experience with is when you store small images as 
fields, to be displayed on the result list. URL-safe encoding means you can 
directly embed the returned string without re-encoding it.

 Add a Binary FieldType
 --

 Key: SOLR-1116
 URL: https://issues.apache.org/jira/browse/SOLR-1116
 Project: Solr
  Issue Type: New Feature
  Components: search
Affects Versions: 1.3
Reporter: Noble Paul
Assignee: Noble Paul
 Fix For: 1.4

 Attachments: SOLR-1116.patch, SOLR-1116.patch


 Lucene supports binary data for field but Solr has no corresponding field 
 type. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-769) Support Document and Search Result clustering

2008-10-12 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638875#action_12638875
 ] 

Andrzej Bialecki  commented on SOLR-769:


FYI, Carrot2 does support a handful of different clustering algorithms (the 
ones I know of are Fuzzy Ants, KMeans and Suffix Tree, in addition to Lingo).

 Support Document and Search Result clustering
 -

 Key: SOLR-769
 URL: https://issues.apache.org/jira/browse/SOLR-769
 Project: Solr
  Issue Type: New Feature
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: clustering-libs.tar, SOLR-769.patch, SOLR-769.patch


 Clustering is a useful tool for working with documents and search results, 
 similar to the notion of dynamic faceting.  Carrot2 
 (http://project.carrot2.org/) is a nice, BSD-licensed, library for doing 
 search results clustering.  Mahout (http://lucene.apache.org/mahout) is well 
 suited for whole-corpus clustering.  
 The patch I lays out a contrib module that starts off w/ an integration of a 
 SearchComponent for doing clustering and an implementation using Carrot.  In 
 search results mode, it will use the DocList as the input for the cluster.   
 While Carrot2 comes w/ a Solr input component, it is not the same as the 
 SearchComponent that I have in that the Carrot example actually submits a 
 query to Solr, whereas my SearchComponent is just chained into the Component 
 list and uses the ResponseBuilder to add in the cluster results.
 While not fully fleshed out yet, the collection based mode will take in a 
 list of ids or just use the whole collection and will produce clusters.  
 Since this is a longer, typically offline task, there will need to be some 
 type of storage mechanism (and replication??) for the clusters.  I _may_ 
 push this off to a separate JIRA issue, but I at least want to present the 
 use case as part of the design of this component/contrib.  It may even make 
 sense that we split this out, such that the building piece is something like 
 an UpdateProcessor and then the SearchComponent just acts as a lookup 
 mechanism.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-09 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12638217#action_12638217
 ] 

Andrzej Bialecki  commented on SOLR-799:


+1 on the incremental sig calculation.

Re: different types of signatures. Our experience in Nutch is that signature 
type is rarely changed, and we assume that this setting is selected once per 
lifetime of an index, i.e. there are never any mixed cases of documents with 
incompatible signatures. If we want to be sure that they are comparable, we 
could prepend a byte or two of unique signature type id - this way, even if a 
signature value matches but was calculated using other impl. the documents 
won't be considered duplicates, which is the way it should work, because 
different signature algorithms are incomparable.

Re: signature as byte[] - I think it's better if we return byte[] from 
Signature, and until we support binary fields we just turn this into a hex 
string.

Re: field ordering in DeduplicateUpdateProcessorFactory: I think that both 
sigFields (if defined) and any other document fields (if sigFields is 
undefined) should be first ordered in a predictable way (lexicographic?). 
Current patch uses a HashSet which doesn't guarantee any particular ordering - 
in fact the ordering may be different if you run the same code under different 
JVMs, which may introduce a random factor to the sig. calculation.

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-799) Add support for hash based exact/near duplicate document handling

2008-10-07 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12637649#action_12637649
 ] 

Andrzej Bialecki  commented on SOLR-799:


Interesting development in light of NUTCH-442 :) Some comments:

* in MD5Signature I suggest using the code from 
org.apache.hadoop.io.MD5Hash.toString() instead of BigInteger.

* TextProfileSignature should contain a remark that it's copied from Nutch, 
since AFAIK the algorithm that it implements is currently used only in Nutch.

* in Nutch the concept of a page Signature is only a part of the deduplication 
process. The other part is the algorithm to decide which copy to keep and which 
one to discard. In your patch the latest update always removes all other 
documents with the same signature. IMHO this decision should be isolated into a 
DuplicateDeletePolicy class that gets all duplicates and can decide (based on 
arbitrary criteria) which one to keep, with the default implementation that 
simply keeps the latest document.

 Add support for hash based exact/near duplicate document handling
 -

 Key: SOLR-799
 URL: https://issues.apache.org/jira/browse/SOLR-799
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Mark Miller
Priority: Minor
 Attachments: SOLR-799.patch


 Hash based duplicate document detection is efficient and allows for blocking 
 as well as field collapsing. Lets put it into solr. 
 http://wiki.apache.org/solr/Deduplication

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-398) Widen return type of FiledType.createField to Fieldable in order to maximize flexibility

2008-08-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12626099#action_12626099
 ] 

Andrzej Bialecki  commented on SOLR-398:


Great minds think alike :) I was about to submit exactly the same patch when I 
noticed this JIRA.

One thing is missing in your patch - DocumentBuilder:282 should use 
out.getFieldable() instead of out.getField(). This is required if you 
provide a custom Fieldable implementation (!instanceof o.a.l.d.Field) in 
FieldType subclasses, because o.a.l.d.Document.getField tries to cast the 
result to o.a.l.d.Field, whereas Document.getFieldable is happy with any 
subclass of Fieldable ;)

 Widen return type of FiledType.createField to Fieldable in order to maximize 
 flexibility
 

 Key: SOLR-398
 URL: https://issues.apache.org/jira/browse/SOLR-398
 Project: Solr
  Issue Type: Improvement
  Components: update
Affects Versions: 1.2, 1.3
Reporter: Espen Amble Kolstad
Priority: Minor
 Attachments: 1.2-FieldType-2.patch, trunk-FieldType-3.patch, 
 trunk-FieldType-4.patch


 FieldType.createField currently returns Field.
 In order to maximize flexibility for developers to extend Solr, it should 
 return Fieldable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-139) Support updateable/modifiable documents

2008-07-24 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12616725#action_12616725
 ] 

Andrzej Bialecki  commented on SOLR-139:


It's possible to recover unstored fields, if the purpose of such recovery is to 
make a copy of the document and update other fields. The process is 
time-consuming, because you need to traverse all postings for all terms, so it 
might be impractical for larger indexes. Furthermore, such recovered content 
may be incomplete - tokens may have been changed or skipped/added by analyzers, 
positionIncrement gaps may have been introduced, etc, etc.

Most of this functionality is implemented in Luke Restore  Edit function. 
Perhaps it's possible to implement a new low-level Lucene API to do it more 
efficiently.

 Support updateable/modifiable documents
 ---

 Key: SOLR-139
 URL: https://issues.apache.org/jira/browse/SOLR-139
 Project: Solr
  Issue Type: New Feature
  Components: update
Reporter: Ryan McKinley
Assignee: Ryan McKinley
 Attachments: Eriks-ModifiableDocument.patch, 
 Eriks-ModifiableDocument.patch, Eriks-ModifiableDocument.patch, 
 Eriks-ModifiableDocument.patch, Eriks-ModifiableDocument.patch, 
 Eriks-ModifiableDocument.patch, getStoredFields.patch, getStoredFields.patch, 
 getStoredFields.patch, getStoredFields.patch, getStoredFields.patch, 
 SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, 
 SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, 
 SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, 
 SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, 
 SOLR-139-IndexDocumentCommand.patch, SOLR-139-IndexDocumentCommand.patch, 
 SOLR-139-IndexDocumentCommand.patch, SOLR-139-ModifyInputDocuments.patch, 
 SOLR-139-ModifyInputDocuments.patch, SOLR-139-ModifyInputDocuments.patch, 
 SOLR-139-ModifyInputDocuments.patch, SOLR-139-XmlUpdater.patch, 
 SOLR-269+139-ModifiableDocumentUpdateProcessor.patch


 It would be nice to be able to update some fields on a document without 
 having to insert the entire document.
 Given the way lucene is structured, (for now) one can only modify stored 
 fields.
 While we are at it, we can support incrementing an existing value - I think 
 this only makes sense for numbers.
 for background, see:
 http://www.nabble.com/loading-many-documents-by-ID-tf3145666.html#a8722293

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-84) New Solr logo?

2008-07-21 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-84?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated SOLR-84:
--

Attachment: solr.svg

Some ideas for a black  white version of the logo

 New Solr logo?
 --

 Key: SOLR-84
 URL: https://issues.apache.org/jira/browse/SOLR-84
 Project: Solr
  Issue Type: Improvement
Reporter: Bertrand Delacretaz
Priority: Minor
 Attachments: logo-grid.jpg, logo-solr-d.jpg, logo-solr-e.jpg, 
 logo-solr-source-files-take2.zip, solr-84-source-files.zip, solr-f.jpg, 
 solr-logo-20061214.jpg, solr-logo-20061218.JPG, solr-logo-20070124.JPG, 
 solr-nick.gif, solr.jpg, solr.svg, sslogo-solr-flare.jpg, sslogo-solr.jpg, 
 sslogo-solr2-flare.jpg, sslogo-solr2.jpg, sslogo-solr3.jpg


 Following up on SOLR-76, our trainee Nicolas Barbay (nicolas (put at here) 
 sarraux-dessous.ch) has reworked his logo proposal to be more solar.
 This can either be the start of a logo contest, or if people like it we could 
 adopt it. The gradients can make it a bit hard to integrate, not sure if this 
 is really a problem.
 WDYT?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.