RE: Multiple QParserPlugins, Single RequestHandler
Hi Erik, Thanks for your reply. My particular use case is this: I have an existing QParserPlugin subclass that does some tagging functionality (kind of a group alias thing). This is currently registered with the default queryHandler. I want to add another, quite separate plugin that writes an audit of every query request that comes in. I thought an event handler might be good for auditing (because it could ideally do more than just /select), but the wiki states this doesn't support all operations (like queries). Am I wrong about this? Maybe eventHandlers do more now? Ideally, I'd like to keep the auditing plugin sef-contained, as I think a secure auditing plugin (whether a QParser or something else) would make a good contribution module for Solr. Being able to track what has happened on a Solr instance in a non-repudiated fashion would be [hopefully] useful for others as well (e.g. if you're storing/accessing secure documents and need to know every time someone accesses something). I know there is some logging that tracks requests etc., but log files are difficult to secure in a forensically-legal way. Maybe whatever generates the log entries can be plugged into so that secure, 'tamper-proof' audit trails can be generated? This is somewhat tied to some sort of document-level security, since auditing isn't much use without a user to go with it - but that's a different thread... Is there a better way to track Solr activity? It would be great to have one plugin that could audit not just queries, but also [user-initiated] updates, deletes, server restarts and config changes (although these last two might need to be outside of Solr). Can eventHandler do this? Thanks, Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Multiple QParserPlugins, Single RequestHandler Date: Tue, 30 Mar 2010 14:06:28 -0400 No, not quite like that, but you can nest various query parser plugins. See http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/ Or perhaps write a composite query parser plugin that runs through the chain of others as you wish. I'm curious, what's the use case? Erik On Mar 30, 2010, at 10:52 AM, Peter Sturge wrote: Hi Solr Expoerts, Is it possible to 'chain' multiple QParserPlugins from a single RequestHandler? e.g. when a query request comes in for the default standard requestHandler, it sends the query request to: str name=defTypeqpluginhandler_1/str then: str name=defTypeqpluginhandler_2/str and finally: str name=defTypeqpluginhandler_N/str where qpluginhandler_X is some defined QParserPlugin instance. Is this possible? Many thanks, Peter _ Tell us your greatest, weirdest and funniest Hotmail stories http://clk.atdmt.com/UKM/go/195013117/direct/01/
RE: Scaling indexes with high document count
Hi, Thanks for your reply (an apologies for the orig msg being ent multiple times to the list - googlemail problems). I actually meant to put 'maxBufferredDocs'. I admit I'm not that familar with this parameter, but as I understand it, it is the number of documents that are held in ram before flushing to disk. I've noticed the ramBufferSizeMB is a similar parameter, but using memory as the threshold rather than number of docs. Is it best not to set these too high on indexers? In my environment, all writes are done via SolrJ, where documents are placed in a SolrDocumentList and commit()ed when the list reaches 1000 (default value), or a configured commit thread interval is reached (default is 20s, whichever comes first). I suppose this is a SolrJ-side version of 'maxBufferedDocs', so maybe I don't need to set maxBufferedDocs in solrconfig? (the SolrJ 'client' is on the same machine as the index) For the indexer cores (essentially write-only indexes), I wasn't planning on configuring extra memory for read cache (Lucene value cache or filter cache), as no queries would/should be received on these. Should I reconsider this? They'll be plenty of RAM available for indexers to use and still leave enough for the OS file system cache to do its thing. Do you have any suggestions as to what would be the best way to use this memory to achieve optimal indexing speed? The main things I do now to tune for fast indexing are: * commiting lists of docs rather than each one separately * not optimizing too often * bump up the mergeFactor (I use a value of 25) Many Thanks! Peter Date: Thu, 11 Mar 2010 09:19:12 -0800 From: hossman_luc...@fucit.org To: solr-user@lucene.apache.org Subject: Re: Scaling indexes with high document count : I wonder if anyone might have some insight/advice on index scaling for high : document count vs size deployments... Your general approach sounds reasonable, although specifics of how you'll need to tune the caches and how much hardware you'll need will largely depend on the specifics of the data and the queries. I'm not sure what you mean by this though... : As searching would always be performed on replicas - the indexing cores : wouldn't be tuned with much autowarming/read cache, but have loads of : 'maxdocs' cache. The searchers would be the other way 'round - lots of what do you mean by 'maxdocs' cache ? -Hoss _ Tell us your greatest, weirdest and funniest Hotmail stories http://clk.atdmt.com/UKM/go/195013117/direct/01/
Scaling indexes with high document count
Hello, I wonder if anyone might have some insight/advice on index scaling for high document count vs size deployments... The nature of the incoming data is a steady stream of, on average, 4GB per day. Importantly, the number of documents inserted during this time is ~7million (i.e. lots of small entries). The plan is to partition shards on a per month basis, and hold 6 months of data. On the search side, this would mean 6 shards (as replicas), each holding ~120GB with ~210million document entries. It is envisioned to deploy 2 indexing cores of which one is active at a time. When the active core gets 'full' (e.g. a month has passed), the other core kicks in for live indexing while the other completes its replication to it searchers. It's then cleared, ready for the next time period. Each time there is a 'switch', the next available replica is cleared and told to replicate to the newly active indexing core. After 6 months, the first replica is re-used, and so on... This type of layout allows indexing to carry on pretty much uninterrupted, and makes it relatively easy to manage replicas separately from the indexers (e.g. add replicas to store, say, 9 months, backup, forward etc.). As searching would always be performed on replicas - the indexing cores wouldn't be tuned with much autowarming/read cache, but have loads of 'maxdocs' cache. The searchers would be the other way 'round - lots of filter/fieldvalue cache. Please correct me if I'm wrong about these. (btw, client searches use faceting in a big way) The 120GB disk footprint is perfectly reasonable. Searching on potentially 1.3billion document entries, each with up to 30-80 facets (+potentially lots of unique values), plus date faceting and range queries, and still keep search performance up is where I could use some advice. Is this a case of simply throwing enough tin at the problem to handle the caching/faceting/distributed searches? What advice could you give to get the best performance out of such a scenario? Any experiences/insight etc. is greatly appreciated. Thanks, Peter BTW: Many thanks to Yonik and Lucid for your excellent Mastering Solr webinar - really useful and highly informative! _ Do you have a story that started on Hotmail? Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/
RE: Implementing hierarchical facet
Hi Andy, It sounds like you may want to have a look at tree faceting: https://issues.apache.org/jira/browse/SOLR-792 Date: Mon, 1 Mar 2010 18:23:51 -0800 From: angelf...@yahoo.com Subject: Implementing hierarchical facet To: solr-user@lucene.apache.org I read that a simple way to implement hierarchical facet is to concatenate strings with a separator. Something like level1level2level3 with as the separator. A problem with this approach is that the number of facet values will greatly increase. For example I have a facet Location with the hierarchy countrystatecity. Using the above approach every single city will lead to a separate facet value. With tens of thousands of cities in the world the response from Solr will be huge. And then on the client side I'd have to loop through all the facet values and combine those with the same country into a single value. Ideally Solr would be aware of the hierarchy structure and send back responses accordingly. So at level 1 Solr will send back facet values based on country (100 or so values). Level 2 the facet values will be based on the states within the selected country (a few dozen values). Next level will be cities within that state. and so on. Is it possible to implement hierarchical facet this way using Solr? _ Tell us your greatest, weirdest and funniest Hotmail stories http://clk.atdmt.com/UKM/go/195013117/direct/01/
Dynamic Solr indexing
Hi, I wonder if anyone could shed some insight on a dynamic indexing question...? The basic requirement is this: Indexing: A process writes to an index, and when it reaches a certain size (say, 1GB), a new index (core) is 'automatically' created/deployed (i.e. the process doesn't know about it) and further indexing now goes into the new core. When that one reaches its threshold size, a new index is deplyoed, and so on. The process that is writing to the indices doesn't actually know that it is writing to different cores. Searching: When a search is directed at the above index, the actual search is a distrbitued shard search across all the shards that have been deployed. Again, the searcher process doesn't know this, but gets back the aggregated results, as if it had specified all the shards in the request URL, but as these are changing dynamically, it of course can't know what they all are at any given time. This requirement sounds to me perhaps like a Katta thing. I've had a look at Solr-1395, and there's questions in Lucid that sound similar (e.g. http://www.lucidimagination.com/search/document/4b3d00055413536d/solr_katta_integration#4b3d00055413536d), so I guess (hope) I'm not the only one with this requirement. I couldn't find anything in either Katta or SOLR-1395 that fit both the writing and searching requirement, but I could easily have missed it. Is Katta/Solr-1395 the way to go to achieve this? Would such a solution be 'production-ready'? Has anyone deployed this type of thing in a production environment? Any insight/advice would be greatly appreciated. Thanks! Peter _ Do you have a story that started on Hotmail? Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/
RE: Dynamic Solr indexing
Hi Jan, Thanks very much for your message. SolrCloud sounds very cool indeed... So, from the Wiki, am I right in understanding that the only 'external' component is ZooKeeper, everything else is pure Solr (i.e. replication, distrib queries et al. are all Solr http a.o.t. something like Hadoop ipc)? If so, this makes it a nice tight package, keeping external dependencies to minimum. Is SolrCloud 'ready for primetime' production at present? Apologies for all the questions - Is SolrCloud marked for inclusion in 1.5? Many thanks! Peter Subject: Re: Dynamic Solr indexing From: jan@cominvent.com Date: Tue, 2 Mar 2010 00:48:50 +0100 To: solr-user@lucene.apache.org Hi, In current version you need to handle the cluster layout yourself, both on indexing and search side, i.e. route documents to shards as you please, and know what shards to search. We try to address how to make this easier in http://wiki.apache.org/solr/SolrCloud - have a look at it. The idea is that there is a component that knows about the layout of the search cluster, and we can then use this to know what shards to index to and search. If we build a component which automatically routes documents to shards, your use case could be implemented as one particular routing strategy, i.e. move to next shard when the current is full - ideal for news type of indexes. -- Jan Høydahl - search architect Cominvent AS - www.cominvent.com On 1. mars 2010, at 18.58, Peter S wrote: Hi, I wonder if anyone could shed some insight on a dynamic indexing question...? The basic requirement is this: Indexing: A process writes to an index, and when it reaches a certain size (say, 1GB), a new index (core) is 'automatically' created/deployed (i.e. the process doesn't know about it) and further indexing now goes into the new core. When that one reaches its threshold size, a new index is deplyoed, and so on. The process that is writing to the indices doesn't actually know that it is writing to different cores. Searching: When a search is directed at the above index, the actual search is a distrbitued shard search across all the shards that have been deployed. Again, the searcher process doesn't know this, but gets back the aggregated results, as if it had specified all the shards in the request URL, but as these are changing dynamically, it of course can't know what they all are at any given time. This requirement sounds to me perhaps like a Katta thing. I've had a look at Solr-1395, and there's questions in Lucid that sound similar (e.g. http://www.lucidimagination.com/search/document/4b3d00055413536d/solr_katta_integration#4b3d00055413536d), so I guess (hope) I'm not the only one with this requirement. I couldn't find anything in either Katta or SOLR-1395 that fit both the writing and searching requirement, but I could easily have missed it. Is Katta/Solr-1395 the way to go to achieve this? Would such a solution be 'production-ready'? Has anyone deployed this type of thing in a production environment? Any insight/advice would be greatly appreciated. Thanks! Peter _ Do you have a story that started on Hotmail? Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/ _ Tell us your greatest, weirdest and funniest Hotmail stories http://clk.atdmt.com/UKM/go/195013117/direct/01/
Aggregated facet value counts?
Hi, I was wondering if anyone had come across this use case, and if this type of faceting is possible: The requirement is to build a query such that an aggregated facet count of common (and'ed) field values form the basis of each returned facet count. For example: Let's say I have a number of documents in an index with, among others, the fields 'host' and 'user': Doc1 host:machine_1 user:user_1 Doc2 host:machine_1 user:user_2 Doc3 host:machine_1 user:user_1 Doc3 host:machine_1 user:user_1 Doc4 host:machine_2 user:user_1 Doc5 host:machine_2 user:user_1 Doc6 host:machine_2 user:user_4 Doc7 host:machine_1 user:user_4 Is it possible to get facets back that would give the count of documents that have common host AND user values (preferably ordered - i.e. host then user for this example, so as not to create a factorial explosion)? Note that the caller wouldn't know what machine and user values exist, only the field names. I've tried using facet queries in various ways to see if they could work for this, but I believe facet queries work on a different plane than this requirement (narrowing the term count, a.o.t. aggregating). For the example above, the desired result would be: machine_1/user_1 (3) machine_1/user_2 (1) machine_1/user_4 (1) machine_2/user_1 (2) machine_2/user_4 (1) Has anyone had a need for this type of faceting and found a way to achieve it? Many thanks, Peter _ We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/
RE: Aggregated facet value counts?
Hi Erik, Thanks for your reply. That's an interesting idea doing it at index-time, and a good idea for known field combinations. The only thing is How to handle arbitrary field combinations? - i.e. to allow the caller to specify any combination of fields at query-time? So, yes, the data is available at index-time, but the combination isn't (short of creating fields for every possible combination). Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 06:30:27 -0500 When faced with this type of situation where the data is entirely available at index-time, simply create an aggregated field that glues the two pieces together, and facet on that. Erik On Jan 29, 2010, at 6:16 AM, Peter S wrote: Hi, I was wondering if anyone had come across this use case, and if this type of faceting is possible: The requirement is to build a query such that an aggregated facet count of common (and'ed) field values form the basis of each returned facet count. For example: Let's say I have a number of documents in an index with, among others, the fields 'host' and 'user': Doc1 host:machine_1 user:user_1 Doc2 host:machine_1 user:user_2 Doc3 host:machine_1 user:user_1 Doc3 host:machine_1 user:user_1 Doc4 host:machine_2 user:user_1 Doc5 host:machine_2 user:user_1 Doc6 host:machine_2 user:user_4 Doc7 host:machine_1 user:user_4 Is it possible to get facets back that would give the count of documents that have common host AND user values (preferably ordered - i.e. host then user for this example, so as not to create a factorial explosion)? Note that the caller wouldn't know what machine and user values exist, only the field names. I've tried using facet queries in various ways to see if they could work for this, but I believe facet queries work on a different plane than this requirement (narrowing the term count, a.o.t. aggregating). For the example above, the desired result would be: machine_1/user_1 (3) machine_1/user_2 (1) machine_1/user_4 (1) machine_2/user_1 (2) machine_2/user_4 (1) Has anyone had a need for this type of faceting and found a way to achieve it? Many thanks, Peter _ We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/ _ Tell us your greatest, weirdest and funniest Hotmail stories http://clk.atdmt.com/UKM/go/195013117/direct/01/
RE: Aggregated facet value counts?
Well, it wouldn't be 'every' combination - more of 'any' combination at query-time. The 'arbitrary' part of the requirement is because it's not practical to predict every combination a user might ask for, although generally users would tend to search for similar/the same query combinations (but perhaps with different date ranges, for example). If 'predicted aggregate fields' were calculated at index-time on, say, 10 fields (the schema in question actually as 73 fields), that's 3,628,801 new fields. A large percentage of these would likely never be used (which ones would depend on the user, environment etc.). Perhaps a more 'typical' use case than my network-based example would be a product search web page, where you want to show the number of products that are made by a manufacturer and within a certain price range (e.g. Sony [$600-$800] (15) ). To obtain the (15) facet count value, you would have to correlate the number of Sony products (say, (861)), and the products that fall into the [600 TO 800] price range (say, (1226) ). The (15) would be the intersection of the Sony hits and the price range hits by 'manufacturer:Sony'. Am I right that filter queries could only do this for document hits if you know the field values ahead of time (e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The facets could then be derived by simply counting the numFound for each result set. If there were subsearch support in Solr (i.e. take the output of a query and use it as input into another) that included facets [perhaps there is such support?], it might be used to achieve this effect. A custom query parser plugin could work, maybe? I suppose it would need to gather up all the separate facets and correlate them according to the input query (e.g. host and user, or manufacturer and price range). Such a mechanism would be crying out for caching, but perhaps it could leverage the existing field and query caches. Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 07:39:44 -0500 Creating values for every possible combination is what you're asking Solr to do at query-time, and as far as I know there isn't really a way to accomplish that like you're asking. Is the need really to be arbitrary here? Erik On Jan 29, 2010, at 7:25 AM, Peter S wrote: Hi Erik, Thanks for your reply. That's an interesting idea doing it at index- time, and a good idea for known field combinations. The only thing is How to handle arbitrary field combinations? - i.e. to allow the caller to specify any combination of fields at query-time? So, yes, the data is available at index-time, but the combination isn't (short of creating fields for every possible combination). Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 06:30:27 -0500 When faced with this type of situation where the data is entirely available at index-time, simply create an aggregated field that glues the two pieces together, and facet on that. Erik On Jan 29, 2010, at 6:16 AM, Peter S wrote: Hi, I was wondering if anyone had come across this use case, and if this type of faceting is possible: The requirement is to build a query such that an aggregated facet count of common (and'ed) field values form the basis of each returned facet count. For example: Let's say I have a number of documents in an index with, among others, the fields 'host' and 'user': Doc1 host:machine_1 user:user_1 Doc2 host:machine_1 user:user_2 Doc3 host:machine_1 user:user_1 Doc3 host:machine_1 user:user_1 Doc4 host:machine_2 user:user_1 Doc5 host:machine_2 user:user_1 Doc6 host:machine_2 user:user_4 Doc7 host:machine_1 user:user_4 Is it possible to get facets back that would give the count of documents that have common host AND user values (preferably ordered - i.e. host then user for this example, so as not to create a factorial explosion)? Note that the caller wouldn't know what machine and user values exist, only the field names. I've tried using facet queries in various ways to see if they could work for this, but I believe facet queries work on a different plane than this requirement (narrowing the term count, a.o.t. aggregating). For the example above, the desired result would be: machine_1/user_1 (3) machine_1/user_2 (1) machine_1/user_4 (1) machine_2/user_1 (2) machine_2/user_4 (1) Has anyone had a need for this type of faceting and found a way to achieve it? Many thanks, Peter _ We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now http
RE: Aggregated facet value counts?
Tree faceting - that sounds very interesting indeed. I'll have a look into that and see how it fits, as well as any improvements for adding facet queries, cross-field aggregation, date range etc. There could be some very nice use-cases for such functionality. Just wondering how this would work with distributed shards/multi-core... Many Thanks! Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 12:20:07 -0500 Sounds like what you're asking for is tree faceting. A basic implementation is available in SOLR-792, but one that could also take facet.queries, numeric or date range buckets, to tree on would be a nice improvement. Still, the underlying implementation will simply enumerate all the possible values (SOLR-792 has some short-circuiting when the top-level has zero, of course). A client-side application could do this with multiple requests to Solr. Subsearch - sure, just make more requests to Solr, rearranging the parameters. I'd still say that in general for this type of need that it'll generally be less arbitrary and locking some things in during indexing will be the pragmatic way to go for most cases. Erik On Jan 29, 2010, at 9:28 AM, Peter S wrote: Well, it wouldn't be 'every' combination - more of 'any' combination at query-time. The 'arbitrary' part of the requirement is because it's not practical to predict every combination a user might ask for, although generally users would tend to search for similar/the same query combinations (but perhaps with different date ranges, for example). If 'predicted aggregate fields' were calculated at index-time on, say, 10 fields (the schema in question actually as 73 fields), that's 3,628,801 new fields. A large percentage of these would likely never be used (which ones would depend on the user, environment etc.). Perhaps a more 'typical' use case than my network-based example would be a product search web page, where you want to show the number of products that are made by a manufacturer and within a certain price range (e.g. Sony [$600-$800] (15) ). To obtain the (15) facet count value, you would have to correlate the number of Sony products (say, (861)), and the products that fall into the [600 TO 800] price range (say, (1226) ). The (15) would be the intersection of the Sony hits and the price range hits by 'manufacturer:Sony'. Am I right that filter queries could only do this for document hits if you know the field values ahead of time (e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The facets could then be derived by simply counting the numFound for each result set. If there were subsearch support in Solr (i.e. take the output of a query and use it as input into another) that included facets [perhaps there is such support?], it might be used to achieve this effect. A custom query parser plugin could work, maybe? I suppose it would need to gather up all the separate facets and correlate them according to the input query (e.g. host and user, or manufacturer and price range). Such a mechanism would be crying out for caching, but perhaps it could leverage the existing field and query caches. Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 07:39:44 -0500 Creating values for every possible combination is what you're asking Solr to do at query-time, and as far as I know there isn't really a way to accomplish that like you're asking. Is the need really to be arbitrary here? Erik On Jan 29, 2010, at 7:25 AM, Peter S wrote: Hi Erik, Thanks for your reply. That's an interesting idea doing it at index- time, and a good idea for known field combinations. The only thing is How to handle arbitrary field combinations? - i.e. to allow the caller to specify any combination of fields at query-time? So, yes, the data is available at index-time, but the combination isn't (short of creating fields for every possible combination). Peter From: erik.hatc...@gmail.com To: solr-user@lucene.apache.org Subject: Re: Aggregated facet value counts? Date: Fri, 29 Jan 2010 06:30:27 -0500 When faced with this type of situation where the data is entirely available at index-time, simply create an aggregated field that glues the two pieces together, and facet on that. Erik On Jan 29, 2010, at 6:16 AM, Peter S wrote: Hi, I was wondering if anyone had come across this use case, and if this type of faceting is possible: The requirement is to build a query such that an aggregated facet count of common (and'ed) field values form the basis of each returned facet count. For example: Let's say I have
Dedupe of document results at query-time
Hi, I wonder if someone might be able to shed some insight into this problem: Is it possible and/or what is the best/accepted way to achieve deduplication of documents by field at query-time? For example: Let's say an index contains: Doc1 host:Host1 time:1 Sept 09 appname:activePDF Doc2 host:Host1 time:2 Sept 09 appname:activePDF Doc3 host:Host1 time:3 Sept 09 appname:activePDF Can a query be constructed that would return only 1 of these Documents based on appname (doesn't really matter which one)? i.e.: match on host:Host1 ignore time dedupe on appname:activePDF Is this possible? Would FunctionQuery be helpful here, maybe? Am I actually talking about field collapsing? Many thanks, Peter _ Got a cool Hotmail story? Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/
RE: Reverse sort facet query [SOLR-1672]
now i'm totally confused: what are you suggesting this new param would do, what does the name mean? Sorry, I wan't clear - there isn't a new parameter, except the one added in the patch. What I was suggesting here is to do the work to remove the new parameter I just put in (facet.sortorder), and do it in exactly the way you mentioned - i.e. just extend facet.sort to allow a 'count desc'. By convention, ok to use a a space in the name? - or would count.desc (and count.asc as alias for count) be more compliant? Peter _ We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now http://clk.atdmt.com/UKM/go/195013117/direct/01/
Non-leading wildcard search
Hello, There are lots of questions and answers in the forum regarding varying wildcard behaviour, but I haven't been able to find any that address this particular behaviour. Perhaps someone could help? Problem: I have a fieldType that only goes through a KeywordTokenizer at index time, to ensure it stays 'verbatim' (e.g. it doesn't get split into any tokens - ws or otherwise). Let's say there's some data stored in this field like this: Something Something Else Something Else Altogether When I query: Something or Something Else or *thing or *omething*, I get back the expected results. If, however, I query: Some* or S* or s* etc, I get no results (although this type of non-leading wildcard works fine with other fieldType schema elements that don't use KeywordTokenizer). Is this something to do with KeywordTokenizer? Is there a better way to index data (preserving case) and not splitting on ws or stemming etc. (i.e. no WhitespaceTokenizer or similar)? My fieldType schema looks like this: (I've tried a number of other combinations as well including using class=solr.TextField) fieldType name=text_verbatim class=solr.StrField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer /fieldType field name=appname type=text_verbatim indexed=true stored=true/ I understand that wildcard queries don't go through analyzers, but why is it that 'tokenized' data matches on non-leading wildcard queries, whereas non-tokenized (or more specifically Keyword-Tokenized) doesn't? The fieldType schema requires some tokenizer class, and it appears that KeywordTokenizer is the only one that tokenizes to a token size of 1 (i.e. the whole string). I'm sure I'm missing something that is probably reasonbly obvious, but having tried myriad combinations, I thought it prudent to ask the experts in the forum. Many thanks for any insight you can provide on this. Peter _ Use Hotmail to send and receive mail from your different email accounts http://clk.atdmt.com/UKM/go/186394592/direct/01/
RE: Non-leading wildcard search
Hi Yonik, Thanks for your quick reply. No, the queries themselves aren't in quotes. Since I sent the initial email, I have managed to get non-leading wildcard queries to work with this, but by unexpected means (for me at least :-). If I add a LowerCaseFilterFactory to the fieldType, queries like s* (or S*) work as expected. So the fieldType schema element now looks like: fieldType name=text_verbatim class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer /fieldType I wasn't expecting this, as I would have thought this would change only the case behaviour, not the wildcard behaviour (or at least not just the non-leading wildcard behaviour). Perhaps I'm just not understanding how the terms (term in this case as not tokenized) is indexed and subsequently matched. What I've noticed is that with the LowerCaseFilterFactory in place, document queries return results with case intact, but facet queries show the results in lower-case (e.g. document-appname=Something facet.field.appname=something). (I kind of expected the document-appname field to be lower case as well) Does this sound like correct behaviour to you? If it's correct, that's ok, I'll manage to work 'round it (maybe there's a way to map the facet field back to the document field?), but if it sounds wrong, perhaps it warrants further investigation. Many thanks, Peter Date: Mon, 4 Jan 2010 17:42:30 -0500 Subject: Re: Non-leading wildcard search From: yo...@lucidimagination.com To: solr-user@lucene.apache.org On Mon, Jan 4, 2010 at 5:38 PM, Peter S pete...@hotmail.com wrote: When I query: Something or Something Else or *thing or *omething*, I get back the expected results. If, however, I query: Some* or S* or s* etc, I get no results (although this type of non-leading wildcard works fine with other fieldType schema elements that don't use KeywordTokenizer). Is your query string actually in quotes? Wildcards aren't currently supported in quotes. So text_verbatim:Some* should work. -Yonik http://www.lucidimagination.com _ View your other email accounts from your Hotmail inbox. Add them now. http://clk.atdmt.com/UKM/go/186394592/direct/01/
RE: Non-leading wildcard search
FYI: I have found the root of this behaviour. It has to do with a test patch I've been working on for working 'round pre SOLR-219 (case insensitive wildcard searching). With the test patch switched out, it works as expected. Although the case insensitive wildcard search reverts to pre-SOLR-219 behaviour. I believe I can work 'round this by using a copyField that holds the lower-case text for wildcarding. Many thanks, Yonik for your help. Peter From: pete...@hotmail.com To: solr-user@lucene.apache.org Subject: RE: Non-leading wildcard search Date: Mon, 4 Jan 2010 23:29:04 + Hi Yonik, Thanks for your quick reply. No, the queries themselves aren't in quotes. Since I sent the initial email, I have managed to get non-leading wildcard queries to work with this, but by unexpected means (for me at least :-). If I add a LowerCaseFilterFactory to the fieldType, queries like s* (or S*) work as expected. So the fieldType schema element now looks like: fieldType name=text_verbatim class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer /fieldType I wasn't expecting this, as I would have thought this would change only the case behaviour, not the wildcard behaviour (or at least not just the non-leading wildcard behaviour). Perhaps I'm just not understanding how the terms (term in this case as not tokenized) is indexed and subsequently matched. What I've noticed is that with the LowerCaseFilterFactory in place, document queries return results with case intact, but facet queries show the results in lower-case (e.g. document-appname=Something facet.field.appname=something). (I kind of expected the document-appname field to be lower case as well) Does this sound like correct behaviour to you? If it's correct, that's ok, I'll manage to work 'round it (maybe there's a way to map the facet field back to the document field?), but if it sounds wrong, perhaps it warrants further investigation. Many thanks, Peter Date: Mon, 4 Jan 2010 17:42:30 -0500 Subject: Re: Non-leading wildcard search From: yo...@lucidimagination.com To: solr-user@lucene.apache.org On Mon, Jan 4, 2010 at 5:38 PM, Peter S pete...@hotmail.com wrote: When I query: Something or Something Else or *thing or *omething*, I get back the expected results. If, however, I query: Some* or S* or s* etc, I get no results (although this type of non-leading wildcard works fine with other fieldType schema elements that don't use KeywordTokenizer). Is your query string actually in quotes? Wildcards aren't currently supported in quotes. So text_verbatim:Some* should work. -Yonik http://www.lucidimagination.com _ View your other email accounts from your Hotmail inbox. Add them now. http://clk.atdmt.com/UKM/go/186394592/direct/01/ _ Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy http://clk.atdmt.com/UKM/go/186394592/direct/01/