RE: Multiple QParserPlugins, Single RequestHandler

2010-03-30 Thread Peter S

Hi Erik,

 

Thanks for your reply.

 

My particular use case is this:

 

I have an existing QParserPlugin subclass that does some tagging functionality 
(kind of a group alias thing). This is currently registered with the default 
queryHandler.

I want to add another, quite separate plugin that writes an audit of every 
query request that comes in.

 

I thought an event handler might be good for auditing (because it could ideally 
do more than just /select), but the wiki states this doesn't support all 
operations (like queries). Am I wrong about this? Maybe eventHandlers do more 
now?

 

Ideally, I'd like to keep the auditing plugin sef-contained, as I think a 
secure auditing plugin (whether a QParser or something else) would make a good 
contribution module for Solr. 

Being able to track what has happened on a Solr instance in a non-repudiated 
fashion would be [hopefully] useful for others as well (e.g. if you're 
storing/accessing secure documents and need to know every time someone accesses 
something). I know there is some logging that tracks requests etc., but log 
files are difficult to secure in a forensically-legal way. Maybe whatever 
generates the log entries can be plugged into so that secure, 'tamper-proof' 
audit trails can be generated?

 

This is somewhat tied to some sort of document-level security, since auditing 
isn't much use without a user to go with it - but that's a different thread...

 

Is there a better way to track Solr activity? It would be great to have one 
plugin that could audit not just queries, but also [user-initiated] updates, 
deletes, server restarts and config changes (although these last two might need 
to be outside of Solr). Can eventHandler do this?

 

Thanks,

Peter

 


 
 From: erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: Re: Multiple QParserPlugins, Single RequestHandler
 Date: Tue, 30 Mar 2010 14:06:28 -0400
 
 No, not quite like that, but you can nest various query parser 
 plugins. See 
 http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/
 
 Or perhaps write a composite query parser plugin that runs through the 
 chain of others as you wish.
 
 I'm curious, what's the use case?
 
 Erik
 
 
 On Mar 30, 2010, at 10:52 AM, Peter Sturge wrote:
 
  Hi Solr Expoerts,
 
  Is it possible to 'chain' multiple QParserPlugins from a single
  RequestHandler?
 
  e.g. when a query request comes in for the default standard 
  requestHandler,
  it sends the query request to:
  str name=defTypeqpluginhandler_1/str then:
  str name=defTypeqpluginhandler_2/str and finally:
  str name=defTypeqpluginhandler_N/str
 
  where qpluginhandler_X is some defined QParserPlugin instance.
 
  Is this possible?
 
  Many thanks,
  Peter
 
  
_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

RE: Scaling indexes with high document count

2010-03-11 Thread Peter S

Hi,

 

Thanks for your reply (an apologies for the orig msg being ent multiple times 
to the list - googlemail problems).

 

I actually meant to put 'maxBufferredDocs'. I admit I'm not that familar with 
this parameter, but as I understand it, it is the number of documents that are 
held in ram before flushing to disk. I've noticed the ramBufferSizeMB is a 
similar parameter, but using memory as the threshold rather than number of docs.

 

Is it best not to set these too high on indexers?

 

In my environment, all writes are done via SolrJ, where documents are placed in 
a SolrDocumentList and commit()ed when the list reaches 1000 (default value), 
or a configured commit thread interval is reached (default is 20s, whichever 
comes first). I suppose this is a SolrJ-side version of 'maxBufferedDocs', so 
maybe I don't need to set maxBufferedDocs in solrconfig? (the SolrJ 'client' is 
on the same machine as the index)

 

For the indexer cores (essentially write-only indexes), I wasn't planning on 
configuring extra memory for read cache (Lucene value cache or filter cache), 
as no queries would/should be received on these. Should I reconsider this? 
They'll be plenty of RAM available for indexers to use and still leave enough 
for the OS file system cache to do its thing. Do you have any suggestions as to 
what would be the best way to use this memory to achieve optimal indexing 
speed? 

The main things I do now to tune for fast indexing are: 

 * commiting lists of docs rather than each one separately

 * not optimizing too often

 * bump up the mergeFactor (I use a value of 25)

 

 

Many Thanks!

Peter

 

 

 
 Date: Thu, 11 Mar 2010 09:19:12 -0800
 From: hossman_luc...@fucit.org
 To: solr-user@lucene.apache.org
 Subject: Re: Scaling indexes with high document count
 
 
 : I wonder if anyone might have some insight/advice on index scaling for high
 : document count vs size deployments...
 
 Your general approach sounds reasonable, although specifics of how you'll 
 need to tune the caches and how much hardware you'll need will largely 
 depend on the specifics of the data and the queries.
 
 I'm not sure what you mean by this though...
 
 
 : As searching would always be performed on replicas - the indexing cores
 : wouldn't be tuned with much autowarming/read cache, but have loads of
 : 'maxdocs' cache. The searchers would be the other way 'round - lots of
 
 what do you mean by 'maxdocs' cache ?
 
 
 
 -Hoss
 
  
_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Scaling indexes with high document count

2010-03-10 Thread Peter S

Hello,

I wonder if anyone might have some insight/advice on index scaling for high 
document count vs size deployments...

The nature of the incoming data is a steady stream of, on average, 4GB per day. 
Importantly, the number of documents inserted during this time is ~7million 
(i.e. lots of small entries).
The plan is to partition shards on a per month basis, and hold 6 months of data.

On the search side, this would mean 6 shards (as replicas), each holding ~120GB 
with ~210million document entries.
It is envisioned to deploy 2 indexing cores of which one is active at a time. 
When the active core gets 'full' (e.g. a month has passed), the other core 
kicks in for live indexing while the other completes its replication to it 
searchers. It's then cleared, ready for the next time period. Each time there 
is a 'switch', the next available replica is cleared and told to replicate to 
the newly active indexing core. After 6 months, the first replica is re-used, 
and so on...
This type of layout allows indexing to carry on pretty much uninterrupted, and 
makes it relatively easy to manage replicas separately from the indexers (e.g. 
add replicas to store, say, 9 months, backup, forward etc.).

As searching would always be performed on replicas - the indexing cores 
wouldn't be tuned with much autowarming/read cache, but have loads of 'maxdocs' 
cache. The searchers would be the other way 'round - lots of filter/fieldvalue 
cache. Please correct me if I'm wrong about these. (btw, client searches use 
faceting in a big way)

The 120GB disk footprint is perfectly reasonable. Searching on potentially 
1.3billion document entries, each with up to 30-80 facets (+potentially lots of 
unique values), plus date faceting and range queries, and still keep search 
performance up is where I could use some advice.
Is this a case of simply throwing enough tin at the problem to handle the 
caching/faceting/distributed searches?

What advice could you give to get the best performance out of such a scenario?
Any experiences/insight etc. is greatly appreciated.

Thanks,
Peter

BTW: Many thanks to Yonik and Lucid for your excellent Mastering Solr webinar - 
really useful and highly informative!

 
  
_
Do you have a story that started on Hotmail? Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/

RE: Implementing hierarchical facet

2010-03-02 Thread Peter S

Hi Andy,

 

It sounds like you may want to have a look at tree faceting:

  https://issues.apache.org/jira/browse/SOLR-792

 


 
 Date: Mon, 1 Mar 2010 18:23:51 -0800
 From: angelf...@yahoo.com
 Subject: Implementing hierarchical facet
 To: solr-user@lucene.apache.org
 
 I read that a simple way to implement hierarchical facet is to concatenate 
 strings with a separator. Something like level1level2level3 with  as 
 the separator.
 
 A problem with this approach is that the number of facet values will greatly 
 increase.
 
 For example I have a facet Location with the hierarchy countrystatecity. 
 Using the above approach every single city will lead to a separate facet 
 value. With tens of thousands of cities in the world the response from Solr 
 will be huge. And then on the client side I'd have to loop through all the 
 facet values and combine those with the same country into a single value.
 
 Ideally Solr would be aware of the hierarchy structure and send back 
 responses accordingly. So at level 1 Solr will send back facet values based 
 on country (100 or so values). Level 2 the facet values will be based on the 
 states within the selected country (a few dozen values). Next level will be 
 cities within that state. and so on.
 
 Is it possible to implement hierarchical facet this way using Solr?
 
 
 
 
  
_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Dynamic Solr indexing

2010-03-01 Thread Peter S

Hi,

 

I wonder if anyone could shed some insight on a dynamic indexing question...?

 

The basic requirement is this:

 

Indexing:

A process writes to an index, and when it reaches a certain size (say, 1GB), a 
new index (core) is 'automatically' created/deployed (i.e. the process doesn't 
know about it) and further indexing now goes into the new core. When that one 
reaches its threshold size, a new index is deplyoed, and so on.

The process that is writing to the indices doesn't actually know that it is 
writing to different cores.

 

Searching:

When a search is directed at the above index, the actual search is a 
distrbitued shard search across all the shards that have been deployed. Again, 
the searcher process doesn't know this, but gets back the aggregated results, 
as if it had specified all the shards in the request URL, but as these are 
changing dynamically, it of course can't know what they all are at any given 
time.

 

This requirement sounds to me perhaps like a Katta thing. I've had a look at 
Solr-1395, and there's questions in Lucid that sound similar (e.g. 
http://www.lucidimagination.com/search/document/4b3d00055413536d/solr_katta_integration#4b3d00055413536d),
 so I guess (hope) I'm not the only one with this requirement.

 

I couldn't find anything in either Katta or SOLR-1395 that fit both the writing 
and searching requirement, but I could easily have missed it.

 

Is Katta/Solr-1395 the way to go to achieve this? Would such a solution be 
'production-ready'? Has anyone deployed this type of thing in a production 
environment?

 

Any insight/advice would be greatly appreciated.

 

Thanks!

Peter

 

 
  
_
Do you have a story that started on Hotmail? Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/

RE: Dynamic Solr indexing

2010-03-01 Thread Peter S

Hi Jan,

 

Thanks very much for your message. SolrCloud sounds very cool indeed...

 

So, from the Wiki, am I right in understanding that the only 'external' 
component is ZooKeeper, everything else is pure Solr (i.e. replication, distrib 
queries et al. are all Solr http a.o.t. something like Hadoop ipc)? If so, this 
makes it a nice tight package, keeping external dependencies to minimum. Is 
SolrCloud 'ready for primetime' production at present?

 

Apologies for all the questions - Is SolrCloud marked for inclusion in 1.5?

 

Many thanks!

Peter

 


 
 Subject: Re: Dynamic Solr indexing
 From: jan@cominvent.com
 Date: Tue, 2 Mar 2010 00:48:50 +0100
 To: solr-user@lucene.apache.org
 
 Hi,
 
 In current version you need to handle the cluster layout yourself, both on 
 indexing and search side, i.e. route documents to shards as you please, and 
 know what shards to search.
 
 We try to address how to make this easier in 
 http://wiki.apache.org/solr/SolrCloud - have a look at it. The idea is that 
 there is a component that knows about the layout of the search cluster, and 
 we can then use this to know what shards to index to and search. If we build 
 a component which automatically routes documents to shards, your use case 
 could be implemented as one particular routing strategy, i.e. move to next 
 shard when the current is full - ideal for news type of indexes.
 
 --
 Jan Høydahl - search architect
 Cominvent AS - www.cominvent.com
 
 On 1. mars 2010, at 18.58, Peter S wrote:
 
  
  Hi,
  
  
  
  I wonder if anyone could shed some insight on a dynamic indexing 
  question...?
  
  
  
  The basic requirement is this:
  
  
  
  Indexing:
  
  A process writes to an index, and when it reaches a certain size (say, 
  1GB), a new index (core) is 'automatically' created/deployed (i.e. the 
  process doesn't know about it) and further indexing now goes into the new 
  core. When that one reaches its threshold size, a new index is deplyoed, 
  and so on.
  
  The process that is writing to the indices doesn't actually know that it is 
  writing to different cores.
  
  
  
  Searching:
  
  When a search is directed at the above index, the actual search is a 
  distrbitued shard search across all the shards that have been deployed. 
  Again, the searcher process doesn't know this, but gets back the aggregated 
  results, as if it had specified all the shards in the request URL, but as 
  these are changing dynamically, it of course can't know what they all are 
  at any given time.
  
  
  
  This requirement sounds to me perhaps like a Katta thing. I've had a look 
  at Solr-1395, and there's questions in Lucid that sound similar (e.g. 
  http://www.lucidimagination.com/search/document/4b3d00055413536d/solr_katta_integration#4b3d00055413536d),
   so I guess (hope) I'm not the only one with this requirement.
  
  
  
  I couldn't find anything in either Katta or SOLR-1395 that fit both the 
  writing and searching requirement, but I could easily have missed it.
  
  
  
  Is Katta/Solr-1395 the way to go to achieve this? Would such a solution be 
  'production-ready'? Has anyone deployed this type of thing in a production 
  environment?
  
  
  
  Any insight/advice would be greatly appreciated.
  
  
  
  Thanks!
  
  Peter
  
  
  
  
  
  _
  Do you have a story that started on Hotmail? Tell us now
  http://clk.atdmt.com/UKM/go/195013117/direct/01/
 
  
_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Aggregated facet value counts?

2010-01-29 Thread Peter S

Hi,

 

I was wondering if anyone had come across this use case, and if this type of 
faceting is possible:

 

The requirement is to build a query such that an aggregated facet count of 
common (and'ed) field values form the basis of each returned facet count.

 

For example:

Let's say I have a number of documents in an index with, among others, the 
fields 'host' and 'user':

 

Doc1  host:machine_1   user:user_1

Doc2  host:machine_1   user:user_2

Doc3  host:machine_1   user:user_1

Doc3  host:machine_1   user:user_1

 

Doc4  host:machine_2   user:user_1

Doc5  host:machine_2   user:user_1

Doc6  host:machine_2   user:user_4

 

Doc7  host:machine_1   user:user_4

 

Is it possible to get facets back that would give the count of documents that 
have common host AND user values (preferably ordered - i.e. host then user for 
this example, so as not to create a factorial explosion)? Note that the caller 
wouldn't know what machine and user values exist, only the field names.

I've tried using facet queries in various ways to see if they could work for 
this, but I believe facet queries work on a different plane than this 
requirement (narrowing the term count, a.o.t. aggregating).

 

For the example above, the desired result would be:

 

machine_1/user_1 (3)

machine_1/user_2 (1)

machine_1/user_4 (1)

 

machine_2/user_1 (2)

machine_2/user_4 (1)

 

Has anyone had a need for this type of faceting and found a way to achieve it?

 

Many thanks,

Peter

 

 
  
_
We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/

RE: Aggregated facet value counts?

2010-01-29 Thread Peter S

Hi Erik,

 

Thanks for your reply. That's an interesting idea doing it at index-time, and a 
good idea for known field combinations.

The only thing is

How to handle arbitrary field combinations? - i.e. to allow the caller to 
specify any combination of fields at query-time?

So, yes, the data is available at index-time, but the combination isn't (short 
of creating fields for every possible combination).

 

Peter


 
 From: erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: Re: Aggregated facet value counts?
 Date: Fri, 29 Jan 2010 06:30:27 -0500
 
 When faced with this type of situation where the data is entirely 
 available at index-time, simply create an aggregated field that glues 
 the two pieces together, and facet on that.
 
 Erik
 
 On Jan 29, 2010, at 6:16 AM, Peter S wrote:
 
 
  Hi,
 
 
 
  I was wondering if anyone had come across this use case, and if this 
  type of faceting is possible:
 
 
 
  The requirement is to build a query such that an aggregated facet 
  count of common (and'ed) field values form the basis of each 
  returned facet count.
 
 
 
  For example:
 
  Let's say I have a number of documents in an index with, among 
  others, the fields 'host' and 'user':
 
 
 
  Doc1 host:machine_1 user:user_1
 
  Doc2 host:machine_1 user:user_2
 
  Doc3 host:machine_1 user:user_1
 
  Doc3 host:machine_1 user:user_1
 
 
 
  Doc4 host:machine_2 user:user_1
 
  Doc5 host:machine_2 user:user_1
 
  Doc6 host:machine_2 user:user_4
 
 
 
  Doc7 host:machine_1 user:user_4
 
 
 
  Is it possible to get facets back that would give the count of 
  documents that have common host AND user values (preferably ordered 
  - i.e. host then user for this example, so as not to create a 
  factorial explosion)? Note that the caller wouldn't know what 
  machine and user values exist, only the field names.
 
  I've tried using facet queries in various ways to see if they could 
  work for this, but I believe facet queries work on a different plane 
  than this requirement (narrowing the term count, a.o.t. aggregating).
 
 
 
  For the example above, the desired result would be:
 
 
 
  machine_1/user_1 (3)
 
  machine_1/user_2 (1)
 
  machine_1/user_4 (1)
 
 
 
  machine_2/user_1 (2)
 
  machine_2/user_4 (1)
 
 
 
  Has anyone had a need for this type of faceting and found a way to 
  achieve it?
 
 
 
  Many thanks,
 
  Peter
 
 
 
 
  
  _
  We want to hear all your funny, exciting and crazy Hotmail stories. 
  Tell us now
  http://clk.atdmt.com/UKM/go/195013117/direct/01/
 
  
_
Tell us your greatest, weirdest and funniest Hotmail stories
http://clk.atdmt.com/UKM/go/195013117/direct/01/

RE: Aggregated facet value counts?

2010-01-29 Thread Peter S

Well, it wouldn't be 'every' combination - more of 'any' combination at 
query-time.
 
The 'arbitrary' part of the requirement is because it's not practical to 
predict every combination a user might ask for, although generally users would 
tend to search for similar/the same query combinations (but perhaps with 
different date ranges, for example).
 
If 'predicted aggregate fields' were calculated at index-time on, say, 10 
fields (the schema in question actually as 73 fields), that's 3,628,801 new 
fields. A large percentage of these would likely never be used (which ones 
would depend on the user, environment etc.).
 

Perhaps a more 'typical' use case than my network-based example would be a 
product search web page, where you want to show the number of products that are 
made by a manufacturer and within a certain price range (e.g. Sony [$600-$800] 
(15) ). To obtain the (15) facet count value, you would have to correlate the 
number of Sony products (say, (861)), and the products that fall into the [600 
TO 800] price range (say, (1226) ). The (15) would be the intersection of the 
Sony hits and the price range hits by 'manufacturer:Sony'. Am I right that 
filter queries could only do this for document hits if you know the field 
values ahead of time (e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The 
facets could then be derived by simply counting the numFound for each result 
set.

 

If there were subsearch support in Solr (i.e. take the output of a query and 
use it as input into another) that included facets [perhaps there is such 
support?], it might be used to achieve this effect.


A custom query parser plugin could work, maybe? I suppose it would need to 
gather up all the separate facets and correlate them according to the input 
query (e.g. host and user, or manufacturer and price range). Such a mechanism 
would be crying out for caching, but perhaps it could leverage the existing 
field and query caches.
 

Peter

 


 From: erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: Re: Aggregated facet value counts?
 Date: Fri, 29 Jan 2010 07:39:44 -0500
 
 Creating values for every possible combination is what you're asking 
 Solr to do at query-time, and as far as I know there isn't really a 
 way to accomplish that like you're asking. Is the need really to be 
 arbitrary here?
 
 Erik
 
 On Jan 29, 2010, at 7:25 AM, Peter S wrote:
 
 
  Hi Erik,
 
 
 
  Thanks for your reply. That's an interesting idea doing it at index- 
  time, and a good idea for known field combinations.
 
  The only thing is
 
  How to handle arbitrary field combinations? - i.e. to allow the 
  caller to specify any combination of fields at query-time?
 
  So, yes, the data is available at index-time, but the combination 
  isn't (short of creating fields for every possible combination).
 
 
 
  Peter
 
 
 
  From: erik.hatc...@gmail.com
  To: solr-user@lucene.apache.org
  Subject: Re: Aggregated facet value counts?
  Date: Fri, 29 Jan 2010 06:30:27 -0500
 
  When faced with this type of situation where the data is entirely
  available at index-time, simply create an aggregated field that glues
  the two pieces together, and facet on that.
 
  Erik
 
  On Jan 29, 2010, at 6:16 AM, Peter S wrote:
 
 
  Hi,
 
 
 
  I was wondering if anyone had come across this use case, and if this
  type of faceting is possible:
 
 
 
  The requirement is to build a query such that an aggregated facet
  count of common (and'ed) field values form the basis of each
  returned facet count.
 
 
 
  For example:
 
  Let's say I have a number of documents in an index with, among
  others, the fields 'host' and 'user':
 
 
 
  Doc1 host:machine_1 user:user_1
 
  Doc2 host:machine_1 user:user_2
 
  Doc3 host:machine_1 user:user_1
 
  Doc3 host:machine_1 user:user_1
 
 
 
  Doc4 host:machine_2 user:user_1
 
  Doc5 host:machine_2 user:user_1
 
  Doc6 host:machine_2 user:user_4
 
 
 
  Doc7 host:machine_1 user:user_4
 
 
 
  Is it possible to get facets back that would give the count of
  documents that have common host AND user values (preferably ordered
  - i.e. host then user for this example, so as not to create a
  factorial explosion)? Note that the caller wouldn't know what
  machine and user values exist, only the field names.
 
  I've tried using facet queries in various ways to see if they could
  work for this, but I believe facet queries work on a different plane
  than this requirement (narrowing the term count, a.o.t. 
  aggregating).
 
 
 
  For the example above, the desired result would be:
 
 
 
  machine_1/user_1 (3)
 
  machine_1/user_2 (1)
 
  machine_1/user_4 (1)
 
 
 
  machine_2/user_1 (2)
 
  machine_2/user_4 (1)
 
 
 
  Has anyone had a need for this type of faceting and found a way to
  achieve it?
 
 
 
  Many thanks,
 
  Peter
 
 
 
 
 
  _
  We want to hear all your funny, exciting and crazy Hotmail stories.
  Tell us now
  http

RE: Aggregated facet value counts?

2010-01-29 Thread Peter S

Tree faceting - that sounds very interesting indeed. I'll have a look into that 
and see how it fits, as well as any improvements for adding facet queries, 
cross-field aggregation, date range etc. There could be some very nice 
use-cases for such functionality. Just wondering how this would work with 
distributed shards/multi-core...


Many Thanks! 

Peter

 

 
 From: erik.hatc...@gmail.com
 To: solr-user@lucene.apache.org
 Subject: Re: Aggregated facet value counts?
 Date: Fri, 29 Jan 2010 12:20:07 -0500
 
 Sounds like what you're asking for is tree faceting. A basic 
 implementation is available in SOLR-792, but one that could also take 
 facet.queries, numeric or date range buckets, to tree on would be a 
 nice improvement.
 
 Still, the underlying implementation will simply enumerate all the 
 possible values (SOLR-792 has some short-circuiting when the top-level 
 has zero, of course). A client-side application could do this with 
 multiple requests to Solr.
 
 Subsearch - sure, just make more requests to Solr, rearranging the 
 parameters.
 
 I'd still say that in general for this type of need that it'll 
 generally be less arbitrary and locking some things in during 
 indexing will be the pragmatic way to go for most cases.
 
 Erik
 
 
 
 On Jan 29, 2010, at 9:28 AM, Peter S wrote:
 
 
  Well, it wouldn't be 'every' combination - more of 'any' combination 
  at query-time.
 
  The 'arbitrary' part of the requirement is because it's not 
  practical to predict every combination a user might ask for, 
  although generally users would tend to search for similar/the same 
  query combinations (but perhaps with different date ranges, for 
  example).
 
  If 'predicted aggregate fields' were calculated at index-time on, 
  say, 10 fields (the schema in question actually as 73 fields), 
  that's 3,628,801 new fields. A large percentage of these would 
  likely never be used (which ones would depend on the user, 
  environment etc.).
 
 
  Perhaps a more 'typical' use case than my network-based example 
  would be a product search web page, where you want to show the 
  number of products that are made by a manufacturer and within a 
  certain price range (e.g. Sony [$600-$800] (15) ). To obtain the 
  (15) facet count value, you would have to correlate the number of 
  Sony products (say, (861)), and the products that fall into the [600 
  TO 800] price range (say, (1226) ). The (15) would be the 
  intersection of the Sony hits and the price range hits by 
  'manufacturer:Sony'. Am I right that filter queries could only do 
  this for document hits if you know the field values ahead of time 
  (e.g. fq=manufacturer:Sonyfq=price:[600 TO 800])? The facets could 
  then be derived by simply counting the numFound for each result set.
 
 
 
  If there were subsearch support in Solr (i.e. take the output of a 
  query and use it as input into another) that included facets 
  [perhaps there is such support?], it might be used to achieve this 
  effect.
 
 
  A custom query parser plugin could work, maybe? I suppose it would 
  need to gather up all the separate facets and correlate them 
  according to the input query (e.g. host and user, or manufacturer 
  and price range). Such a mechanism would be crying out for caching, 
  but perhaps it could leverage the existing field and query caches.
 
 
  Peter
 
 
 
 
  From: erik.hatc...@gmail.com
  To: solr-user@lucene.apache.org
  Subject: Re: Aggregated facet value counts?
  Date: Fri, 29 Jan 2010 07:39:44 -0500
 
  Creating values for every possible combination is what you're asking
  Solr to do at query-time, and as far as I know there isn't really a
  way to accomplish that like you're asking. Is the need really to be
  arbitrary here?
 
  Erik
 
  On Jan 29, 2010, at 7:25 AM, Peter S wrote:
 
 
  Hi Erik,
 
 
 
  Thanks for your reply. That's an interesting idea doing it at index-
  time, and a good idea for known field combinations.
 
  The only thing is
 
  How to handle arbitrary field combinations? - i.e. to allow the
  caller to specify any combination of fields at query-time?
 
  So, yes, the data is available at index-time, but the combination
  isn't (short of creating fields for every possible combination).
 
 
 
  Peter
 
 
 
  From: erik.hatc...@gmail.com
  To: solr-user@lucene.apache.org
  Subject: Re: Aggregated facet value counts?
  Date: Fri, 29 Jan 2010 06:30:27 -0500
 
  When faced with this type of situation where the data is entirely
  available at index-time, simply create an aggregated field that 
  glues
  the two pieces together, and facet on that.
 
  Erik
 
  On Jan 29, 2010, at 6:16 AM, Peter S wrote:
 
 
  Hi,
 
 
 
  I was wondering if anyone had come across this use case, and if 
  this
  type of faceting is possible:
 
 
 
  The requirement is to build a query such that an aggregated facet
  count of common (and'ed) field values form the basis of each
  returned facet count.
 
 
 
  For example:
 
  Let's say I have

Dedupe of document results at query-time

2010-01-23 Thread Peter S

Hi,

 

I wonder if someone might be able to shed some insight into this problem:

 

Is it possible and/or what is the best/accepted way to achieve deduplication of 
documents by field at query-time?

 

For example:

Let's say an index contains:

 

Doc1



host:Host1

time:1 Sept 09

appname:activePDF

 

Doc2



host:Host1

time:2 Sept 09

appname:activePDF

 

Doc3



host:Host1

time:3 Sept 09

appname:activePDF

 

Can a query be constructed that would return only 1 of these Documents based on 
appname (doesn't really matter which one)?

 

i.e.:

   match on host:Host1

   ignore time

   dedupe on appname:activePDF

 

Is this possible? Would FunctionQuery be helpful here, maybe? Am I actually 
talking about field collapsing?

 

Many thanks,

Peter

 
  
_
Got a cool Hotmail story? Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/

RE: Reverse sort facet query [SOLR-1672]

2010-01-08 Thread Peter S

 
 now i'm totally confused: what are you suggesting this new param would 
 do, what does the name mean?
 
Sorry, I wan't clear - there isn't a new parameter, except the one added in the 
patch. What I was suggesting here is to do the work
to remove the new parameter I just put in (facet.sortorder), and do it in 
exactly the way you mentioned - 
i.e. just extend facet.sort to allow a 'count desc'. By convention, ok to use a 
a space in the name? - or would count.desc (and count.asc as alias for count) 
be more compliant?

 

Peter
 

  
_
We want to hear all your funny, exciting and crazy Hotmail stories. Tell us now
http://clk.atdmt.com/UKM/go/195013117/direct/01/

Non-leading wildcard search

2010-01-04 Thread Peter S

Hello,
There are lots of questions and answers in the forum regarding varying wildcard 
behaviour, but I haven't been able to find any
that address this particular behaviour. Perhaps someone could help?
Problem:
I have a fieldType that only goes through a KeywordTokenizer at index time, to 
ensure it stays 'verbatim' (e.g. it doesn't get split into any tokens - ws or 
otherwise).
Let's say there's some data stored in this field like this:


Something
Something Else
Something Else Altogether


When I query:  Something or Something Else or *thing  or *omething*, I 
get back the expected results.
If, however, I query: Some* or S* or s* etc, I get no results (although 
this type of non-leading wildcard works fine with other fieldType schema 
elements that don't use KeywordTokenizer).
Is this something to do with KeywordTokenizer?
Is there a better way to index data (preserving case) and not splitting on ws 
or stemming etc. (i.e. no WhitespaceTokenizer or similar)?
My fieldType schema looks like this: (I've tried a number of other combinations 
as well including using class=solr.TextField)
fieldType name=text_verbatim class=solr.StrField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
  /analyzer
/fieldType
 
field name=appname  type=text_verbatim indexed=true stored=true/

I understand that wildcard queries don't go through analyzers, but why is it 
that 'tokenized' data matches on non-leading wildcard queries, whereas 
non-tokenized (or more specifically Keyword-Tokenized) doesn't?
The fieldType schema requires some tokenizer class, and it appears that 
KeywordTokenizer is the only one that tokenizes to a token size of 1 (i.e. the 
whole string).
I'm sure I'm missing something that is probably reasonbly obvious, but having 
tried myriad combinations, I thought it prudent to ask the experts in the forum.
 
Many thanks for any insight you can provide on this.
 
Peter
 

  
_
Use Hotmail to send and receive mail from your different email accounts
http://clk.atdmt.com/UKM/go/186394592/direct/01/

RE: Non-leading wildcard search

2010-01-04 Thread Peter S

Hi Yonik,

 

Thanks for your quick reply.

No, the queries themselves aren't in quotes.

 

Since I sent the initial email, I have managed to get non-leading wildcard 
queries to work with this, but by unexpected means (for me at least :-).

 

If I add a LowerCaseFilterFactory to the fieldType, queries like s* (or S*) 
work as expected.

 

So the fieldType schema element now looks like:

fieldType name=text_verbatim class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
  /analyzer
/fieldType

 

I wasn't expecting this, as I would have thought this would change only the 
case behaviour, not the wildcard behaviour (or at least not just the 
non-leading wildcard behaviour). Perhaps I'm just not understanding how the 
terms (term in this case as not tokenized) is indexed and subsequently matched.

 

What I've noticed is that with the LowerCaseFilterFactory in place, document 
queries return results with case intact, but facet queries show the results in 
lower-case

(e.g. document-appname=Something  facet.field.appname=something). (I kind of 
expected the document-appname field to be lower case as well)

 

Does this sound like correct behaviour to you?

If it's correct, that's ok, I'll manage to work 'round it (maybe there's a way 
to map the facet field back to the document field?), but if it sounds wrong, 
perhaps it warrants further investigation.

 

Many thanks,

Peter

 


 
 Date: Mon, 4 Jan 2010 17:42:30 -0500
 Subject: Re: Non-leading wildcard search
 From: yo...@lucidimagination.com
 To: solr-user@lucene.apache.org
 
 On Mon, Jan 4, 2010 at 5:38 PM, Peter S pete...@hotmail.com wrote:
  When I query:  Something or Something Else or *thing  or 
  *omething*, I get back the expected results.
  If, however, I query: Some* or S* or s* etc, I get no results 
  (although this type of non-leading wildcard works fine with other fieldType 
  schema elements that don't use KeywordTokenizer).
 
 Is your query string actually in quotes? Wildcards aren't currently
 supported in quotes.
 So text_verbatim:Some* should work.
 
 -Yonik
 http://www.lucidimagination.com
  
_
View your other email accounts from your Hotmail inbox. Add them now.
http://clk.atdmt.com/UKM/go/186394592/direct/01/

RE: Non-leading wildcard search

2010-01-04 Thread Peter S

FYI:

 

I have found the root of this behaviour. It has to do with a test patch I've 
been working on for working 'round pre SOLR-219 (case insensitive wildcard 
searching).

With the test patch switched out, it works as expected. Although the case 
insensitive wildcard search reverts to pre-SOLR-219 behaviour.

 

I believe I can work 'round this by using a copyField that holds the lower-case 
text for wildcarding.

 

Many thanks, Yonik for your help.

 

Peter

 


 
 From: pete...@hotmail.com
 To: solr-user@lucene.apache.org
 Subject: RE: Non-leading wildcard search
 Date: Mon, 4 Jan 2010 23:29:04 +
 
 
 Hi Yonik,
 
 
 
 Thanks for your quick reply.
 
 No, the queries themselves aren't in quotes.
 
 
 
 Since I sent the initial email, I have managed to get non-leading wildcard 
 queries to work with this, but by unexpected means (for me at least :-).
 
 
 
 If I add a LowerCaseFilterFactory to the fieldType, queries like s* (or S*) 
 work as expected.
 
 
 
 So the fieldType schema element now looks like:
 
 fieldType name=text_verbatim class=solr.TextField 
 positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
 ignoreCase=true expand=true/
 /analyzer
 /fieldType
 
 
 
 I wasn't expecting this, as I would have thought this would change only the 
 case behaviour, not the wildcard behaviour (or at least not just the 
 non-leading wildcard behaviour). Perhaps I'm just not understanding how the 
 terms (term in this case as not tokenized) is indexed and subsequently 
 matched.
 
 
 
 What I've noticed is that with the LowerCaseFilterFactory in place, document 
 queries return results with case intact, but facet queries show the results 
 in lower-case
 
 (e.g. document-appname=Something facet.field.appname=something). (I kind of 
 expected the document-appname field to be lower case as well)
 
 
 
 Does this sound like correct behaviour to you?
 
 If it's correct, that's ok, I'll manage to work 'round it (maybe there's a 
 way to map the facet field back to the document field?), but if it sounds 
 wrong, perhaps it warrants further investigation.
 
 
 
 Many thanks,
 
 Peter
 
 
 
 
 
  Date: Mon, 4 Jan 2010 17:42:30 -0500
  Subject: Re: Non-leading wildcard search
  From: yo...@lucidimagination.com
  To: solr-user@lucene.apache.org
  
  On Mon, Jan 4, 2010 at 5:38 PM, Peter S pete...@hotmail.com wrote:
   When I query: Something or Something Else or *thing or 
   *omething*, I get back the expected results.
   If, however, I query: Some* or S* or s* etc, I get no results 
   (although this type of non-leading wildcard works fine with other 
   fieldType schema elements that don't use KeywordTokenizer).
  
  Is your query string actually in quotes? Wildcards aren't currently
  supported in quotes.
  So text_verbatim:Some* should work.
  
  -Yonik
  http://www.lucidimagination.com
 
 _
 View your other email accounts from your Hotmail inbox. Add them now.
 http://clk.atdmt.com/UKM/go/186394592/direct/01/
  
_
Add your Gmail and Yahoo! Mail email accounts into Hotmail - it's easy
http://clk.atdmt.com/UKM/go/186394592/direct/01/