Hi Roman, I referred to something I called ""server-side named filters". It matches the feature described at http://www.elasticsearch.org/blog/terms-filter-lookup/
Would be a cool addition, IMHO. Otis -- Solr & ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Tue, Jul 2, 2013 at 1:25 PM, Roman Chyla <roman.ch...@gmail.com> wrote: > Hello @, > > This thread 'kicked' me into finishing som long-past task of > sending/receiving large boolean (bitset) filter. We have been using bitsets > with solr before, but now I sat down and wrote it as a qparser. The use > cases, as you have discussed are: > > - necessity to send loooong list of ids as a query (where it is not > possible to do it the 'normal' way) > - or filtering ACLs > > > It works in the following way: > > - external application constructs bitset and sends it as a query to solr > (q or fq, depends on your needs) > - solr unpacks the bitset (translated bits into lucene ids, if > necessary), and wraps this into a query which then has the easy job of > 'filtering' wanted/unwanted items > > Therefore it is good only if you can search against something that is > indexed as integer (id's often are). > > A simple benchmark shows acceptable performance, to send the bitset > (randomly populated, 10M, with 4M bits set), it takes 110ms (25+64+20) > > To decode this string (resulting byte size 1.5Mb!) it takes ~90ms > (5+14+68ms) > > But I haven't tested latency of sending it over the network and the query > performance, but since the query is very similar as MatchAllDocs, it is > probably very fast (and I know that sending many Mbs to Solr is fast as > well) > > I know this is not exactly 'standard' solution, and it is probably not > something you want to see with hundreds of millions of docs, but people > seem to be doing 'not the right thing' all the time;) > So if you think this is something useful for the community, please let me > know. If somebody would be willing to test it, i can file a JIRA ticket. > > Thanks! > > Roman > > > The code, if no JIRA is needed, can be found here: > https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/AdsQParserPlugin.java > https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/test/org/apache/solr/search/TestBitSetQParserPlugin.java > > 839ms. run > 154ms. Building random bitset indexSize=10000000 fill=0.5 -- > Size=15054208,cardinality=3934477 highestBit=9999999 > 25ms. Converting bitset to byte array -- resulting array length=1250000 > 20ms. Encoding byte array into base64 -- resulting array length=1666668 > ratio=1.3333344 > 62ms. Compressing byte array with GZIP -- resulting array length=1218602 > ratio=0.9748816 > 20ms. Encoding gzipped byte array into base64 -- resulting string > length=1624804 ratio=1.2998432 > 5ms. Decoding gzipped byte array from base64 > 14ms. Uncompressing decoded byte array > 68ms. Converting from byte array to bitset > 743ms. running > > > On Tue, Jun 18, 2013 at 3:51 PM, Erick Erickson > <erickerick...@gmail.com>wrote: > >> Not necessarily. If the auth tokens are available on some >> other system (DB, LDAP, whatever), one could get them >> in the PostFilter and cache them somewhere since, >> presumably, they wouldn't be changing all that often. Or >> use a UserCache and get notified whenever a new searcher >> was opened and regenerate or purge the cache. >> >> Of course you're right if the post filter does NOT have >> access to the source of truth for the user's privileges. >> >> FWIW, >> Erick >> >> On Tue, Jun 18, 2013 at 8:54 AM, Otis Gospodnetic >> <otis.gospodne...@gmail.com> wrote: >> > Hi, >> > >> > The unfortunate thing about this is what you still have to *pass* that >> > filter from the client to the server every time you want to use that >> > filter. If that filter is big/long, passing that in all the time has >> > some price that could be eliminated by using "server-side named >> > filters". >> > >> > Otis >> > -- >> > Solr & ElasticSearch Support >> > http://sematext.com/ >> > >> > >> > >> > >> > >> > On Tue, Jun 18, 2013 at 8:16 AM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >> You might consider "post filters". The idea >> >> is to write a custom filter that gets applied >> >> after all other filters etc. One use-case >> >> here is exactly ACL lists, and can be quite >> >> helpful if you're not doing *:* type queries. >> >> >> >> Best >> >> Erick >> >> >> >> On Mon, Jun 17, 2013 at 5:12 PM, Otis Gospodnetic >> >> <otis.gospodne...@gmail.com> wrote: >> >>> Btw. ElasticSearch has a nice feature here. Not sure what it's >> >>> called, but I call it "named filter". >> >>> >> >>> http://www.elasticsearch.org/blog/terms-filter-lookup/ >> >>> >> >>> Maybe that's what OP was after? >> >>> >> >>> Otis >> >>> -- >> >>> Solr & ElasticSearch Support >> >>> http://sematext.com/ >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On Mon, Jun 17, 2013 at 4:59 PM, Alexandre Rafalovitch >> >>> <arafa...@gmail.com> wrote: >> >>>> On Mon, Jun 17, 2013 at 12:35 PM, Igor Kustov <ivkus...@gmail.com> >> wrote: >> >>>>> So I'm using query like >> >>>>> >> http://127.0.0.1:8080/solr/select?q=*:*&fq={!mqparser}id:%281%202%203%29 >> >>>> >> >>>> If the IDs are purely numeric, I wonder if the better way is to send a >> >>>> bitset. So, bit 1 is on if ID:1 is included, bit 2000 is on if ID:2000 >> >>>> is included. Even using URL-encoding rules, you can fit at least 65 >> >>>> sequential ID flags per character and I am sure there are more >> >>>> efficient encoding schemes for long empty sequences. >> >>>> >> >>>> Regards, >> >>>> Alex. >> >>>> >> >>>> >> >>>> >> >>>> Personal website: http://www.outerthoughts.com/ >> >>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >> >>>> - Time is the quality of nature that keeps events from happening all >> >>>> at once. Lately, it doesn't seem to be working. (Anonymous - via GTD >> >>>> book) >>