Hi Fred, Thanks for the pointer! 'cursorMark' is a lot more performant alright, though apparently it doesn't suit our use case.
I've written a loop function using OTP's httpc that reads each page, gets the cursorMark and repeats, and it returns all 147 pages with consistent times in the 40-60ms bracket which is an excellent improvement! I would have been asking about the effort involved in making the protocol buffers client support this, but instead our GUI guys insist that they need to request a page number as sometimes they want to start in the middle of a set of data. So I'm almost back to square one. Can you shed any light on the internal workings of SOLR that produce the slow-down in my original question? I'm hoping I can find a way to restructure my index data without having to change the higher-level API's that I support. Cheers, //Sean. On Mon, Sep 19, 2016 at 10:00 PM, Fred Dushin <fdus...@basho.com> wrote: > All great questions, Sean. > > A few things. First off, for result sets that are that large, you are > probably going to want to use Solr cursor marks [1], which are supported in > the current version of Solr we ship. Riak allows queries using cursor > marks through the HTTP interface. At present, it does not support cursors > using the protobuf API, due to some internal limitations of the server-side > protobuf library, but we do hope to fix that in the future. > > Secondly, we have found sorting with distributed queries to be far more > performant using Solr 4.10.4. Currently released versions of Riak use Solr > 4.7, but as you can see on github [2], Solr 4.10.4 support has been merged > into the develop-2.2 branch, and is in the pipeline for release. I can't > say when the next version of Riak is that will ship with this version > because of indeterminacy around bug triage, but it should not be too long. > > I would start to look at using cursor marks and measure their relative > performance in your scenario. My guess is that you should see some > improvement there. > > -Fred > > [1] https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results > [2] https://github.com/basho/yokozuna/commit/ > f64e19cef107d982082f5b95ed598da96fb419b0 > > > > On Sep 19, 2016, at 4:48 PM, sean mcevoy <sean.mce...@gmail.com> wrote: > > > > Hi All, > > > > We have an index with ~548,000 entries, ~14,000 of which match one of > our queries. > > We read these in a paginated search and the first page (of 100 hits) > returns quickly in ~70ms. > > This response time seems to increase exponentially as we walk through > the pages: > > the 4th page takes ~200ms, > > the 8th page takes ~1200ms > > the 12th page takes ~2100ms > > the 16th page takes ~6100ms > > the 20th page takes ~24000ms > > > > And by the time we're searching for the 22nd page it regularly times out > at the default 60 seconds. > > > > I have a good unsderstanding of riak KV internals but absolutely nothing > of Lucene which I think is what's most relevant here. If anyone in the know > can point me towards any relevant resource or can explain what's happening > I'd be much obliged :-) > > As I would also be if anyone with experience of using Riak/Lucene can > tell me: > > - Is 500K a crazy number of entries to put into one index? > > - Is 14K a crazy number of entries to expect to be returned? > > - Are there any methods we can use to make the search time more constant > across the full search? > > I read one blog post on inlining but it was a bit old & not very obvious > how to implement using riakc_pb_socket calls. > > > > And out of curiosity, do we not traverse the full range of hits for each > page? I naively thought that because I'm sorting the returned values we'd > have to get them all first and then sort, but the response times suggests > otherwise. Does Lucene store the data sorted by each field just in case a > query asks for it? Or what other magic is going on? > > > > > > For the technical details, we use the "_yz_default" schema and all the > fields stored are strings: > > - entry_id_s: unique within the DB, the aim of the query is to gather a > list of these > > - type_s: has one of 2 values > > - sub_category_id_s: in the query described above all 14K hits will > match on this, in the DB of ~500K entries there are ~43K different values > for this field, withe each category typically having 2-6 sub categories > > - category_id_s: not matched in this query, in the DB of ~500K entries > there are ~13K different values for this field > > - status_s: has one of 2 values, in the query described baove all hits > will have the value "active" > > - user_id_s: unique within the DB but not matched in this query > > - first_name_s: almost unique within the DB, this query will sort by > this field > > - last_name_s: almost unique within the DB, this query will sort by this > field > > > > This search query looks like: > > <<"sub_category_id_s:test_1 AND status_s:active AND > type_s:sub_category">> > > > > Our options parameter has the sort directive: > > {sort, <<"first_name_s asc, last_name_s asc">>} > > > > The query was run on a 5-node cluster with n_val of 3. > > > > Thanks in advance fo rany pointers! > > //Sean. > > > > _______________________________________________ > > riak-users mailing list > > riak-users@lists.basho.com > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com