Re: cursorMark and shards? (6.6.2)

Walter Underwood Mon, 10 Feb 2020 23:14:12 -0800

sort=“id asc”

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Feb 10, 2020, at 9:50 PM, Tim Casey <tca...@gmail.com> wrote:
> 
> Walter,
> 
> When you do the query, what is the sort of the results?
> 
> tim
> 
> On Mon, Feb 10, 2020 at 8:44 PM Walter Underwood <wun...@wunderwood.org>
> wrote:
> 
>> I’ll back up a bit, since it is sort of an X/Y problem.
>> 
>> I have an index with four shards and 17 million documents. I want to dump
>> all the docs in JSON, label each one with a classifier, then load them back
>> in with the labels. This is a one-time (or rare) bootstrap of the
>> classified data. This will unblock testing and relevance work while we get
>> the classifier hooked into the indexing pipeline.
>> 
>> Because I’m dumping all the fields, we can’t rely on docValues.
>> 
>> It is OK if it takes a few hours.
>> 
>> Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is
>> 16 threads searching id:0* through id:f*, fetching 1000 rows each time,
>> using cursorMark and distributed search. Median response time is 10 s. CPU
>> usage is about 1%.
>> 
>> It is all pretty grubby and it seems like there could be a better way.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 10, 2020, at 3:39 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>> 
>>> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>>> 
>>> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re
>> firing
>>> at individual replicas. Each replica should have multiple UUIDs that
>> start with 0-f.
>>> 
>>> Unless I misunderstand and you’re just firing off, say, 16 threads at
>> the entire
>>> collection rather than individual shards which would work too. But for
>> individual
>>> shards I think you need to look for all possible IDs...
>>> 
>>> Erick
>>> 
>>>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <wun...@wunderwood.org>
>> wrote:
>>>> 
>>>> 
>>>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wun...@wunderwood.org>
>> wrote:
>>>>> 
>>>>> Not sure if range queries work on a UUID field, ...
>>>> 
>>>> A search for id:0* took 260 ms, so it looks like they work just fine.
>> I’ll try separate queries for 0-f.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>> 
>> 
>>

Re: cursorMark and shards? (6.6.2)

Reply via email to