I’ll back up a bit, since it is sort of an X/Y problem.

I have an index with four shards and 17 million documents. I want to dump all 
the docs in JSON, label each one with a classifier, then load them back in with 
the labels. This is a one-time (or rare) bootstrap of the classified data. This 
will unblock testing and relevance work while we get the classifier hooked into 
the indexing pipeline.

Because I’m dumping all the fields, we can’t rely on docValues.

It is OK if it takes a few hours.

Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is 16 
threads searching id:0* through id:f*, fetching 1000 rows each time, using 
cursorMark and distributed search. Median response time is 10 s. CPU usage is 
about 1%.

It is all pretty grubby and it seems like there could be a better way.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 10, 2020, at 3:39 PM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
> 
> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re firing
> at individual replicas. Each replica should have multiple UUIDs that start 
> with 0-f.
> 
> Unless I misunderstand and you’re just firing off, say, 16 threads at the 
> entire
> collection rather than individual shards which would work too. But for 
> individual
> shards I think you need to look for all possible IDs...
> 
> Erick
> 
>> On Feb 10, 2020, at 5:37 PM, Walter Underwood <wun...@wunderwood.org> wrote:
>> 
>> 
>>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wun...@wunderwood.org> wrote:
>>> 
>>> Not sure if range queries work on a UUID field, ...
>> 
>> A search for id:0* took 260 ms, so it looks like they work just fine. I’ll 
>> try separate queries for 0-f. 
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
> 

Reply via email to