I’ll back up a bit, since it is sort of an X/Y problem. I have an index with four shards and 17 million documents. I want to dump all the docs in JSON, label each one with a classifier, then load them back in with the labels. This is a one-time (or rare) bootstrap of the classified data. This will unblock testing and relevance work while we get the classifier hooked into the indexing pipeline.
Because I’m dumping all the fields, we can’t rely on docValues. It is OK if it takes a few hours. Right now, it is running about 1.7 Mdoc/hour, so roughly 10 hours. That is 16 threads searching id:0* through id:f*, fetching 1000 rows each time, using cursorMark and distributed search. Median response time is 10 s. CPU usage is about 1%. It is all pretty grubby and it seems like there could be a better way. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 10, 2020, at 3:39 PM, Erick Erickson <erickerick...@gmail.com> wrote: > > Any field that’s unique per doc would do, but yeah, that’s usually an ID. > > Hmmm, I don’t see why separate queries for 0-f are necessary if you’re firing > at individual replicas. Each replica should have multiple UUIDs that start > with 0-f. > > Unless I misunderstand and you’re just firing off, say, 16 threads at the > entire > collection rather than individual shards which would work too. But for > individual > shards I think you need to look for all possible IDs... > > Erick > >> On Feb 10, 2020, at 5:37 PM, Walter Underwood <wun...@wunderwood.org> wrote: >> >> >>> On Feb 10, 2020, at 2:24 PM, Walter Underwood <wun...@wunderwood.org> wrote: >>> >>> Not sure if range queries work on a UUID field, ... >> >> A search for id:0* took 260 ms, so it looks like they work just fine. I’ll >> try separate queries for 0-f. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >