Re: cursorMark and CSVResponseWriter for mass reindex

Erick Erickson Mon, 20 Jun 2016 17:15:07 -0700

The CursorMark stuff has to deal with shards, what happens when more
than one document on different shards has the same sort value, what
if all the docs in the response packet have the same sort value, what
happens when you want to return docs by score and the like.


For your case you can use a sort criteria that avoids all these issues and be
OK. You can think of it as a specialized CursorMark.

You should be able to just sort by
<uniqueKey> and send each query through with a range filter query,
so the first query would look something like (assuming "id" is your
<uniqueKey>)

q=*:*&sort=id asc&start=0&rows=1000
then the rest would be
q=*:*&sort=id asc&fq={!cache=false}id:[last_id_returned_from_previous_query
TO *]&start=0&rows=1000

this avoids the "deep paging" problem that CursorMark solves more cheaply
because the <uniqueKey> guarantees that there is one and only one doc with
that value. Note that the start parameter is always 0.....

Or your second query could even be just
q=id:[last_id_returned_from_previous_query TO *]&sort=id asc&start=0&rows=1000

Best,
Erick

On Mon, Jun 20, 2016 at 12:37 PM, xavi jmlucjav <jmluc...@gmail.com> wrote:
> Hi,
>
> I need to index into a new schema 800M docs, that exist in an older solr.
> As all fields are stored, I thought I was very lucky as I could:
>
> - use wt=csv
> - combined with cursorMark
>
> to easily script out something that would export/index in chunks of 1M docs
> or something. CVS output being very efficient for this sort of thing, I
> think.
>
> But, sadly I found that there is no way to get the nextcursorMark after the
> first request, as the csvwriter just outputs plailn csv info of the fields,
> excluding all other info on the response!!!
>
> This is so unfortunate, as csv/cursorMark seem like the perfect fit to
> reindex this huge index (it's a one time thing).
>
> Does anyone see some way to still be able to use this? I would prefer not
> having to write some java code just to get the nextcursorMark.
>
> So far I thought of:
> - use json, but I need to postprocess returned json to remove the response
> info etc, before reindexing, a pain.
> - send two calls for each chunk (sending the same cursormark both times),
> one wt=csv to get the data, another wt=json to get cursormark (and ignore
> the data, maybe using fl=id only to avoid getting much data). I did some
> test and this seems should work.
>
> I guess I will go with the 2nd, but anyone has a better idea?
> thanks
> xavier

Re: cursorMark and CSVResponseWriter for mass reindex

Reply via email to