On 4/27/2017 9:15 PM, Vijay Kokatnur wrote:
> Hey Shawn, Unfortunately, we can't upgrade the existing cluster. That
> was my first approach as well. Yes, SolrEntityProcessor is used so it
> results in deep paging after certain rows. I have observed that
> instead of importing for a larger period, if data is imported only for
> 4 hours at a time, import process is much faster. Since we are
> importing for several months it would be nice if dataimport can be
> scripted, in bash or python. But I am can't find any documentation on
> it. Any pointers? 

Hopefully this won't be too confusing:

If you have a field in the original index you can do a range query on,
you could use that range query to do the import in pieces, so each
import doesn't have as large a numFound value.

I'd probably put something like this in the SolrEntityProcessor:

query="${dih.request.solrquery}"

And then on the full-import command URL, I'd add these URL parameters,
varying the X and Y for each import:

&clean=false&solrquery=field:[X TO Y}

You'd want to be sure that either you url encode the parameter value
(especially the brackets and spaces), or that whatever you're using to
execute the URL will automatically do the encoding for you -- which is
something a full browser would do.  For the first import, you'd want to
leave the clean parameter off or set it to true.

Doing it this way would require more imports, but the whole process
should go faster.  Unless you want to write something that can detect
when each import is finished, it would be relatively hard to fully
automate the process.

Thanks,
Shawn

Reply via email to