Re: bulk reindexing 5.3.0 issue

Erick Erickson Fri, 25 Sep 2015 19:59:23 -0700

Wait, query again how? You've got to have something that keeps you
from getting the same 100 docs back so you have to be sorting somehow.
Or you have a high water mark. Or something. Waiting 5 seconds for any
commit also doesn't really make sense to me. I mean how do you know


1> that you're going to get a commit (did you explicitly send one from
the client?).
2> all autowarming will be complete by the time the next query hits?

Let's see the query you fire. There has to be some kind of marker that
you're using to know when you've gotten through the entire set.

And I would use much larger batches, I usually update in batches of
1,000 (excepting if these are very large docs of course). I suspect
you're spending a lot more time sleeping than you need to. I wouldn't
sleep at all in fact. This is one (rare) case I might consider
committing from the client. If you specify the wait for searcher param
(server.commit(true, true), then it doesn't return until a new
searcher is completely opened so your previous updates will be
reflected in your next search.

Actually, what I'd really do is
1> turn off all auto commits
2> go ahead and query/change/update. But the query bits would be using
the cursormark.
3> do NOT commit
4> issue a commit when you were all done.

I bet you'd get through your update a lot faster that way.

Best,
Erick

On Fri, Sep 25, 2015 at 5:07 PM, Ravi Solr <ravis...@gmail.com> wrote:
> Thanks for responding Erick. I set the "start" to zero and "rows" always to
> 100. I create CloudSolrClient instance and use it to both query as well as
> index. But I do sleep for 5 secs just to allow for any auto commits.
>
> So query --> client.add(100 docs) --> wait --> query again
>
> But the weird thing I noticed was that after 8 or 9 batches I.e 800/900
> docs the "query again" returns zero docs causing my while loop to
> exist...so was trying to see if I was doing the right thing or if there is
> an alternate way to do heavy indexing.
>
> Thanks
>
> Ravi Kiran Bhaskar
>
>
>
> On Friday, September 25, 2015, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> How are you querying Solr? You say you query for 100 docs,
>> update then get the next set. What are you using for a marker?
>> If you're using the start parameter, and somehow a commit is
>> creeping in things might be weird, especially if you're using any
>> of the internal Lucene doc IDs. If you're absolutely sure no commits
>> are taking place even that should be OK.
>>
>> The "deep paging" stuff could be helpful here, see:
>>
>> https://lucidworks.com/blog/coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets/
>>
>> Best,
>> Erick
>>
>> On Fri, Sep 25, 2015 at 3:13 PM, Ravi Solr <ravis...@gmail.com
>> <javascript:;>> wrote:
>> > No problem Walter, it's all fun. Was just wondering if there was some
>> other
>> > good way that I did not know of, that's all 😀
>> >
>> > Thanks
>> >
>> > Ravi Kiran Bhaskar
>> >
>> > On Friday, September 25, 2015, Walter Underwood <wun...@wunderwood.org
>> <javascript:;>>
>> > wrote:
>> >
>> >> Sorry, I did not mean to be rude. The original question did not say that
>> >> you don’t have the docs outside of Solr. Some people jump to the
>> advanced
>> >> features and miss the simple ones.
>> >>
>> >> It might be faster to fetch all the docs from Solr and save them in
>> files.
>> >> Then modify them. Then reload all of them. No guarantee, but it is
>> worth a
>> >> try.
>> >>
>> >> Good luck.
>> >>
>> >> wunder
>> >> Walter Underwood
>> >> wun...@wunderwood.org <javascript:;> <javascript:;>
>> >> http://observer.wunderwood.org/  (my blog)
>> >>
>> >>
>> >> > On Sep 25, 2015, at 2:59 PM, Ravi Solr <ravis...@gmail.com
>> <javascript:;>
>> >> <javascript:;>> wrote:
>> >> >
>> >> > Walter, Not in a mood for banter right now.... Its 6:00pm on a friday
>> and
>> >> > Iam stuck here trying to figure reindexing issues :-)
>> >> > I dont have source of docs so I have to query the SOLR, modify and
>> put it
>> >> > back and that is seeming to be quite a task in 5.3.0, I did reindex
>> >> several
>> >> > times with 4.7.2 in a master slave env without any issue. Since then
>> we
>> >> > have moved to cloud and it has been a pain all day.
>> >> >
>> >> > Thanks
>> >> >
>> >> > Ravi Kiran Bhaskar
>> >> >
>> >> > On Fri, Sep 25, 2015 at 5:25 PM, Walter Underwood <
>> wun...@wunderwood.org <javascript:;>
>> >> <javascript:;>>
>> >> > wrote:
>> >> >
>> >> >> Sure.
>> >> >>
>> >> >> 1. Delete all the docs (no commit).
>> >> >> 2. Add all the docs (no commit).
>> >> >> 3. Commit.
>> >> >>
>> >> >> wunder
>> >> >> Walter Underwood
>> >> >> wun...@wunderwood.org <javascript:;> <javascript:;>
>> >> >> http://observer.wunderwood.org/  (my blog)
>> >> >>
>> >> >>
>> >> >>> On Sep 25, 2015, at 2:17 PM, Ravi Solr <ravis...@gmail.com
>> <javascript:;>
>> >> <javascript:;>> wrote:
>> >> >>>
>> >> >>> I have been trying to re-index the docs (about 1.5 million) as one
>> of
>> >> the
>> >> >>> field needed part of string value removed (accidentally
>> introduced). I
>> >> >> was
>> >> >>> issuing a query for 100 docs getting 4 fields and updating the doc
>> >> >> (atomic
>> >> >>> update with "set") via the CloudSolrClient in batches, However from
>> >> time
>> >> >> to
>> >> >>> time the query returns 0 results, which exits the re-indexing
>> program.
>> >> >>>
>> >> >>> I cant understand as to why the cloud returns 0 results when there
>> are
>> >> >> 1.4x
>> >> >>> million docs which have the "accidental" string in them.
>> >> >>>
>> >> >>> Is there another way to do bulk massive updates ?
>> >> >>>
>> >> >>> Thanks
>> >> >>>
>> >> >>> Ravi Kiran Bhaskar
>> >> >>
>> >> >>
>> >>
>> >>
>>

Re: bulk reindexing 5.3.0 issue

Reply via email to