Re: deleting large amount data from solr cloud

Vinay Pothnis Wed, 16 Apr 2014 18:17:25 -0700

Hello,

Couple of follow up questions:


* When the optimize command is run, looks like it creates one big segment
(forceMerge = 1). Will it get split at any point later? Or will that big
segment remain?

* Is there anyway to maintain the number of segments - but still merge to
reclaim the deleted documents space? In other words, can I issue
"forceMerge=20"? If so, how would the command look like? Any examples for
this?

Thanks
Vinay



On 16 April 2014 07:59, Vinay Pothnis <poth...@gmail.com> wrote:

> Thank you Erick!
> Yes - I am using the expunge deletes option.
>
> Thanks for the note on disk space for the optimize command. I should have
> enough space for that. What about the heap space requirement? I hope it can
> do the optimize with the memory that is allocated to it.
>
> Thanks
> Vinay
>
>
> On 16 April 2014 04:52, Erick Erickson <erickerick...@gmail.com> wrote:
>
>> The optimize should, indeed, reduce the index size. Be aware that it
>> may consume 2x the disk space. You may also try expungedeletes, see
>> here: https://wiki.apache.org/solr/UpdateXmlMessages
>>
>> Best,
>> Erick
>>
>> On Wed, Apr 16, 2014 at 12:47 AM, Vinay Pothnis <poth...@gmail.com>
>> wrote:
>> > Another update:
>> >
>> > I removed the replicas - to avoid the replication doing a full copy. I
>> am
>> > able delete sizeable chunks of data.
>> > But the overall index size remains the same even after the deletes. It
>> does
>> > not seem to go down.
>> >
>> > I understand that Solr would do this in background - but I don't seem to
>> > see the decrease in overall index size even after 1-2 hours.
>> > I can see a bunch of ".del" files in the index directory, but the it
>> does
>> > not seem to get cleaned up. Is there anyway to monitor/follow the
>> progress
>> > of index compaction?
>> >
>> > Also, does triggering "optimize" from the admin UI help to compact the
>> > index size on disk?
>> >
>> > Thanks
>> > Vinay
>> >
>> >
>> > On 14 April 2014 12:19, Vinay Pothnis <poth...@gmail.com> wrote:
>> >
>> >> Some update:
>> >>
>> >> I removed the auto warm configurations for the various caches and
>> reduced
>> >> the cache sizes. I then issued a call to delete a day's worth of data
>> (800K
>> >> documents).
>> >>
>> >> There was no out of memory this time - but some of the nodes went into
>> >> recovery mode. Was able to catch some logs this time around and this is
>> >> what i see:
>> >>
>> >> ****************
>> >> *WARN  [2014-04-14 18:11:00.381] [org.apache.solr.update.PeerSync]
>> >> PeerSync: core=core1_shard1_replica2 url=http://host1:8983/solr
>> >> <http://host1:8983/solr> too many updates received since start -
>> >> startingUpdates no longer overlaps with our currentUpdates*
>> >> *INFO  [2014-04-14 18:11:00.476]
>> [org.apache.solr.cloud.RecoveryStrategy]
>> >> PeerSync Recovery was not successful - trying replication.
>> >> core=core1_shard1_replica2*
>> >> *INFO  [2014-04-14 18:11:00.476]
>> [org.apache.solr.cloud.RecoveryStrategy]
>> >> Starting Replication Recovery. core=core1_shard1_replica2*
>> >> *INFO  [2014-04-14 18:11:00.535]
>> [org.apache.solr.cloud.RecoveryStrategy]
>> >> Begin buffering updates. core=core1_shard1_replica2*
>> >> *INFO  [2014-04-14 18:11:00.536]
>> [org.apache.solr.cloud.RecoveryStrategy]
>> >> Attempting to replicate from
>> http://host2:8983/solr/core1_shard1_replica1/
>> >> <http://host2:8983/solr/core1_shard1_replica1/>.
>> core=core1_shard1_replica2*
>> >> *INFO  [2014-04-14 18:11:00.536]
>> >> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http
>> >> client,
>> >>
>> config:maxConnections=128&maxConnectionsPerHost=32&followRedirects=false*
>> >> *INFO  [2014-04-14 18:11:01.964]
>> >> [org.apache.solr.client.solrj.impl.HttpClientUtil] Creating new http
>> >> client,
>> >>
>> config:connTimeout=5000&socketTimeout=20000&allowCompression=false&maxConnections=10000&maxConnectionsPerHost=10000*
>> >> *INFO  [2014-04-14 18:11:01.969] [org.apache.solr.handler.SnapPuller]
>>  No
>> >> value set for 'pollInterval'. Timer Task not started.*
>> >> *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
>> >> Master's generation: 1108645*
>> >> *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
>> >> Slave's generation: 1108627*
>> >> *INFO  [2014-04-14 18:11:01.973] [org.apache.solr.handler.SnapPuller]
>> >> Starting replication process*
>> >> *INFO  [2014-04-14 18:11:02.007] [org.apache.solr.handler.SnapPuller]
>> >> Number of files in latest index in master: 814*
>> >> *INFO  [2014-04-14 18:11:02.007]
>> >> [org.apache.solr.core.CachingDirectoryFactory] return new directory for
>> >> /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007*
>> >> *INFO  [2014-04-14 18:11:02.008] [org.apache.solr.handler.SnapPuller]
>> >> Starting download to
>> >> NRTCachingDirectory(org.apache.lucene.store.MMapDirectory@
>> /opt/data/solr/core1_shard1_replica2/data/index.20140414181102007
>> >> lockFactory=org.apache.lucene.store.NativeFSLockFactory@5f6570fe;
>> >> maxCacheMB=48.0 maxMergeSizeMB=4.0) fullCopy=true*
>> >>
>> >> ****************
>> >>
>> >>
>> >> So, it looks like the number of updates is too huge for the regular
>> >> replication and then it goes into full copy of index. And since our
>> index
>> >> size is very huge (350G), this is causing the cluster to go into
>> recovery
>> >> mode forever - trying to copy that huge index.
>> >>
>> >> I also read in some thread
>> >>
>> http://lucene.472066.n3.nabble.com/Recovery-too-many-updates-received-since-start-td3935281.htmlthatthere
>>  is a limit of 100 documents.
>> >>
>> >> I wonder if this has been updated to make that configurable since that
>> >> thread. If not, the only option I see is to do a "trickle" delete of
>> 100
>> >> documents per second or something.
>> >>
>> >> Also - the other suggestion of using "distributed=false" might not help
>> >> because the issue currently is that the replication is going to "full
>> copy".
>> >>
>> >> Any thoughts?
>> >>
>> >> Thanks
>> >> Vinay
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On 14 April 2014 07:54, Vinay Pothnis <poth...@gmail.com> wrote:
>> >>
>> >>> Yes, that is our approach. We did try deleting a day's worth of data
>> at a
>> >>> time, and that resulted in OOM as well.
>> >>>
>> >>> Thanks
>> >>> Vinay
>> >>>
>> >>>
>> >>> On 14 April 2014 00:27, Furkan KAMACI <furkankam...@gmail.com> wrote:
>> >>>
>> >>>> Hi;
>> >>>>
>> >>>> I mean you can divide the range (i.e. one week at each delete
>> instead of
>> >>>> one month) and try to check whether you still get an OOM or not.
>> >>>>
>> >>>> Thanks;
>> >>>> Furkan KAMACI
>> >>>>
>> >>>>
>> >>>> 2014-04-14 7:09 GMT+03:00 Vinay Pothnis <poth...@gmail.com>:
>> >>>>
>> >>>> > Aman,
>> >>>> > Yes - Will do!
>> >>>> >
>> >>>> > Furkan,
>> >>>> > How do you mean by 'bulk delete'?
>> >>>> >
>> >>>> > -Thanks
>> >>>> > Vinay
>> >>>> >
>> >>>> >
>> >>>> > On 12 April 2014 14:49, Furkan KAMACI <furkankam...@gmail.com>
>> wrote:
>> >>>> >
>> >>>> > > Hi;
>> >>>> > >
>> >>>> > > Do you get any problems when you index your data? On the other
>> hand
>> >>>> > > deleting as bulks and reducing the size of documents may help you
>> >>>> not to
>> >>>> > > hit OOM.
>> >>>> > >
>> >>>> > > Thanks;
>> >>>> > > Furkan KAMACI
>> >>>> > >
>> >>>> > >
>> >>>> > > 2014-04-12 8:22 GMT+03:00 Aman Tandon <amantandon...@gmail.com>:
>> >>>> > >
>> >>>> > > > Vinay please share your experience after trying this solution.
>> >>>> > > >
>> >>>> > > >
>> >>>> > > > On Sat, Apr 12, 2014 at 4:12 AM, Vinay Pothnis <
>> poth...@gmail.com>
>> >>>> > > wrote:
>> >>>> > > >
>> >>>> > > > > The query is something like this:
>> >>>> > > > >
>> >>>> > > > >
>> >>>> > > > > *curl -H 'Content-Type: text/xml' --data
>> >>>> '<delete><query>param1:(val1
>> >>>> > > OR
>> >>>> > > > > val2) AND -param2:(val3 OR val4) AND
>> date_param:[1383955200000 TO
>> >>>> > > > > 1385164800000]</query></delete>'
>> >>>> > > > > 'http://host:port/solr/coll-name1/update?commit=true'*
>> >>>> > > > >
>> >>>> > > > > Trying to restrict the number of documents deleted via the
>> date
>> >>>> > > > parameter.
>> >>>> > > > >
>> >>>> > > > > Had not tried the "distrib=false" option. I could give that a
>> >>>> try.
>> >>>> > > Thanks
>> >>>> > > > > for the link! I will check on the cache sizes and autowarm
>> >>>> values.
>> >>>> > Will
>> >>>> > > > try
>> >>>> > > > > and disable the caches when I am deleting and give that a
>> try.
>> >>>> > > > >
>> >>>> > > > > Thanks Erick and Shawn for your inputs!
>> >>>> > > > >
>> >>>> > > > > -Vinay
>> >>>> > > > >
>> >>>> > > > >
>> >>>> > > > >
>> >>>> > > > > On 11 April 2014 15:28, Shawn Heisey <s...@elyograg.org>
>> wrote:
>> >>>> > > > >
>> >>>> > > > > > On 4/10/2014 7:25 PM, Vinay Pothnis wrote:
>> >>>> > > > > >
>> >>>> > > > > >> When we tried to delete the data through a query - say 1
>> >>>> > day/month's
>> >>>> > > > > worth
>> >>>> > > > > >> of data. But after deleting just 1 month's worth of data,
>> the
>> >>>> > master
>> >>>> > > > > node
>> >>>> > > > > >> is going out of memory - heap space.
>> >>>> > > > > >>
>> >>>> > > > > >> Wondering is there any way to incrementally delete the
>> data
>> >>>> > without
>> >>>> > > > > >> affecting the cluster adversely.
>> >>>> > > > > >>
>> >>>> > > > > >
>> >>>> > > > > > I'm curious about the actual query being used here.  Can
>> you
>> >>>> share
>> >>>> > > it,
>> >>>> > > > or
>> >>>> > > > > > a redacted version of it?  Perhaps there might be a clue
>> there?
>> >>>> > > > > >
>> >>>> > > > > > Is this a fully distributed delete request?  One thing you
>> >>>> might
>> >>>> > try,
>> >>>> > > > > > assuming Solr even supports it, is sending the same delete
>> >>>> request
>> >>>> > > > > directly
>> >>>> > > > > > to each shard core with distrib=false.
>> >>>> > > > > >
>> >>>> > > > > > Here's a very incomplete list about how you can reduce Solr
>> >>>> heap
>> >>>> > > > > > requirements:
>> >>>> > > > > >
>> >>>> > > > > > http://wiki.apache.org/solr/SolrPerformanceProblems#
>> >>>> > > > > > Reducing_heap_requirements
>> >>>> > > > > >
>> >>>> > > > > > Thanks,
>> >>>> > > > > > Shawn
>> >>>> > > > > >
>> >>>> > > > > >
>> >>>> > > > >
>> >>>> > > >
>> >>>> > > >
>> >>>> > > >
>> >>>> > > > --
>> >>>> > > > With Regards
>> >>>> > > > Aman Tandon
>> >>>> > > >
>> >>>> > >
>> >>>> >
>> >>>>
>> >>>
>> >>>
>> >>
>>
>
>

Re: deleting large amount data from solr cloud

Reply via email to