The clean way to go is to start from the log and to replay it... But I have actually no idea about how to do that You might find this (old) work interesting: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
I'd never have tried to transmit this much data across the network, I would always have tried to find a way to copy hard disks and physically ship them to the location... Camusensei On 12 April 2016 at 19:14, cs user <acldstk...@gmail.com> wrote: > Hi there, > > At some point in the near future we are also going to require exactly what > you describe. We had hope to use distcp. > > You mentioned: > > 1. it do not handle data delete > > distcp has a -delete flag which says - > > "Delete the files existing in the dst but not in src" > > Does this not help with handling deleted data? > > I believe there is an issue if data is removed during a distcp run, so for > example at the start of the run it captures all the files it needs to sync. > If some files are deleted during the run, it may lead to errors. Is there a > way to ignore these errors and have distcp retry on the next run? > > I'd be interested in how you manage to eventually accomplish the syncing > between the two clusters, because we also need to solve the very same > problem :-) > > Perhaps others on the mailing list have experience with this? > > > Thanks! > > > On Tue, Apr 12, 2016 at 10:44 AM, raymond <rgbbo...@163.com> wrote: >> >> Hi >> >> >> >> We have a hadoop cluster with several PB data. and we need to migrate it >> to a new cluster across datacenter for larger volume capability. >> We estimate that the data copy itself might took near a month to finish. >> So we are seeking for a sound solution. The requirement is as below: >> 1. we cannot bring down the old cluster for such a long time ( of course), >> and a couple of hours is acceptable. >> 2. we need to mirror the data, it means that we not only need to copy the >> new data, but also need to delete the deleted data happened during the >> migration period. >> 3. we don’t have much space left on the old cluster, say 30% room. >> >> >> >> regarding distcp, although it might be the easiest way , but >> >> >> >> 1. it do not handle data delete >> 2. it handle newly appended file by compare file size and overwrite it ( >> well , it might waste a lot of bandwidth ) >> 3. error handling base on file is triffle. >> 4 load control is difficult ( we still have heavy work load on old >> cluster) you can just try to split your work manually and make it small >> enough to achieve the flow control goal. >> >> >> >> In one word, for a long time mirror work. It won't do well by itself. >> >> >> >> The are some possible works might need to be done : >> >> >> >> We can: >> >> >> >> Do some wrap work around distcp to make it works better. ( say error >> handling, check results. Extra code for sync deleted files etc. ) >> Utilize Snapshot mechanisms for better identify files need to be copied >> and deleted. Or renamed. >> >> >> >> Or >> >> >> >> Forget about distcp. Use FSIMAGE and editlog as a change history source, >> and write our own code to replay the operation. Handle each file one by one. >> ( better per file error handling could be achieved), but this might need a >> lot of dev works. >> >> >> >> >> >> Btw. The closest thing I could found is facebook migration 30PB hive >> warehouse: >> >> >> >> >> https://www.facebook.com/notes/facebook-engineering/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920/ >> >> >> >> They modifiy the distcp to do a initial bulk load (to better handling >> large files and very small files, for load balance I guess.) , and a >> replication system (not much detail on this part) to mirror the changes. >> >> >> >> But it is not clear that how they handle those shortcomings of distcp I >> mentioned above. And do they utilize snapshot mechanism. >> >> >> >> So , does anyone have experience on this kind of work? What do you think >> might be the best approaching for our case? Is there any ready works been >> done that we can utilize? Is there any works have been done around snapshot >> mechanism to easy data migration? > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional commands, e-mail: user-h...@hadoop.apache.org