I think rsync is a good approach. It may take a bit, but we can work out the correct --excludes list, so that 0.94, svn dot-files, and whatever else are preserved. Did you explore this?
On Wed, Sep 10, 2014 at 8:58 PM, Misty Stanley-Jones < mstanleyjo...@cloudera.com> wrote: > Hi all, > > The way the site has been built for a while poses a problem I'm not sure > how to solve. I'd like your input. > > Currently, the site is stored in a SVN repo. What happens is that we > generate the site from the git repo sources and then copy the output over > the top of the svn repo, svn add new files, and svn update. > > This causes some problems. The biggest problem is that if files become > irrelevant (we remove a class or something, or remove a webpage, or > something like that), there is actually no way to delete it from svn, > because we don't start over with a fresh copy of the site each time. > > At first glance, it seems like an easy thing to fix. You could use an rsync > job and just delete the ones that are not present in the generated source. > But there are some things in there that are not generated anymore (such as > 0.94 API docs) or at least not generated by running the site goal on > master. > > So I need a way to figure out what files are truly stale and need to be > deleted from svn, and which need to be left there. One strategy I thought > of trying is to try to crawl the website starting from the front page and > see all of the files that are reachable from there. The ones that are not, > probably should be deleted. > > To that end, I am currently pulling down the site using wget, and I'll > compare that to the contents in trunk and see what's different. But I'd > like advice for what we can do about this in the future, since pulling down > the site with wget takes ages. > > I'll update when I figure out more about it. > > Thanks, > Misty >