I think rsync is a good approach. It may take a bit, but we can work out
the correct --excludes list, so that 0.94, svn dot-files, and whatever else
are preserved. Did you explore this?

On Wed, Sep 10, 2014 at 8:58 PM, Misty Stanley-Jones <
mstanleyjo...@cloudera.com> wrote:

> Hi all,
>
> The way the site has been built for a while poses a problem I'm not sure
> how to solve. I'd like your input.
>
> Currently, the site is stored in a SVN repo. What happens is that we
> generate the site from the git repo sources and then copy the output over
> the top of the svn repo, svn add new files, and svn update.
>
> This causes some problems. The biggest problem is that if files become
> irrelevant (we remove a class or something, or remove a webpage, or
> something like that), there is actually no way to delete it from svn,
> because we don't start over with a fresh copy of the site each time.
>
> At first glance, it seems like an easy thing to fix. You could use an rsync
> job and just delete the ones that are not present in the generated source.
> But there are some things in there that are not generated anymore (such as
> 0.94 API docs) or at least not generated by running the site goal on
> master.
>
> So I need a way to figure out what files are truly stale and need to be
> deleted from svn, and which need to be left there. One strategy I thought
> of trying is to try to crawl the website starting from the front page and
> see all of the files that are reachable from there. The ones that are not,
> probably should be deleted.
>
> To that end, I am currently pulling down the site using wget, and I'll
> compare that to the contents in trunk and see what's different. But I'd
> like advice for what we can do about this in the future, since pulling down
> the site with wget takes ages.
>
> I'll update when I figure out more about it.
>
> Thanks,
> Misty
>

Reply via email to