Here's a script that we could run on the old and new sites that should
quickly catch any major issues but not get caught up in formatting minutia.



On Fri, May 8, 2020 at 10:23 AM Robert Bradshaw <rober...@google.com> wrote:

> On Fri, May 8, 2020 at 9:58 AM Aizhamal Nurmamat kyzy <aizha...@apache.org>
> wrote:
>
>> I understand the difficulty, and this certainly comes with lessons
>> learned for future similar projects.
>>
>> To your questions Robert:
>> (1 and 2) I will commit to review the text in the resulting pages. I will
>> try and use some automation to extract visible text from each page and diff
>> it with the current state of the website. I can do this starting next week.
>> From some quick research, there seem to be tools that help with this
>> analysis (
>> https://stackoverflow.com/questions/3286955/compare-two-websites-and-see-if-they-are-equal
>> )
>>
>
> At first glance it looks like these tools would give diffs that are
> *larger* than the 47K one we're struggling to review here.
>
>
>> By remaining in this state, we hold others up from making changes, or we
>> increase the amount of work needed after merging to port over changes that
>> may be missed. If we move forward, new changes can be done on top of the
>> new website.
>>
>
> I agree we don't want to hold others up from making changes. However, the
> amount of work to port changes over seems small in comparison to everything
> else that is being discussed here. (It also provides good incentives to
> reach the bar quickly and has the advantage of falling on the right
> people.) (3) will still take some time.
>
> If we go this route, we're lowering the bar for doc changes, but not
> removing it.
>
>
>> (3) This makes sense. Brian, would you be able to spend some time to look
>> at the automation changes (build files and scripts) to ensure they look
>> fine?
>>
>> I would also like to write a post mortem to extract lessons learned and
>> avoid this situation in the future.
>>
>>
>> On Fri, May 8, 2020 at 9:44 AM Brian Hulette <bhule...@google.com> wrote:
>>
>>> I'm -0 on merging as-is. I have the same concerns as Robert and he's
>>> voiced them very well so I won't waste time re-airing them.
>>>
>>> (2) I spot checked the content, pulled out some common patterns, and
>>>> it mostly looks good, but there were also some issues (e.g. several
>>>> pages were replaced with the contents from entirely different pages).
>>>> I would be more comfortable if, say, a smoke test of comparing the old
>>>> and new sites, with html tags stripped and ignoring whitespace,
>>>> yielded what should be empty diffs.
>>>>
>>>
>>> Can you share any details about this analysis?
>>>
>>
> It was basically paging through the diff, adding things to the sed script,
> and then looking at more diffs.
>
>
>> +1 for verifying the old and new are the same by diffing the output.
>>>
>>>
>>>> (3) It'd be good to have someone give a stamp of approval on the
>>>> infrastructure changes, at least to validate that we're not going to
>>>> be taking on extra tech debt with regard to jenkins stability and
>>>> developer workflow. I see that Brian has at least looked at this some.
>>>
>>>
>>> My involvement so far was just recognizing a problem (creating
>>> root-owned files on jenkins workers) and helping to fix it. If there's
>>> anyone available who's familiar with the website infrastructure it would be
>>> great if they could take a look instead (if not I could probably acquaint
>>> myself enough to review).
>>>
>>> On Thu, May 7, 2020 at 11:57 PM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> This is a tough situation.
>>>>
>>>> It would have been much better if this transition was structured in
>>>> such a way that the review was more manageable (e.g. the suggestion of
>>>> scripts, not mixing in voluminous unnecessary changes like whitespace,
>>>> and not updating content), and possibly even incrementally (e.g. the
>>>> new site would have been developed over multiple PRs in a subdomain or
>>>> subdirectory while being worked on). But hindsight is 20/20 and no
>>>> one, myself included, thought to bring this up when the original
>>>> migration was proposed, so this is more something to keep in mind for
>>>> the future. I also appreciate the efforts that have been made to clean
>>>> things up (e.g. preserving history) and address feedback.
>>>>
>>>> So, where do we go from here? My first thought is that I really don't
>>>> want to set a precedent that just because a PR "will require a large
>>>> effort" and in a state that if we don't "move forward and merge what
>>>> we have now" then "work done so far will be lost" means that we think
>>>> it's OK to forgo doing a proper review.
>>>>
>>>> On the other hand, there are some mitigating factors with this being
>>>> the website and not the code in that "bugs," though possibly
>>>> embarrassing, won't break production pipelines or data loss, and
>>>> though the source is technically part of the release, when we find
>>>> something to fix we can fix the live website much more quickly than go
>>>> through the whole release process and convince people to upgrade. (I
>>>> recognize accepting this argument is, to some degree at least, saying
>>>> that we don't care about the correctness of docs as much as so-called
>>>> "real" code, if we go there.)
>>>>
>>>> If we decide to go ahead and merge (and I would not object), there are
>>>> some things I would like to see.
>>>>
>>>> (1) I would like to understand what we would do afterwards to "review
>>>> the outcome, and ensure that all the content is there," and why it
>>>> can't be done before merging instead. (Is it because it'd take time
>>>> and we don't want to incorporate changes that are made to the website
>>>> in the meantime? I think that boat has sailed, but maybe we can avoid
>>>> making it worse...)
>>>>
>>>> (2) I spot checked the content, pulled out some common patterns, and
>>>> it mostly looks good, but there were also some issues (e.g. several
>>>> pages were replaced with the contents from entirely different pages).
>>>> I would be more comfortable if, say, a smoke test of comparing the old
>>>> and new sites, with html tags stripped and ignoring whitespace,
>>>> yielded what should be empty diffs.
>>>>
>>>> (3) It'd be good to have someone give a stamp of approval on the
>>>> infrastructure changes, at least to validate that we're not going to
>>>> be taking on extra tech debt with regard to jenkins stability and
>>>> developer workflow. I see that Brian has at least looked at this some.
>>>>
>>>> - Robert
>>>>
>>>>
>>>> On Thu, May 7, 2020 at 12:40 PM Aizhamal Nurmamat kyzy
>>>> <aizha...@apache.org> wrote:
>>>> >
>>>> > Thank you Ahmet.
>>>> >
>>>> > Robert/Brian, what do you think?
>>>> >
>>>> > The website staging and pre commit tests have passed [1]. If nobody
>>>> has objections, we could merge it soon.
>>>> >
>>>> >
>>>> > [1] https://github.com/apache/beam/pull/11554
>>>> >
>>>> > On Thu, May 7, 2020 at 11:38 AM Ahmet Altay <al...@google.com> wrote:
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Thu, May 7, 2020 at 10:50 AM Aizhamal Nurmamat kyzy <
>>>> aizha...@apache.org> wrote:
>>>> >>>
>>>> >>> Thanks for the writeup Ahmet.
>>>> >>>
>>>> >>> My bias is to move forward and merge the PR. After this, we'll
>>>> review the outcome, and ensure that all the content is there. Nam will help
>>>> us with that.
>>>> >>> The reason that I'd like to move forward and merge what we have now
>>>> - is that if we don't do that, the work done so far will be lost.
>>>> >>> We'll make sure to stage the website in its current state, and use
>>>> that as reference/archive to ensure all the content have been moved.
>>>> >>>
>>>> >>> Is this reasonable to everyone?
>>>> >>
>>>> >>
>>>> >> This is reasonable to me. I agree with your reasons.
>>>> >>
>>>> >> What do others think?
>>>> >>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> On Wed, May 6, 2020 at 7:07 PM Ahmet Altay <al...@google.com>
>>>> wrote:
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> On Wed, May 6, 2020 at 2:33 PM Aizhamal Nurmamat kyzy <
>>>> aizha...@apache.org> wrote:
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> > 1) Currently, the main blocker for merging is Staging Test
>>>> Failures.
>>>> >>>>>>
>>>> >>>>>> That and finishing the review. (Is someone tracking/coordinating
>>>> this?)
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> I am coordinating the work on the failed tests, but I would need
>>>> other committer's help to perform the review. @Ahmet, could you help us
>>>> prioritize the review for this PR?
>>>> >>>>
>>>> >>>>
>>>> >>>> The problem is there are too many manual changes. Reviewing this
>>>> change in this form will require a large effort. I do not think I can
>>>> interrupt other projects to prioritize reviews on this PR. IMO, we have a
>>>> few options:
>>>> >>>>
>>>> >>>> - PR to be restructured in the format suggested in this thread. A
>>>> commit for infrastructure changes from Jekyll to hugo. A second commit for
>>>> a script that will convert the majority of the content. A third commit for
>>>> the execution of the script. And a fourth commit for the additional manual
>>>> content changes. If Nam can get to this form, people on this thread
>>>> myself/Robert/Pablo/Brian can review the changes.
>>>> >>>> - Another option is, we can accept that we already invested in
>>>> this transition and overall this is a good change, and merge the PR more or
>>>> less in its current form (with tests fixed and open comments addressed)
>>>> even though it has issues. And then overtime fix the issues we encounter.
>>>> There was already some amount of review and visual comparisons, we risk
>>>> losing some recent content changes but I am assuming this will not be much.
>>>> If Nam can commit to compare two sites after a merge, fixing the majority
>>>> of the delta, this might be a viable option.
>>>> >>>>
>>>> >>>> Another thing we can do, we can archive/store a read-only copy of
>>>> the current website in an "archive" url temporarily instead of completely
>>>> deleting it. It will give us a baseline for a while to go back to the old
>>>> content and move any missing data. (And maybe, someone can come up with an
>>>> innovative way to compare the textual content of both sites.) A note on the
>>>> stop world approach, I believe we are already failing on that with merge
>>>> conflicts showing up on the PR. It will be better for us to complete the
>>>> transition as soon as possible. Fixing after the initial merge might be a
>>>> simpler task, especially if we can archive the old site.
>>>> >>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>>
>>>> >>>>>> > Michal showed Nam how to handle the 1st test which was about
>>>> Apache License missing.
>>>> >>>>>> >
>>>> >>>>>> > However, the 2nd and 3rd tests looked like some kind of
>>>> permissions error on the Jenkins worker, not to be configured by code. For
>>>> more details based on Jenkin logs, the 2nd test failed because of
>>>> website/www/site/themes and the 3rd test failed because of
>>>> website/www/node_modules, they are both auto-generated files on build. Can
>>>> someone help Nam to look into this?
>>>> >>>>>> >
>>>> >>>>>> > RAT ("Run RAT PreCommit") — FAILURE
>>>> >>>>>> > Website_Stage_GCS ("Run Website_Stage_GCS PreCommit") — FAILURE
>>>> >>>>>> > Website_Stage_GCS ("Run Website_Stage_GCS PreCommit") — FAILURE
>>>> >>>>>> >
>>>> >>>>>> > 2) Are there any other blockers for merging?
>>>> @Ahmet/Robert/others please share if there are any other blockers.
>>>> >>>>>> >
>>>> >>>>>> >
>>>> >>>>>> > [1] https://github.com/gohugoio/hugo/pull/4494
>>>> >>>>>> >
>>>> >>>>>> >
>>>> >>>>>> > On Wed, May 6, 2020 at 10:19 AM Robert Bradshaw <
>>>> rober...@google.com> wrote:
>>>> >>>>>> >>
>>>> >>>>>> >> On Mon, May 4, 2020 at 7:07 PM Ahmet Altay <al...@google.com>
>>>> wrote:
>>>> >>>>>> >> >
>>>> >>>>>> >> >> On Mon, May 4, 2020 at 6:30 PM Robert Bradshaw <
>>>> rober...@google.com> wrote:
>>>> >>>>>> >> >>>
>>>> >>>>>> >> >>> I took the massive commit and split it up into:
>>>> >>>>>> >> >>>
>>>> >>>>>> >> >>> (1) Infrastructure changes (basically everything outside
>>>> of
>>>> >>>>>> >> >>> (website/www/site/content)
>>>> >>>>>> >> >>> (2) Sed script changes, and
>>>> >>>>>> >> >>> (3) Manual changes (everything not in (1) and (2)).
>>>> >>>>>> >> >
>>>> >>>>>> >> >
>>>> >>>>>> >> > Thank you Robert. This makes it much easier. What is the
>>>> source of the sed script? I am not sure why some of those lines are there.
>>>> It would be much easier for us to comment on the script source if it is
>>>> reviewable somewhere.
>>>> >>>>>> >>
>>>> >>>>>> >> I just gathered up common patterns as I was trying to go
>>>> through and
>>>> >>>>>> >> review the files... Mostly it was an exercise in finding a
>>>> compact
>>>> >>>>>> >> representation for the delta, not trying to be a perfect
>>>> conversion.
>>>> >>>>>> >> (I do think in retrospect, if we do something like this
>>>> again, it
>>>> >>>>>> >> would be preferable to commit a script that does the
>>>> auto-conversion
>>>> >>>>>> >> (maybe even with some patch files for manual changes) both
>>>> for ease of
>>>> >>>>>> >> reviewing and to avoid the stop-the-world situation we're in
>>>> now. (I'm
>>>> >>>>>> >> still worried that some changes will get lost in the shuffle.)
>>>>
>>>
import difflib
import os
import re

from bs4 import BeautifulSoup

OLD_SITE = 'old'
NEW_SITE = 'new'

def cleanup(html):
  # Strip tags.
  text = '\n'.join(s.replace('\n', ' ') for s in BeautifulSoup(html, "html.parser").stripped_strings)
  # Normalize newlines.
  return re.sub(r'\s+', lambda m: '\n' if '\n' in m.group(0) else ' ', text).strip()

for root, dirs, files in os.walk(OLD_SITE):
  for file in files:
    if file.endswith('.html'):
      old_path = os.path.join(root, file)
      old_content = cleanup(open(old_path).read())
      new_path = NEW_SITE + old_path[len(OLD_SITE):]
      new_content = cleanup(open(new_path).read())
      if old_content.replace('\n', ' ') != new_content.replace('\n', ' '):
         print('\n'.join(difflib.unified_diff(
            old_content.split('\n'), new_content.split('\n'), old_path, new_path)))

Reply via email to