On Fri, Aug 3, 2012 at 9:29 AM, Rob Weir <robw...@apache.org> wrote: > On Fri, Aug 3, 2012 at 12:13 PM, Kay Schenk <kay.sch...@gmail.com> wrote: > > > > > > On 08/02/2012 07:45 AM, Rob Weir wrote: > >> > >> On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk <kay.sch...@gmail.com> > wrote: > >>> > >>> > >>> > >>> On 08/01/2012 04:29 PM, Rob Weir wrote: > >>>> > >>>> > >>>> On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk <kay.sch...@gmail.com> > wrote: > >>>>> > >>>>> > >>>>> Hello all -- > >>>>> > >>>>> I am exploring the www.openoffice.site using the Google Webmaster > tool > >>>>> that > >>>>> Rob told us about on Jul 19. > >>>>> > >>>>> I am ONLY getting started by looking at the 62,962 404 errors (!!!!!) > >>>>> > >>>>> Many of these are links to VERY old docs which we no longer have -- > >>>>> like > >>>>> source trees for 1.0.1, 1.0.2 etc.-- or have to do with the OLD > >>>>> architecture -- servlet references etc. > >>>>> > >>>> > >>>> If I understand this correctly, Google is looking at links on > >>>> webpages, not just our webpages, but also links from 3rd party > >>>> websites, and if they point to an openoffice.org page that doesn't > >>>> exist, it shows up on this list. This could happen for any reason. > >>>> In some cases the original link might have had a typo. > >>> > >>> > >>> > >>> yes, this is correct, and you are right about this too...some of the > 404s > >>> reference pages we probably NEVER had. > >>> > >>> > >>>> > >>>>> Some of this issues could be solved with rather extensive use of sym > >>>>> links > >>>>> (yes, you can actually use these in svn -- kind of) and of course > some > >>>>> not > >>>>> -- many missing old security bulletins. > >>>>> > >>>> > >>>> For the security bulletins, I wonder if this is actually a redirection > >>>> error. We have many of them here: > >>>> > >>>> http://www.openoffice.org/security/bulletin.html > >>> > >>> > >>> > >>> ah...yes, they are there...the problem is we would need to construct a > >>> LOT > >>> of just "redirect" pages to right some of these since they all seem to > >>> have > >>> the form > >>> > >>> "/security/cvs-bulletin-number".html > >>> > >> > >> So let's take a specific example. > >> > >> Google is reporting a 404 error for this URL: > >> http://www.openoffice.org/security/bulletin-20060629.html > >> > >> It is linked to from from at least 10 external web pages, for example > >> the last link in this table: > >> > >> > >> > http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html > >> > >> (Whoops, make that at least 12 links, since the Apache and MarkMail > >> list archives will now link to this) > >> > >> There is no file of this name in > >> > >> > https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/ > >> > >> Looking at the svn log I don't see any record of the files ever being > >> here. > >> > >> I searched the complete ooo-site tree and this file > >> (bulletin-20060629.html) doesn't exist anywhere. > >> > >> The Wayback Machine shows the page did exist in 2006: > >> > >> > >> > http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html > >> > >> But it was broken already by 2009: > >> > >> > >> > http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html > >> > >> So this is a pre-existing problem, and nothing we can do about it. > >> > >> Ughh. Obviously we cannot do this kind of research for every one of > >> the 64 thousand links. > >> > >> But in other cases we can help. For example this link is giving 404 > >> error: > >> > >> http://www.openoffice.org/licenses/lgpl_license.html > >> > >> I think we removed that intentionally, since that is no longer our > >> license. However, that link was used by many other websites, > >> including university course materials looking at open source licenses, > >> etc.: http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html > >> > >> So in cases like this we might want to restore the page. Do our part > >> to help prevent bit rot and entropy from destroying the web. > > > > > > Well this particular one I really AM not in favor of restoring to our > site. > > What we could do on this one, is put in a page with just a redirect to > where > > the actual license lives. (and yes, this is really one of the most > critical > > ones in my opinion) > > > > That would be fine, a page at that URL that says our license has > changed, and that the LGPL van be found at the Free Software > Foundation website, and link to that. Everyone's happy then. > > > > >> > >> But to put it in perspective, although we have 64 thousand 404 errors > >> on our website, we also have nearly 16 million incoming links that do > >> not give errors. > > > > > > Well that's a relief eh? :) > > > > OK, I will have another look at this. At any rate, we absolutely should > put > > in place a generic "error.html" and have infra reconfigure > > www.openoffice.org with THAT as our 404. That way we can assist folks in > > dealing with link problems. > > > > The nice thing about a custom error page is we can put also put Google > custom search box there, to let the user do a site-wide search to try > to find their answer that way. > > -Rob >
EXACTLY! And that's just what was done when I've been in other environments and come up against this. > > > > > > >> > >> -Rob > >> > >>> > >>>> > >>>> But we're redirecting security.openoffice.org to > >>>> http://incubator.apache.org/openofficeorg/security.html > >>>> > >>>> So if there are outstanding URL's that are of the form > >>>> security.openoffice.org/foo.html then they might be broken now. > >>> > >>> > >>> > >>> see above...it's the actual placement of the bulletins within the tree > >>> that's the problem I think > >>> > >>> > >>> > >>>> > >>>>> So, to those of you using this tool, I may mark many of these as > >>>>> "fixed". > >>>>> Of course they are not -- and they may show up again. Some of them > only > >>>>> show up in BZ issues!! (Google is amazingly thorough). > >>>>> > >>>>> I don't know how long it will take for them to "show up" again. The > >>>>> problem > >>>>> is some of these are very very very old references, and not likely we > >>>>> can > >>>>> do anything about at this point in time. > >>>>> If you're not using this tool, you probably don't care about this. If > >>>>> you > >>>>> are using it, and have another opinion before I start chunking away > at > >>>>> hiding these, please weigh in. > >>>>> > >>>> > >>>> The way I understand it the links at the top of the list are the ones > >>>> Google considers the most important. I think this is based on the > >>>> number of links to that page. Maybe they factor in other things as > >>>> well. So I'd recommend looking more at the top 100 or so broken > >>>> links, make this a manageable task. > >>> > >>> > >>> > >>> Well the problem is "how" to make it manageable... :( > >>> > >>> > >>>> > >>>> Or -- and here is a challenge for the algorithm experts -- maybe there > >>>> is an easy way to take that entire list of 62,962 links and determine > >>>> what the top base paths are that are broken. > >>> > >>> > >>> > >>> if only this were so :( They're all over the place. > >>> > >>> > >>> In other words, if the > >>>> > >>>> > >>>> links are: > >>>> > >>>> foo.openoffice.org/bar/baz1 > >>>> foo.openoffice.org/bar/baz2 > >>>> foo.openoffice.org/bar/baz2 > >>>> foo.openoffice.org/bar2/baz1 > >>>> foo2.openoffice.org/bar1/baz1 > >>>> > >>>> Then this would tell us that foo.openoffice.org/bar/* was a top > source > >>>> of broken links. This might indicate important patterns of where the > >>>> most broken links are. > >>>> > >>>> It seems like this could be done via a prefix tree (a "trie"): > >>>> http://en.wikipedia.org/wiki/Trie > >>>> > >>>> Maybe other (simpler) ways as well. > >>> > >>> > >>> > >>> I'll look at this article. It's a daunting task any way you look at it. > >>> > >>>> > >>>> Regards, > >>> > >>> > >>> > >>> What happens when things get moved a LOT with no regard for the end > user. > >>> Don't get me started on the ways I've had to deal with this in the > past. > >>> > >>> > >>>> > >>>> -Rob > >>>> > >>>>> > >>>>> > >>>>> -- > >>>>> > >>>>> > >>>>> > ---------------------------------------------------------------------------------------- > >>>>> MzK > >>>>> > >>>>> "I'm just a normal jerk who happens to make music. > >>>>> As long as my brain and fingers work, I'm cool." > >>>>> -- Eddie Van Halen > >>> > >>> > >>> > >>> -- > >>> > ------------------------------------------------------------------------ > >>> MzK > >>> > >>> "I'm just a normal jerk who happens to make music. > >>> As long as my brain and fingers work, I'm cool." > >>> -- Eddie Van Halen > >>> > >>> > > > > -- > > ------------------------------------------------------------------------ > > MzK > > > > "I'm just a normal jerk who happens to make music. > > As long as my brain and fingers work, I'm cool." > > -- Eddie Van Halen > > > > > -- ---------------------------------------------------------------------------------------- MzK "I'm just a normal jerk who happens to make music. As long as my brain and fingers work, I'm cool." -- Eddie Van Halen