On Fri, Aug 3, 2012 at 12:13 PM, Kay Schenk <kay.sch...@gmail.com> wrote: > > > On 08/02/2012 07:45 AM, Rob Weir wrote: >> >> On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk <kay.sch...@gmail.com> wrote: >>> >>> >>> >>> On 08/01/2012 04:29 PM, Rob Weir wrote: >>>> >>>> >>>> On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk <kay.sch...@gmail.com> wrote: >>>>> >>>>> >>>>> Hello all -- >>>>> >>>>> I am exploring the www.openoffice.site using the Google Webmaster tool >>>>> that >>>>> Rob told us about on Jul 19. >>>>> >>>>> I am ONLY getting started by looking at the 62,962 404 errors (!!!!!) >>>>> >>>>> Many of these are links to VERY old docs which we no longer have -- >>>>> like >>>>> source trees for 1.0.1, 1.0.2 etc.-- or have to do with the OLD >>>>> architecture -- servlet references etc. >>>>> >>>> >>>> If I understand this correctly, Google is looking at links on >>>> webpages, not just our webpages, but also links from 3rd party >>>> websites, and if they point to an openoffice.org page that doesn't >>>> exist, it shows up on this list. This could happen for any reason. >>>> In some cases the original link might have had a typo. >>> >>> >>> >>> yes, this is correct, and you are right about this too...some of the 404s >>> reference pages we probably NEVER had. >>> >>> >>>> >>>>> Some of this issues could be solved with rather extensive use of sym >>>>> links >>>>> (yes, you can actually use these in svn -- kind of) and of course some >>>>> not >>>>> -- many missing old security bulletins. >>>>> >>>> >>>> For the security bulletins, I wonder if this is actually a redirection >>>> error. We have many of them here: >>>> >>>> http://www.openoffice.org/security/bulletin.html >>> >>> >>> >>> ah...yes, they are there...the problem is we would need to construct a >>> LOT >>> of just "redirect" pages to right some of these since they all seem to >>> have >>> the form >>> >>> "/security/cvs-bulletin-number".html >>> >> >> So let's take a specific example. >> >> Google is reporting a 404 error for this URL: >> http://www.openoffice.org/security/bulletin-20060629.html >> >> It is linked to from from at least 10 external web pages, for example >> the last link in this table: >> >> >> http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html >> >> (Whoops, make that at least 12 links, since the Apache and MarkMail >> list archives will now link to this) >> >> There is no file of this name in >> >> https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/ >> >> Looking at the svn log I don't see any record of the files ever being >> here. >> >> I searched the complete ooo-site tree and this file >> (bulletin-20060629.html) doesn't exist anywhere. >> >> The Wayback Machine shows the page did exist in 2006: >> >> >> http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html >> >> But it was broken already by 2009: >> >> >> http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html >> >> So this is a pre-existing problem, and nothing we can do about it. >> >> Ughh. Obviously we cannot do this kind of research for every one of >> the 64 thousand links. >> >> But in other cases we can help. For example this link is giving 404 >> error: >> >> http://www.openoffice.org/licenses/lgpl_license.html >> >> I think we removed that intentionally, since that is no longer our >> license. However, that link was used by many other websites, >> including university course materials looking at open source licenses, >> etc.: http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html >> >> So in cases like this we might want to restore the page. Do our part >> to help prevent bit rot and entropy from destroying the web. > > > Well this particular one I really AM not in favor of restoring to our site. > What we could do on this one, is put in a page with just a redirect to where > the actual license lives. (and yes, this is really one of the most critical > ones in my opinion) >
That would be fine, a page at that URL that says our license has changed, and that the LGPL van be found at the Free Software Foundation website, and link to that. Everyone's happy then. > >> >> But to put it in perspective, although we have 64 thousand 404 errors >> on our website, we also have nearly 16 million incoming links that do >> not give errors. > > > Well that's a relief eh? :) > > OK, I will have another look at this. At any rate, we absolutely should put > in place a generic "error.html" and have infra reconfigure > www.openoffice.org with THAT as our 404. That way we can assist folks in > dealing with link problems. > The nice thing about a custom error page is we can put also put Google custom search box there, to let the user do a site-wide search to try to find their answer that way. -Rob > > >> >> -Rob >> >>> >>>> >>>> But we're redirecting security.openoffice.org to >>>> http://incubator.apache.org/openofficeorg/security.html >>>> >>>> So if there are outstanding URL's that are of the form >>>> security.openoffice.org/foo.html then they might be broken now. >>> >>> >>> >>> see above...it's the actual placement of the bulletins within the tree >>> that's the problem I think >>> >>> >>> >>>> >>>>> So, to those of you using this tool, I may mark many of these as >>>>> "fixed". >>>>> Of course they are not -- and they may show up again. Some of them only >>>>> show up in BZ issues!! (Google is amazingly thorough). >>>>> >>>>> I don't know how long it will take for them to "show up" again. The >>>>> problem >>>>> is some of these are very very very old references, and not likely we >>>>> can >>>>> do anything about at this point in time. >>>>> If you're not using this tool, you probably don't care about this. If >>>>> you >>>>> are using it, and have another opinion before I start chunking away at >>>>> hiding these, please weigh in. >>>>> >>>> >>>> The way I understand it the links at the top of the list are the ones >>>> Google considers the most important. I think this is based on the >>>> number of links to that page. Maybe they factor in other things as >>>> well. So I'd recommend looking more at the top 100 or so broken >>>> links, make this a manageable task. >>> >>> >>> >>> Well the problem is "how" to make it manageable... :( >>> >>> >>>> >>>> Or -- and here is a challenge for the algorithm experts -- maybe there >>>> is an easy way to take that entire list of 62,962 links and determine >>>> what the top base paths are that are broken. >>> >>> >>> >>> if only this were so :( They're all over the place. >>> >>> >>> In other words, if the >>>> >>>> >>>> links are: >>>> >>>> foo.openoffice.org/bar/baz1 >>>> foo.openoffice.org/bar/baz2 >>>> foo.openoffice.org/bar/baz2 >>>> foo.openoffice.org/bar2/baz1 >>>> foo2.openoffice.org/bar1/baz1 >>>> >>>> Then this would tell us that foo.openoffice.org/bar/* was a top source >>>> of broken links. This might indicate important patterns of where the >>>> most broken links are. >>>> >>>> It seems like this could be done via a prefix tree (a "trie"): >>>> http://en.wikipedia.org/wiki/Trie >>>> >>>> Maybe other (simpler) ways as well. >>> >>> >>> >>> I'll look at this article. It's a daunting task any way you look at it. >>> >>>> >>>> Regards, >>> >>> >>> >>> What happens when things get moved a LOT with no regard for the end user. >>> Don't get me started on the ways I've had to deal with this in the past. >>> >>> >>>> >>>> -Rob >>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> >>>>> ---------------------------------------------------------------------------------------- >>>>> MzK >>>>> >>>>> "I'm just a normal jerk who happens to make music. >>>>> As long as my brain and fingers work, I'm cool." >>>>> -- Eddie Van Halen >>> >>> >>> >>> -- >>> ------------------------------------------------------------------------ >>> MzK >>> >>> "I'm just a normal jerk who happens to make music. >>> As long as my brain and fingers work, I'm cool." >>> -- Eddie Van Halen >>> >>> > > -- > ------------------------------------------------------------------------ > MzK > > "I'm just a normal jerk who happens to make music. > As long as my brain and fingers work, I'm cool." > -- Eddie Van Halen > >