investigation using Google Webmaster tools

2012-08-01 Thread Kay Schenk
Hello all --

I am exploring the www.openoffice.site using the Google Webmaster tool that
Rob told us about on Jul 19.

I am ONLY getting started by looking at the 62,962 404 errors (!)

Many of these are links to VERY old docs which we no longer have -- like
source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
architecture -- servlet references etc.

Some of this issues could be solved with rather extensive use of sym links
(yes, you can actually use these in svn -- kind of) and of course some not
-- many missing old security bulletins.

So, to those of you using this tool, I may mark many of these as "fixed".
Of course they are not -- and they may show up again. Some of them only
show up in BZ issues!! (Google is amazingly thorough).

I don't know how long it will take for them to "show up" again. The problem
is some of these are very very very old references, and not likely we can
do anything about at this point in time.
If you're not using this tool, you probably don't care about this. If you
are using it, and have another opinion before I start chunking away at
hiding these, please weigh in.



-- 

MzK

"I'm just a normal jerk who happens to make music.
 As long as my brain and fingers work, I'm cool."
  -- Eddie Van Halen


Re: investigation using Google Webmaster tools

2012-08-01 Thread Rob Weir
On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk  wrote:
> Hello all --
>
> I am exploring the www.openoffice.site using the Google Webmaster tool that
> Rob told us about on Jul 19.
>
> I am ONLY getting started by looking at the 62,962 404 errors (!)
>
> Many of these are links to VERY old docs which we no longer have -- like
> source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
> architecture -- servlet references etc.
>

If I understand this correctly, Google is looking at links on
webpages, not just our webpages, but also links from 3rd party
websites, and if they point to an openoffice.org page that doesn't
exist, it shows up on this list.   This could happen for any reason.
In some cases the original link might have had a typo.

> Some of this issues could be solved with rather extensive use of sym links
> (yes, you can actually use these in svn -- kind of) and of course some not
> -- many missing old security bulletins.
>

For the security bulletins, I wonder if this is actually a redirection
error.  We have many of them here:

http://www.openoffice.org/security/bulletin.html

But we're redirecting security.openoffice.org to
http://incubator.apache.org/openofficeorg/security.html

So if there are outstanding URL's that are of the form
security.openoffice.org/foo.html then they might be broken now.

> So, to those of you using this tool, I may mark many of these as "fixed".
> Of course they are not -- and they may show up again. Some of them only
> show up in BZ issues!! (Google is amazingly thorough).
>
> I don't know how long it will take for them to "show up" again. The problem
> is some of these are very very very old references, and not likely we can
> do anything about at this point in time.
> If you're not using this tool, you probably don't care about this. If you
> are using it, and have another opinion before I start chunking away at
> hiding these, please weigh in.
>

The way I understand it the links at the top of the list are the ones
Google considers the most important.  I think this is based on the
number of links to that page.  Maybe they factor in other things as
well.  So I'd recommend looking more at the top 100 or so broken
links, make this a manageable task.

Or -- and here is a challenge for the algorithm experts -- maybe there
is an easy way to take that entire list of 62,962 links and determine
what the top base paths are that are broken.  In other words, if the
links are:

foo.openoffice.org/bar/baz1
foo.openoffice.org/bar/baz2
foo.openoffice.org/bar/baz2
foo.openoffice.org/bar2/baz1
foo2.openoffice.org/bar1/baz1

Then this would tell us that foo.openoffice.org/bar/* was a top source
of broken links.  This might indicate important patterns of where the
most broken links are.

It seems like this could be done via a prefix tree (a "trie"):
http://en.wikipedia.org/wiki/Trie

Maybe other (simpler) ways as well.

Regards,

-Rob

>
>
> --
> 
> MzK
>
> "I'm just a normal jerk who happens to make music.
>  As long as my brain and fingers work, I'm cool."
>   -- Eddie Van Halen


Re: investigation using Google Webmaster tools

2012-08-01 Thread Kay Schenk



On 08/01/2012 04:29 PM, Rob Weir wrote:

On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk  wrote:

Hello all --

I am exploring the www.openoffice.site using the Google Webmaster tool that
Rob told us about on Jul 19.

I am ONLY getting started by looking at the 62,962 404 errors (!)

Many of these are links to VERY old docs which we no longer have -- like
source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
architecture -- servlet references etc.



If I understand this correctly, Google is looking at links on
webpages, not just our webpages, but also links from 3rd party
websites, and if they point to an openoffice.org page that doesn't
exist, it shows up on this list.   This could happen for any reason.
In some cases the original link might have had a typo.


yes, this is correct, and you are right about this too...some of the 
404s reference pages we probably NEVER had.





Some of this issues could be solved with rather extensive use of sym links
(yes, you can actually use these in svn -- kind of) and of course some not
-- many missing old security bulletins.



For the security bulletins, I wonder if this is actually a redirection
error.  We have many of them here:

http://www.openoffice.org/security/bulletin.html


ah...yes, they are there...the problem is we would need to construct a 
LOT of just "redirect" pages to right some of these since they all seem 
to have the form


"/security/cvs-bulletin-number".html



But we're redirecting security.openoffice.org to
http://incubator.apache.org/openofficeorg/security.html

So if there are outstanding URL's that are of the form
security.openoffice.org/foo.html then they might be broken now.


see above...it's the actual placement of the bulletins within the tree 
that's the problem I think






So, to those of you using this tool, I may mark many of these as "fixed".
Of course they are not -- and they may show up again. Some of them only
show up in BZ issues!! (Google is amazingly thorough).

I don't know how long it will take for them to "show up" again. The problem
is some of these are very very very old references, and not likely we can
do anything about at this point in time.
If you're not using this tool, you probably don't care about this. If you
are using it, and have another opinion before I start chunking away at
hiding these, please weigh in.



The way I understand it the links at the top of the list are the ones
Google considers the most important.  I think this is based on the
number of links to that page.  Maybe they factor in other things as
well.  So I'd recommend looking more at the top 100 or so broken
links, make this a manageable task.


Well the problem is "how" to make it manageable... :(



Or -- and here is a challenge for the algorithm experts -- maybe there
is an easy way to take that entire list of 62,962 links and determine
what the top base paths are that are broken.


if only this were so :( They're all over the place.

 In other words, if the

links are:

foo.openoffice.org/bar/baz1
foo.openoffice.org/bar/baz2
foo.openoffice.org/bar/baz2
foo.openoffice.org/bar2/baz1
foo2.openoffice.org/bar1/baz1

Then this would tell us that foo.openoffice.org/bar/* was a top source
of broken links.  This might indicate important patterns of where the
most broken links are.

It seems like this could be done via a prefix tree (a "trie"):
http://en.wikipedia.org/wiki/Trie

Maybe other (simpler) ways as well.


I'll look at this article. It's a daunting task any way you look at it.



Regards,


What happens when things get moved a LOT with no regard for the end 
user. Don't get me started on the ways I've had to deal with this in the 
past.




-Rob




--

MzK

"I'm just a normal jerk who happens to make music.
  As long as my brain and fingers work, I'm cool."
   -- Eddie Van Halen


--

MzK

"I'm just a normal jerk who happens to make music.
 As long as my brain and fingers work, I'm cool."
  -- Eddie Van Halen




Re: investigation using Google Webmaster tools

2012-08-01 Thread Dave Fisher
Sorry to top post, but this week I am at my work HQ and am busy.

I think that we should create a 404 page and then ask infra to point to that.

Sent from my iPhone

On Aug 1, 2012, at 7:45 PM, Kay Schenk  wrote:

> 
> 
> On 08/01/2012 04:29 PM, Rob Weir wrote:
>> On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk  wrote:
>>> Hello all --
>>> 
>>> I am exploring the www.openoffice.site using the Google Webmaster tool that
>>> Rob told us about on Jul 19.
>>> 
>>> I am ONLY getting started by looking at the 62,962 404 errors (!)
>>> 
>>> Many of these are links to VERY old docs which we no longer have -- like
>>> source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
>>> architecture -- servlet references etc.
>>> 
>> 
>> If I understand this correctly, Google is looking at links on
>> webpages, not just our webpages, but also links from 3rd party
>> websites, and if they point to an openoffice.org page that doesn't
>> exist, it shows up on this list.   This could happen for any reason.
>> In some cases the original link might have had a typo.
> 
> yes, this is correct, and you are right about this too...some of the 404s 
> reference pages we probably NEVER had.
> 
>> 
>>> Some of this issues could be solved with rather extensive use of sym links
>>> (yes, you can actually use these in svn -- kind of) and of course some not
>>> -- many missing old security bulletins.
>>> 
>> 
>> For the security bulletins, I wonder if this is actually a redirection
>> error.  We have many of them here:
>> 
>> http://www.openoffice.org/security/bulletin.html
> 
> ah...yes, they are there...the problem is we would need to construct a LOT of 
> just "redirect" pages to right some of these since they all seem to have the 
> form
> 
> "/security/cvs-bulletin-number".html
> 
>> 
>> But we're redirecting security.openoffice.org to
>> http://incubator.apache.org/openofficeorg/security.html
>> 
>> So if there are outstanding URL's that are of the form
>> security.openoffice.org/foo.html then they might be broken now.
> 
> see above...it's the actual placement of the bulletins within the tree that's 
> the problem I think
> 
> 
>> 
>>> So, to those of you using this tool, I may mark many of these as "fixed".
>>> Of course they are not -- and they may show up again. Some of them only
>>> show up in BZ issues!! (Google is amazingly thorough).
>>> 
>>> I don't know how long it will take for them to "show up" again. The problem
>>> is some of these are very very very old references, and not likely we can
>>> do anything about at this point in time.
>>> If you're not using this tool, you probably don't care about this. If you
>>> are using it, and have another opinion before I start chunking away at
>>> hiding these, please weigh in.
>>> 
>> 
>> The way I understand it the links at the top of the list are the ones
>> Google considers the most important.  I think this is based on the
>> number of links to that page.  Maybe they factor in other things as
>> well.  So I'd recommend looking more at the top 100 or so broken
>> links, make this a manageable task.
> 
> Well the problem is "how" to make it manageable... :(
> 
>> 
>> Or -- and here is a challenge for the algorithm experts -- maybe there
>> is an easy way to take that entire list of 62,962 links and determine
>> what the top base paths are that are broken.
> 
> if only this were so :( They're all over the place.
> 
> In other words, if the
>> links are:
>> 
>> foo.openoffice.org/bar/baz1
>> foo.openoffice.org/bar/baz2
>> foo.openoffice.org/bar/baz2
>> foo.openoffice.org/bar2/baz1
>> foo2.openoffice.org/bar1/baz1
>> 
>> Then this would tell us that foo.openoffice.org/bar/* was a top source
>> of broken links.  This might indicate important patterns of where the
>> most broken links are.
>> 
>> It seems like this could be done via a prefix tree (a "trie"):
>> http://en.wikipedia.org/wiki/Trie
>> 
>> Maybe other (simpler) ways as well.
> 
> I'll look at this article. It's a daunting task any way you look at it.
> 
>> 
>> Regards,
> 
> What happens when things get moved a LOT with no regard for the end user. 
> Don't get me started on the ways I've had to deal with this in the past.
> 
>> 
>> -Rob
>> 
>>> 
>>> 
>>> --
>>> 
>>> MzK
>>> 
>>> "I'm just a normal jerk who happens to make music.
>>>  As long as my brain and fingers work, I'm cool."
>>>   -- Eddie Van Halen
> 
> -- 
> 
> MzK
> 
> "I'm just a normal jerk who happens to make music.
> As long as my brain and fingers work, I'm cool."
>  -- Eddie Van Halen
> 
> 


Re: investigation using Google Webmaster tools

2012-08-02 Thread Kay Schenk
On Wed, Aug 1, 2012 at 5:03 PM, Dave Fisher  wrote:

> Sorry to top post, but this week I am at my work HQ and am busy.
>
> I think that we should create a 404 page and then ask infra to point to
> that.
>
> Sent from my iPhone
>

@Dave

Well this would help in being nicer about "missing" items. We could get
them to do a search or something else -- give them a few hints I guess.

Right now, except for the CVEs which all "bad" links reference in the wrong
location, and the "old" servlet business, I'm not seeing any easy patterns
that would help with 62000+ bad links.


> On Aug 1, 2012, at 7:45 PM, Kay Schenk  wrote:
>
> >
> >
> > On 08/01/2012 04:29 PM, Rob Weir wrote:
> >> On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk 
> wrote:
> >>> Hello all --
> >>>
> >>> I am exploring the www.openoffice.site using the Google Webmaster tool
> that
> >>> Rob told us about on Jul 19.
> >>>
> >>> I am ONLY getting started by looking at the 62,962 404 errors (!)
> >>>
> >>> Many of these are links to VERY old docs which we no longer have --
> like
> >>> source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
> >>> architecture -- servlet references etc.
> >>>
> >>
> >> If I understand this correctly, Google is looking at links on
> >> webpages, not just our webpages, but also links from 3rd party
> >> websites, and if they point to an openoffice.org page that doesn't
> >> exist, it shows up on this list.   This could happen for any reason.
> >> In some cases the original link might have had a typo.
> >
> > yes, this is correct, and you are right about this too...some of the
> 404s reference pages we probably NEVER had.
> >
> >>
> >>> Some of this issues could be solved with rather extensive use of sym
> links
> >>> (yes, you can actually use these in svn -- kind of) and of course some
> not
> >>> -- many missing old security bulletins.
> >>>
> >>
> >> For the security bulletins, I wonder if this is actually a redirection
> >> error.  We have many of them here:
> >>
> >> http://www.openoffice.org/security/bulletin.html
> >
> > ah...yes, they are there...the problem is we would need to construct a
> LOT of just "redirect" pages to right some of these since they all seem to
> have the form
> >
> > "/security/cvs-bulletin-number".html
> >
> >>
> >> But we're redirecting security.openoffice.org to
> >> http://incubator.apache.org/openofficeorg/security.html
> >>
> >> So if there are outstanding URL's that are of the form
> >> security.openoffice.org/foo.html then they might be broken now.
> >
> > see above...it's the actual placement of the bulletins within the tree
> that's the problem I think
> >
> >
> >>
> >>> So, to those of you using this tool, I may mark many of these as
> "fixed".
> >>> Of course they are not -- and they may show up again. Some of them only
> >>> show up in BZ issues!! (Google is amazingly thorough).
> >>>
> >>> I don't know how long it will take for them to "show up" again. The
> problem
> >>> is some of these are very very very old references, and not likely we
> can
> >>> do anything about at this point in time.
> >>> If you're not using this tool, you probably don't care about this. If
> you
> >>> are using it, and have another opinion before I start chunking away at
> >>> hiding these, please weigh in.
> >>>
> >>
> >> The way I understand it the links at the top of the list are the ones
> >> Google considers the most important.  I think this is based on the
> >> number of links to that page.  Maybe they factor in other things as
> >> well.  So I'd recommend looking more at the top 100 or so broken
> >> links, make this a manageable task.
> >
> > Well the problem is "how" to make it manageable... :(
> >
> >>
> >> Or -- and here is a challenge for the algorithm experts -- maybe there
> >> is an easy way to take that entire list of 62,962 links and determine
> >> what the top base paths are that are broken.
> >
> > if only this were so :( They're all over the place.
> >
> > In other words, if the
> >> links are:
> >>
> >> foo.openoffice.org/bar/baz1
> >> foo.openoffice.org/bar/baz2
> >> foo.openoffice.org/bar/baz2
> >> foo.openoffice.org/bar2/baz1
> >> foo2.openoffice.org/bar1/baz1
> >>
> >> Then this would tell us that foo.openoffice.org/bar/* was a top source
> >> of broken links.  This might indicate important patterns of where the
> >> most broken links are.
> >>
> >> It seems like this could be done via a prefix tree (a "trie"):
> >> http://en.wikipedia.org/wiki/Trie
> >>
> >> Maybe other (simpler) ways as well.
>
>
> > I'll look at this article. It's a daunting task any way you look at it.
> >
> >>
> >> Regards,
> >
> > What happens when things get moved a LOT with no regard for the end
> user. Don't get me started on the ways I've had to deal with this in the
> past.
> >
> >>
> >> -Rob
> >>
> >>>
> >>>
> >>> --
> >>>
> 
> >>> MzK
> >>>
> >>> "I'm just a normal jerk who happens to make music.
> >>> 

Re: investigation using Google Webmaster tools

2012-08-03 Thread Rob Weir
On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk  wrote:
>
>
> On 08/01/2012 04:29 PM, Rob Weir wrote:
>>
>> On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk  wrote:
>>>
>>> Hello all --
>>>
>>> I am exploring the www.openoffice.site using the Google Webmaster tool
>>> that
>>> Rob told us about on Jul 19.
>>>
>>> I am ONLY getting started by looking at the 62,962 404 errors (!)
>>>
>>> Many of these are links to VERY old docs which we no longer have -- like
>>> source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
>>> architecture -- servlet references etc.
>>>
>>
>> If I understand this correctly, Google is looking at links on
>> webpages, not just our webpages, but also links from 3rd party
>> websites, and if they point to an openoffice.org page that doesn't
>> exist, it shows up on this list.   This could happen for any reason.
>> In some cases the original link might have had a typo.
>
>
> yes, this is correct, and you are right about this too...some of the 404s
> reference pages we probably NEVER had.
>
>
>>
>>> Some of this issues could be solved with rather extensive use of sym
>>> links
>>> (yes, you can actually use these in svn -- kind of) and of course some
>>> not
>>> -- many missing old security bulletins.
>>>
>>
>> For the security bulletins, I wonder if this is actually a redirection
>> error.  We have many of them here:
>>
>> http://www.openoffice.org/security/bulletin.html
>
>
> ah...yes, they are there...the problem is we would need to construct a LOT
> of just "redirect" pages to right some of these since they all seem to have
> the form
>
> "/security/cvs-bulletin-number".html
>

So let's take a specific example.

Google is reporting a 404 error for this URL:
http://www.openoffice.org/security/bulletin-20060629.html

It is linked to from from at least 10 external web pages, for example
the last link in this table:

http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html

(Whoops, make that at least 12 links, since the Apache and MarkMail
list archives will now link to this)

There is no file of this name in
https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/

Looking at the svn log I don't see any record of the files ever being here.

I searched the complete ooo-site tree and this file
(bulletin-20060629.html) doesn't exist anywhere.

The Wayback Machine shows the page did exist in 2006:

http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html

But it was broken already by 2009:

http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html

So this is a pre-existing problem, and nothing we can do about it.

Ughh.   Obviously we cannot do this kind of research for every one of
the 64 thousand links.

But in other cases we can help.  For example this link is giving 404 error:

http://www.openoffice.org/licenses/lgpl_license.html

I think we removed that intentionally, since that is no longer our
license.  However, that link was used by many other websites,
including university course materials looking at open source licenses,
etc.:   http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html

So in cases like this we might want to restore the page.  Do our part
to help prevent bit rot and entropy from destroying the web.

But to put it in perspective, although we have 64 thousand 404 errors
on our website, we also have nearly 16 million incoming links that do
not give errors.

-Rob

>
>>
>> But we're redirecting security.openoffice.org to
>> http://incubator.apache.org/openofficeorg/security.html
>>
>> So if there are outstanding URL's that are of the form
>> security.openoffice.org/foo.html then they might be broken now.
>
>
> see above...it's the actual placement of the bulletins within the tree
> that's the problem I think
>
>
>
>>
>>> So, to those of you using this tool, I may mark many of these as "fixed".
>>> Of course they are not -- and they may show up again. Some of them only
>>> show up in BZ issues!! (Google is amazingly thorough).
>>>
>>> I don't know how long it will take for them to "show up" again. The
>>> problem
>>> is some of these are very very very old references, and not likely we can
>>> do anything about at this point in time.
>>> If you're not using this tool, you probably don't care about this. If you
>>> are using it, and have another opinion before I start chunking away at
>>> hiding these, please weigh in.
>>>
>>
>> The way I understand it the links at the top of the list are the ones
>> Google considers the most important.  I think this is based on the
>> number of links to that page.  Maybe they factor in other things as
>> well.  So I'd recommend looking more at the top 100 or so broken
>> links, make this a manageable task.
>
>
> Well the problem is "how" to make it manageable... :(
>
>
>>
>> Or -- and here is a challenge for the algorithm experts -- maybe there
>> is an easy way to take that entire list of 62,962 lin

Re: investigation using Google Webmaster tools

2012-08-03 Thread Roberto Galoppini
On Thu, Aug 2, 2012 at 4:45 PM, Rob Weir  wrote:
> On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk  wrote:
>>
>>
>> On 08/01/2012 04:29 PM, Rob Weir wrote:
>>>
>>> On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk  wrote:

 Hello all --

 I am exploring the www.openoffice.site using the Google Webmaster tool
 that
 Rob told us about on Jul 19.

 I am ONLY getting started by looking at the 62,962 404 errors (!)

 Many of these are links to VERY old docs which we no longer have -- like
 source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
 architecture -- servlet references etc.

>>>
>>> If I understand this correctly, Google is looking at links on
>>> webpages, not just our webpages, but also links from 3rd party
>>> websites, and if they point to an openoffice.org page that doesn't
>>> exist, it shows up on this list.   This could happen for any reason.
>>> In some cases the original link might have had a typo.
>>
>>
>> yes, this is correct, and you are right about this too...some of the 404s
>> reference pages we probably NEVER had.
>>
>>
>>>
 Some of this issues could be solved with rather extensive use of sym
 links
 (yes, you can actually use these in svn -- kind of) and of course some
 not
 -- many missing old security bulletins.

>>>
>>> For the security bulletins, I wonder if this is actually a redirection
>>> error.  We have many of them here:
>>>
>>> http://www.openoffice.org/security/bulletin.html
>>
>>
>> ah...yes, they are there...the problem is we would need to construct a LOT
>> of just "redirect" pages to right some of these since they all seem to have
>> the form
>>
>> "/security/cvs-bulletin-number".html
>>
>
> So let's take a specific example.
>
> Google is reporting a 404 error for this URL:
> http://www.openoffice.org/security/bulletin-20060629.html
>
> It is linked to from from at least 10 external web pages, for example
> the last link in this table:
>
> http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html
>
> (Whoops, make that at least 12 links, since the Apache and MarkMail
> list archives will now link to this)
>
> There is no file of this name in
> https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/
>
> Looking at the svn log I don't see any record of the files ever being here.
>
> I searched the complete ooo-site tree and this file
> (bulletin-20060629.html) doesn't exist anywhere.
>
> The Wayback Machine shows the page did exist in 2006:
>
> http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html
>
> But it was broken already by 2009:
>
> http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html
>
> So this is a pre-existing problem, and nothing we can do about it.
>
> Ughh.   Obviously we cannot do this kind of research for every one of
> the 64 thousand links.
>
> But in other cases we can help.  For example this link is giving 404 error:
>
> http://www.openoffice.org/licenses/lgpl_license.html
>
> I think we removed that intentionally, since that is no longer our
> license.  However, that link was used by many other websites,
> including university course materials looking at open source licenses,
> etc.:   http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html
>
> So in cases like this we might want to restore the page.  Do our part
> to help prevent bit rot and entropy from destroying the web.
>
> But to put it in perspective, although we have 64 thousand 404 errors
> on our website, we also have nearly 16 million incoming links that do
> not give errors.

Given our rank I'd rather assume that those 64k 404 errors don't
affect our site popularity because of the 16 M links. So said, we
might consider to restore pages like that one, adding the info about
the license change.

Roberto

> -Rob
>
>>
>>>
>>> But we're redirecting security.openoffice.org to
>>> http://incubator.apache.org/openofficeorg/security.html
>>>
>>> So if there are outstanding URL's that are of the form
>>> security.openoffice.org/foo.html then they might be broken now.
>>
>>
>> see above...it's the actual placement of the bulletins within the tree
>> that's the problem I think
>>
>>
>>
>>>
 So, to those of you using this tool, I may mark many of these as "fixed".
 Of course they are not -- and they may show up again. Some of them only
 show up in BZ issues!! (Google is amazingly thorough).

 I don't know how long it will take for them to "show up" again. The
 problem
 is some of these are very very very old references, and not likely we can
 do anything about at this point in time.
 If you're not using this tool, you probably don't care about this. If you
 are using it, and have another opinion before I start chunking away at
 hiding these, please weigh in.

>>>
>>> The way I understand it the links at the top of the list are the ones
>

Re: investigation using Google Webmaster tools

2012-08-03 Thread Kay Schenk



On 08/02/2012 07:45 AM, Rob Weir wrote:

On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk  wrote:



On 08/01/2012 04:29 PM, Rob Weir wrote:


On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk  wrote:


Hello all --

I am exploring the www.openoffice.site using the Google Webmaster tool
that
Rob told us about on Jul 19.

I am ONLY getting started by looking at the 62,962 404 errors (!)

Many of these are links to VERY old docs which we no longer have -- like
source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
architecture -- servlet references etc.



If I understand this correctly, Google is looking at links on
webpages, not just our webpages, but also links from 3rd party
websites, and if they point to an openoffice.org page that doesn't
exist, it shows up on this list.   This could happen for any reason.
In some cases the original link might have had a typo.



yes, this is correct, and you are right about this too...some of the 404s
reference pages we probably NEVER had.





Some of this issues could be solved with rather extensive use of sym
links
(yes, you can actually use these in svn -- kind of) and of course some
not
-- many missing old security bulletins.



For the security bulletins, I wonder if this is actually a redirection
error.  We have many of them here:

http://www.openoffice.org/security/bulletin.html



ah...yes, they are there...the problem is we would need to construct a LOT
of just "redirect" pages to right some of these since they all seem to have
the form

"/security/cvs-bulletin-number".html



So let's take a specific example.

Google is reporting a 404 error for this URL:
http://www.openoffice.org/security/bulletin-20060629.html

It is linked to from from at least 10 external web pages, for example
the last link in this table:

http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html

(Whoops, make that at least 12 links, since the Apache and MarkMail
list archives will now link to this)

There is no file of this name in
https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/

Looking at the svn log I don't see any record of the files ever being here.

I searched the complete ooo-site tree and this file
(bulletin-20060629.html) doesn't exist anywhere.

The Wayback Machine shows the page did exist in 2006:

http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html

But it was broken already by 2009:

http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html

So this is a pre-existing problem, and nothing we can do about it.

Ughh.   Obviously we cannot do this kind of research for every one of
the 64 thousand links.

But in other cases we can help.  For example this link is giving 404 error:

http://www.openoffice.org/licenses/lgpl_license.html

I think we removed that intentionally, since that is no longer our
license.  However, that link was used by many other websites,
including university course materials looking at open source licenses,
etc.:   http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html

So in cases like this we might want to restore the page.  Do our part
to help prevent bit rot and entropy from destroying the web.


Well this particular one I really AM not in favor of restoring to our 
site. What we could do on this one, is put in a page with just a 
redirect to where the actual license lives. (and yes, this is really one 
of the most critical ones in my opinion)




But to put it in perspective, although we have 64 thousand 404 errors
on our website, we also have nearly 16 million incoming links that do
not give errors.


Well that's a relief eh? :)

OK, I will have another look at this. At any rate, we absolutely should 
put in place a generic "error.html" and have infra reconfigure 
www.openoffice.org with THAT as our 404. That way we can assist folks in 
dealing with link problems.





-Rob





But we're redirecting security.openoffice.org to
http://incubator.apache.org/openofficeorg/security.html

So if there are outstanding URL's that are of the form
security.openoffice.org/foo.html then they might be broken now.



see above...it's the actual placement of the bulletins within the tree
that's the problem I think






So, to those of you using this tool, I may mark many of these as "fixed".
Of course they are not -- and they may show up again. Some of them only
show up in BZ issues!! (Google is amazingly thorough).

I don't know how long it will take for them to "show up" again. The
problem
is some of these are very very very old references, and not likely we can
do anything about at this point in time.
If you're not using this tool, you probably don't care about this. If you
are using it, and have another opinion before I start chunking away at
hiding these, please weigh in.



The way I understand it the links at the top of the list are the ones
Google considers the most important.  I think this is based on t

Re: investigation using Google Webmaster tools

2012-08-03 Thread Rob Weir
On Fri, Aug 3, 2012 at 12:13 PM, Kay Schenk  wrote:
>
>
> On 08/02/2012 07:45 AM, Rob Weir wrote:
>>
>> On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk  wrote:
>>>
>>>
>>>
>>> On 08/01/2012 04:29 PM, Rob Weir wrote:


 On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk  wrote:
>
>
> Hello all --
>
> I am exploring the www.openoffice.site using the Google Webmaster tool
> that
> Rob told us about on Jul 19.
>
> I am ONLY getting started by looking at the 62,962 404 errors (!)
>
> Many of these are links to VERY old docs which we no longer have --
> like
> source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
> architecture -- servlet references etc.
>

 If I understand this correctly, Google is looking at links on
 webpages, not just our webpages, but also links from 3rd party
 websites, and if they point to an openoffice.org page that doesn't
 exist, it shows up on this list.   This could happen for any reason.
 In some cases the original link might have had a typo.
>>>
>>>
>>>
>>> yes, this is correct, and you are right about this too...some of the 404s
>>> reference pages we probably NEVER had.
>>>
>>>

> Some of this issues could be solved with rather extensive use of sym
> links
> (yes, you can actually use these in svn -- kind of) and of course some
> not
> -- many missing old security bulletins.
>

 For the security bulletins, I wonder if this is actually a redirection
 error.  We have many of them here:

 http://www.openoffice.org/security/bulletin.html
>>>
>>>
>>>
>>> ah...yes, they are there...the problem is we would need to construct a
>>> LOT
>>> of just "redirect" pages to right some of these since they all seem to
>>> have
>>> the form
>>>
>>> "/security/cvs-bulletin-number".html
>>>
>>
>> So let's take a specific example.
>>
>> Google is reporting a 404 error for this URL:
>> http://www.openoffice.org/security/bulletin-20060629.html
>>
>> It is linked to from from at least 10 external web pages, for example
>> the last link in this table:
>>
>>
>> http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html
>>
>> (Whoops, make that at least 12 links, since the Apache and MarkMail
>> list archives will now link to this)
>>
>> There is no file of this name in
>>
>> https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/
>>
>> Looking at the svn log I don't see any record of the files ever being
>> here.
>>
>> I searched the complete ooo-site tree and this file
>> (bulletin-20060629.html) doesn't exist anywhere.
>>
>> The Wayback Machine shows the page did exist in 2006:
>>
>>
>> http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html
>>
>> But it was broken already by 2009:
>>
>>
>> http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html
>>
>> So this is a pre-existing problem, and nothing we can do about it.
>>
>> Ughh.   Obviously we cannot do this kind of research for every one of
>> the 64 thousand links.
>>
>> But in other cases we can help.  For example this link is giving 404
>> error:
>>
>> http://www.openoffice.org/licenses/lgpl_license.html
>>
>> I think we removed that intentionally, since that is no longer our
>> license.  However, that link was used by many other websites,
>> including university course materials looking at open source licenses,
>> etc.:   http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html
>>
>> So in cases like this we might want to restore the page.  Do our part
>> to help prevent bit rot and entropy from destroying the web.
>
>
> Well this particular one I really AM not in favor of restoring to our site.
> What we could do on this one, is put in a page with just a redirect to where
> the actual license lives. (and yes, this is really one of the most critical
> ones in my opinion)
>

That would be fine, a page at that URL that says our license has
changed, and that the LGPL van be found at the Free Software
Foundation website, and link to that.  Everyone's happy then.

>
>>
>> But to put it in perspective, although we have 64 thousand 404 errors
>> on our website, we also have nearly 16 million incoming links that do
>> not give errors.
>
>
> Well that's a relief eh? :)
>
> OK, I will have another look at this. At any rate, we absolutely should put
> in place a generic "error.html" and have infra reconfigure
> www.openoffice.org with THAT as our 404. That way we can assist folks in
> dealing with link problems.
>

The nice thing about a custom error page is we can put also put Google
custom search box there, to let the user do a site-wide search to try
to find their answer that way.

-Rob

>
>
>>
>> -Rob
>>
>>>

 But we're redirecting security.openoffice.org to
 http://incubator.apache.org/openofficeorg/security.html

 So if there are outstanding URL's that are o

Re: investigation using Google Webmaster tools

2012-08-03 Thread Pedro Giffuni
Hi;

>
> From: Rob Weir 
...
>
>Ughh.   Obviously we cannot do this kind of research for every one of
>the 64 thousand links.
>
>But in other cases we can help.  For example this link is giving 404 error:
>
>http://www.openoffice.org/licenses/lgpl_license.html
>
>I think we removed that intentionally, since that is no longer our


Raises hand: one of the truly good uses I've made of my axe.


>license.  However, that link was used by many other websites,
>including university course materials looking at open source licenses,
>etc.:  http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html
>

The text in that link says "OpenOffice License":  redirecting to the
Apache License would've been more appropriate in this context but
in general would've not been acceptable, so I think the 404 error is a
reasonable solution. In this case breaking the link was actually a feature.


>So in cases like this we might want to restore the page.  Do our part
>to help prevent bit rot and entropy from destroying the web.
>

We cannot do maintainance for broken links here. Perhaps we should
remind users to report broken links on the originating websites.

Pedro.


Re: investigation using Google Webmaster tools

2012-08-03 Thread Rob Weir
On Fri, Aug 3, 2012 at 2:54 PM, Pedro Giffuni  wrote:
> Hi;
>
>>
>> From: Rob Weir 
> ...
>>
>>Ughh.   Obviously we cannot do this kind of research for every one of
>>the 64 thousand links.
>>
>>But in other cases we can help.  For example this link is giving 404 error:
>>
>>http://www.openoffice.org/licenses/lgpl_license.html
>>
>>I think we removed that intentionally, since that is no longer our
>
>
> Raises hand: one of the truly good uses I've made of my axe.
>
>
>>license.  However, that link was used by many other websites,
>>including university course materials looking at open source licenses,
>>etc.:  http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html
>>
>
> The text in that link says "OpenOffice License":  redirecting to the
> Apache License would've been more appropriate in this context but
> in general would've not been acceptable, so I think the 404 error is a
> reasonable solution. In this case breaking the link was actually a feature.
>
>
>>So in cases like this we might want to restore the page.  Do our part
>>to help prevent bit rot and entropy from destroying the web.
>>
>
> We cannot do maintainance for broken links here. Perhaps we should
> remind users to report broken links on the originating websites.
>

Certainly we can fix this on our end.  In fact I just did:

http://www.openoffice.org/licenses/lgpl_license.html

This isn't rocket science.  It is just a minor courtesy to website
visitors not to break common links for no good reason.

-Rob

> Pedro.


Re: investigation using Google Webmaster tools

2012-08-03 Thread Andrea Pescetti

Rob Weir wrote:

On Fri, Aug 3, 2012 at 2:54 PM, Pedro Giffuni wrote:

We cannot do maintainance for broken links here. Perhaps we should
remind users to report broken links on the originating websites.

Certainly we can fix this on our end.  In fact I just did:
http://www.openoffice.org/licenses/lgpl_license.html


I agree: in almost all cases we should replace pages this way (and 
indeed, if a site links to that page with the text "OpenOffice license" 
this is the best way to inform visitors about the license change).


Regards,
  Andrea.


Re: investigation using Google Webmaster tools

2012-08-03 Thread Kay Schenk
On Fri, Aug 3, 2012 at 9:29 AM, Rob Weir  wrote:

> On Fri, Aug 3, 2012 at 12:13 PM, Kay Schenk  wrote:
> >
> >
> > On 08/02/2012 07:45 AM, Rob Weir wrote:
> >>
> >> On Wed, Aug 1, 2012 at 7:45 PM, Kay Schenk 
> wrote:
> >>>
> >>>
> >>>
> >>> On 08/01/2012 04:29 PM, Rob Weir wrote:
> 
> 
>  On Wed, Aug 1, 2012 at 7:06 PM, Kay Schenk 
> wrote:
> >
> >
> > Hello all --
> >
> > I am exploring the www.openoffice.site using the Google Webmaster
> tool
> > that
> > Rob told us about on Jul 19.
> >
> > I am ONLY getting started by looking at the 62,962 404 errors (!)
> >
> > Many of these are links to VERY old docs which we no longer have --
> > like
> > source trees for 1.0.1, 1.0.2 etc.--  or have to do with the OLD
> > architecture -- servlet references etc.
> >
> 
>  If I understand this correctly, Google is looking at links on
>  webpages, not just our webpages, but also links from 3rd party
>  websites, and if they point to an openoffice.org page that doesn't
>  exist, it shows up on this list.   This could happen for any reason.
>  In some cases the original link might have had a typo.
> >>>
> >>>
> >>>
> >>> yes, this is correct, and you are right about this too...some of the
> 404s
> >>> reference pages we probably NEVER had.
> >>>
> >>>
> 
> > Some of this issues could be solved with rather extensive use of sym
> > links
> > (yes, you can actually use these in svn -- kind of) and of course
> some
> > not
> > -- many missing old security bulletins.
> >
> 
>  For the security bulletins, I wonder if this is actually a redirection
>  error.  We have many of them here:
> 
>  http://www.openoffice.org/security/bulletin.html
> >>>
> >>>
> >>>
> >>> ah...yes, they are there...the problem is we would need to construct a
> >>> LOT
> >>> of just "redirect" pages to right some of these since they all seem to
> >>> have
> >>> the form
> >>>
> >>> "/security/cvs-bulletin-number".html
> >>>
> >>
> >> So let's take a specific example.
> >>
> >> Google is reporting a 404 error for this URL:
> >> http://www.openoffice.org/security/bulletin-20060629.html
> >>
> >> It is linked to from from at least 10 external web pages, for example
> >> the last link in this table:
> >>
> >>
> >>
> http://www.ccip.govt.nz/vulnerability-alerts/archives/2006/AlertArchive0607.html
> >>
> >> (Whoops, make that at least 12 links, since the Apache and MarkMail
> >> list archives will now link to this)
> >>
> >> There is no file of this name in
> >>
> >>
> https://svn.apache.org/repos/asf/incubator/ooo/ooo-site/trunk/content/security/
> >>
> >> Looking at the svn log I don't see any record of the files ever being
> >> here.
> >>
> >> I searched the complete ooo-site tree and this file
> >> (bulletin-20060629.html) doesn't exist anywhere.
> >>
> >> The Wayback Machine shows the page did exist in 2006:
> >>
> >>
> >>
> http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html
> >>
> >> But it was broken already by 2009:
> >>
> >>
> >>
> http://web.archive.org/web/20091006090657/http://www.openoffice.org/security/bulletin-20060629.html
> >>
> >> So this is a pre-existing problem, and nothing we can do about it.
> >>
> >> Ughh.   Obviously we cannot do this kind of research for every one of
> >> the 64 thousand links.
> >>
> >> But in other cases we can help.  For example this link is giving 404
> >> error:
> >>
> >> http://www.openoffice.org/licenses/lgpl_license.html
> >>
> >> I think we removed that intentionally, since that is no longer our
> >> license.  However, that link was used by many other websites,
> >> including university course materials looking at open source licenses,
> >> etc.:   http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html
> >>
> >> So in cases like this we might want to restore the page.  Do our part
> >> to help prevent bit rot and entropy from destroying the web.
> >
> >
> > Well this particular one I really AM not in favor of restoring to our
> site.
> > What we could do on this one, is put in a page with just a redirect to
> where
> > the actual license lives. (and yes, this is really one of the most
> critical
> > ones in my opinion)
> >
>
> That would be fine, a page at that URL that says our license has
> changed, and that the LGPL van be found at the Free Software
> Foundation website, and link to that.  Everyone's happy then.
>
> >
> >>
> >> But to put it in perspective, although we have 64 thousand 404 errors
> >> on our website, we also have nearly 16 million incoming links that do
> >> not give errors.
> >
> >
> > Well that's a relief eh? :)
> >
> > OK, I will have another look at this. At any rate, we absolutely should
> put
> > in place a generic "error.html" and have infra reconfigure
> > www.openoffice.org with THAT as our 404. That way we can assist folks in
> > dealing with link problems.
> >
>
> The nice thing

Re: investigation using Google Webmaster tools

2012-08-03 Thread Andrea Pescetti

Rob Weir wrote:

Google is reporting a 404 error for this URL:
http://www.openoffice.org/security/bulletin-20060629.html
...
The Wayback Machine shows the page did exist in 2006:
http://web.archive.org/web/20060703040511/http://www.openoffice.org/security/bulletin-20060629.html
But it was broken already by 2009


All the broken links under http://www.openoffice.org/security can 
probably be redirected to
http://www.openoffice.org/security/bulletin.html which is probably a 
condensed version of all bulletins we used to have.


I don't know what's the best way to do it: redirects, dummy pages... and 
what is the best way to store this in SVN (but I'm not familiar with the 
"symlinks in SVN" approach Kay described).


Regards,
  Andrea.


Re: investigation using Google Webmaster tools

2012-08-03 Thread Kay Schenk
On Fri, Aug 3, 2012 at 11:54 AM, Pedro Giffuni  wrote:

> Hi;
>
> >
> > From: Rob Weir 
> ...
> >
> >Ughh.   Obviously we cannot do this kind of research for every one of
> >the 64 thousand links.
> >
> >But in other cases we can help.  For example this link is giving 404
> error:
> >
> >http://www.openoffice.org/licenses/lgpl_license.html
> >
> >I think we removed that intentionally, since that is no longer our
>
>
> Raises hand: one of the truly good uses I've made of my axe.
>
>
> >license.  However, that link was used by many other websites,
> >including university course materials looking at open source licenses,
> >etc.:  http://www.cs.utsa.edu/~bylander/cs1023/chapter8links.html
> >
>
> The text in that link says "OpenOffice License":  redirecting to the
> Apache License would've been more appropriate in this context but
> in general would've not been acceptable, so I think the 404 error is a
> reasonable solution. In this case breaking the link was actually a feature.
>

Hi Pedro--

Well not exactly. There are actually "items" on our sight that do indeed
use LGPL and so we need to point folks to the real home of this license.

I think we've got areas covered that use ALv2.


>
> >So in cases like this we might want to restore the page.  Do our part
> >to help prevent bit rot and entropy from destroying the web.
> >
>
> We cannot do maintainance for broken links here. Perhaps we should
> remind users to report broken links on the originating websites.
>
> Pedro.
>



-- 

MzK

"I'm just a normal jerk who happens to make music.
 As long as my brain and fingers work, I'm cool."
  -- Eddie Van Halen


Re: investigation using Google Webmaster tools

2012-08-03 Thread Pedro Giffuni
Hi Rob;


- Original Message -
...
>> 
>>  We cannot do maintainance for broken links here. Perhaps we should
>>  remind users to report broken links on the originating websites.
>> 
> 
> Certainly we can fix this on our end.  In fact I just did:
> 
> http://www.openoffice.org/licenses/lgpl_license.html
> 
> This isn't rocket science.  It is just a minor courtesy to website
> visitors not to break common links for no good reason.
> 

I will give you credit. That page does look like a fine replacement.

63999 broken links to go ;).

Pedro.


Re: investigation using Google Webmaster tools

2012-08-03 Thread Kay Schenk
On Fri, Aug 3, 2012 at 2:05 PM, Pedro Giffuni  wrote:

> Hi Rob;
>
>
> - Original Message -
> ...
> >>
> >>  We cannot do maintainance for broken links here. Perhaps we should
> >>  remind users to report broken links on the originating websites.
> >>
> >
> > Certainly we can fix this on our end.  In fact I just did:
> >
> > http://www.openoffice.org/licenses/lgpl_license.html
> >
> > This isn't rocket science.  It is just a minor courtesy to website
> > visitors not to break common links for no good reason.
> >
>
> I will give you credit. That page does look like a fine replacement.
>
> 63999 broken links to go ;).
>
> Pedro.
>

yes, it's lovely...

-- 

MzK

"I'm just a normal jerk who happens to make music.
 As long as my brain and fingers work, I'm cool."
  -- Eddie Van Halen