Hi,

an attempt to clear up what the two alternatives are and what they do. 
BOTH gwc and geosearch contain code that can generate sitemaps. The 
reason it's a bit rushed, even though we've been working on KML for 
ages, is that we only caught onto the sitemap stuff about 3 weeks ago. 
It took a week before we could confirm it worked, then I have also been 
busy with other stuff and finally flat out sick for a week.

The goal is to have Google index the KML we serve using GeoServer and/or 
GeoWebCache. There are a couple of premises:
1) Googlebot may not crawl or index all the pages / placemarks, hence we 
must hand them to the bot in order of importance. We currently do that 
by using the KML hierarchies we have built for Google Earth.
2) Googlebot cannot follow links, so we need to create exhaustive 
sitemaps that link to every feature we want googlebot to see. So we must 
either precalculate the hierarchy or generate it iteratively.

On request I backported the sitemap feature to the geosearch module, but 
I sense that maybe the importance of this has been overestimated since 
gwc appears to handle it fine in many cases. Because the two systems 
have different sources of information they must take different 
approaches, so the code and the issues are not the same.

The gwc module does the following:
Traverses the entire KML hierarchy (if a tile has not been explored 
previously, it will do so now) and adds all tiles on disk to the sitemap.
Advantages:
1) We get every single placemark, for sure
2) Features stay permanently in the same tiles
3) We get can tell googlebot when a tile was last regenerated (though 
currently not implemented)
4) Serving the tiles is dirt cheap
Disadvantages:
1) Essentially seeds the entire cache when first asked for the sitemap. 
This may require a lot of space
2) Still fairly intense I/O after the cache has been seeded, at least 
for layers with more than 100 000 features
3) GeoServer has to be reloaded and caches cleared if you change the 
regionating attribute or number of features per tile
4) Does not honor the "indexable" checkmark you have in GeoServer, but 
you can still decide whether you want to submit all sitemaps or pick 
individually.

The GeoSearch does the following:
Looks into the H2 database and creates a sitemap with URLs to the WMS 
service the same way that gwc gets tiles. In addition, to enable the bot
Advantages:
1) The H2 query is fairly expensive because of a "group by x,y,z", but 
I'm guessing still a lot cheaper than scanning the disk
2) Any changes to the configuration are updated instantly (but changes 
to the data are not, unless you kill the H2 database)
3) It's fairly easy to seed the H2 database by launching a client like 
Google Earth and panning around for a bit
Disadvantages:
1) If googlebot refreshes sitemap after picking up half the links, the 
full tree may not be generated
2) We will in some cases, depending on how the data is distributed, link 
to a lot of empty tiles. This may make googlebot mad.
3) We currently have no way of seeding the H2 database (though we have a 
ticket for 1.7.2)
4) Creating a tile and serving it is medium-expensive


The stuff that I'm having google index right now is exclusively using 
the gwc solution. Hence we CANE ship without geosearch and still have 
sitemaps.

-- 

Arne Kepp
OpenGeo - http://opengeo.org
Expert service straight from the developers


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Geoserver-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/geoserver-devel

Reply via email to