Very Cool and Thanks, I downloaded (http://code.google.com/p/gbif-providertoolkit/) and got it working on one of my test machines.
Is there a plan to move or not move this to the new DarwinCore? Thanks! Pete On Mon, May 11, 2009 at 2:48 AM, Tim Robertson <[email protected]> wrote: > Hi Peter, > > Just to expand on what Donald has written here: > > > My current thinking is that we should offer this as a service which > > can both be executed during harvesting and also as a stand-alone > > service for which users can submit a batch of Darwin Core-style > > records (probably tab-delimited) and get back a report for whichever > > set of tests or value-add operations they choose. This could help > > providers with data cleaning even before they share their data (and > > also could help them to make sure there are no known sensitivity > > issues around their data). Such a service could be extended more or > > less indefinitely to report more and more aspects of interest. One > > of the major options could be to cross-reference records to accepted > > taxonomic authorities (via LSIDs or other identifiers). > > GBIF recently launched an early release of a biodiversity data > publishing tool (http://code.google.com/p/gbif-providertoolkit/) which > allows for serving of occurrence and species oriented data, in a "star > schema" format with Darwin Core as the core of the star. This tool > has an embedded database, which allows for serving of text files (csv, > tab delimited etc) and also the ability to sit in front of an existing > database to offer DwC through a complete archive, TAPIR and WFS,WMS > services. As you publish data through this tool, it currently does > very basic type checking of input data, and creates "annotations" on > the records that have issue (e.g. > http://ipt.gbif.org/annotations.html?resource_id=11) > . As the tool matures in the coming months, we plan to open up an API > so that data provides can call external services and have them push > back annotations - e.g. check my coordinates, check my names with IPNI > etc. By publishing the complete dataset as an "archive" (a zipped > dump with an xml file describing the columns, > http://rs.tdwg.org/dwc/terms/guides/text/index.htm > as Donald mentions) the technical threshold is reduced to a minimum > for the data transfer to implement such a quality service, while also > ensuring decent harvesting performance. It is in the current GBIF > workplan to register such quality services in the GBIF registry which > is undergoing development now, so that they may be discovered and used > by all, including the GBIF publishing toolkit, and portals. By doing > this, the roles of checking data, or implementing quality services are > not centralised in a GBIF portal, but can be used by the data owner > before sharing with GBIF or other networks. > > Additionally, by allowing for remote annotations, we can aim to > ultimately push back all feedback from the GBIF portal (or others) > into the publishing tools as opposed to through email as is the > current feedback mechanism - this is related to other topics such as > uniquely identifying resources as they are shared through various > networks for example. It would then be trivial to have (for example) > a google map with a clickable point which opens the details holding a > link "this record has bad coordinates", or a form to fill in. > Feedback could take the form of free text or perhaps even better, as > "structured annotations" where possible (this record would be correct > if the isoCountryCode was "DE") which could then be automatically > removed should the source be updated to meet the annotation criteria. > > Best wishes, > > Tim > > > > > > > > > > Best wishes, > > > > Donald > > > > > > Donald Hobern, Director, Atlas of Living Australia > > CSIRO Entomology, GPO Box 1700, Canberra, ACT 2601 > > Phone: (02) 62464352 Mobile: 0437990208 > > Email: [email protected] > > Web: http://www.ala.org.au/ > > > > > > -----Original Message----- > > Date: Fri, 8 May 2009 19:23:32 -0500 > > From: Peter DeVries <[email protected]> > > Subject: [tdwg] Ideas on having Harvesters like GBIF clean, flag > > inconsistencies, and add additional value to the data > > To: [email protected] > > Message-ID: > > <[email protected]> > > Content-Type: text/plain; charset="iso-8859-1" > > > > Arthur Chapman sent me some good comments regarding Datums etc. > > The discussion made me realize that there may be a need for two > > types of > > formats. One for the providers and a second one that is output by the > > harvesting service. > > > > This is because the needs and abilities of the data providers are > > different > > than the needs and abilities of those who would like to consume the > > data. > > > > Consumers, who analyze and map the data, would like something that > > is easy > > to process, standardized and as as error free as as possible. > > > > It could work in the following way. > > > > Data harvesters, like GBIF, collect the records. Run them through > > cleaning algorithms that check attributes including that the lat and > > long > > actually match the location described. > > > > These harvesters would then expose this cleaned data via XML and RDF > > with > > tags that flag possible inconsistencies. The harvesters would also > > add a > > field for the lat and long in WGS84 if the original record contains > > a valid > > Datum. Those records without a Datum would still be exposed but the > > added > > geo:latitude and geo:longitude fields would be empty. > > > > I can imagine that that data uploaded to GBIF and other harvester > > services > > will be replete with typo's and inconsistencies that will frustrate > > people > > trying to analyze or simply map the data, the harvester services > > could add > > value by minimizing these frustrations. > > > > Originally, it seemed that a global service should standardize on a > > global > > Datum like WGS84. After all, we have standardized on meters? > > However, after > > discussing this with Arthur, I realize that this is not possible for a > > number of reasons. That said, I think the data would be much more > > valuable > > and less likely to be misinterpreted if if a version of it was > > available in > > WGS84. This solution would eventually encourage data providers to > > understand > > what a Datum is and include it in their data. It would also help > > solve a > > number of other data integration problems. > > > > Respectfully, > > > > Pete > > _______________________________________________ > > tdwg mailing list > > [email protected] > > http://lists.tdwg.org/mailman/listinfo/tdwg > > > > _______________________________________________ > tdwg mailing list > [email protected] > http://lists.tdwg.org/mailman/listinfo/tdwg > -- --------------------------------------------------------------- Pete DeVries Department of Entomology University of Wisconsin - Madison 445 Russell Laboratories 1630 Linden Drive Madison, WI 53706 ------------------------------------------------------------
_______________________________________________ tdwg mailing list [email protected] http://lists.tdwg.org/mailman/listinfo/tdwg
