On Mar 15, 2006, at 12:49 PM, Brian Osborne wrote:
Eric et al.,
Working on writing up some use cases. Chembank is a nice compound
database
for demonstration purposes since it associates some fraction of its
compounds with MeSH Diseases terms (
http://chembank.broad.harvard.edu/chemistry/search/input/
ontology.htm), it
refers to this ontology as Therapeutic Indication. They also use GO
Biological Process.
A year or so ago you could could access its pages by GET, now it
looks like
it's doing a POST - is this a problem for our programmers? No
description of
any API, as far as I can see.
POST only access and no API certainly makes it more difficult to
reuse any of this data :(
Regarding when to use GET vs POST, I've found the following resource
useful...
[[
An important principle of Web architecture is that all important
resources be identifiable by URI. The finding discusses the
relationship between the URI addressability of a resource and the
choice between HTTP GET and POST methods with HTTP URIs. HTTP GET
promotes URI addressability so, designers should adopt it for safe
operations such as simple queries. POST is appropriate for other
types of applications where a user request has the potential to
change the state of the resource (or of related resources). The
finding explains how to choose between HTTP GET and POST for an
application taking into account architectural, security, and
practical considerations.
]]
-- http://www.w3.org/2001/tag/doc/whenToUseGet.html
A bit of browsing around looks like there are at least some GETable
resources so there might be some data one could gleen
e.g.
http://chembank.broad.harvard.edu/chemistry/search/input/
moleculeName.htm
search on '*sulfide*' and then hit 'search' to add Substructure. this
yeilds for example the following search result
disulfiram / ChemBankID: 2038
- http://chembank.broad.harvard.edu/chemistry/viewMolecule.htm?cbid=2038
which points to "find similar molecules"
- http://chembank.broad.harvard.edu/chemistry/
findSimilarMolecules.htm?cbid=2038
The system seems session based, but at least parts of the data seem
scrapeable.
As you seem to be exploring more the Piggy-bank scraper idea (per the
simile general list), the Open World cat scraper [1] is an example of
a session-based, muti-page scraper than could be adapted to at least
parts of the data on this site.
[1] http://potlach.org/2005/10/scrapers/
--
eric miller http://www.w3.org/people/em/
semantic web activity lead http://www.w3.org/2001/sw/
w3c world wide web consortium http://www.w3.org/