> In article <[EMAIL PROTECTED]>, > Christiaan Hofman <[EMAIL PROTECTED]> wrote: > > It also needs to return parseable content, since it's fed directly > through one of the parsers. This makes them low maintenance and > reliable, whereas all of the screen scrapers are liable to break > without > warning. IIRC there are 3 protocols supported at present: PubMed; > z39.50; ISI is supported using SOAP web services. These are all > formally documented, so they shouldn't break without warning. For > instance, when I wrote the PubMed searching stuff, I used the API > specified at > http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html to > figure out query options and syntax. If DBLP supplied web service > access, that would be ideal, but it looks like they have nothing > that's > robust enough for a search group.
I am aware parsing a webpage is a hack, that can be not robust. However, I can tell that DBLP hasn't changes for YEARS. Also, DBLP is mainly maintained by one person. I could get in contact with him in order to know what is the best way to access the database. Incidentally, there exists a mirrored database (with some enhancements) called DBLP++. This link might be of relevance: http://dblp.l3s.de/dblp++.php and particularly http://dblp.l3s.de/d2r/ Unfortunately this second link doesn't talk to me at all, maybe you're more informed than me about the technologies it mentions. However, I feel it can be useful. In the first link, there is an SQL dump, but it's huge of course. > > I think you'd have to download that file to do queries on it; even > at 76 > MB compressed, it's huge (and would grow stale quickly). You'd have > to > memory map the file and index it, since loading it into NSXMLDocument > would kill the program. > > On a side note, keeping this in a single flat file seems crazy > (hopefully that's not what they actually use!). Likewise, this quote > > "The encoding used for the XML file is plain ASCII. To represent > characters outside of the 7-bit range we use symbolic or numeric > entities. All symbolic entities are defined in the DTD. At the moment > most parts of DBLP are restricted to ISO-8859-1 (Latin-1) characters, > i.e. the first 255 Unicode characters. Only inside the <note>-element > you may find characters outside of this range, for example some > Chinese > names in their original spelling." > > ...makes me a bit nervous. Have they not heard of UTF-8? I guess the xml file only intends to be raw data, to build your own database. However, I suspect the xml is not up to date with the actual database (of course they're not actually using this). A ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bibdesk-develop mailing list Bibdesk-develop@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bibdesk-develop