> In article <[EMAIL PROTECTED]>,
> Christiaan Hofman <[EMAIL PROTECTED]> wrote:
>
> It also needs to return parseable content, since it's fed directly
> through one of the parsers.  This makes them low maintenance and
> reliable, whereas all of the screen scrapers are liable to break  
> without
> warning.  IIRC there are 3 protocols supported at present: PubMed;
> z39.50; ISI is supported using SOAP web services.  These are all
> formally documented, so they shouldn't break without warning.  For
> instance, when I wrote the PubMed searching stuff, I used the API
> specified at
> http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html to
> figure out query options and syntax.  If DBLP supplied web service
> access, that would be ideal, but it looks like they have nothing  
> that's
> robust enough for a search group.

I am aware parsing a webpage is a hack, that can be not robust.  
However, I can tell that DBLP hasn't changes for YEARS. Also, DBLP is  
mainly maintained by one person. I could get in contact with him in  
order to know what is the best way to access the database.

Incidentally, there exists a mirrored database (with some  
enhancements) called DBLP++. This link might be of relevance:

http://dblp.l3s.de/dblp++.php

and particularly

http://dblp.l3s.de/d2r/

Unfortunately this second link doesn't talk to me at all, maybe you're  
more informed than me about the technologies it mentions. However, I  
feel it can be useful. In the first link, there is an SQL dump, but  
it's huge of course.

>
> I think you'd have to download that file to do queries on it; even  
> at 76
> MB compressed, it's huge (and would grow stale quickly).  You'd have  
> to
> memory map the file and index it, since loading it into NSXMLDocument
> would kill the program.
>
> On a side note, keeping this in a single flat file seems crazy
> (hopefully that's not what they actually use!).  Likewise, this quote
>
> "The encoding used for the XML file is plain ASCII. To represent
> characters outside of the 7-bit range we use symbolic or numeric
> entities. All symbolic entities are defined in the DTD. At the moment
> most parts of DBLP are restricted to ISO-8859-1 (Latin-1) characters,
> i.e. the first 255 Unicode characters. Only inside the <note>-element
> you may find characters outside of this range, for example some  
> Chinese
> names in their original spelling."
>
> ...makes me a bit nervous.  Have they not heard of UTF-8?

I guess the xml file only intends to be raw data, to build your own  
database. However, I suspect the xml is not up to date with the actual  
database (of course they're not actually using this).

A

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bibdesk-develop mailing list
Bibdesk-develop@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bibdesk-develop

Reply via email to