CVS access to the XML is just what I needed.
Even though the XML may not be really trusted anymore, it's certain to be
more stable in its general format than the HTML, so I'll work on something
to scrape over that.
Ben Dilts
"Philip Olson" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
On 3 Jun 2008, at 11:31, Ben Dilts wrote:
I maintain a PHP IDE, and scrape php.net's documentation periodically
for information on built-in functions, classes, constants, etc. using
regular expressions. The problem is, the actual HTML syntax changes
periodically.
Is there any way for me to access the source data that is used to
produce those manual pages? My results would be better, my development
time would go down, and it would save php.net a crawl's worth of
bandwidth weekly.
Hello Ben,
This comes up from time to time and although I don't remember specifics
on what we discussed... here are a few words:
Current situation:
- We have various generated .xml and .txt files in CVS
- But we no longer generate them, nor do we trust how we generate them
- They are generated from PHP internal sources and not from the manual
- They don't really have a home, except through CVS
- Unfortunately people tend to instead scrape the manual, both http and
downloadable html
Likely future situation:
- We'll use PhD to generate a friendly format for this
- They'll be hosted/offered outside of CVS
- We need to discuss this format
- We could also add a list of keywords like constants, predefined
variables, etc.
Other considerations:
- PECL: Most PECL extensions are "lightly used" so may be seen as
unnecessary information
- Not everything is documented, so generating from the manual misses
those
- Whether or not to also scrape php-src (or only the php manual sources)
But the PHP Manual XML sources are all available in CVS so feel free to
check them out and parse. Read:
http://php.net/about.generate
And to download via CVS, run this command:
cvs -d :pserver:[EMAIL PROTECTED]:/repository co phpdoc
Regards,
Philip