CVS access to the XML is just what I needed.

Even though the XML may not be really trusted anymore, it's certain to be more stable in its general format than the HTML, so I'll work on something to scrape over that.


Ben Dilts

"Philip Olson" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED]

On 3 Jun 2008, at 11:31, Ben Dilts wrote:

I maintain a PHP IDE, and scrape php.net's documentation periodically for information on built-in functions, classes, constants, etc. using regular expressions. The problem is, the actual HTML syntax changes periodically.

Is there any way for me to access the source data that is used to produce those manual pages? My results would be better, my development time would go down, and it would save php.net a crawl's worth of bandwidth weekly.

Hello Ben,

This comes up from time to time and although I don't remember specifics on what we discussed... here are a few words:

Current situation:
- We have various generated .xml and .txt files in CVS
- But we no longer generate them, nor do we trust how we generate them
- They are generated from PHP internal sources and not from the manual
- They don't really have a home, except through CVS
- Unfortunately people tend to instead scrape the manual, both http and downloadable html

Likely future situation:
- We'll use PhD to generate a friendly format for this
- They'll be hosted/offered outside of CVS
- We need to discuss this format
- We could also add a list of keywords like constants, predefined variables, etc.

Other considerations:
- PECL: Most PECL extensions are "lightly used" so may be seen as unnecessary information - Not everything is documented, so generating from the manual misses those
- Whether or not to also scrape php-src (or only the php manual sources)

But the PHP Manual XML sources are all available in CVS so feel free to check them out and parse. Read:

  http://php.net/about.generate

And to download via CVS, run this command:

  cvs -d :pserver:[EMAIL PROTECTED]:/repository co phpdoc

Regards,
Philip


Reply via email to