On Dec 26, 8:53 pm, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > Without an additional parser, I was getting the following error > > message: > [...] > > xml.parsers.expat.ExpatError: undefined entity é: line 401, column 11 > > To understand that problem better, it would have been helpful to see > what line 401, column 11 of the input file actually says. AFAICT, > it must have been something like "&é;" which would be really puzzling > to have in an XML file (usually, people restrict themselves to ASCII > for entity names).
No, that one was é (testing with my own name that appeared in a file). > > > for entity in ent: > > if entity not in parser.entity: > > parser.entity[entity] = ent[entity] > > This looks fine to me. > > > The output was "wrong". For example, one of the test I used was to > > process a copy of the main dict of htmlentitydefs.py inside an html page. A > > few of the characters came ok, but I got things like: > > > 'Α': 0x0391, # greek capital letter alpha, U+0391 > > Why do you think this is wrong? Sorry, that was just cut-and-pasted from the browser (not the source); in the source of the processed html page, it is '&#913;': 0x0391, # greek capital letter alpha, U+0391 i.e. the "&" was transformed into "&" in a number of places (all places above ascii 127 I believe). Here are a few more lines extracted from the html file that was processed: ============= 'Â': 0x00c2, # latin capital letter A with circumflex, U+00C2 ISOlat1 'À': 0x00c0, # latin capital letter A with grave = latin capital letter A grave, U+00C0 ISOlat1 '&#913;': 0x0391, # greek capital letter alpha, U+0391 'Å': 0x00c5, # latin capital letter A with ring above = latin capital letter A ring, U+00C5 ISOlat1 'Ã': 0x00c3, # latin capital letter A with tilde, U+00C3 ISOlat1 'Ä': 0x00c4, # latin capital letter A with diaeresis, U+00C4 ISOlat1 '&#914;': 0x0392, # greek capital letter beta, U+0392 'Ç': 0x00c7, # latin capital letter C with cedilla, U+00C7 ISOlat1 '&#935;': 0x03a7, # greek capital letter chi, U+03A7 '&#8225;': 0x2021, # double dagger, U+2021 ISOpub '&#916;': 0x0394, # greek capital letter delta, U+0394 ISOgrk3 ============ > > > When using my modified version, I got the following (which may not be > > transmitted properly by email...) > > 'Α': 0x0391, # greek capital letter alpha, U+0391 > > > It does look like a Greek capital letter alpha here. > > Sure, however, your first version ALSO has the Greek capital letter > alpha there; it is just spelled as Α (which *is* a valid spelling > for that latter in XML). Agreed that it would be... However that was not how it was transformed, see above; sorry if I was not clear about what was happening (I should not have cut-and-pasted from the browser window). > > > I hope the above is of some help. > > Thanks; I now think that htmlentitydefs is just as fine as it always > was - I don't see any problem here. > You may well be right in that the problem may lie elsewhere. But as making the change I mentioned "fixed" the problem at my, I figured this was where the problem was located - and thought I should at least report it here. Regards, André > Regards, > Martin -- http://mail.python.org/mailman/listinfo/python-list