Le mardi 13 mai 2014 22:26:51 UTC+2, MRAB a écrit : > On 2014-05-13 20:01, scottca...@gmail.com wrote: > > > On Tuesday, May 13, 2014 9:49:12 AM UTC-4, Steven D'Aprano wrote: > > >> > > >> You may have missed my follow up post, where I said I had not noticed you > > >> were operating on a binary .doc file. > > >> > > >> If you're not willing or able to use a full-blown doc parser, say by > > >> controlling Word or LibreOffice, the other alternative is to do something > > >> quick and dirty that might work most of the time. Open a doc file, or > > >> multiple doc files, in a hex editor and *hopefully* you will be able to > > >> see chunks of human-readable text where you can identify how en-dashes > > >> and similar are stored. > > > > > > I created a .doc file and opened it with UltraEdit in binary (Hex) mode. > > What I see is that there are two characters, one for ndash and one for > > mdash, each a single byte long. 0x96 and 0x97. > > > So I tried this: fStr = re.sub(b'\0x96',b'-',fStr) > > > > > > that did nothing in my file. So I tried this: fStr = > > re.sub(b'0x97',b'-',fStr) > > > > > > which also did nothing. > > > So, for fun I also tried to just put these wildcards in my re.findall so > > I added |Part \0x96|Part \0x97 to no avail. > > > > > > Obviously 0x96 and 0x97 are NOT being interpreted in a re.findall or > > re.sub as hex byte values of 96 and 97 hexadecimal using my current syntax. > > > > > > So here's my question...if I want to replace all ndash or mdash values > > with regular '-' symbols using re.sub, what is the proper syntax to do so? > > > > > > Thanks! > > > > > 0x96 is a hexadecimal literal for an int. Within a string you need \x96 > > (it's \x for 2 hex digits, \u for 4 hex digits, \U for 8 hex digits).
---------------- >>> b'0x61' == b'0x61' True >>> b'0x96' == b'\x96' False - Python and the coding of characters is an unbelievable mess. - Unicode a joke. - I can make Python failing with any valid sequence of chars I wish. - There is a difference between "look, my code work with my chars" and "this code is safely working with any chars". jmf -- https://mail.python.org/mailman/listinfo/python-list