On Jan 10, 8:56 am, webcomm <rya...@gmail.com> wrote: > On Jan 9, 4:12 pm, "Chris Mellon" <arka...@gmail.com> wrote: > > > It would really help if you could post a sample file somewhere. > > Here's a sample with some dummy data from the web > service:http://webcomm.webfactional.com/htdocs/data.zip > > That's the zip created in this line of my code... > f = open('data.zip', 'wb')
Your original problem is identical to that already reported by Chris Mellon (gratuitous \0 bytes appended to the real archive contents). Here's the output of the diagnostic gadget that I posted a few minutes ago: .......................................................... C:\downloads>python zip_susser_v2.py data.zip archive size is 1092 FileHeader at 0 CentralDir at 844 EndArchive at 894 using posEndArchive = 894 endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0) signature : 'PK\x05\x06' this_disk_num : 0 central_dir_disk_num : 0 central_dir_this_disk_num_entries : 1 central_dir_overall_num_entries : 1 central_dir_size : 50 central_dir_offset : 844 comment_size : 0 expected_comment_size: 0 actual_comment_size: 176 comment is all spaces: False comment is all '\0': True comment (first 100 bytes): '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\x00\x00\x00\x00' ................................... > > If I open the file it contains as unicode in my text editor (EditPlus) > on Windows XP, there is ostensibly nothing wrong with it. It looks > like valid XML. Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as God^H^H^HGates intended: >>> buff = open('data', 'rb').read() >>> buff[:100] '<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00> \x00<\x00B\x0 0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00> \x000\x00.\x000\x000\x000\x000\x0 0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00< \x00S\x00t\x0 0a\x00t\x00' >>> buff[:100].decode('utf_16_le') u'<Registration><BalanceDue>0.0000</BalanceDue><Stat' >>> > But if I return it to my browser with python+django, > there are bad characters every other character Please consider that we might have difficulty guessing what "return it to my browser with python+django" means. Show actual code. > > If I unzip it like this... > popen("unzip data.zip") > ...then the bad characters are 'FFFD' characters as described and > pictured > here...http://groups.google.com/group/comp.lang.python/browse_thread/thread/... Yup, you've somehow pushed your utf_16_le-encoded data through some decoder that doesn't like '\x00' and is replacing it with U+FFFD whose name is (funnily enough) REPLACEMENT CHARACTER and whose meaning is "big fat Unicode version of the question mark". > > If I unzip it like this... > getzip('data.zip', ignoreable=30000) > ...using the function > at...http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543 > ...then the bad characters are \x00 characters. Hmmm ... shouldn't make a difference how you extracted 'data' from 'data.zip'. Please consider reading the Unicode HOWTO at http://docs.python.org/howto/unicode.html Cheers, John -- http://mail.python.org/mailman/listinfo/python-list