removing BOM prepended by codecs?
So it is just a random sequence of junk. It will be a matter of finding the real start of the record (in this case a %) and throwing the junk away. I was misled by the note in the codecs class that BOMs were being prepended. Should have looked more carefully. Mea culpa. -- https://mail.python.org/mailman/listinfo/python-list
removing BOM prepended by codecs?
I'm having trouble with the BOM that is now prepended to codecs files. The files have to be read by java servlets which expect a clean file without any BOM. Is there a way to stop the BOM being written? It is seriously messing up my work as the servlets do not expect it to be there. I could delete it but that means another delay in retrieving the data. My work is a bibliographic system and I'm writing a new search engine in Python to replace an ancient one in C. I'm working on Linux with a locale of en_GB.UTF8 -- Dr Janet Bagg CSAC, Dept of Anthropology, University of Kent, UK -- https://mail.python.org/mailman/listinfo/python-list
removing BOM prepended by codecs?
I'm using: outputfile = codecs.open (fn, 'w+', 'utf-8', errors='strict') to write as I know that the files are unicode compliant. I run the raw files that are delivered through a Python script to check the unicode and report problem characters which are then edited. The files use a whole variety of languages from Sanskrit to Cyrillic and more obscure ones too. I'll probably have to remove it in the servlet as we have standardised on utf-8. This was done some years ago when utf-16 was rare (apart from Macs). J -- https://mail.python.org/mailman/listinfo/python-list
removing BOM prepended by codecs?
I've checked the original files using od and they don't have BOMs. I'll remove them in the servlet. The overhead is probably small enough unless somebody is doing a massive search. We have a limit anyway to prevent somebody stealing the entire set of data. I started writing the Python search because the ancient C search had started putting out BOMs. I'm actually mystified because our home Linux box does not add BOMs even though it runs 2.7 but my work one does even though it has the same version. The only difference is Fedora 18 v Fedora 17. The BOMs are certainly there: 86 ADFB%R 10C0203z-621 %A François-Xavier Le_Bourdonnec 000 206 255 373 % R 1 0 C 0 2 0 3 z - J -- https://mail.python.org/mailman/listinfo/python-list
removing BOM prepended by codecs?
My editor is JEdit. I use it on a Win 7 machine but have everything set up for *nix files as that is the machine I'm normally working on. The files are mailed to me as updates. The library where the indexers work do use MS computers but this is restricted to EndNote with an exporter into the old Bib-Refer format which we use. I then run them through a Python program to check the unicode for new characters that also creates an ascii transliteration of the main fields and checks for errors. The problem is occuring at the search stage. This stage creates a script with directives to search particular years and then puts the results into a file in /tmp. The process is left over from an old CGI version but is efficient and so has been kept. This has been done with a very old C program that a collegue wrote back in the 90s with more recent updates. I'm in the process of updating this to Python as it is getting too difficult to maintain. J -- https://mail.python.org/mailman/listinfo/python-list