removing BOM prepended by codecs?

2013-09-25 Thread J. Bagg

So it is just a random sequence of junk.

It will be a matter of finding the real start of the record (in this 
case a %) and throwing the junk away. I was misled by the note in the 
codecs class that BOMs were being prepended. Should have looked more 
carefully.


Mea culpa.

--
https://mail.python.org/mailman/listinfo/python-list


removing BOM prepended by codecs?

2013-09-24 Thread J. Bagg
I'm having trouble with the BOM that is now prepended to codecs files. 
The files have to be read by java servlets which expect a clean file 
without any BOM.


Is there a way to stop the BOM being written?

It is seriously messing up my work as the servlets do not expect it to 
be there. I could delete it but that means another delay in retrieving 
the data. My work is a bibliographic system and I'm writing a new search 
engine in Python to replace an ancient one in C.


I'm working on Linux with a locale of en_GB.UTF8

--
Dr Janet Bagg
CSAC, Dept of Anthropology,
University of Kent, UK
--
https://mail.python.org/mailman/listinfo/python-list


removing BOM prepended by codecs?

2013-09-24 Thread J. Bagg

I'm using:

outputfile = codecs.open (fn, 'w+', 'utf-8', errors='strict')

to write as I know that the files are unicode compliant. I run the raw 
files that are delivered through a Python script to check the unicode 
and report problem characters which are then edited. The files use a 
whole variety of languages from Sanskrit to Cyrillic and more obscure 
ones too.


I'll probably have to remove it in the servlet as we have standardised 
on utf-8. This was done some years ago when utf-16 was rare (apart from 
Macs).


J



--
https://mail.python.org/mailman/listinfo/python-list


removing BOM prepended by codecs?

2013-09-24 Thread J. Bagg

I've checked the original files using od and they don't have BOMs.

I'll remove them in the servlet. The overhead is probably small enough 
unless somebody is doing a massive search. We have a limit anyway to 
prevent somebody stealing the entire set of data.


I started writing the Python search because the ancient C search had 
started putting out BOMs. I'm actually mystified because our home Linux 
box does not add BOMs even though it runs 2.7 but my work one does even 
though it has the same version. The only difference is Fedora 18 v 
Fedora 17.


The BOMs are certainly there:

86 ADFB%R 10C0203z-621
%A François-Xavier Le_Bourdonnec

000 206 255 373   %   R   1   0   C   0   2   0   3   z   -

J

--
https://mail.python.org/mailman/listinfo/python-list


removing BOM prepended by codecs?

2013-09-24 Thread J. Bagg
My editor is JEdit. I use it on a Win 7 machine but have everything set 
up for *nix files as that is the machine I'm normally working on.


The files are mailed to me as updates. The library where the indexers 
work do use MS computers but this is restricted to EndNote with an 
exporter into the old Bib-Refer format which we use. I then run them 
through a Python program to check the unicode for new characters that 
also creates an ascii transliteration of the main fields and checks for 
errors.


The problem is occuring at the search stage. This stage creates a script 
with directives to search particular years and then puts the results 
into a file in /tmp. The process is left over from an old CGI version 
but is efficient and so has been kept. This has been done with a very 
old C program that a collegue wrote back in the 90s with more recent 
updates. I'm in the process of updating this to Python as it is getting 
too difficult to maintain.


J
--
https://mail.python.org/mailman/listinfo/python-list