removing BOM prepended by codecs?

2013-09-25 Thread J. Bagg

So it is just a random sequence of junk.

It will be a matter of finding the real start of the record (in this 
case a %) and throwing the junk away. I was misled by the note in the 
codecs class that BOMs were being prepended. Should have looked more 
carefully.


Mea culpa.

--
https://mail.python.org/mailman/listinfo/python-list


Re: removing BOM prepended by codecs?

2013-09-25 Thread Dave Angel
On 25/9/2013 06:38, J. Bagg wrote:

 So it is just a random sequence of junk.

 It will be a matter of finding the real start of the record (in this 
 case a %) and throwing the junk away.

Please join the list.  Your present habit of starting a new thread for
each of your messages is getting old.


You still need to find the source of the junk if you want anything
approaching reliability.

The open() call you showed in one of the other four threads had a
append mode.  Is it possible that you're creating a file without
deleting pre-existing junk?


-- 
DaveA


-- 
https://mail.python.org/mailman/listinfo/python-list


removing BOM prepended by codecs?

2013-09-24 Thread J. Bagg
I'm having trouble with the BOM that is now prepended to codecs files. 
The files have to be read by java servlets which expect a clean file 
without any BOM.


Is there a way to stop the BOM being written?

It is seriously messing up my work as the servlets do not expect it to 
be there. I could delete it but that means another delay in retrieving 
the data. My work is a bibliographic system and I'm writing a new search 
engine in Python to replace an ancient one in C.


I'm working on Linux with a locale of en_GB.UTF8

--
Dr Janet Bagg
CSAC, Dept of Anthropology,
University of Kent, UK
--
https://mail.python.org/mailman/listinfo/python-list


Re: removing BOM prepended by codecs?

2013-09-24 Thread Steven D'Aprano
On Tue, 24 Sep 2013 10:42:22 +0100, J. Bagg wrote:

 I'm having trouble with the BOM that is now prepended to codecs files.
 The files have to be read by java servlets which expect a clean file
 without any BOM.
 
 Is there a way to stop the BOM being written?

Of course there is :-) but first we need to know how you are writing it 
in the first place.

If you are dealing with existing files, which already contain a BOM, you 
may need to open the files and re-save them without the BOM.

If you are dealing with temporary files you're creating programmatically, 
it depends how you're creating them. My guess is that you're doing 
something like this:

f = open(some file, w, encoding=UTF-16)  # or UTF-32
f.write(data)
f.close()

or similar. Both the UTF-16 and UTF-32 codecs write BOMs. To avoid that, 
you should use UTF-16-BE or UTF-16-LE (Big Endian or Little Endian), as 
appropriate to your platform.

If you're getting a UTF-8 BOM, that's seriously weird. The standard UTF-8 
codec doesn't write a BOM. (Strictly speaking, it's not a Byte Order 
Mark, but a Signature.) Unless you're using encoding='UTF-8-sig', I can't 
guess how you're getting a UTF-8 BOM.

If you're doing something else, well, you'll have to explain what you're 
doing before we can tell you how to stop doing it :-)


 I'm working on Linux with a locale of en_GB.UTF8

The locale only sets the default encoding used by the OS, not that used 
by Python. Python 2 defaults to ASCII; Python 3 defaults to UTF-8.


-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: removing BOM prepended by codecs?

2013-09-24 Thread Peter Otten
J. Bagg wrote:

 I'm having trouble with the BOM that is now prepended to codecs files.
 The files have to be read by java servlets which expect a clean file
 without any BOM.
 
 Is there a way to stop the BOM being written?

I think if you specify the byte order explicitly with UTF-16-LE or 
UTF-16-BE no BOM is written.


-- 
https://mail.python.org/mailman/listinfo/python-list


removing BOM prepended by codecs?

2013-09-24 Thread J. Bagg

I'm using:

outputfile = codecs.open (fn, 'w+', 'utf-8', errors='strict')

to write as I know that the files are unicode compliant. I run the raw 
files that are delivered through a Python script to check the unicode 
and report problem characters which are then edited. The files use a 
whole variety of languages from Sanskrit to Cyrillic and more obscure 
ones too.


I'll probably have to remove it in the servlet as we have standardised 
on utf-8. This was done some years ago when utf-16 was rare (apart from 
Macs).


J



--
https://mail.python.org/mailman/listinfo/python-list


Re: removing BOM prepended by codecs?

2013-09-24 Thread Tim Golden
On 24/09/2013 14:01, J. Bagg wrote:
 I'm using:
 
 outputfile = codecs.open (fn, 'w+', 'utf-8', errors='strict')

Well for the life of me I can't make that produce a BOM on 2.7 or 3.4.
In other words:

code
import codecs
with codecs.open(temp.txt, w+, utf-8, errors=strict) as f:
  f.write(abc)

with open(temp.txt, rb) as f:
  assert f.read()[:3] == babc

/code

works without any assertion failures on 2.7 and 3.4, both running on
Win7 and on 2.7 and 3.3 running on Linux.

Have I misunderstood your situation?

TJG
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: removing BOM prepended by codecs?

2013-09-24 Thread Dave Angel
On 24/9/2013 09:01, J. Bagg wrote:

Why would you start a new thread?  just do a Reply-List (or Reply-All
and remove the extra names) to the appropriate message on the existing
thread.

 I'm using:

 outputfile = codecs.open (fn, 'w+', 'utf-8', errors='strict')

That won't be adding a BOM.  It appends to an existing file, which
already may have a BOM in it.  Or conceivably you have a BOM in your
unicode string that you're passing to write() method.


 to write as I know that the files are unicode compliant. I run the raw 
 files that are delivered through a Python script to check the unicode 
 and report problem characters which are then edited. The files use a 
 whole variety of languages from Sanskrit to Cyrillic and more obscure 
 ones too.

it'd be much nicere to remove it when writing the file.
-- 
DaveA


-- 
https://mail.python.org/mailman/listinfo/python-list


removing BOM prepended by codecs?

2013-09-24 Thread J. Bagg

I've checked the original files using od and they don't have BOMs.

I'll remove them in the servlet. The overhead is probably small enough 
unless somebody is doing a massive search. We have a limit anyway to 
prevent somebody stealing the entire set of data.


I started writing the Python search because the ancient C search had 
started putting out BOMs. I'm actually mystified because our home Linux 
box does not add BOMs even though it runs 2.7 but my work one does even 
though it has the same version. The only difference is Fedora 18 v 
Fedora 17.


The BOMs are certainly there:

86 ADFB%R 10C0203z-621
%A François-Xavier Le_Bourdonnec

000 206 255 373   %   R   1   0   C   0   2   0   3   z   -

J

--
https://mail.python.org/mailman/listinfo/python-list


Re: removing BOM prepended by codecs?

2013-09-24 Thread Peter Otten
J. Bagg wrote:

 I've checked the original files using od and they don't have BOMs.
 
 I'll remove them in the servlet. The overhead is probably small enough
 unless somebody is doing a massive search. We have a limit anyway to
 prevent somebody stealing the entire set of data.
 
 I started writing the Python search because the ancient C search had
 started putting out BOMs. I'm actually mystified because our home Linux
 box does not add BOMs even though it runs 2.7 but my work one does even
 though it has the same version. The only difference is Fedora 18 v
 Fedora 17.
 
 The BOMs are certainly there:
 
 86 ADFB%R 10C0203z-621
 %A François-Xavier Le_Bourdonnec
 
 000 206 255 373   %   R   1   0   C   0   2   0   3   z   -
 
 J
 

Were these files edited with Notepad? According to

http://docs.python.org/2/library/codecs.html#encodings-and-unicode


To increase the reliability with which a UTF-8 encoding can be detected, 
Microsoft invented a variant of UTF-8 (that Python 2.5 calls utf-8-sig) 
for its Notepad program: Before any of the Unicode characters is written to 
the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: 
0xef, 0xbb, 0xbf) is written.


To strip off such a UTF-8 encoded BOM you can open the source file with 
utf-8-sig and write the output to a (different!) file with utf-8

with codecs.open(source, r, encoding=utf-8-sig) as instream:
with codecs.open(dest, w, encoding=utf-8) as outstream:
shutil.copyfileobj(instream, outstream)

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: removing BOM prepended by codecs?

2013-09-24 Thread wxjmfauth
Le mardi 24 septembre 2013 11:42:22 UTC+2, J. Bagg a écrit :
 I'm having trouble with the BOM that is now prepended to codecs files. 
 
 The files have to be read by java servlets which expect a clean file 
 
 without any BOM.
 
 
 
 Is there a way to stop the BOM being written?
 
 
 
 It is seriously messing up my work as the servlets do not expect it to 
 
 be there. I could delete it but that means another delay in retrieving 
 
 the data. My work is a bibliographic system and I'm writing a new search 
 
 engine in Python to replace an ancient one in C.
 
 
 
 I'm working on Linux with a locale of en_GB.UTF8
 
 
 
 -- 
 
 Dr Janet Bagg
 
 CSAC, Dept of Anthropology,
 
 University of Kent, UK

-

Some points.

- The coding of a text file does not matter. What's
count is the knowledge of the coding.

- The *mark* (once the Unicode.org terminology in FAQ) indicating
a unicode encoded raw text file is neither a byte order mark,
nor a signature, it is an encoded code point, the encoded
U+FEFF, 'ZERO WIDTH NO-BREAK SPACE', code point. (Note, a
non breaking space at the start of a text is a non sense.)

- When such a mark exists, it is always possible to work
100% safely. No possible error.

- When such a mark does not exist, in many cases only
guessing is a (the) valid solution.

These are facts.


Now to the question, should I use (put) such a mark,
esp. in utf-8? I would say the following:

It seems to me, one see more and more marked utf-8 files.
(Windows is probably a reason.)

More importantly, more and more tools and software are
handling this utf-8 mark, or are corrected to support it,
so it basicaly does not hurt too much. Eg. Python, golang 1.1
(was not the case in 1.0), LibreOffice, TeXWorks supports it
now (was once not the case), the unicode TeX engines, ...

If I had to work in archiving, it would seriously think
twice.

PS Unicode encodes characters on a per *script* (alphabet)
basis, not per *language*.

jmf

-- 
https://mail.python.org/mailman/listinfo/python-list


removing BOM prepended by codecs?

2013-09-24 Thread J. Bagg
My editor is JEdit. I use it on a Win 7 machine but have everything set 
up for *nix files as that is the machine I'm normally working on.


The files are mailed to me as updates. The library where the indexers 
work do use MS computers but this is restricted to EndNote with an 
exporter into the old Bib-Refer format which we use. I then run them 
through a Python program to check the unicode for new characters that 
also creates an ascii transliteration of the main fields and checks for 
errors.


The problem is occuring at the search stage. This stage creates a script 
with directives to search particular years and then puts the results 
into a file in /tmp. The process is left over from an old CGI version 
but is efficient and so has been kept. This has been done with a very 
old C program that a collegue wrote back in the 90s with more recent 
updates. I'm in the process of updating this to Python as it is getting 
too difficult to maintain.


J
--
https://mail.python.org/mailman/listinfo/python-list


Re: removing BOM prepended by codecs?

2013-09-24 Thread Chris Angelico
On Wed, Sep 25, 2013 at 4:43 AM,  wxjmfa...@gmail.com wrote:
 - The *mark* (once the Unicode.org terminology in FAQ) indicating
 a unicode encoded raw text file is neither a byte order mark,
 nor a signature, it is an encoded code point, the encoded
 U+FEFF, 'ZERO WIDTH NO-BREAK SPACE', code point. (Note, a
 non breaking space at the start of a text is a non sense.)

 - When such a mark exists, it is always possible to work
 100% safely. No possible error.

I have a file encoded in Latin-1 which begins with LATIN SMALL LETTER
Y WITH DIAERESIS followed by LATIN SMALL LETTER THORN. I also have a
file encoded in EBCDIC (okay, I don't really, but let's pretend) that
begins with the same bytes. But of course, when such a mark exists,
there is no possible error - of that there is no manner of doubt, no
possible, probable shadow of doubt, no possible doubt whatever.

(No possible doubt whatever.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: removing BOM prepended by codecs?

2013-09-24 Thread Piet van Oostrum
J. Bagg j.b...@kent.ac.uk writes:

 I've checked the original files using od and they don't have BOMs.

 I'll remove them in the servlet. The overhead is probably small enough
 unless somebody is doing a massive search. We have a limit anyway to
 prevent somebody stealing the entire set of data.

 I started writing the Python search because the ancient C search had
 started putting out BOMs. I'm actually mystified because our home Linux
 box does not add BOMs even though it runs 2.7 but my work one does even
 though it has the same version. The only difference is Fedora 18 v
 Fedora 17.

 The BOMs are certainly there:

 86 ADFB%R 10C0203z-621
 %A François-Xavier Le_Bourdonnec

 000 206 255 373   %   R   1   0   C   0   2   0   3   z   -

That is not a BOM or SIG. It isn't even valid utf-8.
-- 
Piet van Oostrum p...@vanoostrum.org
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]
-- 
https://mail.python.org/mailman/listinfo/python-list