Re: [sqlite] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Eric Grange Thu, 29 Jun 2017 00:02:40 -0700

> The sender, however, could be lying, and this needs to be considered

This is an orthogonal problem: if the sender is sending you data that is
not what it should be, then he could just as well be sending you
well-encoded and well-formed but invalid data, or malware, or
confidential/personal data you are not legally allowed to store, or, or,
or... the list never ends.


And generally speaking, if your code tries too hard to find a possible
interpretation for invalid of malformed input, then you are far more likely
to just end up with processed garbage, which will make it even harder to
figure out down the road where the garbage in your database originated from
(incorrect input? bug in the heuristics? etc.)




On Wed, Jun 28, 2017 at 10:40 PM, Tim Streater <[email protected]> wrote:

> On 28 Jun 2017 at 14:20, Rowan Worth <[email protected]> wrote:
>
> > On 27 June 2017 at 18:42, Eric Grange <[email protected]> wrote:
> >
> >> So while in theory all the scenarios you describe are interesting, in
> >> practice seeing an utf-8 BOM provides an extremely
> >> high likeliness that a file will indeed be utf-8. Not always, but a
> memory
> >> chip could also be hit by a cosmic ray.
> >>
> >> Conversely the absence of an utf-8 BOM means a high probability of
> >> "something undetermined": ANSI or BOMless utf-8,
> >> or something more oddball (in which I lump utf-16 btw)... and the need
> for
> >> heuristics to kick in.
> >>
> >
> > I think we are largely in agreement here (esp. wrt utf-16 being an
> oddball
> > interchange format).
> >
> > It doesn't answer my question though, ie. what advantage the BOM tag
> > provides compared to assuming utf-8 from the outset. Yes if you see a
> utf-8
> > BOM you have immediate confidence that the data is utf-8 encoded, but
> what
> > have you lost if you start with [fake] confidence and treat the data as
> > utf-8 until proven otherwise?
>
> 1) Whether the data contained in a file is to be considered UTF-8 or not
> is an item of metadata about the file. As such, it has no business being
> part of the file itself. BOMs should therefore be deprecated.
>
> 2) I may receive data as part of an email, with a header such as:
>
>            Content-type: text/plain; charset="utf-8"
>            Content-Transfer-Encoding:  base64
>
> then I interpret that to mean that the attendant data, after decoding from
> base64, is it to be expected to be utf-8. The sender, however, could be
> lying, and this needs to be considered. Just because a header, or file
> metadata, or indeed a BOM, says some data or other is legal utf-8, this
> does not mean that it actually is.
>
>
> --
> Cheers  --  Tim
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

Reply via email to