Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread John McKown
On Tue, Jun 27, 2017 at 4:02 PM, Keith Medcalf  wrote:

>
> > If an implementation "uses" 8 bits for ASCII text (as opposed to
> > hardware storage which is never less than 8 bits for a single C char,
> > AFAIK), then it is not a valid ASCII implementation, i.e. does not
> > interpret ASCII according to its definition. The whole point of
> > specifying a format as 7 bits is that the 8th bit is ignored, or
> > perhaps used in an implementation-defined manner, regardless of whether
> > the 8th bit in a char is available or not.
>
> ASCII was designed back in the days of low reliability serial
> communications -- you know, back when data was sent using 7 bit data + 1
> parity bits + 2 stop bits -- to increase the reliability of the
> communications.  A "byte" was also 9 bits.  8 bits of data and a parity bit.
>
> Nowadays we use 8 bits for data with no parity, no error correction, and
> no timing bits.  Cuz when things screw up we want them to REALLY screw up
> ... and remain undetectable.
>

​Actually, most _enterprise_ level storage & transmission facilities have
error detection and correction codes which are "transparent" to the
programmer. Almost everybody knows about RAID arrays which (other than
JBOD) have either "parity" (RAID5 is an example) or is "mirrored" (RAID1).
Most have also heard of ECC RAM memory. But I'll bet that few have heard
of​ RAIM memory, which is used on the IBM z series of computers. Redundant
Array of Independent Memory. This is basically "RAID 5" memory. In addition
to the RAID-ness, it still uses ECC as well. Also, unlike with an Intel
machine, if an IBM z suffers a "memory failure", there is usually the
ability for the _hardware_ to recover all the data in the memory module
("block") and transparently copy it to a "phantom" block of memory, which
then takes the place of the block which contains the error. All without
host software intervention.

https://www.ibm.com/developerworks/community/blogs/e0c474f8-3aad-4f01-8bca-f2c12b576ac9/entry/IBM_zEnterprise_redundant_array_of_independent_memory_subsystem
?


-- 
Veni, Vidi, VISA: I came, I saw, I did a little shopping.

Maranatha! <><
John McKown
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] CSV Extension Typo ...

2017-06-27 Thread Keith Medcalf

sqlite3x.c:206636:18: warning: implicit declaration of function 
'csv_read_one_field'; did you mean 'csv_read_one_field_ext'? 
[-Wimplicit-function-declaration]
   return csv_read_one_field(p);
  ^~
  csv_read_one_field_ext
sqlite3x.c:206636:18: warning: return makes pointer from integer without a cast 
[-Wint-conversion]
   return csv_read_one_field(p);
  ^

C:\Users\KMedcalf\AppData\Local\Temp\ccCaQVe8.ltrans26.ltrans.o::(.text+0x86fd):
 undefined reference to `csv_read_one_field'
collect2.exe: error: ld returned 1 exit status


Should probably be

Index: csv.c
==
--- csv.c
+++ csv.c
@@ -262,11 +262,11 @@
 csv_append(p, c);
 c = csv_getc(p);
 if( (c&0xff)==0xbf ){
   p->bNotFirst = 1;
   p->n = 0;
-  return csv_read_one_field(p);
+  return csv_read_one_field_ext(p);
 }
   }
 }
 while( c>',' || (c!=EOF && c!=',' && c!='\n') ){
   if( csv_append(p, (char)c) ) return 0;



---
Life should not be a journey to the grave with the intention of arriving safely 
in a pretty and well preserved body, but rather to skid in broadside in a cloud 
of smoke, thoroughly used up, totally worn out, and loudly proclaiming "Wow! 
What a Ride!"
 -- Hunter S. Thompson





___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Keith Medcalf
 
> If an implementation "uses" 8 bits for ASCII text (as opposed to
> hardware storage which is never less than 8 bits for a single C char,
> AFAIK), then it is not a valid ASCII implementation, i.e. does not
> interpret ASCII according to its definition. The whole point of
> specifying a format as 7 bits is that the 8th bit is ignored, or
> perhaps used in an implementation-defined manner, regardless of whether
> the 8th bit in a char is available or not.

ASCII was designed back in the days of low reliability serial communications -- 
you know, back when data was sent using 7 bit data + 1 parity bits + 2 stop 
bits -- to increase the reliability of the communications.  A "byte" was also 9 
bits.  8 bits of data and a parity bit.

Nowadays we use 8 bits for data with no parity, no error correction, and no 
timing bits.  Cuz when things screw up we want them to REALLY screw up ... and 
remain undetectable.





___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] INSERT ... VALUES / want to "skip" default values

2017-06-27 Thread David Raymond
If you have to provide 4 values then the way you can use null to do that is to 
add in a trigger to set the default, since NULL _is_ a value and _is_ legal for 
that field.

CREATE TRIGGER test_populate_b
  AFTER INSERT ON test
  WHEN new.b is null
  BEGIN
UPDATE test
SET b = '-'
WHERE rowid = new.rowid;
  END;

INSERT INTO test VALUES ('field a', NULL, 'field c', 'field d');

a   b   c   d
--  --  --  --
field a -   field c field d

-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Simon Slavin
Sent: Tuesday, June 27, 2017 4:08 PM
To: SQLite mailing list
Subject: Re: [sqlite] INSERT ... VALUES / want to "skip" default values



On 27 Jun 2017, at 8:13pm, Robert M. Münch  wrote:

> CREATE TABLE test(a, b DEFAULT "-", c, d)
> 
> Now I would like to use
> 
> INSERT VALUES(a,?,c,d)
> 
> Where ? is something that the default value is used and not the provided 
> value. Is this possible at all?

You provide the text "NULL" (not in any quotes) for that value:

INSERT INTO test VALUES(12, NULL, 84, 'endomorph')

If you’ve set up a statement with parameters …

INSERT INTO test VALUES(?1, ?2, ?3, ?4)

… you can leave that paramater unbound (all parameters are bound to NULL by 
default) or you can explicitly bind it to NULL using sqlite3_bind_null() .

Do not confuse NULL, which is the NULL value, with 'NULL' in those quotes, 
which is a four character string.

Simon.

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] INSERT ... VALUES / want to "skip" default values

2017-06-27 Thread David Raymond
Single quotes should be used for strings, so DEFAULT '-'

Not quite sure what you're asking. Do you mean how to insert defaults in 
general?

INSERT INTO test (a, c, d) VALUES ('field a', 'field c', 'field d');
will get you
a   b   c   d
--  --  --  --
field a -   field c field d

You need to explicitly state which fields you are providing values for, and any 
field you don't specify/provide will get the default value, either what you 
defined or null. When using VALUES or bindings there is no way to explicitly 
say "use the default for this field", you have to not provide anything and 
exclude the field from the insert.

So there is no method to do something like...

INSERT INTO test VALUES ('field a', DEFAULT, 'field c', 'field d');

There is also no way to give it 4 values and have it use only 3 of them.

Hopefully that answers what you were looking for.

PS: Simon: Specifying NULL will just put a NULL value in there, it won't use 
the default.


-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Robert M. Münch
Sent: Tuesday, June 27, 2017 3:13 PM
To: SQLite mailing list
Subject: [sqlite] INSERT ... VALUES / want to "skip" default values

Hi, I have a table like:

CREATE TABLE test(a, b DEFAULT "-", c, d)

Now I would like to use

INSERT VALUES(a,?,c,d)

Where ? is something that the default value is used and not the provided value. 
Is this possible at all?

Viele Grüsse.

-- 

Robert M. Münch, CEO
M: +41 79 65 11 49 6

Saphirion AG
smarter | better | faster

http://www.saphirion.com
http://www.nlpp.ch
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] INSERT ... VALUES / want to "skip" default values

2017-06-27 Thread Simon Slavin


On 27 Jun 2017, at 8:13pm, Robert M. Münch  wrote:

> CREATE TABLE test(a, b DEFAULT "-", c, d)
> 
> Now I would like to use
> 
> INSERT VALUES(a,?,c,d)
> 
> Where ? is something that the default value is used and not the provided 
> value. Is this possible at all?

You provide the text "NULL" (not in any quotes) for that value:

INSERT INTO test VALUES(12, NULL, 84, 'endomorph')

If you’ve set up a statement with parameters …

INSERT INTO test VALUES(?1, ?2, ?3, ?4)

… you can leave that paramater unbound (all parameters are bound to NULL by 
default) or you can explicitly bind it to NULL using sqlite3_bind_null() .

Do not confuse NULL, which is the NULL value, with 'NULL' in those quotes, 
which is a four character string.

Simon.

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8-BOM not disregarded in CSV import

2017-06-27 Thread Scott Robison
On Tue, Jun 27, 2017 at 4:18 AM, Richard Hipp  wrote:
> The CSV import feature of the SQLite command-line shell expects to
> find UTF-8.  It does not understand other encodings, and I have no
> plans to add converters for alternative encodings any time soon.
>
> The latest version of trunk skips over a UTF-8 BOM at the beginning of
> the input file.

A little late, but it occurred to me how to make this "work" with
older versions of sqlite3 that support readfile / writefile. Say I
have a UTF8 BOM encoded file. I can trim it from SQLite then import
the trimmed version:

sqlite> select writefile('temp.csv', substr(readfile('utf8.csv'), 4));

sqlite> .import temp.csv temp
sqlite> .import utf8.csv utf8
sqlite> .schema
CREATE TABLE temp(
  "a" TEXT,
  "b" TEXT,
  "c" TEXT,
  "d" TEXT
);
CREATE TABLE utf8(
  "?a" TEXT,
  "b" TEXT,
  "c" TEXT,
  "d" TEXT
);

Alternatively, without readfile / writefile support:

sqlite> pragma writable_schema = 1;
sqlite> update sqlite_master set sql = replace(sql, char(0xFEFF), '')
where name = 'utf8';
sqlite> pragma writable_schema = 0;
sqlite> vacuum;
sqlite> .schema
CREATE TABLE temp(
  "a" TEXT,
  "b" TEXT,
  "c" TEXT,
  "d" TEXT
);
CREATE TABLE utf8(
  "a" TEXT,
  "b" TEXT,
  "c" TEXT,
  "d" TEXT
);

Still, not nearly as friendly as sqlite shell doing it for you.

-- 
Scott Robison
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] INSERT ... VALUES / want to "skip" default values

2017-06-27 Thread Robert M. Münch
Hi, I have a table like:

CREATE TABLE test(a, b DEFAULT "-", c, d)

Now I would like to use

INSERT VALUES(a,?,c,d)

Where ? is something that the default value is used and not the provided value. 
Is this possible at all?

Viele Grüsse.

-- 

Robert M. Münch, CEO
M: +41 79 65 11 49 6

Saphirion AG
smarter | better | faster

http://www.saphirion.com
http://www.nlpp.ch


signature.asc
Description: OpenPGP digital signature
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Robert Hairgrove
On Tue, 2017-06-27 at 16:38 +0200, Eric Grange wrote:
> > 
> > ASCII / ANSI is a 7-bit format.
> ASCII is a 7 bit encoding, but uses 8 bits in just about any
> implementation
> out there. I do not think there is any 7 bit implementation still
> alive
> outside of legacy mode for low-level wire protocols (RS232 etc.). I
> personally have never encountered a 7 bit ASCII file (as in
> bitpacked), I
> am curious if any exists?

If an implementation "uses" 8 bits for ASCII text (as opposed to
hardware storage which is never less than 8 bits for a single C char,
AFAIK), then it is not a valid ASCII implementation, i.e. does not
interpret ASCII according to its definition. The whole point of
specifying a format as 7 bits is that the 8th bit is ignored, or
perhaps used in an implementation-defined manner, regardless of whether
the 8th bit in a char is available or not.

Once an encoding embraces 8 bits, it will be something like CP1252,
ISO-8859-x, KOI-R, etc. Just not ASCII.


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Eric Grange
> ASCII / ANSI is a 7-bit format.

ASCII is a 7 bit encoding, but uses 8 bits in just about any implementation
out there. I do not think there is any 7 bit implementation still alive
outside of legacy mode for low-level wire protocols (RS232 etc.). I
personally have never encountered a 7 bit ASCII file (as in bitpacked), I
am curious if any exists?

ANSI has no precise definition, it's used to lump together all the <= 8 bit
legacy encodings (cf. https://en.wikipedia.org/wiki/ANSI_character_set)

On Tue, Jun 27, 2017 at 1:53 PM, Simon Slavin  wrote:

>
>
> On 27 Jun 2017, at 7:12am, Rowan Worth  wrote:
>
> > In fact using this assumption we could dispense with the BOM entirely for
> > UTF-8 and drop case 5 from the list.
>
> If you do that, you will try to process the BOM at the beginning of a
> UTF-8 stream as if it is characters.
>
> > So my question is, what advantage does
> > a BOM offer for UTF-8? What other cases can we identify with the
> > information it provides?
>
> Suppose your software processes only UTF-8 files, but someone feeds it a
> file which begins with FE FF.  Your software should recognise this and
> reject the file, telling the user/programmer that it can’t process it
> because it’s in the wrong encoding.
>
> Processing BOMs is part of the work you have to do to make your software
> Unicode-aware.  Without it, your documentation should state that your
> software handles the one flavour of Unicode it handles, not Unicode in
> general.  There’s nothing wrong with this, if it’s all the programmer/user
> needs, as long as it’s correctly documented.
>
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8-BOM not disregarded in CSV import

2017-06-27 Thread Mahmoud Al-Qudsi
Thank you.


From: sqlite-users  on behalf of 
Richard Hipp 
Sent: Tuesday, June 27, 2017 5:18:51 AM
To: SQLite mailing list
Subject: Re: [sqlite] UTF8-BOM not disregarded in CSV import

The CSV import feature of the SQLite command-line shell expects to
find UTF-8.  It does not understand other encodings, and I have no
plans to add converters for alternative encodings any time soon.

The latest version of trunk skips over a UTF-8 BOM at the beginning of
the input file.
--
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Simon Slavin


On 27 Jun 2017, at 7:12am, Rowan Worth  wrote:

> In fact using this assumption we could dispense with the BOM entirely for
> UTF-8 and drop case 5 from the list.

If you do that, you will try to process the BOM at the beginning of a UTF-8 
stream as if it is characters.

> So my question is, what advantage does
> a BOM offer for UTF-8? What other cases can we identify with the
> information it provides?

Suppose your software processes only UTF-8 files, but someone feeds it a file 
which begins with FE FF.  Your software should recognise this and reject the 
file, telling the user/programmer that it can’t process it because it’s in the 
wrong encoding.

Processing BOMs is part of the work you have to do to make your software 
Unicode-aware.  Without it, your documentation should state that your software 
handles the one flavour of Unicode it handles, not Unicode in general.  There’s 
nothing wrong with this, if it’s all the programmer/user needs, as long as it’s 
correctly documented.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Robert Hairgrove
On Tue, 2017-06-27 at 12:42 +0200, Eric Grange wrote:
> In the real world, text files are heavily skewed towards 8 bit
> formats,
> meaning just three cases dominate the debate:
> - ASCII / ANSI
> - utf-8 with BOM
> - utf-8 without BOM

ASCII / ANSI is a 7-bit format.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Eric Grange
> In case 7 we have little choice but to invoke heuristics or defer to the
> user, yes?

Yes in theory, but "no" in the real world, or rather "not in any way that
matters"

In the real world, text files are heavily skewed towards 8 bit formats,
meaning just three cases dominate the debate:
- ASCII / ANSI
- utf-8 with BOM
- utf-8 without BOM

And further, the overwhelming majority of text content are likely to
involve ASCII at the beginning (from various markups,
think html, xml, json, source code... even csv, because of explicit
separator specification or 1st column name).

So while in theory all the scenarios you describe are interesting, in
practice seeing an utf-8 BOM provides an extremely
high likeliness that a file will indeed be utf-8. Not always, but a memory
chip could also be hit by a cosmic ray.

Conversely the absence of an utf-8 BOM means a high probability of
"something undetermined": ANSI or BOMless utf-8,
or something more oddball (in which I lump utf-16 btw)... and the need for
heuristics to kick in.

Outside of source code and Linux config files, BOMless utf-8 are certainly
not the most frequent text files, ANSI and
other various encodings dominate, because most non-ASCII text files were
(are) produced under DOS or Windows,
where notepad and friends use ANSI by default f.i.

That may not be a desirable or happy situation, but that is the situation
we have to deal with.

It is also the reason why 20 years later the utf-8 BOM is still in use: it
explicit and has a practical success rate higher
than any of the heuristics, while the collisions of the BOM with actual
ANSI (or other) text start are unheard of.


On Tue, Jun 27, 2017 at 10:34 AM, Robert Hairgrove 
wrote:

> On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote:
> > The original issue was two of the largest companies in the world
> > output the
> > Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of
> > UTF-8
> > encoded text streams, and it would be friendly for the SQLite3 shell
> > to
> > skip it or use it for encoding identification in at least some cases.
>
> I would suggest adding a command-line argument to the shell indicating
> whether to ignore a BOM or not, possibly requiring specification of a
> certain encoding or list of encodings to consider.
>
> Certainly this should not be a requirement for the library per se, but
> a responsibility of the client to provide data in the proper encoding.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8-BOM not disregarded in CSV import

2017-06-27 Thread Richard Hipp
The CSV import feature of the SQLite command-line shell expects to
find UTF-8.  It does not understand other encodings, and I have no
plans to add converters for alternative encodings any time soon.

The latest version of trunk skips over a UTF-8 BOM at the beginning of
the input file.
-- 
D. Richard Hipp
d...@sqlite.org
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8-BOM not disregarded in CSV import

2017-06-27 Thread Cezary H. Noweta

Hello,

On 2017-06-26 17:26, Scott Robison wrote:


+1



FAQ quote:



Q: When a BOM is used, is it only in 16-bit Unicode text?



A: No, a BOM can be used as a signature no matter how the Unicode
text is transformed: UTF-16, UTF-8, or UTF-32.

Q: How I should deal with BOMs?

A: Here are some guidelines to follow:

4. Where the precise type of the data stream is known (e.g. Unicode
big-endian or Unicode little-endian), the BOM should not be used. In
particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE,
UTF-32BE or UTF-32LE a BOM must not be used. (See also Q: What is the
difference between UCS-2 and UTF-16?.)

:-)

-- best regards

Cezary H. Noweta
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] UTF8-BOM not disregarded in CSV import

2017-06-27 Thread Cezary H. Noweta

On 2017-06-26 15:01, jose isaias cabrera wrote:


I have made a desicion to always include the BOM in all my text files
 whether they are UTF8, UTF16 or UTF32 little or big endian. I think
all of us should also.


I'm sorry, if I introduced ambiguity, but I had described SQLite's and
SQLite shell's behavior -- the sole. I'm not entreating to kill all
UTF-8 BOMs in the universe.


Just because the "Unicode Gurus" didn't think so, does not mean they
are right.  I had a doctor give me the wrong diagnose. There were
just too many symptoms that looked alike and they chosed one and went
with it.  The same thing happened, the Unicode Gurus, they never
thought about the problems they would be causing today.  Some
applications do not place BOM on UTF8, UTF16 files, and then you have
to go and find which one is it, and decode the file correctly.


The problem, which you described, had not been introduced nor created by
``Unicode Gurus''. AFAIR, finding of a correct encoding/codepage (of
files with an unknown origin) was present in the olden days, far before
Unicode. UTF-8 is far easier recognizable then others.


This can all be prevented by having a BOM.


This would have helped, if it had been only UTF-8 and some single-byte
code page.


Yes, I know I am saying everything every body is, but what I am also
saying is to let us all use the BOM, and also have every application
we write welcome the BOM.


What if I want to place 0xFEFF at the beginning of UTF-8? The second EF
BB BF as BOM? OK - but the standard says ``there is no BOM''. This is
what the standard is for. I agree with you -- where a character set is
unmarked, there UTF-8 BOM is useful as an encoding signature. However,
where SQLite is accepting only UTF-8, there I expect that placing EF BB
BF at the beginning will be interpreted as codepoint -- not BOM, until
it will say explicitly: ``My interpretation of EF BB BF is BOM''.

I am not going to tell whether zero or all of my UTF-8 files have BOM
neither whether or not to use/kill/welcome all UTF-8 BOMs -- it does not
matter. However, in case of SQLite, Clemens' arguments are very
stringent -- I hope SQLite shell's behavior will not change. For the
sake of the standard conformance, thus predictability and determinedness.

-- best regards

Cezary H. Noweta
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Robert Hairgrove
On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote:
> The original issue was two of the largest companies in the world
> output the
> Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of
> UTF-8
> encoded text streams, and it would be friendly for the SQLite3 shell
> to
> skip it or use it for encoding identification in at least some cases.

I would suggest adding a command-line argument to the shell indicating
whether to ignore a BOM or not, possibly requiring specification of a
certain encoding or list of encodings to consider.

Certainly this should not be a requirement for the library per se, but
a responsibility of the client to provide data in the proper encoding.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Robert Hairgrove
On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote:
> On Jun 27, 2017 12:13 AM, "Rowan Worth"  wrote:
> 
> I'm sure I've simplified things with this description - have I missed
> something crucial? Is the BOM argument about future proofing? Are we
> worried about EBCDIC? Is my perspective too anglo-centric?

Thanks, Scott -- nothing crucial, it is already quite good enough for
99% of use cases.

The Wikipedia page on "Byte Order Marks" appears to be quite
comprehensive and lists about a dozen possible BOM sequences:

https://en.wikipedia.org/wiki/Byte_order_mark

Lacking a BOM, I would certainly try to rule out UTF-8 right away by
searching for invalid UTF-8 characters within a reasonably large
portion of the input (maybe 100-300KB?) before then looking for any
NULL bytes (which are also invalid UTF-8 except as a delimiter) or
other random control characters.

As to having the user specify an encoding when dealing with something
which should be text (CSV files, for example) and processing files
which the user has specified, there is always the possibility that the
encoding is different than what the user says, mainly because they
probably clicked on a spreadsheet file with a similar name instead of
the desired text file. If the user specifies an 8-bit encoding aside
from Unicode, it gets very difficult to trap wrong input unless you
write routines to search for invalid characters (e.g. distinguishing
between true ISO-8859-x and CP1252).
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Any reason for sqlite3changeset_concat not using const parameters ?

2017-06-27 Thread Domingo Alvarez Duarte

Hello !

I'm trying to use sqlite3 session extension with C++ and I'm getting 
errors because sqlite3changeset_concat aren't using const parameters for 
input parameters like sq_sqlite3_session_invert does, looking at the 
code it doesn't seem that it modifies it's input parameters.


Cheers !

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)

2017-06-27 Thread Scott Robison
On Jun 27, 2017 12:13 AM, "Rowan Worth"  wrote:

I'm sure I've simplified things with this description - have I missed
something crucial? Is the BOM argument about future proofing? Are we
worried about EBCDIC? Is my perspective too anglo-centric?


The original issue was two of the largest companies in the world output the
Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of UTF-8
encoded text streams, and it would be friendly for the SQLite3 shell to
skip it or use it for encoding identification in at least some cases.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users