Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Tue, Jun 27, 2017 at 4:02 PM, Keith Medcalf wrote: > > > If an implementation "uses" 8 bits for ASCII text (as opposed to > > hardware storage which is never less than 8 bits for a single C char, > > AFAIK), then it is not a valid ASCII implementation, i.e. does not > > interpret ASCII according to its definition. The whole point of > > specifying a format as 7 bits is that the 8th bit is ignored, or > > perhaps used in an implementation-defined manner, regardless of whether > > the 8th bit in a char is available or not. > > ASCII was designed back in the days of low reliability serial > communications -- you know, back when data was sent using 7 bit data + 1 > parity bits + 2 stop bits -- to increase the reliability of the > communications. A "byte" was also 9 bits. 8 bits of data and a parity bit. > > Nowadays we use 8 bits for data with no parity, no error correction, and > no timing bits. Cuz when things screw up we want them to REALLY screw up > ... and remain undetectable. > Actually, most _enterprise_ level storage & transmission facilities have error detection and correction codes which are "transparent" to the programmer. Almost everybody knows about RAID arrays which (other than JBOD) have either "parity" (RAID5 is an example) or is "mirrored" (RAID1). Most have also heard of ECC RAM memory. But I'll bet that few have heard of RAIM memory, which is used on the IBM z series of computers. Redundant Array of Independent Memory. This is basically "RAID 5" memory. In addition to the RAID-ness, it still uses ECC as well. Also, unlike with an Intel machine, if an IBM z suffers a "memory failure", there is usually the ability for the _hardware_ to recover all the data in the memory module ("block") and transparently copy it to a "phantom" block of memory, which then takes the place of the block which contains the error. All without host software intervention. https://www.ibm.com/developerworks/community/blogs/e0c474f8-3aad-4f01-8bca-f2c12b576ac9/entry/IBM_zEnterprise_redundant_array_of_independent_memory_subsystem ? -- Veni, Vidi, VISA: I came, I saw, I did a little shopping. Maranatha! <>< John McKown ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] CSV Extension Typo ...
sqlite3x.c:206636:18: warning: implicit declaration of function 'csv_read_one_field'; did you mean 'csv_read_one_field_ext'? [-Wimplicit-function-declaration] return csv_read_one_field(p); ^~ csv_read_one_field_ext sqlite3x.c:206636:18: warning: return makes pointer from integer without a cast [-Wint-conversion] return csv_read_one_field(p); ^ C:\Users\KMedcalf\AppData\Local\Temp\ccCaQVe8.ltrans26.ltrans.o::(.text+0x86fd): undefined reference to `csv_read_one_field' collect2.exe: error: ld returned 1 exit status Should probably be Index: csv.c == --- csv.c +++ csv.c @@ -262,11 +262,11 @@ csv_append(p, c); c = csv_getc(p); if( (c&0xff)==0xbf ){ p->bNotFirst = 1; p->n = 0; - return csv_read_one_field(p); + return csv_read_one_field_ext(p); } } } while( c>',' || (c!=EOF && c!=',' && c!='\n') ){ if( csv_append(p, (char)c) ) return 0; --- Life should not be a journey to the grave with the intention of arriving safely in a pretty and well preserved body, but rather to skid in broadside in a cloud of smoke, thoroughly used up, totally worn out, and loudly proclaiming "Wow! What a Ride!" -- Hunter S. Thompson ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
> If an implementation "uses" 8 bits for ASCII text (as opposed to > hardware storage which is never less than 8 bits for a single C char, > AFAIK), then it is not a valid ASCII implementation, i.e. does not > interpret ASCII according to its definition. The whole point of > specifying a format as 7 bits is that the 8th bit is ignored, or > perhaps used in an implementation-defined manner, regardless of whether > the 8th bit in a char is available or not. ASCII was designed back in the days of low reliability serial communications -- you know, back when data was sent using 7 bit data + 1 parity bits + 2 stop bits -- to increase the reliability of the communications. A "byte" was also 9 bits. 8 bits of data and a parity bit. Nowadays we use 8 bits for data with no parity, no error correction, and no timing bits. Cuz when things screw up we want them to REALLY screw up ... and remain undetectable. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] INSERT ... VALUES / want to "skip" default values
If you have to provide 4 values then the way you can use null to do that is to add in a trigger to set the default, since NULL _is_ a value and _is_ legal for that field. CREATE TRIGGER test_populate_b AFTER INSERT ON test WHEN new.b is null BEGIN UPDATE test SET b = '-' WHERE rowid = new.rowid; END; INSERT INTO test VALUES ('field a', NULL, 'field c', 'field d'); a b c d -- -- -- -- field a - field c field d -Original Message- From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On Behalf Of Simon Slavin Sent: Tuesday, June 27, 2017 4:08 PM To: SQLite mailing list Subject: Re: [sqlite] INSERT ... VALUES / want to "skip" default values On 27 Jun 2017, at 8:13pm, Robert M. Münch wrote: > CREATE TABLE test(a, b DEFAULT "-", c, d) > > Now I would like to use > > INSERT VALUES(a,?,c,d) > > Where ? is something that the default value is used and not the provided > value. Is this possible at all? You provide the text "NULL" (not in any quotes) for that value: INSERT INTO test VALUES(12, NULL, 84, 'endomorph') If you’ve set up a statement with parameters … INSERT INTO test VALUES(?1, ?2, ?3, ?4) … you can leave that paramater unbound (all parameters are bound to NULL by default) or you can explicitly bind it to NULL using sqlite3_bind_null() . Do not confuse NULL, which is the NULL value, with 'NULL' in those quotes, which is a four character string. Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] INSERT ... VALUES / want to "skip" default values
Single quotes should be used for strings, so DEFAULT '-' Not quite sure what you're asking. Do you mean how to insert defaults in general? INSERT INTO test (a, c, d) VALUES ('field a', 'field c', 'field d'); will get you a b c d -- -- -- -- field a - field c field d You need to explicitly state which fields you are providing values for, and any field you don't specify/provide will get the default value, either what you defined or null. When using VALUES or bindings there is no way to explicitly say "use the default for this field", you have to not provide anything and exclude the field from the insert. So there is no method to do something like... INSERT INTO test VALUES ('field a', DEFAULT, 'field c', 'field d'); There is also no way to give it 4 values and have it use only 3 of them. Hopefully that answers what you were looking for. PS: Simon: Specifying NULL will just put a NULL value in there, it won't use the default. -Original Message- From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On Behalf Of Robert M. Münch Sent: Tuesday, June 27, 2017 3:13 PM To: SQLite mailing list Subject: [sqlite] INSERT ... VALUES / want to "skip" default values Hi, I have a table like: CREATE TABLE test(a, b DEFAULT "-", c, d) Now I would like to use INSERT VALUES(a,?,c,d) Where ? is something that the default value is used and not the provided value. Is this possible at all? Viele Grüsse. -- Robert M. Münch, CEO M: +41 79 65 11 49 6 Saphirion AG smarter | better | faster http://www.saphirion.com http://www.nlpp.ch ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] INSERT ... VALUES / want to "skip" default values
On 27 Jun 2017, at 8:13pm, Robert M. Münch wrote: > CREATE TABLE test(a, b DEFAULT "-", c, d) > > Now I would like to use > > INSERT VALUES(a,?,c,d) > > Where ? is something that the default value is used and not the provided > value. Is this possible at all? You provide the text "NULL" (not in any quotes) for that value: INSERT INTO test VALUES(12, NULL, 84, 'endomorph') If you’ve set up a statement with parameters … INSERT INTO test VALUES(?1, ?2, ?3, ?4) … you can leave that paramater unbound (all parameters are bound to NULL by default) or you can explicitly bind it to NULL using sqlite3_bind_null() . Do not confuse NULL, which is the NULL value, with 'NULL' in those quotes, which is a four character string. Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] UTF8-BOM not disregarded in CSV import
On Tue, Jun 27, 2017 at 4:18 AM, Richard Hipp wrote: > The CSV import feature of the SQLite command-line shell expects to > find UTF-8. It does not understand other encodings, and I have no > plans to add converters for alternative encodings any time soon. > > The latest version of trunk skips over a UTF-8 BOM at the beginning of > the input file. A little late, but it occurred to me how to make this "work" with older versions of sqlite3 that support readfile / writefile. Say I have a UTF8 BOM encoded file. I can trim it from SQLite then import the trimmed version: sqlite> select writefile('temp.csv', substr(readfile('utf8.csv'), 4)); sqlite> .import temp.csv temp sqlite> .import utf8.csv utf8 sqlite> .schema CREATE TABLE temp( "a" TEXT, "b" TEXT, "c" TEXT, "d" TEXT ); CREATE TABLE utf8( "?a" TEXT, "b" TEXT, "c" TEXT, "d" TEXT ); Alternatively, without readfile / writefile support: sqlite> pragma writable_schema = 1; sqlite> update sqlite_master set sql = replace(sql, char(0xFEFF), '') where name = 'utf8'; sqlite> pragma writable_schema = 0; sqlite> vacuum; sqlite> .schema CREATE TABLE temp( "a" TEXT, "b" TEXT, "c" TEXT, "d" TEXT ); CREATE TABLE utf8( "a" TEXT, "b" TEXT, "c" TEXT, "d" TEXT ); Still, not nearly as friendly as sqlite shell doing it for you. -- Scott Robison ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] INSERT ... VALUES / want to "skip" default values
Hi, I have a table like: CREATE TABLE test(a, b DEFAULT "-", c, d) Now I would like to use INSERT VALUES(a,?,c,d) Where ? is something that the default value is used and not the provided value. Is this possible at all? Viele Grüsse. -- Robert M. Münch, CEO M: +41 79 65 11 49 6 Saphirion AG smarter | better | faster http://www.saphirion.com http://www.nlpp.ch signature.asc Description: OpenPGP digital signature ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Tue, 2017-06-27 at 16:38 +0200, Eric Grange wrote: > > > > ASCII / ANSI is a 7-bit format. > ASCII is a 7 bit encoding, but uses 8 bits in just about any > implementation > out there. I do not think there is any 7 bit implementation still > alive > outside of legacy mode for low-level wire protocols (RS232 etc.). I > personally have never encountered a 7 bit ASCII file (as in > bitpacked), I > am curious if any exists? If an implementation "uses" 8 bits for ASCII text (as opposed to hardware storage which is never less than 8 bits for a single C char, AFAIK), then it is not a valid ASCII implementation, i.e. does not interpret ASCII according to its definition. The whole point of specifying a format as 7 bits is that the 8th bit is ignored, or perhaps used in an implementation-defined manner, regardless of whether the 8th bit in a char is available or not. Once an encoding embraces 8 bits, it will be something like CP1252, ISO-8859-x, KOI-R, etc. Just not ASCII. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
> ASCII / ANSI is a 7-bit format. ASCII is a 7 bit encoding, but uses 8 bits in just about any implementation out there. I do not think there is any 7 bit implementation still alive outside of legacy mode for low-level wire protocols (RS232 etc.). I personally have never encountered a 7 bit ASCII file (as in bitpacked), I am curious if any exists? ANSI has no precise definition, it's used to lump together all the <= 8 bit legacy encodings (cf. https://en.wikipedia.org/wiki/ANSI_character_set) On Tue, Jun 27, 2017 at 1:53 PM, Simon Slavin wrote: > > > On 27 Jun 2017, at 7:12am, Rowan Worth wrote: > > > In fact using this assumption we could dispense with the BOM entirely for > > UTF-8 and drop case 5 from the list. > > If you do that, you will try to process the BOM at the beginning of a > UTF-8 stream as if it is characters. > > > So my question is, what advantage does > > a BOM offer for UTF-8? What other cases can we identify with the > > information it provides? > > Suppose your software processes only UTF-8 files, but someone feeds it a > file which begins with FE FF. Your software should recognise this and > reject the file, telling the user/programmer that it can’t process it > because it’s in the wrong encoding. > > Processing BOMs is part of the work you have to do to make your software > Unicode-aware. Without it, your documentation should state that your > software handles the one flavour of Unicode it handles, not Unicode in > general. There’s nothing wrong with this, if it’s all the programmer/user > needs, as long as it’s correctly documented. > > Simon. > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] UTF8-BOM not disregarded in CSV import
Thank you. From: sqlite-users on behalf of Richard Hipp Sent: Tuesday, June 27, 2017 5:18:51 AM To: SQLite mailing list Subject: Re: [sqlite] UTF8-BOM not disregarded in CSV import The CSV import feature of the SQLite command-line shell expects to find UTF-8. It does not understand other encodings, and I have no plans to add converters for alternative encodings any time soon. The latest version of trunk skips over a UTF-8 BOM at the beginning of the input file. -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On 27 Jun 2017, at 7:12am, Rowan Worth wrote: > In fact using this assumption we could dispense with the BOM entirely for > UTF-8 and drop case 5 from the list. If you do that, you will try to process the BOM at the beginning of a UTF-8 stream as if it is characters. > So my question is, what advantage does > a BOM offer for UTF-8? What other cases can we identify with the > information it provides? Suppose your software processes only UTF-8 files, but someone feeds it a file which begins with FE FF. Your software should recognise this and reject the file, telling the user/programmer that it can’t process it because it’s in the wrong encoding. Processing BOMs is part of the work you have to do to make your software Unicode-aware. Without it, your documentation should state that your software handles the one flavour of Unicode it handles, not Unicode in general. There’s nothing wrong with this, if it’s all the programmer/user needs, as long as it’s correctly documented. Simon. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Tue, 2017-06-27 at 12:42 +0200, Eric Grange wrote: > In the real world, text files are heavily skewed towards 8 bit > formats, > meaning just three cases dominate the debate: > - ASCII / ANSI > - utf-8 with BOM > - utf-8 without BOM ASCII / ANSI is a 7-bit format. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
> In case 7 we have little choice but to invoke heuristics or defer to the > user, yes? Yes in theory, but "no" in the real world, or rather "not in any way that matters" In the real world, text files are heavily skewed towards 8 bit formats, meaning just three cases dominate the debate: - ASCII / ANSI - utf-8 with BOM - utf-8 without BOM And further, the overwhelming majority of text content are likely to involve ASCII at the beginning (from various markups, think html, xml, json, source code... even csv, because of explicit separator specification or 1st column name). So while in theory all the scenarios you describe are interesting, in practice seeing an utf-8 BOM provides an extremely high likeliness that a file will indeed be utf-8. Not always, but a memory chip could also be hit by a cosmic ray. Conversely the absence of an utf-8 BOM means a high probability of "something undetermined": ANSI or BOMless utf-8, or something more oddball (in which I lump utf-16 btw)... and the need for heuristics to kick in. Outside of source code and Linux config files, BOMless utf-8 are certainly not the most frequent text files, ANSI and other various encodings dominate, because most non-ASCII text files were (are) produced under DOS or Windows, where notepad and friends use ANSI by default f.i. That may not be a desirable or happy situation, but that is the situation we have to deal with. It is also the reason why 20 years later the utf-8 BOM is still in use: it explicit and has a practical success rate higher than any of the heuristics, while the collisions of the BOM with actual ANSI (or other) text start are unheard of. On Tue, Jun 27, 2017 at 10:34 AM, Robert Hairgrove wrote: > On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote: > > The original issue was two of the largest companies in the world > > output the > > Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of > > UTF-8 > > encoded text streams, and it would be friendly for the SQLite3 shell > > to > > skip it or use it for encoding identification in at least some cases. > > I would suggest adding a command-line argument to the shell indicating > whether to ignore a BOM or not, possibly requiring specification of a > certain encoding or list of encodings to consider. > > Certainly this should not be a requirement for the library per se, but > a responsibility of the client to provide data in the proper encoding. > ___ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] UTF8-BOM not disregarded in CSV import
The CSV import feature of the SQLite command-line shell expects to find UTF-8. It does not understand other encodings, and I have no plans to add converters for alternative encodings any time soon. The latest version of trunk skips over a UTF-8 BOM at the beginning of the input file. -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] UTF8-BOM not disregarded in CSV import
Hello, On 2017-06-26 17:26, Scott Robison wrote: +1 FAQ quote: Q: When a BOM is used, is it only in 16-bit Unicode text? A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, or UTF-32. Q: How I should deal with BOMs? A: Here are some guidelines to follow: 4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used. (See also Q: What is the difference between UCS-2 and UTF-16?.) :-) -- best regards Cezary H. Noweta ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] UTF8-BOM not disregarded in CSV import
On 2017-06-26 15:01, jose isaias cabrera wrote: I have made a desicion to always include the BOM in all my text files whether they are UTF8, UTF16 or UTF32 little or big endian. I think all of us should also. I'm sorry, if I introduced ambiguity, but I had described SQLite's and SQLite shell's behavior -- the sole. I'm not entreating to kill all UTF-8 BOMs in the universe. Just because the "Unicode Gurus" didn't think so, does not mean they are right. I had a doctor give me the wrong diagnose. There were just too many symptoms that looked alike and they chosed one and went with it. The same thing happened, the Unicode Gurus, they never thought about the problems they would be causing today. Some applications do not place BOM on UTF8, UTF16 files, and then you have to go and find which one is it, and decode the file correctly. The problem, which you described, had not been introduced nor created by ``Unicode Gurus''. AFAIR, finding of a correct encoding/codepage (of files with an unknown origin) was present in the olden days, far before Unicode. UTF-8 is far easier recognizable then others. This can all be prevented by having a BOM. This would have helped, if it had been only UTF-8 and some single-byte code page. Yes, I know I am saying everything every body is, but what I am also saying is to let us all use the BOM, and also have every application we write welcome the BOM. What if I want to place 0xFEFF at the beginning of UTF-8? The second EF BB BF as BOM? OK - but the standard says ``there is no BOM''. This is what the standard is for. I agree with you -- where a character set is unmarked, there UTF-8 BOM is useful as an encoding signature. However, where SQLite is accepting only UTF-8, there I expect that placing EF BB BF at the beginning will be interpreted as codepoint -- not BOM, until it will say explicitly: ``My interpretation of EF BB BF is BOM''. I am not going to tell whether zero or all of my UTF-8 files have BOM neither whether or not to use/kill/welcome all UTF-8 BOMs -- it does not matter. However, in case of SQLite, Clemens' arguments are very stringent -- I hope SQLite shell's behavior will not change. For the sake of the standard conformance, thus predictability and determinedness. -- best regards Cezary H. Noweta ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote: > The original issue was two of the largest companies in the world > output the > Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of > UTF-8 > encoded text streams, and it would be friendly for the SQLite3 shell > to > skip it or use it for encoding identification in at least some cases. I would suggest adding a command-line argument to the shell indicating whether to ignore a BOM or not, possibly requiring specification of a certain encoding or list of encodings to consider. Certainly this should not be a requirement for the library per se, but a responsibility of the client to provide data in the proper encoding. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Tue, 2017-06-27 at 01:14 -0600, Scott Robison wrote: > On Jun 27, 2017 12:13 AM, "Rowan Worth" wrote: > > I'm sure I've simplified things with this description - have I missed > something crucial? Is the BOM argument about future proofing? Are we > worried about EBCDIC? Is my perspective too anglo-centric? Thanks, Scott -- nothing crucial, it is already quite good enough for 99% of use cases. The Wikipedia page on "Byte Order Marks" appears to be quite comprehensive and lists about a dozen possible BOM sequences: https://en.wikipedia.org/wiki/Byte_order_mark Lacking a BOM, I would certainly try to rule out UTF-8 right away by searching for invalid UTF-8 characters within a reasonably large portion of the input (maybe 100-300KB?) before then looking for any NULL bytes (which are also invalid UTF-8 except as a delimiter) or other random control characters. As to having the user specify an encoding when dealing with something which should be text (CSV files, for example) and processing files which the user has specified, there is always the possibility that the encoding is different than what the user says, mainly because they probably clicked on a spreadsheet file with a similar name instead of the desired text file. If the user specifies an 8-bit encoding aside from Unicode, it gets very difficult to trap wrong input unless you write routines to search for invalid characters (e.g. distinguishing between true ISO-8859-x and CP1252). ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Any reason for sqlite3changeset_concat not using const parameters ?
Hello ! I'm trying to use sqlite3 session extension with C++ and I'm getting errors because sqlite3changeset_concat aren't using const parameters for input parameters like sq_sqlite3_session_invert does, looking at the code it doesn't seem that it modifies it's input parameters. Cheers ! ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] [OT] UTF8-BOM and text encoding detection (was: UTF8-BOM not disregarded in CSV import)
On Jun 27, 2017 12:13 AM, "Rowan Worth" wrote: I'm sure I've simplified things with this description - have I missed something crucial? Is the BOM argument about future proofing? Are we worried about EBCDIC? Is my perspective too anglo-centric? The original issue was two of the largest companies in the world output the Byte Encoding Mark(TM)(Patent Pending) (or BOM) at the beginning of UTF-8 encoded text streams, and it would be friendly for the SQLite3 shell to skip it or use it for encoding identification in at least some cases. ___ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users