Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-09 Thread Stefan Karpinski
So that's basically a useless API then? I guess we should use the W version
everywhere.

On Tue, Feb 9, 2016 at 12:38 AM, Jameson  wrote:

> Calling the ANSI version doesn't preclude the possibility of getting UTF16
> data back. In particular, that would be code page 1200 (utf16le) or 1201
> (utf16be) for Microsoft Windows. MSDN is inconsistent in their usage of A
> and whether it means ANSI, OEM, localized-locale, or application-dependent
> (aka other) and generally makes no statement about how the bytes may need
> to be handled or interpreted.
>
>
> On Friday, February 5, 2016 at 1:44:20 PM UTC-5, Stefan Karpinski wrote:
>>
>> It does, but that's not what we're seeing – at least with some ODBC
>> drivers.
>>
>> On Fri, Feb 5, 2016 at 1:18 PM, David Anthoff 
>> wrote:
>>
>>>
>>> https://msdn.microsoft.com/en-us/library/ms716246%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
>>>
>>>
>>>
>>> suggests that if you call the version without the A or W suffix you get
>>> the ANSI version.
>>>
>>>
>>>
>>> *From:* julia-users@googlegroups.com [mailto:
>>> julia-users@googlegroups.com] *On Behalf Of *Scott Jones
>>> *Sent:* Thursday, February 4, 2016 1:55 PM
>>> *To:* julia-users 
>>> *Subject:* Re: [julia-users] load a Julia dataframe from Microsoft SQL
>>> Server table
>>>
>>>
>>>
>>>
>>>
>>> On Thursday, February 4, 2016 at 4:29:46 PM UTC-5, Stefan Karpinski
>>> wrote:
>>>
>>> On Thu, Feb 4, 2016 at 1:50 PM, Scott Jones 
>>> wrote:
>>>
>>>
>>>
>>> This still doesn't explain why some drivers are accepting UCS-2/UTF-16
>>> when called with the non-Unicode API.
>>>
>>>
>>>
>>> When you do so, are you actually calling the functions with the A, or
>>> just the macro without either A or W?
>>>
>>> The macro will compile to either the A or the W form, depending on how
>>> your application is built.
>>>
>>>
>>>
>>> This is a better page in MSDN:
>>> https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing
>>> what is going on.
>>>
>>>
>>>
>>> The ODBC package calls the functions without A or W. What it's calling
>>> can't be a macro since macros aren't callable via ccall. But changing ODBC
>>> to call the W version of everything may be the fix here.
>>>
>>>
>>>
>>> That very well may be the solution: looking for example at libiodbc on
>>> the Mac, it has 3 different versions of all those functions, and I'm not
>>> sure just what behavior you get when using the form without the A or W.
>>> I've always used ODBC with the C headers, unlike the direct linking that
>>> Julia is doing, so that it always gets the W version since I compile as a
>>> Unicode build.
>>>
>>
>>


Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-08 Thread Jameson
Calling the ANSI version doesn't preclude the possibility of getting UTF16 
data back. In particular, that would be code page 1200 (utf16le) or 1201 
(utf16be) for Microsoft Windows. MSDN is inconsistent in their usage of A 
and whether it means ANSI, OEM, localized-locale, or application-dependent 
(aka other) and generally makes no statement about how the bytes may need 
to be handled or interpreted.


On Friday, February 5, 2016 at 1:44:20 PM UTC-5, Stefan Karpinski wrote:
>
> It does, but that's not what we're seeing – at least with some ODBC 
> drivers.
>
> On Fri, Feb 5, 2016 at 1:18 PM, David Anthoff  
> wrote:
>
>>
>> https://msdn.microsoft.com/en-us/library/ms716246%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
>>
>>  
>>
>> suggests that if you call the version without the A or W suffix you get 
>> the ANSI version. 
>>
>>  
>>
>> *From:* julia-users@googlegroups.com [mailto:julia-users@googlegroups.com] 
>> *On Behalf Of *Scott Jones
>> *Sent:* Thursday, February 4, 2016 1:55 PM
>> *To:* julia-users 
>> *Subject:* Re: [julia-users] load a Julia dataframe from Microsoft SQL 
>> Server table
>>
>>  
>>
>>
>>
>> On Thursday, February 4, 2016 at 4:29:46 PM UTC-5, Stefan Karpinski wrote:
>>
>> On Thu, Feb 4, 2016 at 1:50 PM, Scott Jones  
>> wrote:
>>
>>  
>>
>> This still doesn't explain why some drivers are accepting UCS-2/UTF-16 
>> when called with the non-Unicode API.
>>
>>  
>>
>> When you do so, are you actually calling the functions with the A, or 
>> just the macro without either A or W?
>>
>> The macro will compile to either the A or the W form, depending on how 
>> your application is built.
>>
>>  
>>
>> This is a better page in MSDN: 
>> https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing 
>> what is going on.
>>
>>  
>>
>> The ODBC package calls the functions without A or W. What it's calling 
>> can't be a macro since macros aren't callable via ccall. But changing ODBC 
>> to call the W version of everything may be the fix here.
>>
>>  
>>
>> That very well may be the solution: looking for example at libiodbc on 
>> the Mac, it has 3 different versions of all those functions, and I'm not 
>> sure just what behavior you get when using the form without the A or W.  
>> I've always used ODBC with the C headers, unlike the direct linking that 
>> Julia is doing, so that it always gets the W version since I compile as a 
>> Unicode build. 
>>
>
>

Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-05 Thread Stefan Karpinski
It does, but that's not what we're seeing – at least with some ODBC drivers.

On Fri, Feb 5, 2016 at 1:18 PM, David Anthoff  wrote:

>
> https://msdn.microsoft.com/en-us/library/ms716246%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396
>
>
>
> suggests that if you call the version without the A or W suffix you get
> the ANSI version.
>
>
>
> *From:* julia-users@googlegroups.com [mailto:julia-users@googlegroups.com]
> *On Behalf Of *Scott Jones
> *Sent:* Thursday, February 4, 2016 1:55 PM
> *To:* julia-users 
> *Subject:* Re: [julia-users] load a Julia dataframe from Microsoft SQL
> Server table
>
>
>
>
>
> On Thursday, February 4, 2016 at 4:29:46 PM UTC-5, Stefan Karpinski wrote:
>
> On Thu, Feb 4, 2016 at 1:50 PM, Scott Jones  wrote:
>
>
>
> This still doesn't explain why some drivers are accepting UCS-2/UTF-16
> when called with the non-Unicode API.
>
>
>
> When you do so, are you actually calling the functions with the A, or just
> the macro without either A or W?
>
> The macro will compile to either the A or the W form, depending on how
> your application is built.
>
>
>
> This is a better page in MSDN:
> https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing
> what is going on.
>
>
>
> The ODBC package calls the functions without A or W. What it's calling
> can't be a macro since macros aren't callable via ccall. But changing ODBC
> to call the W version of everything may be the fix here.
>
>
>
> That very well may be the solution: looking for example at libiodbc on the
> Mac, it has 3 different versions of all those functions, and I'm not sure
> just what behavior you get when using the form without the A or W.  I've
> always used ODBC with the C headers, unlike the direct linking that Julia
> is doing, so that it always gets the W version since I compile as a Unicode
> build.
>


RE: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-05 Thread David Anthoff
https://msdn.microsoft.com/en-us/library/ms716246%28v=vs.85%29.aspx?f=255 
<https://msdn.microsoft.com/en-us/library/ms716246%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396>
 &MSPPError=-2147217396

 

suggests that if you call the version without the A or W suffix you get the 
ANSI version. 

 

From: julia-users@googlegroups.com [mailto:julia-users@googlegroups.com] On 
Behalf Of Scott Jones
Sent: Thursday, February 4, 2016 1:55 PM
To: julia-users 
Subject: Re: [julia-users] load a Julia dataframe from Microsoft SQL Server 
table

 



On Thursday, February 4, 2016 at 4:29:46 PM UTC-5, Stefan Karpinski wrote:

On Thu, Feb 4, 2016 at 1:50 PM, Scott Jones  > wrote:

 

This still doesn't explain why some drivers are accepting UCS-2/UTF-16 when 
called with the non-Unicode API.

 

When you do so, are you actually calling the functions with the A, or just the 
macro without either A or W?

The macro will compile to either the A or the W form, depending on how your 
application is built.

 

This is a better page in MSDN: 
https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing what 
is going on.

 

The ODBC package calls the functions without A or W. What it's calling can't be 
a macro since macros aren't callable via ccall. But changing ODBC to call the W 
version of everything may be the fix here.

 

That very well may be the solution: looking for example at libiodbc on the Mac, 
it has 3 different versions of all those functions, and I'm not sure just what 
behavior you get when using the form without the A or W.  I've always used ODBC 
with the C headers, unlike the direct linking that Julia is doing, so that it 
always gets the W version since I compile as a Unicode build. 



Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-04 Thread Jacob Quinn
That's a big part of the "remodel" I've been working on to always call the
"W" version of the functions and use UTF8 consistently (see here:
https://github.com/JuliaDB/ODBC.jl/blob/jq/remodel/src/API.jl). I would
certainly welcome those willing to test various configurations/setups.

-Jacob

On Thu, Feb 4, 2016 at 2:28 PM, Stefan Karpinski 
wrote:

> On Thu, Feb 4, 2016 at 1:50 PM, Scott Jones 
> wrote:
>
>>
>> This still doesn't explain why some drivers are accepting UCS-2/UTF-16
>>> when called with the non-Unicode API.
>>>
>>
>> When you do so, are you actually calling the functions with the A, or
>> just the macro without either A or W?
>> The macro will compile to either the A or the W form, depending on how
>> your application is built.
>>
>> This is a better page in MSDN:
>> https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing
>> what is going on.
>>
>
> The ODBC package calls the functions without A or W. What it's calling
> can't be a macro since macros aren't callable via ccall. But changing ODBC
> to call the W version of everything may be the fix here.
>


Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-04 Thread Scott Jones


On Thursday, February 4, 2016 at 4:29:46 PM UTC-5, Stefan Karpinski wrote:
>
> On Thu, Feb 4, 2016 at 1:50 PM, Scott Jones  > wrote:
>
>>
>> This still doesn't explain why some drivers are accepting UCS-2/UTF-16 
>>> when called with the non-Unicode API.
>>>
>>
>> When you do so, are you actually calling the functions with the A, or 
>> just the macro without either A or W?
>> The macro will compile to either the A or the W form, depending on how 
>> your application is built.
>>
>> This is a better page in MSDN: 
>> https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing 
>> what is going on.
>>
>
> The ODBC package calls the functions without A or W. What it's calling 
> can't be a macro since macros aren't callable via ccall. But changing ODBC 
> to call the W version of everything may be the fix here.
>

That very well may be the solution: looking for example at libiodbc on the 
Mac, it has 3 different versions of all those functions, and I'm not sure 
just what behavior you get when using the form without the A or W.  I've 
always used ODBC with the C headers, unlike the direct linking that Julia 
is doing, so that it always gets the W version since I compile as a Unicode 
build. 


Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-04 Thread Stefan Karpinski
On Thu, Feb 4, 2016 at 1:50 PM, Scott Jones 
wrote:

>
> This still doesn't explain why some drivers are accepting UCS-2/UTF-16
>> when called with the non-Unicode API.
>>
>
> When you do so, are you actually calling the functions with the A, or just
> the macro without either A or W?
> The macro will compile to either the A or the W form, depending on how
> your application is built.
>
> This is a better page in MSDN:
> https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing
> what is going on.
>

The ODBC package calls the functions without A or W. What it's calling
can't be a macro since macros aren't callable via ccall. But changing ODBC
to call the W version of everything may be the fix here.


Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-04 Thread Scott Jones


On Thursday, February 4, 2016 at 1:33:33 PM UTC-5, Stefan Karpinski wrote:
>
> Not a model of clarity (ANSI and Unicode are not encodings), but this page 
> seems to be the best resource on this:
>
> https://msdn.microsoft.com/en-us/library/ms709439(v=vs.85).aspx
>
> It seems that there's a parallel "Unicode" API for ODBC drivers that 
> support it. Moreover:
>
> Currently, the only Unicode encoding that ODBC supports is UCS-2, which 
>> uses a 16-bit integer (fixed length) to represent a character. Unicode 
>> allows applications to work in different languages.
>
>
> So using Klingon is off the table. Although the design of UTF-16 is such 
> that sending UTF-16 to an application that expects UCS-2 will probably work 
> reasonably well, as long as it treats it as "just data".
>

That comment is probably years (decades?) out of date.  There is no limit 
of just UCS-2.
 

> This still doesn't explain why some drivers are accepting UCS-2/UTF-16 
> when called with the non-Unicode API.
>

When you do so, are you actually calling the functions with the A, or just 
the macro without either A or W?
The macro will compile to either the A or the W form, depending on how your 
application is built.

This is a better page in MSDN: 
https://msdn.microsoft.com/en-us/library/ms712612(v=vs.85).aspx describing 
what is going on.

>
>

Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-04 Thread Scott Jones
My reply wasn't meant to be condescending at all, just trying to explain 
the issue.
UTF-16LE & UTF-16BE *are* encodings of the 16-bit UTF-16 encoding of 
Unicode onto 8-bit code units.
If the server is sending UTF-16 or UTF-32, you should simply use the *W 
API, period, because in some places the 8-bit API can have
problems with the embedded 0x00 bytes.

On Thursday, February 4, 2016 at 1:13:12 PM UTC-5, Stefan Karpinski wrote:
>
> The real issue is this:
>
> SQLCHAR is for encodings with 8-bit code units.
>
>
> Condescending lecture on encodings notwithstanding, UTF-16 is not such an 
> encoding, yet UTF-16 is what the ODBC package is currently sending 
> to SQLExecDirect for an argument of type SQLCHAR * – and somehow it seems 
> to be working for many drivers, which still makes no sense to me. I can 
> only conclude that some ODBC drivers are treating this as a void * argument 
> and they expect pointers to data in whatever encoding they prefer, not 
> specifically in encodings with 8-bit code units.
>
> Querying the database about what encoding it expects is a good idea, but 
> how does one do that? The SQLGetInfo 
>  
> function seems like a good candidate but this page doesn't include 
> "encoding" or "utf" anywhere.
>
> On Thu, Feb 4, 2016 at 7:53 AM, Milan Bouchet-Valat  > wrote:
>
>> Le mercredi 03 février 2016 à 11:44 -0800, Terry Seaward a écrit :
>> > From R, it seems like the encoding is based on the connection (as
>> > opposed to being hard coded). See `enc <- attr(channel, "encoding")`
>> > below:
>> >
>> > ```
>> > [...]
>> >
>> > Digging down `odbcConnect` is just a wrapper for `odbcDriverConnect`
>> > which has the following parameter `DBMSencoding = ""`. This calls the
>> > `C` function `C_RODBCDriverConnect` (available here:RODBC_1.3-
>> > 12.tar.gz), which has no reference to encodings. So `attr(channel,
>> > "encoding")` is simply `DBMSencoding`, i.e. `""`.
>> >
>> > It seems to come down to `iconv(..., to = "")` which, from the R
>> > source code, uses `win_iconv.c` attached. I can't seem to find how
>> > `""` is handled, i.e. is there some default value based on the
>> > system?
>> "" refers to the encoding of the current system locale. This is a
>> reasonable guess, but it will probably be wrong in many cases (else, R
>> wouldn't have provided this option at all).
>>
>>
>> Regards
>>
>
>

Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-04 Thread Stefan Karpinski
Not a model of clarity (ANSI and Unicode are not encodings), but this page
seems to be the best resource on this:

https://msdn.microsoft.com/en-us/library/ms709439(v=vs.85).aspx

It seems that there's a parallel "Unicode" API for ODBC drivers that
support it. Moreover:

Currently, the only Unicode encoding that ODBC supports is UCS-2, which
> uses a 16-bit integer (fixed length) to represent a character. Unicode
> allows applications to work in different languages.


So using Klingon is off the table. Although the design of UTF-16 is such
that sending UTF-16 to an application that expects UCS-2 will probably work
reasonably well, as long as it treats it as "just data".

This still doesn't explain why some drivers are accepting UCS-2/UTF-16 when
called with the non-Unicode API.

On Thu, Feb 4, 2016 at 1:12 PM, Stefan Karpinski 
wrote:

> The real issue is this:
>
> SQLCHAR is for encodings with 8-bit code units.
>
>
> Condescending lecture on encodings notwithstanding, UTF-16 is not such an
> encoding, yet UTF-16 is what the ODBC package is currently sending
> to SQLExecDirect for an argument of type SQLCHAR * – and somehow it seems
> to be working for many drivers, which still makes no sense to me. I can
> only conclude that some ODBC drivers are treating this as a void * argument
> and they expect pointers to data in whatever encoding they prefer, not
> specifically in encodings with 8-bit code units.
>
> Querying the database about what encoding it expects is a good idea, but
> how does one do that? The SQLGetInfo
> 
> function seems like a good candidate but this page doesn't include
> "encoding" or "utf" anywhere.
>
> On Thu, Feb 4, 2016 at 7:53 AM, Milan Bouchet-Valat 
> wrote:
>
>> Le mercredi 03 février 2016 à 11:44 -0800, Terry Seaward a écrit :
>> > From R, it seems like the encoding is based on the connection (as
>> > opposed to being hard coded). See `enc <- attr(channel, "encoding")`
>> > below:
>> >
>> > ```
>> > [...]
>> >
>> > Digging down `odbcConnect` is just a wrapper for `odbcDriverConnect`
>> > which has the following parameter `DBMSencoding = ""`. This calls the
>> > `C` function `C_RODBCDriverConnect` (available here:RODBC_1.3-
>> > 12.tar.gz), which has no reference to encodings. So `attr(channel,
>> > "encoding")` is simply `DBMSencoding`, i.e. `""`.
>> >
>> > It seems to come down to `iconv(..., to = "")` which, from the R
>> > source code, uses `win_iconv.c` attached. I can't seem to find how
>> > `""` is handled, i.e. is there some default value based on the
>> > system?
>> "" refers to the encoding of the current system locale. This is a
>> reasonable guess, but it will probably be wrong in many cases (else, R
>> wouldn't have provided this option at all).
>>
>>
>> Regards
>>
>
>


Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-04 Thread Stefan Karpinski
The real issue is this:

SQLCHAR is for encodings with 8-bit code units.


Condescending lecture on encodings notwithstanding, UTF-16 is not such an
encoding, yet UTF-16 is what the ODBC package is currently sending
to SQLExecDirect for an argument of type SQLCHAR * – and somehow it seems
to be working for many drivers, which still makes no sense to me. I can
only conclude that some ODBC drivers are treating this as a void * argument
and they expect pointers to data in whatever encoding they prefer, not
specifically in encodings with 8-bit code units.

Querying the database about what encoding it expects is a good idea, but
how does one do that? The SQLGetInfo
 function
seems like a good candidate but this page doesn't include "encoding" or
"utf" anywhere.

On Thu, Feb 4, 2016 at 7:53 AM, Milan Bouchet-Valat 
wrote:

> Le mercredi 03 février 2016 à 11:44 -0800, Terry Seaward a écrit :
> > From R, it seems like the encoding is based on the connection (as
> > opposed to being hard coded). See `enc <- attr(channel, "encoding")`
> > below:
> >
> > ```
> > [...]
> >
> > Digging down `odbcConnect` is just a wrapper for `odbcDriverConnect`
> > which has the following parameter `DBMSencoding = ""`. This calls the
> > `C` function `C_RODBCDriverConnect` (available here:RODBC_1.3-
> > 12.tar.gz), which has no reference to encodings. So `attr(channel,
> > "encoding")` is simply `DBMSencoding`, i.e. `""`.
> >
> > It seems to come down to `iconv(..., to = "")` which, from the R
> > source code, uses `win_iconv.c` attached. I can't seem to find how
> > `""` is handled, i.e. is there some default value based on the
> > system?
> "" refers to the encoding of the current system locale. This is a
> reasonable guess, but it will probably be wrong in many cases (else, R
> wouldn't have provided this option at all).
>
>
> Regards
>


Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-04 Thread Milan Bouchet-Valat
Le mercredi 03 février 2016 à 11:44 -0800, Terry Seaward a écrit :
> From R, it seems like the encoding is based on the connection (as
> opposed to being hard coded). See `enc <- attr(channel, "encoding")`
> below:
> 
> ```
> [...]
> 
> Digging down `odbcConnect` is just a wrapper for `odbcDriverConnect`
> which has the following parameter `DBMSencoding = ""`. This calls the
> `C` function `C_RODBCDriverConnect` (available here:RODBC_1.3-
> 12.tar.gz), which has no reference to encodings. So `attr(channel,
> "encoding")` is simply `DBMSencoding`, i.e. `""`.
> 
> It seems to come down to `iconv(..., to = "")` which, from the R
> source code, uses `win_iconv.c` attached. I can't seem to find how
> `""` is handled, i.e. is there some default value based on the
> system?
"" refers to the encoding of the current system locale. This is a
reasonable guess, but it will probably be wrong in many cases (else, R
wouldn't have provided this option at all).


Regards


Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-04 Thread Scott Jones
I just have have been clearer - it is just a superset of the printable 
characters, but, as it reuses assigned (even though pretty much never used) 
control character positions, it is not truly a superset.  ASCII is a 7-bit 
subset of ANSI Latin-1, which is an 8-bit subset of UCS-2, which is a 
16-bit subset that can represent only the BMP, which is a subset of the 
Unicode code points (which need 21 bits).

On Thursday, February 4, 2016 at 4:52:33 AM UTC-5, Páll Haraldsson wrote:
>
> On Thursday, February 4, 2016 at 5:33:35 AM UTC, Scott Jones wrote:
>>
>> SQLCHAR is for encodings with 8-bit code units.  It doesn't imply ASCII 
>> or UTF-8 (probably one of the more common character sets used with that is 
>> actually Microsoft's CP1252, which is often mistakenly described as ANSI 
>> Latin-1 - of which it is a superset).
>>
>
> When I read that, I thought, that must not be true.. You can't have s 
> superset (add letters) without dropping others (implying a superset of a 
> subset), so I looked up:
>
> https://en.wikipedia.org/wiki/Windows-1252
> "differs from the IANA's ISO-8859-1 by using displayable characters rather 
> than control characters in the 80 to 9F (hex) range. Notable additional 
> characters are curly quotation marks, the Euro sign, and all the printable 
> characters that are in ISO 8859-15.
> [..]
> This is now standard behavior in the HTML 5 
>  specification, which requires that 
> documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 
> encoding.[1]  In 
> January 2016 1.0% of all web sites use Windows-1252."
>
> Still, despite this 1.0% I think we should support this encoding (in a 
> way, if not its own 8-bit-only type (I'm not sure we need to support any 
> other 8-bit one); it's no longer just some Microsoft thing as I assumed..), 
> as it is ideal for most of Europe (and even the US/world because of "curly 
> quotation"). I've been thinking of doing a sting-type, that does the same 
> as Python, encodes in 8-bit when possible, possibly 7-bit (then it can 
> still say it's UTF-8 and fast indexing is known, note the strings are 
> immutable).
>
> It wouldn't surprise me that "UTF-8" would sometimes, incorrectly, include 
> this as the "Latin-1" subset..
>
> I wander if this screws up sorting.. It's not like the exact position of 
> the Euro sign is to important in alphabetical sorting. I could argue it be 
> sorted with E e but I assume just after A-Z a-z if ok for most..
>
> I had never heard of "control characters in the 80 to 9F (hex) range", 
> assuming then it's a very obscure/ancient thing that can be assumed to be 
> never used anymore..
>
>
> Even when something says it is UTF-8, it frequently is not *really* valid 
>> UTF-8, for example, there are two common variations of UTF-8, CESU-8, used 
>> by MySQL and others, which encodes any non-BMP code point using the two 
>> UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8 
>> sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus 
>> embedded \0s are encoded in a "long" form (0xc0 0x80)
>>
>
> Not only those..
>
> I thought the WTF variant (important for us, because of 
> Windows-filenames?) of UTF-8 was a joke/vandalism at Wikipedia until I read 
> more closely on this I just saw:
>
> https://en.wikipedia.org/wiki/UTF-8#WTF-8
>

I ever hadn't run across that in my work, but my work was in databases, 
usually Unix (AIX, Solaris, etc) or Linux, not so much on Windows any 
longer, and I added Unicode support before surrogates even existed (they 
were added in Unicode 2.0, but not actually used until Unicode 3.0).
I'm not sure what you'd want to do to convert that for use in Julia? (btw, 
I think the initials of the "format" says it all!)



Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-04 Thread Páll Haraldsson
On Thursday, February 4, 2016 at 5:33:35 AM UTC, Scott Jones wrote:
>
> SQLCHAR is for encodings with 8-bit code units.  It doesn't imply ASCII or 
> UTF-8 (probably one of the more common character sets used with that is 
> actually Microsoft's CP1252, which is often mistakenly described as ANSI 
> Latin-1 - of which it is a superset).
>

When I read that, I thought, that must not be true.. You can't have s 
superset (add letters) without dropping others (implying a superset of a 
subset), so I looked up:

https://en.wikipedia.org/wiki/Windows-1252
"differs from the IANA's ISO-8859-1 by using displayable characters rather 
than control characters in the 80 to 9F (hex) range. Notable additional 
characters are curly quotation marks, the Euro sign, and all the printable 
characters that are in ISO 8859-15.
[..]
This is now standard behavior in the HTML 5 
 specification, which requires that 
documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 
encoding.[1]  In 
January 2016 1.0% of all web sites use Windows-1252."

Still, despite this 1.0% I think we should support this encoding (in a way, 
if not its own 8-bit-only type (I'm not sure we need to support any other 
8-bit one); it's no longer just some Microsoft thing as I assumed..), as it 
is ideal for most of Europe (and even the US/world because of "curly 
quotation"). I've been thinking of doing a sting-type, that does the same 
as Python, encodes in 8-bit when possible, possibly 7-bit (then it can 
still say it's UTF-8 and fast indexing is known, note the strings are 
immutable).

It wouldn't surprise me that "UTF-8" would sometimes, incorrectly, include 
this as the "Latin-1" subset..

I wander if this screws up sorting.. It's not like the exact position of 
the Euro sign is to important in alphabetical sorting. I could argue it be 
sorted with E e but I assume just after A-Z a-z if ok for most..

I had never heard of "control characters in the 80 to 9F (hex) range", 
assuming then it's a very obscure/ancient thing that can be assumed to be 
never used anymore..


Even when something says it is UTF-8, it frequently is not *really* valid 
> UTF-8, for example, there are two common variations of UTF-8, CESU-8, used 
> by MySQL and others, which encodes any non-BMP code point using the two 
> UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8 
> sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus 
> embedded \0s are encoded in a "long" form (0xc0 0x80)
>

Not only those..

I thought the WTF variant (important for us, because of Windows-filenames?) 
of UTF-8 was a joke/vandalism at Wikipedia until I read more closely on 
this I just saw:

https://en.wikipedia.org/wiki/UTF-8#WTF-8

-- 
Palli.



Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-03 Thread Scott Jones


On Wednesday, February 3, 2016 at 9:48:24 AM UTC-5, Stefan Karpinski wrote:
>
> I've been working through encoding issues with this as well. To connect to 
> Microsoft SQL Server, this patch made things work:
>
> diff --git a/src/backend.jl b/src/backend.jl
> index b5f24af..bf4ee11 100644
> --- a/src/backend.jl
> +++ b/src/backend.jl
> @@ -40,7 +40,7 @@ end
>
>  # Send query to DMBS
>  function ODBCQueryExecute(stmt::Ptr{Void}, querystring::AbstractString)
> -if @FAILED SQLExecDirect(stmt, utf16(querystring))
> +if @FAILED SQLExecDirect(stmt, utf8(querystring))
>  ODBCError(SQL_HANDLE_STMT,stmt)
>  error("[ODBC]: SQLExecDirect failed; Return Code: $ret")
>  end
>
>
> The query string gets passed through to SQLExecDirect:
>
> #SQLExecDirect
> #
> http://msdn.microsoft.com/en-us/library/windows/desktop/ms713611(v=vs.85).aspx
> #Description: executes a preparable statement
> #Status:
> function SQLExecDirect(stmt::Ptr{Void},query::AbstractString)
> @windows_only ret = ccall( (:SQLExecDirect, odbc_dm), stdcall,
> Int16, (Ptr{Void},Ptr{UInt8},Int),
> stmt,query,sizeof(query))
> @unix_only ret = ccall( (:SQLExecDirect, odbc_dm),
> Int16, (Ptr{Void},Ptr{UInt8},Int),
> stmt,query,sizeof(query))
> return ret
> end
>
>  
> This function just convert whatever it's argument is to a pointer. Looking 
> at the docs 
> 
>  for 
> this function, the signature is this:
>
> SQLRETURN SQLExecDirect(
>  SQLHSTMT StatementHandle,
>  SQLCHAR *StatementText,
>  SQLINTEGER   TextLength);
>
> and SQLCHAR is defined 
>  as 
> unsigned char. So this would seem to be a non-wide character string – i.e. 
> ASCII or UTF-8. And indeed, that's what the Microsoft SQL driver seems to 
> be expecting
>
> The question I have is this: how the heck is this working for other ODBC 
> drivers? How are they getting pointers to UTF-16 data and interpreting it 
> correctly? The correct fix would seem to be to make this always send UTF-8 
> strings. But when I made a PR 
> 
>  
> that did that, it seemed to break other setups.
>

SQLCHAR is for encodings with 8-bit code units.  It doesn't imply ASCII or 
UTF-8 (probably one of the more common character sets used with that is 
actually Microsoft's CP1252, which is often mistakenly described as ANSI 
Latin-1 - of which it is a superset).
Even when something says it is UTF-8, it frequently is not *really* valid 
UTF-8, for example, there are two common variations of UTF-8, CESU-8, used 
by MySQL and others, which encodes any non-BMP code point using the two 
UTF-16 surrogate pairs, i.e. to 6 bytes instead of the correct 4-byte UTF-8 
sequence, and Java's Modified UTF-8, which is the same as CESU-8, plus 
embedded \0s are encoded in a "long" form (0xc0 0x80) so that you can still 
use old-fashioned C \0 termination for strings while allowing embedded \0s.
(note: handling those formats, common in databases, was part of the reason 
I did the work to rewrite the `convert` functions for strings for Julia).

The reason your PR didn't work is because you need to find out from the 
database which character set it is using, and then do the conversions 
necessary to go back and forth between that character set and whatever 
Unicode you are using in Julia (UTF8String, UTF16String, or UTF32String).
One thing that could help is the following package that @nalimilan has been 
working on: https://github.com/nalimilan/StringEncodings.jl)
You should also be able to use the SQLWCHAR API, but that can have it's own 
issues, see the following from Oracle:

> NOTE: The mapping of *SQLWCHAR* type is somewhat complicated and it can
>  create hidden pitfalls for programmers porting their code from 
>  Windows to Linux. Usually a *SQLWCHAR* character is a 16-bit 
> unit and
>  we will not consider the exotic cases when *SQLWCHAR* is 
> different.
>  Windows uses *UTF-16* and maps *SQLWCHAR* to 16-bit wchar_t type.
>  However, many Linux versions such as Ubuntu Linux use *UTF-32* as 
> an
>  internal character set and therefore their 32-bit wchar_t is not
>  compatible with *SQLWCHAR*, which is always 16-bit.

 
Your working theory 
https://github.com/JuliaDB/ODBC.jl/pull/71#issuecomment-175095266 is 
incorrect, it is an issue of character encodings used, not platforms.  You 
see platform differences because the databases you are connecting to have 
different default character encodings on different platforms.


Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-03 Thread Tony Kelman
Try to avoid posting snippets of GPL-licensed code if at all possible. That 
license is viral and it's best not to run any risk of using code with that 
license in more-permissively-licensed Julia projects. Links are fine, as 
long as the license status of what you're linking to is clear.


On Wednesday, February 3, 2016 at 11:44:13 AM UTC-8, Terry Seaward wrote:
>
> From R, it seems like the encoding is based on the connection (as opposed 
> to being hard coded). See `enc <- attr(channel, "encoding")` below:
>
>
>

Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-03 Thread Terry Seaward
PS, thanks for the patch.



Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-03 Thread Terry Seaward
>From R, it seems like the encoding is based on the connection (as opposed 
to being hard coded). See `enc <- attr(channel, "encoding")` below:

```
> RODBC::odbcQuery
function (channel, query, rows_at_time = attr(channel, "rows_at_time")) 
{
if (!odbcValidChannel(channel)) 
stop("first argument is not an open RODBC channel")
if (nchar(enc <- attr(channel, "encoding"))) 
query <- iconv(query, to = enc)
.Call(C_RODBCQuery, attr(channel, "handle_ptr"), as.character(query), 
as.integer(rows_at_time))
}

```

Digging down `odbcConnect` is just a wrapper for `odbcDriverConnect` which 
has the following parameter `DBMSencoding = ""`. This calls the `C` 
function `C_RODBCDriverConnect` (available here:RODBC_1.3-12.tar.gz 
), which has no 
reference to encodings. So `attr(channel, "encoding")` is simply 
`DBMSencoding`, i.e. `""`.

It seems to come down to `iconv(..., to = "")` which, from the R source 
code, uses `win_iconv.c` attached. I can't seem to find how `""` is 
handled, i.e. is there some default value based on the system?


```
> RODBC::odbcDriverConnect
function (connection = "", case = "nochange", believeNRows = TRUE, 
colQuote, tabQuote = colQuote, interpretDot = TRUE, DBMSencoding = "", 
rows_at_time = 100, readOnlyOptimize = FALSE) 
{
id <- as.integer(1 + runif(1, 0, 1e+05))
stat <- .Call(C_RODBCDriverConnect, as.character(connection), 
id, as.integer(believeNRows), as.logical(readOnlyOptimize))
if (stat < 0L) {
warning("ODBC connection failed")
return(stat)
}
Call <- match.call()
res <- .Call(C_RODBCGetInfo, attr(stat, "handle_ptr"))
isMySQL <- res[1L] == "MySQL"
if (missing(colQuote)) 
colQuote <- ifelse(isMySQL, "`", "\"")
if (missing(case)) 
case <- switch(res[1L], MySQL = "mysql", PostgreSQL = "postgresql", 
"nochange")
switch(case, toupper = case <- 1L, tolower = case <- 2L, 
postgresql = case <- 2L, nochange = case <- 0L, msaccess = case <- 
0L, 
mysql = case <- ifelse(.Platform$OS.type == "windows", 
2L, 0L), stop("Invalid case parameter: nochange | toupper | 
tolower | common db names"))
case <- switch(case + 1L, "nochange", "toupper", "tolower")
rows_at_time <- max(1, min(1024, rows_at_time))
cs <- attr(stat, "connection.string")
if (grepl("PWD=", cs)) {
attr(stat, "connection.string") <- sub("PWD=[^;]+($|;)", 
"PWD=**;", cs)
Call$connection <- sub("PWD=[^;]+($|;)", "PWD=**;", 
connection)
}
structure(stat, class = "RODBC", case = case, id = id, believeNRows = 
believeNRows, 
colQuote = colQuote, tabQuote = tabQuote, interpretDot = 
interpretDot, 
encoding = DBMSencoding, rows_at_time = rows_at_time, 
isMySQL = isMySQL, call = Call)
}

```

/*
 * iconv library implemented with Win32 API.
 *
 * This file is placed in the public domain.
 *
 * Maintainer: Yukihiro Nakadaira 
 *
 * Win32 API does not support strict encoding conversion for some
 * codepage.  And MLang function drop or replace invalid bytes and does
 * not return useful error status as iconv.  This implementation cannot
 * be used for encoding validation purpose.
 */

/* Primary source was
   http://code.google.com/p/win-iconv/source/checkout
   but now (Aug 2015)
   https://raw.githubusercontent.com/win-iconv/win-iconv/master/win_iconv.c

   Original R version from http://www.gtk.org/download-windows.html

   R changes:

   - add some missing encoding names, remove duplicate names.
   - add iconvlist()
   - set errno on error
   - XP-compatibility for WC_NO_BEST_FIT_CHARS -- use only for ASCII

A reasonably complete list is at
http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspx

 */

/* for WC_NO_BEST_FIT_CHARS */
#ifndef WINVER
# define WINVER 0x0500
#endif

#include 
#include 
#include 
#include 

#define BUILDING_LIBICONV
#include 

#define MB_CHAR_MAX 16

#define UNICODE_MODE_BOM_DONE   1
#define UNICODE_MODE_SWAPPED2

#define FLAG_USE_BOM1
#define FLAG_TRANSLIT   2 /* //TRANSLIT */
#define FLAG_IGNORE 4 /* //IGNORE */

typedef unsigned char uchar;
typedef unsigned short ushort;
typedef unsigned int uint;

/*
typedef void* iconv_t;

iconv_t iconv_open(const char *tocode, const char *fromcode);
int iconv_close(iconv_t cd);
size_t iconv(iconv_t cd, const char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft);
*/


typedef struct compat_t compat_t;
typedef struct csconv_t csconv_t;
typedef struct rec_iconv_t rec_iconv_t;

typedef iconv_t (*f_iconv_open)(const char *tocode, const char *fromcode);
typedef int (*f_iconv_close)(iconv_t cd);
typedef size_t (*f_iconv)(iconv_t cd, const char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft);
typedef int* (*f_errno)(void);
typedef int (*f_mbtowc)(csconv_t *cv, const ucha

Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-03 Thread Stefan Karpinski
I've been working through encoding issues with this as well. To connect to
Microsoft SQL Server, this patch made things work:

diff --git a/src/backend.jl b/src/backend.jl
index b5f24af..bf4ee11 100644
--- a/src/backend.jl
+++ b/src/backend.jl
@@ -40,7 +40,7 @@ end

 # Send query to DMBS
 function ODBCQueryExecute(stmt::Ptr{Void}, querystring::AbstractString)
-if @FAILED SQLExecDirect(stmt, utf16(querystring))
+if @FAILED SQLExecDirect(stmt, utf8(querystring))
 ODBCError(SQL_HANDLE_STMT,stmt)
 error("[ODBC]: SQLExecDirect failed; Return Code: $ret")
 end


The query string gets passed through to SQLExecDirect:

#SQLExecDirect
#
http://msdn.microsoft.com/en-us/library/windows/desktop/ms713611(v=vs.85).aspx
#Description: executes a preparable statement
#Status:
function SQLExecDirect(stmt::Ptr{Void},query::AbstractString)
@windows_only ret = ccall( (:SQLExecDirect, odbc_dm), stdcall,
Int16, (Ptr{Void},Ptr{UInt8},Int),
stmt,query,sizeof(query))
@unix_only ret = ccall( (:SQLExecDirect, odbc_dm),
Int16, (Ptr{Void},Ptr{UInt8},Int),
stmt,query,sizeof(query))
return ret
end


This function just convert whatever it's argument is to a pointer. Looking
at the docs

for
this function, the signature is this:

SQLRETURN SQLExecDirect(
 SQLHSTMT StatementHandle,
 SQLCHAR *StatementText,
 SQLINTEGER   TextLength);

and SQLCHAR is defined
 as
unsigned char. So this would seem to be a non-wide character string – i.e.
ASCII or UTF-8. And indeed, that's what the Microsoft SQL driver seems to
be expecting.

The question I have is this: how the heck is this working for other ODBC
drivers? How are they getting pointers to UTF-16 data and interpreting it
correctly? The correct fix would seem to be to make this always send UTF-8
strings. But when I made a PR 
that did that, it seemed to break other setups.


On Wed, Feb 3, 2016 at 2:46 AM, Terry Seaward 
wrote:

> Is there a DBI compliant version of this? Also I have endless string
> conversion issues with this package.
>
> On Tuesday, 17 March 2015 21:22:50 UTC+2, Jacob Quinn wrote:
>>
>> Check out the https://github.com/quinnj/ODBC.jl package for connecting
>> to DSN defined in your ODBC manager.
>>
>> On Tue, Mar 17, 2015 at 1:17 PM, Charles Brauer 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm considering diving into Julia. However, all of my data is in a
>>> Microsoft SQL Server database.
>>> I would really appreciate a Julia code example on how to load a Julia
>>> dataframe from SQL Server table.
>>>
>>> Thanks
>>> Charles
>>>
>>
>>


Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2016-02-02 Thread Terry Seaward
Is there a DBI compliant version of this? Also I have endless string 
conversion issues with this package.

On Tuesday, 17 March 2015 21:22:50 UTC+2, Jacob Quinn wrote:
>
> Check out the https://github.com/quinnj/ODBC.jl package for connecting to 
> DSN defined in your ODBC manager.
>
> On Tue, Mar 17, 2015 at 1:17 PM, Charles Brauer  > wrote:
>
>> Hi,
>>
>> I'm considering diving into Julia. However, all of my data is in a 
>> Microsoft SQL Server database.
>> I would really appreciate a Julia code example on how to load a Julia 
>> dataframe from SQL Server table.
>>
>> Thanks
>> Charles
>>
>
>

Re: [julia-users] load a Julia dataframe from Microsoft SQL Server table

2015-03-17 Thread Jacob Quinn
Check out the https://github.com/quinnj/ODBC.jl package for connecting to
DSN defined in your ODBC manager.

On Tue, Mar 17, 2015 at 1:17 PM, Charles Brauer 
wrote:

> Hi,
>
> I'm considering diving into Julia. However, all of my data is in a
> Microsoft SQL Server database.
> I would really appreciate a Julia code example on how to load a Julia
> dataframe from SQL Server table.
>
> Thanks
> Charles
>