alamb opened a new issue, #33: URL: https://github.com/apache/datafusion-benchmarks/issues/33
I spent a large amount of time confused about this and wanted to file an issue I found it in the context of - https://github.com/clflushopt/tpchgen-rs/pull/253/changes The Dockerfile in this repo implies that the DSGen data generator for TPCH has incorrect input: https://github.com/apache/datafusion-benchmarks/blob/cb12c981e6608e0f2dcf919956ada8f1f1622d72/tpcds/Dockerfile#L17-L19 However, what that command actually does is corrupt the file For example,t he originl data has a `CÔTE` in it: (\x43 \xc3 \x94 \x54 \x45 in utf8) grep -n "IVOIRE" tpcds.dst | head -1 | xxd | head -3 echo "---"…) ⎿ 00000000: 3634 393a 6164 6420 2822 43c3 9454 4520 649:add ("C..TE 00000010: 4427 4956 4f49 5245 223a 3129 3b0a D'IVOIRE":1);. ``` iconv -f ISO-8859-14 -t UTF-8 ``` Tells `iconv` "treat the input as ISO-8859-14 single-byte data." But the file was already UTF-8. So iconv took each existing UTF-8 byte (e.g. C3 and 94) as if each were a separate Latin character, and re-encoded each one into UTF-8. This results in tpcds.dst (original) — already correct UTF-8: - C3 94 = the valid UTF-8 encoding for Ô (U+00D4 LATIN CAPITAL LETTER O WITH CIRCUMFLEX) tpcds.dst2 (after iconv) — corrupted (double-encoded): - C3 83 C2 94 = two characters: Ã (U+00C3) + a control character (U+0094) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
