[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...
Github user jingyimei closed the pull request at: https://github.com/apache/madlib/pull/274 ---
[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/274#discussion_r193525716 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, long_format=None): if not array: return "'{{ }}'::{0}".format(array_type) else: -array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'" -return (array_str + "::{1}").format(','.join(map(str, array)), array_type) +# For non-character array types: +if long_format: +array_str = "ARRAY[ {0} ]"; +return (array_str + "::{1}").format( +','.join(map(str, array)), array_type) +else: +# For character array types, we have to deal with special characters in +# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} : +# We firstly use ",,," to join elements in python list and then call +# postgres string_to_array with a delimiter ",,," to form the final +# psql array, because this sequence of characters will be very +# unlikely to show up in a real world use case and some special +# case (such as "M,M") will be handled. +array_str = "string_to_array($${0}$$, ',,,')" +return (array_str + "::{1}").format( +',,,'.join(map(str, array)), array_type) --- End diff -- Agree ---
[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/274#discussion_r193525463 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, long_format=None): if not array: return "'{{ }}'::{0}".format(array_type) else: -array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'" -return (array_str + "::{1}").format(','.join(map(str, array)), array_type) +# For non-character array types: +if long_format: +array_str = "ARRAY[ {0} ]"; +return (array_str + "::{1}").format( +','.join(map(str, array)), array_type) +else: +# For character array types, we have to deal with special characters in +# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} : +# We firstly use ",,," to join elements in python list and then call +# postgres string_to_array with a delimiter ",,," to form the final +# psql array, because this sequence of characters will be very +# unlikely to show up in a real world use case and some special +# case (such as "M,M") will be handled. +array_str = "string_to_array($${0}$$, ',,,')" +return (array_str + "::{1}").format( +',,,'.join(map(str, array)), array_type) --- End diff -- True, but I feel having a single `,` at the end of a string is something that can occur easily. In other words, having a single `,` at the end of a string is probably more likely to happen than the test cases we cover in the corresponding install check files in this PR. We don't need a super obscure delimiter, but `,,,` seems too simple. ---
[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...
Github user jingyimei commented on a diff in the pull request: https://github.com/apache/madlib/pull/274#discussion_r193503140 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, long_format=None): if not array: return "'{{ }}'::{0}".format(array_type) else: -array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'" -return (array_str + "::{1}").format(','.join(map(str, array)), array_type) +# For non-character array types: +if long_format: +array_str = "ARRAY[ {0} ]"; +return (array_str + "::{1}").format( +','.join(map(str, array)), array_type) +else: +# For character array types, we have to deal with special characters in +# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} : +# We firstly use ",,," to join elements in python list and then call +# postgres string_to_array with a delimiter ",,," to form the final +# psql array, because this sequence of characters will be very +# unlikely to show up in a real world use case and some special +# case (such as "M,M") will be handled. +array_str = "string_to_array($${0}$$, ',,,')" +return (array_str + "::{1}").format( +',,,'.join(map(str, array)), array_type) --- End diff -- This sounds like a more obscure, but there is always some corner cases we can't cover given this approach. ---
[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/274#discussion_r193489504 --- Diff: src/ports/postgres/modules/convex/test/mlp.sql_in --- @@ -1096,3 +1096,58 @@ FROM JOIN pg_catalog.pg_namespace n ON n.oid=c.relnamespace WHERE c.relname = 'lin_housing_wi_batch_standardization' AND c.relkind='r' AND nspname=current_schema(); + +-- Test special characters both in column name and column values +DROP TABLE IF EXISTS iris_data_special_char; +CREATE TABLE iris_data_special_char( +id serial, +"rinÐ!#'gs" numeric[], +"se''x" varchar, +class integer, +state varchar +); +INSERT INTO iris_data_special_char VALUES +(1,ARRAY[5.0,3.2,1.2,0.2],'M''M',1,'Alaska'), --- End diff -- If you change `'M''M'` -> `'M''M,'`, the test will fail. This is because the delimiter used internally in the code is `,,,`, and any string that ends with a `,` will break it. ---
[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...
Github user njayaram2 commented on a diff in the pull request: https://github.com/apache/madlib/pull/274#discussion_r193490031 --- Diff: src/ports/postgres/modules/utilities/utilities.py_in --- @@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, long_format=None): if not array: return "'{{ }}'::{0}".format(array_type) else: -array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'" -return (array_str + "::{1}").format(','.join(map(str, array)), array_type) +# For non-character array types: +if long_format: +array_str = "ARRAY[ {0} ]"; +return (array_str + "::{1}").format( +','.join(map(str, array)), array_type) +else: +# For character array types, we have to deal with special characters in +# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} : +# We firstly use ",,," to join elements in python list and then call +# postgres string_to_array with a delimiter ",,," to form the final +# psql array, because this sequence of characters will be very +# unlikely to show up in a real world use case and some special +# case (such as "M,M") will be handled. +array_str = "string_to_array($${0}$$, ',,,')" +return (array_str + "::{1}").format( +',,,'.join(map(str, array)), array_type) --- End diff -- Can we have a more obscure delimiter? As mentioned in a comment above, any string that ends with a `,` will break this code. Just a thought, but would something like `,:;^` be an acceptable delimiter? ---
[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...
GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/274 Handling special characters in MLP and Encode Categorical Variables This PR handles special characters and unicode in column name and column values in MLP and Encode Categorical Variables modules. Also updated install check test cases to cover it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/madlib/madlib one_hot_encoding_fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/274.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #274 commit 1f97cd5a9c9d118e9024049c466e0e6cf44dcdd2 Author: Arvind Sridhar Date: 2018-05-24T00:02:43Z Encode categorical variables: handling special characters JIRA: MADLIB-1238 JIRA: MADLIB-1243 This commit deals with special characters in column name and column values. Also adds install check test cases to cover these scenarios. commit 7d70ac24fbd679c0e5d58ac09bd536e6cc887790 Author: Jingyi Mei Date: 2018-05-30T21:16:13Z MLP: handling special characters JIRA: MADLIB-1238 This commit deals with special characters in column names and column values within tables passed into the multilayer perceptron (MLP). We also added install check cases to cover these scenarios. ---