[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

2018-06-13 Thread jingyimei
Github user jingyimei closed the pull request at:

https://github.com/apache/madlib/pull/274


---


[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

2018-06-06 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/274#discussion_r193525716
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, 
long_format=None):
 if not array:
 return "'{{ }}'::{0}".format(array_type)
 else:
-array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'"
-return (array_str + "::{1}").format(','.join(map(str, array)), 
array_type)
+# For non-character array types:
+if long_format:
+array_str = "ARRAY[ {0} ]";
+return (array_str + "::{1}").format(
+','.join(map(str, array)), array_type)
+else:
+# For character array types, we have to deal with special 
characters in
+# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} :
+# We firstly use ",,," to join elements in python list and then 
call
+# postgres string_to_array with a delimiter ",,," to form the final
+# psql array, because this sequence of characters will be very
+# unlikely to show up in a real world use case and some special
+# case (such as "M,M") will be handled.
+array_str = "string_to_array($${0}$$, ',,,')"
+return (array_str + "::{1}").format(
+',,,'.join(map(str, array)), array_type)
--- End diff --

Agree


---


[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

2018-06-06 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/274#discussion_r193525463
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, 
long_format=None):
 if not array:
 return "'{{ }}'::{0}".format(array_type)
 else:
-array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'"
-return (array_str + "::{1}").format(','.join(map(str, array)), 
array_type)
+# For non-character array types:
+if long_format:
+array_str = "ARRAY[ {0} ]";
+return (array_str + "::{1}").format(
+','.join(map(str, array)), array_type)
+else:
+# For character array types, we have to deal with special 
characters in
+# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} :
+# We firstly use ",,," to join elements in python list and then 
call
+# postgres string_to_array with a delimiter ",,," to form the final
+# psql array, because this sequence of characters will be very
+# unlikely to show up in a real world use case and some special
+# case (such as "M,M") will be handled.
+array_str = "string_to_array($${0}$$, ',,,')"
+return (array_str + "::{1}").format(
+',,,'.join(map(str, array)), array_type)
--- End diff --

True, but I feel having a single `,` at the end of a string is something 
that can occur easily.  In other words, having a single `,` at the end of a 
string is probably more likely to happen than the test cases we cover in the 
corresponding install check files in this PR.

We don't need a super obscure delimiter, but `,,,` seems too simple.


---


[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

2018-06-06 Thread jingyimei
Github user jingyimei commented on a diff in the pull request:

https://github.com/apache/madlib/pull/274#discussion_r193503140
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, 
long_format=None):
 if not array:
 return "'{{ }}'::{0}".format(array_type)
 else:
-array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'"
-return (array_str + "::{1}").format(','.join(map(str, array)), 
array_type)
+# For non-character array types:
+if long_format:
+array_str = "ARRAY[ {0} ]";
+return (array_str + "::{1}").format(
+','.join(map(str, array)), array_type)
+else:
+# For character array types, we have to deal with special 
characters in
+# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} :
+# We firstly use ",,," to join elements in python list and then 
call
+# postgres string_to_array with a delimiter ",,," to form the final
+# psql array, because this sequence of characters will be very
+# unlikely to show up in a real world use case and some special
+# case (such as "M,M") will be handled.
+array_str = "string_to_array($${0}$$, ',,,')"
+return (array_str + "::{1}").format(
+',,,'.join(map(str, array)), array_type)
--- End diff --

This sounds like a more obscure, but there is always some corner cases we 
can't cover given this approach.


---


[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

2018-06-06 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/274#discussion_r193489504
  
--- Diff: src/ports/postgres/modules/convex/test/mlp.sql_in ---
@@ -1096,3 +1096,58 @@ FROM
 JOIN pg_catalog.pg_namespace n
 ON n.oid=c.relnamespace
 WHERE c.relname = 'lin_housing_wi_batch_standardization' AND c.relkind='r' 
AND nspname=current_schema();
+
+-- Test special characters both in column name and column values
+DROP TABLE IF EXISTS iris_data_special_char;
+CREATE TABLE iris_data_special_char(
+id serial,
+"rinЖ!#'gs" numeric[],
+"se''x" varchar,
+class integer,
+state varchar
+);
+INSERT INTO iris_data_special_char VALUES
+(1,ARRAY[5.0,3.2,1.2,0.2],'M''M',1,'Alaska'),
--- End diff --

If you change `'M''M'` -> `'M''M,'`, the test will fail. This is because 
the delimiter used internally in the code is `,,,`, and any string that ends 
with a `,` will break it.


---


[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

2018-06-06 Thread njayaram2
Github user njayaram2 commented on a diff in the pull request:

https://github.com/apache/madlib/pull/274#discussion_r193490031
  
--- Diff: src/ports/postgres/modules/utilities/utilities.py_in ---
@@ -296,11 +305,24 @@ def py_list_to_sql_string(array, array_type=None, 
long_format=None):
 if not array:
 return "'{{ }}'::{0}".format(array_type)
 else:
-array_str = "ARRAY[ {0} ]" if long_format else "'{{ {0} }}'"
-return (array_str + "::{1}").format(','.join(map(str, array)), 
array_type)
+# For non-character array types:
+if long_format:
+array_str = "ARRAY[ {0} ]";
+return (array_str + "::{1}").format(
+','.join(map(str, array)), array_type)
+else:
+# For character array types, we have to deal with special 
characters in
+# elements an array, i.e, {'ele''one', "ele,two", "ele$three"} :
+# We firstly use ",,," to join elements in python list and then 
call
+# postgres string_to_array with a delimiter ",,," to form the final
+# psql array, because this sequence of characters will be very
+# unlikely to show up in a real world use case and some special
+# case (such as "M,M") will be handled.
+array_str = "string_to_array($${0}$$, ',,,')"
+return (array_str + "::{1}").format(
+',,,'.join(map(str, array)), array_type)
--- End diff --

Can we have a more obscure delimiter? As mentioned in a comment above, any 
string that ends with a `,` will break this code.
Just a thought, but would something like `,:;^` be an acceptable delimiter?


---


[GitHub] madlib pull request #274: Handling special characters in MLP and Encode Cate...

2018-05-31 Thread jingyimei
GitHub user jingyimei opened a pull request:

https://github.com/apache/madlib/pull/274

Handling special characters in MLP and Encode Categorical Variables

This PR handles special characters and unicode in column name and column 
values in MLP and Encode Categorical Variables modules. Also updated install 
check test cases to cover it.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/madlib/madlib one_hot_encoding_fix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/madlib/pull/274.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #274


commit 1f97cd5a9c9d118e9024049c466e0e6cf44dcdd2
Author: Arvind Sridhar 
Date:   2018-05-24T00:02:43Z

Encode categorical variables: handling special characters

JIRA: MADLIB-1238
JIRA: MADLIB-1243

This commit deals with special characters in column name and column
values. Also adds install check test cases to cover these scenarios.

commit 7d70ac24fbd679c0e5d58ac09bd536e6cc887790
Author: Jingyi Mei 
Date:   2018-05-30T21:16:13Z

MLP: handling special characters

JIRA: MADLIB-1238

This commit deals with special characters in column names and column
values within tables passed into the multilayer perceptron (MLP). We
also added install check cases to cover these scenarios.




---