[GitHub] [madlib] khannaekta commented on a diff in pull request #607: Various: Reduce SERIAL usage

via GitHub Thu, 17 Aug 2023 15:15:11 -0700


khannaekta commented on code in PR #607:
URL: https://github.com/apache/madlib/pull/607#discussion_r1297800705



##########
src/ports/postgres/modules/assoc_rules/assoc_rules.py_in:
##########
@@ -380,29 +380,33 @@ def assoc_rules(madlib_schema, support, confidence, 
tid_col,
         # The iterp1 check ensures that the new itemset is of a certain size.
         # At every iteration, we increase the target itemset size by one.
 
-        plpy.execute("ALTER SEQUENCE {assoc_loop_aux}_id_seq RESTART WITH 
1".format(**locals()));
         plpy.execute("TRUNCATE TABLE {assoc_loop_aux}".format(**locals()));
-        plpy.execute("""
-           INSERT INTO {assoc_loop_aux}(set_list, support, tids)
-           SELECT DISTINCT ON({madlib_schema}.svec_to_string(set_list)) 
set_list,
-                   {madlib_schema}.svec_l1norm(tids)::FLOAT8 / {num_tranx},
-                   tids
-           FROM (
-             SELECT
-                {madlib_schema}.svec_minus(
-                    {madlib_schema}.svec_plus(t1.set_list, t3.set_list),
-                    {madlib_schema}.svec_mult(t1.set_list, t3.set_list)
-                ) as set_list,
-                {madlib_schema}.svec_mult(t1.tids, t3.tids) as tids
-             FROM {assoc_rule_sets_loop_ordered} t1,
-                  {assoc_rule_sets_loop_ordered} t3
-             WHERE t1.newrownum < t3.newrownum AND
-                   t3.newrownum-t1.newrownum <= {num_products_threshold}
-           ) t
-           WHERE {madlib_schema}.svec_l1norm(set_list)::INT = {iterp1} AND
-                 {madlib_schema}.svec_l1norm(tids)::FLOAT8 >= {min_supp_tranx}
+
+        sql = """
+            INSERT INTO {assoc_loop_aux}(id, set_list, support, tids)
+            SELECT row_number() OVER (), *

Review Comment:
   I understand the reason for adding `row_number()` here but the only concern 
I have is that it would generate a plan that will gather data on the 
coordinator which may have performance slow down with growing data.
   we can probably rewrite this query to add rownumbers without using this 
window function(similar to how we do it for DL's minibatch preprocessor commit: 
85eba2968b5651b6242a398df569b1f3f6412703) but that would make this too 
complicated. Would like to know if using serial datatype and setting the cache 
to 1 seems better over the performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@madlib.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [madlib] khannaekta commented on a diff in pull request #607: Various: Reduce SERIAL usage

Reply via email to