khannaekta commented on code in PR #607: URL: https://github.com/apache/madlib/pull/607#discussion_r1297800705
########## src/ports/postgres/modules/assoc_rules/assoc_rules.py_in: ########## @@ -380,29 +380,33 @@ def assoc_rules(madlib_schema, support, confidence, tid_col, # The iterp1 check ensures that the new itemset is of a certain size. # At every iteration, we increase the target itemset size by one. - plpy.execute("ALTER SEQUENCE {assoc_loop_aux}_id_seq RESTART WITH 1".format(**locals())); plpy.execute("TRUNCATE TABLE {assoc_loop_aux}".format(**locals())); - plpy.execute(""" - INSERT INTO {assoc_loop_aux}(set_list, support, tids) - SELECT DISTINCT ON({madlib_schema}.svec_to_string(set_list)) set_list, - {madlib_schema}.svec_l1norm(tids)::FLOAT8 / {num_tranx}, - tids - FROM ( - SELECT - {madlib_schema}.svec_minus( - {madlib_schema}.svec_plus(t1.set_list, t3.set_list), - {madlib_schema}.svec_mult(t1.set_list, t3.set_list) - ) as set_list, - {madlib_schema}.svec_mult(t1.tids, t3.tids) as tids - FROM {assoc_rule_sets_loop_ordered} t1, - {assoc_rule_sets_loop_ordered} t3 - WHERE t1.newrownum < t3.newrownum AND - t3.newrownum-t1.newrownum <= {num_products_threshold} - ) t - WHERE {madlib_schema}.svec_l1norm(set_list)::INT = {iterp1} AND - {madlib_schema}.svec_l1norm(tids)::FLOAT8 >= {min_supp_tranx} + + sql = """ + INSERT INTO {assoc_loop_aux}(id, set_list, support, tids) + SELECT row_number() OVER (), * Review Comment: I understand the reason for adding `row_number()` here but the only concern I have is that it would generate a plan that will gather data on the coordinator which may have performance slow down with growing data. we can probably rewrite this query to add rownumbers without using this window function(similar to how we do it for DL's minibatch preprocessor commit: 85eba2968b5651b6242a398df569b1f3f6412703) but that would make this too complicated. Would like to know if using serial datatype and setting the cache to 1 seems better over the performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@madlib.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org