gilgenbergg commented on a change in pull request #881: spark wip for review URL: https://github.com/apache/systemml/pull/881#discussion_r407368006
########## File path: scripts/staging/slicing/sparked/sparked_slicer.py ########## @@ -0,0 +1,88 @@ +from pyspark import SparkContext + +from slicing.base.SparkedNode import SparkedNode +from slicing.base.slicer import union, opt_fun +from slicing.base.top_k import Topk +from slicing.sparked import sparked_utils +from slicing.sparked.sparked_utils import update_top_k + + +def join_enum_fun(node_a, list_b, predictions, f_l2, debug, alpha, w, loss_type, cur_lvl, top_k): + x_size = len(predictions) + nodes = [] + for node_i in range(len(list_b)): + flag = sparked_utils.slice_join_nonsense(node_i, node_a, cur_lvl) + if not flag: + new_node = SparkedNode(predictions, f_l2) + parents_set = set(new_node.parents) + parents_set.add(node_i) + parents_set.add(node_a) + new_node.parents = list(parents_set) + parent1_attr = node_a.attributes + parent2_attr = list_b[node_i].attributes + new_node_attr = union(parent1_attr, parent2_attr) + new_node.attributes = new_node_attr + new_node.name = new_node.make_name() + new_node.calc_bounds(cur_lvl, w) + # check if concrete data should be extracted or not (only for those that have score upper + # and if size of subset is big enough + to_slice = new_node.check_bounds(top_k, x_size, alpha) + if to_slice: + new_node.process_slice(loss_type) + new_node.score = opt_fun(new_node.loss, new_node.size, f_l2, x_size, w) + # we decide to add node to current level nodes (in order to make new combinations + # on the next one or not basing on its score value + if new_node.check_constraint(top_k, x_size, alpha) and new_node.key not in top_k.keys: + nodes.append(new_node) + top_k.add_new_top_slice(new_node) + elif new_node.check_bounds(top_k, x_size, alpha): + nodes.append(new_node) Review comment: Thank you for comment @mboehm7 and for noticing this useless additions (that currently also take place in base implementation as well). The algorithm is as follows: 1) depending on lower and upper bounds we decide if we extract data of the slice or not. 2a) data was extracted -> we can compute concrete metrics of size, error and score. If its score is high enough, we add the slice for next level enumeration, additionally checking if it can be included in list of top-k items. 2b) data was not extracted -> we only decide to include slice for next level enumeration or not basing on its bounds. Will be fixed in next commit! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services