[GitHub] [systemml] gilgenbergg commented on a change in pull request #881: spark wip for review

GitBox Mon, 13 Apr 2020 01:04:46 -0700

gilgenbergg commented on a change in pull request #881: spark wip for review
URL: https://github.com/apache/systemml/pull/881#discussion_r407368006


 ##########
 File path: scripts/staging/slicing/sparked/sparked_slicer.py
 ##########
 @@ -0,0 +1,88 @@
+from pyspark import SparkContext
+
+from slicing.base.SparkedNode import SparkedNode
+from slicing.base.slicer import union, opt_fun
+from slicing.base.top_k import Topk
+from slicing.sparked import sparked_utils
+from slicing.sparked.sparked_utils import update_top_k
+
+
+def join_enum_fun(node_a, list_b, predictions, f_l2, debug, alpha, w, 
loss_type, cur_lvl, top_k):
+    x_size = len(predictions)
+    nodes = []
+    for node_i in range(len(list_b)):
+        flag = sparked_utils.slice_join_nonsense(node_i, node_a, cur_lvl)
+        if not flag:
+            new_node = SparkedNode(predictions, f_l2)
+            parents_set = set(new_node.parents)
+            parents_set.add(node_i)
+            parents_set.add(node_a)
+            new_node.parents = list(parents_set)
+            parent1_attr = node_a.attributes
+            parent2_attr = list_b[node_i].attributes
+            new_node_attr = union(parent1_attr, parent2_attr)
+            new_node.attributes = new_node_attr
+            new_node.name = new_node.make_name()
+            new_node.calc_bounds(cur_lvl, w)
+            # check if concrete data should be extracted or not (only for 
those that have score upper
+            # and if size of subset is big enough
+            to_slice = new_node.check_bounds(top_k, x_size, alpha)
+            if to_slice:
+                new_node.process_slice(loss_type)
+                new_node.score = opt_fun(new_node.loss, new_node.size, f_l2, 
x_size, w)
+                # we decide to add node to current level nodes (in order to 
make new combinations
+                # on the next one or not basing on its score value
+                if new_node.check_constraint(top_k, x_size, alpha) and 
new_node.key not in top_k.keys:
+                    nodes.append(new_node)
+                    top_k.add_new_top_slice(new_node)
+                elif new_node.check_bounds(top_k, x_size, alpha):
+                    nodes.append(new_node)
 
 Review comment:
   Thank you for comment @mboehm7 and for noticing this useless additions (that 
currently also take place in base implementation as well). 
   
   The algorithm is as follows:
   1) depending on lower and upper bounds we decide if we extract data of the 
slice or not.
   2a) data was extracted -> we can compute concrete metrics of size, error and 
score. If its score is high enough, we add the slice for next level 
enumeration, additionally checking if it can be included in list of top-k items.
   2b) data was not extracted -> we only decide to include slice for next level 
enumeration or not basing on its bounds.
   
   Will be fixed in next commit!
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [systemml] gilgenbergg commented on a change in pull request #881: spark wip for review

Reply via email to