Leonidas Fegaras created MRQL-56:
------------------------------------
Summary: Improve total aggregations and repetitions with shared
results in Flink mode
Key: MRQL-56
URL: https://issues.apache.org/jira/browse/MRQL-56
Project: MRQL
Issue Type: Improvement
Components: Run-Time/Flink
Affects Versions: 0.9.4
Reporter: Leonidas Fegaras
Assignee: Leonidas Fegaras
Attachments: MRQL-56.patch
The following patch improves the Flink evaluation mode in two cases:
1. improves total aggregations by allowing MapAggregateReduce and
MapAggregateReduce2 operations to do the aggregations at the groupBy stage,
thus generating one aggregation result per node. Previously, the total
aggregation was performed after the groupBy, which was inefficient. This
problem was reported by Eldon Carman.
2. Re-implements repetitions whose result must be shared by all nodes, such as
the centroids in the kmeans algorithm. Previously, it used a loop to evaluate
the repetition body as a detached query. Now, it uses Flink iterations, as is
described in the KMeans.java example in the Flink codebase (by broadcasting the
shared results to nodes). Now the kmeans query is 7 times faster than before,
and about 2.5 faster than Spark. Unfortunately, due to a Flink bug, this loop
ignores the stopping condition.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)