GitHub user chenlica closed a discussion: Join Operator (from old wiki)
>From page https://github.com/apache/texera/wiki/Join-Operator (may be dangling) ==== Author: [Sripad Kowshik Subramanyam](https://www.github.com/sripadks) ## Synopsys Implement an operator that takes two operators as the input and joins their tuples based on constraints specified using a predicate. ## Status As of 9/25/2016: **COMPLETED** ## Modules ```java edu.uci.ics.texera.dataflow.common edu.uci.ics.texera.dataflow.join ``` ## Related Issues https://github.com/Texera/texera/issues/111 ## Description Join Operator performs the join of a certain field of the results of two other operators passed to it based on constraints specified in a join predicate. The field to join upon and the constraints to be satisfied are specified using `JoinPredicate`. The `getNextTuple()` method is used to get the next result of the operator. Currently supported predicates are: * `JoinDistancePredicate`: Takes in an attribute that specifies the ID, the attribute of the field to perform the join on, and a distance threshold. If the distance between two spans of the field of the results to be joined is within the threshold, the join is performed. ## Example Given below is a setting and corresponding examples to use `JoinDistancePredicate` (consider the two tuples to be from two different operators). | | id | author | review | spanList | |---------|----------|---------------|------------------------------------------------------------------------------------------------|-------------------------------------| | tuple1 | 58 | Bruce Wayne | This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman | "book":<6,11> | | tuple2 | 58 | Bruce Wayne | This book gives us a peek into the life of Bruce Wayne when he is not fighting crime as Batman | "gives":<12, 18>, <br> "us":<19, 22> | Where `<spanStartIndex, spanEndIndex>` represents a span. If we want to join over the **review** attribute with the condition **within 10 character distance**, we can write: `JoinDistancePredicate joinPredicate = new JoinDistancePredicate(idAttr, reviewAttr, 10);` Since both tuples have the same ID, we can perform the join on the two span lists. The span distance is computed as: `|(span 1 spanStartIndex) - (span 2 spanStartIndex)| OR |(span 1 spanEndIndex) - (span 2 spanEndIndex)|)` Upon performing Join on the above two tuples, we get: 1. The span `"book":<6,11>` from tuple1 and the span `"gives":<12, 18>` from tuple2 satisfy the condition _distance <= threshold_. Therefore, the join will combine two spans into a new span `"book_gives":<6, 18>`. 2. The span `"book":<6,11>` from tuple1 and the span `"us":<19, 22>` from tuple2 don't satisfy the condition, so they will not be joined. ## TODOs * Implement sorting of spans of the results in order to improve the performance of the operator. * Implement other kinds of predicates to increase the robustness and utility of the operator. GitHub link: https://github.com/apache/texera/discussions/3974 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
