[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

ASF GitHub Bot (JIRA) Mon, 15 Jul 2019 22:46:10 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-7545?focusedWorklogId=277230&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-277230
 ]


ASF GitHub Bot logged work on BEAM-7545:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Jul/19 05:45
            Start Date: 16/Jul/19 05:45
    Worklog Time Spent: 10m 
      Work Description: amaliujia commented on pull request #9040: [BEAM-7545] 
Reordering Beam Joins
URL: https://github.com/apache/beam/pull/9040#discussion_r303734439
 
 

 ##########
 File path: 
sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java
 ##########
 @@ -103,6 +106,10 @@
 
           // join rules
           JoinPushExpressionsRule.INSTANCE,
+          JoinCommuteRule.INSTANCE,
 
 Review comment:
   Because this PR is implementing reordering joins, a useful test would be a 
test, in which a three-way join is reordered. As join reordering is already 
measured in 
https://docs.google.com/document/d/1DM_bcfFbIoc_vEoqQxhC7AvHBUDVCAwToC8TYGukkII/edit#,
 wouldn't it be straightforward to have a similar test? Without such a test, 
how do we even know if join reordering is working?
   
   In terms of checking output plans, Flink has been doing many tests on 
Calcite optimization rules (see 
[here](https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-blink/src/test/scala/org/apache/flink/table/plan/rules/logical/RewriteMultiJoinConditionRuleTest.scala)
 and 
[here](https://github.com/apache/flink/tree/master/flink-table/flink-table-planner-blink/src/test/resources/org/apache/flink/table/plan)).
 Flink's practice has shown that verifying output plan is deterministic and 
stable.
   
   The basic idea is if you want to test an optimization, only enable relevant 
rules in test case(so rules are hit will be known) I can see by Flink's way, 
you can test rules even if rules can be disabled and enabled independently:
   
   In 
[BeamRuleSets.java](https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamRuleSets.java)
   ```
   static BeamJoinReorderingRelSet = 
   {JoinCommuteRule.INSTANCE, 
   JoinAssociateRule.INSTANCE}
   ```
   
   In JoinReorderingTest.java
   ```
   FrameworkConfig testConfig = createTestConfig(BeamJoinReorderingRelSet)
   Planner testPlanner = Frameworks.getPlanner(testConfig);
   
   // setup input tables
   
   expected_plan = PlanLoader.load(testcase.class)
   verifyPlan(testPlanner.getPlan(sql), expected_plan)
   ```
   
   By doing so, if a relevant rule is disabled(e.g. 
`JoinCommuteRule.INSTANCE`), it will break existing join reordering tests, 
which guards join ordering for us. It also justifies this PR is doing join 
reordering.   
   
   Because we are starting the effort to have more optimization rules in 
BeamSQL, Flink's practice on testing is a great example that we can learn and 
apply to BeamSQL to maintain our codebase's health.
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 277230)
    Time Spent: 7h 50m  (was: 7h 40m)

> Row Count Estimation for CSV TextTable
> --------------------------------------
>
>                 Key: BEAM-7545
>                 URL: https://issues.apache.org/jira/browse/BEAM-7545
>             Project: Beam
>          Issue Type: New Feature
>          Components: dsl-sql
>            Reporter: Alireza Samadianzakaria
>            Assignee: Alireza Samadianzakaria
>            Priority: Major
>             Fix For: Not applicable
>
>          Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Implementing Row Count Estimation for CSV Tables by reading the first few 
> lines of the file and estimating the number of records based on the length of 
> these lines and the total length of the file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Work logged] (BEAM-7545) Row Count Estimation for CSV TextTable

Reply via email to