[jira] [Updated] (PHOENIX-5362) Mappers should use the queryPlan from the driver rather than regenerating the plan

Chinmay Kulkarni (Jira) Thu, 19 Sep 2019 17:56:09 -0700


     [ 
https://issues.apache.org/jira/browse/PHOENIX-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chinmay Kulkarni updated PHOENIX-5362:
--------------------------------------
    Description: 
Currently, PhoenixInputFormat#getQueryPlan already generates a queryPlan and we 
use this plan to get the scans and splits for the MR job. In 
PhoenixInputFormat#createRecordReader which is called inside each mapper, we 
again create a queryPlan and pass this to the PhoenixRecordReader instance.

There are multiple problems with this approach:

# The mappers already have information about the scans from the driver code. We 
potentially just need to wrap these scans in an iterator and create a 
subsequent ResultSet.
# The mappers don't need most of the information embedded within a queryPlan, 
so they shouldn't need to regenerate the plan.
# There are weird corner cases that can occur if we replan the query in each 
mapper. For ex: If there is an index creation or metadata change in between 
when the MR job was created, and when the mappers actually launch. In this 
case, the mappers have the scans created for the first queryPlan, but the 
mappers will use iterators created for the second queryPlan. In such cases, the 
issued scans would not match the queryPlan embedded in the mappers' 
iterators/ResultSet. We could potentially miss some scans/be looking for more 
than we actually require since we check the original scans for this size. The 
resolved table would be as per the new queryPlan, and there could be a mismatch 
here as well (considering the index creation case). There are potentially other 
repercussions in case of intermediary metadata changes as well.

Serializing a subset of the information (like the projector, which iterator to 
use, etc.) of a QueryPlan and passing it from the driver to the mappers without 
having them regenerate the plans seems like the best way forward.

  was:
Currently, PhoenixInputFormat#getQueryPlan already generates a queryPlan and we 
use this plan to get the scans and splits for the MR job. In 
PhoenixInputFormat#createRecordReader which is called inside each mapper, we 
again create a queryPlan and pass this to the PhoenixRecordReader instance.

There are multiple problems with this approach:

# The mappers already have information about the scans from the driver code. We 
potentially just need to wrap these scans in an iterator and create a 
subsequent ResultSet.
# The mappers don't need most of the information embedded within a queryPlan, 
so they shouldn't need to regenerate the plan.
# There are weird corner cases that can occur if we replan the query in each 
mapper. For ex: If there is an index creation or metadata change in between 
when the MR job was created, and when the Mappers actually launch. In this 
case, the mappers have the scans created for the first queryPlan, but the 
mappers will use iterators created for the second queryPlan. In such cases, the 
issued scans would not match the queryPlan embedded in the mappers' 
iterators/ResultSet. We could potentially miss some scans/be looking for more 
than we actually require since we check the original scans for this size. The 
resolved table would be as per the new queryPlan, and there could be a mismatch 
here as well (considering the index creation case you mentioned). There are 
potentially other repercussions in case of intermediary metadata changes as 
well.

Serializing a subset of the information (like the projector, which iterator to 
use, etc.) of a QueryPlan and passing it from the driver to the mappers without 
having them regenerate the plans seems like the best way forward.


> Mappers should use the queryPlan from the driver rather than regenerating the 
> plan
> ----------------------------------------------------------------------------------
>
>                 Key: PHOENIX-5362
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5362
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Chinmay Kulkarni
>            Priority: Major
>             Fix For: 4.15.1, 5.1.1
>
>
> Currently, PhoenixInputFormat#getQueryPlan already generates a queryPlan and 
> we use this plan to get the scans and splits for the MR job. In 
> PhoenixInputFormat#createRecordReader which is called inside each mapper, we 
> again create a queryPlan and pass this to the PhoenixRecordReader instance.
> There are multiple problems with this approach:
> # The mappers already have information about the scans from the driver code. 
> We potentially just need to wrap these scans in an iterator and create a 
> subsequent ResultSet.
> # The mappers don't need most of the information embedded within a queryPlan, 
> so they shouldn't need to regenerate the plan.
> # There are weird corner cases that can occur if we replan the query in each 
> mapper. For ex: If there is an index creation or metadata change in between 
> when the MR job was created, and when the mappers actually launch. In this 
> case, the mappers have the scans created for the first queryPlan, but the 
> mappers will use iterators created for the second queryPlan. In such cases, 
> the issued scans would not match the queryPlan embedded in the mappers' 
> iterators/ResultSet. We could potentially miss some scans/be looking for more 
> than we actually require since we check the original scans for this size. The 
> resolved table would be as per the new queryPlan, and there could be a 
> mismatch here as well (considering the index creation case). There are 
> potentially other repercussions in case of intermediary metadata changes as 
> well.
> Serializing a subset of the information (like the projector, which iterator 
> to use, etc.) of a QueryPlan and passing it from the driver to the mappers 
> without having them regenerate the plans seems like the best way forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (PHOENIX-5362) Mappers should use the queryPlan from the driver rather than regenerating the plan

Reply via email to