Re: mapPartitions

Julian Hyde Thu, 04 Mar 2021 11:41:56 -0800

SQL has equivalents of many functional programming idioms:

  * map is SELECT 
  * filter is WHERE
  * flatMap is similar to CROSS APPLY


That said, SQL’s strength is that the operations are not optimized for any 
particular physical organization of data (e.g. working on sorted or partitioned 
data). mapPartitions is in this category. Of course a physical implementation 
of one of SQL’s logical operators might use mapPartitions.

Julian





> On Mar 4, 2021, at 10:44 AM, Debajyoti Roy <newroy...@gmail.com> wrote:
> 
> Thanks for the responses, adding some more color below.
> 
> Spark's API adopted concepts from the functional programming paradigm (map,
> filter, flatmap,...) into data processing. Spark did add several relational
> operators like join, union, select, etc. However, there are certain APIs
> that are really hard to model in terms of standard relational operators.
> Let me take one example of mapPartitions.
> 
> mapPartitions( T -> U ):
> w columns and h rows can turn into totally different w' != w columns and h'
> != h rows. Since processing happens per partition, this API is a great
> choice for vectorized heavyweight initialization cost operations e.g. batch
> inferencing.
> 
> In terms of relational models, mapPartitions can be modeled just like a
> function inside an expression operator. However, there can be interesting
> cases e.g. h > h' and mapMartitions starts to feel like a filter. Can there
> be other challenges and opportunities in terms of planner and optimizer
> because mapPartitions is definitely NOT like any other function inside an
> expression as shown below:
> 
> SELECT name, address, mapPartitions(id, tweet, '{threshold: 0.5}',
> 'sentiment_analysis_4', 10000) FROM my_twitter_data...
> 
> So what is a better usage example for mapPartitions expressed as SQL ? I am
> really struggling with that part and I agree with Julian that is the key.
> 
> Regards,
> Debajyoti Roy
> 
> On Thu, Mar 4, 2021 at 12:01 AM Julian Hyde <jhyde.apa...@gmail.com> wrote:
> 
>> I searched for mapPartitions and flatMapGroupsWithState, and it looks as
>> if you are talking about Apache Spark operations. Can you give some
>> examples of typical queries that would use these operations?
>> 
>> It’s possible that these operations accomplish things that are not
>> possible in the relational model; but I think it’s more likely that these
>> are algorithms that can implement relational operations such as windowed
>> aggregate functions. If you give some examples, we can see whether they can
>> be expressed in SQL or relational algebra.
>> 
>> Julian
>> 
>> 
>>> On Mar 3, 2021, at 10:54 PM, Rui Wang <amaliu...@apache.org> wrote:
>>> 
>>> Well I think the expected approach is to translate other operations to
>>> relational operators by yourself ;-)
>>> 
>>> I think Calcite won't be the place to have extensions for such
>> translation.
>>> The main concern is that those non relational operations are
>> "non-standard".
>>> 
>>> -Rui
>>> 
>>> On Wed, Mar 3, 2021 at 10:12 PM Debajyoti Roy <newroy...@gmail.com>
>> wrote:
>>> 
>>>> Hi All,
>>>> 
>>>> For operators like filter, join, union, aggregate, window the
>>>> logical RelNode choices are obvious within
>> org.apache.calcite.rel.logical
>>>> package.
>>>> 
>>>> However, for more complex applications that require operations like
>>>> mapPartitions, flatMapGroupsWithState, etc. what would be some choices
>> for
>>>> logical rel node and relational expression classes in Apache Calcite
>>>> (independent of engine)?
>>>> 
>>>> What is a good way to model operators that are not traditionally
>> relational
>>>> and hence do not exist in Calcite (looking for hooks / extension points
>> /
>>>> roadmaps)?
>>>> 
>>>> Thanks in advance for any pointers, (disclaimer: I am new to Calcite)
>>>> Debajyoti Roy
>>>> 
>> 
>>

Re: mapPartitions

Reply via email to