Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

huaxin gao Tue, 31 Jan 2023 18:23:59 -0800

+1

On Tue, Jan 31, 2023 at 6:10 PM DB Tsai <dbt...@dbtsai.com> wrote:


> +1
>
> Sent from my iPhone
>
> On Jan 31, 2023, at 4:16 PM, Yuming Wang <wgy...@gmail.com> wrote:
>
> 
> +1.
>
> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura
> <ktanim...@apple.com.invalid> wrote:
>
>> Great! Much appreciated, Mitch!
>>
>> Kazu
>>
>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> Thanks, Kazu.
>>
>> I followed that template link and indeed as you pointed out it is a
>> common template. If it works then it is what it is.
>>
>> I will be going through your design proposals and hopefully we can review
>> it.
>>
>> Regards,
>>
>> Mich
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <ktanim...@apple.com>
>> wrote:
>>
>>> Thank you Mich. I followed the instruction at
>>> https://spark.apache.org/improvement-proposals.html and used its
>>> template.
>>> While we are open to revise our design doc, it seems more like you are
>>> proposing the community to change the instruction per se?
>>>
>>> Kazu
>>>
>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> Thanks for these proposals. good suggestions. Is this style of breaking
>>> down your approach standard?
>>>
>>> My view would be that perhaps it makes more sense to follow the industry
>>> established approach of breaking down your technical proposal  into:
>>>
>>>
>>>    1. Background
>>>    2. Objective
>>>    3. Scope
>>>    4. Constraints
>>>    5. Assumptions
>>>    6. Reporting
>>>    7. Deliverables
>>>    8. Timelines
>>>    9. Appendix
>>>
>>> Your current approach using below
>>>
>>> Q1. What are you trying to do? Articulate your objectives using
>>> absolutely no jargon. What are you trying to achieve?
>>> Q2. What problem is this proposal NOT designed to solve? What issues
>>> the suggested proposal is not going to address
>>> Q3. How is it done today, and what are the limits of current practice?
>>> Q4. What is new in your approach approach and why do you think it will be
>>> successful succeed?
>>> Q5. Who cares? If you are successful, what difference will it make? If
>>> your proposal succeeds, what tangible benefits will it add?
>>> Q6. What are the risks?
>>> Q7. How long will it take?
>>> Q8. What are the midterm and final “exams” to check for success?
>>>
>>>
>>> May not do  justice to your proposal.
>>>
>>> HTH
>>>
>>> Mich
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>>> ktanim...@apple.com.invalid> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I would like to start a discussion on “Lazy Materialization for Parquet
>>>> Read Performance Improvement"
>>>>
>>>> Chao and I propose a Parquet reader with lazy materialization. For
>>>> Spark-SQL filter operations, evaluating the filters first and lazily
>>>> materializing only the used values can save computation wastes and improve
>>>> the read performance.
>>>> The current implementation of Spark requires the read values to
>>>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>>>> applying the filters even though the filters may eventually throw away many
>>>> values.
>>>>
>>>> We made our design doc as follows.
>>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>>>> SPIP Doc:
>>>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>>
>>>> Liang-Chi was kind enough to shepherd this effort.
>>>>
>>>> Thank you
>>>> Kazu
>>>>
>>>
>>>
>>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Reply via email to