Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

DB Tsai Tue, 31 Jan 2023 18:09:45 -0800

Sent from my iPhone

On Jan 31, 2023, at 4:16 PM, Yuming Wang <wgy...@gmail.com> wrote:

+1.

On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura <ktanim...@apple.com.invalid> wrote:
Great! Much appreciated, Mitch!

Kazu

On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

Thanks, Kazu.

I followed that template link and indeed as you pointed out it is a common template. If it works then it is what it is.

I will be going through your design proposals and hopefully we can review it.

Regards,

Mich

   view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <ktanim...@apple.com> wrote:
Thank you Mich. I followed the instruction at https://spark.apache.org/improvement-proposals.html and used its template.
While we are open to revise our design doc, it seems more like you are proposing the community to change the instruction per se?

Kazu

On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

Hi,

Thanks for these proposals. good suggestions. Is this style of breaking down your approach standard?

My view would be that perhaps it makes more sense to follow the industry established approach of breaking down your technical proposal  into:

Background
Objective
Scope
Constraints
Assumptions
Reporting
Deliverables
Timelines
Appendix
Your current approach using below

Q1~~. What are you trying to do? Articulate your objectives using absolutely no jargon~~. What are you trying to achieve?
Q2. ~~What problem is this proposal NOT designed to solve?~~ What issues the suggested proposal is not going to address
Q3. How is it done today, and what are the limits of current practice?
Q4. What is new in your ~~approach~~ approach and why do you think it will ~~be successful~~ succeed?
Q5. ~~Who cares? If you are successful, what difference will it make?~~ If your proposal succeeds, what tangible benefits will it add?
Q6. What are the risks?
Q7. How long will it take?
Q8. What are the midterm and final “exams” to check for success?

May not do  justice to your proposal.

HTH

Mich

   view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <ktanim...@apple.com.invalid> wrote:
Hi everyone,

I would like to start a discussion on “Lazy Materialization for Parquet Read Performance Improvement"

Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL filter operations, evaluating the filters first and lazily materializing only the used values can save computation wastes and improve the read performance.
The current implementation of Spark requires the read values to materialize (i.e. decompress, de-code, etc...) onto memory first before applying the filters even though the filters may eventually throw away many values.

We made our design doc as follows.
SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
SPIP Doc: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME

Liang-Chi was kind enough to shepherd this effort.

Thank you
Kazu

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Reply via email to