Re: Guidelines for working on parquet-mr?

Atour Mousavi Gourabi Fri, 12 Jan 2024 04:36:26 -0800

Hi Antoine, Gang,

I fully agree with both of you that these shortcomings make development on 
parquet-mr somewhat awkward. As for the duration the full test suite runs for, 
we won't really be able to decrease that. Instead, if you are only changing one 
or two modules it might suffice to just run tests for the module(s) you 
modified. Outside of just running the relevant test fixtures in IntelliJ or any 
other IDE, this can also be done through Maven using the following command: 
`mvn -pl :parquet-hadoop -am install -DskipTests && mvn -pl :parquet-hadoop 
test` for the parquet-hadoop module for example. If you want to run this 
command for multiple modules, run it with a comma delimited list of modules 
after the `-pl` option. So `:parquet-hadoop,:parquet-thrift` instead of 
`:parquet-hadoop` for both `parquet-hadoop` and `parquet-thrift`. If your 
changes for whatever unforeseen reason end up breaking stuff in other modules, 
the CI/CD in remote will catch it before the PR gets merged anyways.
As for the issues around temp files and the console logs, I do think t might be 
worthwhile to look into fixing them. I myself have had some problems with disk 
partition sizes because of the huge amount of data the Maven lifecycle dumps in 
temp in the past, and the amount of logging is just unnecessary overhead.


All the best,
Atour
________________________________
From: Gang Wu <[email protected]>
Sent: Friday, January 12, 2024 3:06 AM
To: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: Guidelines for working on parquet-mr?

Hi Antoine,

I agree that I have suffered the same thing while developing on parquet-mr.
Usually I don't make the full build and test unless for the release process.
It would be much easier to use IntelliJ IDEA and run selected tests.

Best,
Gang

On Fri, Jan 12, 2024 at 1:56 AM Antoine Pitrou <[email protected]> wrote:

>
> Update: I finally Ctrl-C'ed the tests; they had left around 14 GB of
> data in /tmp.
>
> Regards
>
> Antoine.
>
>
> On Thu, 11 Jan 2024 18:48:20 +0100
> Antoine Pitrou <[email protected]> wrote:
>
> > Hello,
> >
> > I'm trying to build parquet-mr and I'm unsure how to make the
> > experience smooth enough for development. This is what I observe:
> >
> > 1) running the tests is extremely long (they have been running for 10
> > minutes already, with no sign of nearing completion)
> >
> > 2) the output logs are a true firehose; there's a ton of extremely
> > detailed (and probably superfluous) information being output, such as:
> >
> > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > 2024-01-11 18:45:33 INFO ParquetRewriter - Finish rewriting input file:
> > file:/tmp/test12306662267168473656/test.parquet 2024-01-11 18:45:33
> > INFO InternalParquetRecordReader - RecordReader initialized will read a
> > total of 100000 records. 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - at row 0. reading next block 2024-01-11
> > 18:45:33 INFO CodecPool - Got brand-new decompressor [.zstd] 2024-01-11
> > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 1
> > ms. row count = 100 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - Assembled and processed 100 records from
> > 6 columns in 0 ms: Infinity rec/ms, Infinity cell/ms 2024-01-11
> > 18:45:33 INFO InternalParquetRecordReader - time spent so far 100%
> > reading (1 ms) and 0% processing (0 ms) 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - at row 100. reading next block 2024-01-11
> > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> > ms. row count = 100 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - Assembled and processed 200 records from
> > 6 columns in 1 ms: 200.0 rec/ms, 1200.0 cell/ms 2024-01-11 18:45:33
> > INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> > and 50% processing (1 ms) 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - at row 200. reading next block 2024-01-11
> > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> > ms. row count = 100 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - Assembled and processed 300 records from
> > 6 columns in 1 ms: 300.0 rec/ms, 1800.0 cell/ms 2024-01-11 18:45:33
> > INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> > and 50% processing (1 ms)
> >
> > [etc.]
> >
> >
> > 3) it seems the tests are leaving a lot of generated data files behind
> > in /tmp/test..., though of course they might ultimately clean up at the
> > end?
> >
> >
> > How do people typically develop on parquet-mr? Do they have dedicated
> > shell scripts that only build and test parts of the project? Do they
> > use an IDE and select specific options there?
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>
>
>
>

Re: Guidelines for working on parquet-mr?

Reply via email to