Re: Guidelines for working on parquet-mr?

Atour Mousavi Gourabi Sat, 13 Jan 2024 14:28:14 -0800

Hi Claire, all,

Thanks for trying to pick this up over at Hadoop, it seems like a reasonable 
change so I hope it gains some traction. In the meantime, I propose we limit 
the scope of logging in the test suite. Info level logs aren't awfully 
interesting in this case. IMO bumping it to warn or error by default should 
suffice for our intents and purposes, greatly reducing the overhead. As for the 
temp files, I'll look into setting up some teardown routines somewhere next 
week.


All the best,
Atour
________________________________
From: Claire McGinty <[email protected]>
Sent: Friday, January 12, 2024 7:36 PM
To: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: Guidelines for working on parquet-mr?

Related to the noisy console logs, a few months ago I opened a ticket
<https://issues.apache.org/jira/browse/HADOOP-18717> for Hadoop to move
those CodecPool log statements from INFO to DEBUG, as they're a
significant contributor to log size and (IMO) don't add a ton of value for
the end user. It hasn't gotten traction so far, but I can try to move
forward with opening a pull request for it.

Best,
Claire

On Fri, Jan 12, 2024 at 7:36 AM Atour Mousavi Gourabi <[email protected]>
wrote:

> Hi Antoine, Gang,
>
> I fully agree with both of you that these shortcomings make development on
> parquet-mr somewhat awkward. As for the duration the full test suite runs
> for, we won't really be able to decrease that. Instead, if you are only
> changing one or two modules it might suffice to just run tests for the
> module(s) you modified. Outside of just running the relevant test fixtures
> in IntelliJ or any other IDE, this can also be done through Maven using the
> following command: `mvn -pl :parquet-hadoop -am install -DskipTests && mvn
> -pl :parquet-hadoop test` for the parquet-hadoop module for example. If you
> want to run this command for multiple modules, run it with a comma
> delimited list of modules after the `-pl` option. So
> `:parquet-hadoop,:parquet-thrift` instead of `:parquet-hadoop` for both
> `parquet-hadoop` and `parquet-thrift`. If your changes for whatever
> unforeseen reason end up breaking stuff in other modules, the CI/CD in
> remote will catch it before the PR gets merged anyways.
> As for the issues around temp files and the console logs, I do think t
> might be worthwhile to look into fixing them. I myself have had some
> problems with disk partition sizes because of the huge amount of data the
> Maven lifecycle dumps in temp in the past, and the amount of logging is
> just unnecessary overhead.
>
> All the best,
> Atour
> ________________________________
> From: Gang Wu <[email protected]>
> Sent: Friday, January 12, 2024 3:06 AM
> To: [email protected] <[email protected]>
> Cc: [email protected] <[email protected]>
> Subject: Re: Guidelines for working on parquet-mr?
>
> Hi Antoine,
>
> I agree that I have suffered the same thing while developing on parquet-mr.
> Usually I don't make the full build and test unless for the release
> process.
> It would be much easier to use IntelliJ IDEA and run selected tests.
>
> Best,
> Gang
>
> On Fri, Jan 12, 2024 at 1:56 AM Antoine Pitrou <[email protected]> wrote:
>
> >
> > Update: I finally Ctrl-C'ed the tests; they had left around 14 GB of
> > data in /tmp.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Thu, 11 Jan 2024 18:48:20 +0100
> > Antoine Pitrou <[email protected]> wrote:
> >
> > > Hello,
> > >
> > > I'm trying to build parquet-mr and I'm unsure how to make the
> > > experience smooth enough for development. This is what I observe:
> > >
> > > 1) running the tests is extremely long (they have been running for 10
> > > minutes already, with no sign of nearing completion)
> > >
> > > 2) the output logs are a true firehose; there's a ton of extremely
> > > detailed (and probably superfluous) information being output, such as:
> > >
> > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > > 2024-01-11 18:45:33 INFO ParquetRewriter - Finish rewriting input file:
> > > file:/tmp/test12306662267168473656/test.parquet 2024-01-11 18:45:33
> > > INFO InternalParquetRecordReader - RecordReader initialized will read a
> > > total of 100000 records. 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - at row 0. reading next block 2024-01-11
> > > 18:45:33 INFO CodecPool - Got brand-new decompressor [.zstd] 2024-01-11
> > > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 1
> > > ms. row count = 100 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - Assembled and processed 100 records from
> > > 6 columns in 0 ms: Infinity rec/ms, Infinity cell/ms 2024-01-11
> > > 18:45:33 INFO InternalParquetRecordReader - time spent so far 100%
> > > reading (1 ms) and 0% processing (0 ms) 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - at row 100. reading next block 2024-01-11
> > > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> > > ms. row count = 100 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - Assembled and processed 200 records from
> > > 6 columns in 1 ms: 200.0 rec/ms, 1200.0 cell/ms 2024-01-11 18:45:33
> > > INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> > > and 50% processing (1 ms) 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - at row 200. reading next block 2024-01-11
> > > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> > > ms. row count = 100 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - Assembled and processed 300 records from
> > > 6 columns in 1 ms: 300.0 rec/ms, 1800.0 cell/ms 2024-01-11 18:45:33
> > > INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> > > and 50% processing (1 ms)
> > >
> > > [etc.]
> > >
> > >
> > > 3) it seems the tests are leaving a lot of generated data files behind
> > > in /tmp/test..., though of course they might ultimately clean up at the
> > > end?
> > >
> > >
> > > How do people typically develop on parquet-mr? Do they have dedicated
> > > shell scripts that only build and test parts of the project? Do they
> > > use an IDE and select specific options there?
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
> >
> >
> >
>

Re: Guidelines for working on parquet-mr?

Reply via email to