Re: Guidelines for working on parquet-mr?

Claire McGinty Fri, 12 Jan 2024 10:37:18 -0800

Related to the noisy console logs, a few months ago I opened a ticket
<https://issues.apache.org/jira/browse/HADOOP-18717> for Hadoop to move
those CodecPool log statements from INFO to DEBUG, as they're a
significant contributor to log size and (IMO) don't add a ton of value for
the end user. It hasn't gotten traction so far, but I can try to move
forward with opening a pull request for it.


Best,
Claire

On Fri, Jan 12, 2024 at 7:36 AM Atour Mousavi Gourabi <at...@live.com>
wrote:

> Hi Antoine, Gang,
>
> I fully agree with both of you that these shortcomings make development on
> parquet-mr somewhat awkward. As for the duration the full test suite runs
> for, we won't really be able to decrease that. Instead, if you are only
> changing one or two modules it might suffice to just run tests for the
> module(s) you modified. Outside of just running the relevant test fixtures
> in IntelliJ or any other IDE, this can also be done through Maven using the
> following command: `mvn -pl :parquet-hadoop -am install -DskipTests && mvn
> -pl :parquet-hadoop test` for the parquet-hadoop module for example. If you
> want to run this command for multiple modules, run it with a comma
> delimited list of modules after the `-pl` option. So
> `:parquet-hadoop,:parquet-thrift` instead of `:parquet-hadoop` for both
> `parquet-hadoop` and `parquet-thrift`. If your changes for whatever
> unforeseen reason end up breaking stuff in other modules, the CI/CD in
> remote will catch it before the PR gets merged anyways.
> As for the issues around temp files and the console logs, I do think t
> might be worthwhile to look into fixing them. I myself have had some
> problems with disk partition sizes because of the huge amount of data the
> Maven lifecycle dumps in temp in the past, and the amount of logging is
> just unnecessary overhead.
>
> All the best,
> Atour
> ________________________________
> From: Gang Wu <ust...@gmail.com>
> Sent: Friday, January 12, 2024 3:06 AM
> To: dev@parquet.apache.org <dev@parquet.apache.org>
> Cc: d...@parquet.incubator.apache.org <d...@parquet.incubator.apache.org>
> Subject: Re: Guidelines for working on parquet-mr?
>
> Hi Antoine,
>
> I agree that I have suffered the same thing while developing on parquet-mr.
> Usually I don't make the full build and test unless for the release
> process.
> It would be much easier to use IntelliJ IDEA and run selected tests.
>
> Best,
> Gang
>
> On Fri, Jan 12, 2024 at 1:56 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Update: I finally Ctrl-C'ed the tests; they had left around 14 GB of
> > data in /tmp.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Thu, 11 Jan 2024 18:48:20 +0100
> > Antoine Pitrou <anto...@python.org> wrote:
> >
> > > Hello,
> > >
> > > I'm trying to build parquet-mr and I'm unsure how to make the
> > > experience smooth enough for development. This is what I observe:
> > >
> > > 1) running the tests is extremely long (they have been running for 10
> > > minutes already, with no sign of nearing completion)
> > >
> > > 2) the output logs are a true firehose; there's a ton of extremely
> > > detailed (and probably superfluous) information being output, such as:
> > >
> > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > > 2024-01-11 18:45:33 INFO ParquetRewriter - Finish rewriting input file:
> > > file:/tmp/test12306662267168473656/test.parquet 2024-01-11 18:45:33
> > > INFO InternalParquetRecordReader - RecordReader initialized will read a
> > > total of 100000 records. 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - at row 0. reading next block 2024-01-11
> > > 18:45:33 INFO CodecPool - Got brand-new decompressor [.zstd] 2024-01-11
> > > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 1
> > > ms. row count = 100 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - Assembled and processed 100 records from
> > > 6 columns in 0 ms: Infinity rec/ms, Infinity cell/ms 2024-01-11
> > > 18:45:33 INFO InternalParquetRecordReader - time spent so far 100%
> > > reading (1 ms) and 0% processing (0 ms) 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - at row 100. reading next block 2024-01-11
> > > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> > > ms. row count = 100 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - Assembled and processed 200 records from
> > > 6 columns in 1 ms: 200.0 rec/ms, 1200.0 cell/ms 2024-01-11 18:45:33
> > > INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> > > and 50% processing (1 ms) 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - at row 200. reading next block 2024-01-11
> > > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> > > ms. row count = 100 2024-01-11 18:45:33 INFO
> > > InternalParquetRecordReader - Assembled and processed 300 records from
> > > 6 columns in 1 ms: 300.0 rec/ms, 1800.0 cell/ms 2024-01-11 18:45:33
> > > INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> > > and 50% processing (1 ms)
> > >
> > > [etc.]
> > >
> > >
> > > 3) it seems the tests are leaving a lot of generated data files behind
> > > in /tmp/test..., though of course they might ultimately clean up at the
> > > end?
> > >
> > >
> > > How do people typically develop on parquet-mr? Do they have dedicated
> > > shell scripts that only build and test parts of the project? Do they
> > > use an IDE and select specific options there?
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> >
> >
> >
> >
>

Re: Guidelines for working on parquet-mr?

Reply via email to