adamreeve commented on code in PR #47621:
URL: https://github.com/apache/arrow/pull/47621#discussion_r2370860370
##########
cpp/src/parquet/arrow/generate_fuzz_corpus.cc:
##########
@@ -41,47 +44,162 @@ namespace arrow {
using ::arrow::internal::CreateDir;
using ::arrow::internal::PlatformFilename;
+using ::arrow::util::Float16;
+using ::parquet::ArrowWriterProperties;
using ::parquet::WriterProperties;
static constexpr int32_t kBatchSize = 1000;
+// This will emit several row groups
static constexpr int32_t kChunkSize = kBatchSize * 3 / 8;
-std::shared_ptr<WriterProperties> GetWriterProperties() {
- WriterProperties::Builder builder{};
- builder.disable_dictionary("no_dict");
- builder.compression("compressed", Compression::BROTLI);
- return builder.build();
+struct WriteConfig {
+ std::shared_ptr<WriterProperties> writer_properties;
+ std::shared_ptr<ArrowWriterProperties> arrow_writer_properties;
+};
+
+struct Column {
+ std::string name;
+ std::shared_ptr<Array> array;
+
+ static std::function<std::string()> NameGenerator() {
+ struct Gen {
+ int num_col = 1;
+
+ std::string operator()() {
+ std::stringstream ss;
+ ss << "col_" << num_col++;
+ return std::move(ss).str();
+ }
+ };
+ return Gen{};
+ }
+};
+
+std::vector<WriteConfig> GetWriteConfigurations() {
+ // clang-format off
+ auto w_brotli = WriterProperties::Builder()
+ .disable_dictionary("no_dict")
+ ->compression("compressed", Compression::BROTLI)
+ // Override current default of 1MB
+ ->data_pagesize(20'000)
+ // Reduce max dictionary page size so that less columns are
+ // dict-encoded (XXX: this does not seem to have an effect?)
Review Comment:
Actually, I don't think this comment makes sense. Hitting the dictionary
pagesize limit means that the column will start a new page and switch to plain
encoding, not that dictionary encoding won't be used at all.
I tested bumping the dictionary pagesize limit to 1MB, and compared the
resulting output file. I can see that some columns switch from having a
dictionary page, a dictionary encoded page, and a plain encoded page, to only
having the dictionary page and dictionary encoded page (eg. `col_24` in row
group 0).
So it looks like this property is working correctly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]