[I] Support for writing Parquet files from the Iceberg Java API without the Hadoop Configuration class [iceberg]

via GitHub Thu, 18 Apr 2024 19:30:42 -0700


ms1111 opened a new issue, #10180:
URL: https://github.com/apache/iceberg/issues/10180


   ### Feature Request / Improvement
   
   If the hadoop-common library is not present, trying to write a Parquet file:
   ```java
   DataWriter<Record> dataWriter =
           Parquet.writeData(file)
                   .schema(schema)
                   .createWriterFunc(GenericParquetWriter::buildWriter)
                   .overwrite()
                   .withSpec(partitionSpec)
                   .build();
   ```
   ... will fail with:
   ```
   Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/hadoop/conf/Configuration
           at 
org.apache.iceberg.parquet.Parquet$WriteBuilder.<init>(Parquet.java:164)
           at 
org.apache.iceberg.parquet.Parquet$WriteBuilder.<init>(Parquet.java:143)
           at org.apache.iceberg.parquet.Parquet.write(Parquet.java:129)
           at 
org.apache.iceberg.parquet.Parquet$DataWriteBuilder.<init>(Parquet.java:646)
           at 
org.apache.iceberg.parquet.Parquet$DataWriteBuilder.<init>(Parquet.java:637)
           at org.apache.iceberg.parquet.Parquet.writeData(Parquet.java:623)
   ```
   
   In org.apache.iceberg.parquet.Parquet, an empty Configuration is created:
   ```java
       private WriteBuilder(OutputFile file) {
         this.file = file;
         if (file instanceof HadoopOutputFile) {
           this.conf = new Configuration(((HadoopOutputFile) file).getConf());
         } else {
           this.conf = new Configuration();
         }
       }
   ```
   
   ParquetWriter eventually passes this to ParquetIO.file(), which ignores it 
if the file is not a HadoopOutputFile.
   
   hadoop-common is a heavy dependency with many transitive dependencies, would 
be nice to avoid it.
   
   Similar to Iceberg Flink issues - #3117 / #4183 
   
   ### Query engine
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Support for writing Parquet files from the Iceberg Java API without the Hadoop Configuration class [iceberg]

Reply via email to