Re: [Question] Read Parquet Schema from S3 Directory

Robert Bradshaw via dev Thu, 12 Oct 2023 16:39:21 -0700

You'll probably need to resolve "s3a://<bucket-name>/*.parquet" out into a
concrete non-glob filepattern to inspect it this way. Presumably any
individual shard will do. match and open from
https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileSystems.html
may be useful.


On Wed, Oct 11, 2023 at 10:29 AM Ramya Prasad via dev <dev@beam.apache.org>
wrote:

> Hello,
>
> I am a developer trying to use Apache Beam in my Java application, and I'm
> running into an issue with reading multiple Parquet files from a directory
> in S3. I'm able to successfully run this line of code, where tempPath  =
> "s3://<bucket-name>/*.parquet":
> PCollection<GenericRecord> records = pipeline.apply("Read parquet file in
> as Generic Records", ParquetIO.read(schema).from(tempPath));
>
> My problem is reading the schema beforehand. At runtime, I only have the
> name of the S3 bucket, which has all the Parquet files I need underneath
> it. However, I am unable to use that same tempPath above to retrieve my
> schema. Because the path is not pointing to a singular parquet file, the
> ParquetFileReader class from Apache Hadoop throws an error: No such file or
> directory: s3a://<bucket-name>/*.parquet.
>
> To read my schema, I'm using this chunk of code:
>
> Configuration configuration = new Configuration();
> configuration.set("fs.s3a.access.key","<access_key>);
> configuration.set("fs.s3a.secret.key", "<secret_key>");
> configuration.set("fs.s3a.session.token","<session_token>");
> configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
> configuration.set("fs.s3a.server-side-encryption-algorithm", "<algorithm>");
> configuration.set("fs.s3a.proxy.host", "<proxy_host>");
> configuration.set("fs.s3a.proxy.port", "<proxy_port>");
> configuration.set("fs.s3a.aws.credentials.provider", 
> "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider");
>
> String hadoopFilePath = new Path("s3a://<bucket-name>/*.parquet");
> ParquetFileReader r = 
> ParquetFileReader.open(HadoopInputFile.fromPath(hadoopFilePath, 
> configuration));
> MessageType messageType = r.getFooter().getFileMetaData().getSchema();
> AvroSchemaConverter converter = new AvroSchemaConverter();
> Schema schema = converter.convert(messageType);
>
> The red line is where the code is failing. Is there maybe a Hadoop
> Configuration I can set to force Hadoop to read recursively?
>
> I realize this is kind of a Beam-adjacent problem, but I've been
> struggling with this for a while, so any help would be appreciated!
>
> Thanks and sincerely,
> Ramya
> ------------------------------
>
> The information contained in this e-mail may be confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>
>
>

Re: [Question] Read Parquet Schema from S3 Directory

Reply via email to