You'll probably need to resolve "s3a://<bucket-name>/*.parquet" out into a concrete non-glob filepattern to inspect it this way. Presumably any individual shard will do. match and open from https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/FileSystems.html may be useful.
On Wed, Oct 11, 2023 at 10:29 AM Ramya Prasad via dev <dev@beam.apache.org> wrote: > Hello, > > I am a developer trying to use Apache Beam in my Java application, and I'm > running into an issue with reading multiple Parquet files from a directory > in S3. I'm able to successfully run this line of code, where tempPath = > "s3://<bucket-name>/*.parquet": > PCollection<GenericRecord> records = pipeline.apply("Read parquet file in > as Generic Records", ParquetIO.read(schema).from(tempPath)); > > My problem is reading the schema beforehand. At runtime, I only have the > name of the S3 bucket, which has all the Parquet files I need underneath > it. However, I am unable to use that same tempPath above to retrieve my > schema. Because the path is not pointing to a singular parquet file, the > ParquetFileReader class from Apache Hadoop throws an error: No such file or > directory: s3a://<bucket-name>/*.parquet. > > To read my schema, I'm using this chunk of code: > > Configuration configuration = new Configuration(); > configuration.set("fs.s3a.access.key","<access_key>); > configuration.set("fs.s3a.secret.key", "<secret_key>"); > configuration.set("fs.s3a.session.token","<session_token>"); > configuration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"); > configuration.set("fs.s3a.server-side-encryption-algorithm", "<algorithm>"); > configuration.set("fs.s3a.proxy.host", "<proxy_host>"); > configuration.set("fs.s3a.proxy.port", "<proxy_port>"); > configuration.set("fs.s3a.aws.credentials.provider", > "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"); > > String hadoopFilePath = new Path("s3a://<bucket-name>/*.parquet"); > ParquetFileReader r = > ParquetFileReader.open(HadoopInputFile.fromPath(hadoopFilePath, > configuration)); > MessageType messageType = r.getFooter().getFileMetaData().getSchema(); > AvroSchemaConverter converter = new AvroSchemaConverter(); > Schema schema = converter.convert(messageType); > > The red line is where the code is failing. Is there maybe a Hadoop > Configuration I can set to force Hadoop to read recursively? > > I realize this is kind of a Beam-adjacent problem, but I've been > struggling with this for a while, so any help would be appreciated! > > Thanks and sincerely, > Ramya > ------------------------------ > > The information contained in this e-mail may be confidential and/or > proprietary to Capital One and/or its affiliates and may only be used > solely in performance of work or services for Capital One. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the intended > recipient, you are hereby notified that any review, retransmission, > dissemination, distribution, copying or other use of, or taking of any > action in reliance upon this information is strictly prohibited. If you > have received this communication in error, please contact the sender and > delete the material from your computer. > > > > >