Re: Parquet generation on AWS Lambda Function

Ryan Blue Fri, 06 Jan 2017 15:21:31 -0800

The writer code currently uses a Hadoop class, FSDataOutputStream, so it
does need to go through Hadoop right now. But, I don't think there's a
reason why we couldn't replace that with another OutputStream, as long as
that stream keeps track of the position in the stream. Parquet uses
FSDataOutputStream#getPos to get offsets where data is written in the file.


rb

On Fri, Jan 6, 2017 at 2:39 PM, Daniel Harper <[email protected]> wrote:

> We've done exactly this (used Lambda to create Parquet) with success
>
> Can concur with Ryan, the dependencies we needed to get it to work were
>
> * hadoop-common
> * parquet-avro (we chose avro for our record format)
> * avro
>
> Once you have your schema set up, it's pretty simple to write records into
> a parquet file.
>
>  public ParquetWriter<GenericRecord> create(Path outputPath, Schema
> avroSchema) throws IOException {
>         return AvroParquetWriter.<GenericRecord>
>                 builder(new
> org.apache.hadoop.fs.Path(outputPath.toString()))
>                 .withSchema(avroSchema)
>                 .withCompressionCodec(SNAPPY)
>                 .build();
>     }
>
> The thing you need to be aware of, is from what I can tell (please correct
> me if I'm wrong), you need to write your parquet file to disk using an
> instance of ParquetWriter, meaning you are limited to the 500mb of scratch
> space AWS provide you for each invocation.
>
> Ideally something backed by an OutputStream would be better so you don't
> need to touch disk at all - but I'm not sure if that's possible.
>
> On Fri, 6 Jan 2017 at 22:22 Ryan Blue <[email protected]> wrote:
>
> > Marcos,
> >
> > Parquet currently depends on Hadoop for IO operations and compression.
> You
> > just need to include hadoop-client in your classpath and it should work
> > fine.
> >
> > rb
> >
> > On Fri, Jan 6, 2017 at 10:27 AM, marcos rebelo <[email protected]> wrote:
> >
> > > Hi all
> > >
> > > I'm receiving a set of csv/json files on S3 and I would like to
> transform
> > > them to parquet. Considering the restriction of Lambda Function I would
> > > like to create some code that can generate the parquet file. I didn't
> > found
> > > how to do it without Hadoop. Considering the simplicity of the task
> (file
> > > conversion), I can't believe that is hard to do something similar to
> it.
> > >
> > > Note: I'm a Scala developer, but I can code/adapt any java code.
> > >
> > > Can someone give me one hand on this task?
> > >
> > > Best Regards
> > > Marcos Rebelo
> > >
> >
> >
> >
> > --
> > Ryan Blue
> > Software Engineer
> > Netflix
> >
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: Parquet generation on AWS Lambda Function

Reply via email to