We've done exactly this (used Lambda to create Parquet) with success

Can concur with Ryan, the dependencies we needed to get it to work were

* hadoop-common
* parquet-avro (we chose avro for our record format)
* avro

Once you have your schema set up, it's pretty simple to write records into
a parquet file.

 public ParquetWriter<GenericRecord> create(Path outputPath, Schema
avroSchema) throws IOException {
        return AvroParquetWriter.<GenericRecord>
                builder(new
org.apache.hadoop.fs.Path(outputPath.toString()))
                .withSchema(avroSchema)
                .withCompressionCodec(SNAPPY)
                .build();
    }

The thing you need to be aware of, is from what I can tell (please correct
me if I'm wrong), you need to write your parquet file to disk using an
instance of ParquetWriter, meaning you are limited to the 500mb of scratch
space AWS provide you for each invocation.

Ideally something backed by an OutputStream would be better so you don't
need to touch disk at all - but I'm not sure if that's possible.

On Fri, 6 Jan 2017 at 22:22 Ryan Blue <[email protected]> wrote:

> Marcos,
>
> Parquet currently depends on Hadoop for IO operations and compression. You
> just need to include hadoop-client in your classpath and it should work
> fine.
>
> rb
>
> On Fri, Jan 6, 2017 at 10:27 AM, marcos rebelo <[email protected]> wrote:
>
> > Hi all
> >
> > I'm receiving a set of csv/json files on S3 and I would like to transform
> > them to parquet. Considering the restriction of Lambda Function I would
> > like to create some code that can generate the parquet file. I didn't
> found
> > how to do it without Hadoop. Considering the simplicity of the task (file
> > conversion), I can't believe that is hard to do something similar to it.
> >
> > Note: I'm a Scala developer, but I can code/adapt any java code.
> >
> > Can someone give me one hand on this task?
> >
> > Best Regards
> > Marcos Rebelo
> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to