The writer code currently uses a Hadoop class, FSDataOutputStream, so it does need to go through Hadoop right now. But, I don't think there's a reason why we couldn't replace that with another OutputStream, as long as that stream keeps track of the position in the stream. Parquet uses FSDataOutputStream#getPos to get offsets where data is written in the file.
rb On Fri, Jan 6, 2017 at 2:39 PM, Daniel Harper <[email protected]> wrote: > We've done exactly this (used Lambda to create Parquet) with success > > Can concur with Ryan, the dependencies we needed to get it to work were > > * hadoop-common > * parquet-avro (we chose avro for our record format) > * avro > > Once you have your schema set up, it's pretty simple to write records into > a parquet file. > > public ParquetWriter<GenericRecord> create(Path outputPath, Schema > avroSchema) throws IOException { > return AvroParquetWriter.<GenericRecord> > builder(new > org.apache.hadoop.fs.Path(outputPath.toString())) > .withSchema(avroSchema) > .withCompressionCodec(SNAPPY) > .build(); > } > > The thing you need to be aware of, is from what I can tell (please correct > me if I'm wrong), you need to write your parquet file to disk using an > instance of ParquetWriter, meaning you are limited to the 500mb of scratch > space AWS provide you for each invocation. > > Ideally something backed by an OutputStream would be better so you don't > need to touch disk at all - but I'm not sure if that's possible. > > On Fri, 6 Jan 2017 at 22:22 Ryan Blue <[email protected]> wrote: > > > Marcos, > > > > Parquet currently depends on Hadoop for IO operations and compression. > You > > just need to include hadoop-client in your classpath and it should work > > fine. > > > > rb > > > > On Fri, Jan 6, 2017 at 10:27 AM, marcos rebelo <[email protected]> wrote: > > > > > Hi all > > > > > > I'm receiving a set of csv/json files on S3 and I would like to > transform > > > them to parquet. Considering the restriction of Lambda Function I would > > > like to create some code that can generate the parquet file. I didn't > > found > > > how to do it without Hadoop. Considering the simplicity of the task > (file > > > conversion), I can't believe that is hard to do something similar to > it. > > > > > > Note: I'm a Scala developer, but I can code/adapt any java code. > > > > > > Can someone give me one hand on this task? > > > > > > Best Regards > > > Marcos Rebelo > > > > > > > > > > > -- > > Ryan Blue > > Software Engineer > > Netflix > > > -- Ryan Blue Software Engineer Netflix
