Hi, You are right, i will add option to use own compiled class or dynamic message.
Lukas On Sun, Oct 26, 2014 at 8:27 PM, Chen Song <[email protected]> wrote: > Hi, > > I am new to Parquet and we have a complicated use case in which we want to > adopt Parquet as our storage format. > > Current: > > - The data is stored in Sequence files as Protobuf. > - We have map reduce jobs to write the data. Hive tables were created > with Protobuf Serde using elephant-bird so people can query the data via > Hive. > - We enhance elephant-bird to add our own serializer so one can write > data into table via Hive and data is stored in Sequence files as > Protobuf. > > > Future: > We want to use Parquet as the underlying storage format without losing > Protobuf abstraction at application layer. After a bit research and > practice, I have a few questions. > > - Say if Hive table is created as Parquet table, and data is written via > Hive. > - If I want to read data in map reduce jobs as Protobuf records, can I > use ProtoParquetInputFormat in > > https://github.com/Parquet/parquet-mr/blob/master/parquet-protobuf/src/main/java/parquet/proto/ProtoParquetInputFormat.java > ? > After looking at the API, it doesn't seem possible that I can > specific the > Protobuf class for the input path. Instead, > ProtoParquetInputFormat derives > the class from the footer of the underlying data. Is it fair to > day ProtoParquetInputFormat will only read data written > by ProtoParquetOutputFormat? Is there a way to work around this? > - If not, is there any out of the box Hive output format I can use to > piggy back ProtoParquetOutputFormat? > - If data is written by map reduce job with ProtoParquetOutputFormat. > Will read query in Hive work automatically? > > Thanks a lot in advance. Any suggestions would be appreciated. > > -- > Chen Song >
