Most users will have a batch process converting the JSON short term output into ORC or Parquet files, often adding them to hive tables at the same time. I usually do this with a spark job run every hour, or even every 15mins or less in some cases for high throughput environments. Anecdotally, I’ve found ORC compresses slightly better for most Metron data than parquet, but the difference is marginal.
The reason for this is that HDFS writer was built of the goal of getting data persisted in HDFS as soon as possible, so writing a columnar format would introduce latency to the streaming process. I suspect that a dev list discussion on schema management and alternative output formats will be forthcoming. To handle that with a sensible approach to schema migration is not trivial, but certainly desirable. Simon > On 15 Jul 2019, at 13:25, <stephane.d...@orange.com> > <stephane.d...@orange.com> wrote: > > Hello all, > > I have a question regarding batch indexing. As as I can see, data are stored > in json format in hdfs. Nevertheless, this uses a lot of storage because of > json verbosity, enrichment,.. Is there any way to use parquet for example? I > guess it’s possible to do it the day after, I mean you read the json and with > spark you save as another format, but is it possible to choose the format at > the batch indexing configuration level? > > Thanks a lot > > Stéphane > >