Most users will have a batch process converting the JSON short term output into 
ORC or Parquet files, often adding them to hive tables at the same time. I 
usually do this with a spark job run every hour, or even every 15mins or less 
in some cases for high throughput environments. Anecdotally, I’ve found ORC 
compresses slightly better for most Metron data than parquet, but the 
difference is marginal. 

The reason for this is that HDFS writer was built of the goal of getting data 
persisted in HDFS as soon as possible, so writing a columnar format would 
introduce latency to the streaming process. I suspect that a dev list 
discussion on schema management and alternative output formats will be 
forthcoming. To handle that with a sensible approach to schema migration is not 
trivial, but certainly desirable.

Simon

> On 15 Jul 2019, at 13:25, <stephane.d...@orange.com> 
> <stephane.d...@orange.com> wrote:
> 
> Hello all,
>  
> I have a question regarding batch indexing. As as I can see, data are stored 
> in json format in hdfs. Nevertheless, this uses a lot of storage because of 
> json verbosity, enrichment,.. Is there any way to use parquet for example? I 
> guess it’s possible to do it the day after, I mean you read the json and with 
> spark you save as another format, but is it possible to choose the format at 
> the batch indexing configuration level?
>  
> Thanks a lot
>  
> Stéphane
>  
>  

Reply via email to