RE: batch indexing in JSON format

stephane.davy Mon, 15 Jul 2019 08:48:34 -0700

Thanks Simon, saving as hive table is also what I had in mind, so easy to do 
with spark.

Stéphane

From: Simon Elliston Ball [mailto:si...@simonellistonball.com] 
Sent: Monday, July 15, 2019 17:43
To: user@metron.apache.org
Subject: Re: batch indexing in JSON format

Most users will have a batch process converting the JSON short term output into 
ORC or Parquet files, often adding them to hive tables at the same time. I 
usually do this with a spark job run every hour, or even every 15mins or less 
in some cases for high throughput environments. Anecdotally, I’ve found ORC 
compresses slightly better for most Metron data than parquet, but the 
difference is marginal. 

The reason for this is that HDFS writer was built of the goal of getting data 
persisted in HDFS as soon as possible, so writing a columnar format would 
introduce latency to the streaming process. I suspect that a dev list 
discussion on schema management and alternative output formats will be 
forthcoming. To handle that with a sensible approach to schema migration is not 
trivial, but certainly desirable.

Simon

On 15 Jul 2019, at 13:25, <stephane.d...@orange.com> <stephane.d...@orange.com> 
wrote:

Hello all,

I have a question regarding batch indexing. As as I can see, data are stored in 
json format in hdfs. Nevertheless, this uses a lot of storage because of json 
verbosity, enrichment,.. Is there any way to use parquet for example? I guess 
it’s possible to do it the day after, I mean you read the json and with spark 
you save as another format, but is it possible to choose the format at the 
batch indexing configuration level?

Thanks a lot

Stéphane

smime.p7s
Description: S/MIME cryptographic signature

RE: batch indexing in JSON format

Reply via email to