RE: batch indexing in JSON format

stephane.davy Mon, 15 Jul 2019 07:23:44 -0700

Hello all,


Thanks for your useful answers, it all make sense for me now. So we will 
probably go to post-processing file conversion.

 

Have a good day,

 

Stéphane

 

From: Otto Fowler [mailto:[email protected]] 
Sent: Monday, July 15, 2019 16:19
To: [email protected]
Subject: Re: batch indexing in JSON format

 

We could do something like have some other topology or job that kicks off when 
an HDFS file is closed.

So before we start a new file, we “queue” a log to some conversion topology/job 
whatever or something like that.

 

 

 

On July 15, 2019 at 10:04:08, Michael Miklavcic ([email protected]) 
wrote:

Adding to what Ryan said (and I agree), there are a couple additional 
consequences: 

1.      There are questions around just how optimal an ORC file written in 
real-time can actually be. In order to get columns of data striped effectively, 
you need a sizable number of k rows. That's probably unlikely in real-time, 
though some of these storage formats also have "engines" running that manage 
compactions (like HBase does), but I haven't checked on this in a while. I 
think Kudu may do this, actually, but again that's a whole new storage engine, 
not just a format.

2.      More importantly - loss of data - HDFS is the source of truth. We 
guarantee at-least-once processing. In order to achieve efficient columnar 
storage that makes a columnar format feasible, it's likely that we'd have to 
make larger batches in indexing. This creates a potential for lag in the system 
where we would now have to do more to worry about Storm failures than we do 
currently. With HDFS writing our partial files are still written even if 
there's a failure in the topology or elsewhere. It does have take up more disk 
space, but we felt this was a reasonable tradeoff architecturally for something 
that should be feasible to be written ad-hoc.

That being said, you could certainly write conversion jobs that should be able 
to lag the real-time processing just enough to get the benefits of real-time 
and still do a decent job of getting your data into a more efficient storage 
format, if you choose.

 

Cheers,

Mike

 

 

On Mon, Jul 15, 2019 at 7:00 AM Ryan Merriman <[email protected]> wrote:

The short answer is no.  Offline conversion to other formats (as you describe) 
is a better approach anyways.  Writing to a Parquet/ORC file is more compute 
intensive than just writing JSON data directly to HDFS and not something you 
need to do in real-time since you have the same data available in ES/Solr.  
This would slow down the batch indexing topology for no real gain.


On Jul 15, 2019, at 7:25 AM, <[email protected]> 
<[email protected]> wrote:

Hello all,

 

I have a question regarding batch indexing. As as I can see, data are stored in 
json format in hdfs. Nevertheless, this uses a lot of storage because of json 
verbosity, enrichment,.. Is there any way to use parquet for example? I guess 
it’s possible to do it the day after, I mean you read the json and with spark 
you save as another format, but is it possible to choose the format at the 
batch indexing configuration level?

 

Thanks a lot

 

Stéphane

smime.p7s
Description: S/MIME cryptographic signature

RE: batch indexing in JSON format

Reply via email to