> It is not that simple. The average Hadoop user has years 6-7 of data. They do 
> not have a "magic" convert everything button. They also have legacy processes 
> that don't/can't be converted. 
…
> They do not want the "fastest format" they want "the fastest hive for their 
> data".

I've yet to run into that sort of luddite yet - maybe engineers can hold onto 
an opinion like that in isolation, but businesses are in general cost sensitive 
when it comes to storage & compute. 

The cynic in me says that if there are a few more down rounds, ORC adoption 
will suddenly skyrocket in companies which hoard data.

ORC has massive compression advantages over Text, especially for 
attribute+metric SQL data. A closer look at this is warranted.

Some of this stuff literally blows my mind - customer_demographics in TPC-DS is 
a great example of doing the impossible.

tpcds_bin_partitioned_orc_1000.customer_demographics [numFiles=1, 
numRows=1920800, totalSize=46194, rawDataSize=726062400]

which makes it 0.19 bit per-row (not byte, *BIT*).

Compare to Parquet (which is still far better than text)

tpcds_bin_partitioned_parq_1000.customer_demographics  [numFiles=1, 
numRows=1920800, totalSize=16813614, rawDataSize=17287200]

which uses 70 bits per-row.

So as companies "age" in their data over years, they tend to be very receptive 
to the idea of switching their years old data to ORC (and then use tiered HDFS 
etc).

Still no magic button, but apparently money is a strong incentive to solve hard 
problems.

> They get data dumps from potentially non sophisticated partners maybe using 
> S3 and csv and, cause maybe their partner uses vertica or redshift. I think 
> you understand this.

That is something I'm painfully aware of - after the first few months, the 
second request is "Can you do Change-Data-Capture, so that we can reload every 
30 mins? Can we do every 5 minutes?".

And that's why Hive ACID has got SQL MERGE statements, so that you can grab a 
ChangeLog and apply it over with an UPSERT/UPDATE LATEST. And unlike the old 
lock manager, writes don't lock out any readers.

Then as the warehouse gets bigger, "can you prevent the UPSERT from thrashing 
my cache & IO? Because the more data I have in the warehouse the longer the 
update takes."

And that's what the min-max SemiJoin reduction in Tez does (i.e the min/max 
from the changelog goes pushed into the ORC index on the target table scan, so 
that only the intersection is loaded into cache). We gather a runtime range 
from the updates and push it to the ACID base, so that we don't have to read 
data into memory that doesn't have any updates.

Also, if you have a sequential primary key on the OLTP side, this comes in as a 
~100x speed up for such a use-case … because ACID ORC has 
transaction-consistent indexes built-in.

> Suppose you have 100 GB text data in an S3 bucket, and say queying it takes 
> lets just say "50 seconds for a group by type query".
… 
> Now that second copy..Maybe I can do the same group by in 30 seconds. 

You're off by a couple of orders of magnitude - in fact, that was my last 
year's Hadoop Summit demo, 10 terabytes of Text on S3, converted to ORC + LLAP.

http://people.apache.org/~gopalv/LLAP-S3.gif (GIANT 38Mb GIF)

That's doing nearly a billion rows a second across 9 nodes, through a join + 
group-by - a year ago. You can probably hit 720M rows/sec with plain Text with 
latest LLAP on the same cluster today.

And with LLAP, adding S3 SSE (encrypted data on S3) adds a ~4% overhead for 
ORC, which is another neat trick. And with S3Guard, we have the potential to 
get the consistency needed for ACID.

The format improvements are foundational to the cost-effectiveness on the cloud 
- you can see the impact of the format on the IO costs when you use a non-Hive 
engine like AWS Athena with ORC and Parquet [1].

> 1) io bound 
> 2) have 10 seconds of startup time anyway. 

LLAP is memory bandwidth bound today, because the IO costs are so low & is 
hidden by async IO - the slowest part of LLAP for a BI query is the amount of 
time it takes to convert the cache structures into vectorized rows.

Part of it is due to the fact ORC is really tightly compressed and decode loops 
need to get more instructions-per-clock. Some parts of ORC decompression can be 
faster than a raw memcpy of the original data, because of cache access patterns 
(rather, writing sequentially to the same buffers again is faster than writing 
to a new location). If the focus on sequential writes makes you think of disks, 
this is why memory is treated as the new disk.

The startup time for something like Tableau is actually closer to 240ms (& that 
can come down to 90ms if we disable failure tolerance).

We've got sub-second SQL execution, sub-second compiles, sub-second submissions 
… with all of it adding up to a single or double digit seconds over a billion 
rows of data.

Cheers,
Gopal
[1] - http://tech.marksblogg.com/billion-nyc-taxi-rides-aws-athena.html




Reply via email to