> It is not that simple. The average Hadoop user has years 6-7 of data. They do > not have a "magic" convert everything button. They also have legacy processes > that don't/can't be converted. … > They do not want the "fastest format" they want "the fastest hive for their > data".
I've yet to run into that sort of luddite yet - maybe engineers can hold onto an opinion like that in isolation, but businesses are in general cost sensitive when it comes to storage & compute. The cynic in me says that if there are a few more down rounds, ORC adoption will suddenly skyrocket in companies which hoard data. ORC has massive compression advantages over Text, especially for attribute+metric SQL data. A closer look at this is warranted. Some of this stuff literally blows my mind - customer_demographics in TPC-DS is a great example of doing the impossible. tpcds_bin_partitioned_orc_1000.customer_demographics [numFiles=1, numRows=1920800, totalSize=46194, rawDataSize=726062400] which makes it 0.19 bit per-row (not byte, *BIT*). Compare to Parquet (which is still far better than text) tpcds_bin_partitioned_parq_1000.customer_demographics [numFiles=1, numRows=1920800, totalSize=16813614, rawDataSize=17287200] which uses 70 bits per-row. So as companies "age" in their data over years, they tend to be very receptive to the idea of switching their years old data to ORC (and then use tiered HDFS etc). Still no magic button, but apparently money is a strong incentive to solve hard problems. > They get data dumps from potentially non sophisticated partners maybe using > S3 and csv and, cause maybe their partner uses vertica or redshift. I think > you understand this. That is something I'm painfully aware of - after the first few months, the second request is "Can you do Change-Data-Capture, so that we can reload every 30 mins? Can we do every 5 minutes?". And that's why Hive ACID has got SQL MERGE statements, so that you can grab a ChangeLog and apply it over with an UPSERT/UPDATE LATEST. And unlike the old lock manager, writes don't lock out any readers. Then as the warehouse gets bigger, "can you prevent the UPSERT from thrashing my cache & IO? Because the more data I have in the warehouse the longer the update takes." And that's what the min-max SemiJoin reduction in Tez does (i.e the min/max from the changelog goes pushed into the ORC index on the target table scan, so that only the intersection is loaded into cache). We gather a runtime range from the updates and push it to the ACID base, so that we don't have to read data into memory that doesn't have any updates. Also, if you have a sequential primary key on the OLTP side, this comes in as a ~100x speed up for such a use-case … because ACID ORC has transaction-consistent indexes built-in. > Suppose you have 100 GB text data in an S3 bucket, and say queying it takes > lets just say "50 seconds for a group by type query". … > Now that second copy..Maybe I can do the same group by in 30 seconds. You're off by a couple of orders of magnitude - in fact, that was my last year's Hadoop Summit demo, 10 terabytes of Text on S3, converted to ORC + LLAP. http://people.apache.org/~gopalv/LLAP-S3.gif (GIANT 38Mb GIF) That's doing nearly a billion rows a second across 9 nodes, through a join + group-by - a year ago. You can probably hit 720M rows/sec with plain Text with latest LLAP on the same cluster today. And with LLAP, adding S3 SSE (encrypted data on S3) adds a ~4% overhead for ORC, which is another neat trick. And with S3Guard, we have the potential to get the consistency needed for ACID. The format improvements are foundational to the cost-effectiveness on the cloud - you can see the impact of the format on the IO costs when you use a non-Hive engine like AWS Athena with ORC and Parquet [1]. > 1) io bound > 2) have 10 seconds of startup time anyway. LLAP is memory bandwidth bound today, because the IO costs are so low & is hidden by async IO - the slowest part of LLAP for a BI query is the amount of time it takes to convert the cache structures into vectorized rows. Part of it is due to the fact ORC is really tightly compressed and decode loops need to get more instructions-per-clock. Some parts of ORC decompression can be faster than a raw memcpy of the original data, because of cache access patterns (rather, writing sequentially to the same buffers again is faster than writing to a new location). If the focus on sequential writes makes you think of disks, this is why memory is treated as the new disk. The startup time for something like Tableau is actually closer to 240ms (& that can come down to 90ms if we disable failure tolerance). We've got sub-second SQL execution, sub-second compiles, sub-second submissions … with all of it adding up to a single or double digit seconds over a billion rows of data. Cheers, Gopal [1] - http://tech.marksblogg.com/billion-nyc-taxi-rides-aws-athena.html