Hive implements a format named RCFILE, which could gain better performance, but in my project, it just ties with the plain-text format.
Hive also have an index feature, but not so convenient or practical. I think the best way to optimized is still reusing the same source tables, avoiding sub-queries, and merge HiveQL as many as possible. On Fri, Sep 21, 2012 at 10:30 AM, Mapred Learn <[email protected]> wrote: > Hi, > We have datasets which are about 10-15 TB in size. > > We want to run hive queries on top of this input data. > > What are ways to reduce stress on our cluster for running many such big > queries( include joins too) in parallel ? > How to enable compression etc for intermediate hive output ? > How to make job cache does not go to high etc ? > In short , best practices for huge queries on hive ? > > Any inputs are really appreciated ! > > Thanks, > JJ > > Sent from my iPhone
