The reads and writes both happen in parallel, so as more nodes are
available for read and write, at least in this case, the time stays
roughly the same.
Alan.
James Pirz <mailto:james.p...@gmail.com>
November 16, 2015 at 21:23
Hi,
I am using Hive 1.2 with ORC tables on Hadoop 2.6 on a cluster.
I load data into an ORC table by reading the data from an external
table on raw text files and using insert statement:
INSERT into TABLE myorctab SELECT * FROM mytxttab;
I ran a simple scale-up test to find out how the loading time
increases as I double the size of data and nodes. I realized that the
total time remains more or less the same (scales properly).
I am just wondering why this is happening, as naively I think if I
make the number of partitions and size of data double, the time should
also be roughly double as the system needs to partition twice amount
of data as it was doing before among twice number of partitions. Am I
missing something here ?
Thnx