The reads and writes both happen in parallel, so as more nodes are available for read and write, at least in this case, the time stays roughly the same.

Alan.

James Pirz <mailto:james.p...@gmail.com>
November 16, 2015 at 21:23
Hi,

I am using Hive 1.2 with ORC tables on Hadoop 2.6 on a cluster.
I load data into an ORC table by reading the data from an external table on raw text files and using insert statement:

INSERT into TABLE myorctab SELECT * FROM mytxttab;

I ran a simple scale-up test to find out how the loading time increases as I double the size of data and nodes. I realized that the total time remains more or less the same (scales properly).

I am just wondering why this is happening, as naively I think if I make the number of partitions and size of data double, the time should also be roughly double as the system needs to partition twice amount of data as it was doing before among twice number of partitions. Am I missing something here ?

Thnx

Reply via email to