load data to different partitions parallel is OK, because it equivalent to write to different file on HDFS
2013/5/3 selva <selvai...@gmail.com> > Hi All, > > I need to load a month worth of processed data into a hive table. Table > have 10 partitions. Each day have many files to load and each file is > taking two seconds(constantly) and i have ~3000 files). So it will take > days to complete for 30 days worth of data. > > I planned to load every day data parallel into respective partition so > that i can complete it short time. > > But i need clarrification before proceeding it. > > Question: > > 1. Will it cause data loss/corruption by loading parallel in different > partition of same hive table ? > > For example, Assume i am doing like below, > > Table : processedlogs > Partition : logdate > > Running below commands parallel, > LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE > processedlogs PARTITION(logdate='2013-04-01'); > LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE > processedlogs PARTITION(logdate='2013-04-02'); > LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE > processedlogs PARTITION(logdate='2013-04-03'); > LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE > processedlogs PARTITION(logdate='2013-04-04'); > ..... > LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE > processedlogs PARTITION(logdate='2013-04-30'); > > Thanks > Selva > > > > > >