Install CDH4 using tar ball with MRv1, Not YARN version
Hi folks, I am trying to install CDH4 using tar ball with MRv1, Not YARN version(MRv2). I downloaded two tarballs (mr1-0.20.2+ and hadoop-2.0.0+) from this location http://archive.cloudera.com/cdh4/cdh/4/ as per cloudera instruction i found "If you install CDH4 from a tarball, you will install YARN. To install MRv1 as well, install the separate MRv1 tarball (mr1-0.20.2+) alongside the YARN one (hadoop-2.0.0+)." (@ bottom of http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_4_2.html ) But i could not find steps to install using these two tarballs Since cloudera tailored the steps to package installation. I am totally confused like whether to start dfs of hadoop-2.0.0+ version and start mapred of mr1-0.20.2+ or something else. Kindly help me on setting up. Thanks Selva
Re: Parallel Load Data into Two partitions of a Hive Table
Thanks Yanbo. I my doubt is got clarified now. On Fri, May 3, 2013 at 2:38 PM, Yanbo Liang wrote: > load data to different partitions parallel is OK, because it equivalent to > write to different file on HDFS > > > 2013/5/3 selva > >> Hi All, >> >> I need to load a month worth of processed data into a hive table. Table >> have 10 partitions. Each day have many files to load and each file is >> taking two seconds(constantly) and i have ~3000 files). So it will take >> days to complete for 30 days worth of data. >> >> I planned to load every day data parallel into respective partition so >> that i can complete it short time. >> >> But i need clarrification before proceeding it. >> >> Question: >> >> 1. Will it cause data loss/corruption by loading parallel in different >> partition of same hive table ? >> >> For example, Assume i am doing like below, >> >> Table : processedlogs >> Partition : logdate >> >> Running below commands parallel, >> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE >> processedlogs PARTITION(logdate='2013-04-01'); >> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE >> processedlogs PARTITION(logdate='2013-04-02'); >> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE >> processedlogs PARTITION(logdate='2013-04-03'); >> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE >> processedlogs PARTITION(logdate='2013-04-04'); >> . >> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE >> processedlogs PARTITION(logdate='2013-04-30'); >> >> Thanks >> Selva >> >> >> >> >> >> > -- -- selva
Parallel Load Data into Two partitions of a Hive Table
Hi All, I need to load a month worth of processed data into a hive table. Table have 10 partitions. Each day have many files to load and each file is taking two seconds(constantly) and i have ~3000 files). So it will take days to complete for 30 days worth of data. I planned to load every day data parallel into respective partition so that i can complete it short time. But i need clarrification before proceeding it. Question: 1. Will it cause data loss/corruption by loading parallel in different partition of same hive table ? For example, Assume i am doing like below, Table : processedlogs Partition : logdate Running below commands parallel, LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE processedlogs PARTITION(logdate='2013-04-01'); LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE processedlogs PARTITION(logdate='2013-04-02'); LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE processedlogs PARTITION(logdate='2013-04-03'); LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE processedlogs PARTITION(logdate='2013-04-04'); . LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE processedlogs PARTITION(logdate='2013-04-30'); Thanks Selva
Re: High IO Usage in Datanodes due to Replication
Hi Harsh, You are right, Our Hadoop version is "0.20.2-cdh3u1" which is lack of HDFS-2379. As you suggest i have doubled the DN heap size, Now i will monitor the Block scanning speed. The 2nd idea is good, but I can not merge the small files(~1 MB) since its all in hive table partitions. -Selva On Wed, May 1, 2013 at 2:25 PM, Harsh J wrote: > Hi, > > Neither block reports nor block scanning should affect general DN I/O, > although the former may affect DN liveliness in older versions, if > they lack HDFS-2379 in them. Brahma is partially right in having > mentioned the block reports, hence. > > Your solution, if the # of blocks per DN is too high (counts available > on Live Nodes page in NN UI), say > 1m or so blocks, is to simply > raise the DN heap by another GB to fix issues immediately, and then > start working on merging small files together for more efficient > processing and reducing overall block count to lower memory pressure > at the DNs. > > > > On Wed, May 1, 2013 at 12:02 PM, selva wrote: > > Thanks a lot Harsh. Your input is really valuable for me. > > > > As you mentioned above, we have overload of many small files in our > cluster. > > > > Also when i load data huge data to hive tables, It throws an exception > like > > "replicated to to 0 nodes instead of 1". When i google it i found one of > the > > reason matches my case "Data Node is Busy with block report and block > > scanning" @ http://bit.ly/ZToyNi > > > > Is increasing the Block scanning and scanning all inefficient small files > > will fix my problem ? > > > > Thanks > > Selva > > > > > > On Wed, May 1, 2013 at 11:37 AM, Harsh J wrote: > >> > >> The block scanner is a simple, independent operation of the DN that > >> runs periodically and does work in small phases, to ensure that no > >> blocks exist that aren't matching their checksums (its an automatic > >> data validator) - such that it may report corrupt/rotting blocks and > >> keep the cluster healthy. > >> > >> Its runtime shouldn't cause any issues, unless your DN has a lot of > >> blocks (more than normal due to overload of small, inefficient files) > >> but too little heap size to perform retention plus block scanning. > >> > >> > 1. Is data node will not allow to write the data during > >> > DataBlockScanning process ? > >> > >> No such thing. As I said, its independent and mostly lock free. Writes > >> or reads are not hampered. > >> > >> > 2. Is data node will come normal only when "Not yet verified" come to > >> > zero in data node blockScannerReport ? > >> > >> Yes, but note that this runs over and over again (once every 3 weeks > >> IIRC). > >> > >> On Wed, May 1, 2013 at 11:33 AM, selva wrote: > >> > Thanks Harsh & Manoj for the inputs. > >> > > >> > Now i found that the data node is busy with block scanning. I have TBs > >> > data > >> > attached with each data node. So its taking days to complete the data > >> > block > >> > scanning. I have two questions. > >> > > >> > 1. Is data node will not allow to write the data during > >> > DataBlockScanning > >> > process ? > >> > > >> > 2. Is data node will come normal only when "Not yet verified" come to > >> > zero > >> > in data node blockScannerReport ? > >> > > >> > # Data node logs > >> > > >> > 2013-05-01 05:53:50,639 INFO > >> > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification > >> > succeeded for blk_-7605405041820244736_20626608 > >> > 2013-05-01 05:53:50,664 INFO > >> > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification > >> > succeeded for blk_-1425088964531225881_20391711 > >> > 2013-05-01 05:53:50,692 INFO > >> > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification > >> > succeeded for blk_2259194263704433881_10277076 > >> > 2013-05-01 05:53:50,740 INFO > >> > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification > >> > succeeded for blk_2653195657740262633_18315696 > >> > 2013-05-01 05:53:50,818 INFO > >> > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification > >> > succeeded for blk_-5124560783595402637_20821252 > >> > 2013-05-01 05:53:50,866 IN
Re: High IO Usage in Datanodes due to Replication
Thanks Harsh & Manoj for the inputs. Now i found that the data node is busy with block scanning. I have TBs data attached with each data node. So its taking days to complete the data block scanning. I have two questions. 1. Is data node will not allow to write the data during DataBlockScanning process ? 2. Is data node will come normal only when "Not yet verified" come to zero in data node blockScannerReport ? # Data node logs 2013-05-01 05:53:50,639 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_-7605405041820244736_20626608 2013-05-01 05:53:50,664 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_-1425088964531225881_20391711 2013-05-01 05:53:50,692 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_2259194263704433881_10277076 2013-05-01 05:53:50,740 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_2653195657740262633_18315696 2013-05-01 05:53:50,818 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_-5124560783595402637_20821252 2013-05-01 05:53:50,866 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_6596021414426970798_19649117 2013-05-01 05:53:50,931 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_7026400040099637841_20741138 2013-05-01 05:53:50,992 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_8535358360851622516_20694185 2013-05-01 05:53:51,057 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_7959856580255809601_20559830 # One of my Data node block scanning report http://:15075/blockScannerReport Total Blocks : 2037907 Verified in last hour: 4819 Verified in last day : 107355 Verified in last week: 686873 Verified in last four weeks : 1589964 Verified in SCAN_PERIOD : 1474221 Not yet verified : 447943 Verified since restart : 318433 Scans since restart : 318058 Scan errors since restart: 0 Transient scan errors: 0 Current scan rate limit KBps : 3205 Progress this period :101% Time left in cur period : 86.02% Thanks Selva -Original Message- >From "S, Manoj" Subject RE: High IO Usage in Datanodes due to Replication Date Mon, 29 Apr 2013 06:41:31 GMT Adding to Harsh's comments: You can also tweak a few OS level parameters to improve the I/O performance. 1) Mount the filesystem with "noatime" option. 2) Check if changing the IO scheduling the algorithm will improve the cluster's performance. (Check this file /sys/block//queue/scheduler) 3) If there are lots of I/O requests and your cluster hangs because of that, you can increase the queue length by increasing the value in /sys/block//queue/nr_requests. -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Sunday, April 28, 2013 12:03 AM To: Subject: Re: High IO Usage in Datanodes due to Replication They seem to be transferring blocks between one another. This may most likely be due to under-replication and the NN UI will have numbers on work left to perform. The inter-DN transfer is controlled by the balancing bandwidth though, so you can lower that down if you want to, to cripple it - but you'll lose out on time for a perfectly replicated state again. On Sat, Apr 27, 2013 at 11:33 PM, selva wrote: > Hi All, > > I have lost amazon instances of my hadoop cluster. But i had all the > data in aws EBS volumes. So i launched new instances and attached volumes. > > But all of the datanode logs keep on print the below lines it cauased > to high IO rate. Due to IO usage i am not able to run any jobs. > > Can anyone help me to understand what it is doing? Thanks in advance. > > 2013-04-27 17:51:40,197 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(10.157.10.242:10013, > storageID=DS-407656544-10.28.217.27-10013-1353165843727, > infoPort=15075, > ipcPort=10014) Starting thread to transfer block > blk_2440813767266473910_11564425 to 10.168.18.178:10013 > 2013-04-27 17:51:40,230 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(10.157.10.242:10013, > storageID=DS-407656544-10.28.217.27-10013-1353165843727, > infoPort=15075, ipcPort=10014):Transmitted block > blk_2440813767266473910_11564425 to > /10.168.18.178:10013 > 2013-04-27 17:51:40,433 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_2442656050740605335_10906493 src: /10.171.11.11:60744 dest: > /10.157.10.242:10013 > 2013-04-27 17:51:40,450 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Received block > blk_2442656050740605335_10906493 src: /10.171.11.11:60744 dest: > /10.157.10.242:10013 of size 25431 > > Thanks > Selva > > > > > > -- Harsh J
High IO Usage in Datanodes due to Replication
Hi All, I have lost amazon instances of my hadoop cluster. But i had all the data in aws EBS volumes. So i launched new instances and attached volumes. But all of the datanode logs keep on print the below lines it cauased to high IO rate. Due to IO usage i am not able to run any jobs. Can anyone help me to understand what it is doing? Thanks in advance. 2013-04-27 17:51:40,197 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.157.10.242:10013, storageID=DS-407656544-10.28.217.27-10013-1353165843727, infoPort=15075, ipcPort=10014) Starting thread to transfer block blk_2440813767266473910_11564425 to 10.168.18.178:10013 2013-04-27 17:51:40,230 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 10.157.10.242:10013, storageID=DS-407656544-10.28.217.27-10013-1353165843727, infoPort=15075, ipcPort=10014):Transmitted block blk_2440813767266473910_11564425 to / 10.168.18.178:10013 2013-04-27 17:51:40,433 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_2442656050740605335_10906493 src: /10.171.11.11:60744 dest: / 10.157.10.242:10013 2013-04-27 17:51:40,450 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_2442656050740605335_10906493 src: /10.171.11.11:60744 dest: / 10.157.10.242:10013 of size 25431 Thanks Selva