Install CDH4 using tar ball with MRv1, Not YARN version

2013-06-12 Thread selva
Hi folks,

I am trying to install CDH4 using tar ball with MRv1, Not YARN
version(MRv2).

I downloaded two tarballs (mr1-0.20.2+ and hadoop-2.0.0+) from this
location http://archive.cloudera.com/cdh4/cdh/4/

as per cloudera instruction i found

 "If you install CDH4 from a tarball, you will install YARN. To install
MRv1 as well, install the separate MRv1 tarball (mr1-0.20.2+) alongside
the YARN one (hadoop-2.0.0+)."
(@ bottom of
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_4_2.html
)

But i could not find steps to install using these two tarballs Since
cloudera tailored the steps to package installation.

I am totally confused like whether to start dfs of hadoop-2.0.0+ version
and start mapred of mr1-0.20.2+ or something else.

Kindly help me on setting up.

Thanks
Selva


Re: Parallel Load Data into Two partitions of a Hive Table

2013-05-03 Thread selva
Thanks Yanbo. I my doubt is got clarified now.


On Fri, May 3, 2013 at 2:38 PM, Yanbo Liang  wrote:

> load data to different partitions parallel is OK, because it equivalent to
> write to different file on HDFS
>
>
> 2013/5/3 selva 
>
>> Hi All,
>>
>> I need to load a month worth of processed data into a hive table. Table
>> have 10 partitions. Each day have many files to load and each file is
>> taking two seconds(constantly) and i have ~3000 files). So it will take
>> days to complete for 30 days worth of data.
>>
>> I planned to load every day data parallel into respective partition so
>> that i can complete it short time.
>>
>> But i need clarrification before proceeding it.
>>
>> Question:
>>
>> 1. Will it cause data loss/corruption by loading parallel in different
>> partition of same hive table ?
>>
>> For example, Assume i am doing like below,
>>
>> Table : processedlogs
>> Partition : logdate
>>
>> Running below commands parallel,
>> LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-01');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-02');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-03');
>> LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-04');
>> .
>> LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
>> processedlogs PARTITION(logdate='2013-04-30');
>>
>> Thanks
>> Selva
>>
>>
>>
>>
>>
>>
>


-- 
-- selva


Parallel Load Data into Two partitions of a Hive Table

2013-05-02 Thread selva
Hi All,

I need to load a month worth of processed data into a hive table. Table
have 10 partitions. Each day have many files to load and each file is
taking two seconds(constantly) and i have ~3000 files). So it will take
days to complete for 30 days worth of data.

I planned to load every day data parallel into respective partition so that
i can complete it short time.

But i need clarrification before proceeding it.

Question:

1. Will it cause data loss/corruption by loading parallel in different
partition of same hive table ?

For example, Assume i am doing like below,

Table : processedlogs
Partition : logdate

Running below commands parallel,
LOAD DATA INPATH '/logs/processed/2013-04-01' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-01');
LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-02');
LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-03');
LOAD DATA INPATH '/logs/processed/2013-04-02' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-04');
.
LOAD DATA INPATH '/logs/processed/2013-04-30' OVERWRITE INTO TABLE
processedlogs PARTITION(logdate='2013-04-30');

Thanks
Selva


Re: High IO Usage in Datanodes due to Replication

2013-05-01 Thread selva
Hi Harsh,

You are right, Our Hadoop version is "0.20.2-cdh3u1" which is lack of
HDFS-2379.

As you suggest i have doubled the DN heap size, Now i will monitor the
Block scanning speed.

The 2nd idea is good, but I can not merge the small files(~1 MB) since its
all in hive table partitions.

-Selva


On Wed, May 1, 2013 at 2:25 PM, Harsh J  wrote:

> Hi,
>
> Neither block reports nor block scanning should affect general DN I/O,
> although the former may affect DN liveliness in older versions, if
> they lack HDFS-2379 in them. Brahma is partially right in having
> mentioned the block reports, hence.
>
> Your solution, if the # of blocks per DN is too high (counts available
> on Live Nodes page in NN UI), say > 1m or so blocks, is to simply
> raise the DN heap by another GB to fix issues immediately, and then
> start working on merging small files together for more efficient
> processing and reducing overall block count to lower memory pressure
> at the DNs.
>
>
>
> On Wed, May 1, 2013 at 12:02 PM, selva  wrote:
> > Thanks a lot Harsh. Your input is really valuable for me.
> >
> > As you mentioned above, we have overload of many small files in our
> cluster.
> >
> > Also when i load data huge data to hive tables, It throws an exception
> like
> > "replicated to to 0 nodes instead of 1". When i google it i found one of
> the
> > reason matches my case  "Data Node is Busy with block report and block
> > scanning" @ http://bit.ly/ZToyNi
> >
> > Is increasing the Block scanning and scanning all inefficient small files
> > will fix my problem ?
> >
> > Thanks
> > Selva
> >
> >
> > On Wed, May 1, 2013 at 11:37 AM, Harsh J  wrote:
> >>
> >> The block scanner is a simple, independent operation of the DN that
> >> runs periodically and does work in small phases, to ensure that no
> >> blocks exist that aren't matching their checksums (its an automatic
> >> data validator) - such that it may report corrupt/rotting blocks and
> >> keep the cluster healthy.
> >>
> >> Its runtime shouldn't cause any issues, unless your DN has a lot of
> >> blocks (more than normal due to overload of small, inefficient files)
> >> but too little heap size to perform retention plus block scanning.
> >>
> >> > 1. Is data node will not allow to write the data during
> >> > DataBlockScanning process ?
> >>
> >> No such thing. As I said, its independent and mostly lock free. Writes
> >> or reads are not hampered.
> >>
> >> > 2. Is data node will come normal only when "Not yet verified" come to
> >> > zero in data node blockScannerReport ?
> >>
> >> Yes, but note that this runs over and over again (once every 3 weeks
> >> IIRC).
> >>
> >> On Wed, May 1, 2013 at 11:33 AM, selva  wrote:
> >> > Thanks Harsh & Manoj for the inputs.
> >> >
> >> > Now i found that the data node is busy with block scanning. I have TBs
> >> > data
> >> > attached with each data node. So its taking days to complete the data
> >> > block
> >> > scanning. I have two questions.
> >> >
> >> > 1. Is data node will not allow to write the data during
> >> > DataBlockScanning
> >> > process ?
> >> >
> >> > 2. Is data node will come normal only when "Not yet verified" come to
> >> > zero
> >> > in data node blockScannerReport ?
> >> >
> >> > # Data node logs
> >> >
> >> > 2013-05-01 05:53:50,639 INFO
> >> > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> >> > succeeded for blk_-7605405041820244736_20626608
> >> > 2013-05-01 05:53:50,664 INFO
> >> > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> >> > succeeded for blk_-1425088964531225881_20391711
> >> > 2013-05-01 05:53:50,692 INFO
> >> > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> >> > succeeded for blk_2259194263704433881_10277076
> >> > 2013-05-01 05:53:50,740 INFO
> >> > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> >> > succeeded for blk_2653195657740262633_18315696
> >> > 2013-05-01 05:53:50,818 INFO
> >> > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> >> > succeeded for blk_-5124560783595402637_20821252
> >> > 2013-05-01 05:53:50,866 IN

Re: High IO Usage in Datanodes due to Replication

2013-04-30 Thread selva
Thanks Harsh & Manoj for the inputs.

Now i found that the data node is busy with block scanning. I have TBs data
attached with each data node. So its taking days to complete the data block
scanning. I have two questions.

1. Is data node will not allow to write the data during DataBlockScanning
process ?

2. Is data node will come normal only when "Not yet verified" come to zero
in data node blockScannerReport ?

# Data node logs

2013-05-01 05:53:50,639 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_-7605405041820244736_20626608
2013-05-01 05:53:50,664 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_-1425088964531225881_20391711
2013-05-01 05:53:50,692 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_2259194263704433881_10277076
2013-05-01 05:53:50,740 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_2653195657740262633_18315696
2013-05-01 05:53:50,818 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_-5124560783595402637_20821252
2013-05-01 05:53:50,866 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_6596021414426970798_19649117
2013-05-01 05:53:50,931 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_7026400040099637841_20741138
2013-05-01 05:53:50,992 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_8535358360851622516_20694185
2013-05-01 05:53:51,057 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_7959856580255809601_20559830

# One of my Data node block scanning report

http://:15075/blockScannerReport

Total Blocks : 2037907
Verified in last hour:   4819
Verified in last day : 107355
Verified in last week: 686873
Verified in last four weeks  : 1589964
Verified in SCAN_PERIOD  : 1474221
Not yet verified : 447943
Verified since restart   : 318433
Scans since restart  : 318058
Scan errors since restart:  0
Transient scan errors:  0
Current scan rate limit KBps :   3205
Progress this period :101%
Time left in cur period  :  86.02%

Thanks
Selva


-Original Message-
>From "S, Manoj" 
Subject RE: High IO Usage in Datanodes due to Replication
Date Mon, 29 Apr 2013 06:41:31 GMT
Adding to Harsh's comments:

You can also tweak a few OS level parameters to improve the I/O performance.
1) Mount the filesystem with "noatime" option.
2) Check if changing the IO scheduling the algorithm will improve the
cluster's performance.
(Check this file /sys/block//queue/scheduler)
3) If there are lots of I/O requests and your cluster hangs because of
that, you can increase
the queue length by increasing the value in
/sys/block//queue/nr_requests.

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: Sunday, April 28, 2013 12:03 AM
To: 
Subject: Re: High IO Usage in Datanodes due to Replication

They seem to be transferring blocks between one another. This may most
likely be due to under-replication
and the NN UI will have numbers on work left to perform. The inter-DN
transfer is controlled
by the balancing bandwidth though, so you can lower that down if you want
to, to cripple it
- but you'll lose out on time for a perfectly replicated state again.

On Sat, Apr 27, 2013 at 11:33 PM, selva  wrote:
> Hi All,
>
> I have lost amazon instances of my hadoop cluster. But i had all the
> data in aws EBS volumes. So i launched new instances and attached volumes.
>
> But all of the datanode logs keep on print the below lines it cauased
> to high IO rate. Due to IO usage i am not able to run any jobs.
>
> Can anyone help me to understand what it is doing? Thanks in advance.
>
> 2013-04-27 17:51:40,197 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(10.157.10.242:10013,
> storageID=DS-407656544-10.28.217.27-10013-1353165843727,
> infoPort=15075,
> ipcPort=10014) Starting thread to transfer block
> blk_2440813767266473910_11564425 to 10.168.18.178:10013
> 2013-04-27 17:51:40,230 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(10.157.10.242:10013,
> storageID=DS-407656544-10.28.217.27-10013-1353165843727,
> infoPort=15075, ipcPort=10014):Transmitted block
> blk_2440813767266473910_11564425 to
> /10.168.18.178:10013
> 2013-04-27 17:51:40,433 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_2442656050740605335_10906493 src: /10.171.11.11:60744 dest:
> /10.157.10.242:10013
> 2013-04-27 17:51:40,450 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
> blk_2442656050740605335_10906493 src: /10.171.11.11:60744 dest:
> /10.157.10.242:10013 of size 25431
>
> Thanks
> Selva
>
>
>
>
>
>



--
Harsh J


High IO Usage in Datanodes due to Replication

2013-04-27 Thread selva
Hi All,

I have lost amazon instances of my hadoop cluster. But i had all the data
in aws EBS volumes. So i launched new instances and attached volumes.

But all of the datanode logs keep on print the below lines it cauased to
high IO rate. Due to IO usage i am not able to run any jobs.

Can anyone help me to understand what it is doing? Thanks in advance.

2013-04-27 17:51:40,197 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.157.10.242:10013,
storageID=DS-407656544-10.28.217.27-10013-1353165843727, infoPort=15075,
ipcPort=10014) Starting thread to transfer block
blk_2440813767266473910_11564425 to 10.168.18.178:10013
2013-04-27 17:51:40,230 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
10.157.10.242:10013,
storageID=DS-407656544-10.28.217.27-10013-1353165843727, infoPort=15075,
ipcPort=10014):Transmitted block blk_2440813767266473910_11564425 to /
10.168.18.178:10013
2013-04-27 17:51:40,433 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_2442656050740605335_10906493 src: /10.171.11.11:60744 dest: /
10.157.10.242:10013
2013-04-27 17:51:40,450 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Received block
blk_2442656050740605335_10906493 src: /10.171.11.11:60744 dest: /
10.157.10.242:10013 of size 25431

Thanks
Selva