issue about hadoop data migrate between IDC
hi,maillist: my company signed a new IDC ,i must move the hadoop data(about 30T data) to new IDC,any good suggestion?
Ecplise luna 4.4.0 and Hadoop 2.4.1
Hello, I was wondering how can I integrate hadoop 2.4.1 and eclipse luna 4.4.0, I was want to run some unit tests and play around a bit. Also what is the best way to get familiar with hadoop? Thanks, Thejas
Re: High performance Count Distinct - NO Error
A simple and parallel way to do this is by breaking the data into ranges or hashes then do distinct counting on those. Hive should do something like this automatically. This is a rather naive way. SELECT column from source_table_0 where row_key mod 10 = 0; SELECT column from source_table_1 where row_key mod 10 = 1; create table all as select count(dstinct) from source_table_0 union all select count(distinct) from source_table_1 select count(*) from all; On Wed, Aug 6, 2014 at 10:23 AM, Sergey Murylev wrote: > Why do you think that default implementation of COUNT DISTINCT is slow? As > far as I understand the most famous way to find number of distinct elements > is to sort them and scan all sorted items consequently excluding duplicated > elements. Assimptotics of this algoritm is O(n *log n ), I think that there > is no way to do this faster in general case. I think that Hive should use > map-reduce sort stage to make items sorted, but probably in your case we > have only one reduce task because we need to aggregate result on single > instance. > 06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN - > IN/Bangalore)" написал: > > > > Hi > > > > I am looking for high performance count distinct solution on Hive Query. > > > > Regular count distinct is very slow but if I use probabilistic count > distinct has more error percentage (if the number of records are small). > > > > > > Is there is any solution to have exact count distinct but using low > memory and without error? > > > > Thanks and Regards > > Prabakaran.N > > > > > > >
Re: Datanode disk considerations
Hadoop balancer doesn’t balance data on the local drives, it balances data between datanodes on the grid, so running the balancer won’t balance data on the local datanode. The datanode process round-robins between data directories on local disk, so it’s not unexpected to see the smaller drive fill faster. Typically people run the same size drives within each compute node to prevent this from happening. You could partition the 2TB drive into four 500GB partitions. This isn’t optimal as you’ll have 4 write threads pointing at a single disk but is fairly simple to implement. Otherwise you’ll want to physically rebuild your 4 nodes so each node has equal amounts of storage. I’d also like to suggest while restructuring your local filesystem, that the tasktracker/nodemanager be given it’s own partition for writes. If both the tasktracker/nodemanger plus datanode process share a partition, when the mappers spill to disk it will cause the HDFS space to shrink and grow as the datanode is reporting back how much free space it has for it’s partitions. Good luck. On Aug 6, 2014, at 1:51 PM, Felix Chern wrote: > Run the “hadoop balencer” command on the namenode. It’s is used for balancing > skewed data. > http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#balancer > > > On Aug 6, 2014, at 1:45 PM, Brian C. Huffman > wrote: > >> All, >> >> We currently a Hadoop 2.2.0 cluster with the following characteristics: >> - 4 nodes >> - Each node is a datanode >> - Each node has 3 physical disks for data: 2 x 500GB and 1 x 2TB disk. >> - HDFS replication factor of 3 >> >> It appears that our 500GB disks are filling up first (the alternative would >> be to put 4 times the number of blocks on the 2TB disks per node). I'm >> concerned that once the 500GB disks fill, our performance will slow down >> (less spindles being read / written at the same time per node). Is this >> correct? Is there anything we can do to change this behavior? >> >> Thanks, >> Brian >> >> >
Re: Datanode disk considerations
Run the “hadoop balencer” command on the namenode. It’s is used for balancing skewed data. http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#balancer On Aug 6, 2014, at 1:45 PM, Brian C. Huffman wrote: > All, > > We currently a Hadoop 2.2.0 cluster with the following characteristics: > - 4 nodes > - Each node is a datanode > - Each node has 3 physical disks for data: 2 x 500GB and 1 x 2TB disk. > - HDFS replication factor of 3 > > It appears that our 500GB disks are filling up first (the alternative would > be to put 4 times the number of blocks on the 2TB disks per node). I'm > concerned that once the 500GB disks fill, our performance will slow down > (less spindles being read / written at the same time per node). Is this > correct? Is there anything we can do to change this behavior? > > Thanks, > Brian > >
Datanode disk considerations
All, We currently a Hadoop 2.2.0 cluster with the following characteristics: - 4 nodes - Each node is a datanode - Each node has 3 physical disks for data: 2 x 500GB and 1 x 2TB disk. - HDFS replication factor of 3 It appears that our 500GB disks are filling up first (the alternative would be to put 4 times the number of blocks on the 2TB disks per node). I'm concerned that once the 500GB disks fill, our performance will slow down (less spindles being read / written at the same time per node). Is this correct? Is there anything we can do to change this behavior? Thanks, Brian
Re: High performance Count Distinct - NO Error
Why do you think that default implementation of COUNT DISTINCT is slow? As far as I understand the most famous way to find number of distinct elements is to sort them and scan all sorted items consequently excluding duplicated elements. Assimptotics of this algoritm is O(n *log n ), I think that there is no way to do this faster in general case. I think that Hive should use map-reduce sort stage to make items sorted, but probably in your case we have only one reduce task because we need to aggregate result on single instance. 06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN - IN/Bangalore)" написал: > > Hi > > I am looking for high performance count distinct solution on Hive Query. > > Regular count distinct is very slow but if I use probabilistic count distinct has more error percentage (if the number of records are small). > > > Is there is any solution to have exact count distinct but using low memory and without error? > > Thanks and Regards > Prabakaran.N > > >
Re: YARN Application Master Question
Thanks very much, Mirko! -Ana On 6 August 2014 12:00, Mirko Kämpf wrote: > In this case you would have 3 AM for MR jobs and 2 more AMs, one for each > Giraph job. > Makes a total of 5 AMs. > > Cheers, > Mirko > > > > 2014-08-06 11:57 GMT+01:00 Ana Gillan : > > Hi, >> >> In the documentation and papers about Apache YARN, they say that an >> Application Master is launched for every application and it is this that >> manages resources and scheduling for applications on Hadoop. >> >> For example, if you submit 3 MapReduce jobs, and 2 Giraph jobs to your >> cluster, does the Resource Manager spawn 2 AMs (one to manage all MapReduce >> jobs and one to manage all Giraph jobs) or does it spawn 5 AMs (one for >> each job)? >> >> Thanks! >> Ana >> > >
Re: YARN Application Master Question
In this case you would have 3 AM for MR jobs and 2 more AMs, one for each Giraph job. Makes a total of 5 AMs. Cheers, Mirko 2014-08-06 11:57 GMT+01:00 Ana Gillan : > Hi, > > In the documentation and papers about Apache YARN, they say that an > Application Master is launched for every application and it is this that > manages resources and scheduling for applications on Hadoop. > > For example, if you submit 3 MapReduce jobs, and 2 Giraph jobs to your > cluster, does the Resource Manager spawn 2 AMs (one to manage all MapReduce > jobs and one to manage all Giraph jobs) or does it spawn 5 AMs (one for > each job)? > > Thanks! > Ana >
YARN Application Master Question
Hi, In the documentation and papers about Apache YARN, they say that an Application Master is launched for every application and it is this that manages resources and scheduling for applications on Hadoop. For example, if you submit 3 MapReduce jobs, and 2 Giraph jobs to your cluster, does the Resource Manager spawn 2 AMs (one to manage all MapReduce jobs and one to manage all Giraph jobs) or does it spawn 5 AMs (one for each job)? Thanks! Ana
High performance Count Distinct - NO Error
Hi I am looking for high performance count distinct solution on Hive Query. Regular count distinct is very slow but if I use probabilistic count distinct has more error percentage (if the number of records are small). Is there is any solution to have exact count distinct but using low memory and without error? Thanks and Regards Prabakaran.N