issue about hadoop data migrate between IDC

2014-08-06 Thread ch huang
hi,maillist:
   my company signed a new IDC ,i must move the hadoop
data(about 30T data) to new IDC,any good suggestion?


Ecplise luna 4.4.0 and Hadoop 2.4.1

2014-08-06 Thread thejas prasad
Hello,

I was wondering how can I integrate hadoop 2.4.1 and eclipse luna 4.4.0, I
was want to run some unit tests and play around a bit.

Also what is the best way to get familiar with hadoop?
Thanks,
Thejas


Re: High performance Count Distinct - NO Error

2014-08-06 Thread Edward Capriolo
A simple and parallel way to do this is by breaking the data into ranges or
hashes then do distinct counting on those. Hive should do something like
this automatically.

This is a rather naive way.

SELECT column from source_table_0 where row_key mod 10 = 0;
SELECT column from source_table_1 where row_key mod 10 = 1;

create table all as
select count(dstinct) from source_table_0
union all
select count(distinct) from source_table_1

select count(*) from all;




On Wed, Aug 6, 2014 at 10:23 AM, Sergey Murylev 
wrote:

> Why do you think that default implementation of COUNT DISTINCT is slow? As
> far as I understand the most famous way to find number of distinct elements
> is to sort them and scan all sorted items consequently excluding duplicated
> elements. Assimptotics of this algoritm is O(n *log n ), I think that there
> is no way to do this faster in general case. I think that Hive should use
> map-reduce sort stage to make items sorted, but probably in your case we
> have only one reduce task because we need to aggregate result on single
> instance.
> 06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN -
> IN/Bangalore)"  написал:
> >
> > Hi
> >
> > I am looking for high performance count distinct solution on Hive Query.
> >
> > Regular count distinct is very slow but if I use probabilistic count
> distinct has more error percentage (if the number of records are small).
> >
> >
> > Is there is any solution to have exact count distinct but using low
> memory and without error?
> >
> > Thanks and Regards
> > Prabakaran.N
> >
> >
> >
>


Re: Datanode disk considerations

2014-08-06 Thread Adam Faris
Hadoop balancer doesn’t balance data on the local drives, it balances data 
between datanodes on the grid, so running the balancer won’t balance data on 
the local datanode.  

The datanode process round-robins between data directories on local disk, so 
it’s not unexpected to see the smaller drive fill faster.  Typically people run 
the same size drives within each compute node to prevent this from happening.  

You could partition the 2TB drive into four 500GB partitions.  This isn’t 
optimal as you’ll have 4 write threads pointing at a single disk but is fairly 
simple to implement.  Otherwise you’ll want to physically rebuild your 4 nodes 
so each node has equal amounts of storage.  

I’d also like to suggest while restructuring your local filesystem, that the 
tasktracker/nodemanager be given it’s own partition for writes.  If both the 
tasktracker/nodemanger plus datanode process share a partition, when the 
mappers spill to disk it will cause the HDFS space to shrink and grow as the 
datanode is reporting back how much free space it has for it’s partitions.

Good luck.

On Aug 6, 2014, at 1:51 PM, Felix Chern  wrote:

> Run the “hadoop balencer” command on the namenode. It’s is used for balancing 
> skewed data.
> http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#balancer
> 
> 
> On Aug 6, 2014, at 1:45 PM, Brian C. Huffman  
> wrote:
> 
>> All,
>> 
>> We currently a Hadoop 2.2.0 cluster with the following characteristics:
>> - 4 nodes
>> - Each node is a datanode
>> - Each node has 3 physical disks for data: 2 x 500GB and 1 x 2TB disk.
>> - HDFS replication factor of 3
>> 
>> It appears that our 500GB disks are filling up first (the alternative would 
>> be to put 4 times the number of blocks on the 2TB disks per node).  I'm 
>> concerned that once the 500GB disks fill, our performance will slow down 
>> (less spindles being read / written at the same time per node).  Is this 
>> correct?  Is there anything we can do to change this behavior?
>> 
>> Thanks,
>> Brian
>> 
>> 
> 



Re: Datanode disk considerations

2014-08-06 Thread Felix Chern
Run the “hadoop balencer” command on the namenode. It’s is used for balancing 
skewed data.
http://hadoop.apache.org/docs/r1.0.4/commands_manual.html#balancer


On Aug 6, 2014, at 1:45 PM, Brian C. Huffman  
wrote:

> All,
> 
> We currently a Hadoop 2.2.0 cluster with the following characteristics:
> - 4 nodes
> - Each node is a datanode
> - Each node has 3 physical disks for data: 2 x 500GB and 1 x 2TB disk.
> - HDFS replication factor of 3
> 
> It appears that our 500GB disks are filling up first (the alternative would 
> be to put 4 times the number of blocks on the 2TB disks per node).  I'm 
> concerned that once the 500GB disks fill, our performance will slow down 
> (less spindles being read / written at the same time per node).  Is this 
> correct?  Is there anything we can do to change this behavior?
> 
> Thanks,
> Brian
> 
> 



Datanode disk considerations

2014-08-06 Thread Brian C. Huffman

All,

We currently a Hadoop 2.2.0 cluster with the following characteristics:
- 4 nodes
- Each node is a datanode
- Each node has 3 physical disks for data: 2 x 500GB and 1 x 2TB disk.
- HDFS replication factor of 3

It appears that our 500GB disks are filling up first (the alternative 
would be to put 4 times the number of blocks on the 2TB disks per 
node).  I'm concerned that once the 500GB disks fill, our performance 
will slow down (less spindles being read / written at the same time per 
node).  Is this correct?  Is there anything we can do to change this 
behavior?


Thanks,
Brian




Re: High performance Count Distinct - NO Error

2014-08-06 Thread Sergey Murylev
Why do you think that default implementation of COUNT DISTINCT is slow? As
far as I understand the most famous way to find number of distinct elements
is to sort them and scan all sorted items consequently excluding duplicated
elements. Assimptotics of this algoritm is O(n *log n ), I think that there
is no way to do this faster in general case. I think that Hive should use
map-reduce sort stage to make items sorted, but probably in your case we
have only one reduce task because we need to aggregate result on single
instance.
06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN -
IN/Bangalore)"  написал:
>
> Hi
>
> I am looking for high performance count distinct solution on Hive Query.
>
> Regular count distinct is very slow but if I use probabilistic count
distinct has more error percentage (if the number of records are small).
>
>
> Is there is any solution to have exact count distinct but using low
memory and without error?
>
> Thanks and Regards
> Prabakaran.N
>
>
>


Re: YARN Application Master Question

2014-08-06 Thread Ana Gillan
Thanks very much, Mirko!

-Ana


On 6 August 2014 12:00, Mirko Kämpf  wrote:

> In this case you would have 3 AM for MR jobs and 2 more AMs, one for each
> Giraph job.
> Makes a total of 5 AMs.
>
> Cheers,
> Mirko
>
>
>
> 2014-08-06 11:57 GMT+01:00 Ana Gillan :
>
> Hi,
>>
>> In the documentation and papers about Apache YARN, they say that an
>> Application Master is launched for every application and it is this that
>> manages resources and scheduling for applications on Hadoop.
>>
>> For example, if you submit 3 MapReduce jobs, and 2 Giraph jobs to your
>> cluster, does the Resource Manager spawn 2 AMs (one to manage all MapReduce
>> jobs and one to manage all Giraph jobs) or does it spawn 5 AMs (one for
>> each job)?
>>
>> Thanks!
>> Ana
>>
>
>


Re: YARN Application Master Question

2014-08-06 Thread Mirko Kämpf
In this case you would have 3 AM for MR jobs and 2 more AMs, one for each
Giraph job.
Makes a total of 5 AMs.

Cheers,
Mirko



2014-08-06 11:57 GMT+01:00 Ana Gillan :

> Hi,
>
> In the documentation and papers about Apache YARN, they say that an
> Application Master is launched for every application and it is this that
> manages resources and scheduling for applications on Hadoop.
>
> For example, if you submit 3 MapReduce jobs, and 2 Giraph jobs to your
> cluster, does the Resource Manager spawn 2 AMs (one to manage all MapReduce
> jobs and one to manage all Giraph jobs) or does it spawn 5 AMs (one for
> each job)?
>
> Thanks!
> Ana
>


YARN Application Master Question

2014-08-06 Thread Ana Gillan
Hi,

In the documentation and papers about Apache YARN, they say that an
Application Master is launched for every application and it is this that
manages resources and scheduling for applications on Hadoop.

For example, if you submit 3 MapReduce jobs, and 2 Giraph jobs to your
cluster, does the Resource Manager spawn 2 AMs (one to manage all MapReduce
jobs and one to manage all Giraph jobs) or does it spawn 5 AMs (one for
each job)?

Thanks!
Ana


High performance Count Distinct - NO Error

2014-08-06 Thread Natarajan, Prabakaran 1. (NSN - IN/Bangalore)
Hi

I am looking for high performance count distinct solution on Hive Query.

Regular count distinct is very slow but if I use probabilistic count distinct 
has more error percentage (if the number of records are small).


Is there is any solution to have exact count distinct but using low memory and 
without error?

Thanks and Regards
Prabakaran.N