Re: How to mapreduce in the scenario

2012-05-30 Thread samir das mohapatra
Yes . Hadoop Is only for Huge Dataset Computaion .
  May not good for small dataset.

On Wed, May 30, 2012 at 6:53 AM, liuzhg liu...@cernet.com wrote:

 Hi,

 Mike, Nitin, Devaraj, Soumya, samir, Robert

 Thank you all for your suggestions.

 Actually, I want to know if hadoop has any advantage than routine database
 in performance for solving this kind of problem ( join data ).



 Best Regards,

 Gump





 On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee
 soumya.sbaner...@gmail.com wrote:

 Hi,

 You can also try to use the Hadoop Reduce Side Join functionality.
 Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and
 Reduce classes to do the same.

 Regards,
 Soumya.


 On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote:

  Hi Gump,
 
Mapreduce fits well for solving these types(joins) of problem.
 
  I hope this will help you to solve the described problem..
 
  1. Mapoutput key and value classes : Write a map out put key
  class(Text.class), value class(CombinedValue.class). Here value class
  should be able to hold the values from both the files(a.txt and b.txt) as
  shown below.
 
  class CombinedValue implements WritableComparator
  {
String name;
int age;
String address;
boolean isLeft; // flag to identify from which file
  }
 
  2. Mapper : Write a map() function which can parse from both the
  files(a.txt, b.txt) and produces common output key and value class.
 
  3. Partitioner : Write the partitioner in such a way that it will Send
 all
  the (key, value) pairs to same reducer which are having same key.
 
  4. Reducer : In the reduce() function, you will receive the records from
  both the files and you can combine those easily.
 
 
  Thanks
  Devaraj
 
 
  
  From: liuzhg [liu...@cernet.com]
  Sent: Tuesday, May 29, 2012 3:45 PM
  To: common-user@hadoop.apache.org
  Subject: How to mapreduce in the scenario
 
  Hi,
 
  I wonder that if Hadoop can solve effectively the question as following:
 
  ==
  input file: a.txt, b.txt
  result: c.txt
 
  a.txt:
  id1,name1,age1,...
  id2,name2,age2,...
  id3,name3,age3,...
  id4,name4,age4,...
 
  b.txt:
  id1,address1,...
  id2,address2,...
  id3,address3,...
 
  c.txt
  id1,name1,age1,address1,...
  id2,name2,age2,address2,...
  
 
  I know that it can be done well by database.
  But I want to handle it with hadoop if possible.
  Can hadoop meet the requirement?
 
  Any suggestion can help me. Thank you very much!
 
  Best Regards,
 
  Gump
 






RE: How to mapreduce in the scenario

2012-05-30 Thread Wilson Wayne - wwilso
If I may, I'd like to ask about that statement a little more.  

I think most of us agree that hadoop handles very large (10s of TB and up) 
exceptionally well for several reasons. And I've heard multiple times that 
hadoop does not handle small datasets well and that traditional tools like 
RDBMS and ETL are better suited for the small datasets.  But what if I have a 
mixture of data.  I work with datasets that range from 1GB to 10TB is size, and 
the work requires all that data to be grouped and aggregated.  I would think 
that in such an environment where you have vast differences in the size of 
datasets that it would be better to keep them all in hadoop and do all the work 
there versus moving the small datasets out of hadoop to do some processing on 
them and then loading back into hadoop to group with the larger datasets and 
then possible taking them back out to do more processing and then back in 
again.  I just don't see where the run times for jobs on small files in hadoop 
would be so long that it wouldn't be offset by moving things back and forth.  
Or is the performance on small files in hadoop really that bad.  Thoughts?

-Original Message-
From: samir das mohapatra [mailto:samir.help...@gmail.com] 
Sent: Wednesday, May 30, 2012 8:33 AM
To: common-user@hadoop.apache.org
Subject: Re: How to mapreduce in the scenario

Yes . Hadoop Is only for Huge Dataset Computaion .
  May not good for small dataset.

On Wed, May 30, 2012 at 6:53 AM, liuzhg liu...@cernet.com wrote:

 Hi,

 Mike, Nitin, Devaraj, Soumya, samir, Robert

 Thank you all for your suggestions.

 Actually, I want to know if hadoop has any advantage than routine database
 in performance for solving this kind of problem ( join data ).



 Best Regards,

 Gump





 On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee
 soumya.sbaner...@gmail.com wrote:

 Hi,

 You can also try to use the Hadoop Reduce Side Join functionality.
 Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and
 Reduce classes to do the same.

 Regards,
 Soumya.


 On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote:

  Hi Gump,
 
Mapreduce fits well for solving these types(joins) of problem.
 
  I hope this will help you to solve the described problem..
 
  1. Mapoutput key and value classes : Write a map out put key
  class(Text.class), value class(CombinedValue.class). Here value class
  should be able to hold the values from both the files(a.txt and b.txt) as
  shown below.
 
  class CombinedValue implements WritableComparator
  {
String name;
int age;
String address;
boolean isLeft; // flag to identify from which file
  }
 
  2. Mapper : Write a map() function which can parse from both the
  files(a.txt, b.txt) and produces common output key and value class.
 
  3. Partitioner : Write the partitioner in such a way that it will Send
 all
  the (key, value) pairs to same reducer which are having same key.
 
  4. Reducer : In the reduce() function, you will receive the records from
  both the files and you can combine those easily.
 
 
  Thanks
  Devaraj
 
 
  
  From: liuzhg [liu...@cernet.com]
  Sent: Tuesday, May 29, 2012 3:45 PM
  To: common-user@hadoop.apache.org
  Subject: How to mapreduce in the scenario
 
  Hi,
 
  I wonder that if Hadoop can solve effectively the question as following:
 
  ==
  input file: a.txt, b.txt
  result: c.txt
 
  a.txt:
  id1,name1,age1,...
  id2,name2,age2,...
  id3,name3,age3,...
  id4,name4,age4,...
 
  b.txt:
  id1,address1,...
  id2,address2,...
  id3,address3,...
 
  c.txt
  id1,name1,age1,address1,...
  id2,name2,age2,address2,...
  
 
  I know that it can be done well by database.
  But I want to handle it with hadoop if possible.
  Can hadoop meet the requirement?
 
  Any suggestion can help me. Thank you very much!
 
  Best Regards,
 
  Gump
 




***
The information contained in this communication is confidential, is
intended only for the use of the recipient named above, and may be legally
privileged.

If the reader of this message is not the intended recipient, you are
hereby notified that any dissemination, distribution or copying of this
communication is strictly prohibited.

If you have received this communication in error, please resend this
communication to the sender and delete the original message or any copy
of it from your computer system.

Thank You.




Re: How to mapreduce in the scenario

2012-05-29 Thread Michel Segel
Hive? 
Sure Assuming you mean that the id is a FK common amongst the tables...

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 29, 2012, at 5:29 AM, liuzhg liu...@cernet.com wrote:

 Hi,
 
 I wonder that if Hadoop can solve effectively the question as following:
 
 ==
 input file: a.txt, b.txt
 result: c.txt
 
 a.txt:
 id1,name1,age1,...
 id2,name2,age2,...
 id3,name3,age3,...
 id4,name4,age4,...
 
 b.txt: 
 id1,address1,...
 id2,address2,...
 id3,address3,...
 
 c.txt
 id1,name1,age1,address1,...
 id2,name2,age2,address2,...
 
 
 I know that it can be done well by database. 
 But I want to handle it with hadoop if possible.
 Can hadoop meet the requirement?
 
 Any suggestion can help me. Thank you very much!
 
 Best Regards,
 
 Gump
 
 
 


Re: How to mapreduce in the scenario

2012-05-29 Thread Nitin Pawar
hive is one approach (similar to routine databases but exactly not the same)

if you are looking at mapreduce program then using multipleinput formats
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html



On Tue, May 29, 2012 at 4:02 PM, Michel Segel michael_se...@hotmail.comwrote:

 Hive?
 Sure Assuming you mean that the id is a FK common amongst the tables...

 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On May 29, 2012, at 5:29 AM, liuzhg liu...@cernet.com wrote:

  Hi,
 
  I wonder that if Hadoop can solve effectively the question as following:
 
  ==
  input file: a.txt, b.txt
  result: c.txt
 
  a.txt:
  id1,name1,age1,...
  id2,name2,age2,...
  id3,name3,age3,...
  id4,name4,age4,...
 
  b.txt:
  id1,address1,...
  id2,address2,...
  id3,address3,...
 
  c.txt
  id1,name1,age1,address1,...
  id2,name2,age2,address2,...
  
 
  I know that it can be done well by database.
  But I want to handle it with hadoop if possible.
  Can hadoop meet the requirement?
 
  Any suggestion can help me. Thank you very much!
 
  Best Regards,
 
  Gump
 
 
 




-- 
Nitin Pawar


RE: How to mapreduce in the scenario

2012-05-29 Thread Devaraj k
Hi Gump,

   Mapreduce fits well for solving these types(joins) of problem.

I hope this will help you to solve the described problem..

1. Mapoutput key and value classes : Write a map out put key class(Text.class), 
value class(CombinedValue.class). Here value class should be able to hold the 
values from both the files(a.txt and b.txt) as shown below.

class CombinedValue implements WritableComparator
{
   String name;
   int age;
   String address;
   boolean isLeft; // flag to identify from which file 
}

2. Mapper : Write a map() function which can parse from both the files(a.txt, 
b.txt) and produces common output key and value class.

3. Partitioner : Write the partitioner in such a way that it will Send all the 
(key, value) pairs to same reducer which are having same key.

4. Reducer : In the reduce() function, you will receive the records from both 
the files and you can combine those easily.


Thanks
Devaraj



From: liuzhg [liu...@cernet.com]
Sent: Tuesday, May 29, 2012 3:45 PM
To: common-user@hadoop.apache.org
Subject: How to mapreduce in the scenario

Hi,

I wonder that if Hadoop can solve effectively the question as following:

==
input file: a.txt, b.txt
result: c.txt

a.txt:
id1,name1,age1,...
id2,name2,age2,...
id3,name3,age3,...
id4,name4,age4,...

b.txt:
id1,address1,...
id2,address2,...
id3,address3,...

c.txt
id1,name1,age1,address1,...
id2,name2,age2,address2,...


I know that it can be done well by database.
But I want to handle it with hadoop if possible.
Can hadoop meet the requirement?

Any suggestion can help me. Thank you very much!

Best Regards,

Gump

Re: How to mapreduce in the scenario

2012-05-29 Thread Soumya Banerjee
Hi,

You can also try to use the Hadoop Reduce Side Join functionality.
Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and
Reduce classes to do the same.

Regards,
Soumya.

On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote:

 Hi Gump,

   Mapreduce fits well for solving these types(joins) of problem.

 I hope this will help you to solve the described problem..

 1. Mapoutput key and value classes : Write a map out put key
 class(Text.class), value class(CombinedValue.class). Here value class
 should be able to hold the values from both the files(a.txt and b.txt) as
 shown below.

 class CombinedValue implements WritableComparator
 {
   String name;
   int age;
   String address;
   boolean isLeft; // flag to identify from which file
 }

 2. Mapper : Write a map() function which can parse from both the
 files(a.txt, b.txt) and produces common output key and value class.

 3. Partitioner : Write the partitioner in such a way that it will Send all
 the (key, value) pairs to same reducer which are having same key.

 4. Reducer : In the reduce() function, you will receive the records from
 both the files and you can combine those easily.


 Thanks
 Devaraj


 
 From: liuzhg [liu...@cernet.com]
 Sent: Tuesday, May 29, 2012 3:45 PM
 To: common-user@hadoop.apache.org
 Subject: How to mapreduce in the scenario

 Hi,

 I wonder that if Hadoop can solve effectively the question as following:

 ==
 input file: a.txt, b.txt
 result: c.txt

 a.txt:
 id1,name1,age1,...
 id2,name2,age2,...
 id3,name3,age3,...
 id4,name4,age4,...

 b.txt:
 id1,address1,...
 id2,address2,...
 id3,address3,...

 c.txt
 id1,name1,age1,address1,...
 id2,name2,age2,address2,...
 

 I know that it can be done well by database.
 But I want to handle it with hadoop if possible.
 Can hadoop meet the requirement?

 Any suggestion can help me. Thank you very much!

 Best Regards,

 Gump



Re: How to mapreduce in the scenario

2012-05-29 Thread samir das mohapatra
Yes it is possible by using MultipleInputs format to multiple mapper
(basically 2 different mapper)

Setp: 1

MultipleInputs.addInputPath(conf, new Path(args[0]), TextInputFormat.class,
*Mapper1.class*);
 MultipleInputs.addInputPath(conf, new Path(args[1]),
TextInputFormat.class, *Mapper2.class*);

while defining two mappers value  put some identifier
(*output.collect(new Text(key), new Text(*identifier+~ *+value));*)
related to a.txt and b.txt so that it will easy to distinct two file mapper
output within the reducer.


Step 2:
  put b.txt in the distcach and compare the reducer value against the
b.txt  List
String currValue = values.next().toString();
String valueSplitted[] = currValue.split(~);
   if(valueSplitted[0].equals(A)) // A:- Identifier from A
mapper
{
   //where process A file
}
else if(valueSplitted[0].equals(B)) //B:- Identifier from
B mapper
{
   //here process B file
}

   output.collect(new Text(key), new Text(Formated Value as like
you to display));



Decide the key  as like what you want to produce the result.

After that you have to use one reducer to perform the ouput.

thanks
samir

On Tue, May 29, 2012 at 3:45 PM, liuzhg liu...@cernet.com wrote:

 Hi,

 I wonder that if Hadoop can solve effectively the question as following:

 ==
 input file: a.txt, b.txt
 result: c.txt

 a.txt:
 id1,name1,age1,...
 id2,name2,age2,...
 id3,name3,age3,...
 id4,name4,age4,...

 b.txt:
 id1,address1,...
 id2,address2,...
 id3,address3,...

 c.txt
 id1,name1,age1,address1,...
 id2,name2,age2,address2,...
 

 I know that it can be done well by database.
 But I want to handle it with hadoop if possible.
 Can hadoop meet the requirement?

 Any suggestion can help me. Thank you very much!

 Best Regards,

 Gump





Re: How to mapreduce in the scenario

2012-05-29 Thread Robert Evans
Yes you can do it.  In pig you would write something like

A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;
STORE C into ‘c.txt’

Hive can do it similarly too.  Or you could write your own directly in 
map/redcue or using the data_join jar.

--Bobby Evans

On 5/29/12 4:08 AM, lzg lzg_...@163.com wrote:

Hi,

I wonder that if Hadoop can solve effectively the question as following:

==
input file: a.txt, b.txt
result: c.txt

a.txt:
id1,name1,age1,...
id2,name2,age2,...
id3,name3,age3,...
id4,name4,age4,...

b.txt:
id1,address1,...
id2,address2,...
id3,address3,...

c.txt
id1,name1,age1,address1,...
id2,name2,age2,address2,...


I know that it can be done well by database.
But I want to handle it with hadoop if possible.
Can hadoop meet the requirement?

Any suggestion can help me. Thank you very much!

Best Regards,

Gump




Re: How to mapreduce in the scenario

2012-05-29 Thread liuzhg
Hi,

Mike, Nitin, Devaraj, Soumya, samir, Robert 

Thank you all for your suggestions.

Actually, I want to know if hadoop has any advantage than routine database
in performance for solving this kind of problem ( join data ). 

 

Best Regards,

Gump

 

 

On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee
soumya.sbaner...@gmail.com wrote:

Hi,

You can also try to use the Hadoop Reduce Side Join functionality.
Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and
Reduce classes to do the same.

Regards,
Soumya.


On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote:

 Hi Gump,

   Mapreduce fits well for solving these types(joins) of problem.

 I hope this will help you to solve the described problem..

 1. Mapoutput key and value classes : Write a map out put key
 class(Text.class), value class(CombinedValue.class). Here value class
 should be able to hold the values from both the files(a.txt and b.txt) as
 shown below.

 class CombinedValue implements WritableComparator
 {
   String name;
   int age;
   String address;
   boolean isLeft; // flag to identify from which file
 }

 2. Mapper : Write a map() function which can parse from both the
 files(a.txt, b.txt) and produces common output key and value class.

 3. Partitioner : Write the partitioner in such a way that it will Send all
 the (key, value) pairs to same reducer which are having same key.

 4. Reducer : In the reduce() function, you will receive the records from
 both the files and you can combine those easily.


 Thanks
 Devaraj


 
 From: liuzhg [liu...@cernet.com]
 Sent: Tuesday, May 29, 2012 3:45 PM
 To: common-user@hadoop.apache.org
 Subject: How to mapreduce in the scenario

 Hi,

 I wonder that if Hadoop can solve effectively the question as following:

 ==
 input file: a.txt, b.txt
 result: c.txt

 a.txt:
 id1,name1,age1,...
 id2,name2,age2,...
 id3,name3,age3,...
 id4,name4,age4,...

 b.txt:
 id1,address1,...
 id2,address2,...
 id3,address3,...

 c.txt
 id1,name1,age1,address1,...
 id2,name2,age2,address2,...
 

 I know that it can be done well by database.
 But I want to handle it with hadoop if possible.
 Can hadoop meet the requirement?

 Any suggestion can help me. Thank you very much!

 Best Regards,

 Gump


 



Re: How to mapreduce in the scenario

2012-05-29 Thread Nitin Pawar
if you have huge dataset (huge meaning that around tera bytes or at the
least few GBs) then yes, hadoop has the advantage of distributed systems
and is much faster

but on a smaller set of records it is not as good as RDBMS

On Wed, May 30, 2012 at 6:53 AM, liuzhg liu...@cernet.com wrote:

 Hi,

 Mike, Nitin, Devaraj, Soumya, samir, Robert

 Thank you all for your suggestions.

 Actually, I want to know if hadoop has any advantage than routine database
 in performance for solving this kind of problem ( join data ).



 Best Regards,

 Gump





 On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee
 soumya.sbaner...@gmail.com wrote:

 Hi,

 You can also try to use the Hadoop Reduce Side Join functionality.
 Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and
 Reduce classes to do the same.

 Regards,
 Soumya.


 On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote:

  Hi Gump,
 
Mapreduce fits well for solving these types(joins) of problem.
 
  I hope this will help you to solve the described problem..
 
  1. Mapoutput key and value classes : Write a map out put key
  class(Text.class), value class(CombinedValue.class). Here value class
  should be able to hold the values from both the files(a.txt and b.txt) as
  shown below.
 
  class CombinedValue implements WritableComparator
  {
String name;
int age;
String address;
boolean isLeft; // flag to identify from which file
  }
 
  2. Mapper : Write a map() function which can parse from both the
  files(a.txt, b.txt) and produces common output key and value class.
 
  3. Partitioner : Write the partitioner in such a way that it will Send
 all
  the (key, value) pairs to same reducer which are having same key.
 
  4. Reducer : In the reduce() function, you will receive the records from
  both the files and you can combine those easily.
 
 
  Thanks
  Devaraj
 
 
  
  From: liuzhg [liu...@cernet.com]
  Sent: Tuesday, May 29, 2012 3:45 PM
  To: common-user@hadoop.apache.org
  Subject: How to mapreduce in the scenario
 
  Hi,
 
  I wonder that if Hadoop can solve effectively the question as following:
 
  ==
  input file: a.txt, b.txt
  result: c.txt
 
  a.txt:
  id1,name1,age1,...
  id2,name2,age2,...
  id3,name3,age3,...
  id4,name4,age4,...
 
  b.txt:
  id1,address1,...
  id2,address2,...
  id3,address3,...
 
  c.txt
  id1,name1,age1,address1,...
  id2,name2,age2,address2,...
  
 
  I know that it can be done well by database.
  But I want to handle it with hadoop if possible.
  Can hadoop meet the requirement?
 
  Any suggestion can help me. Thank you very much!
 
  Best Regards,
 
  Gump
 






-- 
Nitin Pawar