Hadoop and Eclipse integration

2012-05-29 Thread Nick Katsipoulakis

Hello everybody,
I attempted to use the Eclipse IDE for Hadoop development and I followed 
the instructions shown in here:


http://wiki.apache.org/hadoop/EclipseEnvironment

Everything goes well until I am starting to import projects in Eclipse, 
and particularly HDFS. When I follow the instructions for HDFS import i 
get the following error from Eclipse:


Project 'hadoop-hdfs' is missing required library: 
'/home/nick/.m2/repository/org/aspectj/aspectjtools/1.6.5/aspectjtools-1.6.5.jar'


I should mention that the directory hadoop-common that i checked out 
hadoop is located at:


/home/nick/hadoop-common

and I am using Ubuntu 10.04.

Similar errors appear when I attempt to import the MapReduceTools:

Project 'MapReduceTools' is missing required library: 'classes'
Project 'MapReduceTools' is missing required library: 'lib/hadoop-core.jar'

How can I resolve these issues?  When I resolve them, how can I execute 
a simple Wordcount job from eclipse? Thank you


How to mapreduce in the scenario

2012-05-29 Thread liuzhg
Hi,
 
I wonder that if Hadoop can solve effectively the question as following:
 
==
input file: a.txt, b.txt
result: c.txt
 
a.txt:
id1,name1,age1,...
id2,name2,age2,...
id3,name3,age3,...
id4,name4,age4,...
 
b.txt: 
id1,address1,...
id2,address2,...
id3,address3,...

c.txt
id1,name1,age1,address1,...
id2,name2,age2,address2,...

 
I know that it can be done well by database. 
But I want to handle it with hadoop if possible.
Can hadoop meet the requirement?
 
Any suggestion can help me. Thank you very much!
 
Best Regards,
 
Gump




Re: How to mapreduce in the scenario

2012-05-29 Thread Michel Segel
Hive? 
Sure Assuming you mean that the id is a FK common amongst the tables...

Sent from a remote device. Please excuse any typos...

Mike Segel

On May 29, 2012, at 5:29 AM, liuzhg liu...@cernet.com wrote:

 Hi,
 
 I wonder that if Hadoop can solve effectively the question as following:
 
 ==
 input file: a.txt, b.txt
 result: c.txt
 
 a.txt:
 id1,name1,age1,...
 id2,name2,age2,...
 id3,name3,age3,...
 id4,name4,age4,...
 
 b.txt: 
 id1,address1,...
 id2,address2,...
 id3,address3,...
 
 c.txt
 id1,name1,age1,address1,...
 id2,name2,age2,address2,...
 
 
 I know that it can be done well by database. 
 But I want to handle it with hadoop if possible.
 Can hadoop meet the requirement?
 
 Any suggestion can help me. Thank you very much!
 
 Best Regards,
 
 Gump
 
 
 


Re: How to mapreduce in the scenario

2012-05-29 Thread Nitin Pawar
hive is one approach (similar to routine databases but exactly not the same)

if you are looking at mapreduce program then using multipleinput formats
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html



On Tue, May 29, 2012 at 4:02 PM, Michel Segel michael_se...@hotmail.comwrote:

 Hive?
 Sure Assuming you mean that the id is a FK common amongst the tables...

 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On May 29, 2012, at 5:29 AM, liuzhg liu...@cernet.com wrote:

  Hi,
 
  I wonder that if Hadoop can solve effectively the question as following:
 
  ==
  input file: a.txt, b.txt
  result: c.txt
 
  a.txt:
  id1,name1,age1,...
  id2,name2,age2,...
  id3,name3,age3,...
  id4,name4,age4,...
 
  b.txt:
  id1,address1,...
  id2,address2,...
  id3,address3,...
 
  c.txt
  id1,name1,age1,address1,...
  id2,name2,age2,address2,...
  
 
  I know that it can be done well by database.
  But I want to handle it with hadoop if possible.
  Can hadoop meet the requirement?
 
  Any suggestion can help me. Thank you very much!
 
  Best Regards,
 
  Gump
 
 
 




-- 
Nitin Pawar


Re: How to Integrate LDAP in Hadoop ?

2012-05-29 Thread Michel Segel
Which release? Version?
I believe there are variables in the *-site.xml that allow LDAP integration ...



Sent from a remote device. Please excuse any typos...

Mike Segel

On May 26, 2012, at 7:40 AM, samir das mohapatra samir.help...@gmail.com 
wrote:

 Hi All,
 
   Did any one work on hadoop with LDAP integration.
   Please help me for same.
 
 Thanks
  samir


RE: How to mapreduce in the scenario

2012-05-29 Thread Devaraj k
Hi Gump,

   Mapreduce fits well for solving these types(joins) of problem.

I hope this will help you to solve the described problem..

1. Mapoutput key and value classes : Write a map out put key class(Text.class), 
value class(CombinedValue.class). Here value class should be able to hold the 
values from both the files(a.txt and b.txt) as shown below.

class CombinedValue implements WritableComparator
{
   String name;
   int age;
   String address;
   boolean isLeft; // flag to identify from which file 
}

2. Mapper : Write a map() function which can parse from both the files(a.txt, 
b.txt) and produces common output key and value class.

3. Partitioner : Write the partitioner in such a way that it will Send all the 
(key, value) pairs to same reducer which are having same key.

4. Reducer : In the reduce() function, you will receive the records from both 
the files and you can combine those easily.


Thanks
Devaraj



From: liuzhg [liu...@cernet.com]
Sent: Tuesday, May 29, 2012 3:45 PM
To: common-user@hadoop.apache.org
Subject: How to mapreduce in the scenario

Hi,

I wonder that if Hadoop can solve effectively the question as following:

==
input file: a.txt, b.txt
result: c.txt

a.txt:
id1,name1,age1,...
id2,name2,age2,...
id3,name3,age3,...
id4,name4,age4,...

b.txt:
id1,address1,...
id2,address2,...
id3,address3,...

c.txt
id1,name1,age1,address1,...
id2,name2,age2,address2,...


I know that it can be done well by database.
But I want to handle it with hadoop if possible.
Can hadoop meet the requirement?

Any suggestion can help me. Thank you very much!

Best Regards,

Gump

Re: How to mapreduce in the scenario

2012-05-29 Thread Soumya Banerjee
Hi,

You can also try to use the Hadoop Reduce Side Join functionality.
Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and
Reduce classes to do the same.

Regards,
Soumya.

On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote:

 Hi Gump,

   Mapreduce fits well for solving these types(joins) of problem.

 I hope this will help you to solve the described problem..

 1. Mapoutput key and value classes : Write a map out put key
 class(Text.class), value class(CombinedValue.class). Here value class
 should be able to hold the values from both the files(a.txt and b.txt) as
 shown below.

 class CombinedValue implements WritableComparator
 {
   String name;
   int age;
   String address;
   boolean isLeft; // flag to identify from which file
 }

 2. Mapper : Write a map() function which can parse from both the
 files(a.txt, b.txt) and produces common output key and value class.

 3. Partitioner : Write the partitioner in such a way that it will Send all
 the (key, value) pairs to same reducer which are having same key.

 4. Reducer : In the reduce() function, you will receive the records from
 both the files and you can combine those easily.


 Thanks
 Devaraj


 
 From: liuzhg [liu...@cernet.com]
 Sent: Tuesday, May 29, 2012 3:45 PM
 To: common-user@hadoop.apache.org
 Subject: How to mapreduce in the scenario

 Hi,

 I wonder that if Hadoop can solve effectively the question as following:

 ==
 input file: a.txt, b.txt
 result: c.txt

 a.txt:
 id1,name1,age1,...
 id2,name2,age2,...
 id3,name3,age3,...
 id4,name4,age4,...

 b.txt:
 id1,address1,...
 id2,address2,...
 id3,address3,...

 c.txt
 id1,name1,age1,address1,...
 id2,name2,age2,address2,...
 

 I know that it can be done well by database.
 But I want to handle it with hadoop if possible.
 Can hadoop meet the requirement?

 Any suggestion can help me. Thank you very much!

 Best Regards,

 Gump



Re: How to Integrate LDAP in Hadoop ?

2012-05-29 Thread samir das mohapatra
It is cloudera version .20

On Tue, May 29, 2012 at 4:14 PM, Michel Segel michael_se...@hotmail.comwrote:

 Which release? Version?
 I believe there are variables in the *-site.xml that allow LDAP
 integration ...



 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On May 26, 2012, at 7:40 AM, samir das mohapatra samir.help...@gmail.com
 wrote:

  Hi All,
 
Did any one work on hadoop with LDAP integration.
Please help me for same.
 
  Thanks
   samir



Re: How to mapreduce in the scenario

2012-05-29 Thread samir das mohapatra
Yes it is possible by using MultipleInputs format to multiple mapper
(basically 2 different mapper)

Setp: 1

MultipleInputs.addInputPath(conf, new Path(args[0]), TextInputFormat.class,
*Mapper1.class*);
 MultipleInputs.addInputPath(conf, new Path(args[1]),
TextInputFormat.class, *Mapper2.class*);

while defining two mappers value  put some identifier
(*output.collect(new Text(key), new Text(*identifier+~ *+value));*)
related to a.txt and b.txt so that it will easy to distinct two file mapper
output within the reducer.


Step 2:
  put b.txt in the distcach and compare the reducer value against the
b.txt  List
String currValue = values.next().toString();
String valueSplitted[] = currValue.split(~);
   if(valueSplitted[0].equals(A)) // A:- Identifier from A
mapper
{
   //where process A file
}
else if(valueSplitted[0].equals(B)) //B:- Identifier from
B mapper
{
   //here process B file
}

   output.collect(new Text(key), new Text(Formated Value as like
you to display));



Decide the key  as like what you want to produce the result.

After that you have to use one reducer to perform the ouput.

thanks
samir

On Tue, May 29, 2012 at 3:45 PM, liuzhg liu...@cernet.com wrote:

 Hi,

 I wonder that if Hadoop can solve effectively the question as following:

 ==
 input file: a.txt, b.txt
 result: c.txt

 a.txt:
 id1,name1,age1,...
 id2,name2,age2,...
 id3,name3,age3,...
 id4,name4,age4,...

 b.txt:
 id1,address1,...
 id2,address2,...
 id3,address3,...

 c.txt
 id1,name1,age1,address1,...
 id2,name2,age2,address2,...
 

 I know that it can be done well by database.
 But I want to handle it with hadoop if possible.
 Can hadoop meet the requirement?

 Any suggestion can help me. Thank you very much!

 Best Regards,

 Gump





distributed cache symlink

2012-05-29 Thread Alan Miller
I'm trying to use the DistributedCache but having an issue resolving the 
symlinks to my files.

My Driver class writes some hashmaps to files in the DC like this:
Path tPath = new Path(/data/cache/fd, UUID.randomUUID().toString());
os = new ObjectOutputStream(fs.create(tPath));
os.writeObject(myHashMap);
os.close();
URI uri = new URI(tPath.toString() + # + q_map);
DistributedCache.addCacheFile(uri, config);
DistributedCache.createSymlink(config);

But what Path() do I need to access to read the symlinks? 
I tried variations of q_map,  work/q_map but neither works.

The files are definitely there because I can set a config var to the path and 
read the files in my reducer. For example, in my Driver class I set a variable 
via
 config.set(q_map, tPath.toString());

And then in my Reducer's setup() I do something like
Path q_map_path = new Path(config.get(q_map_path));
if (fs.exists(q_map_path)) {
HashMapString,String qMap = loadmap(conf,q_map_path);
}

I tried to resolve the path to the symlinks via ${mapred.local.dir}/work but 
that doesn't work either. 
In the STDOUT of my mapper attempt I see:

  2012-05-29 03:59:54,369 - INFO  [main:TaskRunner@759] - 
   Creating symlink: 
/tmp/hadoop-mapred/mapred/local/taskTracker/distcache/-3168904771265144450_-884848596_406879224/varuna010/data/cache/fd/6dc9d5c0-98be-4105-bd59-b344924dd989
 
  - 
/tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826_0020/attempt_201205250826_0020_m_00_0/work/q_map

Which says it's creating the symlinks, BUT I also see this output: 

mapred.local.dir: 
/tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826_0020/attempt_201205250826_0020_m_00_0
   job.local.dir: 
/tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826_0020/work
  mapred.task.id: attempt_201205250826_0020_m_00_0
Path [work/q_map] does not exist
Path 
[/tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826_0020/attempt_201205250826_0020_m_00_0/work/q_map]
 does not exist

Which is from this code in my mapper's setup() method:
try {
System.out.printf(mapred.local.dir: %s\n, 
conf.get(mapred.local.dir));
System.out.printf(   job.local.dir: %s\n, conf.get(job.local.dir));
System.out.printf(  mapred.task.id: %s\n, conf.get(mapred.task.id));
fs = FileSystem.get(conf);
Path symlink = new Path(work/q_map);
Path fullpath = new Path(conf.get(mapred.local.dir) + /work/q_map);
System.out.printf(Path [%s] ,symlink.toString());
if (fs.exists(symlink)) {
System.out.println(exists);
} else {
System.out.println(does not exist);
}   
System.out.printf(Path [%s] ,fullpath.toString());
if (fs.exists(fullpath)) {
System.out.println(exists);
} else {
System.out.println(does not exist);
}   
} catch (IOException e1) {
e1.printStackTrace();
}

Regards,
Alan


Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......

2012-05-29 Thread waqas latif
So my question is that do hadoop 0.20 and 1.0.3 differ in their support of
writing or reading sequencefiles? same code works fine with hadoop 0.20 but
problem occurs when run it under hadoop 1.0.3.

On Sun, May 27, 2012 at 6:15 PM, waqas latif waqas...@gmail.com wrote:

 But the thing is, it works with hadoop 0.20. even with 100 x100(and even
 bigger matrices)  but when it comes to hadoop 1.0.3 then even there is a
 problem with 3x3 matrix.


 On Sun, May 27, 2012 at 12:00 PM, Prashant Kommireddi prash1...@gmail.com
  wrote:

 I have seen this issue with large file writes using SequenceFile writer.
 Not found the same issue when testing with writing fairly small files ( 
 1GB).

 On Fri, May 25, 2012 at 10:33 PM, Kasi Subrahmanyam
 kasisubbu...@gmail.comwrote:

  Hi,
  If you are using a custom writable object while passing data from the
  mapper to the reducer make sure that the read fields and the write has
 the
  same number of variables. It might be possible that you wrote datavtova
  file using custom writable but later modified the custom writable (like
  adding new attribute to the writable) which the old data doesn't have.
 
  It might be a possibility is please check once
 
  On Friday, May 25, 2012, waqas latif wrote:
 
   Hi Experts,
  
   I am fairly new to hadoop MapR and I was trying to run a matrix
   multiplication example presented by Mr. Norstadt under following link
   http://www.norstad.org/matrix-multiply/index.html. I can run it
   successfully with hadoop 0.20.2 but I tried to run it with hadoop
 1.0.3
  but
   I am getting following error. Is it the problem with my hadoop
   configuration or it is compatibility problem in the code which was
  written
   in hadoop 0.20 by author.Also please guide me that how can I fix this
  error
   in either case. Here is the error I am getting.
  
   in thread main java.io.EOFException
  at java.io.DataInputStream.readFully(DataInputStream.java:180)
  at java.io.DataInputStream.readFully(DataInputStream.java:152)
  at
   org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
  at
  
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486)
  at
  
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475)
  at
  
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470)
  at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60)
  at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87)
  at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112)
  at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150)
  at TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278)
  at TestMatrixMultiply.main(TestMatrixMultiply.java:308)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
  
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
  
   Thanks in advance
  
   Regards,
   waqas
  
 





Re: How to Integrate LDAP in Hadoop ?

2012-05-29 Thread Michael Segel
I believe that their CDH3u3 or later has this... parameter. 
(Possibly even earlier.)

On May 29, 2012, at 6:12 AM, samir das mohapatra wrote:

 It is cloudera version .20
 
 On Tue, May 29, 2012 at 4:14 PM, Michel Segel 
 michael_se...@hotmail.comwrote:
 
 Which release? Version?
 I believe there are variables in the *-site.xml that allow LDAP
 integration ...
 
 
 
 Sent from a remote device. Please excuse any typos...
 
 Mike Segel
 
 On May 26, 2012, at 7:40 AM, samir das mohapatra samir.help...@gmail.com
 wrote:
 
 Hi All,
 
  Did any one work on hadoop with LDAP integration.
  Please help me for same.
 
 Thanks
 samir
 



Re: Multiple fs.FSInputChecker: Found checksum error .. because of load ?

2012-05-29 Thread Akshay Singh
Found the problem. 
Shifting VMs from VirtualBox to KVM worked for me, all other configurations of 
VMs were kept same.


So, checksum errors were indeed showing problem with hardware .. though virtual 
in this case.


-Akshay



 From: Akshay Singh akshay_i...@yahoo.com
To: common-user@hadoop.apache.org common-user@hadoop.apache.org 
Sent: Wednesday, 23 May 2012 4:38 PM
Subject: Multiple fs.FSInputChecker: Found checksum error .. because of load ?
 

Hi,

I am trying to run few benchmarks on a small hadoop-cluster of 4 VMs (2 on 2 
phyiscal hosts, each VM having 1 cpu core, 2GB ram, individual disk and Gbps 
bridged connectivity). I am using virtualbox as VMM.


This workload reads good number of random small files (64MB each) concurrently 
from all the HDFS datanodes, throuh clients running on same set of VMs. I am 
using FsShell cat to read the files, and I see these checksum errors:

12/05/22 10:10:12 INFO fs.FSInputChecker: Found checksum error: b[3072, 
3584]=cb93678dc0259c978731af408f2cb493b510c948b45039a4853688fd21c2a070fc03ff7b807f33d20100080027
cf09e308002761d4480800450005dc2af04000400633ca816169cf816169d0c35a87c1b090973e78aa5ef880100e24446b0101080a020fcf7b020fcea7d85a506ff1eaea5383eea539137745249aebc25e86d0feac89
c4e0c9b91bc09ee146af7e9bd103c8269486a8c748091cfc42e178f461d9127f6c9676f47fa6863bb19f2e51142725ae643ffdfbe7027798e1f11314d9aa877db99a86db25f2f6d18d5b86062de737147b918e829fb178cf
bbb57e932ab082197b1f4fa4315eae67210018c3c034b3f52481c4cebc53d1e2fd5ad4b67d87823f5e0923fa1ff579de88768f79a6df5f86a8a7eb3a68b3366063408b7292eef8f909580e3866676838ba8417bb810d9a9e
8d12c49de4522214e1c6a22b64394a1e60e020b12d5803d2b6a53fe64d00b85dc63c67a8a94758f71a7a06a786e168ea234030806026ffed07770ba6d407437a4a83b96c2b3a3c767d834a19c438a0d6f56ca6fc9099d375
ae1f95839c62f36a466818eb816d4d3ef6f3951ce3a19a3364a827bac8fd70833587c89084b847e4ceeae48df9256ef629c6325f67872478838777885f930710b71c02256b0cc66242d4974fbfb0ebcf85ef6cf4b67656dc
6918bc57083dc8868e34662c98e183163a9fc82a42fddc
org.apache.hadoop.fs.ChecksumException: Checksum error: 
/blk_2250776182612718654:of:/user/hduser/15-3/part-00197 at 52284416
    at 
org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
    at
 org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
    at 
org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1457)
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:2172)
    at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:2224)
    at java.io.DataInputStream.read(DataInputStream.java:100)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
    at
 org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
    at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:114)
    at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:49)
    at org.apache.hadoop.fs.FsShell$1.process(FsShell.java:349)
    at 
org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1913)
    at org.apache.hadoop.fs.FsShell.cat(FsShell.java:346)
    at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1557)
    at
 org.apache.hadoop.fs.FsShell.run(FsShell.java:1776)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.hadoop.fs.FsShell.main(FsShell.java:1895)
12/05/22 10:10:13 WARN hdfs.DFSClient: Found Checksum error for 
blk_2250776182612718654_6078 from XX.XX.XX.207:50010 at 52284416
12/05/22 10:10:13 INFO hdfs.DFSClient: Could not obtain block 
blk_2250776182612718654_6078 from any node: java.io.IOException: No live nodes 
contain current block. Will get new
 block locations from namenode and retry...
cat: Checksum error: /blk_2250776182612718654:of:/user/hduser/15-3/part-00197 
at 52284416
cat: Checksum error: /blk_-5591790629390980895:of:/user/hduser/15-1/part-00192 
at 30324736

Hadoop FSCK does not report any corrupt block after writing the data, but after 
every iteration of reading the data I see new corrupt blocks (with output as 
above). Interestingly,  higher the load (concurrent sequential reads) I put on 
DFS cluster chances of blocks getting corrupted increase. I (mostly) do not see 
any corruption happening when there is no or less contention at DFS servers for 
reads. 

I see few other people on web also faced the same problem :

http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/508
http://tinyurl.com/7rsckwo

It has been suggested on these threads that faulty hardware may be causing this 
issue, and these checksum errors are likely to tell so. So, I diagnosed my RAM 
(non 

Re: Pragmatic cluster backup strategies?

2012-05-29 Thread Michael Segel
Hi,
That's not a back up strategy. 
You could still have joe luser take out a key file or directory. What do you do 
then?

On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

 Hi,
 
 We are about to build a 10 machine cluster with 40Tb of storage, obviously
 as this gets full actually trying to create an offsite backup becomes a
 problem unless we build another 10 machine cluster (too expensive right
 now).  Not sure if it will help but we have planned the cabinet into an
 upper and lower half with separate redundant power, then we plan to put
 half of the cluster in the top, half in the bottom, effectively 2 racks, so
 in theory we could lose half the cluster and still have the copies of all
 the blocks with a replication factor of 3?  Apart form the data centre
 burning down or some other disaster that would render the machines totally
 unrecoverable, is this approach good enough?
 
 I realise this is a very open question and everyone's circumstances are
 different, but I'm wondering what other peoples experiences/opinions are
 for backing up cluster data?
 
 Thanks
 Darrell.



Re: Pragmatic cluster backup strategies?

2012-05-29 Thread Robert Evans
Yes you will have redundancy, so no single point of hardware failure can wipe 
out your data, short of a major catastrophe.  But you can still have an errant 
or malicious hadoop fs -rm -rf shut you down.  If you still have the original 
source of your data somewhere else you may be able to recover, by reprocessing 
the data, but if this cluster is your single repository for all your data you 
may have a problem.

--Bobby Evans

On 5/29/12 11:40 AM, Michael Segel michael_se...@hotmail.com wrote:

Hi,
That's not a back up strategy.
You could still have joe luser take out a key file or directory. What do you do 
then?

On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

 Hi,

 We are about to build a 10 machine cluster with 40Tb of storage, obviously
 as this gets full actually trying to create an offsite backup becomes a
 problem unless we build another 10 machine cluster (too expensive right
 now).  Not sure if it will help but we have planned the cabinet into an
 upper and lower half with separate redundant power, then we plan to put
 half of the cluster in the top, half in the bottom, effectively 2 racks, so
 in theory we could lose half the cluster and still have the copies of all
 the blocks with a replication factor of 3?  Apart form the data centre
 burning down or some other disaster that would render the machines totally
 unrecoverable, is this approach good enough?

 I realise this is a very open question and everyone's circumstances are
 different, but I'm wondering what other peoples experiences/opinions are
 for backing up cluster data?

 Thanks
 Darrell.




Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

2012-05-29 Thread Rohit Pandey
Hello Hadoop community,

I have been trying to set up a double node Hadoop cluster (following
the instructions in -
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/)
and am very close to running it apart from one small glitch - when I
start the dfs (using start-dfs.sh), it says:

10.63.88.53: starting datanode, logging to
/usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-ubuntu.out
10.63.88.109: starting datanode, logging to
/usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-pandro51-OptiPlex-960.out
10.63.88.109: starting secondarynamenode, logging to
/usr/local/hadoop/bin/../logs/hadoop-pandro51-secondarynamenode-pandro51-OptiPlex-960.out
starting jobtracker, logging to
/usr/local/hadoop/bin/../logs/hadoop-pandro51-jobtracker-pandro51-OptiPlex-960.out
10.63.88.109: starting tasktracker, logging to
/usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-pandro51-OptiPlex-960.out
10.63.88.53: starting tasktracker, logging to
/usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-ubuntu.out

which looks like it's been successful in starting all the nodes.
However, when I check them out by running 'jps', this is what I see:
27531 SecondaryNameNode
27879 Jps

As you can see, there is no datanode and name node. I have been
racking my brains at this for quite a while now. Checked all the
inputs and every thing. Any one know what the problem might be?

-- 

Thanks in advance,

Rohit


about hadoop webapps

2012-05-29 Thread 孙亮亮
I have another question.
I want to use  hadoop's class and Xml message get about Hadoop's NameNode
DataNode Job etc in my Application monitor it,so I want to deployment a WEB
Application(structs 2.0) in Hadoop's webapps, i'm reading something about
hadoop's src, but i could't found good function to solve it,and Do you have
good suggest or users community about this?


Help with DFSClient Exception.

2012-05-29 Thread Bharadia, Akshay
Hi,

We are frequently observing the exception
java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could not 
complete file 
/output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2.
  Giving up.
on our cluster.  The exception occurs during writing a file.  We are using 
Hadoop 0.20.2. It's ~250 nodes cluster and on average 1 box goes down every 3 
days.

Detailed stack trace :
12/05/27 23:26:54 INFO mapred.JobClient: Task Id : 
attempt_201205232329_28133_r_02_0, Status : FAILED
java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could not 
complete file 
/output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2.
  Giving up.
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106)
at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.Child.main(Child.java:170)


Our investigation:
We have min replication factor set to 2.  As mentioned here 
(http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html) , A call to 
complete() will not return true until all the file's blocks have been 
replicated the minimum number of times.  Thus, DataNode failures may cause a 
client to call complete() several times before succeeding, we should retry 
complete() several times.
The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls 
complete() function and retries it for 20 times.  But in spite of that file 
blocks are not replicated minimum number of times. The retry count is not 
configurable.  Changing min replication factor to 1 is also not a good idea 
since there are continuously jobs running on our cluster.

Do we have any solution / workaround for this problem?
What is min replication factor in general used in industry?

Let me know if any further inputs required.

Thanks,
-Akshay





How to mapreduce in the scenario

2012-05-29 Thread lzg
Hi,
 
I wonder that if Hadoop can solve effectively the question as following:
 
==
input file: a.txt, b.txt
result: c.txt
 
a.txt:
id1,name1,age1,...
id2,name2,age2,...
id3,name3,age3,...
id4,name4,age4,...
 
b.txt: 
id1,address1,...
id2,address2,...
id3,address3,...

c.txt
id1,name1,age1,address1,...
id2,name2,age2,address2,...

 
I know that it can be done well by database.
But I want to handle it with hadoop if possible.
Can hadoop meet the requirement?
 
Any suggestion can help me. Thank you very much!
 
Best Regards,
 
Gump
 

Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

2012-05-29 Thread sandeep
Can you see logs for nn and dn 

Sent from my iPhone

On May 27, 2012, at 1:21 PM, Rohit Pandey rohitpandey...@gmail.com wrote:

 Hello Hadoop community,
 
 I have been trying to set up a double node Hadoop cluster (following
 the instructions in -
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/)
 and am very close to running it apart from one small glitch - when I
 start the dfs (using start-dfs.sh), it says:
 
 10.63.88.53: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-ubuntu.out
 10.63.88.109: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-pandro51-OptiPlex-960.out
 10.63.88.109: starting secondarynamenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-secondarynamenode-pandro51-OptiPlex-960.out
 starting jobtracker, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-jobtracker-pandro51-OptiPlex-960.out
 10.63.88.109: starting tasktracker, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-pandro51-OptiPlex-960.out
 10.63.88.53: starting tasktracker, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-ubuntu.out
 
 which looks like it's been successful in starting all the nodes.
 However, when I check them out by running 'jps', this is what I see:
 27531 SecondaryNameNode
 27879 Jps
 
 As you can see, there is no datanode and name node. I have been
 racking my brains at this for quite a while now. Checked all the
 inputs and every thing. Any one know what the problem might be?
 
 -- 
 
 Thanks in advance,
 
 Rohit


Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

2012-05-29 Thread Harsh J
Rohit,

The SNN may start and run infinitely without doing any work. The NN
and DN have probably not started cause the NN has an issue (perhaps NN
name directory isn't formatted) and the DN can't find the NN (or has
data directory issues as well).

So this isn't a glitch but a real issue you'll have to take a look at
your logs for.

On Sun, May 27, 2012 at 10:51 PM, Rohit Pandey rohitpandey...@gmail.com wrote:
 Hello Hadoop community,

 I have been trying to set up a double node Hadoop cluster (following
 the instructions in -
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/)
 and am very close to running it apart from one small glitch - when I
 start the dfs (using start-dfs.sh), it says:

 10.63.88.53: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-ubuntu.out
 10.63.88.109: starting datanode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-datanode-pandro51-OptiPlex-960.out
 10.63.88.109: starting secondarynamenode, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-secondarynamenode-pandro51-OptiPlex-960.out
 starting jobtracker, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-jobtracker-pandro51-OptiPlex-960.out
 10.63.88.109: starting tasktracker, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-pandro51-OptiPlex-960.out
 10.63.88.53: starting tasktracker, logging to
 /usr/local/hadoop/bin/../logs/hadoop-pandro51-tasktracker-ubuntu.out

 which looks like it's been successful in starting all the nodes.
 However, when I check them out by running 'jps', this is what I see:
 27531 SecondaryNameNode
 27879 Jps

 As you can see, there is no datanode and name node. I have been
 racking my brains at this for quite a while now. Checked all the
 inputs and every thing. Any one know what the problem might be?

 --

 Thanks in advance,

 Rohit



-- 
Harsh J


Best Practices for Upgrading Hadoop Version?

2012-05-29 Thread Eli Finkelshteyn

Hi,
I'd like to upgrade my Hadoop cluster from version 0.20.2-CDH3B4 to 
1.0.3. I'm running a pretty small cluster of just 4 nodes, and it's not 
really being used by too many people at the moment, so I'm OK if things 
get dirty or it goes offline for a bit. I was looking at the tutorial at 
wiki.apache.org http://wiki.apache.org/hadoop/Hadoop_Upgrade, but it 
seems either outdated, or missing information. Namely, from what I've 
noticed so far, it doesn't specify what user any of the commands should 
be run as. Since I'm sure this is something a lot of people have needed 
to do, Is there a better tutorial somewhere for updating Hadoop version 
in general?


Eli


Re: distributed cache symlink

2012-05-29 Thread Koji Noguchi
Should be ./q_map .

Koji

On 5/29/12 7:38 AM, Alan Miller alan.mil...@synopsys.com wrote:

I'm trying to use the DistributedCache but having an issue resolving the
symlinks to my files.

My Driver class writes some hashmaps to files in the DC like this:
   Path tPath = new Path(/data/cache/fd, UUID.randomUUID().toString());
   os = new ObjectOutputStream(fs.create(tPath));
   os.writeObject(myHashMap);
   os.close();
URI uri = new URI(tPath.toString() + # + q_map);
   DistributedCache.addCacheFile(uri, config);
   DistributedCache.createSymlink(config);

But what Path() do I need to access to read the symlinks?
I tried variations of q_map,  work/q_map but neither works.

The files are definitely there because I can set a config var to the path
and 
read the files in my reducer. For example, in my Driver class I set a
variable via
 config.set(q_map, tPath.toString());

And then in my Reducer's setup() I do something like
Path q_map_path = new Path(config.get(q_map_path));
   if (fs.exists(q_map_path)) {
   HashMapString,String qMap = loadmap(conf,q_map_path);
   }

I tried to resolve the path to the symlinks via ${mapred.local.dir}/work
but that doesn't work either.
In the STDOUT of my mapper attempt I see:

  2012-05-29 03:59:54,369 - INFO  [main:TaskRunner@759] -
   Creating symlink:
/tmp/hadoop-mapred/mapred/local/taskTracker/distcache/-3168904771265144450
_-884848596_406879224/varuna010/data/cache/fd/6dc9d5c0-98be-4105-bd59-b344
924dd989 
  - 
/tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826
_0020/attempt_201205250826_0020_m_00_0/work/q_map

Which says it's creating the symlinks, BUT I also see this output:

mapred.local.dir: 
/tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826
_0020/attempt_201205250826_0020_m_00_0
   job.local.dir: 
/tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_201205250826
_0020/work
  mapred.task.id: attempt_201205250826_0020_m_00_0
Path [work/q_map] does not exist
Path 
[/tmp/hadoop-mapred/mapred/local/taskTracker/root/jobcache/job_20120525082
6_0020/attempt_201205250826_0020_m_00_0/work/q_map] does not exist

Which is from this code in my mapper's setup() method:
try {
   System.out.printf(mapred.local.dir: %s\n,
conf.get(mapred.local.dir));
   System.out.printf(   job.local.dir: %s\n, conf.get(job.local.dir));
   System.out.printf(  mapred.task.id: %s\n, conf.get(mapred.task.id));
   fs = FileSystem.get(conf);
   Path symlink = new Path(work/q_map);
   Path fullpath = new Path(conf.get(mapred.local.dir) + /work/q_map);
   System.out.printf(Path [%s] ,symlink.toString());
   if (fs.exists(symlink)) {
   System.out.println(exists);
   } else {
   System.out.println(does not exist);
   }   
   System.out.printf(Path [%s] ,fullpath.toString());
   if (fs.exists(fullpath)) {
   System.out.println(exists);
   } else {
   System.out.println(does not exist);
   }   
} catch (IOException e1) {
   e1.printStackTrace();
}

Regards,
Alan



Re: How to mapreduce in the scenario

2012-05-29 Thread Robert Evans
Yes you can do it.  In pig you would write something like

A = load ‘a.txt’ as (id, name, age, ...)
B = load ‘b.txt’ as (id, address, ...)
C = JOIN A BY id, B BY id;
STORE C into ‘c.txt’

Hive can do it similarly too.  Or you could write your own directly in 
map/redcue or using the data_join jar.

--Bobby Evans

On 5/29/12 4:08 AM, lzg lzg_...@163.com wrote:

Hi,

I wonder that if Hadoop can solve effectively the question as following:

==
input file: a.txt, b.txt
result: c.txt

a.txt:
id1,name1,age1,...
id2,name2,age2,...
id3,name3,age3,...
id4,name4,age4,...

b.txt:
id1,address1,...
id2,address2,...
id3,address3,...

c.txt
id1,name1,age1,address1,...
id2,name2,age2,address2,...


I know that it can be done well by database.
But I want to handle it with hadoop if possible.
Can hadoop meet the requirement?

Any suggestion can help me. Thank you very much!

Best Regards,

Gump




Re: different input/output formats

2012-05-29 Thread samir das mohapatra
Hi  Mark

  public void map(LongWritable offset, Text
val,OutputCollector
FloatWritable,Text output, Reporter reporter)
   throws IOException {
   output.collect(new FloatWritable(*1*), val); *//chanage 1 to 1.0f
then it will work.*
}

let me know the status after the change


On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com wrote:

 Hi guys, this is a very simple  program, trying to use TextInputFormat and
 SequenceFileoutputFormat. Should be easy but I get the same error.

 Here is my configurations:

conf.setMapperClass(myMapper.class);
conf.setMapOutputKeyClass(FloatWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);

TextInputFormat.addInputPath(conf, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));


 myMapper class is:

 public class myMapper extends MapReduceBase implements
 MapperLongWritable,Text,FloatWritable,Text {

public void map(LongWritable offset, Text
 val,OutputCollectorFloatWritable,Text output, Reporter reporter)
throws IOException {
output.collect(new FloatWritable(1), val);
 }
 }

 But I get the following error:

 12/05/29 12:54:31 INFO mapreduce.Job: Task Id :
 attempt_201205260045_0032_m_00_0, Status : FAILED
 java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is
 not class org.apache.hadoop.io.FloatWritable
at
 org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
at

 org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
at

 org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
at

 org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
at

 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59)
at

 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.Use

 Where is the writing of LongWritable coming from ??

 Thank you,
 Mark



Re: different input/output formats

2012-05-29 Thread Mark question
Thanks for the reply but I already tried this option,  and is the error:

java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is
not class org.apache.hadoop.io.FloatWritable
at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
at
org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
at
org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
at
filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:60)
at
filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.Use

Mark

On Tue, May 29, 2012 at 1:05 PM, samir das mohapatra 
samir.help...@gmail.com wrote:

 Hi  Mark

  public void map(LongWritable offset, Text
 val,OutputCollector
 FloatWritable,Text output, Reporter reporter)
   throws IOException {
output.collect(new FloatWritable(*1*), val); *//chanage 1 to 1.0f
 then it will work.*
}

 let me know the status after the change


 On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com
 wrote:

  Hi guys, this is a very simple  program, trying to use TextInputFormat
 and
  SequenceFileoutputFormat. Should be easy but I get the same error.
 
  Here is my configurations:
 
 conf.setMapperClass(myMapper.class);
 conf.setMapOutputKeyClass(FloatWritable.class);
 conf.setMapOutputValueClass(Text.class);
 conf.setNumReduceTasks(0);
 conf.setOutputKeyClass(FloatWritable.class);
 conf.setOutputValueClass(Text.class);
 
 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(SequenceFileOutputFormat.class);
 
 TextInputFormat.addInputPath(conf, new Path(args[0]));
 SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
 
 
  myMapper class is:
 
  public class myMapper extends MapReduceBase implements
  MapperLongWritable,Text,FloatWritable,Text {
 
 public void map(LongWritable offset, Text
  val,OutputCollectorFloatWritable,Text output, Reporter reporter)
 throws IOException {
 output.collect(new FloatWritable(1), val);
  }
  }
 
  But I get the following error:
 
  12/05/29 12:54:31 INFO mapreduce.Job: Task Id :
  attempt_201205260045_0032_m_00_0, Status : FAILED
  java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable
 is
  not class org.apache.hadoop.io.FloatWritable
 at
  org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
 at
 
 
 org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
 at
 
 
 org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
 at
 
 
 org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
 at
 
 
 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59)
 at
 
 
 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at org.apache.hadoop.security.Use
 
  Where is the writing of LongWritable coming from ??
 
  Thank you,
  Mark
 



Re: different input/output formats

2012-05-29 Thread samir das mohapatra
Hi Mark
   See the out put for that same  Application .
   I am  not getting any error.


On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com wrote:

 Hi guys, this is a very simple  program, trying to use TextInputFormat and
 SequenceFileoutputFormat. Should be easy but I get the same error.

 Here is my configurations:

conf.setMapperClass(myMapper.class);
conf.setMapOutputKeyClass(FloatWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);

TextInputFormat.addInputPath(conf, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));


 myMapper class is:

 public class myMapper extends MapReduceBase implements
 MapperLongWritable,Text,FloatWritable,Text {

public void map(LongWritable offset, Text
 val,OutputCollectorFloatWritable,Text output, Reporter reporter)
throws IOException {
output.collect(new FloatWritable(1), val);
 }
 }

 But I get the following error:

 12/05/29 12:54:31 INFO mapreduce.Job: Task Id :
 attempt_201205260045_0032_m_00_0, Status : FAILED
 java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is
 not class org.apache.hadoop.io.FloatWritable
at
 org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
at

 org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
at

 org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
at

 org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
at

 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59)
at

 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.Use

 Where is the writing of LongWritable coming from ??

 Thank you,
 Mark



Re: different input/output formats

2012-05-29 Thread Mark question
Hi Samir, can you email me your main class.. or if you can check mine, it
is as follows:

public class SortByNorm1 extends Configured implements Tool {

@Override public int run(String[] args) throws Exception {

if (args.length != 2) {
System.err.printf(Usage:bin/hadoop jar norm1.jar inputDir
outputDir\n);
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
JobConf conf = new JobConf(new Configuration(),SortByNorm1.class);
conf.setJobName(SortDocByNorm1);
conf.setMapperClass(Norm1Mapper.class);
conf.setMapOutputKeyClass(FloatWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setReducerClass(Norm1Reducer.class);
conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);

TextInputFormat.addInputPath(conf, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SortByNorm1(), args);
System.exit(exitCode);
}


On Tue, May 29, 2012 at 1:55 PM, samir das mohapatra 
samir.help...@gmail.com wrote:

 Hi Mark
See the out put for that same  Application .
I am  not getting any error.


 On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.comwrote:

 Hi guys, this is a very simple  program, trying to use TextInputFormat and
 SequenceFileoutputFormat. Should be easy but I get the same error.

 Here is my configurations:

conf.setMapperClass(myMapper.class);
conf.setMapOutputKeyClass(FloatWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setNumReduceTasks(0);
conf.setOutputKeyClass(FloatWritable.class);
conf.setOutputValueClass(Text.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(SequenceFileOutputFormat.class);

TextInputFormat.addInputPath(conf, new Path(args[0]));
SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1]));


 myMapper class is:

 public class myMapper extends MapReduceBase implements
 MapperLongWritable,Text,FloatWritable,Text {

public void map(LongWritable offset, Text
 val,OutputCollectorFloatWritable,Text output, Reporter reporter)
throws IOException {
output.collect(new FloatWritable(1), val);
 }
 }

 But I get the following error:

 12/05/29 12:54:31 INFO mapreduce.Job: Task Id :
 attempt_201205260045_0032_m_00_0, Status : FAILED
 java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is
 not class org.apache.hadoop.io.FloatWritable
at
 org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
at

 org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75)
at

 org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705)
at

 org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508)
at

 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59)
at

 filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.Use

 Where is the writing of LongWritable coming from ??

 Thank you,
 Mark





Re: How to mapreduce in the scenario

2012-05-29 Thread liuzhg
Hi,

Mike, Nitin, Devaraj, Soumya, samir, Robert 

Thank you all for your suggestions.

Actually, I want to know if hadoop has any advantage than routine database
in performance for solving this kind of problem ( join data ). 

 

Best Regards,

Gump

 

 

On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee
soumya.sbaner...@gmail.com wrote:

Hi,

You can also try to use the Hadoop Reduce Side Join functionality.
Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and
Reduce classes to do the same.

Regards,
Soumya.


On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote:

 Hi Gump,

   Mapreduce fits well for solving these types(joins) of problem.

 I hope this will help you to solve the described problem..

 1. Mapoutput key and value classes : Write a map out put key
 class(Text.class), value class(CombinedValue.class). Here value class
 should be able to hold the values from both the files(a.txt and b.txt) as
 shown below.

 class CombinedValue implements WritableComparator
 {
   String name;
   int age;
   String address;
   boolean isLeft; // flag to identify from which file
 }

 2. Mapper : Write a map() function which can parse from both the
 files(a.txt, b.txt) and produces common output key and value class.

 3. Partitioner : Write the partitioner in such a way that it will Send all
 the (key, value) pairs to same reducer which are having same key.

 4. Reducer : In the reduce() function, you will receive the records from
 both the files and you can combine those easily.


 Thanks
 Devaraj


 
 From: liuzhg [liu...@cernet.com]
 Sent: Tuesday, May 29, 2012 3:45 PM
 To: common-user@hadoop.apache.org
 Subject: How to mapreduce in the scenario

 Hi,

 I wonder that if Hadoop can solve effectively the question as following:

 ==
 input file: a.txt, b.txt
 result: c.txt

 a.txt:
 id1,name1,age1,...
 id2,name2,age2,...
 id3,name3,age3,...
 id4,name4,age4,...

 b.txt:
 id1,address1,...
 id2,address2,...
 id3,address3,...

 c.txt
 id1,name1,age1,address1,...
 id2,name2,age2,address2,...
 

 I know that it can be done well by database.
 But I want to handle it with hadoop if possible.
 Can hadoop meet the requirement?

 Any suggestion can help me. Thank you very much!

 Best Regards,

 Gump


 



about rebalance

2012-05-29 Thread yingnan.ma

Hi,

I add 5 new datanode and I want to do the rebalance, and I started the 
rebalance on the namenode, and it gave me the notice that 

starting balancer, logging to /hadoop/logs/hadoop-hdfs-balancer-hadoop220.out 

and today I check the log file and the detail is that 


Another balancer is running. Exiting...
Balancing took 5.0203 minutes


1) I am not sure that whether I should start the rebalance on the namenode or 
on each new datanode.
2) should I set the bandwidth on each datanode or just only on the namenode
3) If the rebalance started, whether the data on others' would be decreased

4)whether the log details means the balancer was killed by another one.


If you have some suggestion, please give me some notice , thank you


Best Regards

Malone
2012-05-30 

  
Yingnan.Ma
Eyingnan...@ipinyou.com
MSN:  mayingnan_b...@hotmail.com
QQ: 230624226
北京市朝阳区八里庄西里100号东区 住邦2000,1号楼A座2101室,100025
  北京・上海・硅谷
http://www.ipinyou.com


Re: How to mapreduce in the scenario

2012-05-29 Thread Nitin Pawar
if you have huge dataset (huge meaning that around tera bytes or at the
least few GBs) then yes, hadoop has the advantage of distributed systems
and is much faster

but on a smaller set of records it is not as good as RDBMS

On Wed, May 30, 2012 at 6:53 AM, liuzhg liu...@cernet.com wrote:

 Hi,

 Mike, Nitin, Devaraj, Soumya, samir, Robert

 Thank you all for your suggestions.

 Actually, I want to know if hadoop has any advantage than routine database
 in performance for solving this kind of problem ( join data ).



 Best Regards,

 Gump





 On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee
 soumya.sbaner...@gmail.com wrote:

 Hi,

 You can also try to use the Hadoop Reduce Side Join functionality.
 Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and
 Reduce classes to do the same.

 Regards,
 Soumya.


 On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote:

  Hi Gump,
 
Mapreduce fits well for solving these types(joins) of problem.
 
  I hope this will help you to solve the described problem..
 
  1. Mapoutput key and value classes : Write a map out put key
  class(Text.class), value class(CombinedValue.class). Here value class
  should be able to hold the values from both the files(a.txt and b.txt) as
  shown below.
 
  class CombinedValue implements WritableComparator
  {
String name;
int age;
String address;
boolean isLeft; // flag to identify from which file
  }
 
  2. Mapper : Write a map() function which can parse from both the
  files(a.txt, b.txt) and produces common output key and value class.
 
  3. Partitioner : Write the partitioner in such a way that it will Send
 all
  the (key, value) pairs to same reducer which are having same key.
 
  4. Reducer : In the reduce() function, you will receive the records from
  both the files and you can combine those easily.
 
 
  Thanks
  Devaraj
 
 
  
  From: liuzhg [liu...@cernet.com]
  Sent: Tuesday, May 29, 2012 3:45 PM
  To: common-user@hadoop.apache.org
  Subject: How to mapreduce in the scenario
 
  Hi,
 
  I wonder that if Hadoop can solve effectively the question as following:
 
  ==
  input file: a.txt, b.txt
  result: c.txt
 
  a.txt:
  id1,name1,age1,...
  id2,name2,age2,...
  id3,name3,age3,...
  id4,name4,age4,...
 
  b.txt:
  id1,address1,...
  id2,address2,...
  id3,address3,...
 
  c.txt
  id1,name1,age1,address1,...
  id2,name2,age2,address2,...
  
 
  I know that it can be done well by database.
  But I want to handle it with hadoop if possible.
  Can hadoop meet the requirement?
 
  Any suggestion can help me. Thank you very much!
 
  Best Regards,
 
  Gump
 






-- 
Nitin Pawar


RE: about rebalance

2012-05-29 Thread Devaraj k
1) I am not sure that whether I should start the rebalance on the namenode or 
on each new datanode.
You can run the balancer in any node. It is not suggested to run in namenode 
and would be better to run in a node which has less load.

2) should I set the bandwidth on each datanode or just only on the namenode
Each data node has a limited bandwidth for rebalancing. The default value for 
the
bandwidth is 5MB/s.

3) If the rebalance started, whether the data on others' would be decreased
Yes, after the balancer run, data will be moved from over utilized nodes to 
under utilized nodes.

4)whether the log details means the balancer was killed by another one.
We cannot run multiple balancers at a time. It is allowed to run only one 
balancer at any time in the cluster to avoid data corruption.


You can refer the below document fot more details.
https://issues.apache.org/jira/secure/attachment/12368261/RebalanceDesign6.pdf

Thanks
Devaraj


From: yingnan.ma [yingnan...@ipinyou.com]
Sent: Wednesday, May 30, 2012 7:06 AM
To: common-user
Subject: about rebalance

Hi,

I add 5 new datanode and I want to do the rebalance, and I started the 
rebalance on the namenode, and it gave me the notice that

starting balancer, logging to /hadoop/logs/hadoop-hdfs-balancer-hadoop220.out 

and today I check the log file and the detail is that


Another balancer is running. Exiting...
Balancing took 5.0203 minutes


1) I am not sure that whether I should start the rebalance on the namenode or 
on each new datanode.
2) should I set the bandwidth on each datanode or just only on the namenode
3) If the rebalance started, whether the data on others' would be decreased

4)whether the log details means the balancer was killed by another one.


If you have some suggestion, please give me some notice , thank you


Best Regards

Malone
2012-05-30


Yingnan.Ma
Eyingnan...@ipinyou.com
MSN:  mayingnan_b...@hotmail.com
QQ: 230624226
北京市朝阳区八里庄西里100号东区 住邦2000,1号楼A座2101室,100025
  北京・上海・硅谷
http://www.ipinyou.com