Hadoop and Eclipse integration

2012-05-29 Thread Nick Katsipoulakis
Hello everybody, I attempted to use the Eclipse IDE for Hadoop development and I followed the instructions shown in here: http://wiki.apache.org/hadoop/EclipseEnvironment Everything goes well until I am starting to import projects in Eclipse, and particularly HDFS. When I follow the

How to mapreduce in the scenario

2012-05-29 Thread liuzhg
Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,...

Re: How to mapreduce in the scenario

2012-05-29 Thread Michel Segel
Hive? Sure Assuming you mean that the id is a FK common amongst the tables... Sent from a remote device. Please excuse any typos... Mike Segel On May 29, 2012, at 5:29 AM, liuzhg liu...@cernet.com wrote: Hi, I wonder that if Hadoop can solve effectively the question as following:

Re: How to mapreduce in the scenario

2012-05-29 Thread Nitin Pawar
hive is one approach (similar to routine databases but exactly not the same) if you are looking at mapreduce program then using multipleinput formats http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html On Tue, May 29, 2012 at 4:02 PM,

Re: How to Integrate LDAP in Hadoop ?

2012-05-29 Thread Michel Segel
Which release? Version? I believe there are variables in the *-site.xml that allow LDAP integration ... Sent from a remote device. Please excuse any typos... Mike Segel On May 26, 2012, at 7:40 AM, samir das mohapatra samir.help...@gmail.com wrote: Hi All, Did any one work on hadoop

RE: How to mapreduce in the scenario

2012-05-29 Thread Devaraj k
Hi Gump, Mapreduce fits well for solving these types(joins) of problem. I hope this will help you to solve the described problem.. 1. Mapoutput key and value classes : Write a map out put key class(Text.class), value class(CombinedValue.class). Here value class should be able to hold the

Re: How to mapreduce in the scenario

2012-05-29 Thread Soumya Banerjee
Hi, You can also try to use the Hadoop Reduce Side Join functionality. Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and Reduce classes to do the same. Regards, Soumya. On Tue, May 29, 2012 at 4:10 PM, Devaraj k devara...@huawei.com wrote: Hi Gump, Mapreduce fits

Re: How to Integrate LDAP in Hadoop ?

2012-05-29 Thread samir das mohapatra
It is cloudera version .20 On Tue, May 29, 2012 at 4:14 PM, Michel Segel michael_se...@hotmail.comwrote: Which release? Version? I believe there are variables in the *-site.xml that allow LDAP integration ... Sent from a remote device. Please excuse any typos... Mike Segel On May 26,

Re: How to mapreduce in the scenario

2012-05-29 Thread samir das mohapatra
Yes it is possible by using MultipleInputs format to multiple mapper (basically 2 different mapper) Setp: 1 MultipleInputs.addInputPath(conf, new Path(args[0]), TextInputFormat.class, *Mapper1.class*); MultipleInputs.addInputPath(conf, new Path(args[1]), TextInputFormat.class, *Mapper2.class*);

distributed cache symlink

2012-05-29 Thread Alan Miller
I'm trying to use the DistributedCache but having an issue resolving the symlinks to my files. My Driver class writes some hashmaps to files in the DC like this: Path tPath = new Path(/data/cache/fd, UUID.randomUUID().toString()); os = new ObjectOutputStream(fs.create(tPath));

Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......

2012-05-29 Thread waqas latif
So my question is that do hadoop 0.20 and 1.0.3 differ in their support of writing or reading sequencefiles? same code works fine with hadoop 0.20 but problem occurs when run it under hadoop 1.0.3. On Sun, May 27, 2012 at 6:15 PM, waqas latif waqas...@gmail.com wrote: But the thing is, it works

Re: How to Integrate LDAP in Hadoop ?

2012-05-29 Thread Michael Segel
I believe that their CDH3u3 or later has this... parameter. (Possibly even earlier.) On May 29, 2012, at 6:12 AM, samir das mohapatra wrote: It is cloudera version .20 On Tue, May 29, 2012 at 4:14 PM, Michel Segel michael_se...@hotmail.comwrote: Which release? Version? I believe there

Re: Multiple fs.FSInputChecker: Found checksum error .. because of load ?

2012-05-29 Thread Akshay Singh
Found the problem.  Shifting VMs from VirtualBox to KVM worked for me, all other configurations of VMs were kept same. So, checksum errors were indeed showing problem with hardware .. though virtual in this case. -Akshay From: Akshay Singh

Re: Pragmatic cluster backup strategies?

2012-05-29 Thread Michael Segel
Hi, That's not a back up strategy. You could still have joe luser take out a key file or directory. What do you do then? On May 29, 2012, at 11:19 AM, Darrell Taylor wrote: Hi, We are about to build a 10 machine cluster with 40Tb of storage, obviously as this gets full actually trying to

Re: Pragmatic cluster backup strategies?

2012-05-29 Thread Robert Evans
Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe. But you can still have an errant or malicious hadoop fs -rm -rf shut you down. If you still have the original source of your data somewhere else you may be able to

Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

2012-05-29 Thread Rohit Pandey
Hello Hadoop community, I have been trying to set up a double node Hadoop cluster (following the instructions in - http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/) and am very close to running it apart from one small glitch - when I start the dfs (using

about hadoop webapps

2012-05-29 Thread 孙亮亮
I have another question. I want to use hadoop's class and Xml message get about Hadoop's NameNode DataNode Job etc in my Application monitor it,so I want to deployment a WEB Application(structs 2.0) in Hadoop's webapps, i'm reading something about hadoop's src, but i could't found good function

Help with DFSClient Exception.

2012-05-29 Thread Bharadia, Akshay
Hi, We are frequently observing the exception java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could not complete file /output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2. Giving up. on our cluster. The exception occurs during writing a file.

How to mapreduce in the scenario

2012-05-29 Thread lzg
Hi, I wonder that if Hadoop can solve effectively the question as following: == input file: a.txt, b.txt result: c.txt a.txt: id1,name1,age1,... id2,name2,age2,... id3,name3,age3,... id4,name4,age4,... b.txt: id1,address1,... id2,address2,...

Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

2012-05-29 Thread sandeep
Can you see logs for nn and dn Sent from my iPhone On May 27, 2012, at 1:21 PM, Rohit Pandey rohitpandey...@gmail.com wrote: Hello Hadoop community, I have been trying to set up a double node Hadoop cluster (following the instructions in -

Re: Small glitch with setting up two node cluster...only secondary node starts (datanode and namenode don't show up in jps)

2012-05-29 Thread Harsh J
Rohit, The SNN may start and run infinitely without doing any work. The NN and DN have probably not started cause the NN has an issue (perhaps NN name directory isn't formatted) and the DN can't find the NN (or has data directory issues as well). So this isn't a glitch but a real issue you'll

Best Practices for Upgrading Hadoop Version?

2012-05-29 Thread Eli Finkelshteyn
Hi, I'd like to upgrade my Hadoop cluster from version 0.20.2-CDH3B4 to 1.0.3. I'm running a pretty small cluster of just 4 nodes, and it's not really being used by too many people at the moment, so I'm OK if things get dirty or it goes offline for a bit. I was looking at the tutorial at

Re: distributed cache symlink

2012-05-29 Thread Koji Noguchi
Should be ./q_map . Koji On 5/29/12 7:38 AM, Alan Miller alan.mil...@synopsys.com wrote: I'm trying to use the DistributedCache but having an issue resolving the symlinks to my files. My Driver class writes some hashmaps to files in the DC like this: Path tPath = new

Re: How to mapreduce in the scenario

2012-05-29 Thread Robert Evans
Yes you can do it. In pig you would write something like A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id; STORE C into ‘c.txt’ Hive can do it similarly too. Or you could write your own directly in map/redcue or using the data_join jar.

Re: different input/output formats

2012-05-29 Thread samir das mohapatra
Hi Mark public void map(LongWritable offset, Text val,OutputCollector FloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(*1*), val); *//chanage 1 to 1.0f then it will work.* } let me know the status after the change On Wed, May

Re: different input/output formats

2012-05-29 Thread Mark question
Thanks for the reply but I already tried this option, and is the error: java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at

Re: different input/output formats

2012-05-29 Thread samir das mohapatra
Hi Mark See the out put for that same Application . I am not getting any error. On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com wrote: Hi guys, this is a very simple program, trying to use TextInputFormat and SequenceFileoutputFormat. Should be easy but I get the

Re: different input/output formats

2012-05-29 Thread Mark question
Hi Samir, can you email me your main class.. or if you can check mine, it is as follows: public class SortByNorm1 extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.err.printf(Usage:bin/hadoop

Re: How to mapreduce in the scenario

2012-05-29 Thread liuzhg
Hi, Mike, Nitin, Devaraj, Soumya, samir, Robert Thank you all for your suggestions. Actually, I want to know if hadoop has any advantage than routine database in performance for solving this kind of problem ( join data ). Best Regards, Gump On Tue, May 29, 2012 at 6:53 PM, Soumya

about rebalance

2012-05-29 Thread yingnan.ma
Hi, I add 5 new datanode and I want to do the rebalance, and I started the rebalance on the namenode, and it gave me the notice that starting balancer, logging to /hadoop/logs/hadoop-hdfs-balancer-hadoop220.out and today I check the log file and the detail is that Another balancer is

Re: How to mapreduce in the scenario

2012-05-29 Thread Nitin Pawar
if you have huge dataset (huge meaning that around tera bytes or at the least few GBs) then yes, hadoop has the advantage of distributed systems and is much faster but on a smaller set of records it is not as good as RDBMS On Wed, May 30, 2012 at 6:53 AM, liuzhg liu...@cernet.com wrote: Hi,

RE: about rebalance

2012-05-29 Thread Devaraj k
1) I am not sure that whether I should start the rebalance on the namenode or on each new datanode. You can run the balancer in any node. It is not suggested to run in namenode and would be better to run in a node which has less load. 2) should I set the bandwidth on each datanode or just only