Re: metric type
There is @Metric MutableCounterLong bytesWritten attribute in DataNodeMetrics, which is used to IO/sec statistics? 2013/8/31 Jitendra Yadav jeetuyadav200...@gmail.com Hi, For IO/sec statistics I think MutableCounterLongRate and MutableCounterLong more useful than others and for xceiver thread number I'm not bit sure right now. Thanks Jiitendra On Fri, Aug 30, 2013 at 1:40 PM, lei liu liulei...@gmail.com wrote: Hi Jitendra, If I want to statistics number of bytes read per second,and display the result into ganglia, should I use MutableCounterLong or MutableGaugeLong? If I want to display current xceiver thread number in datanode into ganglia, should I use MutableCounterLong or MutableGaugeLong? Thanks, LiuLei 2013/8/30 Jitendra Yadav jeetuyadav200...@gmail.com Hi, Below link contains the answer for your question. http://hadoop.apache.org/docs/r1.2.0/api/org/apache/hadoop/metrics2/package-summary.html Regards Jitendra On Fri, Aug 30, 2013 at 11:35 AM, lei liu liulei...@gmail.com wrote: I use the metrics v2, there are COUNTER and GAUGE metric type in metrics v2. What is the difference between the two? Thanks, LiuLei
sqoop oracle connection error
Hi, I am trying to import table from oracle hdfs. i am getting the following error ERROR manager.SqlManager: Error executing statement: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection any work around this. the query is: sqoop import --connect jdbc:oracle:thin:@//ramesh.ops.cloudwick.com/cloud--username ramesh --password password --table cloud.test -m 1 the output is as follows; [root@ramesh ram]# sqoop import --connect jdbc:oracle:thin:@// ramesh.ops.cloudwick.com/cloud --username ramesh --password password --table cloud.test -m 1 Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. 13/08/31 12:27:27 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 13/08/31 12:27:27 INFO manager.SqlManager: Using default fetchSize of 1000 13/08/31 12:27:27 INFO tool.CodeGenTool: Beginning code generation 13/08/31 12:27:27 ERROR manager.SqlManager: Error executing statement: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:458) at oracle.jdbc.driver.PhysicalConnection.init(PhysicalConnection.java:546) at oracle.jdbc.driver.T4CConnection.init(T4CConnection.java:236) at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32) at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:521) at java.sql.DriverManager.getConnection(DriverManager.java:571) at java.sql.DriverManager.getConnection(DriverManager.java:215) at org.apache.sqoop.manager.OracleManager.makeConnection(OracleManager.java:313) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:605) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:628) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:235) at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:219) at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:347) at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1255) at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1072) at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:390) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476) at org.apache.sqoop.Sqoop.run(Sqoop.java:145) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229) at org.apache.sqoop.Sqoop.main(Sqoop.java:238) Caused by: oracle.net.ns.NetException: The Network Adapter could not establish the connection at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:392) at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:434) at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:687) at oracle.net.ns.NSProtocol.connect(NSProtocol.java:247) at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1102) at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:320) ... 24 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:150) at oracle.net.nt.ConnOption.connect(ConnOption.java:133) at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:370) ... 29 more 13/08/31 12:27:27 ERROR manager.OracleManager: Failed to rollback transaction java.lang.NullPointerException at org.apache.sqoop.manager.OracleManager.getColumnNames(OracleManager.java:744) at org.apache.sqoop.orm.ClassWriter.getColumnNames(ClassWriter.java:1222) at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1074) at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:390) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476) at org.apache.sqoop.Sqoop.run(Sqoop.java:145) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229) at
Re: WritableComparable.compareTo vs RawComparator.compareTo
Also, if both are defined , the framework will use RawComparator . I hope you have registered the comparator in a static block as follows static { WritableComparator.define(PairOfInts.class, new Comparator()); } Regards Ravi Magham On Sat, Aug 31, 2013 at 1:23 PM, Ravi Kiran ravikiranmag...@gmail.comwrote: Hi Adeel, The RawComparator is the fastest between the two as you avoid the need to convert the byte stream to Writable objects for comparison . Regards Ravi Magham On Fri, Aug 30, 2013 at 11:16 PM, Adeel Qureshi adeelmahm...@gmail.comwrote: For secondary sort I am implementing a RawComparator and providing that as sortComparator .. is that the faster way or using a WritableComparable as mapper output and defining a compareTo method on the key itself also what happens if both are defined, is one ignored
Re: sqoop oracle connection error
Hi , Can you check if you are able to ping or telnet to the ip address and port of Oracle database from your machine. I have a hunch that Oracle Listener is stopped . If so , start it. The commands to check the status and start if the listener isn't running. $ lsnrctl status $ lsnrctl start Regards Ravi Magham On Sat, Aug 31, 2013 at 2:05 PM, Krishnan Narayanan krishnan.sm...@gmail.com wrote: Hi Ram, I get the same error.If you find an answer pls dp fwd it to me. I will do the same. Thx Krish On Sat, Aug 31, 2013 at 12:00 AM, Ram pramesh...@gmail.com wrote: Hi, I am trying to import table from oracle hdfs. i am getting the following error ERROR manager.SqlManager: Error executing statement: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection any work around this. the query is: sqoop import --connect jdbc:oracle:thin:@//ramesh.ops.cloudwick.com/cloud--username ramesh --password password --table cloud.test -m 1 the output is as follows; [root@ramesh ram]# sqoop import --connect jdbc:oracle:thin:@// ramesh.ops.cloudwick.com/cloud --username ramesh --password password --table cloud.test -m 1 Warning: /usr/lib/hbase does not exist! HBase imports will fail. Please set $HBASE_HOME to the root of your HBase installation. 13/08/31 12:27:27 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 13/08/31 12:27:27 INFO manager.SqlManager: Using default fetchSize of 1000 13/08/31 12:27:27 INFO tool.CodeGenTool: Beginning code generation 13/08/31 12:27:27 ERROR manager.SqlManager: Error executing statement: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:458) at oracle.jdbc.driver.PhysicalConnection.init(PhysicalConnection.java:546) at oracle.jdbc.driver.T4CConnection.init(T4CConnection.java:236) at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32) at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:521) at java.sql.DriverManager.getConnection(DriverManager.java:571) at java.sql.DriverManager.getConnection(DriverManager.java:215) at org.apache.sqoop.manager.OracleManager.makeConnection(OracleManager.java:313) at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:605) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:628) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:235) at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:219) at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:347) at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1255) at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1072) at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:390) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476) at org.apache.sqoop.Sqoop.run(Sqoop.java:145) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229) at org.apache.sqoop.Sqoop.main(Sqoop.java:238) Caused by: oracle.net.ns.NetException: The Network Adapter could not establish the connection at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:392) at oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:434) at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:687) at oracle.net.ns.NSProtocol.connect(NSProtocol.java:247) at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1102) at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:320) ... 24 more Caused by: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:150) at oracle.net.nt.ConnOption.connect(ConnOption.java:133) at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:370) ... 29 more 13/08/31 12:27:27 ERROR manager.OracleManager: Failed to rollback transaction
Re: Job config before read fields
Thank you for your help Shahab. I guess I wasn't being too clear. My logic is that I use a custom type as key and in order to deserialize it on the compute nodes, I need an extra piece of information (also a custom type). To use an analogy, a Text is serialized by writing the length of the string as a number and then the bytes that compose the actual string. When it is deserialized, the number informs the reader when to stop reading the string. This number is varies from string to string and it is compact so it makes sense to serialize it with the string. My use case is similar to it. I have a complex type (let's call this data), and in order to deserialize it, I need another complex type (let's call this second type metadata). The metadata is not closely tied to the data (i.e. if the data value changes, the metadata does not) and the metadata size is quite large. I ruled out a couple of options, but please let me know if you think I did so for the wrong reasons: 1. I could serialize each data value with it's own metadata value, but since the data value count is in the +tens of millions and the metadata value distinct count can be up to one hundred, it would waste resources in the system. 2. I could serialize the metadata and then the data as a collection property of the metadata. This would be an elegant solution code-wise, but then all the data would have to be read and kept in memory as a massive object before any reduce operations can happen. I wasn't able to find any info on this online so this is just a guess from peeking at the hadoop code. My solution was to serialize the data with a hash of the metadata and separately serialize the metadata and its hash in the job configuration (as key/value pairs). For this to work, I would need to be able to deserialize the metadata on the reduce node before the data is deserialized in the readFields() method. I think that for that to happen I need to hook into the code somewhere where a context or job configuration is used (before readFields()), but I'm stumped as to where that is. Cheers, Adi On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus shahab.yu...@gmail.comwrote: What I meant was that you might have to split or redesign your logic or your usecase (which we don't know about)? Regards, Shahab On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER chivas314...@gmail.com wrote: But how would the comparator have access to the job config? On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus shahab.yu...@gmail.comwrote: I think you have to override/extend the Comparator to achieve that, something like what is done in Secondary Sort? Regards, Shahab On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER chivas314...@gmail.com wrote: Howdy, I apologise for the lack of code in this message, but the code is fairly convoluted and it would obscure my problem. That being said, I can put together some sample code if really needed. I am trying to pass some metadata between the map reduce steps. This metadata is read and generated in the map step and stored in the job config. It also needs to be recreated on the reduce node before the key/ value fields can be read in the readFields function. I had assumed that I would be able to override the Reducer.setup() function and that would be it, but apparently the readFields function is called before the Reducer.setup() function. My question is what is any (the best) place on the reduce node where I can access the job configuration/ context before the readFields function is called? This is the stack trace: at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:) at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70) at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.mapred.Child.main(Child.java:249)
Re: InvalidProtocolBufferException while submitting crunch job to cluster
: java.net.UnknownHostException: bdatadev edit your /etc/hosts file Regards, Som Shekhar Sharma +91-8197243810 On Sat, Aug 31, 2013 at 2:05 AM, Narlin M hpn...@gmail.com wrote: Looks like I was pointing to incorrect ports. After correcting the port numbers, conf.set(fs.defaultFS, hdfs://server_address:8020); conf.set(mapred.job.tracker, server_address:8021); I am now getting the following exception: 2880 [Thread-15] INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob - java.lang.IllegalArgumentException: java.net.UnknownHostException: bdatadev at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:389) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:356) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:124) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2218) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2252) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2234) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:300) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:103) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:902) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896) at org.apache.hadoop.mapreduce.Job.submit(Job.java:531) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:305) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:180) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:209) at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:100) at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:51) at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:75) at java.lang.Thread.run(Thread.java:680) Caused by: java.net.UnknownHostException: bdatadev ... 27 more However nowhere in my code a host named bdatadev is mentioned, and I cannot ping this host. Thanks for the help. On Fri, Aug 30, 2013 at 3:04 PM, Narlin M hpn...@gmail.com wrote: I am getting following exception while trying to submit a crunch pipeline job to a remote hadoop cluster: Exception in thread main java.lang.RuntimeException: Cannot create job output directory /tmp/crunch-324987940 at org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.java:344) at org.apache.crunch.impl.mr.MRPipeline.init(MRPipeline.java:125) at test.CrunchTest.setup(CrunchTest.java:98) at test.CrunchTest.main(CrunchTest.java:367) Caused by: java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: NARLIN/127.0.0.1; destination host is: server_address:50070; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759) at org.apache.hadoop.ipc.Client.call(Client.java:1164) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at com.sun.proxy.$Proxy11.mkdirs(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at com.sun.proxy.$Proxy11.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:425) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1943) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:523) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1799) at org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.java:342) ... 3 more Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected
Re: secondary sort - number of reducers
Adeel, To add to Yong's points a) Consider tuning the number of threads in reduce tasks and the task tracker process. mapred.reduce.parallel.copies b) See if the map output can be compressed to ensure there is less IO . c) Increase the io.sort.factor to ensure the framework merges a larger number of files in each merge sort at the reducer d) Check the counter Reduce Shuffle Bytes of each reducer to see any skew of data at few reducers. Try for a even distribution of load through a better partitioner code. Regards Ravi Magham On Fri, Aug 30, 2013 at 9:28 PM, java8964 java8964 java8...@hotmail.comwrote: Well, The reducers normally will take much longer than the mappers stage, because the copy/shuffle/sort all happened at this time, and they are the hard part. But before we simply say it is part of life, you need to dig into more of your MR jobs to find out if you can make it faster. You are the person most familiar with your data, and you wrote the code to group/partition them, and send them to the reducers. Even you set up 255 reducers, the question is, do each of them get its fair share? You need to read the COUNTER information of each reducer, and found out how many reducer groups each reducer gets, and how many input bytes it get, etc. Simple example, if you send 200G data, and group them by DATE, if all the data belongs to 2 days, and one of them contains 90% of data, then in this case, giving 255 reducers won't help, as only 2 reducers will consume data, and one of them will consume 90% of data, and will finish in a very long time, which WILL delay the whole MR job, while the rest reducers will finish within seconds. In this case, maybe you need to rethink what should be your key, and make sure each reducer get its fair share of volume of data. After the above fix (in fact, normally it will fix 90% of reducer performance problems, especially you have 255 reducer tasks available, so each one average will only get 1G data, good for your huge cluster only needs to process 256G data :-), if you want to make it even faster, then check you code. Do you have to use String.compareTo()? Is it slow? Google hadoop rawcomparator to see if you can do something here. After that, if you still think the reducer stage slow, check you cluster system. Does the reducer spend most time on copy stage, or sort, or in your reducer class? Find out the where the time spends, then identify the solution. Yong -- Date: Fri, 30 Aug 2013 11:02:05 -0400 Subject: Re: secondary sort - number of reducers From: adeelmahm...@gmail.com To: user@hadoop.apache.org my secondary sort on multiple keys seem to work fine with smaller data sets but with bigger data sets (like 256 gig and 800M+ records) the mapper phase gets done pretty quick (about 15 mins) but then the reducer phase seem to take forever. I am using 255 reducers. basic idea is that my composite key has both group and sort keys in it which i parse in the appropriate comparator classes to perform grouping and sorting .. my thinking is that mappers is where most of the work is done 1. mapper itself (create composite key and value) 2. recods sorting 3. partiotioner if all this gets done in 15 mins then reducer has the simple task of 1. grouping comparator 2. reducer itself (simply output records) should take less time than mappers .. instead it essentially gets stuck in reduce phase .. im gonna paste my code here to see if anything stands out as a fundamental design issue //PARTITIONER public int getPartition(Text key, HCatRecord record, int numReduceTasks) { //extract the group key from composite key String groupKey = key.toString().split(\\|)[0]; return (groupKey.hashCode() Integer.MAX_VALUE) % numReduceTasks; } GROUP COMAPRATOR public int compare(WritableComparable a, WritableComparable b) { //compare to text objects String thisGroupKey = ((Text) a).toString().split(\\|)[0]; String otherGroupKey = ((Text) b).toString().split(\\|)[0]; //extract return thisGroupKey.compareTo(otherGroupKey); } SORT COMPARATOR is similar to group comparator and is in map phase and gets done quick //REDUCER public void reduce(Text key, IterableHCatRecord records, Context context) throws IOException, InterruptedException { log.info(in reducer for key + key.toString()); IteratorHCatRecord recordsIter = records.iterator(); //we are only interested in the first record after sorting and grouping if(recordsIter.hasNext()){ HCatRecord rec = recordsIter.next(); context.write(nw, rec); log.info(returned record + rec.toString()); } } On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi adeelmahm...@gmail.comwrote: yup it was negative and by doing this now it seems to be working fine On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma shekhar2...@gmail.comwrote: Is the hash code of that key is negative.? Do something
How to change default ports of datanodes in a cluster
Hello Hadoopers, Default port for datanode is 50075 i am able to change namenode default port by changing dfs.namenode.http-address.ns1 dfs.namenode.http-address.ns2 in my hdfs-site.xml of my 2 namenodes how to change default port address of my multiple datanodes
Re: Multidata center support
lets say that you have some machines in europe and some in US I think you just need the ips and configure them in your cluster set up it will work... On Sat, Aug 31, 2013 at 7:52 AM, Jun Ping Du j...@vmware.com wrote: Hi, Although you can set datacenter layer on your network topology, it is never enabled in hadoop as lacking of replica placement and task scheduling support. There are some work to add layers other than rack and node under HADOOP-8848 but may not suit for your case. Agree with Adam that a cluster spanning multiple data centers seems not make sense even for DR case. Do you have other cases to do such a deployment? Thanks, Junping -- *From: *Adam Muise amu...@hortonworks.com *To: *user@hadoop.apache.org *Sent: *Friday, August 30, 2013 6:26:54 PM *Subject: *Re: Multidata center support Nothing has changed. DR best practice is still one (or more) clusters per site and replication is handled via distributed copy or some variation of it. A cluster spanning multiple data centers is a poor idea right now. On Fri, Aug 30, 2013 at 12:35 AM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: My take on this. Why hadoop has to know about data center thing. I think it can be installed across multiple data centers , however topology configuration would be required to tell which node belongs to which data center and switch for block placement. Thanks, Rahul On Fri, Aug 30, 2013 at 12:42 AM, Baskar Duraikannu baskar.duraika...@outlook.com wrote: We have a need to setup hadoop across data centers. Does hadoop support multi data center configuration? I searched through archives and have found that hadoop did not support multi data center configuration some time back. Just wanted to see whether situation has changed. Please help. -- * * * * *Adam Muise* Solution Engineer *Hortonworks* amu...@hortonworks.com 416-417-4037 Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop.http://hortonworks.com/ Hortonworks Virtual Sandbox http://hortonworks.com/sandbox Hadoop: Disruptive Possibilities by Jeff Needhamhttp://hortonworks.com/resources/?did=72cat=1 CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: InvalidProtocolBufferException while submitting crunch job to cluster
I would, but bdatadev is not one of my servers, it seems like a random host name. I can't figure out how or where this name got generated. That's what puzzling me. On 8/31/13 5:43 AM, Shekhar Sharma shekhar2...@gmail.com wrote: : java.net.UnknownHostException: bdatadev edit your /etc/hosts file Regards, Som Shekhar Sharma +91-8197243810 On Sat, Aug 31, 2013 at 2:05 AM, Narlin M hpn...@gmail.com wrote: Looks like I was pointing to incorrect ports. After correcting the port numbers, conf.set(fs.defaultFS, hdfs://server_address:8020); conf.set(mapred.job.tracker, server_address:8021); I am now getting the following exception: 2880 [Thread-15] INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob - java.lang.IllegalArgumentException: java.net.UnknownHostException: bdatadev at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.ja va:414) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.j ava:164) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:1 29) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:389) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:356) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSy stem.java:124) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2218) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2252) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2234) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:300) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissio nFiles.java:103) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:902) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation .java:1332) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896) at org.apache.hadoop.mapreduce.Job.submit(Job.java:531) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.sub mit(CrunchControlledJob.java:305) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startR eadyJobs(CrunchJobControl.java:180) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJo bStatusAndStartNewOnes(CrunchJobControl.java:209) at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:100 ) at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:51) at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:75) at java.lang.Thread.run(Thread.java:680) Caused by: java.net.UnknownHostException: bdatadev ... 27 more However nowhere in my code a host named bdatadev is mentioned, and I cannot ping this host. Thanks for the help. On Fri, Aug 30, 2013 at 3:04 PM, Narlin M hpn...@gmail.com wrote: I am getting following exception while trying to submit a crunch pipeline job to a remote hadoop cluster: Exception in thread main java.lang.RuntimeException: Cannot create job output directory /tmp/crunch-324987940 at org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.java :344) at org.apache.crunch.impl.mr.MRPipeline.init(MRPipeline.java:125) at test.CrunchTest.setup(CrunchTest.java:98) at test.CrunchTest.main(CrunchTest.java:367) Caused by: java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: NARLIN/127.0.0.1; destination host is: server_address:50070; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759) at org.apache.hadoop.ipc.Client.call(Client.java:1164) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine .java:202) at com.sun.proxy.$Proxy11.mkdirs(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav a:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvo cationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocation Handler.java:83) at com.sun.proxy.$Proxy11.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkd irs(ClientNamenodeProtocolTranslatorPB.java:425) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1943) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSyste m.java:523) at
Re: InvalidProtocolBufferException while submitting crunch job to cluster
The server_address that was mentioned in my original post is not pointing to bdatadev. I should have mentioned this in my original post, sorry I missed that. On 8/31/13 8:32 AM, Narlin M hpn...@gmail.com wrote: I would, but bdatadev is not one of my servers, it seems like a random host name. I can't figure out how or where this name got generated. That's what puzzling me. On 8/31/13 5:43 AM, Shekhar Sharma shekhar2...@gmail.com wrote: : java.net.UnknownHostException: bdatadev edit your /etc/hosts file Regards, Som Shekhar Sharma +91-8197243810 On Sat, Aug 31, 2013 at 2:05 AM, Narlin M hpn...@gmail.com wrote: Looks like I was pointing to incorrect ports. After correcting the port numbers, conf.set(fs.defaultFS, hdfs://server_address:8020); conf.set(mapred.job.tracker, server_address:8021); I am now getting the following exception: 2880 [Thread-15] INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob - java.lang.IllegalArgumentException: java.net.UnknownHostException: bdatadev at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.j a va:414) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies. j ava:164) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java: 1 29) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:389) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:356) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileS y stem.java:124) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2218) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2252) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2234) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:300) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissi o nFiles.java:103) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:902) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio n .java:1332) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896) at org.apache.hadoop.mapreduce.Job.submit(Job.java:531) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.su b mit(CrunchControlledJob.java:305) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.start R eadyJobs(CrunchJobControl.java:180) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJ o bStatusAndStartNewOnes(CrunchJobControl.java:209) at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:10 0 ) at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:51) at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:75) at java.lang.Thread.run(Thread.java:680) Caused by: java.net.UnknownHostException: bdatadev ... 27 more However nowhere in my code a host named bdatadev is mentioned, and I cannot ping this host. Thanks for the help. On Fri, Aug 30, 2013 at 3:04 PM, Narlin M hpn...@gmail.com wrote: I am getting following exception while trying to submit a crunch pipeline job to a remote hadoop cluster: Exception in thread main java.lang.RuntimeException: Cannot create job output directory /tmp/crunch-324987940 at org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.jav a :344) at org.apache.crunch.impl.mr.MRPipeline.init(MRPipeline.java:125) at test.CrunchTest.setup(CrunchTest.java:98) at test.CrunchTest.main(CrunchTest.java:367) Caused by: java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: NARLIN/127.0.0.1; destination host is: server_address:50070; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759) at org.apache.hadoop.ipc.Client.call(Client.java:1164) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngin e .java:202) at com.sun.proxy.$Proxy11.mkdirs(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja v a:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso r Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInv o cationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocatio n Handler.java:83) at com.sun.proxy.$Proxy11.mkdirs(Unknown Source) at
Re: InvalidProtocolBufferException while submitting crunch job to cluster
Can you please check whether are you able to access HDFS using java API..and also able to run MR Job. Regards, Som Shekhar Sharma +91-8197243810 On Sat, Aug 31, 2013 at 7:08 PM, Narlin M hpn...@gmail.com wrote: The server_address that was mentioned in my original post is not pointing to bdatadev. I should have mentioned this in my original post, sorry I missed that. On 8/31/13 8:32 AM, Narlin M hpn...@gmail.com wrote: I would, but bdatadev is not one of my servers, it seems like a random host name. I can't figure out how or where this name got generated. That's what puzzling me. On 8/31/13 5:43 AM, Shekhar Sharma shekhar2...@gmail.com wrote: : java.net.UnknownHostException: bdatadev edit your /etc/hosts file Regards, Som Shekhar Sharma +91-8197243810 On Sat, Aug 31, 2013 at 2:05 AM, Narlin M hpn...@gmail.com wrote: Looks like I was pointing to incorrect ports. After correcting the port numbers, conf.set(fs.defaultFS, hdfs://server_address:8020); conf.set(mapred.job.tracker, server_address:8021); I am now getting the following exception: 2880 [Thread-15] INFO org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob - java.lang.IllegalArgumentException: java.net.UnknownHostException: bdatadev at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.j a va:414) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies. j ava:164) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java: 1 29) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:389) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:356) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileS y stem.java:124) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2218) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2252) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2234) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:300) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissi o nFiles.java:103) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:902) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio n .java:1332) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896) at org.apache.hadoop.mapreduce.Job.submit(Job.java:531) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.su b mit(CrunchControlledJob.java:305) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.start R eadyJobs(CrunchJobControl.java:180) at org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJ o bStatusAndStartNewOnes(CrunchJobControl.java:209) at org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:10 0 ) at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:51) at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:75) at java.lang.Thread.run(Thread.java:680) Caused by: java.net.UnknownHostException: bdatadev ... 27 more However nowhere in my code a host named bdatadev is mentioned, and I cannot ping this host. Thanks for the help. On Fri, Aug 30, 2013 at 3:04 PM, Narlin M hpn...@gmail.com wrote: I am getting following exception while trying to submit a crunch pipeline job to a remote hadoop cluster: Exception in thread main java.lang.RuntimeException: Cannot create job output directory /tmp/crunch-324987940 at org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.jav a :344) at org.apache.crunch.impl.mr.MRPipeline.init(MRPipeline.java:125) at test.CrunchTest.setup(CrunchTest.java:98) at test.CrunchTest.main(CrunchTest.java:367) Caused by: java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: NARLIN/127.0.0.1; destination host is: server_address:50070; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759) at org.apache.hadoop.ipc.Client.call(Client.java:1164) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngin e .java:202) at com.sun.proxy.$Proxy11.mkdirs(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja v a:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso r Impl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInv o cationHandler.java:164) at
Re: Multidata center support
The only problem i guess hadoop wont be able to duplicate data from one data center to another but i guess i can identify data nodes or namenodes from another data center correct me if i am wrong On Sat, Aug 31, 2013 at 7:00 PM, Visioner Sadak visioner.sa...@gmail.comwrote: lets say that you have some machines in europe and some in US I think you just need the ips and configure them in your cluster set up it will work... On Sat, Aug 31, 2013 at 7:52 AM, Jun Ping Du j...@vmware.com wrote: Hi, Although you can set datacenter layer on your network topology, it is never enabled in hadoop as lacking of replica placement and task scheduling support. There are some work to add layers other than rack and node under HADOOP-8848 but may not suit for your case. Agree with Adam that a cluster spanning multiple data centers seems not make sense even for DR case. Do you have other cases to do such a deployment? Thanks, Junping -- *From: *Adam Muise amu...@hortonworks.com *To: *user@hadoop.apache.org *Sent: *Friday, August 30, 2013 6:26:54 PM *Subject: *Re: Multidata center support Nothing has changed. DR best practice is still one (or more) clusters per site and replication is handled via distributed copy or some variation of it. A cluster spanning multiple data centers is a poor idea right now. On Fri, Aug 30, 2013 at 12:35 AM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: My take on this. Why hadoop has to know about data center thing. I think it can be installed across multiple data centers , however topology configuration would be required to tell which node belongs to which data center and switch for block placement. Thanks, Rahul On Fri, Aug 30, 2013 at 12:42 AM, Baskar Duraikannu baskar.duraika...@outlook.com wrote: We have a need to setup hadoop across data centers. Does hadoop support multi data center configuration? I searched through archives and have found that hadoop did not support multi data center configuration some time back. Just wanted to see whether situation has changed. Please help. -- * * * * *Adam Muise* Solution Engineer *Hortonworks* amu...@hortonworks.com 416-417-4037 Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop.http://hortonworks.com/ Hortonworks Virtual Sandbox http://hortonworks.com/sandbox Hadoop: Disruptive Possibilities by Jeff Needhamhttp://hortonworks.com/resources/?did=72cat=1 CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: InvalidProtocolBufferException while submitting crunch job to cluster
Your cluster is using HDFS HA, and therefore requires a little more configs than just fs.defaultFS/etc.. You need to use the right set of cluster client configs. If you don't have them at /etc/hadoop/conf and /etc/hbase/conf on your cluster edge node to pull from, try asking your cluster administrator for a configuration set, and place their parent directories on your application's classpath. The first error deals with perhaps you also including a guava dependency in your project, which is different than the one transitively pulled in by hadoop-client via crunch. You should be able to use guava libs without needing an explicit dependency, and it would be the right needed version. The second error deals with your MR submission failing, cause the JT is using a staging directory over a HDFS HA, which uses a logical name of bdatadev. A logical HA name needs other configs (typically in the hdfs-site.xml) that tell it which are the actual physical NNs under it - configs that you're missing here. On Sat, Aug 31, 2013 at 1:34 AM, Narlin M hpn...@gmail.com wrote: I am getting following exception while trying to submit a crunch pipeline job to a remote hadoop cluster: Exception in thread main java.lang.RuntimeException: Cannot create job output directory /tmp/crunch-324987940 at org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.java:344) at org.apache.crunch.impl.mr.MRPipeline.init(MRPipeline.java:125) at test.CrunchTest.setup(CrunchTest.java:98) at test.CrunchTest.main(CrunchTest.java:367) Caused by: java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: NARLIN/127.0.0.1; destination host is: server_address:50070; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759) at org.apache.hadoop.ipc.Client.call(Client.java:1164) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) at com.sun.proxy.$Proxy11.mkdirs(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) at com.sun.proxy.$Proxy11.mkdirs(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:425) at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1943) at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:523) at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1799) at org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.java:342) ... 3 more Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag. at com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:73) at com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:124) at com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:213) at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:746) at com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:238) at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:282) at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:760) at com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:288) at com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:752) at org.apache.hadoop.ipc.protobuf.RpcPayloadHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcPayloadHeaderProtos.java:985) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:882) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:813) 0[Thread-3] WARN org.apache.hadoop.util.ShutdownHookManager - ShutdownHook 'ClientFinalizer' failed, java.lang.NoSuchMethodError: com.google.common.collect.LinkedListMultimap.values()Ljava/util/List; java.lang.NoSuchMethodError: com.google.common.collect.LinkedListMultimap.values()Ljava/util/List; at org.apache.hadoop.hdfs.SocketCache.clear(SocketCache.java:135) at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:672) at org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:539) at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2308) at org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2324) at
Re: How to change default ports of datanodes in a cluster
Looking at the hdfs-default.xml should help with such questions: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml The property you need is dfs.datanode.http.address On Sat, Aug 31, 2013 at 6:47 PM, Visioner Sadak visioner.sa...@gmail.com wrote: Hello Hadoopers, Default port for datanode is 50075 i am able to change namenode default port by changing dfs.namenode.http-address.ns1 dfs.namenode.http-address.ns2 in my hdfs-site.xml of my 2 namenodes how to change default port address of my multiple datanodes -- Harsh J
Re: How to change default ports of datanodes in a cluster
thanks harsh for a cluster should i enter multiple ip address under tag dfs.datanode.http.address as i have 4 data nodes On Sat, Aug 31, 2013 at 9:44 PM, Harsh J ha...@cloudera.com wrote: Looking at the hdfs-default.xml should help with such questions: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml The property you need is dfs.datanode.http.address On Sat, Aug 31, 2013 at 6:47 PM, Visioner Sadak visioner.sa...@gmail.com wrote: Hello Hadoopers, Default port for datanode is 50075 i am able to change namenode default port by changing dfs.namenode.http-address.ns1 dfs.namenode.http-address.ns2 in my hdfs-site.xml of my 2 namenodes how to change default port address of my multiple datanodes -- Harsh J
Re: WritableComparable.compareTo vs RawComparator.compareTo
Thanks for the information. So is the reason that makes the raw comparator faster is because we can use the bytes to do the comparison .. so if I use the signature of compareTo in my raw comparator that receives two writablecomparable objects public int compare(WritableComparable a, WritableComparable b) instead of the bytes one .. then does it ends up slower and more comparable to the compareTo method defined on the WritableComparable object itself Secondly if I do use the bytes signature and I have seen implementations where you can use util methods like readInt and readString to read int and strings from those bytes but what if I have a complex object inside my writablecomparable such as Text or List .. how can I read those from bytes. Thanks On Aug 31, 2013 3:58 AM, Ravi Kiran ravikiranmag...@gmail.com wrote: Also, if both are defined , the framework will use RawComparator . I hope you have registered the comparator in a static block as follows static { WritableComparator.define(PairOfInts.class, new Comparator()); } Regards Ravi Magham On Sat, Aug 31, 2013 at 1:23 PM, Ravi Kiran ravikiranmag...@gmail.comwrote: Hi Adeel, The RawComparator is the fastest between the two as you avoid the need to convert the byte stream to Writable objects for comparison . Regards Ravi Magham On Fri, Aug 30, 2013 at 11:16 PM, Adeel Qureshi adeelmahm...@gmail.comwrote: For secondary sort I am implementing a RawComparator and providing that as sortComparator .. is that the faster way or using a WritableComparable as mapper output and defining a compareTo method on the key itself also what happens if both are defined, is one ignored
Re: How to change default ports of datanodes in a cluster
You can maintain per-DN configs if you wish to restrict the HTTP server to only the public IP, but otherwise use a wildcard 0.0.0.0:PORT, if you were only just looking to change the port. On Sat, Aug 31, 2013 at 9:49 PM, Visioner Sadak visioner.sa...@gmail.com wrote: thanks harsh for a cluster should i enter multiple ip address under tag dfs.datanode.http.address as i have 4 data nodes On Sat, Aug 31, 2013 at 9:44 PM, Harsh J ha...@cloudera.com wrote: Looking at the hdfs-default.xml should help with such questions: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml The property you need is dfs.datanode.http.address On Sat, Aug 31, 2013 at 6:47 PM, Visioner Sadak visioner.sa...@gmail.com wrote: Hello Hadoopers, Default port for datanode is 50075 i am able to change namenode default port by changing dfs.namenode.http-address.ns1 dfs.namenode.http-address.ns2 in my hdfs-site.xml of my 2 namenodes how to change default port address of my multiple datanodes -- Harsh J -- Harsh J
Re: How to change default ports of datanodes in a cluster
cool thanks a ton harsh!!! On Sat, Aug 31, 2013 at 9:53 PM, Harsh J ha...@cloudera.com wrote: You can maintain per-DN configs if you wish to restrict the HTTP server to only the public IP, but otherwise use a wildcard 0.0.0.0:PORT, if you were only just looking to change the port. On Sat, Aug 31, 2013 at 9:49 PM, Visioner Sadak visioner.sa...@gmail.com wrote: thanks harsh for a cluster should i enter multiple ip address under tag dfs.datanode.http.address as i have 4 data nodes On Sat, Aug 31, 2013 at 9:44 PM, Harsh J ha...@cloudera.com wrote: Looking at the hdfs-default.xml should help with such questions: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml The property you need is dfs.datanode.http.address On Sat, Aug 31, 2013 at 6:47 PM, Visioner Sadak visioner.sa...@gmail.com wrote: Hello Hadoopers, Default port for datanode is 50075 i am able to change namenode default port by changing dfs.namenode.http-address.ns1 dfs.namenode.http-address.ns2 in my hdfs-site.xml of my 2 namenodes how to change default port address of my multiple datanodes -- Harsh J -- Harsh J
Re: Multidata center support
What do you think friends I think hadoop clusters can run on multiple data centers using FEDERATION On Sat, Aug 31, 2013 at 8:39 PM, Visioner Sadak visioner.sa...@gmail.comwrote: The only problem i guess hadoop wont be able to duplicate data from one data center to another but i guess i can identify data nodes or namenodes from another data center correct me if i am wrong On Sat, Aug 31, 2013 at 7:00 PM, Visioner Sadak visioner.sa...@gmail.comwrote: lets say that you have some machines in europe and some in US I think you just need the ips and configure them in your cluster set up it will work... On Sat, Aug 31, 2013 at 7:52 AM, Jun Ping Du j...@vmware.com wrote: Hi, Although you can set datacenter layer on your network topology, it is never enabled in hadoop as lacking of replica placement and task scheduling support. There are some work to add layers other than rack and node under HADOOP-8848 but may not suit for your case. Agree with Adam that a cluster spanning multiple data centers seems not make sense even for DR case. Do you have other cases to do such a deployment? Thanks, Junping -- *From: *Adam Muise amu...@hortonworks.com *To: *user@hadoop.apache.org *Sent: *Friday, August 30, 2013 6:26:54 PM *Subject: *Re: Multidata center support Nothing has changed. DR best practice is still one (or more) clusters per site and replication is handled via distributed copy or some variation of it. A cluster spanning multiple data centers is a poor idea right now. On Fri, Aug 30, 2013 at 12:35 AM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: My take on this. Why hadoop has to know about data center thing. I think it can be installed across multiple data centers , however topology configuration would be required to tell which node belongs to which data center and switch for block placement. Thanks, Rahul On Fri, Aug 30, 2013 at 12:42 AM, Baskar Duraikannu baskar.duraika...@outlook.com wrote: We have a need to setup hadoop across data centers. Does hadoop support multi data center configuration? I searched through archives and have found that hadoop did not support multi data center configuration some time back. Just wanted to see whether situation has changed. Please help. -- * * * * *Adam Muise* Solution Engineer *Hortonworks* amu...@hortonworks.com 416-417-4037 Hortonworks - Develops, Distributes and Supports Enterprise Apache Hadoop. http://hortonworks.com/ Hortonworks Virtual Sandbox http://hortonworks.com/sandbox Hadoop: Disruptive Possibilities by Jeff Needhamhttp://hortonworks.com/resources/?did=72cat=1 CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: bad interpreter: Text file busy and other errors in Hadoop 2.1.0-beta
Hi John This exception should indicate error from the container process. If the container process exits with non-zero exit code, it will be logged. In case of such errors, you'd better look at the per-container log see what's happening there. Jian On Fri, Aug 30, 2013 at 10:03 AM, Jian Fang jian.fang.subscr...@gmail.comwrote: Hi, I upgraded to Hadoop 2.1.0-beta and suddenly I started to see error messages as follows. Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: bash: /var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1377823589199_0002/container_1377823589199_0002_01_000214/default_container_executor.sh: /bin/bash: bad interpreter: Text file busy at org.apache.hadoop.util.Shell.runCommand(Shell.java:458) at org.apache.hadoop.util.Shell.run(Shell.java:373) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:258) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:74) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) cleanup failed for container container_1377823589199_0002_01_000214 : org.apache.hadoop.yarn.exceptions.YarnException: Container container_1377823589199_0002_01_000214 is not handled by this NodeManager at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:210) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Any thing wrong here? Thanks, John -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: metric type
Yes, MutableCounterLong helps to gather DataNode read/write statics. There is more option available within this metric Regards Jitendra On 8/31/13, lei liu liulei...@gmail.com wrote: There is @Metric MutableCounterLong bytesWritten attribute in DataNodeMetrics, which is used to IO/sec statistics? 2013/8/31 Jitendra Yadav jeetuyadav200...@gmail.com Hi, For IO/sec statistics I think MutableCounterLongRate and MutableCounterLong more useful than others and for xceiver thread number I'm not bit sure right now. Thanks Jiitendra On Fri, Aug 30, 2013 at 1:40 PM, lei liu liulei...@gmail.com wrote: Hi Jitendra, If I want to statistics number of bytes read per second,and display the result into ganglia, should I use MutableCounterLong or MutableGaugeLong? If I want to display current xceiver thread number in datanode into ganglia, should I use MutableCounterLong or MutableGaugeLong? Thanks, LiuLei 2013/8/30 Jitendra Yadav jeetuyadav200...@gmail.com Hi, Below link contains the answer for your question. http://hadoop.apache.org/docs/r1.2.0/api/org/apache/hadoop/metrics2/package-summary.html Regards Jitendra On Fri, Aug 30, 2013 at 11:35 AM, lei liu liulei...@gmail.com wrote: I use the metrics v2, there are COUNTER and GAUGE metric type in metrics v2. What is the difference between the two? Thanks, LiuLei
custom writablecomparable with complex fields
I want to write a custom writablecomparable object with two List objects within it .. public class CompositeKey implements WritableComparable { private ListJsonKey groupBy; private ListJsonKey sortBy; ... } what I am not sure about is how to write readFields and write methods for this object. Any help would be appreciated. Thanks Adeel
Re: Subscribe
Please send email to: user-subscr...@hadoop.apache.org On Sat, Aug 31, 2013 at 12:36 PM, Surendra , Manchikanti surendra.manchika...@gmail.com wrote: -- Surendra Manchikanti
Re: custom writablecomparable with complex fields
The idea behind write(…) and readFields(…) is simply that of ordering. You need to write your custom objects (i.e. a representation of them) in order, and read them back in the same order. An example way of serializing a list would be to first serialize the length (so you know how many you'll be needed to read back), and then serialize each item appropriately, using delimiters, or using length-prefixes just like lists. Mainly, you're required to tackle the serialization/deserialization on your own. This is one of the reasons I highly recommend using a library like Apache Avro instead. Its more powerful, faster, and yet simple to use: http://avro.apache.org/docs/current/gettingstartedjava.html and http://avro.apache.org/docs/current/mr.html. It is also popular and carries first-grade support on several other hadoop-ecosystem projects, such as Flume and Crunch. On Sun, Sep 1, 2013 at 1:23 AM, Adeel Qureshi adeelmahm...@gmail.com wrote: I want to write a custom writablecomparable object with two List objects within it .. public class CompositeKey implements WritableComparable { private ListJsonKey groupBy; private ListJsonKey sortBy; ... } what I am not sure about is how to write readFields and write methods for this object. Any help would be appreciated. Thanks Adeel -- Harsh J
Re: Job config before read fields
Personally, I don't know a way to access job configuration parameters in custom implementation of Writables ( at least not an elegant and appropriate one. Of course hacks of various kinds be done.) Maybe experts can chime in? One idea that I thought about was to use MapWritable (if you have not explored it already.) You can encode the 'custom metadata' for you 'data' as one byte symbols and move your data in the M/R flow as a map. Then while deserialization you will have the type (or your 'custom metadata') in the key part of the map and the value would be you actual data. This aligns with the efficient approach that is used natively in Hadoop for Strings/Text i.e. compact metadata (though I agree that you are not taking advantage of the other aspect of non-dependence between metadata and the data it defines.) Take a look at that: Page 96 of the Definitive Guide: http://books.google.com/books?id=Nff49D7vnJcCpg=PA96lpg=PA96dq=mapwritable+in+hadoopsource=blots=IiixYu7vXusig=4V6H7cY-MrNT7Rzs3WlODsDOoP4hl=ensa=Xei=aX4iUp2YGoaosASs_YCACQsqi=2ved=0CFUQ6AEwBA#v=onepageq=mapwritable%20in%20hadoopf=false and then this: http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html and add your own custom types here (note that you are restricted by size of byte): http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html Regards, Shahab On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER chivas314...@gmail.comwrote: Thank you for your help Shahab. I guess I wasn't being too clear. My logic is that I use a custom type as key and in order to deserialize it on the compute nodes, I need an extra piece of information (also a custom type). To use an analogy, a Text is serialized by writing the length of the string as a number and then the bytes that compose the actual string. When it is deserialized, the number informs the reader when to stop reading the string. This number is varies from string to string and it is compact so it makes sense to serialize it with the string. My use case is similar to it. I have a complex type (let's call this data), and in order to deserialize it, I need another complex type (let's call this second type metadata). The metadata is not closely tied to the data (i.e. if the data value changes, the metadata does not) and the metadata size is quite large. I ruled out a couple of options, but please let me know if you think I did so for the wrong reasons: 1. I could serialize each data value with it's own metadata value, but since the data value count is in the +tens of millions and the metadata value distinct count can be up to one hundred, it would waste resources in the system. 2. I could serialize the metadata and then the data as a collection property of the metadata. This would be an elegant solution code-wise, but then all the data would have to be read and kept in memory as a massive object before any reduce operations can happen. I wasn't able to find any info on this online so this is just a guess from peeking at the hadoop code. My solution was to serialize the data with a hash of the metadata and separately serialize the metadata and its hash in the job configuration (as key/value pairs). For this to work, I would need to be able to deserialize the metadata on the reduce node before the data is deserialized in the readFields() method. I think that for that to happen I need to hook into the code somewhere where a context or job configuration is used (before readFields()), but I'm stumped as to where that is. Cheers, Adi On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus shahab.yu...@gmail.comwrote: What I meant was that you might have to split or redesign your logic or your usecase (which we don't know about)? Regards, Shahab On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER chivas314...@gmail.com wrote: But how would the comparator have access to the job config? On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus shahab.yu...@gmail.comwrote: I think you have to override/extend the Comparator to achieve that, something like what is done in Secondary Sort? Regards, Shahab On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER chivas314...@gmail.com wrote: Howdy, I apologise for the lack of code in this message, but the code is fairly convoluted and it would obscure my problem. That being said, I can put together some sample code if really needed. I am trying to pass some metadata between the map reduce steps. This metadata is read and generated in the map step and stored in the job config. It also needs to be recreated on the reduce node before the key/ value fields can be read in the readFields function. I had assumed that I would be able to override the Reducer.setup() function and that would be it, but apparently the readFields function is called before the Reducer.setup() function. My question is what is any (the best) place on the