Re: metric type

2013-08-31 Thread lei liu
There is @Metric MutableCounterLong bytesWritten attribute in
DataNodeMetrics, which is used to IO/sec statistics?


2013/8/31 Jitendra Yadav jeetuyadav200...@gmail.com

 Hi,

 For IO/sec statistics I think MutableCounterLongRate  and
 MutableCounterLong more useful than others and for xceiver thread
 number I'm not bit sure right now.

 Thanks
 Jiitendra
 On Fri, Aug 30, 2013 at 1:40 PM, lei liu liulei...@gmail.com wrote:
 
  Hi  Jitendra,
  If I want to statistics number of bytes read per second,and display the
 result into ganglia, should I use MutableCounterLong or MutableGaugeLong?
 
  If I want to display current xceiver thread number in datanode into
 ganglia, should I use MutableCounterLong or MutableGaugeLong?
 
  Thanks,
  LiuLei
 
 
  2013/8/30 Jitendra Yadav jeetuyadav200...@gmail.com
 
  Hi,
 
  Below link contains the answer for your question.
 
 
 http://hadoop.apache.org/docs/r1.2.0/api/org/apache/hadoop/metrics2/package-summary.html
 
  Regards
  Jitendra
 
  On Fri, Aug 30, 2013 at 11:35 AM, lei liu liulei...@gmail.com wrote:
 
  I use the metrics v2, there are COUNTER and GAUGE metric type in
 metrics v2.
  What is the difference between the two?
 
  Thanks,
  LiuLei
 
 
 



sqoop oracle connection error

2013-08-31 Thread Ram
Hi,
   I am trying to import table from oracle hdfs. i am getting the following
error

ERROR manager.SqlManager: Error executing statement:
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not
establish the connection
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not
establish the connection

any work around this.

the query is:

sqoop import --connect
jdbc:oracle:thin:@//ramesh.ops.cloudwick.com/cloud--username ramesh
--password password --table cloud.test -m 1

the output is as follows;

[root@ramesh ram]# sqoop import --connect jdbc:oracle:thin:@//
ramesh.ops.cloudwick.com/cloud --username ramesh --password password
--table cloud.test -m 1
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
13/08/31 12:27:27 WARN tool.BaseSqoopTool: Setting your password on the
command-line is insecure. Consider using -P instead.
13/08/31 12:27:27 INFO manager.SqlManager: Using default fetchSize of 1000
13/08/31 12:27:27 INFO tool.CodeGenTool: Beginning code generation
13/08/31 12:27:27 ERROR manager.SqlManager: Error executing statement:
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not
establish the connection
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not
establish the connection
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:458)
at oracle.jdbc.driver.PhysicalConnection.init(PhysicalConnection.java:546)
at oracle.jdbc.driver.T4CConnection.init(T4CConnection.java:236)
at
oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)
at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:521)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:215)
at
org.apache.sqoop.manager.OracleManager.makeConnection(OracleManager.java:313)
at
org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:605)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:628)
at
org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:235)
at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:219)
at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:347)
at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1255)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1072)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:390)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
Caused by: oracle.net.ns.NetException: The Network Adapter could not
establish the connection
at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:392)
at
oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:434)
at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:687)
at oracle.net.ns.NSProtocol.connect(NSProtocol.java:247)
at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1102)
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:320)
... 24 more
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:150)
at oracle.net.nt.ConnOption.connect(ConnOption.java:133)
at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:370)
... 29 more
13/08/31 12:27:27 ERROR manager.OracleManager: Failed to rollback
transaction
java.lang.NullPointerException
at
org.apache.sqoop.manager.OracleManager.getColumnNames(OracleManager.java:744)
at org.apache.sqoop.orm.ClassWriter.getColumnNames(ClassWriter.java:1222)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1074)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:390)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at 

Re: WritableComparable.compareTo vs RawComparator.compareTo

2013-08-31 Thread Ravi Kiran
Also, if both are defined , the framework will use RawComparator . I hope
you have registered the comparator in a static block as follows

static
{
WritableComparator.define(PairOfInts.class, new Comparator());
}

Regards
Ravi Magham


On Sat, Aug 31, 2013 at 1:23 PM, Ravi Kiran ravikiranmag...@gmail.comwrote:

 Hi Adeel,

 The RawComparator is the fastest between the two as you avoid the need
 to convert the byte stream to Writable objects for comparison .

 Regards
 Ravi Magham


 On Fri, Aug 30, 2013 at 11:16 PM, Adeel Qureshi adeelmahm...@gmail.comwrote:

 For secondary sort I am implementing a RawComparator and providing that
 as sortComparator .. is that the faster way or using a WritableComparable
 as mapper output and defining a compareTo method on the key itself

 also what happens if both are defined, is one ignored





Re: sqoop oracle connection error

2013-08-31 Thread Ravi Kiran
Hi ,
   Can you check if you are able to ping or telnet to the ip address and
port of Oracle database from your machine.  I have a hunch that Oracle
Listener is stopped . If so , start it.
The commands to check the status and start if the listener isn't running.

$ lsnrctl status
$ lsnrctl start

Regards

Ravi Magham


On Sat, Aug 31, 2013 at 2:05 PM, Krishnan Narayanan 
krishnan.sm...@gmail.com wrote:

 Hi Ram,

 I get the same error.If you find an answer pls dp fwd it to me. I will do
 the same.

 Thx
 Krish


 On Sat, Aug 31, 2013 at 12:00 AM, Ram pramesh...@gmail.com wrote:


 Hi,
I am trying to import table from oracle hdfs. i am getting the
 following error

 ERROR manager.SqlManager: Error executing statement:
 java.sql.SQLRecoverableException: IO Error: The Network Adapter could not
 establish the connection
 java.sql.SQLRecoverableException: IO Error: The Network Adapter could not
 establish the connection

 any work around this.

 the query is:

 sqoop import --connect 
 jdbc:oracle:thin:@//ramesh.ops.cloudwick.com/cloud--username ramesh 
 --password password --table cloud.test -m 1

 the output is as follows;

 [root@ramesh ram]# sqoop import --connect jdbc:oracle:thin:@//
 ramesh.ops.cloudwick.com/cloud --username ramesh --password password
 --table cloud.test -m 1
 Warning: /usr/lib/hbase does not exist! HBase imports will fail.
 Please set $HBASE_HOME to the root of your HBase installation.
 13/08/31 12:27:27 WARN tool.BaseSqoopTool: Setting your password on the
 command-line is insecure. Consider using -P instead.
 13/08/31 12:27:27 INFO manager.SqlManager: Using default fetchSize of 1000
 13/08/31 12:27:27 INFO tool.CodeGenTool: Beginning code generation
 13/08/31 12:27:27 ERROR manager.SqlManager: Error executing statement:
 java.sql.SQLRecoverableException: IO Error: The Network Adapter could not
 establish the connection
 java.sql.SQLRecoverableException: IO Error: The Network Adapter could not
 establish the connection
 at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:458)
  at
 oracle.jdbc.driver.PhysicalConnection.init(PhysicalConnection.java:546)
 at oracle.jdbc.driver.T4CConnection.init(T4CConnection.java:236)
  at
 oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:32)
 at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:521)
  at java.sql.DriverManager.getConnection(DriverManager.java:571)
 at java.sql.DriverManager.getConnection(DriverManager.java:215)
  at
 org.apache.sqoop.manager.OracleManager.makeConnection(OracleManager.java:313)
 at
 org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
  at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:605)
 at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:628)
  at
 org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:235)
 at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:219)
  at
 org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:347)
 at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1255)
  at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1072)
 at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82)
  at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:390)
 at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:476)
  at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
 at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
 at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
  at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
 Caused by: oracle.net.ns.NetException: The Network Adapter could not
 establish the connection
 at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:392)
  at
 oracle.net.resolver.AddrResolution.resolveAndExecute(AddrResolution.java:434)
 at oracle.net.ns.NSProtocol.establishConnection(NSProtocol.java:687)
  at oracle.net.ns.NSProtocol.connect(NSProtocol.java:247)
 at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1102)
  at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:320)
 ... 24 more
 Caused by: java.net.ConnectException: Connection refused
  at java.net.PlainSocketImpl.socketConnect(Native Method)
 at
 java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
  at
 java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
 at
 java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
 at java.net.Socket.connect(Socket.java:579)
  at oracle.net.nt.TcpNTAdapter.connect(TcpNTAdapter.java:150)
 at oracle.net.nt.ConnOption.connect(ConnOption.java:133)
  at oracle.net.nt.ConnStrategy.execute(ConnStrategy.java:370)
 ... 29 more
 13/08/31 12:27:27 ERROR manager.OracleManager: Failed to rollback
 transaction
 

Re: Job config before read fields

2013-08-31 Thread Adrian CAPDEFIER
Thank you for your help Shahab.

I guess I wasn't being too clear. My logic is that I use a custom type as
key and in order to deserialize it on the compute nodes, I need an extra
piece of information (also a custom type).

To use an analogy, a Text is serialized by writing the length of the string
as a number and then the bytes that compose the actual string. When it is
deserialized, the number informs the reader when to stop reading the
string. This number is varies from string to string and it is compact so it
makes sense to serialize it with the string.

My use case is similar to it. I have a complex type (let's call this data),
and in order to deserialize it, I need another complex type (let's call
this second type metadata). The metadata is not closely tied to the data
(i.e. if the data value changes, the metadata does not) and the metadata
size is quite large.

I ruled out a couple of options, but please let me know if you think I did
so for the wrong reasons:
1. I could serialize each data value with it's own metadata value, but
since the data value count is in the +tens of millions and the metadata
value distinct count can be up to one hundred, it would waste resources in
the system.
2. I could serialize the metadata and then the data as a collection
property of the metadata. This would be an elegant solution code-wise, but
then all the data would have to be read and kept in memory as a massive
object before any reduce operations can happen. I wasn't able to find any
info on this online so this is just a guess from peeking at the hadoop code.

My solution was to serialize the data with a hash of the metadata and
separately serialize the metadata and its hash in the job configuration (as
key/value pairs). For this to work, I would need to be able to deserialize
the metadata on the reduce node before the data is deserialized in the
readFields() method.

I think that for that to happen I need to hook into the code somewhere
where a context or job configuration is used (before readFields()), but I'm
stumped as to where that is.

Cheers,
Adi


On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 What I meant was that you might have to split or redesign your logic or
 your usecase (which we don't know about)?

 Regards,
 Shahab


 On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER chivas314...@gmail.com
  wrote:

 But how would the comparator have access to the job config?


 On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 I think you have to override/extend the Comparator to achieve that,
 something like what is done in Secondary Sort?

 Regards,
 Shahab


 On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER 
 chivas314...@gmail.com wrote:

 Howdy,

 I apologise for the lack of code in this message, but the code is
 fairly convoluted and it would obscure my problem. That being said, I can
 put together some sample code if really needed.

 I am trying to pass some metadata between the map  reduce steps. This
 metadata is read and generated in the map step and stored in the job
 config. It also needs to be recreated on the reduce node before the key/
 value fields can be read in the readFields function.

 I had assumed that I would be able to override the Reducer.setup()
 function and that would be it, but apparently the readFields function is
 called before the Reducer.setup() function.

 My question is what is any (the best) place on the reduce node where I
 can access the job configuration/ context before the readFields function is
 called?

 This is the stack trace:

 at
 org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:103)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:)
 at
 org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:70)
 at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1298)
 at
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699)
 at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:766)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)







Re: InvalidProtocolBufferException while submitting crunch job to cluster

2013-08-31 Thread Shekhar Sharma
: java.net.UnknownHostException: bdatadev


edit your /etc/hosts file
Regards,
Som Shekhar Sharma
+91-8197243810


On Sat, Aug 31, 2013 at 2:05 AM, Narlin M hpn...@gmail.com wrote:
 Looks like I was pointing to incorrect ports. After correcting the port
 numbers,

 conf.set(fs.defaultFS, hdfs://server_address:8020);
 conf.set(mapred.job.tracker, server_address:8021);

 I am now getting the following exception:

 2880 [Thread-15] INFO
 org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob  -
 java.lang.IllegalArgumentException: java.net.UnknownHostException: bdatadev
 at
 org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
 at
 org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164)
 at
 org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:389)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:356)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:124)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2218)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2252)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2234)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:300)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
 at
 org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:103)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:902)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:531)
 at
 org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.submit(CrunchControlledJob.java:305)
 at
 org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startReadyJobs(CrunchJobControl.java:180)
 at
 org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJobStatusAndStartNewOnes(CrunchJobControl.java:209)
 at
 org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:100)
 at org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:51)
 at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:75)
 at java.lang.Thread.run(Thread.java:680)
 Caused by: java.net.UnknownHostException: bdatadev
 ... 27 more

 However nowhere in my code a host named bdatadev is mentioned, and I
 cannot ping this host.

 Thanks for the help.


 On Fri, Aug 30, 2013 at 3:04 PM, Narlin M hpn...@gmail.com wrote:

 I am getting following exception while trying to submit a crunch pipeline
 job to a remote hadoop cluster:

 Exception in thread main java.lang.RuntimeException: Cannot create job
 output directory /tmp/crunch-324987940
 at
 org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.java:344)
 at org.apache.crunch.impl.mr.MRPipeline.init(MRPipeline.java:125)
 at test.CrunchTest.setup(CrunchTest.java:98)
 at test.CrunchTest.main(CrunchTest.java:367)
 Caused by: java.io.IOException: Failed on local exception:
 com.google.protobuf.InvalidProtocolBufferException: Protocol message
 end-group tag did not match expected tag.; Host Details : local host is:
 NARLIN/127.0.0.1; destination host is: server_address:50070;
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)
 at org.apache.hadoop.ipc.Client.call(Client.java:1164)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 at com.sun.proxy.$Proxy11.mkdirs(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 at com.sun.proxy.$Proxy11.mkdirs(Unknown Source)
 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:425)
 at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1943)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:523)
 at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1799)
 at
 org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.java:342)
 ... 3 more
 Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
 message end-group tag did not match expected 

Re: secondary sort - number of reducers

2013-08-31 Thread Ravi Kiran
Adeel,
   To add to Yong's  points
a)   Consider tuning the number of threads in reduce tasks and the task
tracker process.  mapred.reduce.parallel.copies
b)   See if the map output can be compressed to ensure there is less IO .
c)   Increase the io.sort.factor to ensure the framework merges a larger
number of files in each merge sort at the reducer
d)   Check the counter Reduce Shuffle Bytes of  each reducer to see any
skew of data at few reducers. Try for a even distribution of load through a
better partitioner code.

Regards
Ravi Magham


On Fri, Aug 30, 2013 at 9:28 PM, java8964 java8964 java8...@hotmail.comwrote:

 Well, The reducers normally will take much longer than the mappers stage,
 because the copy/shuffle/sort all happened at this time, and they are the
 hard part.

 But before we simply say it is part of life, you need to dig into more of
 your MR jobs to find out if you can make it faster.

 You are the person most familiar with your data, and you wrote the code to
 group/partition them, and send them to the reducers. Even you set up 255
 reducers, the question is, do each of them get its fair share?
 You need to read the COUNTER information of each reducer, and found out
 how many reducer groups each reducer gets, and how many input bytes it get,
 etc.

 Simple example, if you send 200G data, and group them by DATE, if all the
 data belongs to 2 days, and one of them contains 90% of data, then in this
 case, giving 255 reducers won't help, as only 2 reducers will consume data,
 and one of them will consume 90% of data, and will finish in a very long
 time, which WILL delay the whole MR job, while the rest reducers will
 finish within seconds. In this case, maybe you need to rethink what should
 be your key, and make sure each reducer get its fair share of volume of
 data.

 After the above fix (in fact, normally it will fix 90% of reducer
 performance problems, especially you have 255 reducer tasks available, so
 each one average will only get 1G data, good for your huge cluster only
 needs to process 256G data :-), if you want to make it even faster, then
 check you code. Do you have to use String.compareTo()? Is it slow?  Google
 hadoop rawcomparator to see if you can do something here.

 After that, if you still think the reducer stage slow, check you cluster
 system. Does the reducer spend most time on copy stage, or sort, or in your
 reducer class? Find out the where the time spends, then identify the
 solution.

 Yong

 --
 Date: Fri, 30 Aug 2013 11:02:05 -0400

 Subject: Re: secondary sort - number of reducers
 From: adeelmahm...@gmail.com
 To: user@hadoop.apache.org



 my secondary sort on multiple keys seem to work fine with smaller data
 sets but with bigger data sets (like 256 gig and 800M+ records) the mapper
 phase gets done pretty quick (about 15 mins) but then the reducer phase
 seem to take forever. I am using 255 reducers.

 basic idea is that my composite key has both group and sort keys in it
 which i parse in the appropriate comparator classes to perform grouping and
 sorting .. my thinking is that mappers is where most of the work is done
 1. mapper itself (create composite key and value)
 2. recods sorting
 3. partiotioner

 if all this gets done in 15 mins then reducer has the simple task of
 1. grouping comparator
 2. reducer itself (simply output records)

 should take less time than mappers .. instead it essentially gets stuck in
 reduce phase .. im gonna paste my code here to see if anything stands out
 as a fundamental design issue

 //PARTITIONER
 public int getPartition(Text key, HCatRecord record, int numReduceTasks) {
 //extract the group key from composite key
  String groupKey = key.toString().split(\\|)[0];
 return (groupKey.hashCode()  Integer.MAX_VALUE) % numReduceTasks;
  }


 GROUP COMAPRATOR
 public int compare(WritableComparable a, WritableComparable b) {
 //compare to text objects
  String thisGroupKey = ((Text) a).toString().split(\\|)[0];
 String otherGroupKey = ((Text) b).toString().split(\\|)[0];
  //extract
 return thisGroupKey.compareTo(otherGroupKey);
 }


 SORT COMPARATOR
 is similar to group comparator and is in map phase and gets done quick



 //REDUCER
 public void reduce(Text key, IterableHCatRecord records, Context
 context) throws IOException, InterruptedException {
 log.info(in reducer for key  + key.toString());
  IteratorHCatRecord recordsIter = records.iterator();
 //we are only interested in the first record after sorting and grouping
  if(recordsIter.hasNext()){
 HCatRecord rec = recordsIter.next();
 context.write(nw, rec);
  log.info(returned record   + rec.toString());
 }
 }


 On Fri, Aug 30, 2013 at 9:24 AM, Adeel Qureshi adeelmahm...@gmail.comwrote:

 yup it was negative and by doing this now it seems to be working fine


 On Fri, Aug 30, 2013 at 3:09 AM, Shekhar Sharma shekhar2...@gmail.comwrote:

 Is the hash code of that key  is negative.?
 Do something 

How to change default ports of datanodes in a cluster

2013-08-31 Thread Visioner Sadak
Hello Hadoopers,

Default port for datanode is 50075 i am able to change namenode default
port by changing

dfs.namenode.http-address.ns1  dfs.namenode.http-address.ns2 in my
hdfs-site.xml of my 2 namenodes

how to change default port address of my multiple datanodes


Re: Multidata center support

2013-08-31 Thread Visioner Sadak
lets say that

you have some machines in europe and some  in US I think you just need the
ips and configure them in your cluster set up
it will work...


On Sat, Aug 31, 2013 at 7:52 AM, Jun Ping Du j...@vmware.com wrote:

 Hi,
 Although you can set datacenter layer on your network topology, it is
 never enabled in hadoop as lacking of replica placement and task scheduling
 support. There are some work to add layers other than rack and node under
 HADOOP-8848 but may not suit for your case. Agree with Adam that a cluster
 spanning multiple data centers seems not make sense even for DR case. Do
 you have other cases to do such a deployment?

 Thanks,

 Junping

 --
 *From: *Adam Muise amu...@hortonworks.com
 *To: *user@hadoop.apache.org
 *Sent: *Friday, August 30, 2013 6:26:54 PM
 *Subject: *Re: Multidata center support


 Nothing has changed. DR best practice is still one (or more) clusters per
 site and replication is handled via distributed copy or some variation of
 it. A cluster spanning multiple data centers is a poor idea right now.




 On Fri, Aug 30, 2013 at 12:35 AM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 My take on this.

 Why hadoop has to know about data center thing. I think it can be
 installed across multiple data centers , however topology configuration
 would be required to tell which node belongs to which data center and
 switch for block placement.

 Thanks,
 Rahul


 On Fri, Aug 30, 2013 at 12:42 AM, Baskar Duraikannu 
 baskar.duraika...@outlook.com wrote:

 We have a need to setup hadoop across data centers.  Does hadoop support
 multi data center configuration? I searched through archives and have found
 that hadoop did not support multi data center configuration some time back.
 Just wanted to see whether situation has changed.

 Please help.





 --
 *
 *
 *
 *
 *Adam Muise*
 Solution Engineer
 *Hortonworks*
 amu...@hortonworks.com
 416-417-4037

 Hortonworks - Develops, Distributes and Supports Enterprise Apache 
 Hadoop.http://hortonworks.com/

 Hortonworks Virtual Sandbox http://hortonworks.com/sandbox

 Hadoop: Disruptive Possibilities by Jeff 
 Needhamhttp://hortonworks.com/resources/?did=72cat=1

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.




Re: InvalidProtocolBufferException while submitting crunch job to cluster

2013-08-31 Thread Narlin M
I would, but bdatadev is not one of my servers, it seems like a random
host name. I can't figure out how or where this name got generated. That's
what puzzling me.

On 8/31/13 5:43 AM, Shekhar Sharma shekhar2...@gmail.com wrote:

: java.net.UnknownHostException: bdatadev


edit your /etc/hosts file
Regards,
Som Shekhar Sharma
+91-8197243810


On Sat, Aug 31, 2013 at 2:05 AM, Narlin M hpn...@gmail.com wrote:
 Looks like I was pointing to incorrect ports. After correcting the port
 numbers,

 conf.set(fs.defaultFS, hdfs://server_address:8020);
 conf.set(mapred.job.tracker, server_address:8021);

 I am now getting the following exception:

 2880 [Thread-15] INFO
 org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob  -
 java.lang.IllegalArgumentException: java.net.UnknownHostException:
bdatadev
 at
 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.ja
va:414)
 at
 
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.j
ava:164)
 at
 
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:1
29)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:389)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:356)
 at
 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSy
stem.java:124)
 at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2218)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80)
 at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2252)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2234)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:300)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
 at
 
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissio
nFiles.java:103)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:902)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
.java:1332)
 at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:531)
 at
 
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.sub
mit(CrunchControlledJob.java:305)
 at
 
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.startR
eadyJobs(CrunchJobControl.java:180)
 at
 
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJo
bStatusAndStartNewOnes(CrunchJobControl.java:209)
 at
 
org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:100
)
 at 
org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:51)
 at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:75)
 at java.lang.Thread.run(Thread.java:680)
 Caused by: java.net.UnknownHostException: bdatadev
 ... 27 more

 However nowhere in my code a host named bdatadev is mentioned, and I
 cannot ping this host.

 Thanks for the help.


 On Fri, Aug 30, 2013 at 3:04 PM, Narlin M hpn...@gmail.com wrote:

 I am getting following exception while trying to submit a crunch
pipeline
 job to a remote hadoop cluster:

 Exception in thread main java.lang.RuntimeException: Cannot create
job
 output directory /tmp/crunch-324987940
 at
 
org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.java
:344)
 at org.apache.crunch.impl.mr.MRPipeline.init(MRPipeline.java:125)
 at test.CrunchTest.setup(CrunchTest.java:98)
 at test.CrunchTest.main(CrunchTest.java:367)
 Caused by: java.io.IOException: Failed on local exception:
 com.google.protobuf.InvalidProtocolBufferException: Protocol message
 end-group tag did not match expected tag.; Host Details : local host
is:
 NARLIN/127.0.0.1; destination host is: server_address:50070;
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)
 at org.apache.hadoop.ipc.Client.call(Client.java:1164)
 at
 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine
.java:202)
 at com.sun.proxy.$Proxy11.mkdirs(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:39)
 at
 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvo
cationHandler.java:164)
 at
 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocation
Handler.java:83)
 at com.sun.proxy.$Proxy11.mkdirs(Unknown Source)
 at
 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkd
irs(ClientNamenodeProtocolTranslatorPB.java:425)
 at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1943)
 at
 
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSyste
m.java:523)
 at 

Re: InvalidProtocolBufferException while submitting crunch job to cluster

2013-08-31 Thread Narlin M
The server_address that was mentioned in my original post is not
pointing to bdatadev. I should have mentioned this in my original post,
sorry I missed that.

On 8/31/13 8:32 AM, Narlin M hpn...@gmail.com wrote:

I would, but bdatadev is not one of my servers, it seems like a random
host name. I can't figure out how or where this name got generated. That's
what puzzling me.

On 8/31/13 5:43 AM, Shekhar Sharma shekhar2...@gmail.com wrote:

: java.net.UnknownHostException: bdatadev


edit your /etc/hosts file
Regards,
Som Shekhar Sharma
+91-8197243810


On Sat, Aug 31, 2013 at 2:05 AM, Narlin M hpn...@gmail.com wrote:
 Looks like I was pointing to incorrect ports. After correcting the port
 numbers,

 conf.set(fs.defaultFS, hdfs://server_address:8020);
 conf.set(mapred.job.tracker, server_address:8021);

 I am now getting the following exception:

 2880 [Thread-15] INFO
 org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob
-
 java.lang.IllegalArgumentException: java.net.UnknownHostException:
bdatadev
 at
 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.j
a
va:414)
 at
 
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.
j
ava:164)
 at
 
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:
1
29)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:389)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:356)
 at
 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileS
y
stem.java:124)
 at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2218)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80)
 at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2252)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2234)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:300)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
 at
 
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissi
o
nFiles.java:103)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:902)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio
n
.java:1332)
 at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:531)
 at
 
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.su
b
mit(CrunchControlledJob.java:305)
 at
 
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.start
R
eadyJobs(CrunchJobControl.java:180)
 at
 
org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJ
o
bStatusAndStartNewOnes(CrunchJobControl.java:209)
 at
 
org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:10
0
)
 at 
org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:51)
 at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:75)
 at java.lang.Thread.run(Thread.java:680)
 Caused by: java.net.UnknownHostException: bdatadev
 ... 27 more

 However nowhere in my code a host named bdatadev is mentioned, and I
 cannot ping this host.

 Thanks for the help.


 On Fri, Aug 30, 2013 at 3:04 PM, Narlin M hpn...@gmail.com wrote:

 I am getting following exception while trying to submit a crunch
pipeline
 job to a remote hadoop cluster:

 Exception in thread main java.lang.RuntimeException: Cannot create
job
 output directory /tmp/crunch-324987940
 at
 
org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.jav
a
:344)
 at org.apache.crunch.impl.mr.MRPipeline.init(MRPipeline.java:125)
 at test.CrunchTest.setup(CrunchTest.java:98)
 at test.CrunchTest.main(CrunchTest.java:367)
 Caused by: java.io.IOException: Failed on local exception:
 com.google.protobuf.InvalidProtocolBufferException: Protocol message
 end-group tag did not match expected tag.; Host Details : local host
is:
 NARLIN/127.0.0.1; destination host is: server_address:50070;
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)
 at org.apache.hadoop.ipc.Client.call(Client.java:1164)
 at
 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngin
e
.java:202)
 at com.sun.proxy.$Proxy11.mkdirs(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja
v
a:39)
 at
 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso
r
Impl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInv
o
cationHandler.java:164)
 at
 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocatio
n
Handler.java:83)
 at com.sun.proxy.$Proxy11.mkdirs(Unknown Source)
 at
 

Re: InvalidProtocolBufferException while submitting crunch job to cluster

2013-08-31 Thread Shekhar Sharma
Can you please check whether are you able to access HDFS using java
API..and also able to run MR Job.
Regards,
Som Shekhar Sharma
+91-8197243810


On Sat, Aug 31, 2013 at 7:08 PM, Narlin M hpn...@gmail.com wrote:
 The server_address that was mentioned in my original post is not
 pointing to bdatadev. I should have mentioned this in my original post,
 sorry I missed that.

 On 8/31/13 8:32 AM, Narlin M hpn...@gmail.com wrote:

I would, but bdatadev is not one of my servers, it seems like a random
host name. I can't figure out how or where this name got generated. That's
what puzzling me.

On 8/31/13 5:43 AM, Shekhar Sharma shekhar2...@gmail.com wrote:

: java.net.UnknownHostException: bdatadev


edit your /etc/hosts file
Regards,
Som Shekhar Sharma
+91-8197243810


On Sat, Aug 31, 2013 at 2:05 AM, Narlin M hpn...@gmail.com wrote:
 Looks like I was pointing to incorrect ports. After correcting the port
 numbers,

 conf.set(fs.defaultFS, hdfs://server_address:8020);
 conf.set(mapred.job.tracker, server_address:8021);

 I am now getting the following exception:

 2880 [Thread-15] INFO
 org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob
-
 java.lang.IllegalArgumentException: java.net.UnknownHostException:
bdatadev
 at

org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.j
a
va:414)
 at

org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.
j
ava:164)
 at

org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:
1
29)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:389)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:356)
 at

org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileS
y
stem.java:124)
 at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2218)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80)
 at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2252)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2234)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:300)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
 at

org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissi
o
nFiles.java:103)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:902)
 at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:896)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at

org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio
n
.java:1332)
 at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:896)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:531)
 at

org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob.su
b
mit(CrunchControlledJob.java:305)
 at

org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.start
R
eadyJobs(CrunchJobControl.java:180)
 at

org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchJobControl.pollJ
o
bStatusAndStartNewOnes(CrunchJobControl.java:209)
 at

org.apache.crunch.impl.mr.exec.MRExecutor.monitorLoop(MRExecutor.java:10
0
)
 at
org.apache.crunch.impl.mr.exec.MRExecutor.access$000(MRExecutor.java:51)
 at org.apache.crunch.impl.mr.exec.MRExecutor$1.run(MRExecutor.java:75)
 at java.lang.Thread.run(Thread.java:680)
 Caused by: java.net.UnknownHostException: bdatadev
 ... 27 more

 However nowhere in my code a host named bdatadev is mentioned, and I
 cannot ping this host.

 Thanks for the help.


 On Fri, Aug 30, 2013 at 3:04 PM, Narlin M hpn...@gmail.com wrote:

 I am getting following exception while trying to submit a crunch
pipeline
 job to a remote hadoop cluster:

 Exception in thread main java.lang.RuntimeException: Cannot create
job
 output directory /tmp/crunch-324987940
 at

org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.jav
a
:344)
 at org.apache.crunch.impl.mr.MRPipeline.init(MRPipeline.java:125)
 at test.CrunchTest.setup(CrunchTest.java:98)
 at test.CrunchTest.main(CrunchTest.java:367)
 Caused by: java.io.IOException: Failed on local exception:
 com.google.protobuf.InvalidProtocolBufferException: Protocol message
 end-group tag did not match expected tag.; Host Details : local host
is:
 NARLIN/127.0.0.1; destination host is: server_address:50070;
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)
 at org.apache.hadoop.ipc.Client.call(Client.java:1164)
 at

org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngin
e
.java:202)
 at com.sun.proxy.$Proxy11.mkdirs(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja
v
a:39)
 at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso
r
Impl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at

org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInv
o
cationHandler.java:164)
 at


Re: Multidata center support

2013-08-31 Thread Visioner Sadak
The only problem i guess hadoop wont be able to duplicate data from one
data center to another but i guess i can identify data nodes or namenodes
from another data center correct me if i am wrong


On Sat, Aug 31, 2013 at 7:00 PM, Visioner Sadak visioner.sa...@gmail.comwrote:

 lets say that

 you have some machines in europe and some  in US I think you just need the
 ips and configure them in your cluster set up
 it will work...


 On Sat, Aug 31, 2013 at 7:52 AM, Jun Ping Du j...@vmware.com wrote:

 Hi,
 Although you can set datacenter layer on your network topology, it is
 never enabled in hadoop as lacking of replica placement and task scheduling
 support. There are some work to add layers other than rack and node under
 HADOOP-8848 but may not suit for your case. Agree with Adam that a cluster
 spanning multiple data centers seems not make sense even for DR case. Do
 you have other cases to do such a deployment?

 Thanks,

 Junping

 --
 *From: *Adam Muise amu...@hortonworks.com
 *To: *user@hadoop.apache.org
 *Sent: *Friday, August 30, 2013 6:26:54 PM
 *Subject: *Re: Multidata center support


 Nothing has changed. DR best practice is still one (or more) clusters per
 site and replication is handled via distributed copy or some variation of
 it. A cluster spanning multiple data centers is a poor idea right now.




 On Fri, Aug 30, 2013 at 12:35 AM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 My take on this.

 Why hadoop has to know about data center thing. I think it can be
 installed across multiple data centers , however topology configuration
 would be required to tell which node belongs to which data center and
 switch for block placement.

 Thanks,
 Rahul


 On Fri, Aug 30, 2013 at 12:42 AM, Baskar Duraikannu 
 baskar.duraika...@outlook.com wrote:

 We have a need to setup hadoop across data centers.  Does hadoop
 support multi data center configuration? I searched through archives and
 have found that hadoop did not support multi data center configuration some
 time back. Just wanted to see whether situation has changed.

 Please help.





 --
 *
 *
 *
 *
 *Adam Muise*
 Solution Engineer
 *Hortonworks*
 amu...@hortonworks.com
 416-417-4037

 Hortonworks - Develops, Distributes and Supports Enterprise Apache 
 Hadoop.http://hortonworks.com/

 Hortonworks Virtual Sandbox http://hortonworks.com/sandbox

 Hadoop: Disruptive Possibilities by Jeff 
 Needhamhttp://hortonworks.com/resources/?did=72cat=1

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.





Re: InvalidProtocolBufferException while submitting crunch job to cluster

2013-08-31 Thread Harsh J
Your cluster is using HDFS HA, and therefore requires a little more
configs than just fs.defaultFS/etc..

You need to use the right set of cluster client configs. If you don't
have them at /etc/hadoop/conf and /etc/hbase/conf on your cluster edge
node to pull from, try asking your cluster administrator for a
configuration set, and place their parent directories on your
application's classpath.

The first error deals with perhaps you also including a guava
dependency in your project, which is different than the one
transitively pulled in by hadoop-client via crunch. You should be able
to use guava libs without needing an explicit dependency, and it would
be the right needed version.

The second error deals with your MR submission failing, cause the JT
is using a staging directory over a HDFS HA, which uses a logical
name of bdatadev. A logical HA name needs other configs (typically
in the hdfs-site.xml) that tell it which are the actual physical NNs
under it - configs that you're missing here.

On Sat, Aug 31, 2013 at 1:34 AM, Narlin M hpn...@gmail.com wrote:
 I am getting following exception while trying to submit a crunch pipeline
 job to a remote hadoop cluster:

 Exception in thread main java.lang.RuntimeException: Cannot create job
 output directory /tmp/crunch-324987940
 at
 org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.java:344)
 at org.apache.crunch.impl.mr.MRPipeline.init(MRPipeline.java:125)
 at test.CrunchTest.setup(CrunchTest.java:98)
 at test.CrunchTest.main(CrunchTest.java:367)
 Caused by: java.io.IOException: Failed on local exception:
 com.google.protobuf.InvalidProtocolBufferException: Protocol message
 end-group tag did not match expected tag.; Host Details : local host is:
 NARLIN/127.0.0.1; destination host is: server_address:50070;
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:759)
 at org.apache.hadoop.ipc.Client.call(Client.java:1164)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
 at com.sun.proxy.$Proxy11.mkdirs(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 at com.sun.proxy.$Proxy11.mkdirs(Unknown Source)
 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:425)
 at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:1943)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:523)
 at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1799)
 at
 org.apache.crunch.impl.mr.MRPipeline.createTempDirectory(MRPipeline.java:342)
 ... 3 more
 Caused by: com.google.protobuf.InvalidProtocolBufferException: Protocol
 message end-group tag did not match expected tag.
 at
 com.google.protobuf.InvalidProtocolBufferException.invalidEndTag(InvalidProtocolBufferException.java:73)
 at
 com.google.protobuf.CodedInputStream.checkLastTagWas(CodedInputStream.java:124)
 at
 com.google.protobuf.AbstractMessageLite$Builder.mergeFrom(AbstractMessageLite.java:213)
 at
 com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:746)
 at
 com.google.protobuf.AbstractMessage$Builder.mergeFrom(AbstractMessage.java:238)
 at
 com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:282)
 at
 com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:760)
 at
 com.google.protobuf.AbstractMessageLite$Builder.mergeDelimitedFrom(AbstractMessageLite.java:288)
 at
 com.google.protobuf.AbstractMessage$Builder.mergeDelimitedFrom(AbstractMessage.java:752)
 at
 org.apache.hadoop.ipc.protobuf.RpcPayloadHeaderProtos$RpcResponseHeaderProto.parseDelimitedFrom(RpcPayloadHeaderProtos.java:985)
 at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:882)
 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:813)
 0[Thread-3] WARN  org.apache.hadoop.util.ShutdownHookManager  -
 ShutdownHook 'ClientFinalizer' failed, java.lang.NoSuchMethodError:
 com.google.common.collect.LinkedListMultimap.values()Ljava/util/List;
 java.lang.NoSuchMethodError:
 com.google.common.collect.LinkedListMultimap.values()Ljava/util/List;
 at org.apache.hadoop.hdfs.SocketCache.clear(SocketCache.java:135)
 at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:672)
 at
 org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:539)
 at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2308)
 at
 org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2324)
 at
 

Re: How to change default ports of datanodes in a cluster

2013-08-31 Thread Harsh J
Looking at the hdfs-default.xml should help with such questions:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

The property you need is dfs.datanode.http.address

On Sat, Aug 31, 2013 at 6:47 PM, Visioner Sadak
visioner.sa...@gmail.com wrote:
 Hello Hadoopers,

 Default port for datanode is 50075 i am able to change namenode default port
 by changing

 dfs.namenode.http-address.ns1  dfs.namenode.http-address.ns2 in my
 hdfs-site.xml of my 2 namenodes

 how to change default port address of my multiple datanodes





-- 
Harsh J


Re: How to change default ports of datanodes in a cluster

2013-08-31 Thread Visioner Sadak
thanks harsh for a cluster should i enter multiple ip address under
tag dfs.datanode.http.address
as i have 4 data nodes


On Sat, Aug 31, 2013 at 9:44 PM, Harsh J ha...@cloudera.com wrote:

 Looking at the hdfs-default.xml should help with such questions:

 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

 The property you need is dfs.datanode.http.address

 On Sat, Aug 31, 2013 at 6:47 PM, Visioner Sadak
 visioner.sa...@gmail.com wrote:
  Hello Hadoopers,
 
  Default port for datanode is 50075 i am able to change namenode default
 port
  by changing
 
  dfs.namenode.http-address.ns1  dfs.namenode.http-address.ns2 in my
  hdfs-site.xml of my 2 namenodes
 
  how to change default port address of my multiple datanodes
 
 



 --
 Harsh J



Re: WritableComparable.compareTo vs RawComparator.compareTo

2013-08-31 Thread Adeel Qureshi
Thanks for the information. So is the reason that makes the raw comparator
faster is because we can use the bytes to do the comparison .. so if I use
the signature of compareTo in my raw comparator that receives two
writablecomparable objects

public int compare(WritableComparable a, WritableComparable b)

instead of the bytes one .. then does it ends up slower and more comparable
to the compareTo method defined on the WritableComparable object itself

Secondly if I do use the bytes signature and I have seen implementations
where you can use util methods like readInt and readString to read int and
strings from those bytes but what if I have a complex object inside my
writablecomparable such as Text or List .. how can I read those from bytes.

Thanks


On Aug 31, 2013 3:58 AM, Ravi Kiran ravikiranmag...@gmail.com wrote:

 Also, if both are defined , the framework will use RawComparator . I hope
 you have registered the comparator in a static block as follows

 static
 {
 WritableComparator.define(PairOfInts.class, new Comparator());
  }

 Regards
 Ravi Magham


 On Sat, Aug 31, 2013 at 1:23 PM, Ravi Kiran ravikiranmag...@gmail.comwrote:

 Hi Adeel,

 The RawComparator is the fastest between the two as you avoid the
 need to convert the byte stream to Writable objects for comparison .

 Regards
 Ravi Magham


 On Fri, Aug 30, 2013 at 11:16 PM, Adeel Qureshi 
 adeelmahm...@gmail.comwrote:

 For secondary sort I am implementing a RawComparator and providing that
 as sortComparator .. is that the faster way or using a WritableComparable
 as mapper output and defining a compareTo method on the key itself

 also what happens if both are defined, is one ignored






Re: How to change default ports of datanodes in a cluster

2013-08-31 Thread Harsh J
You can maintain per-DN configs if you wish to restrict the HTTP
server to only the public IP, but otherwise use a wildcard
0.0.0.0:PORT, if you were only just looking to change the port.

On Sat, Aug 31, 2013 at 9:49 PM, Visioner Sadak
visioner.sa...@gmail.com wrote:
 thanks harsh for a cluster should i enter multiple ip address under tag
 dfs.datanode.http.address as i have 4 data nodes


 On Sat, Aug 31, 2013 at 9:44 PM, Harsh J ha...@cloudera.com wrote:

 Looking at the hdfs-default.xml should help with such questions:

 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

 The property you need is dfs.datanode.http.address

 On Sat, Aug 31, 2013 at 6:47 PM, Visioner Sadak
 visioner.sa...@gmail.com wrote:
  Hello Hadoopers,
 
  Default port for datanode is 50075 i am able to change namenode default
  port
  by changing
 
  dfs.namenode.http-address.ns1  dfs.namenode.http-address.ns2 in my
  hdfs-site.xml of my 2 namenodes
 
  how to change default port address of my multiple datanodes
 
 



 --
 Harsh J





-- 
Harsh J


Re: How to change default ports of datanodes in a cluster

2013-08-31 Thread Visioner Sadak
cool thanks a ton harsh!!!


On Sat, Aug 31, 2013 at 9:53 PM, Harsh J ha...@cloudera.com wrote:

 You can maintain per-DN configs if you wish to restrict the HTTP
 server to only the public IP, but otherwise use a wildcard
 0.0.0.0:PORT, if you were only just looking to change the port.

 On Sat, Aug 31, 2013 at 9:49 PM, Visioner Sadak
 visioner.sa...@gmail.com wrote:
  thanks harsh for a cluster should i enter multiple ip address under tag
  dfs.datanode.http.address as i have 4 data nodes
 
 
  On Sat, Aug 31, 2013 at 9:44 PM, Harsh J ha...@cloudera.com wrote:
 
  Looking at the hdfs-default.xml should help with such questions:
 
 
 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
 
  The property you need is dfs.datanode.http.address
 
  On Sat, Aug 31, 2013 at 6:47 PM, Visioner Sadak
  visioner.sa...@gmail.com wrote:
   Hello Hadoopers,
  
   Default port for datanode is 50075 i am able to change namenode
 default
   port
   by changing
  
   dfs.namenode.http-address.ns1  dfs.namenode.http-address.ns2 in my
   hdfs-site.xml of my 2 namenodes
  
   how to change default port address of my multiple datanodes
  
  
 
 
 
  --
  Harsh J
 
 



 --
 Harsh J



Re: Multidata center support

2013-08-31 Thread Visioner Sadak
What do you think friends I think hadoop clusters can run on multiple data
centers using FEDERATION


On Sat, Aug 31, 2013 at 8:39 PM, Visioner Sadak visioner.sa...@gmail.comwrote:

 The only problem i guess hadoop wont be able to duplicate data from one
 data center to another but i guess i can identify data nodes or namenodes
 from another data center correct me if i am wrong


 On Sat, Aug 31, 2013 at 7:00 PM, Visioner Sadak 
 visioner.sa...@gmail.comwrote:

 lets say that

 you have some machines in europe and some  in US I think you just need
 the ips and configure them in your cluster set up
 it will work...


 On Sat, Aug 31, 2013 at 7:52 AM, Jun Ping Du j...@vmware.com wrote:

 Hi,
 Although you can set datacenter layer on your network topology, it
 is never enabled in hadoop as lacking of replica placement and task
 scheduling support. There are some work to add layers other than rack and
 node under HADOOP-8848 but may not suit for your case. Agree with Adam that
 a cluster spanning multiple data centers seems not make sense even for DR
 case. Do you have other cases to do such a deployment?

 Thanks,

 Junping

 --
 *From: *Adam Muise amu...@hortonworks.com
 *To: *user@hadoop.apache.org
 *Sent: *Friday, August 30, 2013 6:26:54 PM
 *Subject: *Re: Multidata center support


 Nothing has changed. DR best practice is still one (or more) clusters
 per site and replication is handled via distributed copy or some variation
 of it. A cluster spanning multiple data centers is a poor idea right now.




 On Fri, Aug 30, 2013 at 12:35 AM, Rahul Bhattacharjee 
 rahul.rec@gmail.com wrote:

 My take on this.

 Why hadoop has to know about data center thing. I think it can be
 installed across multiple data centers , however topology configuration
 would be required to tell which node belongs to which data center and
 switch for block placement.

 Thanks,
 Rahul


 On Fri, Aug 30, 2013 at 12:42 AM, Baskar Duraikannu 
 baskar.duraika...@outlook.com wrote:

 We have a need to setup hadoop across data centers.  Does hadoop
 support multi data center configuration? I searched through archives and
 have found that hadoop did not support multi data center configuration 
 some
 time back. Just wanted to see whether situation has changed.

 Please help.





 --
 *
 *
 *
 *
 *Adam Muise*
 Solution Engineer
 *Hortonworks*
 amu...@hortonworks.com
 416-417-4037

 Hortonworks - Develops, Distributes and Supports Enterprise Apache
 Hadoop. http://hortonworks.com/

 Hortonworks Virtual Sandbox http://hortonworks.com/sandbox

 Hadoop: Disruptive Possibilities by Jeff 
 Needhamhttp://hortonworks.com/resources/?did=72cat=1

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.






Re: bad interpreter: Text file busy and other errors in Hadoop 2.1.0-beta

2013-08-31 Thread Jian He
Hi John

This exception should indicate error from the container process. If the
container process exits with non-zero exit code, it will be logged.
In case of such errors, you'd better look at the per-container log see
what's happening there.

Jian


On Fri, Aug 30, 2013 at 10:03 AM, Jian Fang
jian.fang.subscr...@gmail.comwrote:

 Hi,

 I upgraded to Hadoop 2.1.0-beta and suddenly I started to see error
 messages as follows.

 Exception from container-launch:
 org.apache.hadoop.util.Shell$ExitCodeException: bash:
 /var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1377823589199_0002/container_1377823589199_0002_01_000214/default_container_executor.sh:
 /bin/bash: bad interpreter: Text file busy

 at org.apache.hadoop.util.Shell.runCommand(Shell.java:458)
 at org.apache.hadoop.util.Shell.run(Shell.java:373)
 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578)
 at
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:258)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:74)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)



 cleanup failed for container container_1377823589199_0002_01_000214 :
 org.apache.hadoop.yarn.exceptions.YarnException: Container
 container_1377823589199_0002_01_000214 is not handled by this NodeManager
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
 at
 org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
 at
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:210)
 at
 org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
 at java.lang.Thread.run(Thread.java:662)

 Any thing wrong here?

 Thanks,

 John


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: metric type

2013-08-31 Thread Jitendra Yadav
Yes, MutableCounterLong helps to gather DataNode read/write statics.
There is more option available within this metric

Regards
Jitendra
On 8/31/13, lei liu liulei...@gmail.com wrote:
 There is @Metric MutableCounterLong bytesWritten attribute in
 DataNodeMetrics, which is used to IO/sec statistics?


 2013/8/31 Jitendra Yadav jeetuyadav200...@gmail.com

 Hi,

 For IO/sec statistics I think MutableCounterLongRate  and
 MutableCounterLong more useful than others and for xceiver thread
 number I'm not bit sure right now.

 Thanks
 Jiitendra
 On Fri, Aug 30, 2013 at 1:40 PM, lei liu liulei...@gmail.com wrote:
 
  Hi  Jitendra,
  If I want to statistics number of bytes read per second,and display the
 result into ganglia, should I use MutableCounterLong or MutableGaugeLong?
 
  If I want to display current xceiver thread number in datanode into
 ganglia, should I use MutableCounterLong or MutableGaugeLong?
 
  Thanks,
  LiuLei
 
 
  2013/8/30 Jitendra Yadav jeetuyadav200...@gmail.com
 
  Hi,
 
  Below link contains the answer for your question.
 
 
 http://hadoop.apache.org/docs/r1.2.0/api/org/apache/hadoop/metrics2/package-summary.html
 
  Regards
  Jitendra
 
  On Fri, Aug 30, 2013 at 11:35 AM, lei liu liulei...@gmail.com wrote:
 
  I use the metrics v2, there are COUNTER and GAUGE metric type in
 metrics v2.
  What is the difference between the two?
 
  Thanks,
  LiuLei
 
 
 




custom writablecomparable with complex fields

2013-08-31 Thread Adeel Qureshi
I want to write a custom writablecomparable object with two List objects
within it ..

public class CompositeKey implements WritableComparable {

private ListJsonKey groupBy;
private ListJsonKey sortBy;
...
}

what I am not sure about is how to write

readFields and write methods for this object. Any help would be appreciated.

Thanks
Adeel


Re: Subscribe

2013-08-31 Thread Ted Yu
Please send email to:
user-subscr...@hadoop.apache.org


On Sat, Aug 31, 2013 at 12:36 PM, Surendra , Manchikanti 
surendra.manchika...@gmail.com wrote:


 -- Surendra Manchikanti



Re: custom writablecomparable with complex fields

2013-08-31 Thread Harsh J
The idea behind write(…) and readFields(…) is simply that of ordering.
You need to write your custom objects (i.e. a representation of them)
in order, and read them back in the same order.

An example way of serializing a list would be to first serialize the
length (so you know how many you'll be needed to read back), and then
serialize each item appropriately, using delimiters, or using
length-prefixes just like lists.

Mainly, you're required to tackle the serialization/deserialization on your own.

This is one of the reasons I highly recommend using a library like
Apache Avro instead. Its more powerful, faster, and yet simple to use:
http://avro.apache.org/docs/current/gettingstartedjava.html and
http://avro.apache.org/docs/current/mr.html. It is also popular and
carries first-grade support on several other hadoop-ecosystem
projects, such as Flume and Crunch.

On Sun, Sep 1, 2013 at 1:23 AM, Adeel Qureshi adeelmahm...@gmail.com wrote:
 I want to write a custom writablecomparable object with two List objects
 within it ..

 public class CompositeKey implements WritableComparable {

 private ListJsonKey groupBy;
 private ListJsonKey sortBy;
 ...
 }

 what I am not sure about is how to write

 readFields and write methods for this object. Any help would be appreciated.

 Thanks
 Adeel



-- 
Harsh J


Re: Job config before read fields

2013-08-31 Thread Shahab Yunus
Personally, I don't know a way to access job configuration parameters in
custom implementation of Writables ( at least not an elegant and
appropriate one. Of course hacks of various kinds be done.) Maybe experts
can chime in?

One idea that I thought about was to use MapWritable (if you have not
explored it already.) You can encode the 'custom metadata' for you 'data'
as one byte symbols and move your data in the M/R flow as a map. Then while
deserialization you will have the type (or your 'custom metadata') in the
key part of the map and the value would be you actual data. This aligns
with the efficient approach that is used natively in Hadoop for
Strings/Text i.e. compact metadata (though I agree that you are not taking
advantage of the other aspect of non-dependence between metadata and the
data it defines.)

Take a look at that:
Page 96 of the Definitive Guide:
http://books.google.com/books?id=Nff49D7vnJcCpg=PA96lpg=PA96dq=mapwritable+in+hadoopsource=blots=IiixYu7vXusig=4V6H7cY-MrNT7Rzs3WlODsDOoP4hl=ensa=Xei=aX4iUp2YGoaosASs_YCACQsqi=2ved=0CFUQ6AEwBA#v=onepageq=mapwritable%20in%20hadoopf=false

and then this:
http://www.chrisstucchio.com/blog/2011/mapwritable_sometimes_a_performance_hog.html

and add your own custom types here (note that you are restricted by size of
byte):
http://hadoop.sourcearchive.com/documentation/0.20.2plus-pdfsg1-1/AbstractMapWritable_8java-source.html

Regards,
Shahab


On Sat, Aug 31, 2013 at 5:38 AM, Adrian CAPDEFIER chivas314...@gmail.comwrote:

 Thank you for your help Shahab.

 I guess I wasn't being too clear. My logic is that I use a custom type as
 key and in order to deserialize it on the compute nodes, I need an extra
 piece of information (also a custom type).

 To use an analogy, a Text is serialized by writing the length of the
 string as a number and then the bytes that compose the actual string. When
 it is deserialized, the number informs the reader when to stop reading the
 string. This number is varies from string to string and it is compact so it
 makes sense to serialize it with the string.

 My use case is similar to it. I have a complex type (let's call this
 data), and in order to deserialize it, I need another complex type (let's
 call this second type metadata). The metadata is not closely tied to the
 data (i.e. if the data value changes, the metadata does not) and the
 metadata size is quite large.

 I ruled out a couple of options, but please let me know if you think I did
 so for the wrong reasons:
 1. I could serialize each data value with it's own metadata value, but
 since the data value count is in the +tens of millions and the metadata
 value distinct count can be up to one hundred, it would waste resources in
 the system.
 2. I could serialize the metadata and then the data as a collection
 property of the metadata. This would be an elegant solution code-wise, but
 then all the data would have to be read and kept in memory as a massive
 object before any reduce operations can happen. I wasn't able to find any
 info on this online so this is just a guess from peeking at the hadoop code.

 My solution was to serialize the data with a hash of the metadata and
 separately serialize the metadata and its hash in the job configuration (as
 key/value pairs). For this to work, I would need to be able to deserialize
 the metadata on the reduce node before the data is deserialized in the
 readFields() method.

 I think that for that to happen I need to hook into the code somewhere
 where a context or job configuration is used (before readFields()), but I'm
 stumped as to where that is.

 Cheers,
 Adi


 On Sat, Aug 31, 2013 at 3:42 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 What I meant was that you might have to split or redesign your logic or
 your usecase (which we don't know about)?

 Regards,
 Shahab


 On Fri, Aug 30, 2013 at 10:31 PM, Adrian CAPDEFIER 
 chivas314...@gmail.com wrote:

 But how would the comparator have access to the job config?


 On Sat, Aug 31, 2013 at 2:38 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 I think you have to override/extend the Comparator to achieve that,
 something like what is done in Secondary Sort?

 Regards,
 Shahab


 On Fri, Aug 30, 2013 at 9:01 PM, Adrian CAPDEFIER 
 chivas314...@gmail.com wrote:

 Howdy,

 I apologise for the lack of code in this message, but the code is
 fairly convoluted and it would obscure my problem. That being said, I can
 put together some sample code if really needed.

 I am trying to pass some metadata between the map  reduce steps. This
 metadata is read and generated in the map step and stored in the job
 config. It also needs to be recreated on the reduce node before the key/
 value fields can be read in the readFields function.

 I had assumed that I would be able to override the Reducer.setup()
 function and that would be it, but apparently the readFields function is
 called before the Reducer.setup() function.

 My question is what is any (the best) place on the