Re: Missing records from HDFS
Thanks for your response Azuryy. My hadoop version: 2.0.0-cdh4.3.0 InputFormat: a custom class that extends from FileInputFormat(csv input format) These fiels are under the same directory, different files. My input path is configured using oozie throughout the propertie mapred.input.dir. Same code and input running on Hadoop 2.0.0-cdh4.2.1 works fine. Does not discard any record. Thanks. De: Azuryy Yu azury...@gmail.commailto:azury...@gmail.com Responder a: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Fecha: jueves, 21 de noviembre de 2013 07:31 Para: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Asunto: Re: Missing records from HDFS what's your hadoop version? and which InputFormat are you used? these files under one directory or there are lots of subdirectory? how ddi you configure input path in your main? On Thu, Nov 21, 2013 at 12:25 AM, ZORAIDA HIDALGO SANCHEZ zora...@tid.esmailto:zora...@tid.es wrote: Hi all, my job is not reading all the input records. In the input directory I have a set of files containing a total of 600 records but only 5997000 are processed. The Map Input Records counter says 5997000. I have tried downloading the files with a getmerge to check how many records would return but the correct number is returned(600). Do you have any suggestion? Thanks. Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo. This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at: http://www.tid.es/ES/PAGINAS/disclaimer.aspx Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo. This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at: http://www.tid.es/ES/PAGINAS/disclaimer.aspx
RE: Any reference for upgrade hadoop from 1.x to 2.2
Hi All, I am also looking into migrating\upgrading from Apache Hadoop 1.x to Apache Hadoop 2.x. I didn’t find any doc\guide\blogs for the same. Although there are guides\docs for the CDH and HDP migration\upgradation from Hadoop 1.x to Hadoop 2.x Would referring those be of some use? I am looking for similar guides\docs for Apache Hadoop 1.x to Apache Hadoop 2.x. I found something on slideshare though. Not sure how much useful that is going to be. I still need to verify that. http://www.slideshare.net/mikejf12/an-example-apache-hadoop-yarn-upgrade Any suggestions\comments will be of great help. Thanks, -Nirmal From: Jilal Oussama [mailto:jilal.ouss...@gmail.com] Sent: Friday, November 08, 2013 9:13 PM To: user@hadoop.apache.org Subject: Re: Any reference for upgrade hadoop from 1.x to 2.2 I am looking for the same thing if anyone can point us to a good direction please. Thank you. (Currently running Hadoop 1.2.1) 2013/11/1 YouPeng Yang yypvsxf19870...@gmail.commailto:yypvsxf19870...@gmail.com Hi users Are there any reference docs to introduce how to upgrade hadoop from 1.x to 2.2. Regards NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Re: Any reference for upgrade hadoop from 1.x to 2.2
For MapReduce and YARN, we recently published a couple blog posts on migrating: http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-users/ http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-operators/ hope that helps, Sandy On Fri, Nov 22, 2013 at 3:03 AM, Nirmal Kumar nirmal.ku...@impetus.co.inwrote: Hi All, I am also looking into migrating\upgrading from Apache Hadoop 1.x to Apache Hadoop 2.x. I didn’t find any doc\guide\blogs for the same. Although there are guides\docs for the CDH and HDP migration\upgradation from Hadoop 1.x to Hadoop 2.x Would referring those be of some use? I am looking for similar guides\docs for Apache Hadoop 1.x to Apache Hadoop 2.x. I found something on slideshare though. Not sure how much useful that is going to be. I still need to verify that. http://www.slideshare.net/mikejf12/an-example-apache-hadoop-yarn-upgrade Any suggestions\comments will be of great help. Thanks, -Nirmal *From:* Jilal Oussama [mailto:jilal.ouss...@gmail.com] *Sent:* Friday, November 08, 2013 9:13 PM *To:* user@hadoop.apache.org *Subject:* Re: Any reference for upgrade hadoop from 1.x to 2.2 I am looking for the same thing if anyone can point us to a good direction please. Thank you. (Currently running Hadoop 1.2.1) 2013/11/1 YouPeng Yang yypvsxf19870...@gmail.com Hi users Are there any reference docs to introduce how to upgrade hadoop from 1.x to 2.2. Regards -- NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Unsubscribe
Re: Difference between clustering and classification in hadoop
Thank you Mirko On Fri, Nov 22, 2013 at 2:11 PM, Mirko Kämpf mirko.kae...@gmail.com wrote: ... it depends on the implementation. ;-) Mahout offers both: Mahout in action http://manning.com/owen/ And is more ... http://en.wikipedia.org/wiki/Cluster_analysis http://en.wikipedia.org/wiki/Statistical_classification Good look! Mirko 2013/11/22 unmesha sreeveni unmeshab...@gmail.com what is the differences b/w classification algorithms and clustering algorithms in hadoop? -- *Thanks Regards* Unmesha Sreeveni U.B *Junior Developer* -- *Thanks Regards* Unmesha Sreeveni U.B *Junior Developer*
Re: HDFS upgrade problem of fsImage
Yes realized that and I see your point :-) However seems like some fs inconsistency present, did you attempt rollback/finalizeUpgrade and check? For that error, FSImage.java/code finds a previous fs state - // Upgrade is allowed only if there are // no previous fs states in any of the directories for (IteratorStorageDirectory it = storage.dirIterator(); it.hasNext();) { StorageDirectory sd = it.next(); if (sd.getPreviousDir().exists()) throw new InconsistentFSStateException(sd.getRoot(), previous fs state should not exist during upgrade. + Finalize or rollback first.); } Thanks Rekha From: Azuryy Yu azury...@gmail.commailto:azury...@gmail.com Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Date: Thursday 21 November 2013 5:19 PM To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Cc: hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org Subject: Re: HDFS upgrade problem of fsImage I insist hot upgrade on the test cluster because I want hot upgrade on the prod cluster. On 2013-11-21 7:23 PM, Joshi, Rekha rekha_jo...@intuit.commailto:rekha_jo...@intuit.com wrote: Hi Azurry, This error occurs when FSImage finds previous fs state, and as log states you would need to either finalizeUpgrade or rollback to proceed.Below - bin/hadoop dfsadmin –finalizeUpgrade hadoop dfsadmin –rollback On side note for a small test cluster on which one might suspect you are the only user, why wouldn't you insist on hot upgrade? :-) Thanks Rekha Some helpful guidelines for upgrade here - http://wiki.apache.org/hadoop/Hadoop_Upgrade https://twiki.grid.iu.edu/bin/view/Storage/HadoopUpgrade http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Federation.html#Upgrading_from_older_release_to_0.23_and_configuring_federation From: Azuryy Yu azury...@gmail.commailto:azury...@gmail.com Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Date: Thursday 21 November 2013 9:48 AM To: hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org, user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: HDFS upgrade problem of fsImage Hi Dear, I have a small test cluster with hadoop-2.0x, and HA configuraded, but I want to upgrade to hadoop-2.2. I dont't want to stop cluster during upgrade, so my steps are: 1) on standby NN: hadoop-dameon.sh stop namenode 2) remove HA configuration in the conf 3) hadoop-daemon.sh start namenode -upgrade -clusterID test-cluster but Exception in the NN log, so how to upgrade and don't stop the whole cluster. Thanks. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /hdfs/name is in an inconsistent state: previous fs state should not exist during upgrade. Finalize or rollback first. at org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:323) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:248) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:858) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:620) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:445) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:494) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:692) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:677) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1345)
Re: HDFS upgrade problem of fsImage
Thanks Joshi, Maybe I pasted wrong log messages. please looked at here for the real story. https://issues.apache.org/jira/browse/HDFS-5550 On Fri, Nov 22, 2013 at 6:25 PM, Joshi, Rekha rekha_jo...@intuit.comwrote: Yes realized that and I see your point :-) However seems like some fs inconsistency present, did you attempt rollback/finalizeUpgrade and check? For that error, FSImage.java/code finds a previous fs state - // Upgrade is allowed only if there are // no previous fs states in any of the directories for (IteratorStorageDirectory it = storage.dirIterator(); it.hasNext();) { StorageDirectory sd = it.next(); if (sd.getPreviousDir().exists()) throw new InconsistentFSStateException(sd.getRoot(), previous fs state should not exist during upgrade. + Finalize or rollback first.); } Thanks Rekha From: Azuryy Yu azury...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Thursday 21 November 2013 5:19 PM To: user@hadoop.apache.org user@hadoop.apache.org Cc: hdfs-...@hadoop.apache.org hdfs-...@hadoop.apache.org Subject: Re: HDFS upgrade problem of fsImage I insist hot upgrade on the test cluster because I want hot upgrade on the prod cluster. On 2013-11-21 7:23 PM, Joshi, Rekha rekha_jo...@intuit.com wrote: Hi Azurry, This error occurs when FSImage finds previous fs state, and as log states you would need to either finalizeUpgrade or rollback to proceed.Below - bin/hadoop dfsadmin –finalizeUpgrade hadoop dfsadmin –rollback On side note for a small test cluster on which one might suspect you are the only user, why wouldn't you insist on hot upgrade? :-) Thanks Rekha Some helpful guidelines for upgrade here - http://wiki.apache.org/hadoop/Hadoop_Upgrade https://twiki.grid.iu.edu/bin/view/Storage/HadoopUpgrade http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Federation.html#Upgrading_from_older_release_to_0.23_and_configuring_federation From: Azuryy Yu azury...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Thursday 21 November 2013 9:48 AM To: hdfs-...@hadoop.apache.org hdfs-...@hadoop.apache.org, user@hadoop.apache.org user@hadoop.apache.org Subject: HDFS upgrade problem of fsImage Hi Dear, I have a small test cluster with hadoop-2.0x, and HA configuraded, but I want to upgrade to hadoop-2.2. I dont't want to stop cluster during upgrade, so my steps are: 1) on standby NN: hadoop-dameon.sh stop namenode 2) remove HA configuration in the conf 3) hadoop-daemon.sh start namenode -upgrade -clusterID test-cluster but Exception in the NN log, so how to upgrade and don't stop the whole cluster. Thanks. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /hdfs/name is in an inconsistent state: previous fs state should not exist during upgrade. Finalize or rollback first. at org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:323) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:248) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:858) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:620) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:445) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:494) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:692) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:677) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1345)
Re: Missing records from HDFS
One more thing, if we split the files then all the records are processed. Files are of 70,5MB. Thanks, Zoraida.- De: zoraida zora...@tid.esmailto:zora...@tid.es Fecha: viernes, 22 de noviembre de 2013 08:59 Para: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Asunto: Re: Missing records from HDFS Thanks for your response Azuryy. My hadoop version: 2.0.0-cdh4.3.0 InputFormat: a custom class that extends from FileInputFormat(csv input format) These fiels are under the same directory, different files. My input path is configured using oozie throughout the propertie mapred.input.dir. Same code and input running on Hadoop 2.0.0-cdh4.2.1 works fine. Does not discard any record. Thanks. De: Azuryy Yu azury...@gmail.commailto:azury...@gmail.com Responder a: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Fecha: jueves, 21 de noviembre de 2013 07:31 Para: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Asunto: Re: Missing records from HDFS what's your hadoop version? and which InputFormat are you used? these files under one directory or there are lots of subdirectory? how ddi you configure input path in your main? On Thu, Nov 21, 2013 at 12:25 AM, ZORAIDA HIDALGO SANCHEZ zora...@tid.esmailto:zora...@tid.es wrote: Hi all, my job is not reading all the input records. In the input directory I have a set of files containing a total of 600 records but only 5997000 are processed. The Map Input Records counter says 5997000. I have tried downloading the files with a getmerge to check how many records would return but the correct number is returned(600). Do you have any suggestion? Thanks. Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo. This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at: http://www.tid.es/ES/PAGINAS/disclaimer.aspx Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo. This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at: http://www.tid.es/ES/PAGINAS/disclaimer.aspx
Re: Missing records from HDFS
I do think this is because of your RecorderReader, can you paste your code here? and give a piece of data example. please use pastebin if you want. On Fri, Nov 22, 2013 at 7:16 PM, ZORAIDA HIDALGO SANCHEZ zora...@tid.eswrote: One more thing, if we split the files then all the records are processed. Files are of 70,5MB. Thanks, Zoraida.- De: zoraida zora...@tid.es Fecha: viernes, 22 de noviembre de 2013 08:59 Para: user@hadoop.apache.org user@hadoop.apache.org Asunto: Re: Missing records from HDFS Thanks for your response Azuryy. My hadoop version: 2.0.0-cdh4.3.0 InputFormat: a custom class that extends from FileInputFormat(csv input format) These fiels are under the same directory, different files. My input path is configured using oozie throughout the propertie mapred.input.dir. Same code and input running on Hadoop 2.0.0-cdh4.2.1 works fine. Does not discard any record. Thanks. De: Azuryy Yu azury...@gmail.com Responder a: user@hadoop.apache.org user@hadoop.apache.org Fecha: jueves, 21 de noviembre de 2013 07:31 Para: user@hadoop.apache.org user@hadoop.apache.org Asunto: Re: Missing records from HDFS what's your hadoop version? and which InputFormat are you used? these files under one directory or there are lots of subdirectory? how ddi you configure input path in your main? On Thu, Nov 21, 2013 at 12:25 AM, ZORAIDA HIDALGO SANCHEZ zora...@tid.eswrote: Hi all, my job is not reading all the input records. In the input directory I have a set of files containing a total of 600 records but only 5997000 are processed. The Map Input Records counter says 5997000. I have tried downloading the files with a getmerge to check how many records would return but the correct number is returned(600). Do you have any suggestion? Thanks. -- Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo. This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at: http://www.tid.es/ES/PAGINAS/disclaimer.aspx -- Este mensaje se dirige exclusivamente a su destinatario. Puede consultar nuestra política de envío y recepción de correo electrónico en el enlace situado más abajo. This message is intended exclusively for its addressee. We only send and receive email on the basis of the terms set out at: http://www.tid.es/ES/PAGINAS/disclaimer.aspx
Re: Missing records from HDFS
Sure, our FileInputFormat implementation: public class CVSInputFormat extends FileInputFormatFileValidatorDescriptor, Text { /* * (non-Javadoc) * * @see * org.apache.hadoop.mapreduce.InputFormat#createRecordReader(org.apache * .hadoop.mapreduce.InputSplit, * org.apache.hadoop.mapreduce.TaskAttemptContext) */ @Override public RecordReaderFileValidatorDescriptor, Text createRecordReader( InputSplit split, TaskAttemptContext context) { String delimiter = context.getConfiguration().get( textinputformat.record.delimiter); byte[] recordDelimiterBytes = null; if (null != delimiter) recordDelimiterBytes = delimiter.getBytes(); return new CVSLineRecordReader(recordDelimiterBytes); } /* * (non-Javadoc) * * @see * org.apache.hadoop.mapreduce.lib.input.FileInputFormat#isSplitable(org * .apache.hadoop.mapreduce.JobContext, org.apache.hadoop.fs.Path) */ @Override protected boolean isSplitable(JobContext context, Path file) { CompressionCodec codec = new CompressionCodecFactory( context.getConfiguration()).getCodec(file); return codec == null; } } the recordReader: public class CVSLineRecordReader extends RecordReaderFileValidatorDescriptor, Text { private static final Log LOG = LogFactory.getLog(CVSLineRecordReader.class); public static final String CVS_FIRST_LINE = file.first.line; private long start; private long pos; private long end; private LineReader in; private int maxLineLength; private FileValidatorDescriptor key = null; private Text value = null; private Text data = null; private byte[] recordDelimiterBytes; public CVSLineRecordReader(byte[] recordDelimiter) { this.recordDelimiterBytes = recordDelimiter; } @Override public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException { Properties properties = new Properties(); Configuration configuration = context.getConfiguration(); Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context .getConfiguration()); for (Path cacheFile : cacheFiles) { if (cacheFile.toString().endsWith( context.getConfiguration().get(VALIDATOR_CONF_PATH))) { properties.load(new FileReader(cacheFile.toString())); } } FileSplit split = (FileSplit) genericSplit; Configuration job = context.getConfiguration(); this.maxLineLength = job.getInt(mapred.linerecordreader.maxlength, Integer.MAX_VALUE); start = split.getStart(); end = start + split.getLength(); pos = start; final Path file = split.getPath(); // open the file and seek to the start of the split FileSystem fs = file.getFileSystem(job); FSDataInputStream fileIn = fs.open(split.getPath()); this.in = generateReader(fileIn, job); // if CVS_FIRST_LINE does not exist in conf then the csv file first line // is the header if (properties.containsKey(CVS_FIRST_LINE)) { configuration.set(CVS_FIRST_LINE, properties.get(CVS_FIRST_LINE) .toString()); } else { readData(); configuration.set(CVS_FIRST_LINE, data.toString()); if (start != 0) { fileIn.seek(start); in = generateReader(fileIn, job); pos = start; } } key = new FileValidatorDescriptor(); key.setFileName(split.getPath().getName()); context.getConfiguration().set(file.name, key.getFileName()); } @Override public boolean nextKeyValue() throws IOException { int newSize = readData(); if (newSize == 0) { key = null; value = null; return false; } else { key.setOffset(this.pos); value = data; return true; } } private LineReader generateReader(FSDataInputStream fileIn, Configuration job) throws IOException { if (null == this.recordDelimiterBytes) { return new LineReader(fileIn, job); } else { return new LineReader(fileIn, job, this.recordDelimiterBytes); } } private int readData() throws IOException { if (data == null) { data = new Text(); } int newSize = 0; while (pos end) { newSize = in.readLine(data, maxLineLength, Math.max((int) Math.min(Integer.MAX_VALUE, end - pos), maxLineLength)); if (newSize == 0) {
Re: Problem sending metrics to multiple targets
We investigated the problem and found root cause. Metrics2 framework uses different from first version config parser (Metrics2 uses apache-commons, Metrics uses hadoop's). org.apache.hadoop.metrics2.sink.ganglia.AbstractGangliaSink uses commas as separators by default. So when we provide list of servers it returns everything until first separator - it is only first server from the list. But we were able to find workaround. Class parsing servers list (org.apache.hadoop.metrics2.util.Servers) handles only commas and spaces. It means if we will provide space separated list of servers instead of comma separated then new parser will be able to read whole servers list. After that all servers will be registered as metrics receivers and metrics will be sent to all of them. On Thu, Jan 17, 2013 at 7:17 PM, Ivan Tretyakov itretya...@griddynamics.com wrote: Hi! We have following problem. There are three target hosts to send metrics: 192.168.1.111:8649, 192.168.1.113:8649,192.168.1.115:8649 (node01, node03, node05). But for example datanode (using org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31) sends one metrics to first target host and the another to the second and third. So some metrics missed on second and third node. When gmetad collects metrics from one of these we could not see certain metrics in ganglia. E.g. on node07 running only one process which sends metrics to ganglia - datanode process and we could see following using tcpdump. Dumping traffic for about three minutes: $ sudo -i tcpdump dst port 8649 and src host node07 | tee tcpdump.out ... $ head -n1 tcpdump.out 12:18:05.559719 IP node07.dom.local.43350 node01.dom.local.8649: UDP, length 180 $ tail -n1 tcpdump.out 12:20:59.575144 IP node Then count packets and bytes sent to each target: $ grep node01 tcpdump.out | wc -l 5972 $ grep node03 tcpdump.out | wc -l 3812 $ grep node05 tcpdump.out | wc -l 3811 $ grep node01 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}' 1048272 $ grep node03 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}' 731604 $ grep node05 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}' 731532 Also we could request gmond daemons which metrics do they have: $ nc node01 8649 | grep ProcessName_DataNode | head -n1 METRIC NAME=jvm.JvmMetrics.ProcessName_DataNode.LogFatal VAL=0 TYPE=float UNITS= TN=0 TMAX=60 DMAX=0 SLOPE=positive $ nc node03 8649 | grep ProcessName_DataNode | head -n1 $ nc node05 8649 | grep ProcessName_DataNode | head -n1 $ nc node01 8649 | grep ProcessName_DataNode | wc -l 100 $ nc node03 8649 | grep ProcessName_DataNode | wc -l 0 $ nc node05 8649 | grep ProcessName_DataNode | wc -l 0 We could see that only first collector node from the list has certain metrics. Hadoop version we use: - MapReduce 2.0.0-mr1-cdh4.1.1 - HDFS 2.0.0-cdh4.1.1 hadoop-metrics2.properties content: datanode.period=20 datanode.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31 datanode.sink.ganglia.servers=192.168.1.111:8649,192.168.1.113:8649, 192.168.1.115:8649 datanode.sink.ganglia.tagsForPrefix.jvm=* datanode.sink.ganglia.tagsForPrefix.dfs=* datanode.sink.ganglia.tagsForPrefix.rpc=* datanode.sink.ganglia.tagsForPrefix.rpcdetailed=* datanode.sink.ganglia.tagsForPrefix.metricssystem=* -- Best Regards Ivan Tretyakov -- Best Regards Ivan Tretyakov
Re: Any reference for upgrade hadoop from 1.x to 2.2
Thanks Sandy! These seem helpful! MapReduce cluster configuration options have been split into YARN configuration options, which go in yarn-site.xml; and MapReduce configuration options, which go in mapred-site.xml. Many have been given new names to reflect the shift. ... *We’ll follow up with a full translation table in a future post.* This type of translation table mapping old configuration to new would be *very* useful! - Robert On Fri, Nov 22, 2013 at 2:15 AM, Sandy Ryza sandy.r...@cloudera.com wrote: For MapReduce and YARN, we recently published a couple blog posts on migrating: http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-users/ http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-operators/ hope that helps, Sandy On Fri, Nov 22, 2013 at 3:03 AM, Nirmal Kumar nirmal.ku...@impetus.co.inwrote: Hi All, I am also looking into migrating\upgrading from Apache Hadoop 1.x to Apache Hadoop 2.x. I didn’t find any doc\guide\blogs for the same. Although there are guides\docs for the CDH and HDP migration\upgradation from Hadoop 1.x to Hadoop 2.x Would referring those be of some use? I am looking for similar guides\docs for Apache Hadoop 1.x to Apache Hadoop 2.x. I found something on slideshare though. Not sure how much useful that is going to be. I still need to verify that. http://www.slideshare.net/mikejf12/an-example-apache-hadoop-yarn-upgrade Any suggestions\comments will be of great help. Thanks, -Nirmal *From:* Jilal Oussama [mailto:jilal.ouss...@gmail.com] *Sent:* Friday, November 08, 2013 9:13 PM *To:* user@hadoop.apache.org *Subject:* Re: Any reference for upgrade hadoop from 1.x to 2.2 I am looking for the same thing if anyone can point us to a good direction please. Thank you. (Currently running Hadoop 1.2.1) 2013/11/1 YouPeng Yang yypvsxf19870...@gmail.com Hi users Are there any reference docs to introduce how to upgrade hadoop from 1.x to 2.2. Regards -- NOTE: This message may contain information that is confidential, proprietary, privileged or otherwise protected by law. The message is intended solely for the named addressee. If received in error, please destroy and notify the sender. Any use of this email is prohibited when received in error. Impetus does not represent, warrant and/or guarantee, that the integrity of this communication has been maintained nor that the communication is free of errors, virus, interception or interference.
Windows - Separating etc (config) from bin
It would be nice if HADOOP_CONF_DIR could be set in the environment like YARN_CONF_DIR. This could be done in lib-exec\hadoop_config.cmd by setting HADOOP_CONF_DIR conditionally. if not defined HADOOP_CONF_DIR ( set HADOOP_CONF_DIR=%HADOOP_HOME%\etc\hadoop ) A similar change might be done in hadoop-config.sh Thus the bin could be under Program Files, and Program File be locked down from modification, but the configuration files still would be in a separate director. (--config didn't seem to work for namenode)
Heterogeneous Cluster
Has anyone set up a Heterogeneous cluster, some Windows nodes and Linux nodes?
Re: Difference between clustering and classification in hadoop
Thanks Devin :) That was a nice explanation. On Fri, Nov 22, 2013 at 6:20 PM, Devin Suiter RDX dsui...@rdx.com wrote: They are both for machine learning. Classification is known as supervised learning where you feed the engine data of known patterns and instruct it what are the key nodes. Clustering is unsupervised learning where you allow the algorithm to guess at what is significant in the correlations picked up by the algorithm. Spam filtering is a popular example of classification, and image indexing is a popular example of clustering. It is mainly used on Hadoop because when it comes to machine learning, the more data that passes through the algorithm the more accurate it should be, and Hadoop can handle large data better than anything else around at the moment. *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Fri, Nov 22, 2013 at 2:54 AM, unmesha sreeveni unmeshab...@gmail.comwrote: what is the differences b/w classification algorithms and clustering algorithms in hadoop? -- *Thanks Regards* Unmesha Sreeveni U.B *Junior Developer* -- *Thanks Regards* Unmesha Sreeveni U.B *Junior Developer*
Re: Difference between clustering and classification in hadoop
when i gone through different Repos for spam data i am only getting MB files . To check in hadoop we need a large file right. I need to test my hadoop svm implementation.I gone through http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/ .But the dataset is of only 700KB or something.I need similar dataset. On Sat, Nov 23, 2013 at 8:35 AM, unmesha sreeveni unmeshab...@gmail.comwrote: Thanks Devin :) That was a nice explanation. On Fri, Nov 22, 2013 at 6:20 PM, Devin Suiter RDX dsui...@rdx.com wrote: They are both for machine learning. Classification is known as supervised learning where you feed the engine data of known patterns and instruct it what are the key nodes. Clustering is unsupervised learning where you allow the algorithm to guess at what is significant in the correlations picked up by the algorithm. Spam filtering is a popular example of classification, and image indexing is a popular example of clustering. It is mainly used on Hadoop because when it comes to machine learning, the more data that passes through the algorithm the more accurate it should be, and Hadoop can handle large data better than anything else around at the moment. *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Fri, Nov 22, 2013 at 2:54 AM, unmesha sreeveni unmeshab...@gmail.comwrote: what is the differences b/w classification algorithms and clustering algorithms in hadoop? -- *Thanks Regards* Unmesha Sreeveni U.B *Junior Developer* -- *Thanks Regards* Unmesha Sreeveni U.B *Junior Developer* -- *Thanks Regards* Unmesha Sreeveni U.B *Junior Developer*
Re: Missing records from HDFS
There is problem in the 'initialize', generally, we cannot think split.start as the real start, because FileSplit cannot split on the end of the line accurately, so you need to adjust the start in the 'initialize' to the start of one line if start is not equal to '0'. also, end = start + split.length, this is not a real end, because it maybe not locate the end of the line. so the Reader MUST adjust the real start and the end in the 'initialize'. otherwise, maybe miss some records. Sure, our FileInputFormat implementation: public class CVSInputFormat extends FileInputFormatFileValidatorDescriptor, Text { /* * (non-Javadoc) * * @see * org.apache.hadoop.mapreduce.InputFormat#createRecordReader(org.apache * .hadoop.mapreduce.InputSplit, * org.apache.hadoop.mapreduce.TaskAttemptContext) */ @Override public RecordReaderFileValidatorDescriptor, Text createRecordReader( InputSplit split, TaskAttemptContext context) { String delimiter = context.getConfiguration().get( textinputformat.record.delimiter); byte[] recordDelimiterBytes = null; if (null != delimiter) recordDelimiterBytes = delimiter.getBytes(); return new CVSLineRecordReader(recordDelimiterBytes); } /* * (non-Javadoc) * * @see * org.apache.hadoop.mapreduce.lib.input.FileInputFormat#isSplitable(org * .apache.hadoop.mapreduce.JobContext, org.apache.hadoop.fs.Path) */ @Override protected boolean isSplitable(JobContext context, Path file) { CompressionCodec codec = new CompressionCodecFactory( context.getConfiguration()).getCodec(file); return codec == null; } } the recordReader: public class CVSLineRecordReader extends RecordReaderFileValidatorDescriptor, Text { private static final Log LOG = LogFactory.getLog(CVSLineRecordReader.class); public static final String CVS_FIRST_LINE = file.first.line; private long start; private long pos; private long end; private LineReader in; private int maxLineLength; private FileValidatorDescriptor key = null; private Text value = null; private Text data = null; private byte[] recordDelimiterBytes; public CVSLineRecordReader(byte[] recordDelimiter) { this.recordDelimiterBytes = recordDelimiter; } @Override public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException { Properties properties = new Properties(); Configuration configuration = context.getConfiguration(); Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context .getConfiguration()); for (Path cacheFile : cacheFiles) { if (cacheFile.toString().endsWith( context.getConfiguration().get(VALIDATOR_CONF_PATH))) { properties.load(new FileReader(cacheFile.toString())); } } FileSplit split = (FileSplit) genericSplit; Configuration job = context.getConfiguration(); this.maxLineLength = job.getInt(mapred.linerecordreader.maxlength, Integer.MAX_VALUE); start = split.getStart(); end = start + split.getLength(); pos = start; final Path file = split.getPath(); // open the file and seek to the start of the split FileSystem fs = file.getFileSystem(job); FSDataInputStream fileIn = fs.open(split.getPath()); this.in = generateReader(fileIn, job); // if CVS_FIRST_LINE does not exist in conf then the csv file first line // is the header if (properties.containsKey(CVS_FIRST_LINE)) { configuration.set(CVS_FIRST_LINE, properties.get(CVS_FIRST_LINE) .toString()); } else { readData(); configuration.set(CVS_FIRST_LINE, data.toString()); if (start != 0) { fileIn.seek(start); in = generateReader(fileIn, job); pos = start; } } key = new FileValidatorDescriptor(); key.setFileName(split.getPath().getName()); context.getConfiguration().set(file.name, key.getFileName()); } @Override public boolean nextKeyValue() throws IOException { int newSize = readData(); if (newSize == 0) { key = null; value = null; return false; } else { key.setOffset(this.pos); value = data; return true; } } private LineReader generateReader(FSDataInputStream fileIn, Configuration job) throws IOException { if (null == this.recordDelimiterBytes) { return new LineReader(fileIn, job);
Decision Tree - Help
Can we implement Decision Tree as Mapreduce Job ? What all algorithms can be converted into MapReduce Job? Thanks Unmesha