Re: Missing records from HDFS

2013-11-22 Thread ZORAIDA HIDALGO SANCHEZ
Thanks for your response Azuryy.

My hadoop version: 2.0.0-cdh4.3.0
InputFormat: a custom class that extends from FileInputFormat(csv input format)
These fiels are under the same directory, different files.
My input path is configured using oozie throughout the propertie 
mapred.input.dir.


Same code and input running on Hadoop 2.0.0-cdh4.2.1 works fine. Does not 
discard any record.

Thanks.

De: Azuryy Yu azury...@gmail.commailto:azury...@gmail.com
Responder a: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Fecha: jueves, 21 de noviembre de 2013 07:31
Para: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Asunto: Re: Missing records from HDFS

what's your hadoop version? and which InputFormat are you used?

these files under one directory or there are lots of subdirectory? how ddi you 
configure input path in your main?



On Thu, Nov 21, 2013 at 12:25 AM, ZORAIDA HIDALGO SANCHEZ 
zora...@tid.esmailto:zora...@tid.es wrote:
Hi all,

my job is not reading all the input records. In the input directory I have a 
set of files containing a total of 600 records but only 5997000 are 
processed. The Map Input Records counter says 5997000.
I have tried downloading the files with a getmerge to check how many records 
would return but the correct number is returned(600).

Do you have any suggestion?

Thanks.



Este mensaje se dirige exclusivamente a su destinatario. Puede consultar 
nuestra política de envío y recepción de correo electrónico en el enlace 
situado más abajo.
This message is intended exclusively for its addressee. We only send and 
receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx




Este mensaje se dirige exclusivamente a su destinatario. Puede consultar 
nuestra política de envío y recepción de correo electrónico en el enlace 
situado más abajo.
This message is intended exclusively for its addressee. We only send and 
receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx


RE: Any reference for upgrade hadoop from 1.x to 2.2

2013-11-22 Thread Nirmal Kumar
Hi All,

I am also looking into migrating\upgrading from Apache Hadoop 1.x to Apache 
Hadoop 2.x.
I didn’t find any doc\guide\blogs for the same.
Although there are guides\docs for the CDH and HDP migration\upgradation from 
Hadoop 1.x to Hadoop 2.x
Would referring those be of some use?

I am looking for similar guides\docs for Apache Hadoop 1.x to Apache Hadoop 2.x.

I found something on slideshare though. Not sure how much useful that is going 
to be. I still need to verify that.
http://www.slideshare.net/mikejf12/an-example-apache-hadoop-yarn-upgrade

Any suggestions\comments will be of great help.

Thanks,
-Nirmal

From: Jilal Oussama [mailto:jilal.ouss...@gmail.com]
Sent: Friday, November 08, 2013 9:13 PM
To: user@hadoop.apache.org
Subject: Re: Any reference for upgrade hadoop from 1.x to 2.2

I am looking for the same thing if anyone can point us to a good direction 
please.
Thank you.

(Currently running Hadoop 1.2.1)

2013/11/1 YouPeng Yang 
yypvsxf19870...@gmail.commailto:yypvsxf19870...@gmail.com
Hi users

   Are there any reference docs to introduce how to upgrade hadoop from 1.x to 
2.2.

Regards









NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Any reference for upgrade hadoop from 1.x to 2.2

2013-11-22 Thread Sandy Ryza
For MapReduce and YARN, we recently published a couple blog posts on
migrating:
http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-users/
http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-operators/

hope that helps,
Sandy


On Fri, Nov 22, 2013 at 3:03 AM, Nirmal Kumar nirmal.ku...@impetus.co.inwrote:

  Hi All,



 I am also looking into migrating\upgrading from Apache Hadoop 1.x to
 Apache Hadoop 2.x.

 I didn’t find any doc\guide\blogs for the same.

 Although there are guides\docs for the CDH and HDP migration\upgradation
 from Hadoop 1.x to Hadoop 2.x

 Would referring those be of some use?



 I am looking for similar guides\docs for Apache Hadoop 1.x to Apache
 Hadoop 2.x.



 I found something on slideshare though. Not sure how much useful that is
 going to be. I still need to verify that.

 http://www.slideshare.net/mikejf12/an-example-apache-hadoop-yarn-upgrade



 Any suggestions\comments will be of great help.



 Thanks,

 -Nirmal



 *From:* Jilal Oussama [mailto:jilal.ouss...@gmail.com]
 *Sent:* Friday, November 08, 2013 9:13 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Any reference for upgrade hadoop from 1.x to 2.2



 I am looking for the same thing if anyone can point us to a good direction
 please.

 Thank you.

 (Currently running Hadoop 1.2.1)



 2013/11/1 YouPeng Yang yypvsxf19870...@gmail.com

   Hi users

Are there any reference docs to introduce how to upgrade hadoop from
 1.x to 2.2.



 Regards



 --






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.



Unsubscribe

2013-11-22 Thread Thomas Bailet




Re: Difference between clustering and classification in hadoop

2013-11-22 Thread unmesha sreeveni
Thank you Mirko


On Fri, Nov 22, 2013 at 2:11 PM, Mirko Kämpf mirko.kae...@gmail.com wrote:

 ... it depends on the implementation. ;-)

 Mahout offers both: Mahout in action http://manning.com/owen/

 And is more ...

   http://en.wikipedia.org/wiki/Cluster_analysis
   http://en.wikipedia.org/wiki/Statistical_classification

 Good look!
 Mirko




 2013/11/22 unmesha sreeveni unmeshab...@gmail.com

 what is the differences b/w classification algorithms and clustering
 algorithms in hadoop?


 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B

 *Junior Developer*






-- 
*Thanks  Regards*

Unmesha Sreeveni U.B

*Junior Developer*


Re: HDFS upgrade problem of fsImage

2013-11-22 Thread Joshi, Rekha
Yes realized that and I see your point :-) However seems like some fs 
inconsistency present, did you attempt rollback/finalizeUpgrade and check?

For that error, FSImage.java/code finds a previous fs state -

// Upgrade is allowed only if there are

// no previous fs states in any of the directories

for (IteratorStorageDirectory it = storage.dirIterator(); it.hasNext();) {

  StorageDirectory sd = it.next();

  if (sd.getPreviousDir().exists())

throw new InconsistentFSStateException(sd.getRoot(),

previous fs state should not exist during upgrade. 

+ Finalize or rollback first.);

}


Thanks

Rekha


From: Azuryy Yu azury...@gmail.commailto:azury...@gmail.com
Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Date: Thursday 21 November 2013 5:19 PM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Cc: hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org 
hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org
Subject: Re: HDFS upgrade problem of fsImage


I insist hot upgrade on the test cluster because I want hot upgrade on the prod 
cluster.

On 2013-11-21 7:23 PM, Joshi, Rekha 
rekha_jo...@intuit.commailto:rekha_jo...@intuit.com wrote:

Hi Azurry,

This error occurs when FSImage finds previous fs state, and as log states you 
would need to either finalizeUpgrade or rollback to proceed.Below -

bin/hadoop dfsadmin –finalizeUpgrade
hadoop dfsadmin –rollback

On side note for a small test cluster on which one might suspect you are the 
only user, why wouldn't you insist on hot upgrade? :-)

Thanks
Rekha


Some helpful guidelines for upgrade here -

http://wiki.apache.org/hadoop/Hadoop_Upgrade

https://twiki.grid.iu.edu/bin/view/Storage/HadoopUpgrade

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Federation.html#Upgrading_from_older_release_to_0.23_and_configuring_federation


From: Azuryy Yu azury...@gmail.commailto:azury...@gmail.com
Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Date: Thursday 21 November 2013 9:48 AM
To: hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org 
hdfs-...@hadoop.apache.orgmailto:hdfs-...@hadoop.apache.org, 
user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: HDFS upgrade problem of fsImage

Hi Dear,

I have a small test cluster with hadoop-2.0x, and HA configuraded, but I want 
to upgrade to hadoop-2.2.

I dont't want to stop cluster during upgrade, so my steps are:

1)  on standby NN: hadoop-dameon.sh stop namenode
2)  remove HA configuration in the conf
3)   hadoop-daemon.sh start namenode -upgrade -clusterID test-cluster

but Exception in the NN log, so how to upgrade and don't stop the whole cluster.
Thanks.


org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/hdfs/name is in an inconsistent state: previous fs state should not exist 
during upgrade. Finalize or rollback first.
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:323)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:248)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:858)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:620)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:445)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:494)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:692)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:677)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1345)


Re: HDFS upgrade problem of fsImage

2013-11-22 Thread Azuryy Yu
Thanks Joshi,

Maybe I pasted wrong log messages.

please looked at here for the real story.

https://issues.apache.org/jira/browse/HDFS-5550




On Fri, Nov 22, 2013 at 6:25 PM, Joshi, Rekha rekha_jo...@intuit.comwrote:

  Yes realized that and I see your point :-) However seems like some fs
 inconsistency present, did you attempt rollback/finalizeUpgrade and check?

  For that error, FSImage.java/code finds a previous fs state -

 // Upgrade is allowed only if there are

 // no previous fs states in any of the directories

 for (IteratorStorageDirectory it = storage.dirIterator();
 it.hasNext();) {

   StorageDirectory sd = it.next();

   if (sd.getPreviousDir().exists())

 throw new InconsistentFSStateException(sd.getRoot(),

 previous fs state should not exist during upgrade. 

 + Finalize or rollback first.);

 }


  Thanks

 Rekha


   From: Azuryy Yu azury...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Thursday 21 November 2013 5:19 PM
 To: user@hadoop.apache.org user@hadoop.apache.org
 Cc: hdfs-...@hadoop.apache.org hdfs-...@hadoop.apache.org
 Subject: Re: HDFS upgrade problem of fsImage

   I insist hot upgrade on the test cluster because I want hot upgrade on
 the prod cluster.
  On 2013-11-21 7:23 PM, Joshi, Rekha rekha_jo...@intuit.com wrote:

  Hi Azurry,

 This error occurs when FSImage finds previous fs state, and as log states 
 you would need to either finalizeUpgrade or rollback to proceed.Below -

 bin/hadoop dfsadmin –finalizeUpgrade
 hadoop dfsadmin –rollback

 On side note for a small test cluster on which one might suspect you are the 
 only user, why wouldn't you insist on hot upgrade? :-)

 Thanks
 Rekha

 Some helpful guidelines for upgrade here -

 http://wiki.apache.org/hadoop/Hadoop_Upgrade

 https://twiki.grid.iu.edu/bin/view/Storage/HadoopUpgrade

 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Federation.html#Upgrading_from_older_release_to_0.23_and_configuring_federation


   From: Azuryy Yu azury...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Thursday 21 November 2013 9:48 AM
 To: hdfs-...@hadoop.apache.org hdfs-...@hadoop.apache.org, 
 user@hadoop.apache.org user@hadoop.apache.org
 Subject: HDFS upgrade problem of fsImage

   Hi Dear,

  I have a small test cluster with hadoop-2.0x, and HA configuraded, but
 I want to upgrade to hadoop-2.2.

  I dont't want to stop cluster during upgrade, so my steps are:

  1)  on standby NN: hadoop-dameon.sh stop namenode
 2)  remove HA configuration in the conf
 3)   hadoop-daemon.sh start namenode -upgrade -clusterID test-cluster

  but Exception in the NN log, so how to upgrade and don't stop the whole
 cluster.
 Thanks.


  org.apache.hadoop.hdfs.server.common.InconsistentFSStateException:
 Directory /hdfs/name is in an inconsistent state: previous fs state should
 not exist during upgrade. Finalize or rollback first.
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.doUpgrade(FSImage.java:323)
 at
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:248)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:858)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:620)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:445)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:494)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:692)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:677)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1345)




Re: Missing records from HDFS

2013-11-22 Thread ZORAIDA HIDALGO SANCHEZ
One more thing,

if we split the files then all the records are processed. Files are of 70,5MB.

Thanks,

Zoraida.-

De: zoraida zora...@tid.esmailto:zora...@tid.es
Fecha: viernes, 22 de noviembre de 2013 08:59
Para: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Asunto: Re: Missing records from HDFS

Thanks for your response Azuryy.

My hadoop version: 2.0.0-cdh4.3.0
InputFormat: a custom class that extends from FileInputFormat(csv input format)
These fiels are under the same directory, different files.
My input path is configured using oozie throughout the propertie 
mapred.input.dir.


Same code and input running on Hadoop 2.0.0-cdh4.2.1 works fine. Does not 
discard any record.

Thanks.

De: Azuryy Yu azury...@gmail.commailto:azury...@gmail.com
Responder a: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Fecha: jueves, 21 de noviembre de 2013 07:31
Para: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Asunto: Re: Missing records from HDFS

what's your hadoop version? and which InputFormat are you used?

these files under one directory or there are lots of subdirectory? how ddi you 
configure input path in your main?



On Thu, Nov 21, 2013 at 12:25 AM, ZORAIDA HIDALGO SANCHEZ 
zora...@tid.esmailto:zora...@tid.es wrote:
Hi all,

my job is not reading all the input records. In the input directory I have a 
set of files containing a total of 600 records but only 5997000 are 
processed. The Map Input Records counter says 5997000.
I have tried downloading the files with a getmerge to check how many records 
would return but the correct number is returned(600).

Do you have any suggestion?

Thanks.



Este mensaje se dirige exclusivamente a su destinatario. Puede consultar 
nuestra política de envío y recepción de correo electrónico en el enlace 
situado más abajo.
This message is intended exclusively for its addressee. We only send and 
receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx




Este mensaje se dirige exclusivamente a su destinatario. Puede consultar 
nuestra política de envío y recepción de correo electrónico en el enlace 
situado más abajo.
This message is intended exclusively for its addressee. We only send and 
receive email on the basis of the terms set out at:
http://www.tid.es/ES/PAGINAS/disclaimer.aspx


Re: Missing records from HDFS

2013-11-22 Thread Azuryy Yu
I do think this is because of your RecorderReader, can you paste your code
here? and give a piece of data example.

please use pastebin if you want.


On Fri, Nov 22, 2013 at 7:16 PM, ZORAIDA HIDALGO SANCHEZ zora...@tid.eswrote:

  One more thing,

  if we split the files then all the records are processed. Files are
 of 70,5MB.

  Thanks,

  Zoraida.-

   De: zoraida zora...@tid.es
 Fecha: viernes, 22 de noviembre de 2013 08:59

 Para: user@hadoop.apache.org user@hadoop.apache.org
 Asunto: Re: Missing records from HDFS

   Thanks for your response Azuryy.

  My hadoop version: 2.0.0-cdh4.3.0
 InputFormat: a custom class that extends from FileInputFormat(csv input
 format)
 These fiels are under the same directory, different files.
 My input path is configured using oozie throughout the propertie
 mapred.input.dir.


  Same code and input running on Hadoop 2.0.0-cdh4.2.1 works fine. Does
 not discard any record.

  Thanks.

   De: Azuryy Yu azury...@gmail.com
 Responder a: user@hadoop.apache.org user@hadoop.apache.org
 Fecha: jueves, 21 de noviembre de 2013 07:31
 Para: user@hadoop.apache.org user@hadoop.apache.org
 Asunto: Re: Missing records from HDFS

   what's your hadoop version? and which InputFormat are you used?

  these files under one directory or there are lots of subdirectory? how
 ddi you configure input path in your main?



 On Thu, Nov 21, 2013 at 12:25 AM, ZORAIDA HIDALGO SANCHEZ 
 zora...@tid.eswrote:

  Hi all,

  my job is not reading all the input records. In the input directory I
 have a set of files containing a total of 600 records but only 5997000
 are processed. The Map Input Records counter says 5997000.
 I have tried downloading the files with a getmerge to check how many
 records would return but the correct number is returned(600).

  Do you have any suggestion?

  Thanks.

 --

 Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
 nuestra política de envío y recepción de correo electrónico en el enlace
 situado más abajo.
 This message is intended exclusively for its addressee. We only send and
 receive email on the basis of the terms set out at:
 http://www.tid.es/ES/PAGINAS/disclaimer.aspx



 --

 Este mensaje se dirige exclusivamente a su destinatario. Puede consultar
 nuestra política de envío y recepción de correo electrónico en el enlace
 situado más abajo.
 This message is intended exclusively for its addressee. We only send and
 receive email on the basis of the terms set out at:
 http://www.tid.es/ES/PAGINAS/disclaimer.aspx



Re: Missing records from HDFS

2013-11-22 Thread ZORAIDA HIDALGO SANCHEZ
Sure,

our FileInputFormat implementation:


public class CVSInputFormat extends

FileInputFormatFileValidatorDescriptor, Text {


/*

 * (non-Javadoc)

 *

 * @see

 * org.apache.hadoop.mapreduce.InputFormat#createRecordReader(org.apache

 * .hadoop.mapreduce.InputSplit,

 * org.apache.hadoop.mapreduce.TaskAttemptContext)

 */

@Override

public RecordReaderFileValidatorDescriptor, Text createRecordReader(

InputSplit split, TaskAttemptContext context) {

String delimiter = context.getConfiguration().get(

textinputformat.record.delimiter);

byte[] recordDelimiterBytes = null;

if (null != delimiter)

recordDelimiterBytes = delimiter.getBytes();

return new CVSLineRecordReader(recordDelimiterBytes);

}


/*

 * (non-Javadoc)

 *

 * @see

 * org.apache.hadoop.mapreduce.lib.input.FileInputFormat#isSplitable(org

 * .apache.hadoop.mapreduce.JobContext, org.apache.hadoop.fs.Path)

 */

@Override

protected boolean isSplitable(JobContext context, Path file) {

CompressionCodec codec = new CompressionCodecFactory(

context.getConfiguration()).getCodec(file);

return codec == null;

}

}


the recordReader:



public class CVSLineRecordReader extends

RecordReaderFileValidatorDescriptor, Text {

private static final Log LOG = LogFactory.getLog(CVSLineRecordReader.class);


public static final String CVS_FIRST_LINE = file.first.line;


private long start;

private long pos;

private long end;

private LineReader in;

private int maxLineLength;

private FileValidatorDescriptor key = null;

private Text value = null;

private Text data = null;

private byte[] recordDelimiterBytes;


public CVSLineRecordReader(byte[] recordDelimiter) {

this.recordDelimiterBytes = recordDelimiter;

}


@Override

public void initialize(InputSplit genericSplit, TaskAttemptContext context)

throws IOException {

Properties properties = new Properties();

Configuration configuration = context.getConfiguration();


Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context

.getConfiguration());

for (Path cacheFile : cacheFiles) {

if (cacheFile.toString().endsWith(

context.getConfiguration().get(VALIDATOR_CONF_PATH))) {

properties.load(new FileReader(cacheFile.toString()));

}

}


FileSplit split = (FileSplit) genericSplit;

Configuration job = context.getConfiguration();

this.maxLineLength = job.getInt(mapred.linerecordreader.maxlength,

Integer.MAX_VALUE);

start = split.getStart();

end = start + split.getLength();

pos = start;

final Path file = split.getPath();


// open the file and seek to the start of the split

FileSystem fs = file.getFileSystem(job);

FSDataInputStream fileIn = fs.open(split.getPath());


this.in = generateReader(fileIn, job);


// if CVS_FIRST_LINE does not exist in conf then the csv file first line

// is the header

if (properties.containsKey(CVS_FIRST_LINE)) {

configuration.set(CVS_FIRST_LINE, properties.get(CVS_FIRST_LINE)

.toString());

} else {

readData();

configuration.set(CVS_FIRST_LINE, data.toString());

if (start != 0) {

fileIn.seek(start);

in = generateReader(fileIn, job);

pos = start;

}

}


key = new FileValidatorDescriptor();

key.setFileName(split.getPath().getName());

context.getConfiguration().set(file.name, key.getFileName());


}


@Override

public boolean nextKeyValue() throws IOException {

int newSize = readData();

if (newSize == 0) {

key = null;

value = null;

return false;

} else {

key.setOffset(this.pos);

value = data;

return true;

}

}


private LineReader generateReader(FSDataInputStream fileIn,

Configuration job) throws IOException {

if (null == this.recordDelimiterBytes) {

return new LineReader(fileIn, job);

} else {

return new LineReader(fileIn, job, this.recordDelimiterBytes);

}


}


private int readData() throws IOException {

if (data == null) {

data = new Text();

}

int newSize = 0;

while (pos  end) {

newSize = in.readLine(data, maxLineLength,

Math.max((int) Math.min(Integer.MAX_VALUE, end - pos),

maxLineLength));

if (newSize == 0) {

   

Re: Problem sending metrics to multiple targets

2013-11-22 Thread Ivan Tretyakov
We investigated the problem and found root cause. Metrics2 framework uses
different from first version config parser (Metrics2 uses apache-commons,
Metrics uses
hadoop's).  org.apache.hadoop.metrics2.sink.ganglia.AbstractGangliaSink uses
commas as separators by default. So when we provide list of servers it
returns everything until first separator - it is only first server from the
list.
But we were able to find workaround. Class parsing servers
list (org.apache.hadoop.metrics2.util.Servers) handles only commas and
spaces. It means if we will provide space separated list of servers instead
of comma separated then new parser will be able to read whole servers list.
After that all servers will be registered as metrics receivers and metrics
will be sent to all of them.


On Thu, Jan 17, 2013 at 7:17 PM, Ivan Tretyakov itretya...@griddynamics.com
 wrote:

 Hi!

 We have following problem.

 There are three target hosts to send metrics: 192.168.1.111:8649,
 192.168.1.113:8649,192.168.1.115:8649 (node01, node03, node05).
 But for example datanode (using
 org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31) sends one metrics to
 first target host and the another to the second and third.
 So some metrics missed on second and third node. When gmetad collects
 metrics from one of these we could not see certain metrics in ganglia.

 E.g. on node07 running only one process which sends metrics to ganglia -
 datanode process and we could see following using tcpdump.

 Dumping traffic for about three minutes:
 $ sudo -i tcpdump dst port 8649 and src host node07 | tee tcpdump.out
 ...
 $ head -n1 tcpdump.out
 12:18:05.559719 IP node07.dom.local.43350  node01.dom.local.8649: UDP,
 length 180
 $ tail -n1 tcpdump.out
 12:20:59.575144 IP node

 Then count packets and bytes sent to each target:
 $ grep node01 tcpdump.out | wc -l
 5972
 $ grep node03 tcpdump.out | wc -l
 3812
 $ grep node05 tcpdump.out | wc -l
 3811
 $ grep node01 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
 1048272
 $ grep node03 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
 731604
 $ grep node05 tcpdump.out | awk 'BEGIN{sum=0}{sum=sum+$8}END{print sum}'
 731532

 Also we could request gmond daemons which metrics do they have:

 $ nc node01 8649 | grep ProcessName_DataNode | head -n1
 METRIC NAME=jvm.JvmMetrics.ProcessName_DataNode.LogFatal VAL=0
 TYPE=float UNITS= TN=0 TMAX=60 DMAX=0 SLOPE=positive
 $ nc node03 8649 | grep ProcessName_DataNode | head -n1
 $ nc node05 8649 | grep ProcessName_DataNode | head -n1
 $ nc node01 8649 | grep ProcessName_DataNode | wc -l
 100
 $ nc node03 8649 | grep ProcessName_DataNode | wc -l
 0
 $ nc node05 8649 | grep ProcessName_DataNode | wc -l
 0

 We could see that only first collector node from the list has certain
 metrics.

 Hadoop version we use:
 - MapReduce 2.0.0-mr1-cdh4.1.1
 - HDFS 2.0.0-cdh4.1.1

 hadoop-metrics2.properties content:

 datanode.period=20

 datanode.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31
 datanode.sink.ganglia.servers=192.168.1.111:8649,192.168.1.113:8649,
 192.168.1.115:8649
  datanode.sink.ganglia.tagsForPrefix.jvm=*
 datanode.sink.ganglia.tagsForPrefix.dfs=*
 datanode.sink.ganglia.tagsForPrefix.rpc=*
 datanode.sink.ganglia.tagsForPrefix.rpcdetailed=*
 datanode.sink.ganglia.tagsForPrefix.metricssystem=*

 --
 Best Regards
 Ivan Tretyakov




-- 
Best Regards
Ivan Tretyakov


Re: Any reference for upgrade hadoop from 1.x to 2.2

2013-11-22 Thread Robert Dyer
Thanks Sandy! These seem helpful!

MapReduce cluster configuration options have been split into YARN
configuration options, which go in yarn-site.xml; and MapReduce
configuration options, which go in mapred-site.xml. Many have been given
new names to reflect the shift. ... *We’ll follow up with a full
translation table in a future post.*

This type of translation table mapping old configuration to new would be
*very* useful!

- Robert

On Fri, Nov 22, 2013 at 2:15 AM, Sandy Ryza sandy.r...@cloudera.com wrote:

 For MapReduce and YARN, we recently published a couple blog posts on
 migrating:

 http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-users/

 http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-operators/

 hope that helps,
 Sandy


 On Fri, Nov 22, 2013 at 3:03 AM, Nirmal Kumar 
 nirmal.ku...@impetus.co.inwrote:

  Hi All,



 I am also looking into migrating\upgrading from Apache Hadoop 1.x to
 Apache Hadoop 2.x.

 I didn’t find any doc\guide\blogs for the same.

 Although there are guides\docs for the CDH and HDP migration\upgradation
 from Hadoop 1.x to Hadoop 2.x

 Would referring those be of some use?



 I am looking for similar guides\docs for Apache Hadoop 1.x to Apache
 Hadoop 2.x.



 I found something on slideshare though. Not sure how much useful that is
 going to be. I still need to verify that.

 http://www.slideshare.net/mikejf12/an-example-apache-hadoop-yarn-upgrade



 Any suggestions\comments will be of great help.



 Thanks,

 -Nirmal



 *From:* Jilal Oussama [mailto:jilal.ouss...@gmail.com]
 *Sent:* Friday, November 08, 2013 9:13 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Any reference for upgrade hadoop from 1.x to 2.2



 I am looking for the same thing if anyone can point us to a good
 direction please.

 Thank you.

 (Currently running Hadoop 1.2.1)



 2013/11/1 YouPeng Yang yypvsxf19870...@gmail.com

   Hi users

Are there any reference docs to introduce how to upgrade hadoop from
 1.x to 2.2.



 Regards



 --






 NOTE: This message may contain information that is confidential,
 proprietary, privileged or otherwise protected by law. The message is
 intended solely for the named addressee. If received in error, please
 destroy and notify the sender. Any use of this email is prohibited when
 received in error. Impetus does not represent, warrant and/or guarantee,
 that the integrity of this communication has been maintained nor that the
 communication is free of errors, virus, interception or interference.




Windows - Separating etc (config) from bin

2013-11-22 Thread Ian Jackson
It would be nice if HADOOP_CONF_DIR could be set in the environment like 
YARN_CONF_DIR. This could be done in lib-exec\hadoop_config.cmd by setting 
HADOOP_CONF_DIR conditionally.
if not defined HADOOP_CONF_DIR (
set HADOOP_CONF_DIR=%HADOOP_HOME%\etc\hadoop
)

A similar change might be done in hadoop-config.sh

Thus the bin could be under Program Files, and Program File be locked down from 
modification, but the configuration files still would be in a separate director.

(--config didn't seem to work for namenode)


Heterogeneous Cluster

2013-11-22 Thread Ian Jackson
Has anyone set up a Heterogeneous cluster, some Windows nodes and Linux nodes?


Re: Difference between clustering and classification in hadoop

2013-11-22 Thread unmesha sreeveni
Thanks Devin :) That was a nice explanation.


On Fri, Nov 22, 2013 at 6:20 PM, Devin Suiter RDX dsui...@rdx.com wrote:

 They are both for machine learning. Classification is known as supervised
 learning where you feed the engine data of known patterns and instruct it
 what are the key nodes. Clustering is unsupervised learning where you
 allow the algorithm to guess at what is significant in the correlations
 picked up by the algorithm. Spam filtering is a popular example of
 classification, and image indexing is a popular example of clustering. It
 is mainly used on Hadoop because when it comes to machine learning, the
 more data that passes through the algorithm the more accurate it should be,
 and Hadoop can handle large data better than anything else around at the
 moment.

 *Devin Suiter*
 Jr. Data Solutions Software Engineer
 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
 Google Voice: 412-256-8556 | www.rdx.com


 On Fri, Nov 22, 2013 at 2:54 AM, unmesha sreeveni 
 unmeshab...@gmail.comwrote:

 what is the differences b/w classification algorithms and clustering
 algorithms in hadoop?


 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B

 *Junior Developer*






-- 
*Thanks  Regards*

Unmesha Sreeveni U.B

*Junior Developer*


Re: Difference between clustering and classification in hadoop

2013-11-22 Thread unmesha sreeveni
when i gone through different Repos for spam data i am only getting MB
files .
To check in hadoop we need a large file right.
I need to test my hadoop svm implementation.I gone through
http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/ .But the
dataset is of only 700KB or something.I need similar dataset.


On Sat, Nov 23, 2013 at 8:35 AM, unmesha sreeveni unmeshab...@gmail.comwrote:

 Thanks Devin :) That was a nice explanation.


 On Fri, Nov 22, 2013 at 6:20 PM, Devin Suiter RDX dsui...@rdx.com wrote:

 They are both for machine learning. Classification is known as
 supervised learning where you feed the engine data of known patterns and
 instruct it what are the key nodes. Clustering is unsupervised learning
 where you allow the algorithm to guess at what is significant in the
 correlations picked up by the algorithm. Spam filtering is a popular
 example of classification, and image indexing is a popular example of
 clustering. It is mainly used on Hadoop because when it comes to machine
 learning, the more data that passes through the algorithm the more accurate
 it should be, and Hadoop can handle large data better than anything else
 around at the moment.

 *Devin Suiter*
 Jr. Data Solutions Software Engineer
 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
 Google Voice: 412-256-8556 | www.rdx.com


 On Fri, Nov 22, 2013 at 2:54 AM, unmesha sreeveni 
 unmeshab...@gmail.comwrote:

 what is the differences b/w classification algorithms and clustering
 algorithms in hadoop?


 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B

 *Junior Developer*






 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B

 *Junior Developer*





-- 
*Thanks  Regards*

Unmesha Sreeveni U.B

*Junior Developer*


Re: Missing records from HDFS

2013-11-22 Thread Azuryy Yu
There is problem in the 'initialize', generally, we cannot think
split.start as the real start, because FileSplit cannot split on the end of
the line accurately, so  you need to adjust the start in the 'initialize'
to the start of one line if start is not equal to '0'.

also, end = start + split.length, this is not a real end, because it maybe
not locate the end of the line.

so the Reader MUST adjust the real start and the end in the 'initialize'.
otherwise, maybe miss some records.

 Sure,

 our FileInputFormat implementation:

  public class CVSInputFormat extends

FileInputFormatFileValidatorDescriptor, Text {


 /*

 * (non-Javadoc)

 *

 * @see

 * org.apache.hadoop.mapreduce.InputFormat#createRecordReader(org.apache

 * .hadoop.mapreduce.InputSplit,

 * org.apache.hadoop.mapreduce.TaskAttemptContext)

 */

@Override

public RecordReaderFileValidatorDescriptor, Text createRecordReader(

InputSplit split, TaskAttemptContext context) {

String delimiter = context.getConfiguration().get(

textinputformat.record.delimiter);

byte[] recordDelimiterBytes = null;

if (null != delimiter)

recordDelimiterBytes = delimiter.getBytes();

return new CVSLineRecordReader(recordDelimiterBytes);

}


 /*

 * (non-Javadoc)

 *

 * @see

 * org.apache.hadoop.mapreduce.lib.input.FileInputFormat#isSplitable(org

 * .apache.hadoop.mapreduce.JobContext, org.apache.hadoop.fs.Path)

 */

@Override

protected boolean isSplitable(JobContext context, Path file) {

CompressionCodec codec = new CompressionCodecFactory(

context.getConfiguration()).getCodec(file);

return codec == null;

}

}


 the recordReader:



 public class CVSLineRecordReader extends

RecordReaderFileValidatorDescriptor, Text {

private static final Log LOG =
LogFactory.getLog(CVSLineRecordReader.class);


 public static final String CVS_FIRST_LINE = file.first.line;


 private long start;

private long pos;

private long end;

private LineReader in;

private int maxLineLength;

private FileValidatorDescriptor key = null;

private Text value = null;

private Text data = null;

private byte[] recordDelimiterBytes;


 public CVSLineRecordReader(byte[] recordDelimiter) {

this.recordDelimiterBytes = recordDelimiter;

}


 @Override

public void initialize(InputSplit genericSplit, TaskAttemptContext
context)

throws IOException {

Properties properties = new Properties();

Configuration configuration = context.getConfiguration();


 Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context

.getConfiguration());

for (Path cacheFile : cacheFiles) {

if (cacheFile.toString().endsWith(

context.getConfiguration().get(VALIDATOR_CONF_PATH))) {

properties.load(new FileReader(cacheFile.toString()));

}

}


 FileSplit split = (FileSplit) genericSplit;

Configuration job = context.getConfiguration();

this.maxLineLength = job.getInt(mapred.linerecordreader.maxlength,

Integer.MAX_VALUE);

start = split.getStart();

end = start + split.getLength();

pos = start;

final Path file = split.getPath();


 // open the file and seek to the start of the split

FileSystem fs = file.getFileSystem(job);

FSDataInputStream fileIn = fs.open(split.getPath());


 this.in = generateReader(fileIn, job);


 // if CVS_FIRST_LINE does not exist in conf then the csv file
first line

// is the header

if (properties.containsKey(CVS_FIRST_LINE)) {

configuration.set(CVS_FIRST_LINE, properties.get(CVS_FIRST_LINE)

.toString());

} else {

readData();

configuration.set(CVS_FIRST_LINE, data.toString());

if (start != 0) {

fileIn.seek(start);

in = generateReader(fileIn, job);

pos = start;

}

}


 key = new FileValidatorDescriptor();

key.setFileName(split.getPath().getName());

context.getConfiguration().set(file.name, key.getFileName());


 }


 @Override

public boolean nextKeyValue() throws IOException {

int newSize = readData();

if (newSize == 0) {

key = null;

value = null;

return false;

} else {

key.setOffset(this.pos);

value = data;

return true;

}

}


 private LineReader generateReader(FSDataInputStream fileIn,

Configuration job) throws IOException {

if (null == this.recordDelimiterBytes) {

return new LineReader(fileIn, job);

 

Decision Tree - Help

2013-11-22 Thread unmesha sreeveni
Can we implement Decision Tree as Mapreduce Job ?
What all algorithms can be converted into MapReduce Job?

Thanks
Unmesha