Neo4j and Hadoop

2017-01-10 Thread unmesha sreeveni
​Hi,

 I have my input file in HDFS. How to store that data to Neo4j db. Is there
any way to do the same? ​

-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


.doc Custom Format for Hadoop

2016-01-04 Thread unmesha sreeveni
Is there a custom .doc Input Format for hadoop which is already build?


Re: Doubt in DoubleWritable

2015-11-23 Thread unmesha sreeveni
Please try this
for (DoubleArrayWritable avalue : values) {
Writable[] value = avalue.get();
// DoubleWritable[] value = new DoubleWritable[6];
// for(int k=0;k<6;k++){
// value[k] = DoubleWritable(wvalue[k]);
// }
//parse accordingly
if (Double.parseDouble(value[1].toString()) != 0) {
total_records_Temp = total_records_Temp + 1;
sumvalueTemp = sumvalueTemp + Double.parseDouble(value[0].toString());
}
if (Double.parseDouble(value[3].toString()) != 0) {
total_records_Dewpoint = total_records_Dewpoint + 1;
sumvalueDewpoint = sumvalueDewpoint +
Double.parseDouble(value[2].toString());
}
if (Double.parseDouble(value[5].toString()) != 0) {
total_records_Windspeed = total_records_Windspeed + 1;
sumvalueWindspeed = sumvalueWindspeed +
Double.parseDouble(value[4].toString());
}
}
​Attaching the code​


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/
//cc MaxTemperature Application to find the maximum temperature in the weather dataset
//vv MaxTemperature
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.conf.Configuration;

public class MapReduce {

	public static void main(String[] args) throws Exception {
		if (args.length != 2) {
			System.err
	.println("Usage: MaxTemperature  ");
			System.exit(-1);
		}

		/*
		 * Job job = new Job(); job.setJarByClass(MaxTemperature.class);
		 * job.setJobName("Max temperature");
		 */

		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		Job job = Job.getInstance(conf, "AverageTempValues");

		/*
		 * Deleting output directory for reuseing same dir
		 */
		Path dest = new Path(args[1]);
		if(fs.exists(dest)){
			fs.delete(dest, true);
		}
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		job.setNumReduceTasks(2);

		job.setMapperClass(NewMapper.class);
		job.setReducerClass(NewReducer.class);

		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(DoubleArrayWritable.class);

		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
}
// ^^ MaxTemperature
// cc MaxTemperatureMapper Mapper for maximum temperature example
// vv MaxTemperatureMapper
import java.io.IOException;

import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class NewMapper extends
		Mapper<LongWritable, Text, Text, DoubleArrayWritable> {

	public void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {

		String Str = value.toString();
		String[] Mylist = new String[1000];
		int i = 0;

		for (String retval : Str.split("\\s+")) {
			System.out.println(retval);
			Mylist[i++] = retval;

		}
		String Val = Mylist[2];
		String Year = Val.substring(0, 4);
		String Month = Val.substring(5, 6);
		String[] Section = Val.split("_");

		String section_string = "0";
		if (Section[1].matches("^(0|1|2|3|4|5)$")) {
			section_string = "4";
		} else if (Section[1].matches("^(6|7|8|9|10|11)$")) {
			section_string = "1";
		} else if (Section[1].matches("^(12|13|14|15|16|17)$")) {
			section_string = "2";
		} else if (Section[1].matches("^(18|19|20|21|22|23)$")) {
			section_string = "3";
		}

		DoubleWritable[] array = new DoubleWritable[6];
		DoubleArrayWritable output = new DoubleArrayWritable();
		array[0].set(Double.parseDouble(Mylist[3]));
		array[2].set(Double.parseDouble(Mylist[4]));
		array[4].set(Double.parseDouble(Mylist[12]));
		for (int j = 0; j < 6; j = j + 2) {
			if (999.9 == array[j].get()) {
array[j + 1].set(0);
			} else {
array[j + 1].set(1);
			}
		}
		output.set(array);
		context.write(new Text(Year + section_string + Month), output);
	}
}
//cc MaxTemperatureReducer Reducer for maximum temperature example
//vv MaxTemperatureReducer
import java.io.IOException;

import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class NewReducer extends
		Reducer<Text, DoubleArrayWritable, Text, DoubleArrayWritable> {

	@Override
	public void reduce(Text key, Iterable values,
			Context context) throws IOException, InterruptedException {
		double sumvalueTemp = 0;
		double sumvalueDewpoint = 0;
		double sumvalueWindspeed = 0;
		double total_records_Temp = 0;
		double total_records_Dewpoint = 0;
		double total_records_Windspeed = 0;
		double average_Temp 

Permutations and combination in mapreduce

2015-11-04 Thread unmesha sreeveni
Hi

 whether permutaions and combinations in mapreduce is implemented by anyone?

-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Not able to copy one HDFS data to another HDFS location using distcp

2015-09-07 Thread unmesha sreeveni
I am trying to copy one HDFS data to another HDFS location
I am able to achieve the same using "distcp" command
hadoop distcp hdfs://mySrcip:8020/copyDev/* hdfs://myDestip:8020/copyTest

But I want to try the same using Java Api
After a long search I found one code and executed . But it didnt copied my
src file to destination.


*public class TouchFile {*

* /***
* * @param args*
* * @throws Exception *
* */*
* public static void main(String[] args) throws Exception {*
* // TODO Auto-generated method stub*
* //create configuration object*
* Configuration config = new Configuration();*
* config.set("fs.defaultFS", "hdfs://mySrcip:8020/");*
* config.set("hadoop.job.ugi", "hdfs");*
* /**
* * Distcp*
* */*
* String sourceNameNode = "hdfs://mySrcip:8020/copyDev";*
*String destNameNode = "hdfs://myDestip:8020/copyTest";*
*String fileList = "myfile.txt";*
* distFileCopy(config,sourceNameNode,destNameNode,fileList);*
* }*
* /***
* * Copies files from one cloud to another using Hadoop's distributed
copy features. Uses*
* * input to build DISTCP configuration settings. *
* **
* * param config Hadoop configuration*
* * param sourceNameNode full HDFS path to parent source directory*
* * param destNameNode full HDFS path to parent destination directory*
* * param fileList Comma separated string of file names in
sourceNameNode to be copied to destNameNode*
* * returns Elapsed time in milliseconds to copy files*
* */*
*public static long distFileCopy( Configuration config, String
sourceNameNode, String destNameNode, String fileList ) throws Exception {*
*System.out.println("In dist copy");*

*StringTokenizer tokenizer = new StringTokenizer(fileList,",");*
*ArrayList list = new ArrayList<>();*

*while ( tokenizer.hasMoreTokens() ){*
*String file = sourceNameNode + "/" + tokenizer.nextToken();*
*list.add( file );*
*}*

*String[] args = new String[list.size() + 1];*
*int count = 0;*
*for ( String filename : list ){*
*args[count++] = filename;*
*}*

*args[count] = destNameNode;*

*System.out.println("args-->"+Arrays.toString(args));*
*long st = System.currentTimeMillis();*
*DistCp distCp=new DistCp(config,null);*
*distCp.run(args);   *
*    return System.currentTimeMillis() - st;*

*}*

*}*



Am I doing anything wrong.
Please suggest

-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
http://www.unmeshasreeveni.blogspot.in/


Copy Data From HDFS to FTP

2015-08-23 Thread unmesha sreeveni
Hi

How can I copy my HDFS data to an FTP server?

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Copy Data From HDFS to FTP

2015-08-23 Thread unmesha sreeveni
showing

-cp: The value of property fs.ftp.password.MYIP must not be null

On Mon, Aug 24, 2015 at 10:52 AM, Chinnappan Chandrasekaran 
chiranchan...@jos.com.sg wrote:



 hadoop fs -cp ftp://userid@youipaddress/directory





 *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com]
 *Sent:* Monday, 24 August, 2015 12:45 PM
 *To:* User Hadoop
 *Subject:* Copy Data From HDFS to FTP



 Hi



 How can I copy my HDFS data to an FTP server?






-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Build Failure - SciHadoop

2015-05-06 Thread unmesha sreeveni
Hi
 I was trying to check out SciHadoop.
Came accross
https://github.com/four2five/SIDR/tree/sc13_experiments_improved
while doing the 4 th step - mvn install
the Build Failed.

[ERROR] COMPILATION ERROR :
[INFO] -
[ERROR] Failure executing javac, but could not parse the error:
javac: directory not found: /installSCID/git/thredds/udunits/target/classes
Usage: javac options source files
use -help for a list of possible options

[INFO] 1 error
[INFO] -
[INFO]

[INFO] Reactor Summary:
[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 20.797s
[INFO] Finished at: Wed May 06 15:15:30 IST 2015
[INFO] Final Memory: 9M/109M
[INFO]

[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile
(default-compile) on project udunits: Compilation failure
[ERROR] Failure executing javac, but could not parse the error:
[ERROR] javac: directory not found:
/installSCID/git/thredds/udunits/target/classes
[ERROR] Usage: javac options source files
[ERROR] use -help for a list of possible options
[ERROR] - [Help 1]
[ERROR]



Have anyone came across the same.I doubt if I am wrong somewhere.


-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Connect c language with HDFS

2015-05-04 Thread unmesha sreeveni
thanks alex
  I have gone through the same. but once I checked my cloudera distribution
I am not able to get those folders ..Thats y I posted here. I dont know if
I made any mistake.

On Mon, May 4, 2015 at 2:40 PM, Alexander Alten-Lorenz wget.n...@gmail.com
wrote:

 Google:

 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/LibHdfs.html

 --
 Alexander Alten-Lorenz
 m: wget.n...@gmail.com
 b: mapredit.blogspot.com

 On May 4, 2015, at 10:57 AM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 Hi
   Can we connect c with HDFS using cloudera hadoop distribution.

 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Connect c language with HDFS

2015-05-04 Thread unmesha sreeveni
Thanks
Did it.
http://unmeshasreeveni.blogspot.in/2015/05/hadoop-word-count-using-c-hadoop.html

On Mon, May 4, 2015 at 3:43 PM, Alexander Alten-Lorenz wget.n...@gmail.com
wrote:

 That depends on the installation source (rpm, tgz or parcels). Usually,
 when you use parcels, libhdfs.so* should be within /opt/cloudera/parcels/
 CDH/lib64/ (or similar). Or just use linux' locate (locate libhdfs.so*)
 to find the library.




 --
 Alexander Alten-Lorenz
 m: wget.n...@gmail.com
 b: mapredit.blogspot.com

 On May 4, 2015, at 11:39 AM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 thanks alex
   I have gone through the same. but once I checked my cloudera
 distribution I am not able to get those folders ..Thats y I posted here. I
 dont know if I made any mistake.

 On Mon, May 4, 2015 at 2:40 PM, Alexander Alten-Lorenz 
 wget.n...@gmail.com wrote:

 Google:

 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/LibHdfs.html

 --
 Alexander Alten-Lorenz
 m: wget.n...@gmail.com
 b: mapredit.blogspot.com

 On May 4, 2015, at 10:57 AM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 Hi
   Can we connect c with HDFS using cloudera hadoop distribution.

 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: How to stop a mapreduce job from terminal running on Hadoop Cluster?

2015-04-12 Thread unmesha sreeveni
you can use
$ hadoop job -kill jobid


On Mon, Apr 13, 2015 at 10:20 AM, Rohith Sharma K S 
rohithsharm...@huawei.com wrote:

  In addition to below options, in the Hadoop-2.7(yet to release in couple
 of weeks) the user friendly option provided for killing the applications
 from Web UI.



 In the application block , *‘Kill Application’* button has been provided
 for killing applications.



 Thanks  Regards

 Rohith Sharma K S

 *From:* Pradeep Gollakota [mailto:pradeep...@gmail.com]
 *Sent:* 12 April 2015 23:41
 *To:* user@hadoop.apache.org
 *Subject:* Re: How to stop a mapreduce job from terminal running on
 Hadoop Cluster?



 Also, mapred job -kill job_id



 On Sun, Apr 12, 2015 at 11:07 AM, Shahab Yunus shahab.yu...@gmail.com
 wrote:

 You can kill t by using the following yarn command



 yarn application -kill application id


 https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html



 Or use old hadoop job command

 http://stackoverflow.com/questions/11458519/how-to-kill-hadoop-jobs



 Regards,

 Shahab



 On Sun, Apr 12, 2015 at 2:03 PM, Answer Agrawal yrsna.tse...@gmail.com
 wrote:

 To run a job we use the command
 $ hadoop jar example.jar inputpath outputpath
 If job is so time taken and we want to stop it in middle then which
 command is used? Or is there any other way to do that?

 Thanks,












-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


cleanup() in hadoop results in aggregation of whole file/not

2015-02-27 Thread unmesha sreeveni
​I am having an input file, which contains last column as class label
7.4 0.29 0.5 1.8 0.042 35 127 0.9937 3.45 0.5 10.2 7 1
10 0.41 0.45 6.2 0.071 6 14 0.99702 3.21 0.49 11.8 7 -1
7.8 0.26 0.27 1.9 0.051 52 195 0.9928 3.23 0.5 10.9 6 1
6.9 0.32 0.3 1.8 0.036 28 117 0.99269 3.24 0.48 11 6 1
...
I am trying to get the unique class label of the whole file. Inorder to get
the same I am doing the below code.

*public class MyMapper extends MapperLongWritable, Text, IntWritable,
FourvalueWritable{*
*SetString uniqueLabel = new HashSet();*

*public void map(LongWritable key,Text value,Context context){*
*//Last column of input is classlabel.*
* VectorString cls = CustomParam.customLabel(line, delimiter,
classindex); // *
* uniqueLabel.add(cls.get(0));*
*}*
*public void cleanup(Context context) throws IOException{*
*//find min and max label*
*
 context.getCounter(UpdateCost.MINLABEL).setValue(Long.valueOf(minLabel));*
*
 context.getCounter(UpdateCost.MAXLABEL).setValue(Long.valueOf(maxLabel));*
*}*
Cleanup is only executed for once.

And after each map whether Set uniqueLabel = new HashSet(); the set get
updated,Hope that set get updated for each map?
Hope I am able to get the uniqueLabel of the whole file in cleanup
Please suggest if I am wrong.

Thanks in advance.


Re: Get method in Writable

2015-02-22 Thread unmesha sreeveni
Thanks Drake.
That was the point.It was my mistake.

On Mon, Feb 23, 2015 at 6:34 AM, Drake민영근 drake@nexr.com wrote:

 Hi, unmesha.

 I think this is a gson problem. you mentioned like this:

  But parsing canot be done in *MR2*.
 * TreeInfoWritable info = gson.toJson(setupData,
 TreeInfoWritable.class);*

 I think just use gson.fromJson, not toJson(setupData is already json
 string, i think).

 Is this right ?

 Drake 민영근 Ph.D
 kt NexR

 On Sat, Feb 21, 2015 at 4:55 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 Am I able to get the values from writable of a previous job.
 ie I have 2 MR jobs
 *MR 1:*
  I need to pass 3 element as values from reducer and the key is
 NullWritable. So I created a custom writable class to achieve this.
 * public class TreeInfoWritable implements Writable{*

 * DoubleWritable entropy;*
 * IntWritable sum;*
 * IntWritable clsCount;*
 * ..*
 *}*
 *MR 2:*
  I need to access MR 1 result in MR2 mapper setup function. And I
 accessed it as distributed cache (small file).
  Is there a way to get those values using get method.
  *while ((setupData = bf.readLine()) != null) {*
 * System.out.println(Setup Line +setupData);*
 * TreeInfoWritable info = //something i can pass to TreeInfoWritable and
 get values*
 * DoubleWritable entropy = info.getEntropy();*
 * System.out.println(entropy: +entropy);*
 *}*

 Tried to convert writable to gson format.
 *MR 1*
 *Gson gson = new Gson();*
 *String emitVal = gson.toJson(valEmit);*
 *context.write(out, new Text(emitVal));*

  But parsing canot be done in *MR2*.
 *TreeInfoWritable info = gson.toJson(setupData, TreeInfoWritable.class);*

 *Error: Type mismatch: cannot convert from String to TreeInfoWritable*
 Once it is changed to string we cannot get values.

 Am I able to get a workaround for the same. or to use just POJO classes
 instaed of Writable. I'm afraid if that becomes slower as we are depending
 on Java instaed of hadoop 's serializable classes








-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Get method in Writable

2015-02-20 Thread unmesha sreeveni
Am I able to get the values from writable of a previous job.
ie I have 2 MR jobs
*MR 1:*
 I need to pass 3 element as values from reducer and the key is
NullWritable. So I created a custom writable class to achieve this.
* public class TreeInfoWritable implements Writable{*

* DoubleWritable entropy;*
* IntWritable sum;*
* IntWritable clsCount;*
* ..*
*}*
*MR 2:*
 I need to access MR 1 result in MR2 mapper setup function. And I accessed
it as distributed cache (small file).
 Is there a way to get those values using get method.
 *while ((setupData = bf.readLine()) != null) {*
* System.out.println(Setup Line +setupData);*
* TreeInfoWritable info = //something i can pass to TreeInfoWritable and
get values*
* DoubleWritable entropy = info.getEntropy();*
* System.out.println(entropy: +entropy);*
*}*

Tried to convert writable to gson format.
*MR 1*
*Gson gson = new Gson();*
*String emitVal = gson.toJson(valEmit);*
*context.write(out, new Text(emitVal));*

 But parsing canot be done in *MR2*.
*TreeInfoWritable info = gson.toJson(setupData, TreeInfoWritable.class);*

*Error: Type mismatch: cannot convert from String to TreeInfoWritable*
Once it is changed to string we cannot get values.

Am I able to get a workaround for the same. or to use just POJO classes
instaed of Writable. I'm afraid if that becomes slower as we are depending
on Java instaed of hadoop 's serializable classes


Re: writing mappers and reducers question

2015-02-19 Thread unmesha sreeveni
You can write MapReduce jobs in eclipse also for testing purpose. Once it
is done u can create jar and run that in your single node or multinode.
But plese note while doing in such IDE s using hadoop dependecies, There
will not be input splits, different mappers etc..


Re: How to get Hadoop's Generic Options value

2015-02-19 Thread unmesha sreeveni
Try implementing your program

public class YourDriver extends Configured implements Tool {

main()
run()
}

Then supply your file using -D option.

Thanks
Unmesha Biju


Delete output folder automatically in CRUNCH (FlumeJava)

2015-02-17 Thread unmesha sreeveni
Hi
I am new to FlumeJava.I ran wordcount in the same.But how can I
automatically delete the outputfolder in the code block. Instead of going
back and deleting the folder.
Thanks in advance.

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Neural Network in hadoop

2015-02-12 Thread unmesha sreeveni
I am trying to implement Neural Network in MapReduce. Apache mahout is
reffering this paper
http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf

Neural Network (NN) We focus on backpropagation By defining a network
structure (we use a three layer network with two output neurons classifying
the data into two categories), each mapper propagates its set of data
through the network. For each training example, the error is back
propagated to calculate the partial gradient for each of the weights in the
network. The reducer then sums the partial gradient from each mapper and
does a batch gradient descent to update the weights of the network.

Here http://homepages.gold.ac.uk/nikolaev/311sperc.htm is the worked out
example for gradient descent algorithm.

Gradient Descent Learning Algorithm for Sigmoidal Perceptrons
http://pastebin.com/6gAQv5vb

   1. Which is the better way to parallize neural network algorithm While
   looking in MapReduce perspective? In mapper: Each Record owns a partial
   weight(from above example: w0,w1,w2),I doubt if w0 is bias. A random weight
   will be assigned initially and initial record calculates the output(o) and
   weight get updated , second record also find the output and deltaW is got
   updated with the previous deltaW value. While coming into reducer the sum
   of gradient is calculated. ie if we have 3 mappers,we will be able to get 3
   w0,w1,w2.These are summed and using batch gradient descent we will be
   updating the weights of the network.
   2. In the above method how can we ensure that which previous weight is
   taken while considering more than 1 map task.Each map task has its own
   weight updated.How can it be accurate? [image: enter image description
   here]
   3. Where can I find backward propogation in the above mentioned gradient
   descent neural network algorithm?Or is it fine with this implementation?
   4. what is the termination condition mensioned in the algorithm?

Please help me with some pointers.

Thanks in advance.

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Neural Network in hadoop

2015-02-12 Thread unmesha sreeveni
On Thu, Feb 12, 2015 at 4:13 PM, Alpha Bagus Sunggono bagusa...@gmail.com
wrote:

 In my opinion,
 - This is just for 1 iteration. Then, batch gradient means find all delta,
 then updates all weight. So , I think its improperly if each have weight
 updated. Weight updated should be after Reduced.
 - Backpropagation can be found after Reduced.
 - This iteration should be repeat and repeat again.

​I doubt if iteration is for each record. ie say for example we have just 5
records,so whether the iteration will be 5 ? or some other concepts.
ie from the above example
​
​
​​
∆*w**0*,∆*w**1*,∆*w*
*2​ ​*
​
will be the delta error

​.So here lets say we have a threshold value
​. so for each record we will be checking if
​
∆*w**0*,∆*w**1*,∆*w*
*2​ * is
​less
 than
​ or equal to ​

​threshold value , else continue the iteration. Is it like that . Am I
wrong ?

​Sorry I am not that much clear on the iteration part.​


 Termination condition should be measured by delta error of sigmoid output
 in the end of mapper.
 ​
 Iteration process can be terminated after we get suitable  small value
 enough of the delta error.


Is there any criteria in updating delta weights?
 after calculating output of perceptron lets find the error:
(oj*(1-0j)(tj-oj))
check if error is less than threshold,then delta weight is not updated else
update delta weight .
Is it like that?



 On Thu, Feb 12, 2015 at 5:14 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 I am trying to implement Neural Network in MapReduce. Apache mahout is
 reffering this paper
 http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf

 Neural Network (NN) We focus on backpropagation By defining a network
 structure (we use a three layer network with two output neurons classifying
 the data into two categories), each mapper propagates its set of data
 through the network. For each training example, the error is back
 propagated to calculate the partial gradient for each of the weights in the
 network. The reducer then sums the partial gradient from each mapper and
 does a batch gradient descent to update the weights of the network.

 Here http://homepages.gold.ac.uk/nikolaev/311sperc.htm is the worked
 out example for gradient descent algorithm.

 Gradient Descent Learning Algorithm for Sigmoidal Perceptrons
 http://pastebin.com/6gAQv5vb

1. Which is the better way to parallize neural network algorithm
While looking in MapReduce perspective? In mapper: Each Record owns a
partial weight(from above example: w0,w1,w2),I doubt if w0 is bias. A
random weight will be assigned initially and initial record calculates the
output(o) and weight get updated , second record also find the output and
deltaW is got updated with the previous deltaW value. While coming into
reducer the sum of gradient is calculated. ie if we have 3 mappers,we will
be able to get 3 w0,w1,w2.These are summed and using batch gradient 
 descent
we will be updating the weights of the network.
2. In the above method how can we ensure that which previous weight
is taken while considering more than 1 map task.Each map task has its own
weight updated.How can it be accurate? [image: enter image
description here]
3. Where can I find backward propogation in the above mentioned
gradient descent neural network algorithm?Or is it fine with this
implementation?
4. what is the termination condition mensioned in the algorithm?

 Please help me with some pointers.

 Thanks in advance.

 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





 --
 Alpha Bagus Sunggono
 http://www.dyavacs.com




-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

2015-01-20 Thread unmesha sreeveni
I have 4 nodes and the replication factor is set to 3

On Wed, Jan 21, 2015 at 11:15 AM, Drake민영근 drake@nexr.com wrote:

 Yes, almost same. I assume the most time spending part was copying model
 data from datanode which has model data to actual process node(tasktracker
 or nodemanager).

 How about the model data's replication factor? How many nodes do you have?
 If you have 4 or more nodes, you can increase replication with following
 command. I suggest the number equal to your datanodes, but first you should
 confirm the enough space in HDFS.


- hdfs dfs -setrep -w 6 /user/model/data




 Drake 민영근 Ph.D

 On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 Yes I tried the same Drake.

 I dont know if I understood your answer.

  Instead of loading them into setup() through cache I read them directly
 from HDFS in map section. and for each incoming record .I found the
 distance between all the records in HDFS.
 ie if R ans S are my dataset, R is the model data stored in HDFs
 and when S taken for processing
 S1-R(finding distance with whole R set)
 S2-R

 But it is taking a long time as it needs to compute the distance.

 On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 drake@nexr.com wrote:

 In my suggestion, map or reduce tasks do not use distributed cache. They
 use file directly from HDFS with short circuit local read. Like a shared
 storage method, but almost every node has the data with high-replication
 factor.

 Drake 민영근 Ph.D

 On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni unmeshab...@gmail.com
  wrote:

 But stil if the model is very large enough, how can we load them inti
 Distributed cache or some thing like that.
 Here is one source :
 http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
 But it is confusing me

 On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 drake@nexr.com wrote:

 Hi,

 How about this ? The large model data stay in HDFS but with many
 replications and MapReduce program read the model from HDFS. In theory, 
 the
 replication factor of model data equals with number of data nodes and with
 the Short Circuit Local Reads function of HDFS datanode, the map or reduce
 tasks read the model data in their own disks.

 In this way, maybe use too many usage of HDFS, but the annoying
 partition problem will be gone.

 Thanks

 Drake 민영근 Ph.D

 On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni 
 unmeshab...@gmail.com wrote:

 Is there any way..
 Waiting for a reply.I have posted the question every where..but none
 is responding back.
 I feel like this is the right place to ask doubts. As some of u may
 came across the same issue and get stuck.

 On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni 
 unmeshab...@gmail.com wrote:

 Yes, One of my friend is implemeting the same. I know global sharing
 of Data is not possible across Hadoop MapReduce. But I need to check if
 that can be done somehow in hadoop Mapreduce also. Because I found some
 papers in KNN hadoop also.
 And I trying to compare the performance too.

 Hope some pointers can help me.


 On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning ted.dunn...@gmail.com
  wrote:


 have you considered implementing using something like spark?  That
 could be much easier than raw map-reduce

 On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni 
 unmeshab...@gmail.com wrote:

 In KNN like algorithm we need to load model Data into cache for
 predicting the records.

 Here is the example for KNN.


 [image: Inline image 1]

 So if the model will be a large file say1 or 2 GB we will be able
 to load them into Distributed cache.

 The one way is to split/partition the model Result into some files
 and perform the distance calculation for all records in that file and 
 then
 find the min ditance and max occurance of classlabel and predict the
 outcome.

 How can we parttion the file and perform the operation on these
 partition ?

 ie  1 record Distance parttition1,partition2,
  2nd record Distance parttition1,partition2,...

 This is what came to my thought.

 Is there any further way.

 Any pointers would help me.

 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

2015-01-20 Thread unmesha sreeveni
Yes I tried the same Drake.

I dont know if I understood your answer.

 Instead of loading them into setup() through cache I read them directly
from HDFS in map section. and for each incoming record .I found the
distance between all the records in HDFS.
ie if R ans S are my dataset, R is the model data stored in HDFs
and when S taken for processing
S1-R(finding distance with whole R set)
S2-R

But it is taking a long time as it needs to compute the distance.

On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 drake@nexr.com wrote:

 In my suggestion, map or reduce tasks do not use distributed cache. They
 use file directly from HDFS with short circuit local read. Like a shared
 storage method, but almost every node has the data with high-replication
 factor.

 Drake 민영근 Ph.D

 On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 But stil if the model is very large enough, how can we load them inti
 Distributed cache or some thing like that.
 Here is one source :
 http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf
 But it is confusing me

 On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 drake@nexr.com wrote:

 Hi,

 How about this ? The large model data stay in HDFS but with many
 replications and MapReduce program read the model from HDFS. In theory, the
 replication factor of model data equals with number of data nodes and with
 the Short Circuit Local Reads function of HDFS datanode, the map or reduce
 tasks read the model data in their own disks.

 In this way, maybe use too many usage of HDFS, but the annoying
 partition problem will be gone.

 Thanks

 Drake 민영근 Ph.D

 On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni unmeshab...@gmail.com
  wrote:

 Is there any way..
 Waiting for a reply.I have posted the question every where..but none is
 responding back.
 I feel like this is the right place to ask doubts. As some of u may
 came across the same issue and get stuck.

 On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni 
 unmeshab...@gmail.com wrote:

 Yes, One of my friend is implemeting the same. I know global sharing
 of Data is not possible across Hadoop MapReduce. But I need to check if
 that can be done somehow in hadoop Mapreduce also. Because I found some
 papers in KNN hadoop also.
 And I trying to compare the performance too.

 Hope some pointers can help me.


 On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:


 have you considered implementing using something like spark?  That
 could be much easier than raw map-reduce

 On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni 
 unmeshab...@gmail.com wrote:

 In KNN like algorithm we need to load model Data into cache for
 predicting the records.

 Here is the example for KNN.


 [image: Inline image 1]

 So if the model will be a large file say1 or 2 GB we will be able to
 load them into Distributed cache.

 The one way is to split/partition the model Result into some files
 and perform the distance calculation for all records in that file and 
 then
 find the min ditance and max occurance of classlabel and predict the
 outcome.

 How can we parttion the file and perform the operation on these
 partition ?

 ie  1 record Distance parttition1,partition2,
  2nd record Distance parttition1,partition2,...

 This is what came to my thought.

 Is there any further way.

 Any pointers would help me.

 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

2015-01-15 Thread unmesha sreeveni
Is there any way..
Waiting for a reply.I have posted the question every where..but none is
responding back.
I feel like this is the right place to ask doubts. As some of u may came
across the same issue and get stuck.

On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni unmeshab...@gmail.com
wrote:

 Yes, One of my friend is implemeting the same. I know global sharing of
 Data is not possible across Hadoop MapReduce. But I need to check if that
 can be done somehow in hadoop Mapreduce also. Because I found some papers
 in KNN hadoop also.
 And I trying to compare the performance too.

 Hope some pointers can help me.


 On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:


 have you considered implementing using something like spark?  That could
 be much easier than raw map-reduce

 On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni unmeshab...@gmail.com
  wrote:

 In KNN like algorithm we need to load model Data into cache for
 predicting the records.

 Here is the example for KNN.


 [image: Inline image 1]

 So if the model will be a large file say1 or 2 GB we will be able to
 load them into Distributed cache.

 The one way is to split/partition the model Result into some files and
 perform the distance calculation for all records in that file and then find
 the min ditance and max occurance of classlabel and predict the outcome.

 How can we parttion the file and perform the operation on these
 partition ?

 ie  1 record Distance parttition1,partition2,
  2nd record Distance parttition1,partition2,...

 This is what came to my thought.

 Is there any further way.

 Any pointers would help me.

 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


How to partition a file to smaller size for performing KNN in hadoop mapreduce

2015-01-14 Thread unmesha sreeveni
In KNN like algorithm we need to load model Data into cache for predicting
the records.

Here is the example for KNN.


[image: Inline image 1]

So if the model will be a large file say1 or 2 GB we will be able to load
them into Distributed cache.

The one way is to split/partition the model Result into some files and
perform the distance calculation for all records in that file and then find
the min ditance and max occurance of classlabel and predict the outcome.

How can we parttion the file and perform the operation on these partition ?

ie  1 record Distance parttition1,partition2,
 2nd record Distance parttition1,partition2,...

This is what came to my thought.

Is there any further way.

Any pointers would help me.

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

2015-01-14 Thread unmesha sreeveni
Yes, One of my friend is implemeting the same. I know global sharing of
Data is not possible across Hadoop MapReduce. But I need to check if that
can be done somehow in hadoop Mapreduce also. Because I found some papers
in KNN hadoop also.
And I trying to compare the performance too.

Hope some pointers can help me.


On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning ted.dunn...@gmail.com wrote:


 have you considered implementing using something like spark?  That could
 be much easier than raw map-reduce

 On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 In KNN like algorithm we need to load model Data into cache for
 predicting the records.

 Here is the example for KNN.


 [image: Inline image 1]

 So if the model will be a large file say1 or 2 GB we will be able to load
 them into Distributed cache.

 The one way is to split/partition the model Result into some files and
 perform the distance calculation for all records in that file and then find
 the min ditance and max occurance of classlabel and predict the outcome.

 How can we parttion the file and perform the operation on these partition
 ?

 ie  1 record Distance parttition1,partition2,
  2nd record Distance parttition1,partition2,...

 This is what came to my thought.

 Is there any further way.

 Any pointers would help me.

 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: How to run a mapreduce program not on the node of hadoop cluster?

2015-01-14 Thread unmesha sreeveni
Your data wont get splitted. so your program runs as single mapper and
single reducer. And your intermediate data is not shuffeld and sorted, But
u can use this for debuging
On Jan 14, 2015 2:04 PM, Cao Yi iridium...@gmail.com wrote:

 Hi,

 I write some mapreduce code in my project *my_prj*. *my_prj *will be
 deployed on the machine which is not a node of the cluster.
 how does *my_prj* to run a mapreduce job in this case?

 thank you!

 Best Regards,
 Iridium



Re: Write and Read file through map reduce

2015-01-05 Thread unmesha sreeveni
Hi hitarth
​,

If your file1 and file 2 is smaller you can move on with Distributed Cache.
mentioned here
http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html
 .

Or you can move on with MultipleInputFormat
​ mentioned here
http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html​
.

[1]
http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html
[2]
http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html

On Tue, Jan 6, 2015 at 8:53 AM, Ted Yu yuzhih...@gmail.com wrote:

 Hitarth:
 You can also consider MultiFileInputFormat (and its concrete
 implementations).

 Cheers

 On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet cjno...@gmail.com wrote:

 Hitarth,

 I don't know how much direction you are looking for with regards to the
 formats of the times but you can certainly read both files into the third
 mapreduce job using the FileInputFormat by comma-separating the paths to
 the files. The blocks for both files will essentially be unioned together
 and the mappers scheduled across your cluster.

 On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi t.hita...@gmail.com
 wrote:

 Hi,

 I have 6 node cluster, and the scenario is as follows :-

 I have one map reduce job which will write file1 in HDFS.
 I have another map reduce job which will write file2 in  HDFS.
 In the third map reduce job I need to use file1 and file2 to do some
 computation and output the value.

 What is the best way to store file1 and file2 in HDFS so that they could
 be used in third map reduce job.

 Thanks,
 Hitarth






-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


[Blog] My experience on Hadoop Certification

2015-01-02 Thread unmesha sreeveni
http://unmeshasreeveni.blogspot.in/2015/01/cloudera-certified-hadoop-developer-ccd.html

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: FileNotFoundException in distributed mode

2014-12-22 Thread unmesha sreeveni
Driver

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path cachefile = new Path(path/to/file);
FileStatus[] list = fs.globStatus(cachefile);
for (FileStatus status : list) {
 DistributedCache.addCacheFile(status.getPath().toUri(), conf);
}

In setup

public void setup(Context context) throws IOException{
 Configuration conf = context.getConfiguration();
 FileSystem fs = FileSystem.get(conf);
 URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
 Path getPath = new Path(cacheFiles[0].getPath());
 BufferedReader bf = new BufferedReader(new
InputStreamReader(fs.open(getPath)));
 String setupData = null;
 while ((setupData = bf.readLine()) != null) {
   System.out.println(Setup Line in reducer +setupData);
 }
}


Hope this link helps:
http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html

On Mon, Dec 22, 2014 at 2:58 PM, Marko Dinic marko.di...@nissatech.com
wrote:

 Hello Hadoopers,

 I'm getting this exception in Hadoop while trying to read file that was
 added to distributed cache, and the strange thing is that the file exists
 on the given location

 java.io.FileNotFoundException: File does not exist:
 /tmp/hadoop-pera/mapred/local/taskTracker/distcache/-1517670662102870873_-
 1918892372_1898431787/localhost/work/output/temporalcentroids/centroids-
 iteration0-noOfClusters2/part-r-0

 I'm adding the file in before starting my job using

 DistributedCache.addCacheFile(URI.create(args[2]),
 job.getConfiguration());

 And I'm trying to read from the file from setup metod in my mapper using

 DistributedCache.getLocalCacheFiles(conf);

 As I said, I can confirm that the file is on the local system, but the
 exception is thrown.

 I'm running the job in pseudo-distributed mode, on one computer.

 Any ideas?

 Thanks




-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Run a c++ program using opencv libraries in hadoop

2014-12-17 Thread unmesha sreeveni
Hi

 How can I run c++ programs using opencv libraries in hadoop?

So far I have done MapReduce jobs in Java only..and there we can supply
external jars using command line itself.
And even tried using python language also..to run them we use hadoop
streaming API.
But I am confused how to run C++ programs using opencv libraries.

Thanks in Advance.

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Split files into 80% and 20% for building model and prediction

2014-12-12 Thread unmesha sreeveni
I am trying to divide my HDFS file into 2 parts/files
80% and 20% for classification algorithm(80% for modelling and 20% for
prediction)
Please provide suggestion for the same.
To take 80% and 20% to 2 seperate files we need to know the exact number of
record in the data set
And it is only known if we go through the data set once.
so we need to write 1 MapReduce Job for just counting the number of records
and
2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
Inputs.


Am I in the right track or there is any alternative for the same.
But again a small confusion how to check if the reducer get filled with 80%
data.


-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Split files into 80% and 20% for building model and prediction

2014-12-12 Thread unmesha sreeveni
Hi Mikael
 So you wont write an MR job for counting the number of records in that
file to find 80% and 20%

On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk mikael.sit...@gmail.com
wrote:

 I would use a different approach. For each row in the mapper I would have
 invoked random.Next() then if the number generated by random is below 0.8
 then the row would go to key for training otherwise go to key for the test.
 Mikael.s
 --
 From: Susheel Kumar Gadalay skgada...@gmail.com
 Sent: ‎12/‎12/‎2014 12:00
 To: user@hadoop.apache.org
 Subject: Re: Split files into 80% and 20% for building model and
 prediction

 Simple solution..

 Copy the HDFS file to local and use OS commands to count no of lines

 cat file1 | wc -l

 and cut it based on line number.


 On 12/12/14, unmesha sreeveni unmeshab...@gmail.com wrote:
  I am trying to divide my HDFS file into 2 parts/files
  80% and 20% for classification algorithm(80% for modelling and 20% for
  prediction)
  Please provide suggestion for the same.
  To take 80% and 20% to 2 seperate files we need to know the exact number
 of
  record in the data set
  And it is only known if we go through the data set once.
  so we need to write 1 MapReduce Job for just counting the number of
 records
  and
  2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple
  Inputs.
 
 
  Am I in the right track or there is any alternative for the same.
  But again a small confusion how to check if the reducer get filled with
 80%
  data.
 
 
  --
  *Thanks  Regards *
 
 
  *Unmesha Sreeveni U.B*
  *Hadoop, Bigdata Developer*
  *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
  http://www.unmeshasreeveni.blogspot.in/
 



-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: DistributedCache

2014-12-11 Thread unmesha sreeveni
On Fri, Dec 12, 2014 at 9:55 AM, Shahab Yunus shahab.yu...@gmail.com
wrote:

 job.addCacheFiles


​Yes you can use job.addCacheFiles to cache the file.

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path cachefile = new Path(path/to/file);
FileStatus[] list = fs.globStatus(cachefile);
for (FileStatus status : list) {
 DistributedCache.addCacheFile(status.getPath().toUri(), conf);

}

Hope this link helps
[1]
http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html


-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Detailing on how UPDATE is performed in Hive

2014-11-27 Thread unmesha sreeveni
Hi friends
  Where can I find details on how update is performed in Hive.

​1. When an update is performed,whether HDFS will write that block
elsewhere with the new value.
2. whether the old block is unallocated and is allowed for further writes.
3. Whether this process create fragmentation ?
4. while creating a partitioned table, and update is performed ,whether the
partition is deleted and updated with new value or the entire block is
deleted and written once again?

where will be the good place to gather these knowlege​

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


[blog] How to do Update operation in hive-0.14.0

2014-11-25 Thread unmesha sreeveni
Hi

Hope this link helps for those who are trying to do practise ACID
properties in hive 0.14.

http://unmeshasreeveni.blogspot.in/2014/11/updatedeleteinsert-in-hive-0140.html

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Fwd: Values getting duplicated in Hive table(Partitioned)

2014-11-18 Thread unmesha sreeveni
Thanks it worked.
On Nov 17, 2014 3:32 PM, unmesha sreeveni unmeshab...@gmail.com wrote:


 -- Forwarded message --
 From: unmesha sreeveni unmeshab...@gmail.com
 Date: Mon, Nov 17, 2014 at 10:49 AM
 Subject: Re: Values getting duplicated in Hive table(Partitioned)
 To: User - Hive u...@hive.apache.org


 In non partitioned table I am getting the correct values.

 Is my update query wrong?

1. INSERT OVERWRITE TABLE Unm_Parti_Trail PARTITION (Department = 'A')
SELECT employeeid,firstname,designation, CASE WHEN employeeid=19 THEN
'5 ELSE salary END AS salary FROM Unm_Parti_Trail;

 What I tried to include in the query is , In partion with department = A,
 update employeeid =19 's salary with 5

 Is that query statement wrong? and the replication is not affected to dept
 B and C.
 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





[Blog] Hive Partitioning

2014-11-18 Thread unmesha sreeveni
Hi,

This is a blog on Hive partitioning.

http://unmeshasreeveni.blogspot.in/2014/11/hive-partitioning.html

Hope it helps someone.

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


[Blog] Updating Partition Table using INSERT In HIVE

2014-11-18 Thread unmesha sreeveni
Hi

This is a blog on Hive updating for older version (hive -0.12.0)

http://unmeshasreeveni.blogspot.in/2014/11/updating-partition-table-using-insert.html

Hope it helps someone.

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Showing INFO ipc.Client: Retrying connect to server once hadoop is upgraded to cdh5.2.0

2014-11-17 Thread unmesha sreeveni
Upgraded my Hadoop cluster (CDH) to cdh5.2.0

But once I run my Job with iteration, It is showing after 1 st iterative
Job.


14/11/17 09:29:44 INFO ipc.Client: Retrying connect to server:
/xx.xx.xx.xx:. Already tried 0 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000
MILLISECONDS)
14/11/17 09:29:45 INFO ipc.Client: Retrying connect to server:
/xx.xx.xx.xx:. Already tried 1 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000
MILLISECONDS)
14/11/17 09:29:46 INFO ipc.Client: Retrying connect to server:
/xx.xx.xx.xx:. Already tried 2 time(s); retry policy is
RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000
MILLISECONDS)


​I have some calculations in Driver class. It is working fine for small
dataset. When I tried to run 1gb data it is showing above error.
It seems like after my first job some calculation is done in driver class
and after the calculation the next job get starts. But I think it is not
waiting for the time spend for calculation in driver class(as it is 1 gb
file it takes long time for driver calculation) and throwing the above
error.

​It worked fine in previous version.


Whether I missed anything during installation?
Why is it so? Pleace Advice​

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Values getting duplicated in Hive table(Partitioned)

2014-11-16 Thread unmesha sreeveni
I created a Hive table with *partition* and inserted data into Partioned
Hive table.

Refered site
https://blog.safaribooksonline.com/2012/12/03/tip-partitioning-data-in-hive/

   1.

   *Initially created one Non -partioned table and then using select query
   and loaded data into partioned table. Is there an alternate way?*
   2.

   *By following above link my partioned table contains duplicate values.
   Below are the setps*

This is my Sample employee dataset:link1 http://pastebin.com/tVh16Yxt

I tried the following queries: link2 http://pastebin.com/U2yykWpy

But after updating a value in Hive table,the values are getting duplicated.

7   Nirmal  Tech12000   A
7   Nirmal  Tech12000   B

Nirmal is placed in Department *A* only
​,​
but it is duplicated to department *B*.

And Once I update a column value in middle I am getting NULL values
displayed,while updating last column it is fine.

Am I doing any thing wrong.
Please suggest.--
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Fwd: Values getting duplicated in Hive table(Partitioned)

2014-11-16 Thread unmesha sreeveni
In non partitioned table I am getting the correct values.

Is my update query wrong?

INSERT OVERWRITE TABLE Unm_Parti_Trail PARTITION (Department = 'A') SELECT
employeeid,firstname,designation, CASE WHEN employeeid=19 THEN '5 ELSE
salary END AS salary FROM Unm_Parti_Trail;


What I tried to include in the query is , In partion with department = A,
update employeeid =19 's salary with 5

Is that query statement wrong? and the replication is not affected to dept
B and C


-- Forwarded message --
From: hadoop hive hadooph...@gmail.com
Date: Mon, Nov 17, 2014 at 10:08 AM
Subject: Re: Values getting duplicated in Hive table(Partitioned)
To: u...@hive.apache.org


Can you check your select query to run on non partitioned tables. Check if
it's giving correct values.

Same as for dept. B
 On Nov 17, 2014 10:03 AM, unmesha sreeveni unmeshab...@gmail.com wrote:

 ***I created a Hive table with *non*- *partitioned* and using select
 query I inserted data into *Partioned* Hive table.

 On Mon, Nov 17, 2014 at 10:00 AM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 I created a Hive table with *partition* and inserted data into Partioned
 Hive table.

 Refered site
 https://blog.safaribooksonline.com/2012/12/03/tip-partitioning-data-in-hive/

1.

*Initially created one Non -partioned table and then using select
query and loaded data into partioned table. Is there an alternate way?*
2.

*By following above link my partioned table contains duplicate
values. Below are the setps*

 This is my Sample employee dataset:link1 http://pastebin.com/tVh16Yxt

 I tried the following queries: link2 http://pastebin.com/U2yykWpy

 But after updating a value in Hive table,the values are getting
 duplicated.

 7   Nirmal  Tech12000   A
 7   Nirmal  Tech12000   B

 Nirmal is placed in Department *A* only
 ​,​
 but it is duplicated to department *B*.

 And Once I update a column value in middle I am getting NULL values
 displayed,while updating last column it is fine.

 Am I doing any thing wrong.
 Please suggest.--


 --
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Can add a regular check in DataNode on free disk space?

2014-10-19 Thread unmesha sreeveni
1. Stop all Hadoop daemons
2. Remove all files from
  /var/lib/hadoop-hdfs/cache/hdfs/dfs/name
3. Format namenode
4. Start all Hadoop daemons.

On Mon, Oct 20, 2014 at 8:26 AM, sam liu samliuhad...@gmail.com wrote:

 Hi Experts and Developers,

 At present, if a DataNode does not has free disk space, we can not get
 this bad situation from anywhere, including DataNode log. At the same time,
 under this situation, the hdfs writing operation will fail and return error
 msg as below. However, from the error msg, user could not know the root
 cause is that the only datanode runs out of disk space, and he also could
 not get any useful hint in datanode log. So I believe it will be better if
 we could add a regular check in DataNode on free disk space, and it will
 add WARNING or ERROR msg in datanode log if that datanode runs out of
 space. What's your opinion?

 Error Msg:
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
 /user/hadoop/PiEstimator_TMP_3_141592654/in/part0 could only be replicated
 to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running
 and no node(s) are excluded in this operation.
 at
 org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1441)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2702)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584)
 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440)


 Thanks!




-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


[Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects

2014-10-06 Thread unmesha sreeveni
http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects

2014-10-06 Thread unmesha sreeveni
Hi

5 th question can it be SQOOP?

On Mon, Oct 6, 2014 at 1:24 PM, unmesha sreeveni unmeshab...@gmail.com
wrote:

 Yes

 On Mon, Oct 6, 2014 at 1:22 PM, Santosh Kumar skumar.bigd...@hotmail.com
 wrote:

 Are you preparing g for Cloudera certification exam?





 Thanks and Regards,

 Santosh Kumar SINHA http://www.linkedin.com/in/sinhasantosh

 (510) 936-2650

 Sr Data Consultant - BigData Implementations.

 [image: View my profile on LinkedIn]
 http://www.linkedin.com/in/sinhasantosh







 *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com]
 *Sent:* Monday, October 06, 2014 12:45 AM
 *To:* User - Hive; User Hadoop; User Pig
 *Subject:* [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects




 http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html



 --

 *Thanks  Regards *



 *Unmesha Sreeveni U.B*

 *Hadoop, Bigdata Developer*

 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*

 http://www.unmeshasreeveni.blogspot.in/








 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects

2014-10-06 Thread unmesha sreeveni
what about the last one? The answer is correct. Pig. Is nt it?

On Mon, Oct 6, 2014 at 4:29 PM, adarsh deshratnam 
adarsh.deshrat...@gmail.com wrote:

 For question 3 answer should be B and for question 4 answer should be D.

 Thanks,
 Adarsh D

 Consultant - BigData and Cloud

 [image: View my profile on LinkedIn]
 http://in.linkedin.com/in/adarshdeshratnam

 On Mon, Oct 6, 2014 at 2:25 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 Hi

 5 th question can it be SQOOP?

 On Mon, Oct 6, 2014 at 1:24 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 Yes

 On Mon, Oct 6, 2014 at 1:22 PM, Santosh Kumar 
 skumar.bigd...@hotmail.com wrote:

 Are you preparing g for Cloudera certification exam?





 Thanks and Regards,

 Santosh Kumar SINHA http://www.linkedin.com/in/sinhasantosh

 (510) 936-2650

 Sr Data Consultant - BigData Implementations.

 [image: View my profile on LinkedIn]
 http://www.linkedin.com/in/sinhasantosh







 *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com]
 *Sent:* Monday, October 06, 2014 12:45 AM
 *To:* User - Hive; User Hadoop; User Pig
 *Subject:* [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects




 http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html



 --

 *Thanks  Regards *



 *Unmesha Sreeveni U.B*

 *Hadoop, Bigdata Developer*

 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*

 http://www.unmeshasreeveni.blogspot.in/








 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects

2014-10-06 Thread unmesha sreeveni
What I feel like is

For  question
​ 5​
it says, the weblogs are already in HDFS (so no need to import
anything).Also these are log files, NOT database files with a specific
schema. So
​ I think​
Pig is the best way to access and process this data.

On Tue, Oct 7, 2014 at 4:10 AM, Pradeep Gollakota pradeep...@gmail.com
wrote:

 I agree with the answers suggested above.

 3. B
 4. D
 5. C

 On Mon, Oct 6, 2014 at 2:58 PM, Ulul had...@ulul.org wrote:

  Hi

 No, Pig is a data manipulation language for data already in Hadoop.
 The question is about importing data from OLTP DB (eg Oracle, MySQL...)
 to Hadoop, this is what Sqoop is for (SQL to Hadoop)

 I'm not certain certification guys are happy with their exam questions
 ending up on blogs and mailing lists :-)

 Ulul

  Le 06/10/2014 13:54, unmesha sreeveni a écrit :

  what about the last one? The answer is correct. Pig. Is nt it?

 On Mon, Oct 6, 2014 at 4:29 PM, adarsh deshratnam 
 adarsh.deshrat...@gmail.com wrote:

 For question 3 answer should be B and for question 4 answer should be D.

  Thanks,
 Adarsh D

 Consultant - BigData and Cloud

 [image: View my profile on LinkedIn]
 http://in.linkedin.com/in/adarshdeshratnam

 On Mon, Oct 6, 2014 at 2:25 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

  Hi

  5 th question can it be SQOOP?

 On Mon, Oct 6, 2014 at 1:24 PM, unmesha sreeveni unmeshab...@gmail.com
  wrote:

  Yes

 On Mon, Oct 6, 2014 at 1:22 PM, Santosh Kumar 
 skumar.bigd...@hotmail.com wrote:

  Are you preparing g for Cloudera certification exam?





 Thanks and Regards,

 Santosh Kumar SINHA http://www.linkedin.com/in/sinhasantosh

 (510) 936-2650

 Sr Data Consultant - BigData Implementations.

 [image: View my profile on LinkedIn]
 http://www.linkedin.com/in/sinhasantosh







 *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com]
 *Sent:* Monday, October 06, 2014 12:45 AM
 *To:* User - Hive; User Hadoop; User Pig
 *Subject:* [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem
 Projects




 http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html



 --

 *Thanks  Regards *



 *Unmesha Sreeveni U.B*

 *Hadoop, Bigdata Developer*

 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*

 http://www.unmeshasreeveni.blogspot.in/








  --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B *
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
  http://www.unmeshasreeveni.blogspot.in/





  --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B *
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
  http://www.unmeshasreeveni.blogspot.in/






  --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B *
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
  http://www.unmeshasreeveni.blogspot.in/







-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects

2014-10-06 Thread unmesha sreeveni
Hi Pradeep
  You are right. Updated the right answers in the blog.

This may help anyone thinking about investing in that particular test
package.

On Tue, Oct 7, 2014 at 9:25 AM, Pradeep Gollakota pradeep...@gmail.com
wrote:

 That's not exactly what the question is asking for... It's saying that you
 have a bunch of weblogs in HDFS that you want to join with user profile
 data that is stored in your OLTP database, how do you do the join? First,
 you export your OLTP database into HDFS using Sqoop. Then you can use
 Pig/Hive/MR/Cascading/whatever to work with both the datasets and perform
 the join.

 On Mon, Oct 6, 2014 at 8:49 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 What I feel like is

 For  question
 ​ 5​
 it says, the weblogs are already in HDFS (so no need to import
 anything).Also these are log files, NOT database files with a specific
 schema. So
 ​ I think​
 Pig is the best way to access and process this data.

 On Tue, Oct 7, 2014 at 4:10 AM, Pradeep Gollakota pradeep...@gmail.com
 wrote:

 I agree with the answers suggested above.

 3. B
 4. D
 5. C

 On Mon, Oct 6, 2014 at 2:58 PM, Ulul had...@ulul.org wrote:

  Hi

 No, Pig is a data manipulation language for data already in Hadoop.
 The question is about importing data from OLTP DB (eg Oracle, MySQL...)
 to Hadoop, this is what Sqoop is for (SQL to Hadoop)

 I'm not certain certification guys are happy with their exam questions
 ending up on blogs and mailing lists :-)

 Ulul

  Le 06/10/2014 13:54, unmesha sreeveni a écrit :

  what about the last one? The answer is correct. Pig. Is nt it?

 On Mon, Oct 6, 2014 at 4:29 PM, adarsh deshratnam 
 adarsh.deshrat...@gmail.com wrote:

 For question 3 answer should be B and for question 4 answer should be
 D.

  Thanks,
 Adarsh D

 Consultant - BigData and Cloud

 [image: View my profile on LinkedIn]
 http://in.linkedin.com/in/adarshdeshratnam

 On Mon, Oct 6, 2014 at 2:25 PM, unmesha sreeveni 
 unmeshab...@gmail.com wrote:

  Hi

  5 th question can it be SQOOP?

 On Mon, Oct 6, 2014 at 1:24 PM, unmesha sreeveni 
 unmeshab...@gmail.com wrote:

  Yes

 On Mon, Oct 6, 2014 at 1:22 PM, Santosh Kumar 
 skumar.bigd...@hotmail.com wrote:

  Are you preparing g for Cloudera certification exam?





 Thanks and Regards,

 Santosh Kumar SINHA http://www.linkedin.com/in/sinhasantosh

 (510) 936-2650

 Sr Data Consultant - BigData Implementations.

 [image: View my profile on LinkedIn]
 http://www.linkedin.com/in/sinhasantosh







 *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com]
 *Sent:* Monday, October 06, 2014 12:45 AM
 *To:* User - Hive; User Hadoop; User Pig
 *Subject:* [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem
 Projects




 http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html



 --
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: toolrunner issue

2014-09-01 Thread unmesha sreeveni
public class MyClass extends Configured implements Tool{
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
 int res = ToolRunner.run(conf, new MyClass(), args);
 System.exit(res);
 }

 @Override
public int run(String[] args) throws Exception {
 // TODO Auto-generated method stub
Job job = new Job(conf, );
job.setJarByClass(MyClass.class);
 job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(TwovalueWritable.class);
 job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(TwovalueWritable.class);
 job.setMapperClass(Mapper.class);
job.setReducerClass(Reducer.class);
 job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
   .
}

I am able to work without any errors. Please make sure that you are doing
the same code above.


On Mon, Sep 1, 2014 at 4:18 PM, rab ra rab...@gmail.com wrote:

 Hello

 I m having an issue in running one simple map reduce job.
 The portion of the code is below. It gives a warning that Hadoop command
 line parsing was not peformed.
 This occurs despite the class implements Tool interface. Any clue?

 public static void main(String[] args) throws Exception {

 try{

 int exitcode = ToolRunner.run(new Configuration(), new
 MyClass(), args);

 System.exit(exitcode);

 }

 catch(Exception e)

 {

 e.printStackTrace();

 }

 }



 @Override

 public int run(String[] args) throws Exception {

 JobConf conf = new JobConf(MyClass.class);

 System.out.println(args);

 FileInputFormat.addInputPath(conf, new Path(/smallInput));

 conf.setInputFormat(CFNInputFormat.class);

 conf.setMapperClass(MyMapper.class);

 conf.setMapOutputKeyClass(Text.class);

 conf.setMapOutputValueClass(Text.class);

 FileOutputFormat.setOutputPath(conf, new Path(/TEST));

 JobClient.runJob(conf);

 return 0;




-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Hadoop on Safe Mode because Resources are low on NameNode

2014-08-26 Thread unmesha sreeveni
You can leave safe mode:
Namenode in safe mode how to leave:
http://www.unmeshasreeveni.blogspot.in/2014/04/name-node-is-in-safe-mode-how-to-leave.html



On Wed, Aug 27, 2014 at 9:38 AM, Stanley Shi s...@pivotal.io wrote:

 You can force the namenode to get out of safe mode: hadoop dfsadmin
 -safemode leave


 On Tue, Aug 26, 2014 at 11:05 PM, Vincent Emonet vincent.emo...@gmail.com
  wrote:

 Hello,

 We have a 11 nodes Hadoop cluster installed from Hortonworks RPM doc:

 http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_installing_manually_book/content/rpm-chap1.html

 The cluster was working fine since it went on Safe Mode during the
 execution of a job with this message on the NameNode interface:



 *Safe mode is ON. Resources are low on NN. Please add or free up more
 resources then turn off safe mode manually. NOTE: If you turn off safe mode
 before adding resources, the NN will immediately return to safe mode. Use
 hdfs dfsadmin -safemode leave to turn safe mode off.*
 The error displayed in the job log is:
 2014-08-22 08:51:35,446 WARN namenode.NameNodeResourceChecker
 (NameNodeResourceChecker.java:isResourceAvailable(89)) - Space available on
 volume 'null' is 100720640, which is below the configured reserved amount
 104857600 2014-08-22 08:51:35,446 WARN namenode.FSNamesystem
 (FSNamesystem.java:run(4042)) - NameNode low on available disk space.
 Already in safe mode.

 On each node we have 5 hdd used for Hadoop
 And we checked the 5 hdd on the namenode are all full (between 95 and
 100%) when the HDFS as still 50% of its capacity available : on the other
 nodes the 5 hdd are at 30/40%

 So I think this is the cause of the error.

 On the NameNode we had some Non HDFS data on 1 hdd, so I deleted them to
 have 50% of this hdd available (the 4 others are still between 95 and 100%)
 But this didn't resolve the problem
 I have also followed the advices found here :
 https://issues.apache.org/jira/browse/HDFS-4425
 And added the following property to the hdfs-site.xml of the NameNode
 (multiplying the default value by 2)
   property
  namedfs.namenode.resource.du.reserved/name
value209715200/value
/property

 Still impossible to get out of the safe mode and as log as we are in safe
 mode we can't delete anything in the HDFS.


 Is anyone having a tip about this issue?


 Thankfully,

 Vincent.





 --
 Regards,
 *Stanley Shi,*




-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Create HDFS directory fails

2014-07-29 Thread unmesha sreeveni
On Tue, Jul 29, 2014 at 1:13 PM, R J rj201...@yahoo.com wrote:

 java.io.IOException: Mkdirs failed to create


Check
​if ​
you have permissions to mkdir this directory (try it from the command line)
​​



-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Sqoop command syntax

2014-07-29 Thread unmesha sreeveni
Try this

sqoop import \

--connect jdbc:oracle:thin:@myoracleserver.com:1521/mysid \

--table mytab1 \

--username
​ ​
scott --password
​ ​
tiger



On Tue, Jul 29, 2014 at 2:11 PM, R J rj201...@yahoo.com wrote:

 Hi All,

 Could anyone please help me with the Sqoop command syntax? I tried the
 following command:

 /home/logger/scoop/sqoop-1.4.4.bin__hadoop-0.20/bin/sqoop import --driver
 oracle.jdbc.driver.OracleDriver --connect
 jdbc:oracle:thin:@myoracleserver.com:1521/mysid --username scott
 --password tiger --table mytab1

 I get the error:
 14/07/29 08:26:23 INFO manager.SqlManager: Executing SQL statement: SELECT
 t.* FROM mytab1 AS t WHERE 1=0
 14/07/29 08:26:24 ERROR manager.SqlManager: Error executing statement:
 java.sql.SQLSyntaxErrorException: ORA-00933: SQL command not properly ended

 java.sql.SQLSyntaxErrorException: ORA-00933: SQL command not properly ended

 at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447)
 at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396)
 at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:951)
 at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513)
 at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227)
 at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531)
 at
 oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:208)
 at
 oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:886)
 at
 oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1175)
 at
 oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1296)
 at
 oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3613)
 at
 oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3657)
 at
 oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1495)
 at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:674)
 at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:683)
 at
 org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:240)
 at
 org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:223)
 at
 org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:347)
 at
 org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1277)
 at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1089)
 at
 org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96)
 at
 org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:396)
 at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502)
 at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
 at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
 at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
 at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
 14/07/29 08:26:24 ERROR tool.ImportTool: Encountered IOException running
 import job: java.io.IOException: No columns to generate for ClassWriter
 at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1095)
 at
 org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96)
 at
 org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:396)
 at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502)
 at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
 at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
 at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
 at org.apache.sqoop.Sqoop.main(Sqoop.java:238)





-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


ListWritable In Hadoop

2014-07-10 Thread unmesha sreeveni
hi
 Do we have a ListWritable in hadoop ?

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*


Re: hadoop directory can't add and remove

2014-07-02 Thread unmesha sreeveni
http://www.unmeshasreeveni.blogspot.in/2014/04/name-node-is-in-safe-mode-how-to-leave.html


hadoop fs -rm r QuasiMonteCarlo_1404262305436_855154103



On Wed, Jul 2, 2014 at 11:56 AM, EdwardKing zhan...@neusoft.com wrote:

  I want to remove hadoop directory,so I use hadoop fs -rmr,but it can't
 remove,why?

 [hdfs@localhost hadoop-2.2.0]$ hadoop fs -ls
 Found 2 items
 drwxr-xr-x   - hdfs supergroup  0 2014-07-01 17:52
 QuasiMonteCarlo_1404262305436_855154103
 drwxr-xr-x   - hdfs supergroup  0 2014-07-01 18:17
 QuasiMonteCarlo_1404263830233_477013424

 [hdfs@localhost hadoop-2.2.0]$ hadoop fs -rmr *.*
 rmr: DEPRECATED: Please use 'rm -r' instead.
 rmr: `LICENSE.txt': No such file or directory
 rmr: `NOTICE.txt': No such file or directory
 rmr: `README.txt': No such file or directory

 [hdfs@localhost hadoop-2.2.0]$ hadoop fs -la
 Found 2 items
 drwxr-xr-x   - hdfs supergroup  0 2014-07-01 17:52
 QuasiMonteCarlo_1404262305436_855154103
 drwxr-xr-x   - hdfs supergroup  0 2014-07-01 18:17
 QuasiMonteCarlo_1404263830233_477013424

 And I try mkdir a directory,but I still fail
 [hdfs@localhost hadoop-2.2.0]$ hadoop fs -mkdir abc
 mkdir: Cannot create directory /user/hdfs/abc. Name node is in safe mode.

 How can I do it?  Thanks.


 ---
 Confidentiality Notice: The information contained in this e-mail and any
 accompanying attachment(s)
 is intended only for the use of the intended recipient and may be
 confidential and/or privileged of
 Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader
 of this communication is
 not the intended recipient, unauthorized use, forwarding, printing,
 storing, disclosure or copying
 is strictly prohibited, and may be unlawful.If you have received this
 communication in error,please
 immediately notify the sender by return e-mail, and delete the original
 message and all copies from
 your system. Thank you.

 ---




-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: WholeFileInputFormat in hadoop

2014-06-29 Thread unmesha sreeveni
But how is it different from normal execution and parallel MR.
Although mapreduce is a parallel exec framework where the data into map is
 a single input.

If the Whole fileinput is jst an entire input split insead of the entire
input file . it will be useful right?
if it is the whole file it can caught heapspace ..

Please correct me if I am wrong.

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: WholeFileInputFormat in hadoop

2014-06-29 Thread unmesha sreeveni
I am trying to do DBScan Algo.I refered the algo in Data Mining - Concepts
and Techniques (3rd Ed) chapter 10 Page no: 474.
Here in this algorithmwe need to find the disance between each point.
say my sample input is
5,6
8,2
4,5
4,6

So in DBScan we have to pic 1 elemnt and then find the distance between all.

While implementing so I will not be able to get the whole file in map
inorder to find the distance.
I tried some approach
1. used WholeFileInput and done the entire algorithm in Map itself - I dnt
think this is a better one.(And it end up with heap space error)
2. and this one is not implementes as I thought it is not feasible
  - Reading 1 line of input data set in driver and write to a new file.(say
centroid)
 - this centriod can be read in setup and calculate the distance in Map and
emit the data which satifies the condition with dbscan
map(id,epsilonneighbr) and in reducer we will be able to aggregate all the
epsilon neighbours of (5,6) which come from different map and in Reducer
find the neighbors of epsilon neighbour.
 - Next iteration should also be done agian read the input file find a node
which is not visited
If the input is a 1GB file the MR job executes as many times of the total
record.


Can anyone suggest me a better way to do this.

Hope the usecase is understandable else please tell me.I will explain
further.


-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


WholeFileInputFormat in hadoop

2014-06-28 Thread unmesha sreeveni
Hi

  A small clarification:

 WholeFileInputFormat takes the entire input file as input or each
record(input split) as whole?

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Finding mamimum value in reducer

2014-06-24 Thread unmesha sreeveni
I have a scenario.

Output from previous job1 is http://pastebin.com/ADa8fTGB.

In next job2 I need to get/find i key having maximum value.

eg i=3, 3 keys having maximum value.
(i will be a custom parameter)

How to approach this.

Should we calculated max() in job2 mapper as there will be unique keys(as
the output is coming from previous reducer)

or

find max in second jobs reducer.But again how to find i keys?

I tried in this way
Instead of emiting value as value in reducer.I emitted value as key so I
can get the values in ascending order. And I wrote the next MR job.where
mapper simply emits the key/value.

Reducer finds the max of key But again I am stuck that cannot be done as we
try to get the id , because id is only unique,Values are not uniqe

How to solve this.

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Map reduce query

2014-06-20 Thread unmesha sreeveni
Hi

You can directly use this right?

FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

Or you need extra input file to feed into mapper?


On Fri, Jun 20, 2014 at 11:57 AM, Shrivastava, Himnshu (GE Global Research,
Non-GE) himnshu.shrivast...@ge.com wrote:

  How can I give Input to a mapper from the command line? –D option can be
 used but what are the corresponding changes required in the mapper and the
 driver program ?



 Regards,




-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Map reduce query

2014-06-20 Thread unmesha sreeveni
U can try it using Distributed cache

In Driver
FileStatus[] list = fs.globStatus(extrafile);
for (FileStatus status : list) {
  DistributedCache.addCacheFile(status.getPath().toUri(), conf);
}

In Map
URI[] cacheFiles = DistributedCache.getCacheFiles(conf);
Path getPath = new Path(cacheFiles[0].getPath());
BufferedReader bf = new BufferedReader(new
InputStreamReader(fs.open(getPath)));


On Fri, Jun 20, 2014 at 1:12 PM, unmesha sreeveni unmeshab...@gmail.com
wrote:

 Hi

 You can directly use this right?

 FileInputFormat.setInputPaths(job,new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));

 Or you need extra input file to feed into mapper?


 On Fri, Jun 20, 2014 at 11:57 AM, Shrivastava, Himnshu (GE Global
 Research, Non-GE) himnshu.shrivast...@ge.com wrote:

  How can I give Input to a mapper from the command line? –D option can
 be used but what are the corresponding changes required in the mapper and
 the driver program ?



 Regards,




 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Counters in MapReduce

2014-06-12 Thread unmesha sreeveni
yes rectified that error
But after 1 st iteration when it enters to second iteration
showing

 java.io.FileNotFoundException: for  *Path out1 = new Path(CL);*
*Why is it so .*
*Normally that should be in this way only the o/p folder should not exist*

* //other configuration*
* job1.setMapperClass(ID3ClsLabelMapper.class);*
* job1.setReducerClass(ID3ClsLabelReducer.class);*
* Path in = new Path(args[0]);*
* Path out1 = new Path(CL);*
*//delete the file if exist*
* if(counter == 0){*
*FileInputFormat.addInputPath(job1, in);*
* }*
* else{*
*FileInputFormat.addInputPath(job1, out5);   *
* }*
*  FileOutputFormat.setOutputPath(job1,out1);*
* job1.waitForCompletion(true);*


On Thu, Jun 12, 2014 at 10:29 AM, unmesha sreeveni unmeshab...@gmail.com
wrote:

 I tried out by setting an enum to count no. of lines in output file from
 job3.

 But I am getting
 14/06/12 10:12:30 INFO mapred.JobClient: Total committed heap usage
 (bytes)=1238630400
 conf3
 Exception in thread main java.lang.IllegalStateException: Job in state
 DEFINE instead of RUNNING
 at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:116)
  at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:491)


 Below is my current code

 *static enum UpdateCounter {*
 *INCOMING_ATTR*
 *}*

 *public static void main(String[] args) throws Exception {*
 *Configuration conf = new Configuration();*
 *int res = ToolRunner.run(conf, new Driver(), args);*
 *System.exit(res);*
 *}*


 *@Override*
 *public int run(String[] args) throws Exception {*
 *while(counter = 0){*

 *  Configuration conf = getConf();*
 * /**
 * * Job 1: *
 * */*
 * Job job1 = new Job(conf, );*
 * //other configuration*
 * job1.setMapperClass(ID3ClsLabelMapper.class);*
 * job1.setReducerClass(ID3ClsLabelReducer.class);*
 * Path in = new Path(args[0]);*
 * Path out1 = new Path(CL);*
 * if(counter == 0){*
 *FileInputFormat.addInputPath(job1, in);*
 * }*
 * else{*
 *FileInputFormat.addInputPath(job1, out5);   *
 * }*
 * FileInputFormat.addInputPath(job1, in);*
 * FileOutputFormat.setOutputPath(job1,out1);*
 * job1.waitForCompletion(true);*
 */**
 * * Job 2: *
 * *  *
 * */*
 *Configuration conf2 = getConf();*
 *Job job2 = new Job(conf2, );*
 *Path out2 = new Path(ANC);*
 *FileInputFormat.addInputPath(job2, in);*
 *FileOutputFormat.setOutputPath(job2,out2);*
 *   job2.waitForCompletion(true);*

  *   /**
 * * Job3*
 **/*
 *Configuration conf3 = getConf();*
 *Job job3 = new Job(conf3, );*
 *System.out.println(conf3);*
 *Path out5 = new Path(args[1]);*
 *if(fs.exists(out5)){*
 *fs.delete(out5, true);*
 *}*
 *FileInputFormat.addInputPath(job3,out2);*
 *FileOutputFormat.setOutputPath(job3,out5);*
 *job3.waitForCompletion(true);*
 *FileInputFormat.addInputPath(job3,new Path(args[0]));*
 *FileOutputFormat.setOutputPath(job3,out5);*
 *job3.waitForCompletion(true);*
 *counter =
 job3.getCounters().findCounter(UpdateCounter.INCOMING_ATTR).getValue();*
 *  }*
 * return 0;*

  Am I doing anything wrong?


 On Mon, Jun 9, 2014 at 4:37 PM, Krishna Kumar kku...@nanigans.com wrote:

 You should use FileStatus to  decide what files you want to include in
 the InputPath, and use the FileSystem class to delete or process the
 intermediate / final paths. Moving each job in your iteration logic into
 different methods would help keep things simple.



 From: unmesha sreeveni unmeshab...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Monday, June 9, 2014 at 6:02 AM
 To: User Hadoop user@hadoop.apache.org
 Subject: Re: Counters in MapReduce

 Ok I will check out with counters.
 And after I st iteration the input file to job1 will be the output file
 of job 3.How to give that.
 *Inorder to satisfy 2 conditions*
 First iteration : users input file
 after first iteration :job 3 's output file as job 1 s input.



 --
 *Thanks  Regards*


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/



   --
 *Kai Voigt* Am Germaniahafen 1 k...@123.org
  24143 Kiel +49 160 96683050
  Germany @KaiVoigt




 --
 *Thanks  Regards*


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Counters in MapReduce

2014-06-09 Thread unmesha sreeveni
I am trying to do iteration with map reduce. I have 3 sequence job running

*//job1 configuration*
*FileInputFormat.addInputPath(job1,new Path(args[0]));*
*FileOutputFormat.setOutputPath(job1,out1);*
*job1.waitForCompletion(true);*

*job2 configuration*
*FileInputFormat.addInputPath(job2,out1);*
*FileOutputFormat.setOutputPath(job2,out2);*
*job2.waitForCompletion(true);*

 *job3 configuration*
*FileInputFormat.addInputPath(job3,out2);*
*FileOutputFormat.setOutputPath(job3,new Path(args[1]);*
*boolean success = job3.waitForCompletion(true);*
*return(success ? 0 : 1);*

After job3 I should continue an iteration - job 3 's output should be the
input for job1. And the iteration should continue until the input file is
empty. How to accomplish this.

Will counters do the work.
​

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Need advice for Implementing classification algorithms in MapReduce

2014-05-26 Thread unmesha sreeveni
In-order to learn MapReduce algorithm,I usually try it from scratch.
What I follow is for classification algorithms-

*First I build a model *
*hadoop jar myjar.jar edu.ModelDriver Modelinput Modeloutput*

*secondly I will write a prediction class in MR*
*hadoop jar myjar.jar edu.PredictDriver Testinput TestOutput
Modeloutput*

  *Modeloutput is supplied as an argument to get the model results for
prediction.*

Is this a better way ?

or Should I follow the below way -

*hadoop jar myjar.jar edu.Driver train=traininput.txt
test=testinput.txt output=outputlocation*


Which is the standard way?
Please suggest

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Issue with conf.set and conf.get method

2014-05-21 Thread unmesha sreeveni
Hi,

I am having an issue with conf.set and conf.get method
Driver
Configuration conf=new Configuration();
conf.set(delimiter,args[2]);   //File delimiter as user argument

Map/Reduce
Configuration conf = context.getConfiguration();
String delim = conf.get(delimiter);

All things works fine with this.I am able to get the delimiter(, ;
.) and process accordingly except TAB

If I give
1. \t as an argument it wont work any operations
   eg: will not be able to do
 1. StringTokenizer st = new StringTokenizer(value.toString,delim)
  but works for split
  String[] parts = value.toString.split(delim);
 2. String classLabel =
value.toString.substring(value.toString.lastIndexOf(delim)+1);

2. \t as argument also wont work
3. \\t and \\t also wont work
4.   this WORKS FINE as an argument.

Anybody came across with this issue?
If so can any one tell me a workaround.

Regards
Unmesha

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Are mapper classes re-instantiated for each record?

2014-05-16 Thread unmesha sreeveni
Setup() Method is called before all the mappers and cleanup() method is
called after all mappers


On Tue, May 6, 2014 at 1:17 PM, Raj K Singh rajkrrsi...@gmail.com wrote:

 point 2 is right,The framework first calls setup() followed by map() for
 each key/value pair in the InputSplit. Finally cleanup() is called
 irrespective of no of records in the input split.

 
 Raj K Singh
 http://in.linkedin.com/in/rajkrrsingh
 http://www.rajkrrsingh.blogspot.com
 Mobile  Tel: +91 (0)9899821370


 On Tue, May 6, 2014 at 11:21 AM, Sergey Murylev 
 sergeymury...@gmail.comwrote:

  Hi Jeremy,

 According to official 
 documentationhttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Mapper.htmlsetup
  and cleanup calls performed for each InputSplit. In this case you
 variant 2 is more correct. But actually single mapper can be used for
 processing multiple InputSplits. In you case if you have 5 files with 1
 record each it can call setup/cleanup 5 times. But if your records are in
 single file I think that setup/cleanup should be called once.

 --
 Thanks,
 Sergey


 On 06/05/14 02:49, jeremy p wrote:

 Let's say I have TaskTracker that receives 5 records to process for a
 single job.  When the TaskTracker processses the first record, it will
 instantiate my Mapper class and execute my setup() function.  It will then
 run the map() method on that record.  My question is this : what happens
 when the map() method has finished processing the first record?  I'm
 guessing it will do one of two things :

  1) My cleanup() function will execute.  After the cleanup() method has
 finished, this instance of the Mapper object will be destroyed.  When it is
 time to process the next record, a new Mapper object will be instantiated.
  Then my setup() method will execute, the map() method will execute, the
 cleanup() method will execute, and then the Mapper instance will be
 destroyed.  When it is time to process the next record, a new Mapper object
 will be instantiated.  This process will repeat itself until all 5 records
 have been processed.  In other words, my setup() and cleanup() methods will
 have been executed 5 times each.

  or

  2) When the map() method has finished processing my first record, the
 Mapper instance will NOT be destroyed.  It will be reused for all 5
 records.  When the map() method has finished processing the last record, my
 cleanup() method will execute.  In other words, my setup() and cleanup()
 methods will only execute 1 time each.

  Thanks for the help!






-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: writing multiple files on hdfs

2014-05-11 Thread unmesha sreeveni
Yes you can do so. Hadoop is a distributed computing framework. And you are
able to do multiple writes also. Only thing we cannot do is we will not be
able to update a file content. But you can do this by deleting and them
writing the whole once again



On Mon, May 12, 2014 at 8:06 AM, Stanley Shi s...@gopivotal.com wrote:

 Yes, why not?

 Regards,
 *Stanley Shi,*



 On Sun, May 11, 2014 at 9:57 PM, Karim Awara karim.aw...@kaust.edu.sawrote:

 Hi,

 Can I open multiple files on hdfs and write data to them in parallel and
 then close them at the end?

 --
 Best Regards,
 Karim Ahmed Awara

 --
 This message and its contents, including attachments are intended solely
 for the original recipient. If you are not the intended recipient or have
 received this message in error, please notify me immediately and delete
 this message from your computer system. Any unauthorized use or
 distribution is prohibited. Please consider the environment before printing
 this email.





-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


[Blog] Map-only jobs in Hadoop for beginers

2014-05-05 Thread unmesha sreeveni
​Hi

http://www.unmeshasreeveni.blogspot.in/2014/05/map-only-jobs-in-hadoop.html

This is a post on Map-only Jobs in Hadoop for beginers.

​

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


[Blog] Chaining Jobs In Hadoop for beginners.

2014-05-03 Thread unmesha sreeveni
Hi


http://www.unmeshasreeveni.blogspot.in/2014/04/chaining-jobs-in-hadoop-mapreduce.html

   This is the sample code for chaining Jobs In Hadoop for beginners.

Please post your comments in blog page.

Let me know your thoughts, so that I can improve my blog post.


-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Which database should be used

2014-05-02 Thread unmesha sreeveni
On Fri, May 2, 2014 at 1:51 PM, Alex Lee eliy...@hotmail.com wrote:

 hive


​HBase is better.​



-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Wordcount file cannot be located

2014-05-01 Thread unmesha sreeveni
Try this along with your MapReduce source code

Configuration config = new Configuration();
config.set(fs.defaultFS, hdfs://IP:port/);
FileSystem dfs = FileSystem.get(config);
Path path = new Path(/tmp/in);

Let me know your thoughts.


-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


[Blog] Code For Deleting Output Folder If Exist In Hadoop MapReduce Jobs

2014-05-01 Thread unmesha sreeveni
Hi

   This is the sample code for Deleting Output Folder(If Exist) In Hadoop
MapReduce Jobs for beginners that can be included along with our MapReduce
Code to work on with same output folder for debugging.

Please post your comments in blog page.

Let me know your thoughts

Thanks
unmesha

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: [Blog] Code For Deleting Output Folder If Exist In Hadoop MapReduce Jobs

2014-05-01 Thread unmesha sreeveni
Please see this link:
http://www.unmeshasreeveni.blogspot.in/2014/04/code-for-deleting-existing-output.html


On Fri, May 2, 2014 at 8:52 AM, unmesha sreeveni unmeshab...@gmail.comwrote:

 Hi

This is the sample code for Deleting Output Folder(If Exist) In Hadoop
 MapReduce Jobs for beginners that can be included along with our MapReduce
 Code to work on with same output folder for debugging.

 Please post your comments in blog page.

 Let me know your thoughts

 Thanks
 unmesha

 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/





-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: hadoop.tmp.dir directory size

2014-04-30 Thread unmesha sreeveni
​​
Try

*​​hadoop fs -rmr /tmp*

-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: What configuration parameters cause a Hadoop 2.x job to run on the cluster

2014-04-30 Thread unmesha sreeveni
   config.set(fs.defaultFS, hdfs://port/);
  config.set(hadoop.job.ugi, hdfs);


On Fri, Apr 25, 2014 at 10:46 PM, Oleg Zhurakousky 
oleg.zhurakou...@gmail.com wrote:

 Yes, it will be copied since it goes to each job's namesapce



 On Fri, Apr 25, 2014 at 1:14 PM, Steve Lewis lordjoe2...@gmail.comwrote:

 I am using MR and know the job.setJar command - I can add all
 dependencies to the jar in the lib directory but I was wondering if Hadoop
 would copy a jar from my local machine to the cluster - also is I ran
 multiple jobs with the same jar whether the jar would be copied N times (I
 typically chain 5 map-reduce jobs


 On Fri, Apr 25, 2014 at 10:08 AM, Oleg Zhurakousky 
 oleg.zhurakou...@gmail.com wrote:

 Are you talking about MR or plain YARN application?
 In MR you typically use one of the job.setJar* methods. That aside you
 may have more then your app JAR (dependencies). So you can copy the
 dependencies to all hadoop nodes classpath (e.g., shared dir)

 Oleg


 On Fri, Apr 25, 2014 at 1:02 PM, Steve Lewis lordjoe2...@gmail.comwrote:

 so if I create a Hadoop jar file with referenced libraries in the lib
 directory do I need to move it to hdfs or can it sit on my local machine?
 if I move it to hdfs where does it live - which is to say how do I specify
 the path?


 On Fri, Apr 25, 2014 at 9:52 AM, Oleg Zhurakousky 
 oleg.zhurakou...@gmail.com wrote:

 Yes, if you are running MR


 On Fri, Apr 25, 2014 at 12:48 PM, Steve Lewis 
 lordjoe2...@gmail.comwrote:

 Thank you for your answer

 1) I am using YARN
 2) So presumably dropping  core-site.xml, yarn-site into user.dir
 works do I need mapred-site.xml as well?



 On Fri, Apr 25, 2014 at 9:00 AM, Oleg Zhurakousky 
 oleg.zhurakou...@gmail.com wrote:

 What version of Hadoop you are using? (YARN or no YARN)
 To answer your question; Yes its possible and simple. All you need
 to to is to have Hadoop JARs on the classpath with relevant 
 configuration
 files on the same  classpath pointing to the Hadoop cluster. Most often
 people simply copy core-site.xml, yarn-site.xml etc from the actual 
 cluster
 to the application classpath and then you can run it straight from IDE.

 Not a windows user so not sure about that second part of the
 question.

 Cheers
 Oleg


 On Fri, Apr 25, 2014 at 11:46 AM, Steve Lewis lordjoe2...@gmail.com
  wrote:

 Assume I have a machine on the same network as a hadoop 2 cluster
 but separate from it.

 My understanding is that by setting certain elements of the config
 file or local xml files to point to the cluster I can launch a job 
 without
 having to log into the cluster, move my jar to hdfs and start the job 
 from
 the cluster's hadoop machine.

 Does this work?
 What Parameters need I sat?
 Where is the jar file?
 What issues would I see if the machine is running Windows with
 cygwin installed?





 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com





 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com





 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com





-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: How do I get started with hadoop

2014-04-30 Thread unmesha sreeveni
check this link:
http://www.unmeshasreeveni.blogspot.in/2014/04/hadoop-installation-for-beginners.html
and
http://www.unmeshasreeveni.blogspot.in/2014/04/hadoop-wordcount-example-in-detail.html


On Fri, Apr 25, 2014 at 5:29 PM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Assuming, you are talking about basic stuff...

 Michael Noll has some good Hadoop (pre-Yarn) tutorials

 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

 Then definitely go through the book Hadoop- The Definitive Guide by Tom
 White.
 http://shop.oreilly.com/product/0636920021773.do

 Then, if you download the free distributions by a non-Apache vendors (e.g.
 Cloudera, MapR, Horton etc.), their docs are helpful too as well.

 Lastly, Apache it self has quite good docs for starter/basic stuff:

 e.g. this is for Hadoop 2.3.0 version:
 http://hadoop.apache.org/docs/r2.3.0/
 You can find similar for almost all versions.


 Regards,
 Shahab


 On Fri, Apr 25, 2014 at 2:26 AM, 破千 997626...@qq.com wrote:

 Hi,
 I'm new in hadoop, can I get some useful links about hadoop so I can get
 started with it step by step.
 Thank you very much!





-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: How do I get started with hadoop on windows system

2014-04-30 Thread unmesha sreeveni
In order to get started with hadoop you need to install cgywin (Provides an
exact look and feel as linux)
Or else u can run ubundu in a vmPlayer
Once you done this
You can download hadoop directly from Apache or from other vendors
And Follow thiese steps:
http://www.unmeshasreeveni.blogspot.in/2014/04/hadoop-installation-for-beginners.html




On Fri, Apr 25, 2014 at 11:47 AM, 破千 997626...@qq.com wrote:

 Hi everyone,
 I have subscribed hadoop mail list this morning. How do I get started with
 hadoop on my windows 7 PC.
 Thanks!




-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: Using Eclipse for Hadoop code

2014-04-30 Thread unmesha sreeveni
Are you asking about standalone mode where we run hadoop using local fs?​​


-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: hadoop.tmp.dir directory size

2014-04-30 Thread unmesha sreeveni
Can You just try and see this.

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
fs.deleteOnExit(path/to/tmp);


On Thu, May 1, 2014 at 12:10 AM, S.L simpleliving...@gmail.com wrote:

 Can I do this while the job is still running ? I know I cant delete the
 directory , but I just want confirmation that the data Hadoop writes into
 /tmp/hadoop-df/nm-local-dir (df being my user name) can be discarded while
 the job is being executed.


 On Wed, Apr 30, 2014 at 6:40 AM, unmesha sreeveni 
 unmeshab...@gmail.comwrote:

 ​​
 Try

 * ​​hadoop fs -rmr /tmp*

 --
 *Thanks  Regards *


 *Unmesha Sreeveni U.B*
 *Hadoop, Bigdata Developer*
 *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
 http://www.unmeshasreeveni.blogspot.in/






-- 
*Thanks  Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/


Re: how to customize hadoop configuration for a job?

2014-04-01 Thread unmesha sreeveni
Hi Libo,
 You can implement your driver code using ToolRunner.So that you can pass
your extra configuration through command line instead of editing your code
all the time.

Driver code

public class WordCount extends Configured implements Tool {

  public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new Configuration(), new
WordCount(), args);
System.exit(exitCode);
  }
  public int run(String[] args) throws Exception {
if (args.length != 2) {
System.out.printf(Usage: %s [generic options] input
dir output dir\n, getClass().getSimpleName());
 return -1;
   }
 Job job = new Job(getConf());
 job.setJarByClass(WordCount.class);
 job.setJobName(Word Count);
 FileInputFormat.setInputPaths(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 job.setMapperClass(WordMapper.class);
 job.setReducerClass(SumReducer.class);
 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(IntWritable.class);
 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
 boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
}

command line
---
$ hadoop jar myjar.jar MyDriver -D mapred.reduce.tasks=10 myinputdir
myoutputdir

This is a better practise.


Happy Hadooping.


-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Hadoop, Bigdata Developer
Center for Cyber Security | Amrita Vishwa Vidyapeetham

http://www.unmeshasreeveni.blogspot.in/


Re: error in hadoop hdfs while building the code.

2014-03-12 Thread unmesha sreeveni
I think it is Hadoop problem not java
https://issues.apache.org/jira/browse/HADOOP-5396


On Wed, Mar 12, 2014 at 11:37 AM, Avinash Kujur avin...@gmail.com wrote:

 hi,
  i am getting error like RefreshCallQueueProtocol can not be resolved.
 it is a java problem.

 help me out.

 Regards,
 Avinash




-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Re: GC overhead limit exceeded

2014-03-10 Thread unmesha sreeveni
/application_1394160253524_0003/container_1394160253524_0003_01_03

 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA
 org.apache.hadoop.mapred.YarnChild 10.239.44.34 46837
 attempt_1394160253524_0003_m_01_0 3

 Container killed on request. Exit code is 143

 at last, the task failed.
 Thanks for any help!






-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Binning for numerical dataset

2014-02-04 Thread unmesha sreeveni
I am able to normalize a given data say
100,1:2:3
101,2:3:4

into
100 1
100 2
100 3
101 2
101 3
101 4

How to do binning for a numerical data say iris.csv.

I worked out the maths behind it
Iris DataSet:  http://archive.ics.uci.edu/ml/datasets/Iris
1. find out the minimum and maximum values of each attribute
in the data set.

 Sepal Length Sepal Width Petal Length Petal Width
Min4.32.0 1.00.1
Max7.9   4.4 6.92.5

Then, we should divide the data values of each attributes into 'n' buckets .
Say, n=5.
Bucket Width= (Max - Min) /n


Eg: Sepal Length
= (7.9-4.3)/5
= 0.72
So, the intervals will be as follows :
4.3 -   5.02
5.02 - 5.74
Likewise,
5.74 -6.46
6.46 - 7.18
7.18- 7.9
continue for all attributes
How to do the same in Mapreduce .



-- 
*Thanks  Regards*

Unmesha Sreeveni U.B


Re: Binning for numerical dataset

2014-02-04 Thread unmesha sreeveni
To do binning in MapReduce we need to find min and max in mapper  let
mapper() pass the min,max values to reducer.then after reducer calculate
the buckets.
Is that the best way




-- 
*Thanks  Regards*

Unmesha Sreeveni U.B


Pre-processing in hadoop

2014-01-29 Thread unmesha sreeveni
Are we able to do preprocessing such as
1.Binning
2.Discretization
in hadoop.
Some of the reviews are telling it is difficult.
Is that right.
Pls share some links that will help me alot.

-- 
*Thanks  Regards*

Unmesha Sreeveni U.B


Re: HIVE+MAPREDUCE

2014-01-21 Thread unmesha sreeveni
Programming in Hive Text Book contains what u want . Chapter 4
Hope that will help u.


On Tue, Jan 21, 2014 at 1:51 PM, Ranjini Rathinam ranjinibe...@gmail.comwrote:

 Hi,

 Need to load the data into hive table using mapreduce, using java.

 Please suggest the code related to hive +mapreduce.



 Thanks in advance

 Ranjini R






-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Re: Sorting a csv file

2014-01-17 Thread unmesha sreeveni
are we able to sort multiple columns dynamically as the user suggests?
ie user requests to sort col1 and col2
then the user request to sort 3 cols
I am not able to find anyof the stuff through googling


On Thu, Jan 16, 2014 at 4:03 PM, unmesha sreeveni unmeshab...@gmail.comwrote:

 yes i did ..
 But how to make it in decending order?

 My current code run in accending order

 *public class SortingCsv {*
  * public static class Map extends MapperLongWritable, Text, Text, Text
 {*
 *private Text word = new Text();*
 *private Text one = new Text();*

 *public void map(LongWritable key, Text value, Context context) throws
 IOException, InterruptedException {*
 * System.out.println(in mapper);*
 * /**
 * * sort*
 * */*
 * ArrayListString ar = new ArrayListString(); *
 * String line = value.toString();*
 * String[] tokens = null;*
 * ar.add(line);*
 * System.out.println(list: +ar);*
 * for(int i=0;iar.size();i++) {*
 *tokens=(ar.get(i)).split(,);*
 *System.out.println(ele: +ar.get(i));*
 *System.out.println(token: +tokens[1]); //change according
 to user input*
 *word.set(tokens[1]);*
 *one.set(ar.get(i));*
 *context.write(word, one);*
 * }*
 *}*
 * } *
 * public static void main(String[] args) throws Exception {*
 * System.out.println(in main);*
 *Configuration conf = new Configuration();*

 *Job job = new Job(conf, wordcount);*
 *job.setJarByClass(SortingCsv.class);*
 *//Path intermediateInfo = new Path(out);*
 *job.setOutputKeyClass(Text.class);*
 *job.setOutputValueClass(Text.class);*

 *job.setMapperClass(Map.class);*
 *FileSystem fs = FileSystem.get(conf);*

  * /* Delete the files if any in the output path */*

  * if (fs.exists(new Path(args[1])))*
 * fs.delete(new Path(args[1]), true);*


 *job.setInputFormatClass(TextInputFormat.class);*
 *job.setOutputFormatClass(TextOutputFormat.class);*

 *FileInputFormat.addInputPath(job, new Path(args[0]));*
 *FileOutputFormat.setOutputPath(job, new Path(args[1]));*

 *job.waitForCompletion(true);*
 * }*



 On Thu, Jan 16, 2014 at 10:26 AM, unmesha sreeveni 
 unmeshab...@gmail.comwrote:

 Thanks for ur reply Ramya
 ok :) .so should i need to transpose the entire .csv file inorder to get
 the entire col 2 data?


 On Thu, Jan 16, 2014 at 10:11 AM, Ramya S ram...@suntecgroup.com wrote:

 Try to keep col2 values as  map output key  and map output value as the
 total values  b,a,v 



 Regards...
 Ramya.S



 

 From: unmesha sreeveni [mailto:unmeshab...@gmail.com]
 Sent: Thu 1/16/2014 9:29 AM
 To: User Hadoop
 Subject: Re: Sorting a csv file


 Thanks Ramya.s
 I was trying it to do with NULLWRITABLE..

 Thanks alot Ramya.

 And do u have any idea how to sort a given col.
 Say if user is giving col2 to sort the i want to get as
 b,a,v
 a,c,p
 d,a,z
 q,z,a
 r,a,b

 b,a,v
 d,a,z
 r,a,b

 a,c,p

 q,z,a

 How do i approch to that.

 I my current implementation i am getting
 result as
 a,c,p
 b,a,v
 d,a,z
 q,z,a
 r,a,b


 using the above code.


 On Wed, Jan 15, 2014 at 5:09 PM, Ramya S ram...@suntecgroup.com wrote:


 All you need is to change the map output value class to TEXT
 format.
 Set this accordingly in the main.

 Eg:

 public static class Map extends MapperLongWritable, Text, Text,
 Text {
private Text one = new Text();

private Text word = new Text();

public void map(LongWritable key, Text value, Context
 context) throws IOException, InterruptedException {
 System.out.println(in mapper);
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
System.out.println(sort: +word);
}
}
 }


 Regards...?
 Ramya.S


 

 From: unmesha sreeveni [mailto:unmeshab...@gmail.com]
 Sent: Wed 1/15/2014 4:11 PM
 To: User Hadoop
 Subject: Re: Sorting a csv file



 I did a map only job for sorting a txt file by editing wordcount
 program.
 I only need the key .
 How to set value to null.


 public class SortingCsv {
 public static class Map extends MapperLongWritable, Text, Text,
 IntWritable {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context
 context) throws IOException, InterruptedException {
 System.out.println(in mapper);
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer

Merge files

2014-01-17 Thread unmesha sreeveni
How to merge two files using Map-Reduce code .
I am aware of -getmerge and cat command.\

Thanks in advance.

-- 
*Thanks  Regards*

Unmesha Sreeveni U.B


Re: Sorting a csv file

2014-01-16 Thread unmesha sreeveni
yes i did ..
But how to make it in decending order?

My current code run in accending order

*public class SortingCsv {*
 * public static class Map extends MapperLongWritable, Text, Text, Text {*
*private Text word = new Text();*
*private Text one = new Text();*

*public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {*
* System.out.println(in mapper);*
* /**
* * sort*
* */*
* ArrayListString ar = new ArrayListString(); *
* String line = value.toString();*
* String[] tokens = null;*
* ar.add(line);*
* System.out.println(list: +ar);*
* for(int i=0;iar.size();i++) {*
*tokens=(ar.get(i)).split(,);*
*System.out.println(ele: +ar.get(i));*
*System.out.println(token: +tokens[1]); //change according to
user input*
*word.set(tokens[1]);*
*one.set(ar.get(i));*
*context.write(word, one);*
* }*
*}*
* } *
* public static void main(String[] args) throws Exception {*
* System.out.println(in main);*
*Configuration conf = new Configuration();*

*Job job = new Job(conf, wordcount);*
*job.setJarByClass(SortingCsv.class);*
*//Path intermediateInfo = new Path(out);*
*job.setOutputKeyClass(Text.class);*
*job.setOutputValueClass(Text.class);*

*job.setMapperClass(Map.class);*
*FileSystem fs = FileSystem.get(conf);*

 * /* Delete the files if any in the output path */*

 * if (fs.exists(new Path(args[1])))*
* fs.delete(new Path(args[1]), true);*


*job.setInputFormatClass(TextInputFormat.class);*
*job.setOutputFormatClass(TextOutputFormat.class);*

*FileInputFormat.addInputPath(job, new Path(args[0]));*
*FileOutputFormat.setOutputPath(job, new Path(args[1]));*

*job.waitForCompletion(true);*
* }*



On Thu, Jan 16, 2014 at 10:26 AM, unmesha sreeveni unmeshab...@gmail.comwrote:

 Thanks for ur reply Ramya
 ok :) .so should i need to transpose the entire .csv file inorder to get
 the entire col 2 data?


 On Thu, Jan 16, 2014 at 10:11 AM, Ramya S ram...@suntecgroup.com wrote:

 Try to keep col2 values as  map output key  and map output value as the
 total values  b,a,v 



 Regards...
 Ramya.S



 

 From: unmesha sreeveni [mailto:unmeshab...@gmail.com]
 Sent: Thu 1/16/2014 9:29 AM
 To: User Hadoop
 Subject: Re: Sorting a csv file


 Thanks Ramya.s
 I was trying it to do with NULLWRITABLE..

 Thanks alot Ramya.

 And do u have any idea how to sort a given col.
 Say if user is giving col2 to sort the i want to get as
 b,a,v
 a,c,p
 d,a,z
 q,z,a
 r,a,b

 b,a,v
 d,a,z
 r,a,b

 a,c,p

 q,z,a

 How do i approch to that.

 I my current implementation i am getting
 result as
 a,c,p
 b,a,v
 d,a,z
 q,z,a
 r,a,b


 using the above code.


 On Wed, Jan 15, 2014 at 5:09 PM, Ramya S ram...@suntecgroup.com wrote:


 All you need is to change the map output value class to TEXT
 format.
 Set this accordingly in the main.

 Eg:

 public static class Map extends MapperLongWritable, Text, Text,
 Text {
private Text one = new Text();

private Text word = new Text();

public void map(LongWritable key, Text value, Context context)
 throws IOException, InterruptedException {
 System.out.println(in mapper);
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
System.out.println(sort: +word);
}
}
 }


 Regards...?
 Ramya.S


 

 From: unmesha sreeveni [mailto:unmeshab...@gmail.com]
 Sent: Wed 1/15/2014 4:11 PM
 To: User Hadoop
 Subject: Re: Sorting a csv file



 I did a map only job for sorting a txt file by editing wordcount
 program.
 I only need the key .
 How to set value to null.


 public class SortingCsv {
 public static class Map extends MapperLongWritable, Text, Text,
 IntWritable {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context)
 throws IOException, InterruptedException {
 System.out.println(in mapper);
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
System.out.println(sort: +word);
}
}
 }
 public static void main(String[] args) throws Exception

Sorting in decending order

2014-01-16 Thread unmesha sreeveni
Are we able to sort a csv file in descending order.


-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Combination of MapReduce and Hive

2014-01-16 Thread unmesha sreeveni
Hi
   Can we use a cobination of Hive and MapReduce
Say: I am having a csv file. Ineed to find the mean of a column and replace
the null data with the mean(replace null with mean)
so whether we can write a hive query in driver (to find the mean) then
write a mapreduce block to replace the null with mean.

Which is better way
1. writing only mapreduce code or
2. Use a combination of hive and mapreduce.

-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Sorting a csv file

2014-01-15 Thread unmesha sreeveni
How to sort a csv file
I know , between map and reduce shuffle and sort is taking place.
But how do i sort each column in a csv file?

-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Re: Sorting a csv file

2014-01-15 Thread unmesha sreeveni
I did a map only job for sorting a txt file by editing wordcount program.
I only need the key .
How to set value to null.


*public class SortingCsv {*
 * public static class Map extends MapperLongWritable, Text, Text,
IntWritable {*
*private final static IntWritable one = new IntWritable(1);*
*private Text word = new Text();*

*public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {*
* System.out.println(in mapper);*
*String line = value.toString();*
*StringTokenizer tokenizer = new StringTokenizer(line);*
*while (tokenizer.hasMoreTokens()) {*
*word.set(tokenizer.nextToken());*
*context.write(word, one);*
*System.out.println(sort: +word);*
*}*
*}*
* } *



*public static void main(String[] args) throws Exception {
System.out.println(in main);Configuration conf = new Configuration();
   Job job = new Job(conf, wordcount);
 job.setJarByClass(SortingCsv.class);//Path intermediateInfo = new
Path(out);job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
 job.setMapperClass(Map.class);FileSystem fs = FileSystem.get(conf); /*
Delete the files if any in the output path */ if (fs.exists(new
Path(args[1]))) fs.delete(new Path(args[1]), true);
 job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(TextOutputFormat.class);
 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 job.waitForCompletion(true); } }*


On Wed, Jan 15, 2014 at 2:50 PM, unmesha sreeveni unmeshab...@gmail.comwrote:

 How to sort a csv file
 I know , between map and reduce shuffle and sort is taking place.
 But how do i sort each column in a csv file?

 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B
 Junior Developer

 http://www.unmeshasreeveni.blogspot.in/





-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Re: Sorting a csv file

2014-01-15 Thread unmesha sreeveni
Thanks Ramya.s
I was trying it to do with NULLWRITABLE..

Thanks alot Ramya.

And do u have any idea how to sort a given col.
Say if user is giving col2 to sort the i want to get as
b,a,v
a,c,p
d,a,z
q,z,a
r,a,b

b,a,v
d,a,z
r,a,b
a,c,p
q,z,a
How do i approch to that.

I my current implementation i am getting
result as
a,c,p
b,a,v
d,a,z
q,z,a
r,a,b

using the above code.


On Wed, Jan 15, 2014 at 5:09 PM, Ramya S ram...@suntecgroup.com wrote:

 All you need is to change the map output value class to TEXT format.
 Set this accordingly in the main.

 Eg:

 public static class Map extends MapperLongWritable, Text, Text, Text {
private Text one = new Text();
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws
 IOException, InterruptedException {
 System.out.println(in mapper);
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
System.out.println(sort: +word);
}
}
 }

 Regards...?
 Ramya.S


 

 From: unmesha sreeveni [mailto:unmeshab...@gmail.com]
 Sent: Wed 1/15/2014 4:11 PM
 To: User Hadoop
 Subject: Re: Sorting a csv file


 I did a map only job for sorting a txt file by editing wordcount program.
 I only need the key .
 How to set value to null.


 public class SortingCsv {
 public static class Map extends MapperLongWritable, Text, Text,
 IntWritable {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws
 IOException, InterruptedException {
 System.out.println(in mapper);
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
System.out.println(sort: +word);
}
}
 }
 public static void main(String[] args) throws Exception {
 System.out.println(in main);
Configuration conf = new Configuration();

Job job = new Job(conf, wordcount);
job.setJarByClass(SortingCsv.class);
//Path intermediateInfo = new Path(out);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
FileSystem fs = FileSystem.get(conf);

 /* Delete the files if any in the output path */

 if (fs.exists(new Path(args[1])))
 fs.delete(new Path(args[1]), true);


job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
 }

 }


 On Wed, Jan 15, 2014 at 2:50 PM, unmesha sreeveni unmeshab...@gmail.com
 wrote:


 How to sort a csv file
 I know , between map and reduce shuffle and sort is taking place.
 But how do i sort each column in a csv file?


 --

 Thanks  Regards


 Unmesha Sreeveni U.B

 Junior Developer

 http://www.unmeshasreeveni.blogspot.in/








 --

 Thanks  Regards


 Unmesha Sreeveni U.B

 Junior Developer

 http://www.unmeshasreeveni.blogspot.in/







-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Adding file to HDFs

2014-01-14 Thread unmesha sreeveni
How to add a *csv* file to hdfs using Mapreduce Code
Using hadoop fs -copyFromLocal /local/path /hdfs/location i am able to do .
BUt i would like to write mapreduce code.

-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Re: Adding file to HDFs

2014-01-14 Thread unmesha sreeveni
Thank you sudhakar


On Tue, Jan 14, 2014 at 2:51 PM, sudhakara st sudhakara...@gmail.comwrote:


 Read file from local file system and write to file in HDFS using
 *FSDataOutputStream*

 FSDataOutputStream outStream = fs.create(new
 Path(demo.csv););
 outStream.writeUTF(stream);
 outStream.close();


 On Tue, Jan 14, 2014 at 2:04 PM, unmesha sreeveni 
 unmeshab...@gmail.comwrote:

 How to add a *csv* file to hdfs using Mapreduce Code
 Using hadoop fs -copyFromLocal /local/path /hdfs/location i am able to do
 .
 BUt i would like to write mapreduce code.

 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B
 Junior Developer

 http://www.unmeshasreeveni.blogspot.in/





 --

 Regards,
 ...Sudhakara.st





-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Re: Adding file to HDFs

2014-01-14 Thread unmesha sreeveni
I tried to copy a 2.5 gb to hdfs. it took 3 -4 min.
Are we able to reduce that time.


On Tue, Jan 14, 2014 at 3:07 PM, unmesha sreeveni unmeshab...@gmail.comwrote:

 Thank you sudhakar


 On Tue, Jan 14, 2014 at 2:51 PM, sudhakara st sudhakara...@gmail.comwrote:


 Read file from local file system and write to file in HDFS using
 *FSDataOutputStream*

 FSDataOutputStream outStream = fs.create(new
 Path(demo.csv););
 outStream.writeUTF(stream);
 outStream.close();


 On Tue, Jan 14, 2014 at 2:04 PM, unmesha sreeveni 
 unmeshab...@gmail.comwrote:

 How to add a *csv* file to hdfs using Mapreduce Code
 Using hadoop fs -copyFromLocal /local/path /hdfs/location i am able to
 do .
 BUt i would like to write mapreduce code.

 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B
 Junior Developer

 http://www.unmeshasreeveni.blogspot.in/





 --

 Regards,
 ...Sudhakara.st





 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B
 Junior Developer

 http://www.unmeshasreeveni.blogspot.in/





-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Seggregation in MapReduce

2014-01-12 Thread unmesha sreeveni
Can we do seggregation in MapReduce.
If we are having a employee dataset which contains
emp id,emp name,emp type.

Are we able to group the employees based on different types.
say emp type A in one file,say emp type B in another file.


-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Re: Find max and min of a column in a csvfile

2014-01-11 Thread unmesha sreeveni
Thanks Jiayu and John Hancock.
Showered a very nice hint for me.

John that was a really gud link you provided. But i dnt know Pig. I am
using java.
Is there any java related  document.
like :
http://packtlib.packtpub.com/library/hadoop-mapreduce-cookbook/ch06lvl1sec66#


On Sat, Jan 11, 2014 at 6:14 PM, John Hancock jhancock1...@gmail.comwrote:

 Unmesha,

 You may want to write your own mapper and reducer for the purpose of
 learning more about map-reduce programming techniques.

 However, the Pig documentation also discusses aggregate functions such as
 max() which may save you some time:

 http://pig.apache.org/docs/r0.12.0/udf.html


 -John


 On Fri, Jan 10, 2014 at 12:23 PM, Jiayu Ji jiayu...@gmail.com wrote:

 if you are doing with only one column, then I think the key/value pair
 could be Null and number( elements) . If you are doing more than one
 column, then column name and numbers.


 On Fri, Jan 10, 2014 at 12:36 AM, unmesha sreeveni unmeshab...@gmail.com
  wrote:


 Need help
 How to find the maximum element and min element of a col in a csv file
 .What will be the mapper output.

 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B
 Junior Developer

 http://www.unmeshasreeveni.blogspot.in/





 --
 Jiayu (James) Ji,

 Cell: (312)823-7393





-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Re: what all can be done using MR

2014-01-11 Thread unmesha sreeveni
What about sorting .
Acutually it is done by MapReduce itself.
But if we are giving a csv file as input and trying to sort one/multiple
column...Whether the corresponting columns also get reflectted??

eg: foo.csv
 B,2,3
A,4,6

When we apply sorting to first column:whether the resultent will be
A,4,6
B,2,3
A will be mapped to its correct values right?
If so what will be context.write() of mapper?


On Wed, Jan 8, 2014 at 8:18 PM, Chris Mawata chris.maw...@gmail.com wrote:

  Yes.
 Check out, for example,
 http://packtlib.packtpub.com/library/hadoop-mapreduce-cookbook/ch06lvl1sec66#



 On 1/8/2014 2:41 AM, unmesha sreeveni wrote:

  Can we do aggregation with in Hadoop MR
 like find min,max,sum,avg of a column in a csv file.

  --
 *Thanks  Regards*

  Unmesha Sreeveni U.B
  Junior Developer

  http://www.unmeshasreeveni.blogspot.in/






-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Re: what all can be done using MR

2014-01-11 Thread unmesha sreeveni
For that do we have to write a custom class for value inorder to pass all
the columns as value.
ie in the example 2 values. Or jst do a concatenation and emit values.


On Sat, Jan 11, 2014 at 9:46 PM, Chris Mawata chris.maw...@gmail.comwrote:

 Results will be sorted by key so make A the key and put the rest in the
 value
 Chris
 On Jan 11, 2014 10:11 AM, unmesha sreeveni unmeshab...@gmail.com
 wrote:

 What about sorting .
 Acutually it is done by MapReduce itself.
  But if we are giving a csv file as input and trying to sort one/multiple
 column...Whether the corresponting columns also get reflectted??

 eg: foo.csv
  B,2,3
 A,4,6

 When we apply sorting to first column:whether the resultent will be
 A,4,6
 B,2,3
 A will be mapped to its correct values right?
 If so what will be context.write() of mapper?


 On Wed, Jan 8, 2014 at 8:18 PM, Chris Mawata chris.maw...@gmail.comwrote:

  Yes.
 Check out, for example,
 http://packtlib.packtpub.com/library/hadoop-mapreduce-cookbook/ch06lvl1sec66#



 On 1/8/2014 2:41 AM, unmesha sreeveni wrote:

  Can we do aggregation with in Hadoop MR
 like find min,max,sum,avg of a column in a csv file.

  --
 *Thanks  Regards*

  Unmesha Sreeveni U.B
  Junior Developer

  http://www.unmeshasreeveni.blogspot.in/






 --
 *Thanks  Regards*

 Unmesha Sreeveni U.B
 Junior Developer

 http://www.unmeshasreeveni.blogspot.in/





-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


Expressions in MapReduce

2014-01-11 Thread unmesha sreeveni
Are we able to do expresions in Mapreduce
Say if i am having a csv file . which has 2 columns.
The user is giving an expresion
col1  + col2 = col3
Are we able to do this?
And when again the user wants col1 - col2 = col4
Can we do it in the same mapreduce (dynamic change of expressions)

-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


FAILED EMFILE: Too many open files

2014-01-07 Thread unmesha sreeveni
While i am trying to run a MR Job I am getting
 FAILED EMFILE: Too many open files 
 at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)
at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:172)
 at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:310)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:383)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
 at org.apache.hadoop.mapred.Child.main(Child.java:262)

Why is it so?

-- 
*Thanks  Regards*

Unmesha Sreeveni U.B
Junior Developer

http://www.unmeshasreeveni.blogspot.in/


  1   2   >