Neo4j and Hadoop
Hi, I have my input file in HDFS. How to store that data to Neo4j db. Is there any way to do the same? -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
.doc Custom Format for Hadoop
Is there a custom .doc Input Format for hadoop which is already build?
Re: Doubt in DoubleWritable
Please try this for (DoubleArrayWritable avalue : values) { Writable[] value = avalue.get(); // DoubleWritable[] value = new DoubleWritable[6]; // for(int k=0;k<6;k++){ // value[k] = DoubleWritable(wvalue[k]); // } //parse accordingly if (Double.parseDouble(value[1].toString()) != 0) { total_records_Temp = total_records_Temp + 1; sumvalueTemp = sumvalueTemp + Double.parseDouble(value[0].toString()); } if (Double.parseDouble(value[3].toString()) != 0) { total_records_Dewpoint = total_records_Dewpoint + 1; sumvalueDewpoint = sumvalueDewpoint + Double.parseDouble(value[2].toString()); } if (Double.parseDouble(value[5].toString()) != 0) { total_records_Windspeed = total_records_Windspeed + 1; sumvalueWindspeed = sumvalueWindspeed + Double.parseDouble(value[4].toString()); } } Attaching the code -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ //cc MaxTemperature Application to find the maximum temperature in the weather dataset //vv MaxTemperature import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.ArrayWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.conf.Configuration; public class MapReduce { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err .println("Usage: MaxTemperature "); System.exit(-1); } /* * Job job = new Job(); job.setJarByClass(MaxTemperature.class); * job.setJobName("Max temperature"); */ Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Job job = Job.getInstance(conf, "AverageTempValues"); /* * Deleting output directory for reuseing same dir */ Path dest = new Path(args[1]); if(fs.exists(dest)){ fs.delete(dest, true); } FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setNumReduceTasks(2); job.setMapperClass(NewMapper.class); job.setReducerClass(NewReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(DoubleArrayWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); } } // ^^ MaxTemperature // cc MaxTemperatureMapper Mapper for maximum temperature example // vv MaxTemperatureMapper import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class NewMapper extends Mapper<LongWritable, Text, Text, DoubleArrayWritable> { public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String Str = value.toString(); String[] Mylist = new String[1000]; int i = 0; for (String retval : Str.split("\\s+")) { System.out.println(retval); Mylist[i++] = retval; } String Val = Mylist[2]; String Year = Val.substring(0, 4); String Month = Val.substring(5, 6); String[] Section = Val.split("_"); String section_string = "0"; if (Section[1].matches("^(0|1|2|3|4|5)$")) { section_string = "4"; } else if (Section[1].matches("^(6|7|8|9|10|11)$")) { section_string = "1"; } else if (Section[1].matches("^(12|13|14|15|16|17)$")) { section_string = "2"; } else if (Section[1].matches("^(18|19|20|21|22|23)$")) { section_string = "3"; } DoubleWritable[] array = new DoubleWritable[6]; DoubleArrayWritable output = new DoubleArrayWritable(); array[0].set(Double.parseDouble(Mylist[3])); array[2].set(Double.parseDouble(Mylist[4])); array[4].set(Double.parseDouble(Mylist[12])); for (int j = 0; j < 6; j = j + 2) { if (999.9 == array[j].get()) { array[j + 1].set(0); } else { array[j + 1].set(1); } } output.set(array); context.write(new Text(Year + section_string + Month), output); } } //cc MaxTemperatureReducer Reducer for maximum temperature example //vv MaxTemperatureReducer import java.io.IOException; import org.apache.hadoop.io.DoubleWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class NewReducer extends Reducer<Text, DoubleArrayWritable, Text, DoubleArrayWritable> { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { double sumvalueTemp = 0; double sumvalueDewpoint = 0; double sumvalueWindspeed = 0; double total_records_Temp = 0; double total_records_Dewpoint = 0; double total_records_Windspeed = 0; double average_Temp
Permutations and combination in mapreduce
Hi whether permutaions and combinations in mapreduce is implemented by anyone? -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Not able to copy one HDFS data to another HDFS location using distcp
I am trying to copy one HDFS data to another HDFS location I am able to achieve the same using "distcp" command hadoop distcp hdfs://mySrcip:8020/copyDev/* hdfs://myDestip:8020/copyTest But I want to try the same using Java Api After a long search I found one code and executed . But it didnt copied my src file to destination. *public class TouchFile {* * /*** * * @param args* * * @throws Exception * * */* * public static void main(String[] args) throws Exception {* * // TODO Auto-generated method stub* * //create configuration object* * Configuration config = new Configuration();* * config.set("fs.defaultFS", "hdfs://mySrcip:8020/");* * config.set("hadoop.job.ugi", "hdfs");* * /** * * Distcp* * */* * String sourceNameNode = "hdfs://mySrcip:8020/copyDev";* *String destNameNode = "hdfs://myDestip:8020/copyTest";* *String fileList = "myfile.txt";* * distFileCopy(config,sourceNameNode,destNameNode,fileList);* * }* * /*** * * Copies files from one cloud to another using Hadoop's distributed copy features. Uses* * * input to build DISTCP configuration settings. * * ** * * param config Hadoop configuration* * * param sourceNameNode full HDFS path to parent source directory* * * param destNameNode full HDFS path to parent destination directory* * * param fileList Comma separated string of file names in sourceNameNode to be copied to destNameNode* * * returns Elapsed time in milliseconds to copy files* * */* *public static long distFileCopy( Configuration config, String sourceNameNode, String destNameNode, String fileList ) throws Exception {* *System.out.println("In dist copy");* *StringTokenizer tokenizer = new StringTokenizer(fileList,",");* *ArrayList list = new ArrayList<>();* *while ( tokenizer.hasMoreTokens() ){* *String file = sourceNameNode + "/" + tokenizer.nextToken();* *list.add( file );* *}* *String[] args = new String[list.size() + 1];* *int count = 0;* *for ( String filename : list ){* *args[count++] = filename;* *}* *args[count] = destNameNode;* *System.out.println("args-->"+Arrays.toString(args));* *long st = System.currentTimeMillis();* *DistCp distCp=new DistCp(config,null);* *distCp.run(args); * * return System.currentTimeMillis() - st;* *}* *}* Am I doing anything wrong. Please suggest -- *Thanks & Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* http://www.unmeshasreeveni.blogspot.in/
Copy Data From HDFS to FTP
Hi How can I copy my HDFS data to an FTP server? -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Copy Data From HDFS to FTP
showing -cp: The value of property fs.ftp.password.MYIP must not be null On Mon, Aug 24, 2015 at 10:52 AM, Chinnappan Chandrasekaran chiranchan...@jos.com.sg wrote: hadoop fs -cp ftp://userid@youipaddress/directory *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com] *Sent:* Monday, 24 August, 2015 12:45 PM *To:* User Hadoop *Subject:* Copy Data From HDFS to FTP Hi How can I copy my HDFS data to an FTP server? -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Build Failure - SciHadoop
Hi I was trying to check out SciHadoop. Came accross https://github.com/four2five/SIDR/tree/sc13_experiments_improved while doing the 4 th step - mvn install the Build Failed. [ERROR] COMPILATION ERROR : [INFO] - [ERROR] Failure executing javac, but could not parse the error: javac: directory not found: /installSCID/git/thredds/udunits/target/classes Usage: javac options source files use -help for a list of possible options [INFO] 1 error [INFO] - [INFO] [INFO] Reactor Summary: [INFO] BUILD FAILURE [INFO] [INFO] Total time: 20.797s [INFO] Finished at: Wed May 06 15:15:30 IST 2015 [INFO] Final Memory: 9M/109M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile (default-compile) on project udunits: Compilation failure [ERROR] Failure executing javac, but could not parse the error: [ERROR] javac: directory not found: /installSCID/git/thredds/udunits/target/classes [ERROR] Usage: javac options source files [ERROR] use -help for a list of possible options [ERROR] - [Help 1] [ERROR] Have anyone came across the same.I doubt if I am wrong somewhere. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Connect c language with HDFS
thanks alex I have gone through the same. but once I checked my cloudera distribution I am not able to get those folders ..Thats y I posted here. I dont know if I made any mistake. On Mon, May 4, 2015 at 2:40 PM, Alexander Alten-Lorenz wget.n...@gmail.com wrote: Google: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/LibHdfs.html -- Alexander Alten-Lorenz m: wget.n...@gmail.com b: mapredit.blogspot.com On May 4, 2015, at 10:57 AM, unmesha sreeveni unmeshab...@gmail.com wrote: Hi Can we connect c with HDFS using cloudera hadoop distribution. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Connect c language with HDFS
Thanks Did it. http://unmeshasreeveni.blogspot.in/2015/05/hadoop-word-count-using-c-hadoop.html On Mon, May 4, 2015 at 3:43 PM, Alexander Alten-Lorenz wget.n...@gmail.com wrote: That depends on the installation source (rpm, tgz or parcels). Usually, when you use parcels, libhdfs.so* should be within /opt/cloudera/parcels/ CDH/lib64/ (or similar). Or just use linux' locate (locate libhdfs.so*) to find the library. -- Alexander Alten-Lorenz m: wget.n...@gmail.com b: mapredit.blogspot.com On May 4, 2015, at 11:39 AM, unmesha sreeveni unmeshab...@gmail.com wrote: thanks alex I have gone through the same. but once I checked my cloudera distribution I am not able to get those folders ..Thats y I posted here. I dont know if I made any mistake. On Mon, May 4, 2015 at 2:40 PM, Alexander Alten-Lorenz wget.n...@gmail.com wrote: Google: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/LibHdfs.html -- Alexander Alten-Lorenz m: wget.n...@gmail.com b: mapredit.blogspot.com On May 4, 2015, at 10:57 AM, unmesha sreeveni unmeshab...@gmail.com wrote: Hi Can we connect c with HDFS using cloudera hadoop distribution. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: How to stop a mapreduce job from terminal running on Hadoop Cluster?
you can use $ hadoop job -kill jobid On Mon, Apr 13, 2015 at 10:20 AM, Rohith Sharma K S rohithsharm...@huawei.com wrote: In addition to below options, in the Hadoop-2.7(yet to release in couple of weeks) the user friendly option provided for killing the applications from Web UI. In the application block , *‘Kill Application’* button has been provided for killing applications. Thanks Regards Rohith Sharma K S *From:* Pradeep Gollakota [mailto:pradeep...@gmail.com] *Sent:* 12 April 2015 23:41 *To:* user@hadoop.apache.org *Subject:* Re: How to stop a mapreduce job from terminal running on Hadoop Cluster? Also, mapred job -kill job_id On Sun, Apr 12, 2015 at 11:07 AM, Shahab Yunus shahab.yu...@gmail.com wrote: You can kill t by using the following yarn command yarn application -kill application id https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html Or use old hadoop job command http://stackoverflow.com/questions/11458519/how-to-kill-hadoop-jobs Regards, Shahab On Sun, Apr 12, 2015 at 2:03 PM, Answer Agrawal yrsna.tse...@gmail.com wrote: To run a job we use the command $ hadoop jar example.jar inputpath outputpath If job is so time taken and we want to stop it in middle then which command is used? Or is there any other way to do that? Thanks, -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
cleanup() in hadoop results in aggregation of whole file/not
I am having an input file, which contains last column as class label 7.4 0.29 0.5 1.8 0.042 35 127 0.9937 3.45 0.5 10.2 7 1 10 0.41 0.45 6.2 0.071 6 14 0.99702 3.21 0.49 11.8 7 -1 7.8 0.26 0.27 1.9 0.051 52 195 0.9928 3.23 0.5 10.9 6 1 6.9 0.32 0.3 1.8 0.036 28 117 0.99269 3.24 0.48 11 6 1 ... I am trying to get the unique class label of the whole file. Inorder to get the same I am doing the below code. *public class MyMapper extends MapperLongWritable, Text, IntWritable, FourvalueWritable{* *SetString uniqueLabel = new HashSet();* *public void map(LongWritable key,Text value,Context context){* *//Last column of input is classlabel.* * VectorString cls = CustomParam.customLabel(line, delimiter, classindex); // * * uniqueLabel.add(cls.get(0));* *}* *public void cleanup(Context context) throws IOException{* *//find min and max label* * context.getCounter(UpdateCost.MINLABEL).setValue(Long.valueOf(minLabel));* * context.getCounter(UpdateCost.MAXLABEL).setValue(Long.valueOf(maxLabel));* *}* Cleanup is only executed for once. And after each map whether Set uniqueLabel = new HashSet(); the set get updated,Hope that set get updated for each map? Hope I am able to get the uniqueLabel of the whole file in cleanup Please suggest if I am wrong. Thanks in advance.
Re: Get method in Writable
Thanks Drake. That was the point.It was my mistake. On Mon, Feb 23, 2015 at 6:34 AM, Drake민영근 drake@nexr.com wrote: Hi, unmesha. I think this is a gson problem. you mentioned like this: But parsing canot be done in *MR2*. * TreeInfoWritable info = gson.toJson(setupData, TreeInfoWritable.class);* I think just use gson.fromJson, not toJson(setupData is already json string, i think). Is this right ? Drake 민영근 Ph.D kt NexR On Sat, Feb 21, 2015 at 4:55 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Am I able to get the values from writable of a previous job. ie I have 2 MR jobs *MR 1:* I need to pass 3 element as values from reducer and the key is NullWritable. So I created a custom writable class to achieve this. * public class TreeInfoWritable implements Writable{* * DoubleWritable entropy;* * IntWritable sum;* * IntWritable clsCount;* * ..* *}* *MR 2:* I need to access MR 1 result in MR2 mapper setup function. And I accessed it as distributed cache (small file). Is there a way to get those values using get method. *while ((setupData = bf.readLine()) != null) {* * System.out.println(Setup Line +setupData);* * TreeInfoWritable info = //something i can pass to TreeInfoWritable and get values* * DoubleWritable entropy = info.getEntropy();* * System.out.println(entropy: +entropy);* *}* Tried to convert writable to gson format. *MR 1* *Gson gson = new Gson();* *String emitVal = gson.toJson(valEmit);* *context.write(out, new Text(emitVal));* But parsing canot be done in *MR2*. *TreeInfoWritable info = gson.toJson(setupData, TreeInfoWritable.class);* *Error: Type mismatch: cannot convert from String to TreeInfoWritable* Once it is changed to string we cannot get values. Am I able to get a workaround for the same. or to use just POJO classes instaed of Writable. I'm afraid if that becomes slower as we are depending on Java instaed of hadoop 's serializable classes -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Get method in Writable
Am I able to get the values from writable of a previous job. ie I have 2 MR jobs *MR 1:* I need to pass 3 element as values from reducer and the key is NullWritable. So I created a custom writable class to achieve this. * public class TreeInfoWritable implements Writable{* * DoubleWritable entropy;* * IntWritable sum;* * IntWritable clsCount;* * ..* *}* *MR 2:* I need to access MR 1 result in MR2 mapper setup function. And I accessed it as distributed cache (small file). Is there a way to get those values using get method. *while ((setupData = bf.readLine()) != null) {* * System.out.println(Setup Line +setupData);* * TreeInfoWritable info = //something i can pass to TreeInfoWritable and get values* * DoubleWritable entropy = info.getEntropy();* * System.out.println(entropy: +entropy);* *}* Tried to convert writable to gson format. *MR 1* *Gson gson = new Gson();* *String emitVal = gson.toJson(valEmit);* *context.write(out, new Text(emitVal));* But parsing canot be done in *MR2*. *TreeInfoWritable info = gson.toJson(setupData, TreeInfoWritable.class);* *Error: Type mismatch: cannot convert from String to TreeInfoWritable* Once it is changed to string we cannot get values. Am I able to get a workaround for the same. or to use just POJO classes instaed of Writable. I'm afraid if that becomes slower as we are depending on Java instaed of hadoop 's serializable classes
Re: writing mappers and reducers question
You can write MapReduce jobs in eclipse also for testing purpose. Once it is done u can create jar and run that in your single node or multinode. But plese note while doing in such IDE s using hadoop dependecies, There will not be input splits, different mappers etc..
Re: How to get Hadoop's Generic Options value
Try implementing your program public class YourDriver extends Configured implements Tool { main() run() } Then supply your file using -D option. Thanks Unmesha Biju
Delete output folder automatically in CRUNCH (FlumeJava)
Hi I am new to FlumeJava.I ran wordcount in the same.But how can I automatically delete the outputfolder in the code block. Instead of going back and deleting the folder. Thanks in advance. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Neural Network in hadoop
I am trying to implement Neural Network in MapReduce. Apache mahout is reffering this paper http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf Neural Network (NN) We focus on backpropagation By defining a network structure (we use a three layer network with two output neurons classifying the data into two categories), each mapper propagates its set of data through the network. For each training example, the error is back propagated to calculate the partial gradient for each of the weights in the network. The reducer then sums the partial gradient from each mapper and does a batch gradient descent to update the weights of the network. Here http://homepages.gold.ac.uk/nikolaev/311sperc.htm is the worked out example for gradient descent algorithm. Gradient Descent Learning Algorithm for Sigmoidal Perceptrons http://pastebin.com/6gAQv5vb 1. Which is the better way to parallize neural network algorithm While looking in MapReduce perspective? In mapper: Each Record owns a partial weight(from above example: w0,w1,w2),I doubt if w0 is bias. A random weight will be assigned initially and initial record calculates the output(o) and weight get updated , second record also find the output and deltaW is got updated with the previous deltaW value. While coming into reducer the sum of gradient is calculated. ie if we have 3 mappers,we will be able to get 3 w0,w1,w2.These are summed and using batch gradient descent we will be updating the weights of the network. 2. In the above method how can we ensure that which previous weight is taken while considering more than 1 map task.Each map task has its own weight updated.How can it be accurate? [image: enter image description here] 3. Where can I find backward propogation in the above mentioned gradient descent neural network algorithm?Or is it fine with this implementation? 4. what is the termination condition mensioned in the algorithm? Please help me with some pointers. Thanks in advance. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Neural Network in hadoop
On Thu, Feb 12, 2015 at 4:13 PM, Alpha Bagus Sunggono bagusa...@gmail.com wrote: In my opinion, - This is just for 1 iteration. Then, batch gradient means find all delta, then updates all weight. So , I think its improperly if each have weight updated. Weight updated should be after Reduced. - Backpropagation can be found after Reduced. - This iteration should be repeat and repeat again. I doubt if iteration is for each record. ie say for example we have just 5 records,so whether the iteration will be 5 ? or some other concepts. ie from the above example ∆*w**0*,∆*w**1*,∆*w* *2 * will be the delta error .So here lets say we have a threshold value . so for each record we will be checking if ∆*w**0*,∆*w**1*,∆*w* *2 * is less than or equal to threshold value , else continue the iteration. Is it like that . Am I wrong ? Sorry I am not that much clear on the iteration part. Termination condition should be measured by delta error of sigmoid output in the end of mapper. Iteration process can be terminated after we get suitable small value enough of the delta error. Is there any criteria in updating delta weights? after calculating output of perceptron lets find the error: (oj*(1-0j)(tj-oj)) check if error is less than threshold,then delta weight is not updated else update delta weight . Is it like that? On Thu, Feb 12, 2015 at 5:14 PM, unmesha sreeveni unmeshab...@gmail.com wrote: I am trying to implement Neural Network in MapReduce. Apache mahout is reffering this paper http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf Neural Network (NN) We focus on backpropagation By defining a network structure (we use a three layer network with two output neurons classifying the data into two categories), each mapper propagates its set of data through the network. For each training example, the error is back propagated to calculate the partial gradient for each of the weights in the network. The reducer then sums the partial gradient from each mapper and does a batch gradient descent to update the weights of the network. Here http://homepages.gold.ac.uk/nikolaev/311sperc.htm is the worked out example for gradient descent algorithm. Gradient Descent Learning Algorithm for Sigmoidal Perceptrons http://pastebin.com/6gAQv5vb 1. Which is the better way to parallize neural network algorithm While looking in MapReduce perspective? In mapper: Each Record owns a partial weight(from above example: w0,w1,w2),I doubt if w0 is bias. A random weight will be assigned initially and initial record calculates the output(o) and weight get updated , second record also find the output and deltaW is got updated with the previous deltaW value. While coming into reducer the sum of gradient is calculated. ie if we have 3 mappers,we will be able to get 3 w0,w1,w2.These are summed and using batch gradient descent we will be updating the weights of the network. 2. In the above method how can we ensure that which previous weight is taken while considering more than 1 map task.Each map task has its own weight updated.How can it be accurate? [image: enter image description here] 3. Where can I find backward propogation in the above mentioned gradient descent neural network algorithm?Or is it fine with this implementation? 4. what is the termination condition mensioned in the algorithm? Please help me with some pointers. Thanks in advance. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- Alpha Bagus Sunggono http://www.dyavacs.com -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce
I have 4 nodes and the replication factor is set to 3 On Wed, Jan 21, 2015 at 11:15 AM, Drake민영근 drake@nexr.com wrote: Yes, almost same. I assume the most time spending part was copying model data from datanode which has model data to actual process node(tasktracker or nodemanager). How about the model data's replication factor? How many nodes do you have? If you have 4 or more nodes, you can increase replication with following command. I suggest the number equal to your datanodes, but first you should confirm the enough space in HDFS. - hdfs dfs -setrep -w 6 /user/model/data Drake 민영근 Ph.D On Wed, Jan 21, 2015 at 2:12 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Yes I tried the same Drake. I dont know if I understood your answer. Instead of loading them into setup() through cache I read them directly from HDFS in map section. and for each incoming record .I found the distance between all the records in HDFS. ie if R ans S are my dataset, R is the model data stored in HDFs and when S taken for processing S1-R(finding distance with whole R set) S2-R But it is taking a long time as it needs to compute the distance. On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 drake@nexr.com wrote: In my suggestion, map or reduce tasks do not use distributed cache. They use file directly from HDFS with short circuit local read. Like a shared storage method, but almost every node has the data with high-replication factor. Drake 민영근 Ph.D On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni unmeshab...@gmail.com wrote: But stil if the model is very large enough, how can we load them inti Distributed cache or some thing like that. Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf But it is confusing me On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 drake@nexr.com wrote: Hi, How about this ? The large model data stay in HDFS but with many replications and MapReduce program read the model from HDFS. In theory, the replication factor of model data equals with number of data nodes and with the Short Circuit Local Reads function of HDFS datanode, the map or reduce tasks read the model data in their own disks. In this way, maybe use too many usage of HDFS, but the annoying partition problem will be gone. Thanks Drake 민영근 Ph.D On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Is there any way.. Waiting for a reply.I have posted the question every where..but none is responding back. I feel like this is the right place to ask doubts. As some of u may came across the same issue and get stuck. On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Yes, One of my friend is implemeting the same. I know global sharing of Data is not possible across Hadoop MapReduce. But I need to check if that can be done somehow in hadoop Mapreduce also. Because I found some papers in KNN hadoop also. And I trying to compare the performance too. Hope some pointers can help me. On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning ted.dunn...@gmail.com wrote: have you considered implementing using something like spark? That could be much easier than raw map-reduce On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni unmeshab...@gmail.com wrote: In KNN like algorithm we need to load model Data into cache for predicting the records. Here is the example for KNN. [image: Inline image 1] So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache. The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome. How can we parttion the file and perform the operation on these partition ? ie 1 record Distance parttition1,partition2, 2nd record Distance parttition1,partition2,... This is what came to my thought. Is there any further way. Any pointers would help me. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http
Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce
Yes I tried the same Drake. I dont know if I understood your answer. Instead of loading them into setup() through cache I read them directly from HDFS in map section. and for each incoming record .I found the distance between all the records in HDFS. ie if R ans S are my dataset, R is the model data stored in HDFs and when S taken for processing S1-R(finding distance with whole R set) S2-R But it is taking a long time as it needs to compute the distance. On Wed, Jan 21, 2015 at 10:31 AM, Drake민영근 drake@nexr.com wrote: In my suggestion, map or reduce tasks do not use distributed cache. They use file directly from HDFS with short circuit local read. Like a shared storage method, but almost every node has the data with high-replication factor. Drake 민영근 Ph.D On Wed, Jan 21, 2015 at 1:49 PM, unmesha sreeveni unmeshab...@gmail.com wrote: But stil if the model is very large enough, how can we load them inti Distributed cache or some thing like that. Here is one source : http://www.cs.utah.edu/~lifeifei/papers/knnslides.pdf But it is confusing me On Wed, Jan 21, 2015 at 7:30 AM, Drake민영근 drake@nexr.com wrote: Hi, How about this ? The large model data stay in HDFS but with many replications and MapReduce program read the model from HDFS. In theory, the replication factor of model data equals with number of data nodes and with the Short Circuit Local Reads function of HDFS datanode, the map or reduce tasks read the model data in their own disks. In this way, maybe use too many usage of HDFS, but the annoying partition problem will be gone. Thanks Drake 민영근 Ph.D On Thu, Jan 15, 2015 at 6:05 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Is there any way.. Waiting for a reply.I have posted the question every where..but none is responding back. I feel like this is the right place to ask doubts. As some of u may came across the same issue and get stuck. On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Yes, One of my friend is implemeting the same. I know global sharing of Data is not possible across Hadoop MapReduce. But I need to check if that can be done somehow in hadoop Mapreduce also. Because I found some papers in KNN hadoop also. And I trying to compare the performance too. Hope some pointers can help me. On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning ted.dunn...@gmail.com wrote: have you considered implementing using something like spark? That could be much easier than raw map-reduce On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni unmeshab...@gmail.com wrote: In KNN like algorithm we need to load model Data into cache for predicting the records. Here is the example for KNN. [image: Inline image 1] So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache. The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome. How can we parttion the file and perform the operation on these partition ? ie 1 record Distance parttition1,partition2, 2nd record Distance parttition1,partition2,... This is what came to my thought. Is there any further way. Any pointers would help me. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce
Is there any way.. Waiting for a reply.I have posted the question every where..but none is responding back. I feel like this is the right place to ask doubts. As some of u may came across the same issue and get stuck. On Thu, Jan 15, 2015 at 12:34 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Yes, One of my friend is implemeting the same. I know global sharing of Data is not possible across Hadoop MapReduce. But I need to check if that can be done somehow in hadoop Mapreduce also. Because I found some papers in KNN hadoop also. And I trying to compare the performance too. Hope some pointers can help me. On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning ted.dunn...@gmail.com wrote: have you considered implementing using something like spark? That could be much easier than raw map-reduce On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni unmeshab...@gmail.com wrote: In KNN like algorithm we need to load model Data into cache for predicting the records. Here is the example for KNN. [image: Inline image 1] So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache. The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome. How can we parttion the file and perform the operation on these partition ? ie 1 record Distance parttition1,partition2, 2nd record Distance parttition1,partition2,... This is what came to my thought. Is there any further way. Any pointers would help me. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
How to partition a file to smaller size for performing KNN in hadoop mapreduce
In KNN like algorithm we need to load model Data into cache for predicting the records. Here is the example for KNN. [image: Inline image 1] So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache. The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome. How can we parttion the file and perform the operation on these partition ? ie 1 record Distance parttition1,partition2, 2nd record Distance parttition1,partition2,... This is what came to my thought. Is there any further way. Any pointers would help me. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce
Yes, One of my friend is implemeting the same. I know global sharing of Data is not possible across Hadoop MapReduce. But I need to check if that can be done somehow in hadoop Mapreduce also. Because I found some papers in KNN hadoop also. And I trying to compare the performance too. Hope some pointers can help me. On Thu, Jan 15, 2015 at 12:17 PM, Ted Dunning ted.dunn...@gmail.com wrote: have you considered implementing using something like spark? That could be much easier than raw map-reduce On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni unmeshab...@gmail.com wrote: In KNN like algorithm we need to load model Data into cache for predicting the records. Here is the example for KNN. [image: Inline image 1] So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache. The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome. How can we parttion the file and perform the operation on these partition ? ie 1 record Distance parttition1,partition2, 2nd record Distance parttition1,partition2,... This is what came to my thought. Is there any further way. Any pointers would help me. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: How to run a mapreduce program not on the node of hadoop cluster?
Your data wont get splitted. so your program runs as single mapper and single reducer. And your intermediate data is not shuffeld and sorted, But u can use this for debuging On Jan 14, 2015 2:04 PM, Cao Yi iridium...@gmail.com wrote: Hi, I write some mapreduce code in my project *my_prj*. *my_prj *will be deployed on the machine which is not a node of the cluster. how does *my_prj* to run a mapreduce job in this case? thank you! Best Regards, Iridium
Re: Write and Read file through map reduce
Hi hitarth , If your file1 and file 2 is smaller you can move on with Distributed Cache. mentioned here http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html . Or you can move on with MultipleInputFormat mentioned here http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html . [1] http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html [2] http://unmeshasreeveni.blogspot.in/2014/12/joining-two-files-using-multipleinput.html On Tue, Jan 6, 2015 at 8:53 AM, Ted Yu yuzhih...@gmail.com wrote: Hitarth: You can also consider MultiFileInputFormat (and its concrete implementations). Cheers On Mon, Jan 5, 2015 at 6:14 PM, Corey Nolet cjno...@gmail.com wrote: Hitarth, I don't know how much direction you are looking for with regards to the formats of the times but you can certainly read both files into the third mapreduce job using the FileInputFormat by comma-separating the paths to the files. The blocks for both files will essentially be unioned together and the mappers scheduled across your cluster. On Mon, Jan 5, 2015 at 3:55 PM, hitarth trivedi t.hita...@gmail.com wrote: Hi, I have 6 node cluster, and the scenario is as follows :- I have one map reduce job which will write file1 in HDFS. I have another map reduce job which will write file2 in HDFS. In the third map reduce job I need to use file1 and file2 to do some computation and output the value. What is the best way to store file1 and file2 in HDFS so that they could be used in third map reduce job. Thanks, Hitarth -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
[Blog] My experience on Hadoop Certification
http://unmeshasreeveni.blogspot.in/2015/01/cloudera-certified-hadoop-developer-ccd.html -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: FileNotFoundException in distributed mode
Driver Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path cachefile = new Path(path/to/file); FileStatus[] list = fs.globStatus(cachefile); for (FileStatus status : list) { DistributedCache.addCacheFile(status.getPath().toUri(), conf); } In setup public void setup(Context context) throws IOException{ Configuration conf = context.getConfiguration(); FileSystem fs = FileSystem.get(conf); URI[] cacheFiles = DistributedCache.getCacheFiles(conf); Path getPath = new Path(cacheFiles[0].getPath()); BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath))); String setupData = null; while ((setupData = bf.readLine()) != null) { System.out.println(Setup Line in reducer +setupData); } } Hope this link helps: http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html On Mon, Dec 22, 2014 at 2:58 PM, Marko Dinic marko.di...@nissatech.com wrote: Hello Hadoopers, I'm getting this exception in Hadoop while trying to read file that was added to distributed cache, and the strange thing is that the file exists on the given location java.io.FileNotFoundException: File does not exist: /tmp/hadoop-pera/mapred/local/taskTracker/distcache/-1517670662102870873_- 1918892372_1898431787/localhost/work/output/temporalcentroids/centroids- iteration0-noOfClusters2/part-r-0 I'm adding the file in before starting my job using DistributedCache.addCacheFile(URI.create(args[2]), job.getConfiguration()); And I'm trying to read from the file from setup metod in my mapper using DistributedCache.getLocalCacheFiles(conf); As I said, I can confirm that the file is on the local system, but the exception is thrown. I'm running the job in pseudo-distributed mode, on one computer. Any ideas? Thanks -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Run a c++ program using opencv libraries in hadoop
Hi How can I run c++ programs using opencv libraries in hadoop? So far I have done MapReduce jobs in Java only..and there we can supply external jars using command line itself. And even tried using python language also..to run them we use hadoop streaming API. But I am confused how to run C++ programs using opencv libraries. Thanks in Advance. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Split files into 80% and 20% for building model and prediction
I am trying to divide my HDFS file into 2 parts/files 80% and 20% for classification algorithm(80% for modelling and 20% for prediction) Please provide suggestion for the same. To take 80% and 20% to 2 seperate files we need to know the exact number of record in the data set And it is only known if we go through the data set once. so we need to write 1 MapReduce Job for just counting the number of records and 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple Inputs. Am I in the right track or there is any alternative for the same. But again a small confusion how to check if the reducer get filled with 80% data. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Split files into 80% and 20% for building model and prediction
Hi Mikael So you wont write an MR job for counting the number of records in that file to find 80% and 20% On Fri, Dec 12, 2014 at 3:54 PM, Mikael Sitruk mikael.sit...@gmail.com wrote: I would use a different approach. For each row in the mapper I would have invoked random.Next() then if the number generated by random is below 0.8 then the row would go to key for training otherwise go to key for the test. Mikael.s -- From: Susheel Kumar Gadalay skgada...@gmail.com Sent: 12/12/2014 12:00 To: user@hadoop.apache.org Subject: Re: Split files into 80% and 20% for building model and prediction Simple solution.. Copy the HDFS file to local and use OS commands to count no of lines cat file1 | wc -l and cut it based on line number. On 12/12/14, unmesha sreeveni unmeshab...@gmail.com wrote: I am trying to divide my HDFS file into 2 parts/files 80% and 20% for classification algorithm(80% for modelling and 20% for prediction) Please provide suggestion for the same. To take 80% and 20% to 2 seperate files we need to know the exact number of record in the data set And it is only known if we go through the data set once. so we need to write 1 MapReduce Job for just counting the number of records and 2 nd Mapreduce Job for separating 80% and 20% into 2 files using Multiple Inputs. Am I in the right track or there is any alternative for the same. But again a small confusion how to check if the reducer get filled with 80% data. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: DistributedCache
On Fri, Dec 12, 2014 at 9:55 AM, Shahab Yunus shahab.yu...@gmail.com wrote: job.addCacheFiles Yes you can use job.addCacheFiles to cache the file. Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path cachefile = new Path(path/to/file); FileStatus[] list = fs.globStatus(cachefile); for (FileStatus status : list) { DistributedCache.addCacheFile(status.getPath().toUri(), conf); } Hope this link helps [1] http://unmeshasreeveni.blogspot.in/2014/10/how-to-load-file-in-distributedcache-in.html -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Detailing on how UPDATE is performed in Hive
Hi friends Where can I find details on how update is performed in Hive. 1. When an update is performed,whether HDFS will write that block elsewhere with the new value. 2. whether the old block is unallocated and is allowed for further writes. 3. Whether this process create fragmentation ? 4. while creating a partitioned table, and update is performed ,whether the partition is deleted and updated with new value or the entire block is deleted and written once again? where will be the good place to gather these knowlege -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
[blog] How to do Update operation in hive-0.14.0
Hi Hope this link helps for those who are trying to do practise ACID properties in hive 0.14. http://unmeshasreeveni.blogspot.in/2014/11/updatedeleteinsert-in-hive-0140.html -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Fwd: Values getting duplicated in Hive table(Partitioned)
Thanks it worked. On Nov 17, 2014 3:32 PM, unmesha sreeveni unmeshab...@gmail.com wrote: -- Forwarded message -- From: unmesha sreeveni unmeshab...@gmail.com Date: Mon, Nov 17, 2014 at 10:49 AM Subject: Re: Values getting duplicated in Hive table(Partitioned) To: User - Hive u...@hive.apache.org In non partitioned table I am getting the correct values. Is my update query wrong? 1. INSERT OVERWRITE TABLE Unm_Parti_Trail PARTITION (Department = 'A') SELECT employeeid,firstname,designation, CASE WHEN employeeid=19 THEN '5 ELSE salary END AS salary FROM Unm_Parti_Trail; What I tried to include in the query is , In partion with department = A, update employeeid =19 's salary with 5 Is that query statement wrong? and the replication is not affected to dept B and C. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
[Blog] Hive Partitioning
Hi, This is a blog on Hive partitioning. http://unmeshasreeveni.blogspot.in/2014/11/hive-partitioning.html Hope it helps someone. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
[Blog] Updating Partition Table using INSERT In HIVE
Hi This is a blog on Hive updating for older version (hive -0.12.0) http://unmeshasreeveni.blogspot.in/2014/11/updating-partition-table-using-insert.html Hope it helps someone. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Showing INFO ipc.Client: Retrying connect to server once hadoop is upgraded to cdh5.2.0
Upgraded my Hadoop cluster (CDH) to cdh5.2.0 But once I run my Job with iteration, It is showing after 1 st iterative Job. 14/11/17 09:29:44 INFO ipc.Client: Retrying connect to server: /xx.xx.xx.xx:. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) 14/11/17 09:29:45 INFO ipc.Client: Retrying connect to server: /xx.xx.xx.xx:. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) 14/11/17 09:29:46 INFO ipc.Client: Retrying connect to server: /xx.xx.xx.xx:. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS) I have some calculations in Driver class. It is working fine for small dataset. When I tried to run 1gb data it is showing above error. It seems like after my first job some calculation is done in driver class and after the calculation the next job get starts. But I think it is not waiting for the time spend for calculation in driver class(as it is 1 gb file it takes long time for driver calculation) and throwing the above error. It worked fine in previous version. Whether I missed anything during installation? Why is it so? Pleace Advice -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Values getting duplicated in Hive table(Partitioned)
I created a Hive table with *partition* and inserted data into Partioned Hive table. Refered site https://blog.safaribooksonline.com/2012/12/03/tip-partitioning-data-in-hive/ 1. *Initially created one Non -partioned table and then using select query and loaded data into partioned table. Is there an alternate way?* 2. *By following above link my partioned table contains duplicate values. Below are the setps* This is my Sample employee dataset:link1 http://pastebin.com/tVh16Yxt I tried the following queries: link2 http://pastebin.com/U2yykWpy But after updating a value in Hive table,the values are getting duplicated. 7 Nirmal Tech12000 A 7 Nirmal Tech12000 B Nirmal is placed in Department *A* only , but it is duplicated to department *B*. And Once I update a column value in middle I am getting NULL values displayed,while updating last column it is fine. Am I doing any thing wrong. Please suggest.-- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Fwd: Values getting duplicated in Hive table(Partitioned)
In non partitioned table I am getting the correct values. Is my update query wrong? INSERT OVERWRITE TABLE Unm_Parti_Trail PARTITION (Department = 'A') SELECT employeeid,firstname,designation, CASE WHEN employeeid=19 THEN '5 ELSE salary END AS salary FROM Unm_Parti_Trail; What I tried to include in the query is , In partion with department = A, update employeeid =19 's salary with 5 Is that query statement wrong? and the replication is not affected to dept B and C -- Forwarded message -- From: hadoop hive hadooph...@gmail.com Date: Mon, Nov 17, 2014 at 10:08 AM Subject: Re: Values getting duplicated in Hive table(Partitioned) To: u...@hive.apache.org Can you check your select query to run on non partitioned tables. Check if it's giving correct values. Same as for dept. B On Nov 17, 2014 10:03 AM, unmesha sreeveni unmeshab...@gmail.com wrote: ***I created a Hive table with *non*- *partitioned* and using select query I inserted data into *Partioned* Hive table. On Mon, Nov 17, 2014 at 10:00 AM, unmesha sreeveni unmeshab...@gmail.com wrote: I created a Hive table with *partition* and inserted data into Partioned Hive table. Refered site https://blog.safaribooksonline.com/2012/12/03/tip-partitioning-data-in-hive/ 1. *Initially created one Non -partioned table and then using select query and loaded data into partioned table. Is there an alternate way?* 2. *By following above link my partioned table contains duplicate values. Below are the setps* This is my Sample employee dataset:link1 http://pastebin.com/tVh16Yxt I tried the following queries: link2 http://pastebin.com/U2yykWpy But after updating a value in Hive table,the values are getting duplicated. 7 Nirmal Tech12000 A 7 Nirmal Tech12000 B Nirmal is placed in Department *A* only , but it is duplicated to department *B*. And Once I update a column value in middle I am getting NULL values displayed,while updating last column it is fine. Am I doing any thing wrong. Please suggest.-- -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Centre for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Can add a regular check in DataNode on free disk space?
1. Stop all Hadoop daemons 2. Remove all files from /var/lib/hadoop-hdfs/cache/hdfs/dfs/name 3. Format namenode 4. Start all Hadoop daemons. On Mon, Oct 20, 2014 at 8:26 AM, sam liu samliuhad...@gmail.com wrote: Hi Experts and Developers, At present, if a DataNode does not has free disk space, we can not get this bad situation from anywhere, including DataNode log. At the same time, under this situation, the hdfs writing operation will fail and return error msg as below. However, from the error msg, user could not know the root cause is that the only datanode runs out of disk space, and he also could not get any useful hint in datanode log. So I believe it will be better if we could add a regular check in DataNode on free disk space, and it will add WARNING or ERROR msg in datanode log if that datanode runs out of space. What's your opinion? Error Msg: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hadoop/PiEstimator_TMP_3_141592654/in/part0 could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1441) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2702) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440) Thanks! -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
[Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects
http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects
Hi 5 th question can it be SQOOP? On Mon, Oct 6, 2014 at 1:24 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Yes On Mon, Oct 6, 2014 at 1:22 PM, Santosh Kumar skumar.bigd...@hotmail.com wrote: Are you preparing g for Cloudera certification exam? Thanks and Regards, Santosh Kumar SINHA http://www.linkedin.com/in/sinhasantosh (510) 936-2650 Sr Data Consultant - BigData Implementations. [image: View my profile on LinkedIn] http://www.linkedin.com/in/sinhasantosh *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com] *Sent:* Monday, October 06, 2014 12:45 AM *To:* User - Hive; User Hadoop; User Pig *Subject:* [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects
what about the last one? The answer is correct. Pig. Is nt it? On Mon, Oct 6, 2014 at 4:29 PM, adarsh deshratnam adarsh.deshrat...@gmail.com wrote: For question 3 answer should be B and for question 4 answer should be D. Thanks, Adarsh D Consultant - BigData and Cloud [image: View my profile on LinkedIn] http://in.linkedin.com/in/adarshdeshratnam On Mon, Oct 6, 2014 at 2:25 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Hi 5 th question can it be SQOOP? On Mon, Oct 6, 2014 at 1:24 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Yes On Mon, Oct 6, 2014 at 1:22 PM, Santosh Kumar skumar.bigd...@hotmail.com wrote: Are you preparing g for Cloudera certification exam? Thanks and Regards, Santosh Kumar SINHA http://www.linkedin.com/in/sinhasantosh (510) 936-2650 Sr Data Consultant - BigData Implementations. [image: View my profile on LinkedIn] http://www.linkedin.com/in/sinhasantosh *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com] *Sent:* Monday, October 06, 2014 12:45 AM *To:* User - Hive; User Hadoop; User Pig *Subject:* [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects
What I feel like is For question 5 it says, the weblogs are already in HDFS (so no need to import anything).Also these are log files, NOT database files with a specific schema. So I think Pig is the best way to access and process this data. On Tue, Oct 7, 2014 at 4:10 AM, Pradeep Gollakota pradeep...@gmail.com wrote: I agree with the answers suggested above. 3. B 4. D 5. C On Mon, Oct 6, 2014 at 2:58 PM, Ulul had...@ulul.org wrote: Hi No, Pig is a data manipulation language for data already in Hadoop. The question is about importing data from OLTP DB (eg Oracle, MySQL...) to Hadoop, this is what Sqoop is for (SQL to Hadoop) I'm not certain certification guys are happy with their exam questions ending up on blogs and mailing lists :-) Ulul Le 06/10/2014 13:54, unmesha sreeveni a écrit : what about the last one? The answer is correct. Pig. Is nt it? On Mon, Oct 6, 2014 at 4:29 PM, adarsh deshratnam adarsh.deshrat...@gmail.com wrote: For question 3 answer should be B and for question 4 answer should be D. Thanks, Adarsh D Consultant - BigData and Cloud [image: View my profile on LinkedIn] http://in.linkedin.com/in/adarshdeshratnam On Mon, Oct 6, 2014 at 2:25 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Hi 5 th question can it be SQOOP? On Mon, Oct 6, 2014 at 1:24 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Yes On Mon, Oct 6, 2014 at 1:22 PM, Santosh Kumar skumar.bigd...@hotmail.com wrote: Are you preparing g for Cloudera certification exam? Thanks and Regards, Santosh Kumar SINHA http://www.linkedin.com/in/sinhasantosh (510) 936-2650 Sr Data Consultant - BigData Implementations. [image: View my profile on LinkedIn] http://www.linkedin.com/in/sinhasantosh *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com] *Sent:* Monday, October 06, 2014 12:45 AM *To:* User - Hive; User Hadoop; User Pig *Subject:* [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B * *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B * *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B * *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects
Hi Pradeep You are right. Updated the right answers in the blog. This may help anyone thinking about investing in that particular test package. On Tue, Oct 7, 2014 at 9:25 AM, Pradeep Gollakota pradeep...@gmail.com wrote: That's not exactly what the question is asking for... It's saying that you have a bunch of weblogs in HDFS that you want to join with user profile data that is stored in your OLTP database, how do you do the join? First, you export your OLTP database into HDFS using Sqoop. Then you can use Pig/Hive/MR/Cascading/whatever to work with both the datasets and perform the join. On Mon, Oct 6, 2014 at 8:49 PM, unmesha sreeveni unmeshab...@gmail.com wrote: What I feel like is For question 5 it says, the weblogs are already in HDFS (so no need to import anything).Also these are log files, NOT database files with a specific schema. So I think Pig is the best way to access and process this data. On Tue, Oct 7, 2014 at 4:10 AM, Pradeep Gollakota pradeep...@gmail.com wrote: I agree with the answers suggested above. 3. B 4. D 5. C On Mon, Oct 6, 2014 at 2:58 PM, Ulul had...@ulul.org wrote: Hi No, Pig is a data manipulation language for data already in Hadoop. The question is about importing data from OLTP DB (eg Oracle, MySQL...) to Hadoop, this is what Sqoop is for (SQL to Hadoop) I'm not certain certification guys are happy with their exam questions ending up on blogs and mailing lists :-) Ulul Le 06/10/2014 13:54, unmesha sreeveni a écrit : what about the last one? The answer is correct. Pig. Is nt it? On Mon, Oct 6, 2014 at 4:29 PM, adarsh deshratnam adarsh.deshrat...@gmail.com wrote: For question 3 answer should be B and for question 4 answer should be D. Thanks, Adarsh D Consultant - BigData and Cloud [image: View my profile on LinkedIn] http://in.linkedin.com/in/adarshdeshratnam On Mon, Oct 6, 2014 at 2:25 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Hi 5 th question can it be SQOOP? On Mon, Oct 6, 2014 at 1:24 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Yes On Mon, Oct 6, 2014 at 1:22 PM, Santosh Kumar skumar.bigd...@hotmail.com wrote: Are you preparing g for Cloudera certification exam? Thanks and Regards, Santosh Kumar SINHA http://www.linkedin.com/in/sinhasantosh (510) 936-2650 Sr Data Consultant - BigData Implementations. [image: View my profile on LinkedIn] http://www.linkedin.com/in/sinhasantosh *From:* unmesha sreeveni [mailto:unmeshab...@gmail.com] *Sent:* Monday, October 06, 2014 12:45 AM *To:* User - Hive; User Hadoop; User Pig *Subject:* [Blog] Doubts On CCD-410 Sample Dumps on Ecosystem Projects http://www.unmeshasreeveni.blogspot.in/2014/09/what-do-you-think-of-these-three.html -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: toolrunner issue
public class MyClass extends Configured implements Tool{ public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); int res = ToolRunner.run(conf, new MyClass(), args); System.exit(res); } @Override public int run(String[] args) throws Exception { // TODO Auto-generated method stub Job job = new Job(conf, ); job.setJarByClass(MyClass.class); job.setMapOutputKeyClass(IntWritable.class); job.setMapOutputValueClass(TwovalueWritable.class); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(TwovalueWritable.class); job.setMapperClass(Mapper.class); job.setReducerClass(Reducer.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); . } I am able to work without any errors. Please make sure that you are doing the same code above. On Mon, Sep 1, 2014 at 4:18 PM, rab ra rab...@gmail.com wrote: Hello I m having an issue in running one simple map reduce job. The portion of the code is below. It gives a warning that Hadoop command line parsing was not peformed. This occurs despite the class implements Tool interface. Any clue? public static void main(String[] args) throws Exception { try{ int exitcode = ToolRunner.run(new Configuration(), new MyClass(), args); System.exit(exitcode); } catch(Exception e) { e.printStackTrace(); } } @Override public int run(String[] args) throws Exception { JobConf conf = new JobConf(MyClass.class); System.out.println(args); FileInputFormat.addInputPath(conf, new Path(/smallInput)); conf.setInputFormat(CFNInputFormat.class); conf.setMapperClass(MyMapper.class); conf.setMapOutputKeyClass(Text.class); conf.setMapOutputValueClass(Text.class); FileOutputFormat.setOutputPath(conf, new Path(/TEST)); JobClient.runJob(conf); return 0; -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Hadoop on Safe Mode because Resources are low on NameNode
You can leave safe mode: Namenode in safe mode how to leave: http://www.unmeshasreeveni.blogspot.in/2014/04/name-node-is-in-safe-mode-how-to-leave.html On Wed, Aug 27, 2014 at 9:38 AM, Stanley Shi s...@pivotal.io wrote: You can force the namenode to get out of safe mode: hadoop dfsadmin -safemode leave On Tue, Aug 26, 2014 at 11:05 PM, Vincent Emonet vincent.emo...@gmail.com wrote: Hello, We have a 11 nodes Hadoop cluster installed from Hortonworks RPM doc: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.9.1/bk_installing_manually_book/content/rpm-chap1.html The cluster was working fine since it went on Safe Mode during the execution of a job with this message on the NameNode interface: *Safe mode is ON. Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use hdfs dfsadmin -safemode leave to turn safe mode off.* The error displayed in the job log is: 2014-08-22 08:51:35,446 WARN namenode.NameNodeResourceChecker (NameNodeResourceChecker.java:isResourceAvailable(89)) - Space available on volume 'null' is 100720640, which is below the configured reserved amount 104857600 2014-08-22 08:51:35,446 WARN namenode.FSNamesystem (FSNamesystem.java:run(4042)) - NameNode low on available disk space. Already in safe mode. On each node we have 5 hdd used for Hadoop And we checked the 5 hdd on the namenode are all full (between 95 and 100%) when the HDFS as still 50% of its capacity available : on the other nodes the 5 hdd are at 30/40% So I think this is the cause of the error. On the NameNode we had some Non HDFS data on 1 hdd, so I deleted them to have 50% of this hdd available (the 4 others are still between 95 and 100%) But this didn't resolve the problem I have also followed the advices found here : https://issues.apache.org/jira/browse/HDFS-4425 And added the following property to the hdfs-site.xml of the NameNode (multiplying the default value by 2) property namedfs.namenode.resource.du.reserved/name value209715200/value /property Still impossible to get out of the safe mode and as log as we are in safe mode we can't delete anything in the HDFS. Is anyone having a tip about this issue? Thankfully, Vincent. -- Regards, *Stanley Shi,* -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Create HDFS directory fails
On Tue, Jul 29, 2014 at 1:13 PM, R J rj201...@yahoo.com wrote: java.io.IOException: Mkdirs failed to create Check if you have permissions to mkdir this directory (try it from the command line) -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Sqoop command syntax
Try this sqoop import \ --connect jdbc:oracle:thin:@myoracleserver.com:1521/mysid \ --table mytab1 \ --username scott --password tiger On Tue, Jul 29, 2014 at 2:11 PM, R J rj201...@yahoo.com wrote: Hi All, Could anyone please help me with the Sqoop command syntax? I tried the following command: /home/logger/scoop/sqoop-1.4.4.bin__hadoop-0.20/bin/sqoop import --driver oracle.jdbc.driver.OracleDriver --connect jdbc:oracle:thin:@myoracleserver.com:1521/mysid --username scott --password tiger --table mytab1 I get the error: 14/07/29 08:26:23 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM mytab1 AS t WHERE 1=0 14/07/29 08:26:24 ERROR manager.SqlManager: Error executing statement: java.sql.SQLSyntaxErrorException: ORA-00933: SQL command not properly ended java.sql.SQLSyntaxErrorException: ORA-00933: SQL command not properly ended at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:447) at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:396) at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:951) at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:513) at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227) at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531) at oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:208) at oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:886) at oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:1175) at oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1296) at oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:3613) at oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:3657) at oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:1495) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:674) at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:683) at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:240) at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:223) at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:347) at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1277) at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1089) at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:396) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502) at org.apache.sqoop.Sqoop.run(Sqoop.java:145) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229) at org.apache.sqoop.Sqoop.main(Sqoop.java:238) 14/07/29 08:26:24 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: No columns to generate for ClassWriter at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1095) at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96) at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:396) at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:502) at org.apache.sqoop.Sqoop.run(Sqoop.java:145) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220) at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229) at org.apache.sqoop.Sqoop.main(Sqoop.java:238) -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
ListWritable In Hadoop
hi Do we have a ListWritable in hadoop ? -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer*
Re: hadoop directory can't add and remove
http://www.unmeshasreeveni.blogspot.in/2014/04/name-node-is-in-safe-mode-how-to-leave.html hadoop fs -rm r QuasiMonteCarlo_1404262305436_855154103 On Wed, Jul 2, 2014 at 11:56 AM, EdwardKing zhan...@neusoft.com wrote: I want to remove hadoop directory,so I use hadoop fs -rmr,but it can't remove,why? [hdfs@localhost hadoop-2.2.0]$ hadoop fs -ls Found 2 items drwxr-xr-x - hdfs supergroup 0 2014-07-01 17:52 QuasiMonteCarlo_1404262305436_855154103 drwxr-xr-x - hdfs supergroup 0 2014-07-01 18:17 QuasiMonteCarlo_1404263830233_477013424 [hdfs@localhost hadoop-2.2.0]$ hadoop fs -rmr *.* rmr: DEPRECATED: Please use 'rm -r' instead. rmr: `LICENSE.txt': No such file or directory rmr: `NOTICE.txt': No such file or directory rmr: `README.txt': No such file or directory [hdfs@localhost hadoop-2.2.0]$ hadoop fs -la Found 2 items drwxr-xr-x - hdfs supergroup 0 2014-07-01 17:52 QuasiMonteCarlo_1404262305436_855154103 drwxr-xr-x - hdfs supergroup 0 2014-07-01 18:17 QuasiMonteCarlo_1404263830233_477013424 And I try mkdir a directory,but I still fail [hdfs@localhost hadoop-2.2.0]$ hadoop fs -mkdir abc mkdir: Cannot create directory /user/hdfs/abc. Name node is in safe mode. How can I do it? Thanks. --- Confidentiality Notice: The information contained in this e-mail and any accompanying attachment(s) is intended only for the use of the intended recipient and may be confidential and/or privileged of Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader of this communication is not the intended recipient, unauthorized use, forwarding, printing, storing, disclosure or copying is strictly prohibited, and may be unlawful.If you have received this communication in error,please immediately notify the sender by return e-mail, and delete the original message and all copies from your system. Thank you. --- -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: WholeFileInputFormat in hadoop
But how is it different from normal execution and parallel MR. Although mapreduce is a parallel exec framework where the data into map is a single input. If the Whole fileinput is jst an entire input split insead of the entire input file . it will be useful right? if it is the whole file it can caught heapspace .. Please correct me if I am wrong. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: WholeFileInputFormat in hadoop
I am trying to do DBScan Algo.I refered the algo in Data Mining - Concepts and Techniques (3rd Ed) chapter 10 Page no: 474. Here in this algorithmwe need to find the disance between each point. say my sample input is 5,6 8,2 4,5 4,6 So in DBScan we have to pic 1 elemnt and then find the distance between all. While implementing so I will not be able to get the whole file in map inorder to find the distance. I tried some approach 1. used WholeFileInput and done the entire algorithm in Map itself - I dnt think this is a better one.(And it end up with heap space error) 2. and this one is not implementes as I thought it is not feasible - Reading 1 line of input data set in driver and write to a new file.(say centroid) - this centriod can be read in setup and calculate the distance in Map and emit the data which satifies the condition with dbscan map(id,epsilonneighbr) and in reducer we will be able to aggregate all the epsilon neighbours of (5,6) which come from different map and in Reducer find the neighbors of epsilon neighbour. - Next iteration should also be done agian read the input file find a node which is not visited If the input is a 1GB file the MR job executes as many times of the total record. Can anyone suggest me a better way to do this. Hope the usecase is understandable else please tell me.I will explain further. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
WholeFileInputFormat in hadoop
Hi A small clarification: WholeFileInputFormat takes the entire input file as input or each record(input split) as whole? -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Finding mamimum value in reducer
I have a scenario. Output from previous job1 is http://pastebin.com/ADa8fTGB. In next job2 I need to get/find i key having maximum value. eg i=3, 3 keys having maximum value. (i will be a custom parameter) How to approach this. Should we calculated max() in job2 mapper as there will be unique keys(as the output is coming from previous reducer) or find max in second jobs reducer.But again how to find i keys? I tried in this way Instead of emiting value as value in reducer.I emitted value as key so I can get the values in ascending order. And I wrote the next MR job.where mapper simply emits the key/value. Reducer finds the max of key But again I am stuck that cannot be done as we try to get the id , because id is only unique,Values are not uniqe How to solve this. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Map reduce query
Hi You can directly use this right? FileInputFormat.setInputPaths(job,new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); Or you need extra input file to feed into mapper? On Fri, Jun 20, 2014 at 11:57 AM, Shrivastava, Himnshu (GE Global Research, Non-GE) himnshu.shrivast...@ge.com wrote: How can I give Input to a mapper from the command line? –D option can be used but what are the corresponding changes required in the mapper and the driver program ? Regards, -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Map reduce query
U can try it using Distributed cache In Driver FileStatus[] list = fs.globStatus(extrafile); for (FileStatus status : list) { DistributedCache.addCacheFile(status.getPath().toUri(), conf); } In Map URI[] cacheFiles = DistributedCache.getCacheFiles(conf); Path getPath = new Path(cacheFiles[0].getPath()); BufferedReader bf = new BufferedReader(new InputStreamReader(fs.open(getPath))); On Fri, Jun 20, 2014 at 1:12 PM, unmesha sreeveni unmeshab...@gmail.com wrote: Hi You can directly use this right? FileInputFormat.setInputPaths(job,new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); Or you need extra input file to feed into mapper? On Fri, Jun 20, 2014 at 11:57 AM, Shrivastava, Himnshu (GE Global Research, Non-GE) himnshu.shrivast...@ge.com wrote: How can I give Input to a mapper from the command line? –D option can be used but what are the corresponding changes required in the mapper and the driver program ? Regards, -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Counters in MapReduce
yes rectified that error But after 1 st iteration when it enters to second iteration showing java.io.FileNotFoundException: for *Path out1 = new Path(CL);* *Why is it so .* *Normally that should be in this way only the o/p folder should not exist* * //other configuration* * job1.setMapperClass(ID3ClsLabelMapper.class);* * job1.setReducerClass(ID3ClsLabelReducer.class);* * Path in = new Path(args[0]);* * Path out1 = new Path(CL);* *//delete the file if exist* * if(counter == 0){* *FileInputFormat.addInputPath(job1, in);* * }* * else{* *FileInputFormat.addInputPath(job1, out5); * * }* * FileOutputFormat.setOutputPath(job1,out1);* * job1.waitForCompletion(true);* On Thu, Jun 12, 2014 at 10:29 AM, unmesha sreeveni unmeshab...@gmail.com wrote: I tried out by setting an enum to count no. of lines in output file from job3. But I am getting 14/06/12 10:12:30 INFO mapred.JobClient: Total committed heap usage (bytes)=1238630400 conf3 Exception in thread main java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:116) at org.apache.hadoop.mapreduce.Job.getCounters(Job.java:491) Below is my current code *static enum UpdateCounter {* *INCOMING_ATTR* *}* *public static void main(String[] args) throws Exception {* *Configuration conf = new Configuration();* *int res = ToolRunner.run(conf, new Driver(), args);* *System.exit(res);* *}* *@Override* *public int run(String[] args) throws Exception {* *while(counter = 0){* * Configuration conf = getConf();* * /** * * Job 1: * * */* * Job job1 = new Job(conf, );* * //other configuration* * job1.setMapperClass(ID3ClsLabelMapper.class);* * job1.setReducerClass(ID3ClsLabelReducer.class);* * Path in = new Path(args[0]);* * Path out1 = new Path(CL);* * if(counter == 0){* *FileInputFormat.addInputPath(job1, in);* * }* * else{* *FileInputFormat.addInputPath(job1, out5); * * }* * FileInputFormat.addInputPath(job1, in);* * FileOutputFormat.setOutputPath(job1,out1);* * job1.waitForCompletion(true);* */** * * Job 2: * * * * * */* *Configuration conf2 = getConf();* *Job job2 = new Job(conf2, );* *Path out2 = new Path(ANC);* *FileInputFormat.addInputPath(job2, in);* *FileOutputFormat.setOutputPath(job2,out2);* * job2.waitForCompletion(true);* * /** * * Job3* **/* *Configuration conf3 = getConf();* *Job job3 = new Job(conf3, );* *System.out.println(conf3);* *Path out5 = new Path(args[1]);* *if(fs.exists(out5)){* *fs.delete(out5, true);* *}* *FileInputFormat.addInputPath(job3,out2);* *FileOutputFormat.setOutputPath(job3,out5);* *job3.waitForCompletion(true);* *FileInputFormat.addInputPath(job3,new Path(args[0]));* *FileOutputFormat.setOutputPath(job3,out5);* *job3.waitForCompletion(true);* *counter = job3.getCounters().findCounter(UpdateCounter.INCOMING_ATTR).getValue();* * }* * return 0;* Am I doing anything wrong? On Mon, Jun 9, 2014 at 4:37 PM, Krishna Kumar kku...@nanigans.com wrote: You should use FileStatus to decide what files you want to include in the InputPath, and use the FileSystem class to delete or process the intermediate / final paths. Moving each job in your iteration logic into different methods would help keep things simple. From: unmesha sreeveni unmeshab...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Monday, June 9, 2014 at 6:02 AM To: User Hadoop user@hadoop.apache.org Subject: Re: Counters in MapReduce Ok I will check out with counters. And after I st iteration the input file to job1 will be the output file of job 3.How to give that. *Inorder to satisfy 2 conditions* First iteration : users input file after first iteration :job 3 's output file as job 1 s input. -- *Thanks Regards* *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Kai Voigt* Am Germaniahafen 1 k...@123.org 24143 Kiel +49 160 96683050 Germany @KaiVoigt -- *Thanks Regards* *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Counters in MapReduce
I am trying to do iteration with map reduce. I have 3 sequence job running *//job1 configuration* *FileInputFormat.addInputPath(job1,new Path(args[0]));* *FileOutputFormat.setOutputPath(job1,out1);* *job1.waitForCompletion(true);* *job2 configuration* *FileInputFormat.addInputPath(job2,out1);* *FileOutputFormat.setOutputPath(job2,out2);* *job2.waitForCompletion(true);* *job3 configuration* *FileInputFormat.addInputPath(job3,out2);* *FileOutputFormat.setOutputPath(job3,new Path(args[1]);* *boolean success = job3.waitForCompletion(true);* *return(success ? 0 : 1);* After job3 I should continue an iteration - job 3 's output should be the input for job1. And the iteration should continue until the input file is empty. How to accomplish this. Will counters do the work. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Need advice for Implementing classification algorithms in MapReduce
In-order to learn MapReduce algorithm,I usually try it from scratch. What I follow is for classification algorithms- *First I build a model * *hadoop jar myjar.jar edu.ModelDriver Modelinput Modeloutput* *secondly I will write a prediction class in MR* *hadoop jar myjar.jar edu.PredictDriver Testinput TestOutput Modeloutput* *Modeloutput is supplied as an argument to get the model results for prediction.* Is this a better way ? or Should I follow the below way - *hadoop jar myjar.jar edu.Driver train=traininput.txt test=testinput.txt output=outputlocation* Which is the standard way? Please suggest -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Issue with conf.set and conf.get method
Hi, I am having an issue with conf.set and conf.get method Driver Configuration conf=new Configuration(); conf.set(delimiter,args[2]); //File delimiter as user argument Map/Reduce Configuration conf = context.getConfiguration(); String delim = conf.get(delimiter); All things works fine with this.I am able to get the delimiter(, ; .) and process accordingly except TAB If I give 1. \t as an argument it wont work any operations eg: will not be able to do 1. StringTokenizer st = new StringTokenizer(value.toString,delim) but works for split String[] parts = value.toString.split(delim); 2. String classLabel = value.toString.substring(value.toString.lastIndexOf(delim)+1); 2. \t as argument also wont work 3. \\t and \\t also wont work 4. this WORKS FINE as an argument. Anybody came across with this issue? If so can any one tell me a workaround. Regards Unmesha -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Are mapper classes re-instantiated for each record?
Setup() Method is called before all the mappers and cleanup() method is called after all mappers On Tue, May 6, 2014 at 1:17 PM, Raj K Singh rajkrrsi...@gmail.com wrote: point 2 is right,The framework first calls setup() followed by map() for each key/value pair in the InputSplit. Finally cleanup() is called irrespective of no of records in the input split. Raj K Singh http://in.linkedin.com/in/rajkrrsingh http://www.rajkrrsingh.blogspot.com Mobile Tel: +91 (0)9899821370 On Tue, May 6, 2014 at 11:21 AM, Sergey Murylev sergeymury...@gmail.comwrote: Hi Jeremy, According to official documentationhttp://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Mapper.htmlsetup and cleanup calls performed for each InputSplit. In this case you variant 2 is more correct. But actually single mapper can be used for processing multiple InputSplits. In you case if you have 5 files with 1 record each it can call setup/cleanup 5 times. But if your records are in single file I think that setup/cleanup should be called once. -- Thanks, Sergey On 06/05/14 02:49, jeremy p wrote: Let's say I have TaskTracker that receives 5 records to process for a single job. When the TaskTracker processses the first record, it will instantiate my Mapper class and execute my setup() function. It will then run the map() method on that record. My question is this : what happens when the map() method has finished processing the first record? I'm guessing it will do one of two things : 1) My cleanup() function will execute. After the cleanup() method has finished, this instance of the Mapper object will be destroyed. When it is time to process the next record, a new Mapper object will be instantiated. Then my setup() method will execute, the map() method will execute, the cleanup() method will execute, and then the Mapper instance will be destroyed. When it is time to process the next record, a new Mapper object will be instantiated. This process will repeat itself until all 5 records have been processed. In other words, my setup() and cleanup() methods will have been executed 5 times each. or 2) When the map() method has finished processing my first record, the Mapper instance will NOT be destroyed. It will be reused for all 5 records. When the map() method has finished processing the last record, my cleanup() method will execute. In other words, my setup() and cleanup() methods will only execute 1 time each. Thanks for the help! -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: writing multiple files on hdfs
Yes you can do so. Hadoop is a distributed computing framework. And you are able to do multiple writes also. Only thing we cannot do is we will not be able to update a file content. But you can do this by deleting and them writing the whole once again On Mon, May 12, 2014 at 8:06 AM, Stanley Shi s...@gopivotal.com wrote: Yes, why not? Regards, *Stanley Shi,* On Sun, May 11, 2014 at 9:57 PM, Karim Awara karim.aw...@kaust.edu.sawrote: Hi, Can I open multiple files on hdfs and write data to them in parallel and then close them at the end? -- Best Regards, Karim Ahmed Awara -- This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
[Blog] Map-only jobs in Hadoop for beginers
Hi http://www.unmeshasreeveni.blogspot.in/2014/05/map-only-jobs-in-hadoop.html This is a post on Map-only Jobs in Hadoop for beginers. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
[Blog] Chaining Jobs In Hadoop for beginners.
Hi http://www.unmeshasreeveni.blogspot.in/2014/04/chaining-jobs-in-hadoop-mapreduce.html This is the sample code for chaining Jobs In Hadoop for beginners. Please post your comments in blog page. Let me know your thoughts, so that I can improve my blog post. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Which database should be used
On Fri, May 2, 2014 at 1:51 PM, Alex Lee eliy...@hotmail.com wrote: hive HBase is better. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Wordcount file cannot be located
Try this along with your MapReduce source code Configuration config = new Configuration(); config.set(fs.defaultFS, hdfs://IP:port/); FileSystem dfs = FileSystem.get(config); Path path = new Path(/tmp/in); Let me know your thoughts. -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
[Blog] Code For Deleting Output Folder If Exist In Hadoop MapReduce Jobs
Hi This is the sample code for Deleting Output Folder(If Exist) In Hadoop MapReduce Jobs for beginners that can be included along with our MapReduce Code to work on with same output folder for debugging. Please post your comments in blog page. Let me know your thoughts Thanks unmesha -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: [Blog] Code For Deleting Output Folder If Exist In Hadoop MapReduce Jobs
Please see this link: http://www.unmeshasreeveni.blogspot.in/2014/04/code-for-deleting-existing-output.html On Fri, May 2, 2014 at 8:52 AM, unmesha sreeveni unmeshab...@gmail.comwrote: Hi This is the sample code for Deleting Output Folder(If Exist) In Hadoop MapReduce Jobs for beginners that can be included along with our MapReduce Code to work on with same output folder for debugging. Please post your comments in blog page. Let me know your thoughts Thanks unmesha -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: hadoop.tmp.dir directory size
Try *hadoop fs -rmr /tmp* -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: What configuration parameters cause a Hadoop 2.x job to run on the cluster
config.set(fs.defaultFS, hdfs://port/); config.set(hadoop.job.ugi, hdfs); On Fri, Apr 25, 2014 at 10:46 PM, Oleg Zhurakousky oleg.zhurakou...@gmail.com wrote: Yes, it will be copied since it goes to each job's namesapce On Fri, Apr 25, 2014 at 1:14 PM, Steve Lewis lordjoe2...@gmail.comwrote: I am using MR and know the job.setJar command - I can add all dependencies to the jar in the lib directory but I was wondering if Hadoop would copy a jar from my local machine to the cluster - also is I ran multiple jobs with the same jar whether the jar would be copied N times (I typically chain 5 map-reduce jobs On Fri, Apr 25, 2014 at 10:08 AM, Oleg Zhurakousky oleg.zhurakou...@gmail.com wrote: Are you talking about MR or plain YARN application? In MR you typically use one of the job.setJar* methods. That aside you may have more then your app JAR (dependencies). So you can copy the dependencies to all hadoop nodes classpath (e.g., shared dir) Oleg On Fri, Apr 25, 2014 at 1:02 PM, Steve Lewis lordjoe2...@gmail.comwrote: so if I create a Hadoop jar file with referenced libraries in the lib directory do I need to move it to hdfs or can it sit on my local machine? if I move it to hdfs where does it live - which is to say how do I specify the path? On Fri, Apr 25, 2014 at 9:52 AM, Oleg Zhurakousky oleg.zhurakou...@gmail.com wrote: Yes, if you are running MR On Fri, Apr 25, 2014 at 12:48 PM, Steve Lewis lordjoe2...@gmail.comwrote: Thank you for your answer 1) I am using YARN 2) So presumably dropping core-site.xml, yarn-site into user.dir works do I need mapred-site.xml as well? On Fri, Apr 25, 2014 at 9:00 AM, Oleg Zhurakousky oleg.zhurakou...@gmail.com wrote: What version of Hadoop you are using? (YARN or no YARN) To answer your question; Yes its possible and simple. All you need to to is to have Hadoop JARs on the classpath with relevant configuration files on the same classpath pointing to the Hadoop cluster. Most often people simply copy core-site.xml, yarn-site.xml etc from the actual cluster to the application classpath and then you can run it straight from IDE. Not a windows user so not sure about that second part of the question. Cheers Oleg On Fri, Apr 25, 2014 at 11:46 AM, Steve Lewis lordjoe2...@gmail.com wrote: Assume I have a machine on the same network as a hadoop 2 cluster but separate from it. My understanding is that by setting certain elements of the config file or local xml files to point to the cluster I can launch a job without having to log into the cluster, move my jar to hdfs and start the job from the cluster's hadoop machine. Does this work? What Parameters need I sat? Where is the jar file? What issues would I see if the machine is running Windows with cygwin installed? -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: How do I get started with hadoop
check this link: http://www.unmeshasreeveni.blogspot.in/2014/04/hadoop-installation-for-beginners.html and http://www.unmeshasreeveni.blogspot.in/2014/04/hadoop-wordcount-example-in-detail.html On Fri, Apr 25, 2014 at 5:29 PM, Shahab Yunus shahab.yu...@gmail.comwrote: Assuming, you are talking about basic stuff... Michael Noll has some good Hadoop (pre-Yarn) tutorials http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ Then definitely go through the book Hadoop- The Definitive Guide by Tom White. http://shop.oreilly.com/product/0636920021773.do Then, if you download the free distributions by a non-Apache vendors (e.g. Cloudera, MapR, Horton etc.), their docs are helpful too as well. Lastly, Apache it self has quite good docs for starter/basic stuff: e.g. this is for Hadoop 2.3.0 version: http://hadoop.apache.org/docs/r2.3.0/ You can find similar for almost all versions. Regards, Shahab On Fri, Apr 25, 2014 at 2:26 AM, 破千 997626...@qq.com wrote: Hi, I'm new in hadoop, can I get some useful links about hadoop so I can get started with it step by step. Thank you very much! -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: How do I get started with hadoop on windows system
In order to get started with hadoop you need to install cgywin (Provides an exact look and feel as linux) Or else u can run ubundu in a vmPlayer Once you done this You can download hadoop directly from Apache or from other vendors And Follow thiese steps: http://www.unmeshasreeveni.blogspot.in/2014/04/hadoop-installation-for-beginners.html On Fri, Apr 25, 2014 at 11:47 AM, 破千 997626...@qq.com wrote: Hi everyone, I have subscribed hadoop mail list this morning. How do I get started with hadoop on my windows 7 PC. Thanks! -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: Using Eclipse for Hadoop code
Are you asking about standalone mode where we run hadoop using local fs? -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: hadoop.tmp.dir directory size
Can You just try and see this. Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); fs.deleteOnExit(path/to/tmp); On Thu, May 1, 2014 at 12:10 AM, S.L simpleliving...@gmail.com wrote: Can I do this while the job is still running ? I know I cant delete the directory , but I just want confirmation that the data Hadoop writes into /tmp/hadoop-df/nm-local-dir (df being my user name) can be discarded while the job is being executed. On Wed, Apr 30, 2014 at 6:40 AM, unmesha sreeveni unmeshab...@gmail.comwrote: Try * hadoop fs -rmr /tmp* -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards * *Unmesha Sreeveni U.B* *Hadoop, Bigdata Developer* *Center for Cyber Security | Amrita Vishwa Vidyapeetham* http://www.unmeshasreeveni.blogspot.in/
Re: how to customize hadoop configuration for a job?
Hi Libo, You can implement your driver code using ToolRunner.So that you can pass your extra configuration through command line instead of editing your code all the time. Driver code public class WordCount extends Configured implements Tool { public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new Configuration(), new WordCount(), args); System.exit(exitCode); } public int run(String[] args) throws Exception { if (args.length != 2) { System.out.printf(Usage: %s [generic options] input dir output dir\n, getClass().getSimpleName()); return -1; } Job job = new Job(getConf()); job.setJarByClass(WordCount.class); job.setJobName(Word Count); FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(WordMapper.class); job.setReducerClass(SumReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); boolean success = job.waitForCompletion(true); return success ? 0 : 1; } } command line --- $ hadoop jar myjar.jar MyDriver -D mapred.reduce.tasks=10 myinputdir myoutputdir This is a better practise. Happy Hadooping. -- *Thanks Regards* Unmesha Sreeveni U.B Hadoop, Bigdata Developer Center for Cyber Security | Amrita Vishwa Vidyapeetham http://www.unmeshasreeveni.blogspot.in/
Re: error in hadoop hdfs while building the code.
I think it is Hadoop problem not java https://issues.apache.org/jira/browse/HADOOP-5396 On Wed, Mar 12, 2014 at 11:37 AM, Avinash Kujur avin...@gmail.com wrote: hi, i am getting error like RefreshCallQueueProtocol can not be resolved. it is a java problem. help me out. Regards, Avinash -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Re: GC overhead limit exceeded
/application_1394160253524_0003/container_1394160253524_0003_01_03 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 10.239.44.34 46837 attempt_1394160253524_0003_m_01_0 3 Container killed on request. Exit code is 143 at last, the task failed. Thanks for any help! -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Binning for numerical dataset
I am able to normalize a given data say 100,1:2:3 101,2:3:4 into 100 1 100 2 100 3 101 2 101 3 101 4 How to do binning for a numerical data say iris.csv. I worked out the maths behind it Iris DataSet: http://archive.ics.uci.edu/ml/datasets/Iris 1. find out the minimum and maximum values of each attribute in the data set. Sepal Length Sepal Width Petal Length Petal Width Min4.32.0 1.00.1 Max7.9 4.4 6.92.5 Then, we should divide the data values of each attributes into 'n' buckets . Say, n=5. Bucket Width= (Max - Min) /n Eg: Sepal Length = (7.9-4.3)/5 = 0.72 So, the intervals will be as follows : 4.3 - 5.02 5.02 - 5.74 Likewise, 5.74 -6.46 6.46 - 7.18 7.18- 7.9 continue for all attributes How to do the same in Mapreduce . -- *Thanks Regards* Unmesha Sreeveni U.B
Re: Binning for numerical dataset
To do binning in MapReduce we need to find min and max in mapper let mapper() pass the min,max values to reducer.then after reducer calculate the buckets. Is that the best way -- *Thanks Regards* Unmesha Sreeveni U.B
Pre-processing in hadoop
Are we able to do preprocessing such as 1.Binning 2.Discretization in hadoop. Some of the reviews are telling it is difficult. Is that right. Pls share some links that will help me alot. -- *Thanks Regards* Unmesha Sreeveni U.B
Re: HIVE+MAPREDUCE
Programming in Hive Text Book contains what u want . Chapter 4 Hope that will help u. On Tue, Jan 21, 2014 at 1:51 PM, Ranjini Rathinam ranjinibe...@gmail.comwrote: Hi, Need to load the data into hive table using mapreduce, using java. Please suggest the code related to hive +mapreduce. Thanks in advance Ranjini R -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Re: Sorting a csv file
are we able to sort multiple columns dynamically as the user suggests? ie user requests to sort col1 and col2 then the user request to sort 3 cols I am not able to find anyof the stuff through googling On Thu, Jan 16, 2014 at 4:03 PM, unmesha sreeveni unmeshab...@gmail.comwrote: yes i did .. But how to make it in decending order? My current code run in accending order *public class SortingCsv {* * public static class Map extends MapperLongWritable, Text, Text, Text {* *private Text word = new Text();* *private Text one = new Text();* *public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {* * System.out.println(in mapper);* * /** * * sort* * */* * ArrayListString ar = new ArrayListString(); * * String line = value.toString();* * String[] tokens = null;* * ar.add(line);* * System.out.println(list: +ar);* * for(int i=0;iar.size();i++) {* *tokens=(ar.get(i)).split(,);* *System.out.println(ele: +ar.get(i));* *System.out.println(token: +tokens[1]); //change according to user input* *word.set(tokens[1]);* *one.set(ar.get(i));* *context.write(word, one);* * }* *}* * } * * public static void main(String[] args) throws Exception {* * System.out.println(in main);* *Configuration conf = new Configuration();* *Job job = new Job(conf, wordcount);* *job.setJarByClass(SortingCsv.class);* *//Path intermediateInfo = new Path(out);* *job.setOutputKeyClass(Text.class);* *job.setOutputValueClass(Text.class);* *job.setMapperClass(Map.class);* *FileSystem fs = FileSystem.get(conf);* * /* Delete the files if any in the output path */* * if (fs.exists(new Path(args[1])))* * fs.delete(new Path(args[1]), true);* *job.setInputFormatClass(TextInputFormat.class);* *job.setOutputFormatClass(TextOutputFormat.class);* *FileInputFormat.addInputPath(job, new Path(args[0]));* *FileOutputFormat.setOutputPath(job, new Path(args[1]));* *job.waitForCompletion(true);* * }* On Thu, Jan 16, 2014 at 10:26 AM, unmesha sreeveni unmeshab...@gmail.comwrote: Thanks for ur reply Ramya ok :) .so should i need to transpose the entire .csv file inorder to get the entire col 2 data? On Thu, Jan 16, 2014 at 10:11 AM, Ramya S ram...@suntecgroup.com wrote: Try to keep col2 values as map output key and map output value as the total values b,a,v Regards... Ramya.S From: unmesha sreeveni [mailto:unmeshab...@gmail.com] Sent: Thu 1/16/2014 9:29 AM To: User Hadoop Subject: Re: Sorting a csv file Thanks Ramya.s I was trying it to do with NULLWRITABLE.. Thanks alot Ramya. And do u have any idea how to sort a given col. Say if user is giving col2 to sort the i want to get as b,a,v a,c,p d,a,z q,z,a r,a,b b,a,v d,a,z r,a,b a,c,p q,z,a How do i approch to that. I my current implementation i am getting result as a,c,p b,a,v d,a,z q,z,a r,a,b using the above code. On Wed, Jan 15, 2014 at 5:09 PM, Ramya S ram...@suntecgroup.com wrote: All you need is to change the map output value class to TEXT format. Set this accordingly in the main. Eg: public static class Map extends MapperLongWritable, Text, Text, Text { private Text one = new Text(); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { System.out.println(in mapper); String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); System.out.println(sort: +word); } } } Regards...? Ramya.S From: unmesha sreeveni [mailto:unmeshab...@gmail.com] Sent: Wed 1/15/2014 4:11 PM To: User Hadoop Subject: Re: Sorting a csv file I did a map only job for sorting a txt file by editing wordcount program. I only need the key . How to set value to null. public class SortingCsv { public static class Map extends MapperLongWritable, Text, Text, IntWritable { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { System.out.println(in mapper); String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer
Merge files
How to merge two files using Map-Reduce code . I am aware of -getmerge and cat command.\ Thanks in advance. -- *Thanks Regards* Unmesha Sreeveni U.B
Re: Sorting a csv file
yes i did .. But how to make it in decending order? My current code run in accending order *public class SortingCsv {* * public static class Map extends MapperLongWritable, Text, Text, Text {* *private Text word = new Text();* *private Text one = new Text();* *public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {* * System.out.println(in mapper);* * /** * * sort* * */* * ArrayListString ar = new ArrayListString(); * * String line = value.toString();* * String[] tokens = null;* * ar.add(line);* * System.out.println(list: +ar);* * for(int i=0;iar.size();i++) {* *tokens=(ar.get(i)).split(,);* *System.out.println(ele: +ar.get(i));* *System.out.println(token: +tokens[1]); //change according to user input* *word.set(tokens[1]);* *one.set(ar.get(i));* *context.write(word, one);* * }* *}* * } * * public static void main(String[] args) throws Exception {* * System.out.println(in main);* *Configuration conf = new Configuration();* *Job job = new Job(conf, wordcount);* *job.setJarByClass(SortingCsv.class);* *//Path intermediateInfo = new Path(out);* *job.setOutputKeyClass(Text.class);* *job.setOutputValueClass(Text.class);* *job.setMapperClass(Map.class);* *FileSystem fs = FileSystem.get(conf);* * /* Delete the files if any in the output path */* * if (fs.exists(new Path(args[1])))* * fs.delete(new Path(args[1]), true);* *job.setInputFormatClass(TextInputFormat.class);* *job.setOutputFormatClass(TextOutputFormat.class);* *FileInputFormat.addInputPath(job, new Path(args[0]));* *FileOutputFormat.setOutputPath(job, new Path(args[1]));* *job.waitForCompletion(true);* * }* On Thu, Jan 16, 2014 at 10:26 AM, unmesha sreeveni unmeshab...@gmail.comwrote: Thanks for ur reply Ramya ok :) .so should i need to transpose the entire .csv file inorder to get the entire col 2 data? On Thu, Jan 16, 2014 at 10:11 AM, Ramya S ram...@suntecgroup.com wrote: Try to keep col2 values as map output key and map output value as the total values b,a,v Regards... Ramya.S From: unmesha sreeveni [mailto:unmeshab...@gmail.com] Sent: Thu 1/16/2014 9:29 AM To: User Hadoop Subject: Re: Sorting a csv file Thanks Ramya.s I was trying it to do with NULLWRITABLE.. Thanks alot Ramya. And do u have any idea how to sort a given col. Say if user is giving col2 to sort the i want to get as b,a,v a,c,p d,a,z q,z,a r,a,b b,a,v d,a,z r,a,b a,c,p q,z,a How do i approch to that. I my current implementation i am getting result as a,c,p b,a,v d,a,z q,z,a r,a,b using the above code. On Wed, Jan 15, 2014 at 5:09 PM, Ramya S ram...@suntecgroup.com wrote: All you need is to change the map output value class to TEXT format. Set this accordingly in the main. Eg: public static class Map extends MapperLongWritable, Text, Text, Text { private Text one = new Text(); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { System.out.println(in mapper); String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); System.out.println(sort: +word); } } } Regards...? Ramya.S From: unmesha sreeveni [mailto:unmeshab...@gmail.com] Sent: Wed 1/15/2014 4:11 PM To: User Hadoop Subject: Re: Sorting a csv file I did a map only job for sorting a txt file by editing wordcount program. I only need the key . How to set value to null. public class SortingCsv { public static class Map extends MapperLongWritable, Text, Text, IntWritable { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { System.out.println(in mapper); String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); System.out.println(sort: +word); } } } public static void main(String[] args) throws Exception
Sorting in decending order
Are we able to sort a csv file in descending order. -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Combination of MapReduce and Hive
Hi Can we use a cobination of Hive and MapReduce Say: I am having a csv file. Ineed to find the mean of a column and replace the null data with the mean(replace null with mean) so whether we can write a hive query in driver (to find the mean) then write a mapreduce block to replace the null with mean. Which is better way 1. writing only mapreduce code or 2. Use a combination of hive and mapreduce. -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Sorting a csv file
How to sort a csv file I know , between map and reduce shuffle and sort is taking place. But how do i sort each column in a csv file? -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Re: Sorting a csv file
I did a map only job for sorting a txt file by editing wordcount program. I only need the key . How to set value to null. *public class SortingCsv {* * public static class Map extends MapperLongWritable, Text, Text, IntWritable {* *private final static IntWritable one = new IntWritable(1);* *private Text word = new Text();* *public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {* * System.out.println(in mapper);* *String line = value.toString();* *StringTokenizer tokenizer = new StringTokenizer(line);* *while (tokenizer.hasMoreTokens()) {* *word.set(tokenizer.nextToken());* *context.write(word, one);* *System.out.println(sort: +word);* *}* *}* * } * *public static void main(String[] args) throws Exception { System.out.println(in main);Configuration conf = new Configuration(); Job job = new Job(conf, wordcount); job.setJarByClass(SortingCsv.class);//Path intermediateInfo = new Path(out);job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class);FileSystem fs = FileSystem.get(conf); /* Delete the files if any in the output path */ if (fs.exists(new Path(args[1]))) fs.delete(new Path(args[1]), true); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }* On Wed, Jan 15, 2014 at 2:50 PM, unmesha sreeveni unmeshab...@gmail.comwrote: How to sort a csv file I know , between map and reduce shuffle and sort is taking place. But how do i sort each column in a csv file? -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Re: Sorting a csv file
Thanks Ramya.s I was trying it to do with NULLWRITABLE.. Thanks alot Ramya. And do u have any idea how to sort a given col. Say if user is giving col2 to sort the i want to get as b,a,v a,c,p d,a,z q,z,a r,a,b b,a,v d,a,z r,a,b a,c,p q,z,a How do i approch to that. I my current implementation i am getting result as a,c,p b,a,v d,a,z q,z,a r,a,b using the above code. On Wed, Jan 15, 2014 at 5:09 PM, Ramya S ram...@suntecgroup.com wrote: All you need is to change the map output value class to TEXT format. Set this accordingly in the main. Eg: public static class Map extends MapperLongWritable, Text, Text, Text { private Text one = new Text(); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { System.out.println(in mapper); String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); System.out.println(sort: +word); } } } Regards...? Ramya.S From: unmesha sreeveni [mailto:unmeshab...@gmail.com] Sent: Wed 1/15/2014 4:11 PM To: User Hadoop Subject: Re: Sorting a csv file I did a map only job for sorting a txt file by editing wordcount program. I only need the key . How to set value to null. public class SortingCsv { public static class Map extends MapperLongWritable, Text, Text, IntWritable { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { System.out.println(in mapper); String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); System.out.println(sort: +word); } } } public static void main(String[] args) throws Exception { System.out.println(in main); Configuration conf = new Configuration(); Job job = new Job(conf, wordcount); job.setJarByClass(SortingCsv.class); //Path intermediateInfo = new Path(out); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); FileSystem fs = FileSystem.get(conf); /* Delete the files if any in the output path */ if (fs.exists(new Path(args[1]))) fs.delete(new Path(args[1]), true); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } } On Wed, Jan 15, 2014 at 2:50 PM, unmesha sreeveni unmeshab...@gmail.com wrote: How to sort a csv file I know , between map and reduce shuffle and sort is taking place. But how do i sort each column in a csv file? -- Thanks Regards Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/ -- Thanks Regards Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Adding file to HDFs
How to add a *csv* file to hdfs using Mapreduce Code Using hadoop fs -copyFromLocal /local/path /hdfs/location i am able to do . BUt i would like to write mapreduce code. -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Re: Adding file to HDFs
Thank you sudhakar On Tue, Jan 14, 2014 at 2:51 PM, sudhakara st sudhakara...@gmail.comwrote: Read file from local file system and write to file in HDFS using *FSDataOutputStream* FSDataOutputStream outStream = fs.create(new Path(demo.csv);); outStream.writeUTF(stream); outStream.close(); On Tue, Jan 14, 2014 at 2:04 PM, unmesha sreeveni unmeshab...@gmail.comwrote: How to add a *csv* file to hdfs using Mapreduce Code Using hadoop fs -copyFromLocal /local/path /hdfs/location i am able to do . BUt i would like to write mapreduce code. -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/ -- Regards, ...Sudhakara.st -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Re: Adding file to HDFs
I tried to copy a 2.5 gb to hdfs. it took 3 -4 min. Are we able to reduce that time. On Tue, Jan 14, 2014 at 3:07 PM, unmesha sreeveni unmeshab...@gmail.comwrote: Thank you sudhakar On Tue, Jan 14, 2014 at 2:51 PM, sudhakara st sudhakara...@gmail.comwrote: Read file from local file system and write to file in HDFS using *FSDataOutputStream* FSDataOutputStream outStream = fs.create(new Path(demo.csv);); outStream.writeUTF(stream); outStream.close(); On Tue, Jan 14, 2014 at 2:04 PM, unmesha sreeveni unmeshab...@gmail.comwrote: How to add a *csv* file to hdfs using Mapreduce Code Using hadoop fs -copyFromLocal /local/path /hdfs/location i am able to do . BUt i would like to write mapreduce code. -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/ -- Regards, ...Sudhakara.st -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Seggregation in MapReduce
Can we do seggregation in MapReduce. If we are having a employee dataset which contains emp id,emp name,emp type. Are we able to group the employees based on different types. say emp type A in one file,say emp type B in another file. -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Re: Find max and min of a column in a csvfile
Thanks Jiayu and John Hancock. Showered a very nice hint for me. John that was a really gud link you provided. But i dnt know Pig. I am using java. Is there any java related document. like : http://packtlib.packtpub.com/library/hadoop-mapreduce-cookbook/ch06lvl1sec66# On Sat, Jan 11, 2014 at 6:14 PM, John Hancock jhancock1...@gmail.comwrote: Unmesha, You may want to write your own mapper and reducer for the purpose of learning more about map-reduce programming techniques. However, the Pig documentation also discusses aggregate functions such as max() which may save you some time: http://pig.apache.org/docs/r0.12.0/udf.html -John On Fri, Jan 10, 2014 at 12:23 PM, Jiayu Ji jiayu...@gmail.com wrote: if you are doing with only one column, then I think the key/value pair could be Null and number( elements) . If you are doing more than one column, then column name and numbers. On Fri, Jan 10, 2014 at 12:36 AM, unmesha sreeveni unmeshab...@gmail.com wrote: Need help How to find the maximum element and min element of a col in a csv file .What will be the mapper output. -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/ -- Jiayu (James) Ji, Cell: (312)823-7393 -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Re: what all can be done using MR
What about sorting . Acutually it is done by MapReduce itself. But if we are giving a csv file as input and trying to sort one/multiple column...Whether the corresponting columns also get reflectted?? eg: foo.csv B,2,3 A,4,6 When we apply sorting to first column:whether the resultent will be A,4,6 B,2,3 A will be mapped to its correct values right? If so what will be context.write() of mapper? On Wed, Jan 8, 2014 at 8:18 PM, Chris Mawata chris.maw...@gmail.com wrote: Yes. Check out, for example, http://packtlib.packtpub.com/library/hadoop-mapreduce-cookbook/ch06lvl1sec66# On 1/8/2014 2:41 AM, unmesha sreeveni wrote: Can we do aggregation with in Hadoop MR like find min,max,sum,avg of a column in a csv file. -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Re: what all can be done using MR
For that do we have to write a custom class for value inorder to pass all the columns as value. ie in the example 2 values. Or jst do a concatenation and emit values. On Sat, Jan 11, 2014 at 9:46 PM, Chris Mawata chris.maw...@gmail.comwrote: Results will be sorted by key so make A the key and put the rest in the value Chris On Jan 11, 2014 10:11 AM, unmesha sreeveni unmeshab...@gmail.com wrote: What about sorting . Acutually it is done by MapReduce itself. But if we are giving a csv file as input and trying to sort one/multiple column...Whether the corresponting columns also get reflectted?? eg: foo.csv B,2,3 A,4,6 When we apply sorting to first column:whether the resultent will be A,4,6 B,2,3 A will be mapped to its correct values right? If so what will be context.write() of mapper? On Wed, Jan 8, 2014 at 8:18 PM, Chris Mawata chris.maw...@gmail.comwrote: Yes. Check out, for example, http://packtlib.packtpub.com/library/hadoop-mapreduce-cookbook/ch06lvl1sec66# On 1/8/2014 2:41 AM, unmesha sreeveni wrote: Can we do aggregation with in Hadoop MR like find min,max,sum,avg of a column in a csv file. -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/ -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
Expressions in MapReduce
Are we able to do expresions in Mapreduce Say if i am having a csv file . which has 2 columns. The user is giving an expresion col1 + col2 = col3 Are we able to do this? And when again the user wants col1 - col2 = col4 Can we do it in the same mapreduce (dynamic change of expressions) -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/
FAILED EMFILE: Too many open files
While i am trying to run a MR Job I am getting FAILED EMFILE: Too many open files at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:172) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:310) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:383) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262) Why is it so? -- *Thanks Regards* Unmesha Sreeveni U.B Junior Developer http://www.unmeshasreeveni.blogspot.in/