Re: Reading fields from a Text line
That is not really a bug. Only if you use @Override will you be really asserting that you've overriden the right method (since new API uses inheritance instead of interfaces). Without that kinda check, its easy to make mistakes and add in methods that won't get considered by the framework (and hence the default IdentityMapper comes into play). Always use @Override annotations when inheriting and overriding methods. On Fri, Aug 3, 2012 at 4:41 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Tariq On further analysis I noticed a odd behavior in this context. If we use the default InputFormat (TextInputFormat) but specify the Key type in mapper as IntWritable instead of Long Writable. The framework is supposed throw a class cast exception.Such an exception is thrown only if the key types at class level and method level are the same (IntWritable) in Mapper. But if we provide the Input key type as IntWritable on the class level but LongWritable on the method level (map method), instead of throwing a compile time error, the code compliles fine . In addition to it on execution the framework triggers Identity Mapper instead of the custom mapper provided with the configuration. This seems like a bug to me . Filed a jira to track this issue https://issues.apache.org/jira/browse/MAPREDUCE-4507 Regards Bejoy KS -- Harsh J
Re: Reading fields from a Text line
That is a good pointer Harsh. Thanks a lot. But if IdentityMapper is being used shouldn't the job.xml reflect that? But Job.xml always shows mapper as our CustomMapper. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Harsh J ha...@cloudera.com Date: Fri, 3 Aug 2012 13:02:32 To: mapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Cc: Mohammad Tariqdonta...@gmail.com Subject: Re: Reading fields from a Text line That is not really a bug. Only if you use @Override will you be really asserting that you've overriden the right method (since new API uses inheritance instead of interfaces). Without that kinda check, its easy to make mistakes and add in methods that won't get considered by the framework (and hence the default IdentityMapper comes into play). Always use @Override annotations when inheriting and overriding methods. On Fri, Aug 3, 2012 at 4:41 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Tariq On further analysis I noticed a odd behavior in this context. If we use the default InputFormat (TextInputFormat) but specify the Key type in mapper as IntWritable instead of Long Writable. The framework is supposed throw a class cast exception.Such an exception is thrown only if the key types at class level and method level are the same (IntWritable) in Mapper. But if we provide the Input key type as IntWritable on the class level but LongWritable on the method level (map method), instead of throwing a compile time error, the code compliles fine . In addition to it on execution the framework triggers Identity Mapper instead of the custom mapper provided with the configuration. This seems like a bug to me . Filed a jira to track this issue https://issues.apache.org/jira/browse/MAPREDUCE-4507 Regards Bejoy KS -- Harsh J
Re: Reading fields from a Text line
Bejoy, In the new API, the default map() function, if not properly overridden, is the identity map function. There is no IdentityMapper class in the new API, the Mapper class itself is identity by default. On Fri, Aug 3, 2012 at 1:07 PM, Bejoy KS bejoy.had...@gmail.com wrote: That is a good pointer Harsh. Thanks a lot. But if IdentityMapper is being used shouldn't the job.xml reflect that? But Job.xml always shows mapper as our CustomMapper. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Harsh J ha...@cloudera.com Date: Fri, 3 Aug 2012 13:02:32 To: mapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Cc: Mohammad Tariqdonta...@gmail.com Subject: Re: Reading fields from a Text line That is not really a bug. Only if you use @Override will you be really asserting that you've overriden the right method (since new API uses inheritance instead of interfaces). Without that kinda check, its easy to make mistakes and add in methods that won't get considered by the framework (and hence the default IdentityMapper comes into play). Always use @Override annotations when inheriting and overriding methods. On Fri, Aug 3, 2012 at 4:41 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Tariq On further analysis I noticed a odd behavior in this context. If we use the default InputFormat (TextInputFormat) but specify the Key type in mapper as IntWritable instead of Long Writable. The framework is supposed throw a class cast exception.Such an exception is thrown only if the key types at class level and method level are the same (IntWritable) in Mapper. But if we provide the Input key type as IntWritable on the class level but LongWritable on the method level (map method), instead of throwing a compile time error, the code compliles fine . In addition to it on execution the framework triggers Identity Mapper instead of the custom mapper provided with the configuration. This seems like a bug to me . Filed a jira to track this issue https://issues.apache.org/jira/browse/MAPREDUCE-4507 Regards Bejoy KS -- Harsh J -- Harsh J
Re: Reading fields from a Text line
Ok Got it now. That is a good piece of information. Thank You :) Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Harsh J ha...@cloudera.com Date: Fri, 3 Aug 2012 16:28:27 To: mapreduce-user@hadoop.apache.org; bejoy.had...@gmail.com Cc: Mohammad Tariqdonta...@gmail.com Subject: Re: Reading fields from a Text line Bejoy, In the new API, the default map() function, if not properly overridden, is the identity map function. There is no IdentityMapper class in the new API, the Mapper class itself is identity by default. On Fri, Aug 3, 2012 at 1:07 PM, Bejoy KS bejoy.had...@gmail.com wrote: That is a good pointer Harsh. Thanks a lot. But if IdentityMapper is being used shouldn't the job.xml reflect that? But Job.xml always shows mapper as our CustomMapper. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Harsh J ha...@cloudera.com Date: Fri, 3 Aug 2012 13:02:32 To: mapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Cc: Mohammad Tariqdonta...@gmail.com Subject: Re: Reading fields from a Text line That is not really a bug. Only if you use @Override will you be really asserting that you've overriden the right method (since new API uses inheritance instead of interfaces). Without that kinda check, its easy to make mistakes and add in methods that won't get considered by the framework (and hence the default IdentityMapper comes into play). Always use @Override annotations when inheriting and overriding methods. On Fri, Aug 3, 2012 at 4:41 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Tariq On further analysis I noticed a odd behavior in this context. If we use the default InputFormat (TextInputFormat) but specify the Key type in mapper as IntWritable instead of Long Writable. The framework is supposed throw a class cast exception.Such an exception is thrown only if the key types at class level and method level are the same (IntWritable) in Mapper. But if we provide the Input key type as IntWritable on the class level but LongWritable on the method level (map method), instead of throwing a compile time error, the code compliles fine . In addition to it on execution the framework triggers Identity Mapper instead of the custom mapper provided with the configuration. This seems like a bug to me . Filed a jira to track this issue https://issues.apache.org/jira/browse/MAPREDUCE-4507 Regards Bejoy KS -- Harsh J -- Harsh J
Re: Reading fields from a Text line
Thanks for the response Harsh n Sri. Actually, I was trying to prepare a template for my application using which I was trying to read one line at a time, extract the first field from it and emit that extracted value from the mapper. I have these few lines of code for that : public static class XPTMapper extends MapperIntWritable, Text, LongWritable, Text{ public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ Text word = new Text(); String line = value.toString(); if (!line.startsWith(TT)){ context.setStatus(INVALID LINE..SKIPPING); }else{ String stdid = line.substring(0, 7); word.set(stdid); context.write(key, word); } } But the output file contains all the rows of the input file including the lines which I was expecting to get skipped. Also, I was expecting only the fields I am emitting but the file contains entire lines. Could you guys please point out the the mistake I might have made. (Pardon my ignorance, as I am not very good at MapReduce).Many thanks. Regards, Mohammad Tariq On Thu, Aug 2, 2012 at 10:58 AM, Sriram Ramachandrasekaran sri.ram...@gmail.com wrote: Wouldn't it be better if you could skip those unwanted lines upfront(preprocess) and have a file which is ready to be processed by the MR system? In any case, more details are needed. On Thu, Aug 2, 2012 at 8:23 AM, Harsh J ha...@cloudera.com wrote: Mohammad, But it seems I am not doing things in correct way. Need some guidance. What do you mean by the above? What is your written code exactly expected to do and what is it not doing? Perhaps since you ask for a code question here, can you share it with us (pastebin or gists, etc.)? For skipping 8 lines, if you are using splits, you need to detect within the mapper or your record reader if the map task filesplit has an offset of 0 and skip 8 line reads if so (Cause its the first split of some file). On Thu, Aug 2, 2012 at 1:54 AM, Mohammad Tariq donta...@gmail.com wrote: Hello list, I have a flat file in which data is stored as lines of 107 bytes each. I need to skip the first 8 lines(as they don't contain any valuable info). Thereafter, I have to read each line and extract the information from them, but not the line as a whole. Each line is composed of several fields without any delimiter between them. For example, the first field is of 8 bytes, second of 2 bytes and so on. I was trying to reach each line as a Text value, convert it into string and using String.subring() method to extract the value of each field. But it seems I am not doing things in correct way. Need some guidance. Many thanks. Regards, Mohammad Tariq -- Harsh J -- It's just about how deep your longing is!
Re: Reading fields from a Text line
Hi Tariq, Is your file splittable? If it's not, Mapper will process entire file in one go! http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#isSplitable%28org.apache.hadoop.mapreduce.JobContext,%20org.apache.hadoop.fs.Path%29 How many mappers being created? See if that helps. Regards, Alok On Thu, Aug 2, 2012 at 3:48 PM, Mohammad Tariq donta...@gmail.com wrote: Thanks for the response Harsh n Sri. Actually, I was trying to prepare a template for my application using which I was trying to read one line at a time, extract the first field from it and emit that extracted value from the mapper. I have these few lines of code for that : public static class XPTMapper extends MapperIntWritable, Text, LongWritable, Text{ public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ Text word = new Text(); String line = value.toString(); if (!line.startsWith(TT)){ context.setStatus(INVALID LINE..SKIPPING); }else{ String stdid = line.substring(0, 7); word.set(stdid); context.write(key, word); } } But the output file contains all the rows of the input file including the lines which I was expecting to get skipped. Also, I was expecting only the fields I am emitting but the file contains entire lines. Could you guys please point out the the mistake I might have made. (Pardon my ignorance, as I am not very good at MapReduce).Many thanks. Regards, Mohammad Tariq On Thu, Aug 2, 2012 at 10:58 AM, Sriram Ramachandrasekaran sri.ram...@gmail.com wrote: Wouldn't it be better if you could skip those unwanted lines upfront(preprocess) and have a file which is ready to be processed by the MR system? In any case, more details are needed. On Thu, Aug 2, 2012 at 8:23 AM, Harsh J ha...@cloudera.com wrote: Mohammad, But it seems I am not doing things in correct way. Need some guidance. What do you mean by the above? What is your written code exactly expected to do and what is it not doing? Perhaps since you ask for a code question here, can you share it with us (pastebin or gists, etc.)? For skipping 8 lines, if you are using splits, you need to detect within the mapper or your record reader if the map task filesplit has an offset of 0 and skip 8 line reads if so (Cause its the first split of some file). On Thu, Aug 2, 2012 at 1:54 AM, Mohammad Tariq donta...@gmail.com wrote: Hello list, I have a flat file in which data is stored as lines of 107 bytes each. I need to skip the first 8 lines(as they don't contain any valuable info). Thereafter, I have to read each line and extract the information from them, but not the line as a whole. Each line is composed of several fields without any delimiter between them. For example, the first field is of 8 bytes, second of 2 bytes and so on. I was trying to reach each line as a Text value, convert it into string and using String.subring() method to extract the value of each field. But it seems I am not doing things in correct way. Need some guidance. Many thanks. Regards, Mohammad Tariq -- Harsh J -- It's just about how deep your longing is! -- Alok Kumar
Re: Reading fields from a Text line
Hi Tariq I assume the mapper being used is IdentityMapper instead of XPTMapper class. Can you share your main class? If you are using TextInputFormat an reading from a file in hdfs, it should have LongWritable Keys as input and your code has IntWritable as the input key type. Have a check on that as well. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Mohammad Tariq donta...@gmail.com Date: Thu, 2 Aug 2012 15:48:42 To: mapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Subject: Re: Reading fields from a Text line Thanks for the response Harsh n Sri. Actually, I was trying to prepare a template for my application using which I was trying to read one line at a time, extract the first field from it and emit that extracted value from the mapper. I have these few lines of code for that : public static class XPTMapper extends MapperIntWritable, Text, LongWritable, Text{ public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ Text word = new Text(); String line = value.toString(); if (!line.startsWith(TT)){ context.setStatus(INVALID LINE..SKIPPING); }else{ String stdid = line.substring(0, 7); word.set(stdid); context.write(key, word); } } But the output file contains all the rows of the input file including the lines which I was expecting to get skipped. Also, I was expecting only the fields I am emitting but the file contains entire lines. Could you guys please point out the the mistake I might have made. (Pardon my ignorance, as I am not very good at MapReduce).Many thanks. Regards, Mohammad Tariq On Thu, Aug 2, 2012 at 10:58 AM, Sriram Ramachandrasekaran sri.ram...@gmail.com wrote: Wouldn't it be better if you could skip those unwanted lines upfront(preprocess) and have a file which is ready to be processed by the MR system? In any case, more details are needed. On Thu, Aug 2, 2012 at 8:23 AM, Harsh J ha...@cloudera.com wrote: Mohammad, But it seems I am not doing things in correct way. Need some guidance. What do you mean by the above? What is your written code exactly expected to do and what is it not doing? Perhaps since you ask for a code question here, can you share it with us (pastebin or gists, etc.)? For skipping 8 lines, if you are using splits, you need to detect within the mapper or your record reader if the map task filesplit has an offset of 0 and skip 8 line reads if so (Cause its the first split of some file). On Thu, Aug 2, 2012 at 1:54 AM, Mohammad Tariq donta...@gmail.com wrote: Hello list, I have a flat file in which data is stored as lines of 107 bytes each. I need to skip the first 8 lines(as they don't contain any valuable info). Thereafter, I have to read each line and extract the information from them, but not the line as a whole. Each line is composed of several fields without any delimiter between them. For example, the first field is of 8 bytes, second of 2 bytes and so on. I was trying to reach each line as a Text value, convert it into string and using String.subring() method to extract the value of each field. But it seems I am not doing things in correct way. Need some guidance. Many thanks. Regards, Mohammad Tariq -- Harsh J -- It's just about how deep your longing is!
Re: Reading fields from a Text line
Thank you everyone. Here is the code from the driver : Configuration conf = new Configuration(); conf.addResource(/home/cluster/hadoop-1.0.3/conf/core-site.xml); conf.addResource(/home/cluster/hadoop-1.0.3/conf/hdfs-site.xml); Job job = new Job(conf, XPTReader); job.setJarByClass(XPTReader.class); job.setMapperClass(XPTMapper.class); job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(TextInputFormat.class); Path inPath = new Path(/mapin/TX.xpt); FileInputFormat.addInputPath(job, inPath); FileOutputFormat.setOutputPath(job, new Path(/mapout/+inPath.toString().split(/)[4]+java.util.Random.class.newInstance().nextInt())); System.exit(job.waitForCompletion(true) ? 0 : 1); Bejoy : I have observed one strange thing. When I am using IntWritable, the output file contains the entire content of the input file, but if I am using LongWritable, the output file is empty. Sri, Code is working outside MR. Regards, Mohammad Tariq On Thu, Aug 2, 2012 at 4:38 PM, Bejoy KS bejoy.had...@gmail.com wrote: Hi Tariq I assume the mapper being used is IdentityMapper instead of XPTMapper class. Can you share your main class? If you are using TextInputFormat an reading from a file in hdfs, it should have LongWritable Keys as input and your code has IntWritable as the input key type. Have a check on that as well. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Mohammad Tariq donta...@gmail.com Date: Thu, 2 Aug 2012 15:48:42 To: mapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Subject: Re: Reading fields from a Text line Thanks for the response Harsh n Sri. Actually, I was trying to prepare a template for my application using which I was trying to read one line at a time, extract the first field from it and emit that extracted value from the mapper. I have these few lines of code for that : public static class XPTMapper extends MapperIntWritable, Text, LongWritable, Text{ public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ Text word = new Text(); String line = value.toString(); if (!line.startsWith(TT)){ context.setStatus(INVALID LINE..SKIPPING); }else{ String stdid = line.substring(0, 7); word.set(stdid); context.write(key, word); } } But the output file contains all the rows of the input file including the lines which I was expecting to get skipped. Also, I was expecting only the fields I am emitting but the file contains entire lines. Could you guys please point out the the mistake I might have made. (Pardon my ignorance, as I am not very good at MapReduce).Many thanks. Regards, Mohammad Tariq On Thu, Aug 2, 2012 at 10:58 AM, Sriram Ramachandrasekaran sri.ram...@gmail.com wrote: Wouldn't it be better if you could skip those unwanted lines upfront(preprocess) and have a file which is ready to be processed by the MR system? In any case, more details are needed. On Thu, Aug 2, 2012 at 8:23 AM, Harsh J ha...@cloudera.com wrote: Mohammad, But it seems I am not doing things in correct way. Need some guidance. What do you mean by the above? What is your written code exactly expected to do and what is it not doing? Perhaps since you ask for a code question here, can you share it with us (pastebin or gists, etc.)? For skipping 8 lines, if you are using splits, you need to detect within the mapper or your record reader if the map task filesplit has an offset of 0 and skip 8 line reads if so (Cause its the first split of some file). On Thu, Aug 2, 2012 at 1:54 AM, Mohammad Tariq donta...@gmail.com wrote: Hello list, I have a flat file in which data is stored as lines of 107 bytes each. I need to skip the first 8 lines(as they don't contain any valuable info). Thereafter, I have to read each line and extract the information from them, but not the line as a whole. Each line is composed of several fields without any delimiter between them. For example, the first field is of 8 bytes, second of 2 bytes and so on. I was trying to reach each line as a Text value, convert it into string and using String.subring() method to extract the value of each field. But it seems I am not doing things in correct way. Need some guidance. Many thanks. Regards, Mohammad Tariq -- Harsh J -- It's just about how deep your longing is!
Re: Reading fields from a Text line
Hi Tariq Again I strongly suspect the IdentityMapper in play here. The reasoning why I suspect so is When you have the whole data in output file it should be the Identity Mapper. Due to the mismatch in input key type at class level and method level the framework is falling back to IdentityMapper. I have noticed this fall back while using new mapreduce API. public static class XPTMapper extends Mapper*IntWritable*, Text, LongWritable, Text{ public void map(*LongWritable* key, Text value, Context context) throws IOException, InterruptedException{ When you change the Input Key type to LongWritable in class level, it is your custom mapper(XPTMapper) being called. Because of some exceptional cases it is just going into if condition where you are not writing anything out of Mapper and hence an empty output file. public static class XPTMapper extends Mapper*LongWritable*, Text, LongWritable, Text{ public void map(*LongWritable* key, Text value, Context context) throws IOException, InterruptedException{ To cross check this, try enabling some logging on your code to see exactly what is happening. By the way are you getting the output of this line in your logs when you change the input key type to LongWritable? context.setStatus(INVALID LINE..SKIPPING); If so that confirms my assumption. :) Try adding more logs to trace the flow and see what is going wrong. Or you can use MRunit to unit test your code as the first step. Hope it helps!.. Regards Bejoy KS
Re: Reading fields from a Text line
Hi Tariq On further analysis I noticed a odd behavior in this context. If we use the default InputFormat (TextInputFormat) but specify the Key type in mapper as IntWritable instead of Long Writable. The framework is supposed throw a class cast exception.Such an exception is thrown only if the key types at class level and method level are the same (IntWritable) in Mapper. But if we provide the Input key type as IntWritable on the class level but LongWritable on the method level (map method), instead of throwing a compile time error, the code compliles fine . In addition to it on execution the framework triggers Identity Mapper instead of the custom mapper provided with the configuration. This seems like a bug to me . Filed a jira to track this issue https://issues.apache.org/jira/browse/MAPREDUCE-4507 Regards Bejoy KS
Reading fields from a Text line
Hello list, I have a flat file in which data is stored as lines of 107 bytes each. I need to skip the first 8 lines(as they don't contain any valuable info). Thereafter, I have to read each line and extract the information from them, but not the line as a whole. Each line is composed of several fields without any delimiter between them. For example, the first field is of 8 bytes, second of 2 bytes and so on. I was trying to reach each line as a Text value, convert it into string and using String.subring() method to extract the value of each field. But it seems I am not doing things in correct way. Need some guidance. Many thanks. Regards, Mohammad Tariq
Re: Reading fields from a Text line
Mohammad, But it seems I am not doing things in correct way. Need some guidance. What do you mean by the above? What is your written code exactly expected to do and what is it not doing? Perhaps since you ask for a code question here, can you share it with us (pastebin or gists, etc.)? For skipping 8 lines, if you are using splits, you need to detect within the mapper or your record reader if the map task filesplit has an offset of 0 and skip 8 line reads if so (Cause its the first split of some file). On Thu, Aug 2, 2012 at 1:54 AM, Mohammad Tariq donta...@gmail.com wrote: Hello list, I have a flat file in which data is stored as lines of 107 bytes each. I need to skip the first 8 lines(as they don't contain any valuable info). Thereafter, I have to read each line and extract the information from them, but not the line as a whole. Each line is composed of several fields without any delimiter between them. For example, the first field is of 8 bytes, second of 2 bytes and so on. I was trying to reach each line as a Text value, convert it into string and using String.subring() method to extract the value of each field. But it seems I am not doing things in correct way. Need some guidance. Many thanks. Regards, Mohammad Tariq -- Harsh J
Re: Reading fields from a Text line
Wouldn't it be better if you could skip those unwanted lines upfront(preprocess) and have a file which is ready to be processed by the MR system? In any case, more details are needed. On Thu, Aug 2, 2012 at 8:23 AM, Harsh J ha...@cloudera.com wrote: Mohammad, But it seems I am not doing things in correct way. Need some guidance. What do you mean by the above? What is your written code exactly expected to do and what is it not doing? Perhaps since you ask for a code question here, can you share it with us (pastebin or gists, etc.)? For skipping 8 lines, if you are using splits, you need to detect within the mapper or your record reader if the map task filesplit has an offset of 0 and skip 8 line reads if so (Cause its the first split of some file). On Thu, Aug 2, 2012 at 1:54 AM, Mohammad Tariq donta...@gmail.com wrote: Hello list, I have a flat file in which data is stored as lines of 107 bytes each. I need to skip the first 8 lines(as they don't contain any valuable info). Thereafter, I have to read each line and extract the information from them, but not the line as a whole. Each line is composed of several fields without any delimiter between them. For example, the first field is of 8 bytes, second of 2 bytes and so on. I was trying to reach each line as a Text value, convert it into string and using String.subring() method to extract the value of each field. But it seems I am not doing things in correct way. Need some guidance. Many thanks. Regards, Mohammad Tariq -- Harsh J -- It's just about how deep your longing is!