Hi all,

For the TextInputFormat class, the input key is a file position. This is 
working well. But when I switch to LzoTextInputFormat to read LZO files, the 
key does not make sense. It does not indicate file position. Is the file 
position supported with LzoTextInputFormat? 

Here is a job that prints out file position and line.

public class Test {

    public static class Map extends Mapper<LongWritable, Text, LongWritable, 
Text> {

        private Text outputValue = new Text();

        /*
         *  Outputs key,value pair.
         *    key = offset
         *    value = string
         */
        public void map(LongWritable key, Text value, Context context) throws 
IOException, InterruptedException {
            String s = value.toString();
            if (s.length() > 64) {
                s = s.substring(0, 64);
            }
            this.outputValue.set(s);
            context.write(key, this.outputValue);
        }

    }

    public static void main(String[] args) throws Exception {
        Configuration c = new Configuration();

        Job j = new Job(c, "Test");

        j.setJarByClass(TomcatLogTest.class);

        FileInputFormat.addInputPath(j, new Path(args[0]));
        FileOutputFormat.setOutputPath(j, new Path(args[1]));

        j.setMapperClass(Map.class);

        j.setInputFormatClass(LzoTextInputFormat.class);
        j.setOutputFormatClass(TextOutputFormat.class);

        j.setMapOutputKeyClass(LongWritable.class);
        j.setMapOutputValueClass(Text.class);

        j.setOutputKeyClass(LongWritable.class);
        j.setOutputValueClass(Text.class);

        if (!j.waitForCompletion(true)) {
            System.exit(1);
        }
    }

}


The output is:

0       [WEB.WWW.WARNING.30000][Mon 2012/01/09 14:00:00:933 PST][com.wm.
101200  =DynamicItem to String MethodDynamicItem{id=15762417, timestamp=
101200  {
101200  2012-01-09 14:16:19:195 - TP-Processor2, 29718094 -> L2 STRAND B
101200  2012-01-09 14:16:19:192 - TP-Processor2, 29718094 -> hostName=ed
101200  2012-01-09 14:16:19:186 - pool-113-thread-2, 11661605 -> hostNam
101200  SESSION FILTER BENCH: pre-process 0 millis <SessionID: 000000086
101200  TOMCAT REQ: /ip/Archangels-Chessmen/17703726 Mon Jan 09 14:16:19
101200  TIMESTAMP: Mon Jan 9 14:16:11 PST 2012
101200  TOMCAT BENCH: /verify.gsp?novisitor=true&noses=true 3 elapsed Mo
101200  
101200  [WEB.WWW.WARNING.PLATFORM][Mon 2012/01/09 14:16:11:778 PST][com.
101200  
101200  [WEB.WWW.WARNING.PLATFORM][Mon 2012/01/09 14:16:11:778 PST][com.
101200  TOMCAT REQ: /verify.gsp?novisitor=true&noses=true Mon Jan 09 14:
101200  TOMCAT BENCH: /verify.gsp?novisitor=true&noses=true 3 elapsed Mo
101200  
101200  [WEB.WWW.WARNING.PLATFORM][Mon 2012/01/09 14:16:03:767 PST][com.
...

The file position does change but it does not make sense to me. Is there any 
way to get the file position of a line so I can print out that line later?

Any help would be helpful!

Thanks!






Reply via email to