Re: Can jobs be configured to be sequential

2008-10-18 Thread Ravion

Hi Paco,

Thanks - This is exactly what I was looking for..

Regards,
Ravi
- Original Message - 
From: "Paco NATHAN" <[EMAIL PROTECTED]>

To: 
Sent: Saturday, October 18, 2008 9:46 AM
Subject: Re: Can jobs be configured to be sequential



Hi Ravion,

The problem you are describing sounds like a workflow where you must
be careful to verify certain conditions before proceeding to a next
step.

We have similar kinds of use cases for Hadoop apps at work, which are
essentially ETL.  I recommend that you look at http://cascading.org as
an abstraction layer for managing these kinds of workflows. We've
found it quite useful.

Best,
Paco


On Fri, Oct 17, 2008 at 8:29 PM, Ravion <[EMAIL PROTECTED]> 
wrote:

Dear all,

We have in our Data Warehouse System, about 600  ETL( Extract Transform 
Load) jobs to create interim data model. SOme jobs are dependent on 
completion of others.


Assume that I create a group id intdependent jobs. Say a group G1 
contains 100 jobs , G2 contains another 200 jobs which are dependent on 
completion of Group G1 and so on.


Can we leverage on Haddop so that Hadoop executed G1 first, on failure it 
wont execute G2 otherwise will continue for G2 and so  on.. ?


Or do I need to configure "N" ( where N =  total number of groups) Haddop 
jobs independently and handle by ourselves?


Please share your thoughts, thanks

Warmest regards,
Ravion 




Re: supporting WordCount example for multiple level directories

2008-10-18 Thread Latha
Apologies for pasting a wrong command .Please find the correct command I
used.


 branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount inputdir
outdir
08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
: 3
java.io.IOException: Not a file:
hdfs://localhost:5310/user/username/inputdir/dir1
...
...


And inputdir has 2 subdirectories "dir1","dir2" and  a file "file1".

My requirement is to run wordcount for all the files in all sub directories.
Please suggest an idea.

Regards,
Srilatha


On Sat, Oct 18, 2008 at 6:38 PM, Latha <[EMAIL PROTECTED]> wrote:

> Hi All
>
> Greetings
> The wordcount at
> http://hadoop.apache.org/core/docs/current/mapred_tutorial.html works fine
> for following directory structure.
>
> inputdir -> file1
>-> file2
>-> file3
>
> And it does not work for
> inputdir -> dir1 -> innerfile1
>-> file1
>-> file2
>-> dir2
> For this second scenario we get error like
> 
>  branch-0.17]$ bin/hadoop jar wordcount.jar org.myorg.WordCount toplevel
> outlevel
> 08/10/18 05:58:14 INFO mapred.FileInputFormat: Total input paths to process
> : 3
> java.io.IOException: Not a file:
> hdfs://localhost:5310/user/username/inputdir/dir1
> 
>
>
> So, when it encounters an entry that is not a file, it comes out after
> throwing IO exception.
> In FileInputFormat.java, would like to call a recursive procedure in the
> following piece of code. All the files at leaf level of entire directory
> structure should be included in the paths to be searched. If anyone has
> already done this, please help me achieving the same.
>
> --
> public InputSplit[] getSplits(JobConf job, int numSplits)
> throws IOException {
> Path[] files = listPaths(job);
> long totalSize = 0;   // compute total size
> for (int i = 0; i < files.length; i++) {  // check we have valid
> files
>   Path file = files[i];
>   FileSystem fs = file.getFileSystem(job);
>   if (fs.isDirectory(file) || !fs.exists(file)) {
> throw new IOException("Not a file: "+files[i]);
>   }
>   totalSize += fs.getLength(files[i]);
> }
> 
> --
>
> Should we reset "mapred.input.dir" to inner directory and call
> getInputPaths recursively?
> Please help me to get all the file paths (irrespective of their depth
> level) .
>
> Thankyou
> Srilatha
>


Re: How to modify hadoop-wordcount example to display File-wise results.

2008-10-18 Thread Latha
Hi,

Inside Map method, performed following change for  Example: WordCount
v1.0at
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
--
String filename = new String();
...
 filename =  ((FileSplit) reporter.getInputSplit()).getPath().toString();
 while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken()+" "+filename);


Worked great!! Thanks to everyone!

Regards,
Srilatha


On Sat, Oct 18, 2008 at 6:24 PM, Latha <[EMAIL PROTECTED]> wrote:

> Hi All,
>
> Thankyou for your valuable inputs in suggesting me the possible solutions
> of creating an index file with following format.
> word1 filename count
> word2 filename count.
>
> However, following is not working for me. Please help me to resolve the
> same.
>
> --
>  public static class Map extends MapReduceBase implements
> Mapper {
>   private Text word = new Text();
>   private Text filename = new Text();
>   public void map(LongWritable key, Text value,
> OutputCollector output, Reporter reporter) throws IOException {
>   filename.set( ((FileSplit)
> reporter.getInputSplit()).getPath().toString());
>   String line = value.toString();
>   StringTokenizer tokenizer = new StringTokenizer(line);
>   while (tokenizer.hasMoreTokens()) {
>word.set(tokenizer.nextToken());
>output.collect(word, filename);
>   }
>   }
>   }
>
>   public static class Reduce extends MapReduceBase implements Reducer Text , Text, Text> {
>   public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter) throws IOException {
>  int sum = 0;
>  Text filename;
>  while (values.hasNext()) {
>  sum ++;
>  filename.set(values.next().toString());
>  }
>String file = filename.toString() + " " + ( new
> IntWritable(sum)).toString();
>filename=new Text(file);
>output.collect(key, filename);
>}
>   }
>
> --
> 08/10/18 05:38:25 INFO mapred.JobClient: Task Id :
> task_200810170342_0010_m_00_2, Status : FAILED
> java.io.IOException: Type mismatch in value from map: expected
> org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.Text
> at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
> at org.myorg.WordCount$Map.map(WordCount.java:23)
> at org.myorg.WordCount$Map.map(WordCount.java:13)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
> at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)
>
>
> Thanks
> Srilatha
>
>
>
> On Mon, Oct 6, 2008 at 11:38 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
>
>> On Sun, Oct 5, 2008 at 12:46 PM, Ted Dunning <[EMAIL PROTECTED]>
>> wrote:
>>
>> > What you need to do is snag access to the filename in the configure
>> method
>> > of the mapper.
>>
>>
>> You can also do it in the map method with:
>>
>> ((FileSplit) reporter.getInputSplit()).getPath()
>>
>>
>> Then instead of outputting just the word as the key, output a pair
>> > containing the word and the file name as the key.  Everything downstream
>> > should remain the same.
>>
>>
>> If you want to have each file handled by a single reduce, I'd suggest:
>>
>> class FileWordPair implements Writable {
>>  private Text fileName;
>>  private Text word;
>>  ...
>>  public int hashCode() {
>> return fileName.hashCode();
>>  }
>> }
>>
>> so that the HashPartitioner will send the records for file Foo to a single
>> reducer. It would make sense to use this as an example for when to use
>> grouping comparators (for getting a single call to reduce for each file)
>> too...
>>
>> -- Owen
>>
>
>


Re: How to modify hadoop-wordcount example to display File-wise results.

2008-10-18 Thread Latha
Hi All,

Thankyou for your valuable inputs in suggesting me the possible solutions of
creating an index file with following format.
word1 filename count
word2 filename count.

However, following is not working for me. Please help me to resolve the
same.

--
 public static class Map extends MapReduceBase implements
Mapper {
  private Text word = new Text();
  private Text filename = new Text();
  public void map(LongWritable key, Text value,
OutputCollector output, Reporter reporter) throws IOException {
  filename.set( ((FileSplit)
reporter.getInputSplit()).getPath().toString());
  String line = value.toString();
  StringTokenizer tokenizer = new StringTokenizer(line);
  while (tokenizer.hasMoreTokens()) {
   word.set(tokenizer.nextToken());
   output.collect(word, filename);
  }
  }
  }

  public static class Reduce extends MapReduceBase implements Reducer {
  public void reduce(Text key, Iterator values,
OutputCollector output, Reporter reporter) throws IOException {
 int sum = 0;
 Text filename;
 while (values.hasNext()) {
 sum ++;
 filename.set(values.next().toString());
 }
   String file = filename.toString() + " " + ( new
IntWritable(sum)).toString();
   filename=new Text(file);
   output.collect(key, filename);
   }
  }

--
08/10/18 05:38:25 INFO mapred.JobClient: Task Id :
task_200810170342_0010_m_00_2, Status : FAILED
java.io.IOException: Type mismatch in value from map: expected
org.apache.hadoop.io.IntWritable, recieved org.apache.hadoop.io.Text
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:427)
at org.myorg.WordCount$Map.map(WordCount.java:23)
at org.myorg.WordCount$Map.map(WordCount.java:13)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2122)


Thanks
Srilatha


On Mon, Oct 6, 2008 at 11:38 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:

> On Sun, Oct 5, 2008 at 12:46 PM, Ted Dunning <[EMAIL PROTECTED]>
> wrote:
>
> > What you need to do is snag access to the filename in the configure
> method
> > of the mapper.
>
>
> You can also do it in the map method with:
>
> ((FileSplit) reporter.getInputSplit()).getPath()
>
>
> Then instead of outputting just the word as the key, output a pair
> > containing the word and the file name as the key.  Everything downstream
> > should remain the same.
>
>
> If you want to have each file handled by a single reduce, I'd suggest:
>
> class FileWordPair implements Writable {
>  private Text fileName;
>  private Text word;
>  ...
>  public int hashCode() {
> return fileName.hashCode();
>  }
> }
>
> so that the HashPartitioner will send the records for file Foo to a single
> reducer. It would make sense to use this as an example for when to use
> grouping comparators (for getting a single call to reduce for each file)
> too...
>
> -- Owen
>