partitioning the inputs to the mapper

2008-07-27 Thread Shirley Cohen
How do I partition the inputs to the mapper, such that a mapper  
processes an entire file or files? What is happening now is that each  
mapper receives only portions of a file and I want them to receive an  
entire file. Is there a way to do that within the scope of the  
framework?


Thanks,

Shirley 


Re: partitioning the inputs to the mapper

2008-07-27 Thread lohit


>How do I partition the inputs to the mapper, such that a mapper  
>processes an entire file or files? What is happening now is that each  
>mapper receives only portions of a file and I want them to receive an  
>entire file. Is there a way to do that within the scope of the  
>framework?

http://wiki.apache.org/hadoop/FAQ#10

-Lohit



Re: Bean Scripting Framework?

2008-07-27 Thread Andreas Kostyrka
On Saturday 26 July 2008 00:53:48 Joydeep Sen Sarma wrote:
> Just as an aside - there is probably a general perception that streaming
> is really slow (at least I had it).
>
> The last I did some profiling (in 0.15) - the primary overheads from
> streaming came from the scripting language (python is sssw). For

Beg pardon, it's a question of good design. And yes, in some cases, a 
microscopic tiny amounts C code (in our case Pyrex code to parse log lines).

E.g. our driver script that parses the loglines is 4 times slower than cat, 
which IMHO, considering the amount of work it does to parse a custom CLF 
logfile is a good value.

C/C++ is not that cool, nor that fast when you need to deal with arbitrary 
long lines. Which might be not the case for CLF files, but you wouldn't 
believe what kind of garbage I've seen ;)

In practice, especially on the longline processing job, Python beats Hadoop's 
Java code easily into submission. Not that I'm happy about that because it 
makes some stuff for me quite more complicated.

> an insanely fast script (bin/cat), I saw significant overheads in java
> function/data path that drowned out streaming overheads by huge margin
> (lot of those overheads have been fixed in recent versions - thanks to
> the hadoop team).
>
> Writing a c/c++ streaming program is pretty good way of getting good
> performance (and some performance sensitive apps in our environment
> ended up doing just that).

Well, in our environment, the architecture issues, e.g. how to access data 
from the database that is needed to process the data, but which is to huge to 
just copy to each node or include somehow in the input, are of way higher 
relevance. Refactoring the application code to cover more of the latency for 
accessing that data can easily speed up the application by a magnitude.

Andreas


signature.asc
Description: This is a digitally signed message part.


Re: Name node heap space problem

2008-07-27 Thread Gert Pfeifer

There I have:
   export HADOOP_HEAPSIZE=8000
,which should be enough (actually in this case I don't know).

Running the fsck on the directory it turned out that there are 1785959 
files in this dir... I have no clue how I can get  the data out of there.
Can I somehow calculate, how much heap a namenode would need to do an ls 
on this dir?


Gert


Taeho Kang schrieb:

Check how much memory is allocated for the JVM running namenode.

In a file HADOOP_INSTALL/conf/hadoop-env.sh
you should change a line that starts with "export HADOOP_HEAPSIZE=1000"

It's set to 1GB by default.


On Fri, Jul 25, 2008 at 2:51 AM, Gert Pfeifer <[EMAIL PROTECTED]>
wrote:


Update on this one...

I put some more memory in the machine running the name node. Now fsck is
running. Unfortunately ls fails with a time-out.

I identified one directory that causes the trouble. I can run fsck on it
but not ls.

What could be the problem?

Gert

Gert Pfeifer schrieb:

Hi,

I am running a Hadoop DFS on a cluster of 5 data nodes with a name node
and one secondary name node.

I have 1788874 files and directories, 1465394 blocks = 3254268 total.
Heap Size max is 3.47 GB.

My problem is that I produce many small files. Therefore I have a cron
job which just runs daily across the new files and copies them into
bigger files and deletes the small files.

Apart from this program, even a fsck kills the cluster.

The problem is that, as soon as I start this program, the heap space of
the name node reaches 100 %.

What could be the problem? There are not many small files right now and
still it doesn't work. I guess we have this problem since the upgrade to
0.17.

Here is some additional data about the DFS:
Capacity :   2 TB
DFS Remaining   :   1.19 TB
DFS Used:   719.35 GB
DFS Used%   :   35.16 %

Thanks for hints,
Gert









JobTracker History data+analysis

2008-07-27 Thread Paco NATHAN
We have a need to access data found in the JobTracker History link.
Specifically in the "Analyse This Job" analysis. Must be run in Java,
between jobs, in the same code which calls ToolRunner and JobClient.
In essence, we need to collect descriptive statistics about task
counts and times for map, shuffle, reduce.

After tracing the flow of the JSP in "src/webapps/job"...  Is there a
better way to get at this data, *not* from the web UI perspective but
from the code?

Tried to find any applicable patterns in JobTracker, ClusterStatus,
JobClient, etc., but no joy.

Thanks,
Paco


Help: specifying different input/output class for combiner and reducer

2008-07-27 Thread Qin Gao
Hi all,

I am trying to specify different key/value classes for combiner and reducer
in my task, for example, I want the
mapper to output integer==>(integer,float) pair, and then the combiner
outputs integer==>some structure. Finally
the reducer takes in integer==>some structure and output null==>integer.
However, I got the following exception:

java.io.IOException: Spill failed
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:597)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:579)
at java.io.DataOutputStream.writeInt(DataOutputStream.java:180)
at org.apache.hadoop.io.IntWritable.write(IntWritable.java:42)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:437)
at
edu.cmu.cs.lti.mapred.tasks.model1align.ModelOneTraining$MyMapper.map(ModelOneTraining.java:254)
at
edu.cmu.cs.lti.mapred.tasks.model1align.ModelOneTraining$MyMapper.map(ModelOneTraining.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:157)
Caused by: java.io.IOException: wrong value class:
edu.cmu.cs.lti.mapred.io.TTableColumnWritable is not class
edu.cmu.cs.lti.mapred.io.IntFloatPairWritable
at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998)
at
org.apache.hadoop.mapred.MapTask$CombineOutputCollector.collect(MapTask.java:1083)
at
edu.cmu.cs.lti.mapred.tasks.model1align.ModelOneTraining$BinCombiner.reduce(ModelOneTraining.java:433)
at
edu.cmu.cs.lti.mapred.tasks.model1align.ModelOneTraining$BinCombiner.reduce(ModelOneTraining.java:1)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:876)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:782)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1600(MapTask.java:272)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:707)

And I found nowhere to specify the combiner's input and output class. I
wonder is it possible to do so.

Best,
Qin


Re: JobTracker History data+analysis

2008-07-27 Thread Amareshwari Sriramadasu
Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if 
it make sense?


Thanks
Amareshwari

Paco NATHAN wrote:

We have a need to access data found in the JobTracker History link.
Specifically in the "Analyse This Job" analysis. Must be run in Java,
between jobs, in the same code which calls ToolRunner and JobClient.
In essence, we need to collect descriptive statistics about task
counts and times for map, shuffle, reduce.

After tracing the flow of the JSP in "src/webapps/job"...  Is there a
better way to get at this data, *not* from the web UI perspective but
from the code?

Tried to find any applicable patterns in JobTracker, ClusterStatus,
JobClient, etc., but no joy.

Thanks,
Paco
  




Re: JobTracker History data+analysis

2008-07-27 Thread Paco NATHAN
Thank you, Amareshwari -

That helps.  Hadn't noticed HistoryViewer before. It has no JavaDoc.

What is a typical usage?  In other words, what would be the
"outputDir" value in the context of ToolRunner, JobClient, etc. ?

Paco


On Sun, Jul 27, 2008 at 11:48 PM, Amareshwari Sriramadasu
<[EMAIL PROTECTED]> wrote:
> Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if it
> make sense?
>
> Thanks
> Amareshwari
>
> Paco NATHAN wrote:
>>
>> We have a need to access data found in the JobTracker History link.
>> Specifically in the "Analyse This Job" analysis. Must be run in Java,
>> between jobs, in the same code which calls ToolRunner and JobClient.
>> In essence, we need to collect descriptive statistics about task
>> counts and times for map, shuffle, reduce.
>>
>> After tracing the flow of the JSP in "src/webapps/job"...  Is there a
>> better way to get at this data, *not* from the web UI perspective but
>> from the code?
>>
>> Tried to find any applicable patterns in JobTracker, ClusterStatus,
>> JobClient, etc., but no joy.
>>
>> Thanks,
>> Paco
>>
>
>


Re: JobTracker History data+analysis

2008-07-27 Thread Amareshwari Sriramadasu
HistoryViewer is used in JobClient to view the history files in the 
directory provided on the command line. The command is
$ bin/hadoop job -history   #by default history is stored 
in output dir.
outputDir in the constructor of HistoryViewer is the directory passed on 
the command-line.


You can specify a location to store the history files of a particular 
job using "hadoop.job.history.user.location". If nothing is specified, 
the logs are stored in the job's
output directory i.e. "mapred.output.dir". The files are stored in 
"_logs/history/" inside the directory.

Thanks
Amareshwari

Paco NATHAN wrote:

Thank you, Amareshwari -

That helps.  Hadn't noticed HistoryViewer before. It has no JavaDoc.

What is a typical usage?  In other words, what would be the
"outputDir" value in the context of ToolRunner, JobClient, etc. ?

Paco


On Sun, Jul 27, 2008 at 11:48 PM, Amareshwari Sriramadasu
<[EMAIL PROTECTED]> wrote:
  

Can you have a look at org.apache.hadoop.mapred.HistoryViewer and see if it
make sense?

Thanks
Amareshwari

Paco NATHAN wrote:


We have a need to access data found in the JobTracker History link.
Specifically in the "Analyse This Job" analysis. Must be run in Java,
between jobs, in the same code which calls ToolRunner and JobClient.
In essence, we need to collect descriptive statistics about task
counts and times for map, shuffle, reduce.

After tracing the flow of the JSP in "src/webapps/job"...  Is there a
better way to get at this data, *not* from the web UI perspective but
from the code?

Tried to find any applicable patterns in JobTracker, ClusterStatus,
JobClient, etc., but no joy.

Thanks,
Paco