Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

hello, harsh.

do you mean I need to read xml files and then parse it to set in my app?


Junyoung Kim (juneng...@gmail.com)


On 02/25/2011 03:32 PM, Harsh J wrote:

It is best if your application gets
the right configuration files on its classpath itself, so that the
right values are read (how else would it know your values!).


Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Harsh J
Hello again,

Finals won't help all the logic you require to be performed in the
front-end/Driver code. If you're using fs.default.name inside a Task
somehow, final will help there. It is best if your application gets
the right configuration files on its classpath itself, so that the
right values are read (how else would it know your values!).

Alternatively, you can use GenericOptionsParser to parse -fs and -jt
arguments when the Driver is launched from commandline.

On Fri, Feb 25, 2011 at 11:46 AM, Jun Young Kim  wrote:
> Hi, Harsh.
>
> I've already tried to do use  tag to set it unmodifiable.
> but, my result is not different.
>
> *core-site.xml:*
> 
> 
> fs.default.name
> hdfs://localhost
> true
> 
> 
>
> other *-site.xml files are also modified by this rule.
>
> thanks.
>
> Junyoung Kim (juneng...@gmail.com)
>
>
> On 02/25/2011 02:50 PM, Harsh J wrote:
>>
>> Hi,
>>
>> On Fri, Feb 25, 2011 at 10:17 AM, Jun Young Kim
>>  wrote:
>>>
>>> hi,
>>>
>>> I got the reason of my problem.
>>>
>>> in case of submitting a job by shell,
>>>
>>> conf.get("fs.default.name") is "hdfs://localhost"
>>>
>>> in case of submitting a job by a java application directly,
>>>
>>> conf.get("fs.default.name") is "file://localhost"
>>> so I couldn't read any files from hdfs.
>>>
>>> I think the execution of my java app couldn't read *-site.xml
>>> configurations
>>> properly.
>>
>> Have a look at this Q:
>>
>> http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F
>>
>



-- 
Harsh J
www.harshj.com


Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

Hi, Harsh.

I've already tried to do use  tag to set it unmodifiable.
but, my result is not different.

*core-site.xml:*


fs.default.name
hdfs://localhost
true



other *-site.xml files are also modified by this rule.

thanks.

Junyoung Kim (juneng...@gmail.com)


On 02/25/2011 02:50 PM, Harsh J wrote:

Hi,

On Fri, Feb 25, 2011 at 10:17 AM, Jun Young Kim  wrote:

hi,

I got the reason of my problem.

in case of submitting a job by shell,

conf.get("fs.default.name") is "hdfs://localhost"

in case of submitting a job by a java application directly,

conf.get("fs.default.name") is "file://localhost"
so I couldn't read any files from hdfs.

I think the execution of my java app couldn't read *-site.xml configurations
properly.


Have a look at this Q:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F



Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Harsh J
Hi,

On Fri, Feb 25, 2011 at 10:17 AM, Jun Young Kim  wrote:
> hi,
>
> I got the reason of my problem.
>
> in case of submitting a job by shell,
>
> conf.get("fs.default.name") is "hdfs://localhost"
>
> in case of submitting a job by a java application directly,
>
> conf.get("fs.default.name") is "file://localhost"
> so I couldn't read any files from hdfs.
>
> I think the execution of my java app couldn't read *-site.xml configurations
> properly.


Have a look at this Q:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F

-- 
Harsh J
www.harshj.com


Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

hi,

I got the reason of my problem.

in case of submitting a job by shell,

conf.get("fs.default.name") is "hdfs://localhost"

in case of submitting a job by a java application directly,

conf.get("fs.default.name") is "file://localhost"
so I couldn't read any files from hdfs.

I think the execution of my java app couldn't read *-site.xml 
configurations properly.


Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 06:41 PM, Harsh J wrote:

Hey,

On Thu, Feb 24, 2011 at 2:36 PM, Jun Young Kim  wrote:

How are I going to do?

In new API, 'Job' class too has a Job.submit() and
Job.waitForCompletion(bool) method. Please see the API here:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html



Re: Current available Memory

2011-02-24 Thread Yang Xiaoliang
Thanks a lot!

Yang Xiaoliang

2011/2/25 maha 

> Hi Yang,
>
>  The problem could be solved using the following link:
> http://www.roseindia.net/java/java-get-example/get-memory-usage.shtml
>  You need to use other memory managers like the Garbage collector and its
> finalize method to measure memory accurately.
>
>  Good Luck,
>   Maha
>
> On Feb 23, 2011, at 10:11 PM, Yang Xiaoliang wrote:
>
> > I had also encuntered the smae problem a few days ago.
> >
> > any one has another method?
> >
> > 2011/2/24 maha 
> >
> >> Based on the Java function documentation, it gives approximately the
> >> available memory, so I need to tweak it with other functions.
> >> So it's a Java issue not Hadoop.
> >>
> >> Thanks anyways,
> >> Maha
> >>
> >> On Feb 23, 2011, at 6:31 PM, maha wrote:
> >>
> >>> Hello Everyone,
> >>>
> >>> I'm using  " Runtime.getRuntime().freeMemory()" to see current memory
> >> available before and after creation of an object, but this doesn't seem
> to
> >> work well with Hadoop?
> >>>
> >>> Why? and is there another alternative?
> >>>
> >>> Thank you,
> >>>
> >>> Maha
> >>>
> >>
> >>
>
>


Re: setJarByClass question

2011-02-24 Thread Stanley Xu
The jar in the command line might only be the jar to submit the map-reduce
job, rather than the jar contains the Mapper and Reducer which will be
transferred to different node.

What the hadoop jar your-jar really did, is setting the classpath and other
related environment, and run the main method in your-jar. You might have a
different map-reduce-jar in the classpath which contains the real mapper and
reducer used to do the job.

Best wishes,
Stanley Xu



On Fri, Feb 25, 2011 at 7:23 AM, Mark Kerzner  wrote:

> Hi, this call,
>
> job.setJarByClass
>
> tells Hadoop which jar to use. But we also tell Hadoop which jar to use on
> the command line,
>
> hadoop jar your-jar parameters
>
> Why do we need this in both places?
>
> Thank you,
> Mark
>


setJarByClass question

2011-02-24 Thread Mark Kerzner
Hi, this call,

job.setJarByClass

tells Hadoop which jar to use. But we also tell Hadoop which jar to use on
the command line,

hadoop jar your-jar parameters

Why do we need this in both places?

Thank you,
Mark


hadoop file format query

2011-02-24 Thread Mapred Learn
hi,
I have a use case to upload gzipped text files of sizes ranging from 10-30
GB on hdfs.
We have decided on sequence file format as format on hdfs.
I have some doubts/questions regarding it:

i) what should be the optimal size for a sequence file considering the input
text files range from 10-30 GB in size ? Can we have a sequence file as same
size as text file ?

ii) is there some tool that could be used to convert a gzipped text file to
sequence file ?

ii) what should be a good metadata management for the files. Currently, we
have about 30-40 different types of schema for these text files. We thought
of 2 options:
-  uploading metadata as a text file on hdfs along with data. So users
can view using hadoop fs -cat .
-  adding metadata in seq file header. In this case, we could not find
how to fetch the metadata from sequence file as we need to provide our
downstream users a way to see what is the metadata of the
   data they are reading.

thanks a lot !
-JJ


Re: File size shown in HDFS using "-lsr"

2011-02-24 Thread maha
It's because of the HDFS_BYTES_READ . 

So, my question now is what examples other than compression that change 
HDFS_BYTES_READ from Map-input-bytes?

In my case, input file is 67K but is stored in HDFS as 83K and this doesn't 
happen all the time, sometimes they're the same and other times they're 
different (nothing else was changed). 

Please any explanation is appreciated !

Thank you,
Maha

On Feb 24, 2011, at 11:00 AM, maha wrote:

> Silly question..
> 
> 
> bin/hadoop dfs -lsr /
> 
> -rw-r--r--   1 Hadoop supergroup 832011-02-24 10:52 
> /tmp/File-Size-4k
> 
> 
> 
> Why do I see my 4KB file has a size of 83 bytes??
> 
> 
> Thanks,
> Maha
> 



Slides and videos from Feb 2011 Bay Area HUG posted

2011-02-24 Thread Owen O'Malley
The February 2011 Bay Area HUG had a record turn out with 336 people  
signed up. We had two great talks:


* The next generation of Hadoop MapReduce by Arun Murthy
* The next generation of Hadoop Operations at Facebook by Andrew Ryan

The videos and slides are posted on Yahoo's blog:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/

-- Owen

File size shown in HDFS using "-lsr"

2011-02-24 Thread maha
Silly question..


bin/hadoop dfs -lsr /

-rw-r--r--   1 Hadoop supergroup 832011-02-24 10:52 
/tmp/File-Size-4k



Why do I see my 4KB file has a size of 83 bytes??


Thanks,
Maha



Re: Current available Memory

2011-02-24 Thread maha
Hi Yang,

 The problem could be solved using the following link: 
http://www.roseindia.net/java/java-get-example/get-memory-usage.shtml
  You need to use other memory managers like the Garbage collector and its 
finalize method to measure memory accurately. 

  Good Luck,
  Maha

On Feb 23, 2011, at 10:11 PM, Yang Xiaoliang wrote:

> I had also encuntered the smae problem a few days ago.
> 
> any one has another method?
> 
> 2011/2/24 maha 
> 
>> Based on the Java function documentation, it gives approximately the
>> available memory, so I need to tweak it with other functions.
>> So it's a Java issue not Hadoop.
>> 
>> Thanks anyways,
>> Maha
>> 
>> On Feb 23, 2011, at 6:31 PM, maha wrote:
>> 
>>> Hello Everyone,
>>> 
>>> I'm using  " Runtime.getRuntime().freeMemory()" to see current memory
>> available before and after creation of an object, but this doesn't seem to
>> work well with Hadoop?
>>> 
>>> Why? and is there another alternative?
>>> 
>>> Thank you,
>>> 
>>> Maha
>>> 
>> 
>> 



Re: java.io.FileNotFoundException: File /var/lib/hadoop-0.20/cache/mapred/mapred/staging/job/

2011-02-24 Thread Todd Lipcon
Hi Job,

This seems CDH-specific, so I've moved the thread over to the cdh-users
mailing list (BCC common-user)

Thanks
-Todd

On Thu, Feb 24, 2011 at 2:52 AM, Job  wrote:

> Hi all,
>
> This issue could very well be related to the Cloudera distribution
> (CDH3b4) I use, but maybe someone knows the solution:
>
> I configured a Job, something like this:
>
>Configuration conf = getConf();
>// ... set configuration
>conf.set("mapred.jar", localJarFile.toString())
>// tracker, zookeeper, hbase etc.
>
>
>Job job = new Job(conf);
>// map:
>job.setMapperClass(DataImportMap.class);
>job.setMapOutputKeyClass(LongWritable.class);
>job.setMapOutputValueClass(Put.class);
>// reduce:
>
>TableMapReduceUtil.initTableReducerJob("MyTable",
> DataImportReduce.class, job);
>FileInputFormat.addInputPath(job, new Path(inputData));
>
>// execute:
>job.waitForCompletion(true);
>
> Now the server throws a strange exception below, see the stacktrace
> below.
>
> When i take look at the hdfs file system - through hdfs fuse - the file
> is there, it really is the jar that contains my mapred classes.
>
> Any clue wat goes wrong here?
>
> Thanks,
> Job
>
>
> -
> java.io.FileNotFoundException:
> File
> /var/lib/hadoop-0.20/cache/mapred/mapred/staging/job/.staging/job_201102241026_0002/job.jar
> does not exist.
>at
>
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383)
>at
>
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
>at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:207)
>at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
>at
>
> org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61)
>at
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1303)
>at
>
> org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:273)
>at
>
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:381)
>at
>
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:371)
>at
>
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:198)
>at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1154)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:396)
>at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>at
> org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1129)
>at
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1055)
>at
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2212)
>at org.apache.hadoop.mapred.TaskTracker
> $TaskLauncher.run(TaskTracker.java:2176)
>
>
> --
> Drs. Job Tiel Groenestege
> GridLine - Intranet en Zoeken
>
> GridLine
> Keizersgracht 520
> 1017 EK Amsterdam
>
> www: http://www.gridline.nl
> mail: j...@gridline.nl
> tel: +31 20 616 2050
> fax: +31 20 616 2051
>
> De inhoud van dit bericht en de eventueel daarbij behorende bijlagen zijn
> persoonlijk gericht aan en derhalve uitsluitend bestemd voor de
> geadresseerde. Zij kunnen gegevens met betrekking tot een derde bevatten. De
> ontvanger die niet de geadresseerde is, noch bevoegd is dit bericht namens
> geadresseerde te ontvangen, wordt verzocht de afzender onmiddellijk op de
> hoogte te stellen van de ontvangst. Elk gebruik van de inhoud van dit
> bericht en/of van de daarbij behorende bijlagen door een ander dan de
> geadresseerde is onrechtmatig jegens afzender respectievelijk de hiervoor
> bedoelde derde.
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Check lzo is working on intermediate data

2011-02-24 Thread Da Zheng
I use the first one, and it seems to work because I see the size of data output
from mappers is much smaller.

Da

On 2/24/11 10:12 AM, Marc Sturlese wrote:
> 
> Hey there,
> I am using hadoop 0.20.2. I 've successfully installed LZOCompression
> following these steps:
> https://github.com/kevinweil/hadoop-lzo
> 
> I have some MR jobs written with the new API and I want to compress
> intermediate data.
> Not sure if my mapred-site.xml should have the properties:
> 
>   
> mapred.compress.map.output
> true
>   
>   
> mapred.map.output.compression.codec
> com.hadoop.compression.lzo.LzoCodec
>   
> 
> or:
> 
>   
> mapreduce.map.output.compress
> true
>   
>   
> mapreduce.map.output.compress.codec
> com.hadoop.compression.lzo.LzoCodec
>   
> 
> How can I check that the compression is been applied?
> 
> Thanks in advance
> 



Re: Check lzo is working on intermediate data

2011-02-24 Thread James Seigel
Run a standard job before. Look at the summary data.

Run the job again after the changes and look at the summary.

You should see less file system bytes written from the map stage.
Sorry, might be most obvious in shuffle bytes.

I don't have a terminal in front of me right now.

James

Sent from my mobile. Please excuse the typos.

On 2011-02-24, at 8:22 AM, Marc Sturlese  wrote:

>
> Hey there,
> I am using hadoop 0.20.2. I 've successfully installed LZOCompression
> following these steps:
> https://github.com/kevinweil/hadoop-lzo
>
> I have some MR jobs written with the new API and I want to compress
> intermediate data.
> Not sure if my mapred-site.xml should have the properties:
>
>  
>mapred.compress.map.output
>true
>  
>  
>mapred.map.output.compression.codec
>com.hadoop.compression.lzo.LzoCodec
>  
>
> or:
>
>  
>mapreduce.map.output.compress
>true
>  
>  
>mapreduce.map.output.compress.codec
>com.hadoop.compression.lzo.LzoCodec
>  
>
> How can I check that the compression is been applied?
>
> Thanks in advance
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Check-lzo-is-working-on-intermediate-data-tp2567704p2567704.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Check lzo is working on intermediate data

2011-02-24 Thread Marc Sturlese

Hey there,
I am using hadoop 0.20.2. I 've successfully installed LZOCompression
following these steps:
https://github.com/kevinweil/hadoop-lzo

I have some MR jobs written with the new API and I want to compress
intermediate data.
Not sure if my mapred-site.xml should have the properties:

  
mapred.compress.map.output
true
  
  
mapred.map.output.compression.codec
com.hadoop.compression.lzo.LzoCodec
  

or:

  
mapreduce.map.output.compress
true
  
  
mapreduce.map.output.compress.codec
com.hadoop.compression.lzo.LzoCodec
  

How can I check that the compression is been applied?

Thanks in advance

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Check-lzo-is-working-on-intermediate-data-tp2567704p2567704.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Trouble in installing Hbase

2011-02-24 Thread James Seigel
You probably should ask on the cloudera support forums as cloudera has
for some reason changed the users that things run under.

James

Sent from my mobile. Please excuse the typos.

On 2011-02-24, at 8:00 AM, JAGANADH G  wrote:

> Hi All
>
> I was trying to install CDH3 Hhase in Fedora14 .
> It gives the following error. Any solution to resolve this
> Transaction Test Succeeded
> Running Transaction
> Error in PREIN scriptlet in rpm package hadoop-hbase-0.90.1+8-1.noarch
> /usr/bin/install: invalid user `hbase'
> /usr/bin/install: invalid user `hbase'
> error: %pre(hadoop-hbase-0.90.1+8-1.noarch) scriptlet failed, exit status 1
> error:   install: %pre scriptlet failed (2), skipping
> hadoop-hbase-0.90.1+8-1
>
> Failed:
>  hadoop-hbase.noarch
> 0:0.90.1+8-1
>
>
> Complete!
> [root@linguist hexp]#
>
> --
> **
> JAGANADH G
> http://jaganadhg.freeflux.net/blog
> *ILUGCBE*
> http://ilugcbe.techstud.org


Trouble in installing Hbase

2011-02-24 Thread JAGANADH G
Hi All

I was trying to install CDH3 Hhase in Fedora14 .
It gives the following error. Any solution to resolve this
Transaction Test Succeeded
Running Transaction
Error in PREIN scriptlet in rpm package hadoop-hbase-0.90.1+8-1.noarch
/usr/bin/install: invalid user `hbase'
/usr/bin/install: invalid user `hbase'
error: %pre(hadoop-hbase-0.90.1+8-1.noarch) scriptlet failed, exit status 1
error:   install: %pre scriptlet failed (2), skipping
hadoop-hbase-0.90.1+8-1

Failed:
  hadoop-hbase.noarch
0:0.90.1+8-1


Complete!
[root@linguist hexp]#

-- 
**
JAGANADH G
http://jaganadhg.freeflux.net/blog
*ILUGCBE*
http://ilugcbe.techstud.org


Re: About MapTask.java

2011-02-24 Thread Harsh J
Hey,

On Thu, Feb 24, 2011 at 6:26 PM, Dongwon Kim  wrote:
> I've been trying to read "MapTask.java" after reading some references such
> as "Hadoop definitive guide" and
> "http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html";, but
> it's quite tough to directly read the code without detailed comments.

Perhaps you can add some after getting things cleared ;-)

> Q2)
>
> Is it efficient to partition data first and then sort records inside each
> partition?
>
> Does it happen to avoid comparing expensive pair-wise key comparisons?

Typically you would only want sorting done inside a partitioned set,
since all of the different partitions are sent off to different
reducers. Total-order partitioning may be an exception here, perhaps.

> Q3)
>
> Are there any documents containing explanations about how such internal
> classes are implemented?

There's a very good presentation you may want to see, on the
spill/shuffle/sort framework portions your doubts are about:
http://www.slideshare.net/hadoopusergroup/ordered-record-collection

HTH :)

-- 
Harsh J
www.harshj.com


About MapTask.java

2011-02-24 Thread Dongwon Kim
Hi,

 

I want to know how "MapTask.java" is implemented, especially
"MapOutputBuffer" class defined in "MapTask.java".

I've been trying to read "MapTask.java" after reading some references such
as "Hadoop definitive guide" and
"http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html";, but
it's quite tough to directly read the code without detailed comments.

 

As I know, when each intermediate (key, value) pair is generated by the
user-defined map function, the pair is written by "MapOutputBuffer" class
defined in "MapTask.java" with MapOutputBuffer.collect() invoked.

However, I can't understand what each variable defined in "MapOutputBuffer"
means.

What I've understood is as follows (* please correct any misunderstanding): 

- The byte buffer "kvbuffer" is where each actual (partition, key, value)
triple is written.

- An integer array "kvindices" is called "accounting buffer", every three
elements of which save indices to the corresponding triple in "kvbuffer".

- Another integer array "kvoffsets" contains indices of triples in
"kvindices".

- "kvstart", "kvend", "kvindex" are used to point "kvindex"

- "bufstart", "bufend", "bufvoid", "bufindex", "bufmark" are used to point
"kvbuffer"

 

What I can't understand is the comments beside variable definitions.

= definitions of some variables
=

private volatile int kvstart = 0;  // marks beginning of *spill*

private volatile int kvend = 0;// marks beginning of *collectable*

private int kvindex = 0;   // marks end of *collected*

private final int[] kvoffsets; // indices into kvindices

private final int[] kvindices; // partition, k/v offsets into
kvbuffer

private volatile int bufstart = 0; // marks beginning of *spill*

private volatile int bufend = 0;   // marks beginning of *collectable*

private volatile int bufvoid = 0;  // marks the point where we should
stop

   // reading at the end of the buffer

private int bufindex = 0;  // marks end of *collected*

private int bufmark = 0;   // marks end of *record*

private byte[] kvbuffer;   // main output buffer


==

 

Q1)

What do the terms "spill", "collectable", and "collected" mean?

I guess, because map outputs continue to be written to the buffer while the
spill takes place, there must be at least two pointers: from where to write
map outputs and to where to spill data; but I don't know what those "spill"
"collectable", and "collected" mean exactly.

 

Q2)

Is it efficient to partition data first and then sort records inside each
partition?

Does it happen to avoid comparing expensive pair-wise key comparisons?

 

Q3)

Are there any documents containing explanations about how such internal
classes are implemented? 

 

Thanks,



eastcirclek

 

 



Re: Benchmarking pipelined MapReduce jobs

2011-02-24 Thread David Saile
Thanks for your help! 

I had a look at the gridmix_config.xml file in the gridmix2 directory. However, 
I'm having difficulties to map the descriptions of the simulated jobs from the 
README-file
1) Three stage map/reduce job
2) Large sort of variable key/value size
3) Reference select
4) API text sort (java, streaming)
5) Jobs with combiner (word count jobs)

to the jobs names in gridmix_config.xml: 
-streamSort
-javaSort
-combiner
-monsterQuery
-webdataScan
-webdataSort

I would really appreciate any help, getting the right configuration! Which job 
do I have to enable to simulate a pipelined execution as described in "1) Three 
stage map/reduce job"?

Thanks
David 

Am 23.02.2011 um 04:01 schrieb Shrinivas Joshi:

> I am not sure about this but you might want to take a look at the GridMix 
> config file. FWIU, it lets you define the # of jobs for different workloads 
> and categories.
> 
> HTH,
> -Shrinivas
> 
> On Tue, Feb 22, 2011 at 10:46 AM, David Saile  wrote:
> Hello everybody,
> 
> I am trying to benchmark a Hadoop-cluster with regards to throughput of 
> pipelined MapReduce jobs.
> Looking for benchmarks, I found the "Gridmix" benchmark that is supplied with 
> Hadoop. In its README-file it says that part of this benchmark is a "Three 
> stage map/reduce job".
> 
> As this seems to match my needs, I was wondering if it possible to configure 
> "Gridmix", in order to only run this job (without the rest of the "Gridmix" 
> benchmark)?
> Or do I have to build my own benchmark? If this is the case, which classes 
> are used by this "Three stage map/reduce job"?
> 
> Thanks for any help!
> 
> David
> 
>  
> 



java.io.FileNotFoundException: File /var/lib/hadoop-0.20/cache/mapred/mapred/staging/job/

2011-02-24 Thread Job
Hi all,

This issue could very well be related to the Cloudera distribution
(CDH3b4) I use, but maybe someone knows the solution:

I configured a Job, something like this:

Configuration conf = getConf();
// ... set configuration 
conf.set("mapred.jar", localJarFile.toString())
// tracker, zookeeper, hbase etc.


Job job = new Job(conf);
// map:
job.setMapperClass(DataImportMap.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Put.class);
// reduce:

TableMapReduceUtil.initTableReducerJob("MyTable",
DataImportReduce.class, job);
FileInputFormat.addInputPath(job, new Path(inputData));

// execute:
job.waitForCompletion(true);

Now the server throws a strange exception below, see the stacktrace
below.

When i take look at the hdfs file system - through hdfs fuse - the file
is there, it really is the jar that contains my mapred classes.

Any clue wat goes wrong here?

Thanks,
Job


-
java.io.FileNotFoundException:
File 
/var/lib/hadoop-0.20/cache/mapred/mapred/staging/job/.staging/job_201102241026_0002/job.jar
 does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:207)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
at
org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1303)
at
org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:273)
at
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:381)
at
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:371)
at
org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:198)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1154)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at
org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1129)
at
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1055)
at
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2212)
at org.apache.hadoop.mapred.TaskTracker
$TaskLauncher.run(TaskTracker.java:2176)


-- 
Drs. Job Tiel Groenestege
GridLine - Intranet en Zoeken

GridLine
Keizersgracht 520
1017 EK Amsterdam

www: http://www.gridline.nl
mail: j...@gridline.nl
tel: +31 20 616 2050
fax: +31 20 616 2051

De inhoud van dit bericht en de eventueel daarbij behorende bijlagen zijn 
persoonlijk gericht aan en derhalve uitsluitend bestemd voor de geadresseerde. 
Zij kunnen gegevens met betrekking tot een derde bevatten. De ontvanger die 
niet de geadresseerde is, noch bevoegd is dit bericht namens geadresseerde te 
ontvangen, wordt verzocht de afzender onmiddellijk op de hoogte te stellen van 
de ontvangst. Elk gebruik van de inhoud van dit bericht en/of van de daarbij 
behorende bijlagen door een ander dan de geadresseerde is onrechtmatig jegens 
afzender respectievelijk de hiervoor bedoelde derde.



Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

Now, I am using Job.waitForCompletion(bool) method to submit my job.

but, my jar cannot open hdfs files.
and also after submitting my job, I couldn't look job history on admin 
pages(jobtracker.jsp) even if my job is succeeded..


for example)
I set the input path as "hdfs:/user/juneng/1.input".

but, look this error..

Wrong FS: hdfs:/user/juneng/1.input, expected: file:///

Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 06:41 PM, Harsh J wrote:


In new API, 'Job' class too has a Job.submit() and
Job.waitForCompletion(bool) method. Please see the API here:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html


Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Harsh J
Hey,

On Thu, Feb 24, 2011 at 2:36 PM, Jun Young Kim  wrote:
> How are I going to do?

In new API, 'Job' class too has a Job.submit() and
Job.waitForCompletion(bool) method. Please see the API here:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html

-- 
Harsh J
www.harshj.com


Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

hello, harsh.

to use MultipleOutput class,
I need to use a Job class to set it as a first argument to configure 
about my hadoop job.


|*addNamedOutput 
*(Job 
 job,String 
 namedOutput,Class 
extendsOutputFormat 
> outputFormatClass,Class 
 keyClass,Class 
 valueClass)|

  Adds a named output for the job.

AYK, Job class is deprecated in 0.21.0.

to submit my job in a cluster like runJob().

How are I going to do?

Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 04:12 PM, Harsh J wrote:

Hello,

On Thu, Feb 24, 2011 at 12:25 PM, Jun Young Kim  wrote:

Hi,
I executed my cluster by this way.

call a command in shell directly.

What are you doing within your testCluster.jar? If you are simply
submitting a job, you can use a Driver method and get rid of all these
hassles. JobClient and Job classes both support submitting jobs from
Java API itself.

Please read the tutorial on submitting application code via code
itself: http://developer.yahoo.com/hadoop/tutorial/module4.html#driver
Notice the last line in the code presented there, which submits a job
itself. Using runJob() also prints your progress/counters etc.

The way you've implemented this looks unnecessary when your Jar itself
can be made runnable with a Driver!