Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

hello, harsh.

to use MultipleOutput class,
I need to use a Job class to set it as a first argument to configure 
about my hadoop job.


|*addNamedOutput 
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html#addNamedOutput%28org.apache.hadoop.mapreduce.Job,%20java.lang.String,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class%29*(Job 
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/Job.html job,String 
http://java.sun.com/javase/6/docs/api/java/lang/String.html?is-external=true namedOutput,Class 
http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? 
extendsOutputFormat 
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/OutputFormat.html outputFormatClass,Class 
http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? keyClass,Class 
http://java.sun.com/javase/6/docs/api/java/lang/Class.html?is-external=true? valueClass)|

  Adds a named output for the job.

AYK, Job class is deprecated in 0.21.0.

to submit my job in a cluster like runJob().

How are I going to do?

Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 04:12 PM, Harsh J wrote:

Hello,

On Thu, Feb 24, 2011 at 12:25 PM, Jun Young Kimjuneng...@gmail.com  wrote:

Hi,
I executed my cluster by this way.

call a command in shell directly.

What are you doing within your testCluster.jar? If you are simply
submitting a job, you can use a Driver method and get rid of all these
hassles. JobClient and Job classes both support submitting jobs from
Java API itself.

Please read the tutorial on submitting application code via code
itself: http://developer.yahoo.com/hadoop/tutorial/module4.html#driver
Notice the last line in the code presented there, which submits a job
itself. Using runJob() also prints your progress/counters etc.

The way you've implemented this looks unnecessary when your Jar itself
can be made runnable with a Driver!



Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Harsh J
Hey,

On Thu, Feb 24, 2011 at 2:36 PM, Jun Young Kim juneng...@gmail.com wrote:
 How are I going to do?

In new API, 'Job' class too has a Job.submit() and
Job.waitForCompletion(bool) method. Please see the API here:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html

-- 
Harsh J
www.harshj.com


Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

Now, I am using Job.waitForCompletion(bool) method to submit my job.

but, my jar cannot open hdfs files.
and also after submitting my job, I couldn't look job history on admin 
pages(jobtracker.jsp) even if my job is succeeded..


for example)
I set the input path as hdfs:/user/juneng/1.input.

but, look this error..

Wrong FS: hdfs:/user/juneng/1.input, expected: file:///

Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 06:41 PM, Harsh J wrote:


In new API, 'Job' class too has a Job.submit() and
Job.waitForCompletion(bool) method. Please see the API here:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html


java.io.FileNotFoundException: File /var/lib/hadoop-0.20/cache/mapred/mapred/staging/job/

2011-02-24 Thread Job
Hi all,

This issue could very well be related to the Cloudera distribution
(CDH3b4) I use, but maybe someone knows the solution:

I configured a Job, something like this:

Configuration conf = getConf();
// ... set configuration 
conf.set(mapred.jar, localJarFile.toString())
// tracker, zookeeper, hbase etc.


Job job = new Job(conf);
// map:
job.setMapperClass(DataImportMap.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Put.class);
// reduce:

TableMapReduceUtil.initTableReducerJob(MyTable,
DataImportReduce.class, job);
FileInputFormat.addInputPath(job, new Path(inputData));

// execute:
job.waitForCompletion(true);

Now the server throws a strange exception below, see the stacktrace
below.

When i take look at the hdfs file system - through hdfs fuse - the file
is there, it really is the jar that contains my mapred classes.

Any clue wat goes wrong here?

Thanks,
Job


-
java.io.FileNotFoundException:
File 
/var/lib/hadoop-0.20/cache/mapred/mapred/staging/job/.staging/job_201102241026_0002/job.jar
 does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:207)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
at
org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1303)
at
org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:273)
at
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:381)
at
org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:371)
at
org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:198)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1154)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at
org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1129)
at
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1055)
at
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2212)
at org.apache.hadoop.mapred.TaskTracker
$TaskLauncher.run(TaskTracker.java:2176)


-- 
Drs. Job Tiel Groenestege
GridLine - Intranet en Zoeken

GridLine
Keizersgracht 520
1017 EK Amsterdam

www: http://www.gridline.nl
mail: j...@gridline.nl
tel: +31 20 616 2050
fax: +31 20 616 2051

De inhoud van dit bericht en de eventueel daarbij behorende bijlagen zijn 
persoonlijk gericht aan en derhalve uitsluitend bestemd voor de geadresseerde. 
Zij kunnen gegevens met betrekking tot een derde bevatten. De ontvanger die 
niet de geadresseerde is, noch bevoegd is dit bericht namens geadresseerde te 
ontvangen, wordt verzocht de afzender onmiddellijk op de hoogte te stellen van 
de ontvangst. Elk gebruik van de inhoud van dit bericht en/of van de daarbij 
behorende bijlagen door een ander dan de geadresseerde is onrechtmatig jegens 
afzender respectievelijk de hiervoor bedoelde derde.



Re: Benchmarking pipelined MapReduce jobs

2011-02-24 Thread David Saile
Thanks for your help! 

I had a look at the gridmix_config.xml file in the gridmix2 directory. However, 
I'm having difficulties to map the descriptions of the simulated jobs from the 
README-file
1) Three stage map/reduce job
2) Large sort of variable key/value size
3) Reference select
4) API text sort (java, streaming)
5) Jobs with combiner (word count jobs)

to the jobs names in gridmix_config.xml: 
-streamSort
-javaSort
-combiner
-monsterQuery
-webdataScan
-webdataSort

I would really appreciate any help, getting the right configuration! Which job 
do I have to enable to simulate a pipelined execution as described in 1) Three 
stage map/reduce job?

Thanks
David 

Am 23.02.2011 um 04:01 schrieb Shrinivas Joshi:

 I am not sure about this but you might want to take a look at the GridMix 
 config file. FWIU, it lets you define the # of jobs for different workloads 
 and categories.
 
 HTH,
 -Shrinivas
 
 On Tue, Feb 22, 2011 at 10:46 AM, David Saile da...@uni-koblenz.de wrote:
 Hello everybody,
 
 I am trying to benchmark a Hadoop-cluster with regards to throughput of 
 pipelined MapReduce jobs.
 Looking for benchmarks, I found the Gridmix benchmark that is supplied with 
 Hadoop. In its README-file it says that part of this benchmark is a Three 
 stage map/reduce job.
 
 As this seems to match my needs, I was wondering if it possible to configure 
 Gridmix, in order to only run this job (without the rest of the Gridmix 
 benchmark)?
 Or do I have to build my own benchmark? If this is the case, which classes 
 are used by this Three stage map/reduce job?
 
 Thanks for any help!
 
 David
 
  
 



About MapTask.java

2011-02-24 Thread Dongwon Kim
Hi,

 

I want to know how MapTask.java is implemented, especially
MapOutputBuffer class defined in MapTask.java.

I've been trying to read MapTask.java after reading some references such
as Hadoop definitive guide and
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html;, but
it's quite tough to directly read the code without detailed comments.

 

As I know, when each intermediate (key, value) pair is generated by the
user-defined map function, the pair is written by MapOutputBuffer class
defined in MapTask.java with MapOutputBuffer.collect() invoked.

However, I can't understand what each variable defined in MapOutputBuffer
means.

What I've understood is as follows (* please correct any misunderstanding): 

- The byte buffer kvbuffer is where each actual (partition, key, value)
triple is written.

- An integer array kvindices is called accounting buffer, every three
elements of which save indices to the corresponding triple in kvbuffer.

- Another integer array kvoffsets contains indices of triples in
kvindices.

- kvstart, kvend, kvindex are used to point kvindex

- bufstart, bufend, bufvoid, bufindex, bufmark are used to point
kvbuffer

 

What I can't understand is the comments beside variable definitions.

= definitions of some variables
=

private volatile int kvstart = 0;  // marks beginning of *spill*

private volatile int kvend = 0;// marks beginning of *collectable*

private int kvindex = 0;   // marks end of *collected*

private final int[] kvoffsets; // indices into kvindices

private final int[] kvindices; // partition, k/v offsets into
kvbuffer

private volatile int bufstart = 0; // marks beginning of *spill*

private volatile int bufend = 0;   // marks beginning of *collectable*

private volatile int bufvoid = 0;  // marks the point where we should
stop

   // reading at the end of the buffer

private int bufindex = 0;  // marks end of *collected*

private int bufmark = 0;   // marks end of *record*

private byte[] kvbuffer;   // main output buffer


==

 

Q1)

What do the terms spill, collectable, and collected mean?

I guess, because map outputs continue to be written to the buffer while the
spill takes place, there must be at least two pointers: from where to write
map outputs and to where to spill data; but I don't know what those spill
collectable, and collected mean exactly.

 

Q2)

Is it efficient to partition data first and then sort records inside each
partition?

Does it happen to avoid comparing expensive pair-wise key comparisons?

 

Q3)

Are there any documents containing explanations about how such internal
classes are implemented? 

 

Thanks,



eastcirclek

 

 



Re: About MapTask.java

2011-02-24 Thread Harsh J
Hey,

On Thu, Feb 24, 2011 at 6:26 PM, Dongwon Kim eastcirc...@postech.ac.kr wrote:
 I've been trying to read MapTask.java after reading some references such
 as Hadoop definitive guide and
 http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html;, but
 it's quite tough to directly read the code without detailed comments.

Perhaps you can add some after getting things cleared ;-)

 Q2)

 Is it efficient to partition data first and then sort records inside each
 partition?

 Does it happen to avoid comparing expensive pair-wise key comparisons?

Typically you would only want sorting done inside a partitioned set,
since all of the different partitions are sent off to different
reducers. Total-order partitioning may be an exception here, perhaps.

 Q3)

 Are there any documents containing explanations about how such internal
 classes are implemented?

There's a very good presentation you may want to see, on the
spill/shuffle/sort framework portions your doubts are about:
http://www.slideshare.net/hadoopusergroup/ordered-record-collection

HTH :)

-- 
Harsh J
www.harshj.com


Trouble in installing Hbase

2011-02-24 Thread JAGANADH G
Hi All

I was trying to install CDH3 Hhase in Fedora14 .
It gives the following error. Any solution to resolve this
Transaction Test Succeeded
Running Transaction
Error in PREIN scriptlet in rpm package hadoop-hbase-0.90.1+8-1.noarch
/usr/bin/install: invalid user `hbase'
/usr/bin/install: invalid user `hbase'
error: %pre(hadoop-hbase-0.90.1+8-1.noarch) scriptlet failed, exit status 1
error:   install: %pre scriptlet failed (2), skipping
hadoop-hbase-0.90.1+8-1

Failed:
  hadoop-hbase.noarch
0:0.90.1+8-1


Complete!
[root@linguist hexp]#

-- 
**
JAGANADH G
http://jaganadhg.freeflux.net/blog
*ILUGCBE*
http://ilugcbe.techstud.org


Re: Trouble in installing Hbase

2011-02-24 Thread James Seigel
You probably should ask on the cloudera support forums as cloudera has
for some reason changed the users that things run under.

James

Sent from my mobile. Please excuse the typos.

On 2011-02-24, at 8:00 AM, JAGANADH G jagana...@gmail.com wrote:

 Hi All

 I was trying to install CDH3 Hhase in Fedora14 .
 It gives the following error. Any solution to resolve this
 Transaction Test Succeeded
 Running Transaction
 Error in PREIN scriptlet in rpm package hadoop-hbase-0.90.1+8-1.noarch
 /usr/bin/install: invalid user `hbase'
 /usr/bin/install: invalid user `hbase'
 error: %pre(hadoop-hbase-0.90.1+8-1.noarch) scriptlet failed, exit status 1
 error:   install: %pre scriptlet failed (2), skipping
 hadoop-hbase-0.90.1+8-1

 Failed:
  hadoop-hbase.noarch
 0:0.90.1+8-1


 Complete!
 [root@linguist hexp]#

 --
 **
 JAGANADH G
 http://jaganadhg.freeflux.net/blog
 *ILUGCBE*
 http://ilugcbe.techstud.org


Check lzo is working on intermediate data

2011-02-24 Thread Marc Sturlese

Hey there,
I am using hadoop 0.20.2. I 've successfully installed LZOCompression
following these steps:
https://github.com/kevinweil/hadoop-lzo

I have some MR jobs written with the new API and I want to compress
intermediate data.
Not sure if my mapred-site.xml should have the properties:

  property
namemapred.compress.map.output/name
valuetrue/value
  /property
  property
namemapred.map.output.compression.codec/name
valuecom.hadoop.compression.lzo.LzoCodec/value
  /property

or:

  property
namemapreduce.map.output.compress/name
valuetrue/value
  /property
  property
namemapreduce.map.output.compress.codec/name
valuecom.hadoop.compression.lzo.LzoCodec/value
  /property

How can I check that the compression is been applied?

Thanks in advance

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Check-lzo-is-working-on-intermediate-data-tp2567704p2567704.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Check lzo is working on intermediate data

2011-02-24 Thread James Seigel
Run a standard job before. Look at the summary data.

Run the job again after the changes and look at the summary.

You should see less file system bytes written from the map stage.
Sorry, might be most obvious in shuffle bytes.

I don't have a terminal in front of me right now.

James

Sent from my mobile. Please excuse the typos.

On 2011-02-24, at 8:22 AM, Marc Sturlese marc.sturl...@gmail.com wrote:


 Hey there,
 I am using hadoop 0.20.2. I 've successfully installed LZOCompression
 following these steps:
 https://github.com/kevinweil/hadoop-lzo

 I have some MR jobs written with the new API and I want to compress
 intermediate data.
 Not sure if my mapred-site.xml should have the properties:

  property
namemapred.compress.map.output/name
valuetrue/value
  /property
  property
namemapred.map.output.compression.codec/name
valuecom.hadoop.compression.lzo.LzoCodec/value
  /property

 or:

  property
namemapreduce.map.output.compress/name
valuetrue/value
  /property
  property
namemapreduce.map.output.compress.codec/name
valuecom.hadoop.compression.lzo.LzoCodec/value
  /property

 How can I check that the compression is been applied?

 Thanks in advance

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Check-lzo-is-working-on-intermediate-data-tp2567704p2567704.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Check lzo is working on intermediate data

2011-02-24 Thread Da Zheng
I use the first one, and it seems to work because I see the size of data output
from mappers is much smaller.

Da

On 2/24/11 10:12 AM, Marc Sturlese wrote:
 
 Hey there,
 I am using hadoop 0.20.2. I 've successfully installed LZOCompression
 following these steps:
 https://github.com/kevinweil/hadoop-lzo
 
 I have some MR jobs written with the new API and I want to compress
 intermediate data.
 Not sure if my mapred-site.xml should have the properties:
 
   property
 namemapred.compress.map.output/name
 valuetrue/value
   /property
   property
 namemapred.map.output.compression.codec/name
 valuecom.hadoop.compression.lzo.LzoCodec/value
   /property
 
 or:
 
   property
 namemapreduce.map.output.compress/name
 valuetrue/value
   /property
   property
 namemapreduce.map.output.compress.codec/name
 valuecom.hadoop.compression.lzo.LzoCodec/value
   /property
 
 How can I check that the compression is been applied?
 
 Thanks in advance
 



Re: java.io.FileNotFoundException: File /var/lib/hadoop-0.20/cache/mapred/mapred/staging/job/

2011-02-24 Thread Todd Lipcon
Hi Job,

This seems CDH-specific, so I've moved the thread over to the cdh-users
mailing list (BCC common-user)

Thanks
-Todd

On Thu, Feb 24, 2011 at 2:52 AM, Job j...@gridline.nl wrote:

 Hi all,

 This issue could very well be related to the Cloudera distribution
 (CDH3b4) I use, but maybe someone knows the solution:

 I configured a Job, something like this:

Configuration conf = getConf();
// ... set configuration
conf.set(mapred.jar, localJarFile.toString())
// tracker, zookeeper, hbase etc.


Job job = new Job(conf);
// map:
job.setMapperClass(DataImportMap.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Put.class);
// reduce:

TableMapReduceUtil.initTableReducerJob(MyTable,
 DataImportReduce.class, job);
FileInputFormat.addInputPath(job, new Path(inputData));

// execute:
job.waitForCompletion(true);

 Now the server throws a strange exception below, see the stacktrace
 below.

 When i take look at the hdfs file system - through hdfs fuse - the file
 is there, it really is the jar that contains my mapred classes.

 Any clue wat goes wrong here?

 Thanks,
 Job


 -
 java.io.FileNotFoundException:
 File
 /var/lib/hadoop-0.20/cache/mapred/mapred/staging/job/.staging/job_201102241026_0002/job.jar
 does not exist.
at

 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:383)
at

 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:207)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:157)
at

 org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61)
at
 org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1303)
at

 org.apache.hadoop.mapred.JobLocalizer.localizeJobJarFile(JobLocalizer.java:273)
at

 org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:381)
at

 org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(JobLocalizer.java:371)
at

 org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:198)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1154)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at

 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at
 org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1129)
at
 org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1055)
at
 org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2212)
at org.apache.hadoop.mapred.TaskTracker
 $TaskLauncher.run(TaskTracker.java:2176)


 --
 Drs. Job Tiel Groenestege
 GridLine - Intranet en Zoeken

 GridLine
 Keizersgracht 520
 1017 EK Amsterdam

 www: http://www.gridline.nl
 mail: j...@gridline.nl
 tel: +31 20 616 2050
 fax: +31 20 616 2051

 De inhoud van dit bericht en de eventueel daarbij behorende bijlagen zijn
 persoonlijk gericht aan en derhalve uitsluitend bestemd voor de
 geadresseerde. Zij kunnen gegevens met betrekking tot een derde bevatten. De
 ontvanger die niet de geadresseerde is, noch bevoegd is dit bericht namens
 geadresseerde te ontvangen, wordt verzocht de afzender onmiddellijk op de
 hoogte te stellen van de ontvangst. Elk gebruik van de inhoud van dit
 bericht en/of van de daarbij behorende bijlagen door een ander dan de
 geadresseerde is onrechtmatig jegens afzender respectievelijk de hiervoor
 bedoelde derde.




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Current available Memory

2011-02-24 Thread maha
Hi Yang,

 The problem could be solved using the following link: 
http://www.roseindia.net/java/java-get-example/get-memory-usage.shtml
  You need to use other memory managers like the Garbage collector and its 
finalize method to measure memory accurately. 

  Good Luck,
  Maha

On Feb 23, 2011, at 10:11 PM, Yang Xiaoliang wrote:

 I had also encuntered the smae problem a few days ago.
 
 any one has another method?
 
 2011/2/24 maha m...@umail.ucsb.edu
 
 Based on the Java function documentation, it gives approximately the
 available memory, so I need to tweak it with other functions.
 So it's a Java issue not Hadoop.
 
 Thanks anyways,
 Maha
 
 On Feb 23, 2011, at 6:31 PM, maha wrote:
 
 Hello Everyone,
 
 I'm using   Runtime.getRuntime().freeMemory() to see current memory
 available before and after creation of an object, but this doesn't seem to
 work well with Hadoop?
 
 Why? and is there another alternative?
 
 Thank you,
 
 Maha
 
 
 



File size shown in HDFS using -lsr

2011-02-24 Thread maha
Silly question..


bin/hadoop dfs -lsr /

-rw-r--r--   1 Hadoop supergroup 832011-02-24 10:52 
/tmp/File-Size-4k



Why do I see my 4KB file has a size of 83 bytes??


Thanks,
Maha



Slides and videos from Feb 2011 Bay Area HUG posted

2011-02-24 Thread Owen O'Malley
The February 2011 Bay Area HUG had a record turn out with 336 people  
signed up. We had two great talks:


* The next generation of Hadoop MapReduce by Arun Murthy
* The next generation of Hadoop Operations at Facebook by Andrew Ryan

The videos and slides are posted on Yahoo's blog:
http://developer.yahoo.com/blogs/hadoop/posts/2011/02/hug-feb-2011-recap/

-- Owen

Re: File size shown in HDFS using -lsr

2011-02-24 Thread maha
It's because of the HDFS_BYTES_READ . 

So, my question now is what examples other than compression that change 
HDFS_BYTES_READ from Map-input-bytes?

In my case, input file is 67K but is stored in HDFS as 83K and this doesn't 
happen all the time, sometimes they're the same and other times they're 
different (nothing else was changed). 

Please any explanation is appreciated !

Thank you,
Maha

On Feb 24, 2011, at 11:00 AM, maha wrote:

 Silly question..
 
 
 bin/hadoop dfs -lsr /
 
 -rw-r--r--   1 Hadoop supergroup 832011-02-24 10:52 
 /tmp/File-Size-4k
 
 
 
 Why do I see my 4KB file has a size of 83 bytes??
 
 
 Thanks,
 Maha
 



hadoop file format query

2011-02-24 Thread Mapred Learn
hi,
I have a use case to upload gzipped text files of sizes ranging from 10-30
GB on hdfs.
We have decided on sequence file format as format on hdfs.
I have some doubts/questions regarding it:

i) what should be the optimal size for a sequence file considering the input
text files range from 10-30 GB in size ? Can we have a sequence file as same
size as text file ?

ii) is there some tool that could be used to convert a gzipped text file to
sequence file ?

ii) what should be a good metadata management for the files. Currently, we
have about 30-40 different types of schema for these text files. We thought
of 2 options:
-  uploading metadata as a text file on hdfs along with data. So users
can view using hadoop fs -cat file.
-  adding metadata in seq file header. In this case, we could not find
how to fetch the metadata from sequence file as we need to provide our
downstream users a way to see what is the metadata of the
   data they are reading.

thanks a lot !
-JJ


setJarByClass question

2011-02-24 Thread Mark Kerzner
Hi, this call,

job.setJarByClass

tells Hadoop which jar to use. But we also tell Hadoop which jar to use on
the command line,

hadoop jar your-jar parameters

Why do we need this in both places?

Thank you,
Mark


Re: setJarByClass question

2011-02-24 Thread Stanley Xu
The jar in the command line might only be the jar to submit the map-reduce
job, rather than the jar contains the Mapper and Reducer which will be
transferred to different node.

What the hadoop jar your-jar really did, is setting the classpath and other
related environment, and run the main method in your-jar. You might have a
different map-reduce-jar in the classpath which contains the real mapper and
reducer used to do the job.

Best wishes,
Stanley Xu



On Fri, Feb 25, 2011 at 7:23 AM, Mark Kerzner markkerz...@gmail.com wrote:

 Hi, this call,

 job.setJarByClass

 tells Hadoop which jar to use. But we also tell Hadoop which jar to use on
 the command line,

 hadoop jar your-jar parameters

 Why do we need this in both places?

 Thank you,
 Mark



Re: Current available Memory

2011-02-24 Thread Yang Xiaoliang
Thanks a lot!

Yang Xiaoliang

2011/2/25 maha m...@umail.ucsb.edu

 Hi Yang,

  The problem could be solved using the following link:
 http://www.roseindia.net/java/java-get-example/get-memory-usage.shtml
  You need to use other memory managers like the Garbage collector and its
 finalize method to measure memory accurately.

  Good Luck,
   Maha

 On Feb 23, 2011, at 10:11 PM, Yang Xiaoliang wrote:

  I had also encuntered the smae problem a few days ago.
 
  any one has another method?
 
  2011/2/24 maha m...@umail.ucsb.edu
 
  Based on the Java function documentation, it gives approximately the
  available memory, so I need to tweak it with other functions.
  So it's a Java issue not Hadoop.
 
  Thanks anyways,
  Maha
 
  On Feb 23, 2011, at 6:31 PM, maha wrote:
 
  Hello Everyone,
 
  I'm using   Runtime.getRuntime().freeMemory() to see current memory
  available before and after creation of an object, but this doesn't seem
 to
  work well with Hadoop?
 
  Why? and is there another alternative?
 
  Thank you,
 
  Maha
 
 
 




Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

hi,

I got the reason of my problem.

in case of submitting a job by shell,

conf.get(fs.default.name) is hdfs://localhost

in case of submitting a job by a java application directly,

conf.get(fs.default.name) is file://localhost
so I couldn't read any files from hdfs.

I think the execution of my java app couldn't read *-site.xml 
configurations properly.


Junyoung Kim (juneng...@gmail.com)


On 02/24/2011 06:41 PM, Harsh J wrote:

Hey,

On Thu, Feb 24, 2011 at 2:36 PM, Jun Young Kimjuneng...@gmail.com  wrote:

How are I going to do?

In new API, 'Job' class too has a Job.submit() and
Job.waitForCompletion(bool) method. Please see the API here:
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html



Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Harsh J
Hi,

On Fri, Feb 25, 2011 at 10:17 AM, Jun Young Kim juneng...@gmail.com wrote:
 hi,

 I got the reason of my problem.

 in case of submitting a job by shell,

 conf.get(fs.default.name) is hdfs://localhost

 in case of submitting a job by a java application directly,

 conf.get(fs.default.name) is file://localhost
 so I couldn't read any files from hdfs.

 I think the execution of my java app couldn't read *-site.xml configurations
 properly.


Have a look at this Q:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F

-- 
Harsh J
www.harshj.com


Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

Hi, Harsh.

I've already tried to do use final tag to set it unmodifiable.
but, my result is not different.

*core-site.xml:*
configuration
property
namefs.default.name/name
valuehdfs://localhost/value
finaltrue/final
/property
/configuration

other *-site.xml files are also modified by this rule.

thanks.

Junyoung Kim (juneng...@gmail.com)


On 02/25/2011 02:50 PM, Harsh J wrote:

Hi,

On Fri, Feb 25, 2011 at 10:17 AM, Jun Young Kimjuneng...@gmail.com  wrote:

hi,

I got the reason of my problem.

in case of submitting a job by shell,

conf.get(fs.default.name) is hdfs://localhost

in case of submitting a job by a java application directly,

conf.get(fs.default.name) is file://localhost
so I couldn't read any files from hdfs.

I think the execution of my java app couldn't read *-site.xml configurations
properly.


Have a look at this Q:
http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F



Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Harsh J
Hello again,

Finals won't help all the logic you require to be performed in the
front-end/Driver code. If you're using fs.default.name inside a Task
somehow, final will help there. It is best if your application gets
the right configuration files on its classpath itself, so that the
right values are read (how else would it know your values!).

Alternatively, you can use GenericOptionsParser to parse -fs and -jt
arguments when the Driver is launched from commandline.

On Fri, Feb 25, 2011 at 11:46 AM, Jun Young Kim juneng...@gmail.com wrote:
 Hi, Harsh.

 I've already tried to do use final tag to set it unmodifiable.
 but, my result is not different.

 *core-site.xml:*
 configuration
 property
 namefs.default.name/name
 valuehdfs://localhost/value
 finaltrue/final
 /property
 /configuration

 other *-site.xml files are also modified by this rule.

 thanks.

 Junyoung Kim (juneng...@gmail.com)


 On 02/25/2011 02:50 PM, Harsh J wrote:

 Hi,

 On Fri, Feb 25, 2011 at 10:17 AM, Jun Young Kimjuneng...@gmail.com
  wrote:

 hi,

 I got the reason of my problem.

 in case of submitting a job by shell,

 conf.get(fs.default.name) is hdfs://localhost

 in case of submitting a job by a java application directly,

 conf.get(fs.default.name) is file://localhost
 so I couldn't read any files from hdfs.

 I think the execution of my java app couldn't read *-site.xml
 configurations
 properly.

 Have a look at this Q:

 http://wiki.apache.org/hadoop/FAQ#How_do_I_get_my_MapReduce_Java_Program_to_read_the_Cluster.27s_set_configuration_and_not_just_defaults.3F





-- 
Harsh J
www.harshj.com


Re: is there more smarter way to execute a hadoop cluster?

2011-02-24 Thread Jun Young Kim

hello, harsh.

do you mean I need to read xml files and then parse it to set in my app?


Junyoung Kim (juneng...@gmail.com)


On 02/25/2011 03:32 PM, Harsh J wrote:

It is best if your application gets
the right configuration files on its classpath itself, so that the
right values are read (how else would it know your values!).