How to enumerate files in the directories?

2010-08-25 Thread Denim Live
Hello, how can one determine the names of the files in a particular hadoop 
directory, programmatically?



  

Re: Ganglia 3.1 on Hadoop 0.20.2 ...

2010-08-25 Thread Gautam
Brian,

Works for me now.. one should point the servers param to the multicast
address that gmond writes to and listens on... and not the ganglia server.
Started working once I did this.

thanks for you inputs,
-G.

On Tue, Aug 24, 2010 at 7:12 PM, Brian Bockelman bbock...@cse.unl.eduwrote:


 On Aug 24, 2010, at 8:27 AM, Gautam wrote:

  I was trying to get Ganglia 3.1 to work with the stable hadoop-0.20.2
  version from Apache. I patched this release from HADOOP-4675 using
  HADOOP-4675-v7.patch as suggested by CDH3 release notes [1]  I am unable
 to
  see any hadoop metrics on the Ganglia monitoring UI. The other metrics
 that
  gmond spews (system CPU/Memory etc) seem to work.
 
  When I switch to FileContext the metrics are written properly to the log
  file. Once I moved to GangliaContext31 it doesn't show anything. I tried
  pointing the servers param to localhost:8649 while listening on that
 port
  using netcat on that machine... nothing comes up on netcat. Has anyone
 faced
  this issue?

 This is possibly misleading - netcat won't work if Hadoop is using UDP.

 My advice is to do:

 telnet $Ganglia_Server 9988

 and see if it spits out a bunch of XML.  In the typical Ganglia
 configuration, it is set up to listen on UDP and write on TCP of the same
 port.

 A third thing to test is to switch the hadoop-metrics back to the file
 output, and make sure something gets written to the log file.  The issue
 might be upstream.

 Brian

 
  This is what most of my hadoop-metrics looks like:
 
  dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
  dfs.period=10
  dfs.fileName=/tmp/dfsmetrics.log
  dfs.servers=$Ganglia_Server:9988
 
  # Configuration of the mapred context for null
  mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
  mapred.period=10
  mapred.fileName=/tmp/mrmetrics.log
  mapred.servers=$Ganglia_Server:9988
 
  # Configuration of the jvm context for null
  jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
  jvm.period=10
  jvm.fileName=/tmp/jvmmetrics.log
  jvm.servers=$GANGLIA_SERVER:9988
 
  -G.
 
  [1] - http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228.CHANGES.txt




-- 
If you really want something in this life, you have to work for it. Now,
quiet! They're about to announce the lottery numbers...


Re: Ganglia 3.1 on Hadoop 0.20.2 ...

2010-08-25 Thread Brian Bockelman
Hi Gautam,

Yup - that's one possible way to configure Ganglia and is common at many sites. 
 That's why I usually recommend the telnet trick to determine what IP address 
your configuration is using.

Brian

On Aug 25, 2010, at 5:53 AM, Gautam wrote:

 Brian,
 
Works for me now.. one should point the servers param to the multicast
 address that gmond writes to and listens on... and not the ganglia server.
 Started working once I did this.
 
 thanks for you inputs,
 -G.
 
 On Tue, Aug 24, 2010 at 7:12 PM, Brian Bockelman bbock...@cse.unl.eduwrote:
 
 
 On Aug 24, 2010, at 8:27 AM, Gautam wrote:
 
 I was trying to get Ganglia 3.1 to work with the stable hadoop-0.20.2
 version from Apache. I patched this release from HADOOP-4675 using
 HADOOP-4675-v7.patch as suggested by CDH3 release notes [1]  I am unable
 to
 see any hadoop metrics on the Ganglia monitoring UI. The other metrics
 that
 gmond spews (system CPU/Memory etc) seem to work.
 
 When I switch to FileContext the metrics are written properly to the log
 file. Once I moved to GangliaContext31 it doesn't show anything. I tried
 pointing the servers param to localhost:8649 while listening on that
 port
 using netcat on that machine... nothing comes up on netcat. Has anyone
 faced
 this issue?
 
 This is possibly misleading - netcat won't work if Hadoop is using UDP.
 
 My advice is to do:
 
 telnet $Ganglia_Server 9988
 
 and see if it spits out a bunch of XML.  In the typical Ganglia
 configuration, it is set up to listen on UDP and write on TCP of the same
 port.
 
 A third thing to test is to switch the hadoop-metrics back to the file
 output, and make sure something gets written to the log file.  The issue
 might be upstream.
 
 Brian
 
 
 This is what most of my hadoop-metrics looks like:
 
 dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 dfs.period=10
 dfs.fileName=/tmp/dfsmetrics.log
 dfs.servers=$Ganglia_Server:9988
 
 # Configuration of the mapred context for null
 mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 mapred.period=10
 mapred.fileName=/tmp/mrmetrics.log
 mapred.servers=$Ganglia_Server:9988
 
 # Configuration of the jvm context for null
 jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31
 jvm.period=10
 jvm.fileName=/tmp/jvmmetrics.log
 jvm.servers=$GANGLIA_SERVER:9988
 
 -G.
 
 [1] - http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228.CHANGES.txt
 
 
 
 
 -- 
 If you really want something in this life, you have to work for it. Now,
 quiet! They're about to announce the lottery numbers...



smime.p7s
Description: S/MIME cryptographic signature


Re: How to enumerate files in the directories?

2010-08-25 Thread Steve Lewis
@Override
public HDFSFile[] getFiles(String directory) {
String result = executeCommand(hadoop fs -ls  + directory);
String[] items = result.split(\n);
ListHDFSFile holder = new ArrayListHDFSFile();
for (int i = 1; i  items.length; i++) {
String item = items[i];
if (item.length()  MIN__FILE_LENGTH) {
try {
holder.add(new HDFSFile(item));
}
catch (Exception e) {
}
}
}
HDFSFile[] ret = new HDFSFile[holder.size()];
holder.toArray(ret);
return ret;

}

On Wed, Aug 25, 2010 at 12:36 AM, Denim Live denim.l...@yahoo.com wrote:

 Hello, how can one determine the names of the files in a particular hadoop
 directory, programmatically?








-- 
Steven M. Lewis PhD
Institute for Systems Biology
Seattle WA


Searching more Hadoop-Common content

2010-08-25 Thread Alex Baranau
Hello guys,

Over at http://search-hadoop.com we index Hadoop-Common subprojects mailing
lists, wiki, web site,
source code, javadoc, jira...

Would the community be interested in a patch that replaces the
Google-powered
search with that from search-hadoop.com, set to search only Hadoop-Common
project by
default?

We look into adding this search service for all Hadoop's sub-projects.

Assuming people are for this, any suggestions for how the search should
function by default or any specific instructions for how the search box
should
be modified would be great!

Thank you,
Alex Baranau.

P.S. HBase community already accepted our proposal (please refer to
https://issues.apache.org/jira/browse/HBASE-2886) and new version (0.90)
will include new search box. Also the patch is available for TIKA (we are in
the process of discussing some details now):
https://issues.apache.org/jira/browse/TIKA-488. Hadoop-Common's site looks
much like Avro's for which we also created patch recently (
https://issues.apache.org/jira/browse/AVRO-626).


Custom partitioner for hadoop

2010-08-25 Thread Mithila Nagendra
I came across the tutorial on creating a custom partitioner on Hadoop (
http://philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/)
I
am trying to create my own partitioner on Hadoop, and the above blog
has given me a good starting point.

I had a question on the partitioner. In the code given in the blog they
have:

if( nbOccurences  3 )
   return 0;
else
   return 1;

I want to do something similar, but I need the key to be in a range, like
following:

if(nbOccurenceslbrange0   nbOccurences  ubrange0 )
   return 0;
if(nbOccurenceslbrange1   nbOccurences  ubrange1 )
   return 1;

The range boundaries lbrange0, lbrange1, ubrange0, ubrange1 are calculated
by reading a histogram that is stored on the HDFS. I initially thought I can
read the histogram from the customPartitioner class and calculate the range
boundaries, but then in this case the ranges get recalculated for every
K,V pair emitted by the mapper. In order to avoid this I was thinking of
passing the range boundaries to the partitioner. How would I do that? Is
there an alternative? Any suggestion would prove useful.

Thank you,

Mithila
Ph.D. Candidate, C.S., Arizona State University


Re: Hadoop startup problem - directory name required

2010-08-25 Thread Hemanth Yamijala
Hmm. Without the / in the property tag, isn't the file malformed XML ?
I am pretty sure Hadoop complains in such cases  ?

On Wed, Aug 25, 2010 at 4:44 AM, cliff palmer palmercl...@gmail.com wrote:
 Thanks Allen - that has resolved the problem.  Good catch!
 Cliff

 On Tue, Aug 24, 2010 at 3:05 PM, Allen Wittenauer
 awittena...@linkedin.comwrote:


 On Aug 23, 2010, at 6:49 AM, cliff palmer wrote:

  Thanks Harsh, but I am still not sure I understand what is going on.
  The directory specified in the dfs.name.dir property,
  /var/lib/hadoop-0.20/dfsname, does exist and rights to that directory
 have
  been granted to the OS user that is running the Hadoop startup script.
  The directory mentioned in the error message is
  /var/lib/hadoop-0.20/cache/hadoop/dfs/name.
  I can create this directory and that would (I assume) remove the error,
 but
  I want to understand how the name is derived.

 From here:

        property
                namehadoop.tmp.dir/name
                value/var/lib/hadoop-0.20/cache/hadoop/value
        /property
  /configuration

 because:

 
         property
                 namedfs.name.dir/name
                 value/DFS/dfsname,/var/lib/hadoop-0.20/dfsname/value
         property

 is missing a / in the property line.








Re: Hadoop startup problem - directory name required

2010-08-25 Thread cliff palmer
No complaints from hadoop, other than the error for the mangled directory
name.

On Wed, Aug 25, 2010 at 2:04 PM, Hemanth Yamijala yhema...@gmail.comwrote:

 Hmm. Without the / in the property tag, isn't the file malformed XML ?
 I am pretty sure Hadoop complains in such cases  ?

 On Wed, Aug 25, 2010 at 4:44 AM, cliff palmer palmercl...@gmail.com
 wrote:
  Thanks Allen - that has resolved the problem.  Good catch!
  Cliff
 
  On Tue, Aug 24, 2010 at 3:05 PM, Allen Wittenauer
  awittena...@linkedin.comwrote:
 
 
  On Aug 23, 2010, at 6:49 AM, cliff palmer wrote:
 
   Thanks Harsh, but I am still not sure I understand what is going on.
   The directory specified in the dfs.name.dir property,
   /var/lib/hadoop-0.20/dfsname, does exist and rights to that
 directory
  have
   been granted to the OS user that is running the Hadoop startup script.
   The directory mentioned in the error message is
   /var/lib/hadoop-0.20/cache/hadoop/dfs/name.
   I can create this directory and that would (I assume) remove the
 error,
  but
   I want to understand how the name is derived.
 
  From here:
 
 property
 namehadoop.tmp.dir/name
 value/var/lib/hadoop-0.20/cache/hadoop/value
 /property
   /configuration
 
  because:
 
  
  property
  namedfs.name.dir/name
  
  value/DFS/dfsname,/var/lib/hadoop-0.20/dfsname/value
  property
 
  is missing a / in the property line.
 
 
 
 
 
 



Re: How to enumerate files in the directories?

2010-08-25 Thread Sudhir Vallamkondu
You should use the FileStatus API to access file metadata. See below a
example. 

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSt
atus.html

Configuration conf = new Configuration(); // takes default conf
FileSystem fs = FileSystem.get(conf);
Path dir = new Path(/dir);
FileStatus[] stats = fs.listStatus(dir);
foreach(FileStatus stat : stats)
{ 
stat.getPath().toUri().getPath(); // gives directory name
stat.getModificationTime();
stat.getReplication();
stat.getBlockSize();
stat.getOwner();
stat.getGroup();
stat.getPermission().toString();
} 
  


 From: Denim Live denim.l...@yahoo.com
 Date: Wed, 25 Aug 2010 07:36:11 + (GMT)
 To: common-user@hadoop.apache.org
 Subject: How to enumerate files in the directories?
 
 Hello, how can one determine the names of the files in a particular hadoop
 directory, programmatically?


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information of iCrossing. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.




Re: Custom partitioner for hadoop

2010-08-25 Thread David Rosenstrauch

On 08/25/2010 12:40 PM, Mithila Nagendra wrote:

In order to avoid this I was thinking of
passing the range boundaries to the partitioner. How would I do that? Is
there an alternative? Any suggestion would prove useful.


We use a custom partitioner, for which we pass in configuration data 
that gets used in the partitioning calculations.


We do it by making the Partitioner implement Configurable, and then grab 
the needed config data from the configuration object that we're given. 
(We set the needed config data on the config object when we submit the 
job).  i.e., like so:


import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.conf.Configurable;
import org.apache.hadoop.conf.Configuration;

public class OurPartitioner extends PartitionerBytesWritable, Writable 
implements Configurable {

...

	public int getPartition(BytesWritable key, Writable value, int 
numPartitions) {

...
}

public Configuration getConf() {
return conf;
}

public void setConf(Configuration conf) {
this.conf = conf;

configure();
}

@SuppressWarnings(unchecked)
private void configure() throws IOException {
String parmValue = conf.get(parmKey);
if (parmValue == null) {
throw new RuntimeException(.);
}
}

private Configuration conf;
}

HTH,

DR


Re: Custom partitioner for hadoop

2010-08-25 Thread Mithila Nagendra
In which of the three functions would I have to set the ranges? In the
configure function? Would the configure be called once for every mapper?
Thank you!

On Wed, Aug 25, 2010 at 12:50 PM, David Rosenstrauch dar...@darose.netwrote:

 On 08/25/2010 12:40 PM, Mithila Nagendra wrote:

 In order to avoid this I was thinking of
 passing the range boundaries to the partitioner. How would I do that? Is
 there an alternative? Any suggestion would prove useful.


 We use a custom partitioner, for which we pass in configuration data that
 gets used in the partitioning calculations.

 We do it by making the Partitioner implement Configurable, and then grab
 the needed config data from the configuration object that we're given. (We
 set the needed config data on the config object when we submit the job).
  i.e., like so:

 import org.apache.hadoop.mapreduce.Partitioner;
 import org.apache.hadoop.conf.Configurable;
 import org.apache.hadoop.conf.Configuration;

 public class OurPartitioner extends PartitionerBytesWritable, Writable
 implements Configurable {
 ...

public int getPartition(BytesWritable key, Writable value, int
 numPartitions) {
 ...
}

public Configuration getConf() {
return conf;
}

public void setConf(Configuration conf) {
this.conf = conf;

configure();
}

@SuppressWarnings(unchecked)
private void configure() throws IOException {
String parmValue = conf.get(parmKey);
if (parmValue == null) {
throw new RuntimeException(.);
}
}

private Configuration conf;
 }

 HTH,

 DR



Re: Custom partitioner for hadoop

2010-08-25 Thread David Rosenstrauch
If you define a Hadoop object as implementing Configurable, then its 
setConf() method will be called once, right after it gets instantiated. 
 So each partitioner that gets instantiated will have its setConf() 
method called right afterwards.


I'm taking advantage of that fact by calling my own (private) 
configure() method when the Partitioner gets its configuration.  So in 
that configure method, you would grab the ranges from out of the 
configuration object.


The flip side of this is that your ranges won't just magically appear in 
the configuration object.  You'll have to set them on the configuration 
object used in the Job that you're submitting.


A copy of the job's config object will then get passed to each task in 
your job, which you can then use to configure that task.


HTH,

DR

On 08/25/2010 04:23 PM, Mithila Nagendra wrote:

In which of the three functions would I have to set the ranges? In the
configure function? Would the configure be called once for every mapper?
Thank you!

On Wed, Aug 25, 2010 at 12:50 PM, David Rosenstrauchdar...@darose.netwrote:


On 08/25/2010 12:40 PM, Mithila Nagendra wrote:


In order to avoid this I was thinking of
passing the range boundaries to the partitioner. How would I do that? Is
there an alternative? Any suggestion would prove useful.



We use a custom partitioner, for which we pass in configuration data that
gets used in the partitioning calculations.

We do it by making the Partitioner implement Configurable, and then grab
the needed config data from the configuration object that we're given. (We
set the needed config data on the config object when we submit the job).


Re: How to enumerate files in the directories?

2010-08-25 Thread Raj V
I would use the FileSystem API.

Here is a QD  example

import java.io.*;
import java.util.*;
import java.lang.*;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileStatus;


public class dirc {
public static void main ( String args[])
{
try {
String dirname = args[0];
Configuration conf = new Configuration(true);
FileSystem fs = FileSystem.get(conf);
Path path = new Path(dirname);
FileStatus fstatus[] = fs.listStatus(path);
for ( FileStatus f: fstatus ) {
System.out.println(f.getPath().toUri().getPath());
}
}catch ( IOException e ) {
System.out.println(Usage dirc directory );
return ;
} catch (ArrayIndexOutOfBoundsException e) {
System.out.println(Usage dirc directory );
return ;
}
}
}








From: Steve Lewis lordjoe2...@gmail.com
To: common-user@hadoop.apache.org
Sent: Wed, August 25, 2010 9:04:41 AM
Subject: Re: How to enumerate files in the directories?



@Override
public HDFSFile[] getFiles(String directory) {
String result = executeCommand(hadoop fs -ls  + directory);
String[] items = result.split(\n);
ListHDFSFile holder = new ArrayListHDFSFile();
for (int i = 1; i  items.length; i++) {
String item = items[i];
if (item.length()  MIN__FILE_LENGTH) {
try {
holder.add(new HDFSFile(item));
}
catch (Exception e) {
}
}
}
HDFSFile[] ret = new HDFSFile[holder.size()];
holder.toArray(ret);
return ret;

}

On Wed, Aug 25, 2010 at 12:36 AM, Denim Live denim.l...@yahoo.com wrote:

Hello, how can one determine the names of the files in a particular hadoop
directory, programmatically?



 


-- 
Steven M. Lewis PhD
Institute for Systems Biology
Seattle WA


Hadoop Benchmark Results

2010-08-25 Thread Burak ISIKLI
Hi,
I'm reading Hadoop Definitive Guide book. I've tried to benchmark my hadoop 
cluster. I got some results but when i compared them, i was shocked. Because 
there were several interesting difference. I don't understand is it good or bad?
Please help me
Regards...

Writing Test
 hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000
10/08/25 23:49:40 INFO mapred.FileInputFormat: - TestDFSIO - : write
10/08/25 23:49:40 INFO mapred.FileInputFormat:Date  time: Wed Aug 
25 23:49:40 EEST 2010
10/08/25 23:49:40 INFO mapred.FileInputFormat:Number of files: 10
10/08/25 23:49:40 INFO mapred.FileInputFormat: Total MBytes processed: 1
10/08/25 23:49:40 INFO mapred.FileInputFormat:  Throughput mb/sec: 
36.09482833299645
10/08/25 23:49:40 INFO mapred.FileInputFormat: Average IO rate mb/sec: 
49.026153564453125
10/08/25 23:49:40 INFO mapred.FileInputFormat:  IO rate std deviation: 
22.15250292439401
10/08/25 23:49:40 INFO mapred.FileInputFormat: Test exec time sec: 175.537

Reading Test
hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize 1000
10/08/25 23:54:11 INFO mapred.FileInputFormat: - TestDFSIO - : read
10/08/25 23:54:11 INFO mapred.FileInputFormat:Date  time: Wed Aug 
25 23:54:11 EEST 2010
10/08/25 23:54:11 INFO mapred.FileInputFormat:Number of files: 10
10/08/25 23:54:11 INFO mapred.FileInputFormat: Total MBytes processed: 1
10/08/25 23:54:11 INFO mapred.FileInputFormat:  Throughput mb/sec: 
152.87948510189418
10/08/25 23:54:11 INFO mapred.FileInputFormat: Average IO rate mb/sec: 
152.8846893310547
10/08/25 23:54:11 INFO mapred.FileInputFormat:  IO rate std deviation: 
0.8895501092647955
10/08/25 23:54:11 INFO mapred.FileInputFormat: Test exec time sec: 61.618


  

svn/git revisions for 0.20.2

2010-08-25 Thread Johannes Zillmann
Hey folks,

can somebody tell me how to get the source versions from git/svn for 
hadoop-hdfs and hadoop-mapreduce ?
In hadoop-common there are branches and tags for the release. But how to get 
the corresponding version of the other 2 projects ?

any help would be appreciated!
Johannes

quota?

2010-08-25 Thread jiang licht
Is it possible to tell hadoop to restrict space usage of a specific dfs folder 
in the cluster, e.g. a user home directory (/user/accountA in dfs)?

And is there a way to restrict the size of map/reduce output that can be saved 
to dfs? E.g. if a job creates over-limit data, it won't be allowed to save the 
result to the dfs.

Thanks,

Michael


  

Re: quota?

2010-08-25 Thread Ted Yu
Refer to
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_quota_admin_guide.html#Space+Quotas

On Wed, Aug 25, 2010 at 3:43 PM, jiang licht licht_ji...@yahoo.com wrote:

 Is it possible to tell hadoop to restrict space usage of a specific dfs
 folder in the cluster, e.g. a user home directory (/user/accountA in dfs)?

 And is there a way to restrict the size of map/reduce output that can be
 saved to dfs? E.g. if a job creates over-limit data, it won't be allowed to
 save the result to the dfs.

 Thanks,

 Michael





Re: quota?

2010-08-25 Thread jiang licht
Thanks, Ted.


Michael

--- On Wed, 8/25/10, Ted Yu yuzhih...@gmail.com wrote:

From: Ted Yu yuzhih...@gmail.com
Subject: Re: quota?
To: common-user@hadoop.apache.org
Date: Wednesday, August 25, 2010, 5:47 PM

Refer to
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_quota_admin_guide.html#Space+Quotas

On Wed, Aug 25, 2010 at 3:43 PM, jiang licht licht_ji...@yahoo.com wrote:

 Is it possible to tell hadoop to restrict space usage of a specific dfs
 folder in the cluster, e.g. a user home directory (/user/accountA in dfs)?

 And is there a way to restrict the size of map/reduce output that can be
 saved to dfs? E.g. if a job creates over-limit data, it won't be allowed to
 save the result to the dfs.

 Thanks,

 Michael






  

Re: svn/git revisions for 0.20.2

2010-08-25 Thread Owen O'Malley


On Aug 25, 2010, at 3:20 PM, Johannes Zillmann wrote:


Hey folks,

can somebody tell me how to get the source versions from git/svn for  
hadoop-hdfs and hadoop-mapreduce ?
In hadoop-common there are branches and tags for the release. But  
how to get the corresponding version of the other 2 projects ?


0.20 was pre-project split, so common included hdfs and mapreduce.

-- Owen


data node disk usage per volume?

2010-08-25 Thread jiang licht
dfs.datanode.du.reserved can limit the size of space per volume that can be 
used by a data node. I have 2 related questions.

1. How volume is defined here?
Say I have 2 folders listed for dfs.data.dir and each resides on a different 
disk. By setting dfs.datanode.du.reserved to N, does it mean 2N bytes 
reserved for non-dfs use, with N bytes on each disk?

2. Is it possible that we can reserve different size of space to non-dfs use on 
different disks?

Thanks,

Michael


  

Basic question

2010-08-25 Thread Mark

 job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

Does this mean the input to the reducer should be Text/IntWritable or 
the output of the reducer is Text/IntWritable?


What is the inverse of this.. setInputKeyClass/setInputValueClass? Is 
this inferred by the JobInputFormatClass? Would someone mind briefly 
explaining?


Thanks


Re: Basic question

2010-08-25 Thread James Seigel
The output of the reducer is Text/IntWritable. 

To set the input to the reducer you set the mapper output classes. 

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2010-08-25, at 8:13 PM, Mark static.void@gmail.com wrote:

  job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
 
 Does this mean the input to the reducer should be Text/IntWritable or 
 the output of the reducer is Text/IntWritable?
 
 What is the inverse of this.. setInputKeyClass/setInputValueClass? Is 
 this inferred by the JobInputFormatClass? Would someone mind briefly 
 explaining?
 
 Thanks


data in compression format affect mapreduce speed

2010-08-25 Thread shangan
will data stored in  compression format affect mapreduce job speed? increase or 
decrease? or more complex relationship between these two ?  can anybody give 
some explanation in detail?

2010-08-26 



shangan 


Re: How can I run the other test cases except those defined in CoreTestDriver?

2010-08-25 Thread Min Zhou
Anyone can help me with this?

On Sun, Aug 22, 2010 at 1:18 PM, Min Zhou coderp...@gmail.com wrote:

 Hi all,

 When I run the command
 bin/hadoop jar hadoop-common-test-*.jar org.apache.hadoop.io.file.
 tfile.TestTFileSeqFileComparison
 a prompt is showed like below
 Valid program names are:
   testarrayfile: A test for flat files of binary key/value pairs.
   testipc: A test for ipc.
   testrpc: A test for rpc.
   testsetfile: A test for flat files of binary key/value pairs.

 AFAIK, this script finds the Main-class from a jar file essentially through
 RunJar, which firstly lookup the mainfest to decide whether this jar has
 defined a main class or not. If there is no defination, the first one of the
 uneaten arguments from the command line will be considered as the main
 class's name.  But org/apache/hadoop/test/CoreTestDriver has been taken as
 the main class of hadoop test in build.xml.  Therefore,  I  can only run 4
 testcases ( testarrayfile, testipc, testrpc,  testsetfile). How can I  run
 the other test cases except those defined in that class? By running ant test
 -Dxxx ?


 Thanks,
 Min

 --
 My research interests are distributed systems, parallel computing and
 bytecode based virtual machine.

 My profile:
 http://www.linkedin.com/in/coderplay
 My blog:
 http://coderplay.javaeye.com




-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com


Re: data in compression format affect mapreduce speed

2010-08-25 Thread Ted Yu
Compressed data would increase processing time in mapper/reducer but
decrease the amount of data transferred between tasktracker nodes.
Normally you should consider applying some form of compression.

On Wed, Aug 25, 2010 at 7:32 PM, shangan shan...@corp.kaixin001.com wrote:

 will data stored in  compression format affect mapreduce job speed?
 increase or decrease? or more complex relationship between these two ?  can
 anybody give some explanation in detail?

 2010-08-26



 shangan



JIRA down

2010-08-25 Thread Bill Graham
JIRA seems to be down FYI. Database errors are being returned:

*Cause: *
org.apache.commons.lang.exception.NestableRuntimeException:
com.atlassian.jira.exception.DataAccessException:
org.ofbiz.core.entity.GenericDataSourceException: Unable to esablish a
connection with the database. (FATAL: database is not accepting commands to
avoid wraparound data loss in database postgres)

*Stack Trace: * [hide]

org.apache.commons.lang.exception.NestableRuntimeException:
com.atlassian.jira.exception.DataAccessException:
org.ofbiz.core.entity.GenericDataSourceException: Unable to esablish a
connection with the database. (FATAL: database is not accepting
commands to avoid wraparound data loss in database postgres)
at 
com.atlassian.jira.web.component.TableLayoutFactory.getUserColumns(TableLayoutFactory.java:239)
at 
com.atlassian.jira.web.component.TableLayoutFactory.getStandardLayout(TableLayoutFactory.java:42)
at 
org.apache.jsp.includes.navigator.table_jsp._jspService(table_jsp.java:178)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:374)
at 
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)


Re: JIRA down

2010-08-25 Thread Ted Yu
In case you need to access JIRA tonight, google JIRA number and click on
Cached link.

You would see:
http://webcache.googleusercontent.com/search?q=cache:Tgi71phHrUoJ:https://issues.apache.org/jira/browse/HBASE-2893+hbase+metadata+layercd=4hl=enct=clnkgl=usclient=firefox-a

On Wed, Aug 25, 2010 at 8:47 PM, Bill Graham billgra...@gmail.com wrote:

 JIRA seems to be down FYI. Database errors are being returned:

 *Cause: *
 org.apache.commons.lang.exception.NestableRuntimeException:
 com.atlassian.jira.exception.DataAccessException:
 org.ofbiz.core.entity.GenericDataSourceException: Unable to esablish a
 connection with the database. (FATAL: database is not accepting commands to
 avoid wraparound data loss in database postgres)

 *Stack Trace: * [hide]

 org.apache.commons.lang.exception.NestableRuntimeException:
 com.atlassian.jira.exception.DataAccessException:
 org.ofbiz.core.entity.GenericDataSourceException: Unable to esablish a
 connection with the database. (FATAL: database is not accepting
 commands to avoid wraparound data loss in database postgres)
at
 com.atlassian.jira.web.component.TableLayoutFactory.getUserColumns(TableLayoutFactory.java:239)
at
 com.atlassian.jira.web.component.TableLayoutFactory.getStandardLayout(TableLayoutFactory.java:42)
at
 org.apache.jsp.includes.navigator.table_jsp._jspService(table_jsp.java:178)
at
 org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at
 org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:374)
at
 org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:342)
at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:267)



Re: data in compression format affect mapreduce speed

2010-08-25 Thread Harsh J
Logically it 'should' increase time as its an extra step beyond the
Mapper/Reducer. But while your processing time would slightly (very
very slightly) increase, your IO and Network Transfers time would
decrease by a large margin -- giving you a clear impression that your
total job time has decreased overall. The difference being in writing
out say 10 GB before, and writing out 5-7 GB this time (a crude
example).

With the fast CPUs available these days, compressing and decompressing
should hardly take a noticeable amount of extra time. Its almost
negligible in case of using gzip, lzo or plain deflate.

On Thu, Aug 26, 2010 at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote:
 Compressed data would increase processing time in mapper/reducer but
 decrease the amount of data transferred between tasktracker nodes.
 Normally you should consider applying some form of compression.

 On Wed, Aug 25, 2010 at 7:32 PM, shangan shan...@corp.kaixin001.com wrote:

 will data stored in  compression format affect mapreduce job speed?
 increase or decrease? or more complex relationship between these two ?  can
 anybody give some explanation in detail?

 2010-08-26



 shangan





-- 
Harsh J
www.harshj.com


Re: Re: data in compression format affect mapreduce speed

2010-08-25 Thread shangan
I agree with you on the most part. But I have some other questions. mapper are 
working on local machine so there's no network transfers during this process, 
if the original data stored in hdfs is compressed it will only decrease the IO 
time. One major point is I doubt whether the mapper can deal with only part of 
the whole data if the data is compressed which seems can't be split ? I've try 
to do a select sum() in hive and trace the job, it seems the .tar.gz data can 
only worked on one single matchine and stuck there for quite a long time(seems 
like need to wait other part of data be copied from other machines),while other 
data not compressed can work on different machines parallelly. Do you know 
something about this ?

2010-08-26 



shangan 



发件人: Harsh J 
发送时间: 2010-08-26  12:15:49 
收件人: common-user 
抄送: 
主题: Re: data in compression format affect mapreduce speed 
 
Logically it 'should' increase time as its an extra step beyond the
Mapper/Reducer. But while your processing time would slightly (very
very slightly) increase, your IO and Network Transfers time would
decrease by a large margin -- giving you a clear impression that your
total job time has decreased overall. The difference being in writing
out say 10 GB before, and writing out 5-7 GB this time (a crude
example).
With the fast CPUs available these days, compressing and decompressing
should hardly take a noticeable amount of extra time. Its almost
negligible in case of using gzip, lzo or plain deflate.
On Thu, Aug 26, 2010 at 9:13 AM, Ted Yu yuzhih...@gmail.com wrote:
 Compressed data would increase processing time in mapper/reducer but
 decrease the amount of data transferred between tasktracker nodes.
 Normally you should consider applying some form of compression.

 On Wed, Aug 25, 2010 at 7:32 PM, shangan shan...@corp.kaixin001.com wrote:

 will data stored in  compression format affect mapreduce job speed?
 increase or decrease? or more complex relationship between these two ?  can
 anybody give some explanation in detail?

 2010-08-26



 shangan


-- 
Harsh J
www.harshj.com
__ Information from ESET NOD32 Antivirus, version of virus signature 
database 5397 (20100825) __
The message was checked by ESET NOD32 Antivirus.
http://www.eset.com