Problem with LibHDFS

2008-02-21 Thread Raghavendra K
Hi,
  I am able to get Hadoop running and also able to compile the libhdfs.
But when I run the hdfs_test program it is giving Segmentation Fault.
Just a small program like this
#include hdfs.h
int main() {
return(0);
}
and compiled using the command
gcc -ggdb -m32 -I/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/include
-I/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/include/ hdfs_test.c
-L/garl/garl-alpha1/home1/raghu/Desktop/hadoop-0.15.3/libhdfs -lhdfs
-L/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/lib/i386/server -ljvm
-shared -m32 -Wl,-x -o hdfs_test
running hdfs_test gives segmentation fault.
please tell me as to how to fix it.



-- 
Regards,
Raghavendra K


Re: Problem with LibHDFS

2008-02-21 Thread Miles Osborne
Since you are compiling a C(++) program, why not add the -g switch and run
it within gdb:  that will tell people which line it crashes at (etc etc)

Miles

On 21/02/2008, Raghavendra K [EMAIL PROTECTED] wrote:

 Hi,
   I am able to get Hadoop running and also able to compile the libhdfs.
 But when I run the hdfs_test program it is giving Segmentation Fault.
 Just a small program like this
 #include hdfs.h
 int main() {
 return(0);
 }
 and compiled using the command
 gcc -ggdb -m32 -I/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/include
 -I/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/include/ hdfs_test.c
 -L/garl/garl-alpha1/home1/raghu/Desktop/hadoop-0.15.3/libhdfs -lhdfs
 -L/garl/garl-alpha1/home1/raghu/Desktop/jre1.5.0_14/lib/i386/server -ljvm
 -shared -m32 -Wl,-x -o hdfs_test
 running hdfs_test gives segmentation fault.
 please tell me as to how to fix it.



 --
 Regards,

 Raghavendra K




-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.


Re: Add your project or company to the powered by page?

2008-02-21 Thread Derek Gottfrid
The New York Times / nytimes.com
-large scale image conversions
-http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/

On Thu, Feb 21, 2008 at 1:26 AM, Eric Baldeschwieler
[EMAIL PROTECTED] wrote:
 Hi Folks,

  Let's get the word out that Hadoop is being used and is useful in
  your organizations, ok?  Please add yourselves to the Hadoop powered
  by page, or reply to this email with what details you would like to
  add and I'll do it.

  http://wiki.apache.org/hadoop/PoweredBy

  Thanks!

  E14

  ---
  eric14 a.k.a. Eric Baldeschwieler
  senior director, grid computing
  Yahoo!  Inc.





java error

2008-02-21 Thread Jaya Ghosh
Hello,

 

As per my earlier mails I could not deploy Nutch on Linux . Now am
attempting the same using cygwin as per the tutorial by Peter Wang. Can
someone from the list help me resolve the attached error? Atleast on Linux I
could run the crawl.

 

java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName

Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.util.PlatformName

at java.net.URLClassLoader$1.run(URLClassLoader.java:200)

at java.security.AccessController.doPrivileged(Native
Method)

at
java.net.URLClassLoader.findClass(URLClassLoader.java:188)

at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)

at java.lang.ClassLoader.loadClass(ClassLoader.java:251)

at
java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)

Exception in thread main java.lang.NoClassDefFoundError:
org/apache/nutch/crawl/Crawl

Caused by: java.lang.ClassNotFoundException: org.apache.nutch.crawl.Crawl

at java.net.URLClassLoader$1.run(URLClassLoader.java:200)

at java.security.AccessController.doPrivileged(Native
Method)

at
java.net.URLClassLoader.findClass(URLClassLoader.java:188)

at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)

at java.lang.ClassLoader.loadClass(ClassLoader.java:251)

at
java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)

Exception in thread main

 

 

P.S. Have sent this mail to nutch-users as well but so far no response from
their end. Am a writer and not very technical to debug the errors.

Regards,

Jaya 

 



Question on metrics via ganglia

2008-02-21 Thread Jason Venner
We have modified my metrics file, distributed it and restarted our 
cluster. We have gmond running on the nodes, and a machine on the vlan 
with gmetad running.
We have statistics for the machines in the web ui, and our statistics 
reported by the gmetric program are present. We don't see any hadoop 
reporting.


Clearly we have something basic wrong in our understanding of how to set 
this up.


# Configuration of the dfs context for null
# dfs.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the dfs context for file
#dfs.class=org.apache.hadoop.metrics.file.FileContext
#dfs.period=10
#dfs.fileName=/tmp/dfsmetrics.log

# Configuration of the dfs context for ganglia
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=localhost:8649


# Configuration of the mapred context for null
# mapred.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the mapred context for file
#mapred.class=org.apache.hadoop.metrics.file.FileContext
#mapred.period=10
#mapred.fileName=/tmp/mrmetrics.log

# Configuration of the mapred context for ganglia
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=localhost:8649


# Configuration of the jvm context for null
# jvm.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the jvm context for file
#jvm.class=org.apache.hadoop.metrics.file.FileContext
#jvm.period=10
#jvm.fileName=/tmp/jvmmetrics.log

# Configuration of the jvm context for ganglia
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.period=10
jvm.servers=localhost:8649


--
Jason Venner
Attributor - Publish with Confidence http://www.attributor.com/
Attributor is hiring Hadoop Wranglers, contact if interested



Re: Hadoop summit / workshop at Yahoo!

2008-02-21 Thread John Heidemann
On Wed, 20 Feb 2008 12:10:09 PST, Ajay Anand wrote: 
The registration page for the Hadoop summit is now up:
http://developer.yahoo.com/hadoop/summit/
...
Agenda:

Ajay, when we talked about the summit on the phone, you were considering
having a poster session.  I don't see that listed.  Should I assume it's
no longer planned?

Thanks,
   -John


Re: Questions about namenode and JobTracker configuration.

2008-02-21 Thread Amar Kamat

Zhang, jian wrote:

Hi, All

 


I have a small question about configuration.

 

In Hadoop Documentation page, it says 


 Typically you choose one machine in the cluster to act as the NameNode
and one machine as to act as the JobTracker, exclusively. The rest of
the machines act as both a DataNode and TaskTracker and are referred to
as slaves.

 


Does that mean the JobTracker is not a slave as NameNode ?

 
  
JobTracker and Namenode are daemons on a machine (frequently called as 
masters). The master node can also act as a slave node. JobTracker and 
Namenode basically do the book-keeping/scheduling work. On a large 
cluster the load on the JobTracker/Namenode is usually high. Hence its 
recommended to run these daemons on a separate machine but this is not 
mandatory.

NameNode and DataNode form the HDFS. Since the JobTracker needs to
interact with TaskTracker which resides in HDFS, 
TaskTracker and DataNodes are processes on the slave nodes. TaskTracker 
communicates with the JobTracker while DataNode communicates with the 
Namenode. The DFS is designed in such a way that it can function without 
mapreduce just for distributed storage. The TaskTracker never 
communicates with the NameNode. Its the JobTracker that does. Mostly the 
TaskTracker concentrates on doing the work locally i.e spawn JVMs for 
doing the maps.

Amar

to make the
communication easier, I think it should be at least part of the HDFS.  

 


Best Regards

 


Jian Zhang

 



  




Re: Add your project or company to the powered by page?

2008-02-21 Thread Dennis Kubes

 * [http://alpha.search.wikia.com Search Wikia]
  * A project to help develop open source social search tools.  We run 
a 125 node hadoop cluster.


Derek Gottfrid wrote:

The New York Times / nytimes.com
-large scale image conversions
-http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/

On Thu, Feb 21, 2008 at 1:26 AM, Eric Baldeschwieler
[EMAIL PROTECTED] wrote:

Hi Folks,

 Let's get the word out that Hadoop is being used and is useful in
 your organizations, ok?  Please add yourselves to the Hadoop powered
 by page, or reply to this email with what details you would like to
 add and I'll do it.

 http://wiki.apache.org/hadoop/PoweredBy

 Thanks!

 E14

 ---
 eric14 a.k.a. Eric Baldeschwieler
 senior director, grid computing
 Yahoo!  Inc.





Re: Question on metrics via ganglia

2008-02-21 Thread Jason Venner
Well, with the metrics file changed to perform file based logging, 
metrics do appear.
On digging into the GangliaContext source, it looks like it is using udp 
for reporting, and we modified the gmond.conf to receive via udp as well 
as tcp. netstat -a -p shows gmond monitoring 8649 for both tcp and udp.
Still nothing visible via the ganglia ui and no rrd file for anything 
hadoop related.


Jason Venner wrote:
We have modified my metrics file, distributed it and restarted our 
cluster. We have gmond running on the nodes, and a machine on the vlan 
with gmetad running.
We have statistics for the machines in the web ui, and our statistics 
reported by the gmetric program are present. We don't see any hadoop 
reporting.


Clearly we have something basic wrong in our understanding of how to 
set this up.


# Configuration of the dfs context for null
# dfs.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the dfs context for file
#dfs.class=org.apache.hadoop.metrics.file.FileContext
#dfs.period=10
#dfs.fileName=/tmp/dfsmetrics.log

# Configuration of the dfs context for ganglia
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=localhost:8649


# Configuration of the mapred context for null
# mapred.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the mapred context for file
#mapred.class=org.apache.hadoop.metrics.file.FileContext
#mapred.period=10
#mapred.fileName=/tmp/mrmetrics.log

# Configuration of the mapred context for ganglia
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=localhost:8649


# Configuration of the jvm context for null
# jvm.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the jvm context for file
#jvm.class=org.apache.hadoop.metrics.file.FileContext
#jvm.period=10
#jvm.fileName=/tmp/jvmmetrics.log

# Configuration of the jvm context for ganglia
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.period=10
jvm.servers=localhost:8649




Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?

2008-02-21 Thread Amar Kamat
Output of every mapreduce job in Hadoop gets stored in the DFS i.e made 
visible. You can run back to back jobs (i.e job chaining) but the output 
wont be temporary. Look at Grep.java as Hairong suggested for more 
details on job chaining. As of now there is no support for job chaining 
in Hadoop. Pig []http://incubator.apache.org/pig/] on the other hand 
implicitly does job pipelining. But for smaller and simple pipelines you 
could do manual chaining. It depends on the kind of pipelining one requires.

Amar
ma qiang wrote:

Hi all:
 Here I have two mapreduce program.I need to use the result of the
first mapreduce program to computer another values which generate in
the second mapreduce program and this intermediate result is not need
to save, 
so I want to run the second mapreduce program automatic using

output of the first mapreduce program as the input of the second
mapreduce program. Who can tell me how?
 Thanks!
 Best Wishes!

Qiang
  




Re: changes to compression interfaces in 0.15?

2008-02-21 Thread Arun C Murthy

Joydeep,

On Feb 20, 2008, at 5:06 PM, Joydeep Sen Sarma wrote:


Hi developers,

In migrating to 0.15 - i am noticing that the compression interfaces
have changed:

-  compression type for sequencefile outputs used to be set  
by:

SequenceFile.setCompressionType()

-  now it seems to be set using:
sequenceFileOutputFormat.setOutputCompressionType()




Yes, we added SequenceFileOutputFormat.setOutputCompressionType and  
deprecated the old api. (HADOOP-1851)




The change is for the better - but would it be possible to:

-  remove old/dead interfaces. That would have been a
straightforward hint for applications to look for new interfaces.
(hadoop-default.xml also still has setting for old conf variable:
io.seqfile.compression.type)



To maintain backward compat, we cannot remove old apis - the standard  
procedure is to deprecate them for the next release and remove them  
in subsequent releases.



-  if possible - document changed interfaces in the release
notes (there's no way we can find this out by looking at the long list
of Jiras).



Please look at the INCOMPATIBLE CHANGES section of CHANGES.txt,  
HADOOP-1851 is listed there. Admittedly we can do better, but that is  
a good place to look for when upgrading to newer releases.


i am not sure how updated the wiki is on the compression stuff (my
responsibility to update it) - but please do consider the impact of


Please use the forrest-based docs (on the hadoop website - e.g.  
mapred_tutorial.html) rather than the wiki as the gold-standard. The  
reason we moved away from the wiki is precisely this - harder to  
maintain docs per release etc.



changing interfaces on existing applications. (maybe we should have a
JIRA tag to mark out bugs that change interfaces).




Again, CHANGES.txt and INCOMPATIBLE CHANGES section for now.

Arun




As always - thanks for all the fish (err .. working code),



Joydeep







Re: Add your project or company to the powered by page?

2008-02-21 Thread Allen Wittenauer
On 2/21/08 11:34 AM, Jeff Hammerbacher [EMAIL PROTECTED]
wrote:
 yeah, i've heard those facebook groups can be a great way to get the word
 out...
 
 anyways, just got approval yesterday for a 320 node cluster.  each node has
 8 cores and 4 TB of raw storage so this guy is gonna be pretty powerful.
 can we claim largest cluster outside of yahoo?

I guess it depends upon how you define outside.

   *Technically*, M45 is outside of a Yahoo! building, given that it is in
one of those shipping-container-data-center-thingies ...



Re: Add your project or company to the powered by page?

2008-02-21 Thread Paco NATHAN
More on the subject of outreach, not specific uses at companies, but...
A couple things might help get the word out:

   - Add a community group in LinkedIn (shows up on profile searches)
 http://www.linkedin.com/static?key=groups_faq

   - Add a link on the wiki to the Facebook group about Hadoop
http://www.facebook.com/pages/Hadoop/9887781514

There's also a small but growing network of local user groups for
Amazon AWS, and much interest there for presentations and discussions
about Hadoop:
   
http://www.amazon.com/Upcoming-Events-AWS-home-page/b/ref=sc_fe_c_0_371080011_1/103-5668663-1566203?ie=UTF8node=16284451no=371080011me=A36L942TSJ2AJA


I'd be happy to help with any of those.
Paco



On Wed, Feb 20, 2008 at 10:26 PM, Eric Baldeschwieler
[EMAIL PROTECTED] wrote:
 Hi Folks,

  Let's get the word out that Hadoop is being used and is useful in
  your organizations, ok?  Please add yourselves to the Hadoop powered
  by page, or reply to this email with what details you would like to
  add and I'll do it.

  http://wiki.apache.org/hadoop/PoweredBy

  Thanks!

  E14

  ---
  eric14 a.k.a. Eric Baldeschwieler
  senior director, grid computing
  Yahoo!  Inc.



Re: Question on metrics via ganglia solved

2008-02-21 Thread Jason Venner
Instead of localhost, in the servers block, we now put the machine that 
has gmetad running.


dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=GMETAD_HOST:8649


Jason Venner wrote:
Well, with the metrics file changed to perform file based logging, 
metrics do appear.
On digging into the GangliaContext source, it looks like it is using 
udp for reporting, and we modified the gmond.conf to receive via udp 
as well as tcp. netstat -a -p shows gmond monitoring 8649 for both tcp 
and udp.
Still nothing visible via the ganglia ui and no rrd file for anything 
hadoop related.


Jason Venner wrote:
We have modified my metrics file, distributed it and restarted our 
cluster. We have gmond running on the nodes, and a machine on the 
vlan with gmetad running.
We have statistics for the machines in the web ui, and our statistics 
reported by the gmetric program are present. We don't see any hadoop 
reporting.


Clearly we have something basic wrong in our understanding of how to 
set this up.


# Configuration of the dfs context for null
# dfs.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the dfs context for file
#dfs.class=org.apache.hadoop.metrics.file.FileContext
#dfs.period=10
#dfs.fileName=/tmp/dfsmetrics.log

# Configuration of the dfs context for ganglia
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=10
dfs.servers=localhost:8649


# Configuration of the mapred context for null
# mapred.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the mapred context for file
#mapred.class=org.apache.hadoop.metrics.file.FileContext
#mapred.period=10
#mapred.fileName=/tmp/mrmetrics.log

# Configuration of the mapred context for ganglia
mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext
mapred.period=10
mapred.servers=localhost:8649


# Configuration of the jvm context for null
# jvm.class=org.apache.hadoop.metrics.spi.NullContext

# Configuration of the jvm context for file
#jvm.class=org.apache.hadoop.metrics.file.FileContext
#jvm.period=10
#jvm.fileName=/tmp/jvmmetrics.log

# Configuration of the jvm context for ganglia
jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext
jvm.period=10
jvm.servers=localhost:8649



--
Jason Venner
Attributor - Publish with Confidence http://www.attributor.com/
Attributor is hiring Hadoop Wranglers, contact if interested


RE: changes to compression interfaces in 0.15?

2008-02-21 Thread Joydeep Sen Sarma
 To maintain backward compat, we cannot remove old apis - the standard 
 procedure is to deprecate them for the next release and remove them 
 in subsequent releases.

you've got to be kidding.

we didn't maintain backwards compatibility. my app broke. Simple and 
straightforward. and the old interfaces are not deprecated (to quote 0.15.3 on 
a 'deprecated' interface:

  /**   

   * Set the compression type for sequence files.   

   * @param job the configuration to modify 

   * @param val the new compression type (none, block, record)  

   */
  static public void setCompressionType(Configuration job,
CompressionType val) {
)

I (and i would suspect any average user willing to recompile code) would much 
much rather that we broke backwards compatibility immediately rather than 
maintain carry over defunct apis that insidiously break application behavior.

and of course - this does not address the point that the option strings 
themselves are depcreated. (remember - people set options explicitly from xml 
files and streaming. not everyone goes through java apis)).

--

as one of my dear professors once said - put ur self in the other person's 
shoe. consider that u were in my position and that a production app suddenly 
went from consuming 100G to 1TB. and everything slowed down drastically. and it 
did not give any sign that anything was amiss. everything looked golden on the 
ourside. what would be ur reaction if u find out after a week that the system 
was full and numerous processes had to be re-run? how would you have figured 
that was going to happen by looking at the INCOMPATIBLE section (which btw - i 
did carefully before sending my mail).

(fortunately i escaped the worst case - but i think this is a real call to 
action)


-Original Message-
From: Arun C Murthy [mailto:[EMAIL PROTECTED]
Sent: Thu 2/21/2008 11:21 AM
To: core-user@hadoop.apache.org
Subject: Re: changes to compression interfaces in 0.15?
 
Joydeep,

On Feb 20, 2008, at 5:06 PM, Joydeep Sen Sarma wrote:

 Hi developers,

 In migrating to 0.15 - i am noticing that the compression interfaces
 have changed:

 -  compression type for sequencefile outputs used to be set  
 by:
 SequenceFile.setCompressionType()

 -  now it seems to be set using:
 sequenceFileOutputFormat.setOutputCompressionType()



Yes, we added SequenceFileOutputFormat.setOutputCompressionType and  
deprecated the old api. (HADOOP-1851)


 The change is for the better - but would it be possible to:

 -  remove old/dead interfaces. That would have been a
 straightforward hint for applications to look for new interfaces.
 (hadoop-default.xml also still has setting for old conf variable:
 io.seqfile.compression.type)


To maintain backward compat, we cannot remove old apis - the standard  
procedure is to deprecate them for the next release and remove them  
in subsequent releases.

 -  if possible - document changed interfaces in the release
 notes (there's no way we can find this out by looking at the long list
 of Jiras).


Please look at the INCOMPATIBLE CHANGES section of CHANGES.txt,  
HADOOP-1851 is listed there. Admittedly we can do better, but that is  
a good place to look for when upgrading to newer releases.

 i am not sure how updated the wiki is on the compression stuff (my
 responsibility to update it) - but please do consider the impact of

Please use the forrest-based docs (on the hadoop website - e.g.  
mapred_tutorial.html) rather than the wiki as the gold-standard. The  
reason we moved away from the wiki is precisely this - harder to  
maintain docs per release etc.

 changing interfaces on existing applications. (maybe we should have a
 JIRA tag to mark out bugs that change interfaces).



Again, CHANGES.txt and INCOMPATIBLE CHANGES section for now.

Arun



 As always - thanks for all the fish (err .. working code),



 Joydeep







Re: Hadoop summit / workshop at Yahoo!

2008-02-21 Thread Tim Wintle
I would certainly appreciate being able to watch them online too, and
they would help spread the word about hadoop - think of all the people
who watch Google's Techtalks (am I allowed to say the G word around
here?).



On Thu, 2008-02-21 at 08:34 +0100, Lukas Vlcek wrote:
 Online webcast/recorded video would be really appreciated by lot of people.
 Please post the content online! (not only you can target much greater
 audience but you can significantly save on break/lunch/beer food budget :-).
 Lukas
 
 On Wed, Feb 20, 2008 at 9:10 PM, Ajay Anand [EMAIL PROTECTED] wrote:
 
  The registration page for the Hadoop summit is now up:
  http://developer.yahoo.com/hadoop/summit/
 
  Space is limited, so please sign up early if you are interested in
  attending.
 
  About the summit:
  Yahoo! is hosting the first summit on Apache Hadoop on March 25th in
  Sunnyvale. The summit is sponsored by the Computing Community Consortium
  (CCC) and brings together leaders from the Hadoop developer and user
  communities. The speakers will cover topics in the areas of extensions
  being developed for Hadoop, case studies of applications being built and
  deployed on Hadoop, and a discussion on future directions for the
  platform.
 
  Agenda:
  8:30-8:55 Breakfast
  8:55-9:00 Welcome to Yahoo!  Logistics - Ajay Anand, Yahoo!
  9:00-9:30 Hadoop Overview - Doug Cutting / Eric Baldeschwieler, Yahoo!
  9:30-10:00 Pig - Chris Olston, Yahoo!
  10:00-10:30 JAQL - Kevin Beyer, IBM
  10:30-10:45 Break
  10:45-11:15 DryadLINQ - Michael Isard, Microsoft
  11:15-11:45 Monitoring Hadoop using X-Trace - Andy Konwinski and Matei
  Zaharia, UC Berkeley
  11:45-12:15 Zookeeper - Ben Reed, Yahoo!
  12:15-1:15 Lunch
  1:15-1:45 Hbase - Michael Stack, Powerset
  1:45-2:15 Hbase App - Bryan Duxbury, Rapleaf
  2:15-2:45 Hive - Joydeep Sen Sarma, Facebook
  2:45-3:00 Break
  3:00-3:20 Building Ground Models of Southern California - Steve
  Schossler, David O'Hallaron, Intel / CMU
  3:20-3:40 Online search for engineering design content - Mike Haley,
  Autodesk
  3:40-4:00 Yahoo - Webmap - Arnab Bhattacharjee, Yahoo!
  4:00-4:30 Natural language Processing - Jimmy Lin, U of Maryland /
  Christophe Bisciglia, Google
  4:30-4:45 Break
  4:45-5:30 Panel on future directions
  5:30-7:00 Happy hour
 
  Look forward to seeing you there!
  Ajay
 
  -Original Message-
  From: Bradford Stephens [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, February 20, 2008 9:17 AM
  To: core-user@hadoop.apache.org
  Subject: Re: Hadoop summit / workshop at Yahoo!
 
  Hrm yes, I'd like to make a visit as well :)
 
  On Feb 20, 2008 8:05 AM, C G [EMAIL PROTECTED] wrote:
 Hey All:
  
 Is this going forward?  I'd like to make plans to attend and the
  sooner I can get plane tickets the happier the bean counters will be
  :-).
  
 Thx,
 C G
  
  Ajay Anand wrote:
   
Yahoo plans to host a summit / workshop on Apache Hadoop at our
Sunnyvale campus on March 25th. Given the interest we are seeing
  from
developers in a broad range of organizations, this seems like a
  good
time to get together and brief each other on the progress that is
being
made.
   
   
   
We would like to cover topics in the areas of extensions being
developed
for Hadoop, innovative applications being built and deployed on
Hadoop,
and future extensions to the platform. Some of the speakers who
  have
already committed to present are from organizations such as IBM,
Intel,
Carnegie Mellon University, UC Berkeley, Facebook and Yahoo!, and
we are
actively recruiting other leaders in the space.
   
   
   
If you have an innovative application you would like to talk about,
please let us know. Although there are limitations on the amount of
time
we have, we would love to hear from you. You can contact me at
[EMAIL PROTECTED]
   
   
   
Thanks and looking forward to hearing about your cool apps,
   
Ajay
   
   
   
   
   
   
--
View this message in context:
  http://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15
  393386.htmlhttp://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15393386.html
Sent from the Hadoop lucene-users mailing list archive at
  Nabble.com.
   
  
  
  
  
  
  
   -
   Be a better friend, newshound, and know-it-all with Yahoo! Mobile.
  Try it now.
 
 
 
 



Re: changes to compression interfaces in 0.15?

2008-02-21 Thread Arun C Murthy


On Feb 21, 2008, at 12:20 PM, Joydeep Sen Sarma wrote:


To maintain backward compat, we cannot remove old apis - the standard
procedure is to deprecate them for the next release and remove them
in subsequent releases.


you've got to be kidding.

we didn't maintain backwards compatibility. my app broke. Simple  
and straightforward. and the old interfaces are not deprecated (to  
quote 0.15.3 on a 'deprecated' interface:




You are right, HADOOP-1851 didn't fix it right. I've filed HADOOP-2869.

We do need to be more diligent about listing config changes in  
CHANGES.txt for starters, and that point is taken. However, we can't  
start pulling out apis without deprecating them first.


Arun



  /**
   * Set the compression type for sequence files.
   * @param job the configuration to modify
   * @param val the new compression type (none, block, record)
   */
  static public void setCompressionType(Configuration job,
CompressionType val) {
)

I (and i would suspect any average user willing to recompile code)  
would much much rather that we broke backwards compatibility  
immediately rather than maintain carry over defunct apis that  
insidiously break application behavior.


and of course - this does not address the point that the option  
strings themselves are depcreated. (remember - people set options  
explicitly from xml files and streaming. not everyone goes through  
java apis)).


--

as one of my dear professors once said - put ur self in the other  
person's shoe. consider that u were in my position and that a  
production app suddenly went from consuming 100G to 1TB. and  
everything slowed down drastically. and it did not give any sign  
that anything was amiss. everything looked golden on the ourside.  
what would be ur reaction if u find out after a week that the  
system was full and numerous processes had to be re-run? how would  
you have figured that was going to happen by looking at the  
INCOMPATIBLE section (which btw - i did carefully before sending my  
mail).


(fortunately i escaped the worst case - but i think this is a real  
call to action)



-Original Message-
From: Arun C Murthy [mailto:[EMAIL PROTECTED]
Sent: Thu 2/21/2008 11:21 AM
To: core-user@hadoop.apache.org
Subject: Re: changes to compression interfaces in 0.15?

Joydeep,

On Feb 20, 2008, at 5:06 PM, Joydeep Sen Sarma wrote:


Hi developers,

In migrating to 0.15 - i am noticing that the compression interfaces
have changed:

-  compression type for sequencefile outputs used to be set
by:
SequenceFile.setCompressionType()

-  now it seems to be set using:
sequenceFileOutputFormat.setOutputCompressionType()




Yes, we added SequenceFileOutputFormat.setOutputCompressionType and
deprecated the old api. (HADOOP-1851)



The change is for the better - but would it be possible to:

-  remove old/dead interfaces. That would have been a
straightforward hint for applications to look for new interfaces.
(hadoop-default.xml also still has setting for old conf variable:
io.seqfile.compression.type)



To maintain backward compat, we cannot remove old apis - the standard
procedure is to deprecate them for the next release and remove them
in subsequent releases.


-  if possible - document changed interfaces in the release
notes (there's no way we can find this out by looking at the long  
list

of Jiras).



Please look at the INCOMPATIBLE CHANGES section of CHANGES.txt,
HADOOP-1851 is listed there. Admittedly we can do better, but that is
a good place to look for when upgrading to newer releases.


i am not sure how updated the wiki is on the compression stuff (my
responsibility to update it) - but please do consider the impact of


Please use the forrest-based docs (on the hadoop website - e.g.
mapred_tutorial.html) rather than the wiki as the gold-standard. The
reason we moved away from the wiki is precisely this - harder to
maintain docs per release etc.


changing interfaces on existing applications. (maybe we should have a
JIRA tag to mark out bugs that change interfaces).




Again, CHANGES.txt and INCOMPATIBLE CHANGES section for now.

Arun




As always - thanks for all the fish (err .. working code),



Joydeep










Re: changes to compression interfaces in 0.15?

2008-02-21 Thread Pete Wyckoff

If the API semantics are changing under you, you have to change your code
whether or not the API is pulled or deprecated.  Pulling it makes it more
obvious that the user has to change his/her code.

-- pete


On 2/21/08 12:41 PM, Arun C Murthy [EMAIL PROTECTED] wrote:

 
 On Feb 21, 2008, at 12:20 PM, Joydeep Sen Sarma wrote:
 
 To maintain backward compat, we cannot remove old apis - the standard
 procedure is to deprecate them for the next release and remove them
 in subsequent releases.
 
 you've got to be kidding.
 
 we didn't maintain backwards compatibility. my app broke. Simple
 and straightforward. and the old interfaces are not deprecated (to
 quote 0.15.3 on a 'deprecated' interface:
 
 
 You are right, HADOOP-1851 didn't fix it right. I've filed HADOOP-2869.
 
 We do need to be more diligent about listing config changes in
 CHANGES.txt for starters, and that point is taken. However, we can't
 start pulling out apis without deprecating them first.
 
 Arun
 
 
   /**
* Set the compression type for sequence files.
* @param job the configuration to modify
* @param val the new compression type (none, block, record)
*/
   static public void setCompressionType(Configuration job,
 CompressionType val) {
 )
 
 I (and i would suspect any average user willing to recompile code)
 would much much rather that we broke backwards compatibility
 immediately rather than maintain carry over defunct apis that
 insidiously break application behavior.
 
 and of course - this does not address the point that the option
 strings themselves are depcreated. (remember - people set options
 explicitly from xml files and streaming. not everyone goes through
 java apis)).
 
 --
 
 as one of my dear professors once said - put ur self in the other
 person's shoe. consider that u were in my position and that a
 production app suddenly went from consuming 100G to 1TB. and
 everything slowed down drastically. and it did not give any sign
 that anything was amiss. everything looked golden on the ourside.
 what would be ur reaction if u find out after a week that the
 system was full and numerous processes had to be re-run? how would
 you have figured that was going to happen by looking at the
 INCOMPATIBLE section (which btw - i did carefully before sending my
 mail).
 
 (fortunately i escaped the worst case - but i think this is a real
 call to action)
 
 
 -Original Message-
 From: Arun C Murthy [mailto:[EMAIL PROTECTED]
 Sent: Thu 2/21/2008 11:21 AM
 To: core-user@hadoop.apache.org
 Subject: Re: changes to compression interfaces in 0.15?
 
 Joydeep,
 
 On Feb 20, 2008, at 5:06 PM, Joydeep Sen Sarma wrote:
 
 Hi developers,
 
 In migrating to 0.15 - i am noticing that the compression interfaces
 have changed:
 
 -  compression type for sequencefile outputs used to be set
 by:
 SequenceFile.setCompressionType()
 
 -  now it seems to be set using:
 sequenceFileOutputFormat.setOutputCompressionType()
 
 
 
 Yes, we added SequenceFileOutputFormat.setOutputCompressionType and
 deprecated the old api. (HADOOP-1851)
 
 
 The change is for the better - but would it be possible to:
 
 -  remove old/dead interfaces. That would have been a
 straightforward hint for applications to look for new interfaces.
 (hadoop-default.xml also still has setting for old conf variable:
 io.seqfile.compression.type)
 
 
 To maintain backward compat, we cannot remove old apis - the standard
 procedure is to deprecate them for the next release and remove them
 in subsequent releases.
 
 -  if possible - document changed interfaces in the release
 notes (there's no way we can find this out by looking at the long
 list
 of Jiras).
 
 
 Please look at the INCOMPATIBLE CHANGES section of CHANGES.txt,
 HADOOP-1851 is listed there. Admittedly we can do better, but that is
 a good place to look for when upgrading to newer releases.
 
 i am not sure how updated the wiki is on the compression stuff (my
 responsibility to update it) - but please do consider the impact of
 
 Please use the forrest-based docs (on the hadoop website - e.g.
 mapred_tutorial.html) rather than the wiki as the gold-standard. The
 reason we moved away from the wiki is precisely this - harder to
 maintain docs per release etc.
 
 changing interfaces on existing applications. (maybe we should have a
 JIRA tag to mark out bugs that change interfaces).
 
 
 
 Again, CHANGES.txt and INCOMPATIBLE CHANGES section for now.
 
 Arun
 
 
 
 As always - thanks for all the fish (err .. working code),
 
 
 
 Joydeep
 
 
 
 
 
 



RE: Hadoop summit / workshop at Yahoo!

2008-02-21 Thread Ajay Anand
We do plan to make the video available online after the event.

Ajay

-Original Message-
From: Tim Wintle [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 21, 2008 12:22 PM
To: core-user@hadoop.apache.org
Subject: Re: Hadoop summit / workshop at Yahoo!

I would certainly appreciate being able to watch them online too, and
they would help spread the word about hadoop - think of all the people
who watch Google's Techtalks (am I allowed to say the G word around
here?).



On Thu, 2008-02-21 at 08:34 +0100, Lukas Vlcek wrote:
 Online webcast/recorded video would be really appreciated by lot of
people.
 Please post the content online! (not only you can target much greater
 audience but you can significantly save on break/lunch/beer food
budget :-).
 Lukas
 
 On Wed, Feb 20, 2008 at 9:10 PM, Ajay Anand [EMAIL PROTECTED]
wrote:
 
  The registration page for the Hadoop summit is now up:
  http://developer.yahoo.com/hadoop/summit/
 
  Space is limited, so please sign up early if you are interested in
  attending.
 
  About the summit:
  Yahoo! is hosting the first summit on Apache Hadoop on March 25th in
  Sunnyvale. The summit is sponsored by the Computing Community
Consortium
  (CCC) and brings together leaders from the Hadoop developer and user
  communities. The speakers will cover topics in the areas of
extensions
  being developed for Hadoop, case studies of applications being built
and
  deployed on Hadoop, and a discussion on future directions for the
  platform.
 
  Agenda:
  8:30-8:55 Breakfast
  8:55-9:00 Welcome to Yahoo!  Logistics - Ajay Anand, Yahoo!
  9:00-9:30 Hadoop Overview - Doug Cutting / Eric Baldeschwieler,
Yahoo!
  9:30-10:00 Pig - Chris Olston, Yahoo!
  10:00-10:30 JAQL - Kevin Beyer, IBM
  10:30-10:45 Break
  10:45-11:15 DryadLINQ - Michael Isard, Microsoft
  11:15-11:45 Monitoring Hadoop using X-Trace - Andy Konwinski and
Matei
  Zaharia, UC Berkeley
  11:45-12:15 Zookeeper - Ben Reed, Yahoo!
  12:15-1:15 Lunch
  1:15-1:45 Hbase - Michael Stack, Powerset
  1:45-2:15 Hbase App - Bryan Duxbury, Rapleaf
  2:15-2:45 Hive - Joydeep Sen Sarma, Facebook
  2:45-3:00 Break
  3:00-3:20 Building Ground Models of Southern California - Steve
  Schossler, David O'Hallaron, Intel / CMU
  3:20-3:40 Online search for engineering design content - Mike Haley,
  Autodesk
  3:40-4:00 Yahoo - Webmap - Arnab Bhattacharjee, Yahoo!
  4:00-4:30 Natural language Processing - Jimmy Lin, U of Maryland /
  Christophe Bisciglia, Google
  4:30-4:45 Break
  4:45-5:30 Panel on future directions
  5:30-7:00 Happy hour
 
  Look forward to seeing you there!
  Ajay
 
  -Original Message-
  From: Bradford Stephens [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, February 20, 2008 9:17 AM
  To: core-user@hadoop.apache.org
  Subject: Re: Hadoop summit / workshop at Yahoo!
 
  Hrm yes, I'd like to make a visit as well :)
 
  On Feb 20, 2008 8:05 AM, C G [EMAIL PROTECTED] wrote:
 Hey All:
  
 Is this going forward?  I'd like to make plans to attend and the
  sooner I can get plane tickets the happier the bean counters will be
  :-).
  
 Thx,
 C G
  
  Ajay Anand wrote:
   
Yahoo plans to host a summit / workshop on Apache Hadoop at our
Sunnyvale campus on March 25th. Given the interest we are
seeing
  from
developers in a broad range of organizations, this seems like a
  good
time to get together and brief each other on the progress that
is
being
made.
   
   
   
We would like to cover topics in the areas of extensions being
developed
for Hadoop, innovative applications being built and deployed on
Hadoop,
and future extensions to the platform. Some of the speakers who
  have
already committed to present are from organizations such as
IBM,
Intel,
Carnegie Mellon University, UC Berkeley, Facebook and Yahoo!,
and
we are
actively recruiting other leaders in the space.
   
   
   
If you have an innovative application you would like to talk
about,
please let us know. Although there are limitations on the
amount of
time
we have, we would love to hear from you. You can contact me at
[EMAIL PROTECTED]
   
   
   
Thanks and looking forward to hearing about your cool apps,
   
Ajay
   
   
   
   
   
   
--
View this message in context:
 
http://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15
 
393386.htmlhttp://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-t
p14889262p15393386.html
Sent from the Hadoop lucene-users mailing list archive at
  Nabble.com.
   
  
  
  
  
  
  
   -
   Be a better friend, newshound, and know-it-all with Yahoo! Mobile.
  Try it now.
 
 
 
 



define backwards compatibility (was: changes to compression interfaces in 0.15?)

2008-02-21 Thread Joydeep Sen Sarma
Arun - if you can't pull the api - then u must redirect the api to the new call 
that preserves it's semantics.

in this case - had we re-implemented SequenceFile.setCompressionType in 0.15 to 
call SequenceFileOutputFormat.setOutputCompressionType() - then it would have 
been a backwards compatible change. + deprecation would have served fair 
warning for eventual pullout.

i find the confusion over what backwards compatibility means scary - and i am 
really hoping that the outcome of this thread is a clear definition from the 
committers/hadoop-board of what to reasonably expect (or not!) going forward.



-Original Message-
From: Pete Wyckoff [mailto:[EMAIL PROTECTED]
Sent: Thu 2/21/2008 12:47 PM
To: core-user@hadoop.apache.org
Subject: Re: changes to compression interfaces in 0.15?
 

If the API semantics are changing under you, you have to change your code
whether or not the API is pulled or deprecated.  Pulling it makes it more
obvious that the user has to change his/her code.

-- pete


On 2/21/08 12:41 PM, Arun C Murthy [EMAIL PROTECTED] wrote:

 
 On Feb 21, 2008, at 12:20 PM, Joydeep Sen Sarma wrote:
 
 To maintain backward compat, we cannot remove old apis - the standard
 procedure is to deprecate them for the next release and remove them
 in subsequent releases.
 
 you've got to be kidding.
 
 we didn't maintain backwards compatibility. my app broke. Simple
 and straightforward. and the old interfaces are not deprecated (to
 quote 0.15.3 on a 'deprecated' interface:
 
 
 You are right, HADOOP-1851 didn't fix it right. I've filed HADOOP-2869.
 
 We do need to be more diligent about listing config changes in
 CHANGES.txt for starters, and that point is taken. However, we can't
 start pulling out apis without deprecating them first.
 
 Arun
 
 
   /**
* Set the compression type for sequence files.
* @param job the configuration to modify
* @param val the new compression type (none, block, record)
*/
   static public void setCompressionType(Configuration job,
 CompressionType val) {
 )
 
 I (and i would suspect any average user willing to recompile code)
 would much much rather that we broke backwards compatibility
 immediately rather than maintain carry over defunct apis that
 insidiously break application behavior.
 
 and of course - this does not address the point that the option
 strings themselves are depcreated. (remember - people set options
 explicitly from xml files and streaming. not everyone goes through
 java apis)).
 
 --
 
 as one of my dear professors once said - put ur self in the other
 person's shoe. consider that u were in my position and that a
 production app suddenly went from consuming 100G to 1TB. and
 everything slowed down drastically. and it did not give any sign
 that anything was amiss. everything looked golden on the ourside.
 what would be ur reaction if u find out after a week that the
 system was full and numerous processes had to be re-run? how would
 you have figured that was going to happen by looking at the
 INCOMPATIBLE section (which btw - i did carefully before sending my
 mail).
 
 (fortunately i escaped the worst case - but i think this is a real
 call to action)
 
 
 -Original Message-
 From: Arun C Murthy [mailto:[EMAIL PROTECTED]
 Sent: Thu 2/21/2008 11:21 AM
 To: core-user@hadoop.apache.org
 Subject: Re: changes to compression interfaces in 0.15?
 
 Joydeep,
 
 On Feb 20, 2008, at 5:06 PM, Joydeep Sen Sarma wrote:
 
 Hi developers,
 
 In migrating to 0.15 - i am noticing that the compression interfaces
 have changed:
 
 -  compression type for sequencefile outputs used to be set
 by:
 SequenceFile.setCompressionType()
 
 -  now it seems to be set using:
 sequenceFileOutputFormat.setOutputCompressionType()
 
 
 
 Yes, we added SequenceFileOutputFormat.setOutputCompressionType and
 deprecated the old api. (HADOOP-1851)
 
 
 The change is for the better - but would it be possible to:
 
 -  remove old/dead interfaces. That would have been a
 straightforward hint for applications to look for new interfaces.
 (hadoop-default.xml also still has setting for old conf variable:
 io.seqfile.compression.type)
 
 
 To maintain backward compat, we cannot remove old apis - the standard
 procedure is to deprecate them for the next release and remove them
 in subsequent releases.
 
 -  if possible - document changed interfaces in the release
 notes (there's no way we can find this out by looking at the long
 list
 of Jiras).
 
 
 Please look at the INCOMPATIBLE CHANGES section of CHANGES.txt,
 HADOOP-1851 is listed there. Admittedly we can do better, but that is
 a good place to look for when upgrading to newer releases.
 
 i am not sure how updated the wiki is on the compression stuff (my
 responsibility to update it) - but please do consider the impact of
 
 Please use the forrest-based docs (on the hadoop website - e.g.
 mapred_tutorial.html) rather than the wiki 

Python access to HDFS

2008-02-21 Thread Steve Sapovits


Are there any existing HDFS access packages out there for Python?

I've had some success using SWIG and the C HDFS code, as documented
here:

http://www.stat.purdue.edu/~sguha/code.html

(halfway down the page) but it's slow adding support for some of the more
complex functions.  If there's anything out there I missed, I'd like to hear
about it.

--
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
[EMAIL PROTECTED]




Re: define backwards compatibility

2008-02-21 Thread Doug Cutting

Joydeep Sen Sarma wrote:

i find the confusion over what backwards compatibility means scary - and i am 
really hoping that the outcome of this thread is a clear definition from the 
committers/hadoop-board of what to reasonably expect (or not!) going forward.


The goal is clear: code that compiles and runs warning-free in one 
release should not have to to be altered to try the next release.  It 
may generate warnings, and these should be addressed before another 
upgrade is attempted.


Sometimes it is not possible to achieve this.  In these cases 
applications should fail with a clear error message, either at 
compilation or runtime.


In both cases, incompatible changes should be well documented in the 
release notes.


This is described (in part) in http://wiki.apache.org/hadoop/Roadmap

That's the goal.  Implementing and enforcing it is another story.  For 
that we depend on developer and user vigilance.  The current issue seems 
a case of failure to implement the policy rather than a lack of policy.


Doug


Re: Questions regarding configuration parameters...

2008-02-21 Thread Andy Li
Try the 2 parameters to utilize all the cores per node/host.

property
  namemapred.tasktracker.map.tasks.maximum/name
  value7/value
  descriptionThe maximum number of map tasks that will be run
  simultaneously by a task tracker.
  /description
/property

property
  namemapred.tasktracker.reduce.tasks.maximum/name
  value7/value
  descriptionThe maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  /description
/property

The default value are 2 so you might only see 2 cores used by Hadoop per
node/host.
If each system/machine has 4 cores (dual dual core), then you can change
them to 3.

Hope this works for you.

-Andy


On Wed, Feb 20, 2008 at 9:30 AM, C G [EMAIL PROTECTED] wrote:

 Hi All:

  The documentation for the configuration parameters mapred.map.tasks and
 mapred.reduce.tasks discuss these  values in terms of number of available
 hosts in the grid.  This description strikes me as a bit odd given that a
 host could be anything from a uniprocessor to an N-way box, where values
 for N could vary from 2..16 or more.  The documentation is also vague about
 computing the actual value.  For example, for mapred.map.tasks the doc
 says …a prime number several times greater….  I'm curious about how people
 are interpreting the descriptions and what values people are using.
  Specifically, I'm wondering if I should be using core count instead of
 host count to set these values.

  In the specific case of my system, we have 24 hosts where each host is a
 4-way system (i.e. 96 cores total).  For mapred.map.tasks I chose the
 value 173, as that is a prime number which is near 7*24.  For
 mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
  Is this what was intended?

  Beyond curiousity, I'm concerned about setting these values and other
 configuration parameters correctly because I am pursuing some performance
 issues where it is taking a very long time to process small amounts of data.
  I am hoping that some amount of tuning will resolve the problems.

  Any thoughts and insights most appreciated.

  Thanks,
   C G



 -
 Never miss a thing.   Make Yahoo your homepage.



Re: Sorting output data on value

2008-02-21 Thread Owen O'Malley


On Feb 21, 2008, at 5:47 PM, Ted Dunning wrote:

It may be sorted within the output for a single reducer and,  
indeed, you can
even guarantee that it is sorted but *only* by the reduce key.  The  
order

that values appear will not be deterministic.


Actually, there is a better answer for this. If you put both the  
primary and secondary key into the key, you can use  
JobConf.setOutputValueGroupingComparator to set a comparator that  
only compares the primary key. Reduce will be called once per a  
primary key, but all of the values will be sorted by the secondary key.


See http://tinyurl.com/32gld4

-- Owen


Problems running a HOD test cluster

2008-02-21 Thread Luca

Hello everyone,
	I've been trying to run HOD on a sample cluster with three nodes that 
already have Torque installed and (hopefully?) properly working. I also 
prepared a configuration file for hod, that I'm gonna paste at the end 
of this email.


A few questions:
- is Java6 ok for HOD?
- I have an externally running HDFS cluster, as specified in 
[gridservice-hdfs]: how do I find out the fs_port of my cluster? IS it 
something specified in the hadoop-site.xml file?
- what should I expect at the end of an allocate command? Currently what 
I get is the output above, but should I in theory return back to the 
shell prompt, to issue an hadoop command?



[2008-02-21 19:45:34,349] DEBUG/10 hod:144 - ('server.com', 10029)
[2008-02-21 19:45:34,350] INFO/20 hod:216 - Service Registry Started.
[2008-02-21 19:45:34,353] DEBUG/10 hadoop:425 - allocate 
/mnt/scratch/grid/test 3 3
[2008-02-21 19:45:34,357] DEBUG/10 torque:72 - ringmaster cmd: 
/mnt/scratch/grid/hod/bin/ringmaster 
--hodring.tarball-retry-initial-time 1.0 
--hodring.cmd-retry-initial-time 2.0 --hodring.http-port-range 
1-11000 --hodring.log-dir /mnt/scratch/grid/hod/logs 
--hodring.temp-dir /tmp/hod --hodring.register --hodring.userid hadoop 
--hodring.java-home /usr/java/jdk1.6.0_04 
--hodring.tarball-retry-interval 3.0 --hodring.cmd-retry-interval 2.0 
--hodring.xrs-port-range 1-11000 --hodring.debug 4 
--resource_manager.queue hadoop --resource_manager.env-vars 
HOD_PYTHON_HOME=/usr/bin/python2.5 --resource_manager.id torque 
--resource_manager.batch-home /usr --gridservice-hdfs.fs_port 10007 
--gridservice-hdfs.host localhost --gridservice-hdfs.pkgs 
/mnt/scratch/grid/hadoop/current --gridservice-hdfs.info_port 10009 
--gridservice-hdfs.external --ringmaster.http-port-range 1-11000 
--ringmaster.hadoop-tar-ball hadoop/hadoop-releases/hadoop-0.16.0.tar.gz 
--ringmaster.temp-dir /tmp/hod --ringmaster.register --ringmaster.userid 
hadoop --ringmaster.work-dirs /tmp/hod/1,/tmp/hod/2 
--ringmaster.svcrgy-addr server.com:10029 --ringmaster.log-dir 
/mnt/scratch/grid/hod/logs --ringmaster.max-connect 30 
--ringmaster.xrs-port-range 1-11000 --ringmaster.jt-poll-interval 
120 --ringmaster.debug 4 --ringmaster.idleness-limit 3600 
--gridservice-mapred.tracker_port 10003 --gridservice-mapred.host 
localhost --gridservice-mapred.pkgs /mnt/scratch/grid/hadoop/current 
--gridservice-mapred.info_port 10008
[2008-02-21 19:45:34,361] DEBUG/10 torque:44 - qsub - /usr/bin/qsub -l 
nodes=3 -W x= -l nodes=3 -W x= -N HOD -r n -d /tmp/ -q hadoop -v 
HOD_PYTHON_HOME=/usr/bin/python2.5

[2008-02-21 19:45:34,373] DEBUG/10 torque:54 - qsub stdin: #!/bin/sh
[2008-02-21 19:45:34,374] DEBUG/10 torque:54 - qsub stdin: 
/mnt/scratch/grid/hod/bin/ringmaster 
--hodring.tarball-retry-initial-time 1.0 
--hodring.cmd-retry-initial-time 2.0 --hodring.http-port-range 
1-11000 --hodring.log-dir /mnt/scratch/grid/hod/logs 
--hodring.temp-dir /tmp/hod --hodring.register --hodring.userid hadoop 
--hodring.java-home /usr/java/jdk1.6.0_04 
--hodring.tarball-retry-interval 3.0 --hodring.cmd-retry-interval 2.0 
--hodring.xrs-port-range 1-11000 --hodring.debug 4 
--resource_manager.queue hadoop --resource_manager.env-vars 
HOD_PYTHON_HOME=/usr/bin/python2.5 --resource_manager.id torque 
--resource_manager.batch-home /usr --gridservice-hdfs.fs_port 10007 
--gridservice-hdfs.host localhost --gridservice-hdfs.pkgs 
/mnt/scratch/grid/hadoop/current --gridservice-hdfs.info_port 10009 
--gridservice-hdfs.external --ringmaster.http-port-range 1-11000 
--ringmaster.hadoop-tar-ball hadoop/hadoop-releases/hadoop-0.16.0.tar.gz 
--ringmaster.temp-dir /tmp/hod --ringmaster.register --ringmaster.userid 
hadoop --ringmaster.work-dirs /tmp/hod/1,/tmp/hod/2 
--ringmaster.svcrgy-addr server.com:10029 --ringmaster.log-dir 
/mnt/scratch/grid/hod/logs --ringmaster.max-connect 30 
--ringmaster.xrs-port-range 1-11000 --ringmaster.jt-poll-interval 
120 --ringmaster.debug 4 --ringmaster.idleness-limit 3600 
--gridservice-mapred.tracker_port 10003 --gridservice-mapred.host 
localhost --gridservice-mapred.pkgs /mnt/scratch/grid/hadoop/current 
--gridservice-mapred.info_port 10008

[2008-02-21 19:45:36,385] DEBUG/10 torque:76 - qsub jobid: 207.server.com
[2008-02-21 19:45:36,389] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:38,952] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:41,524] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:44,066] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:46,612] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:49,155] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:51,696] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:54,236] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 
207.server.com
[2008-02-21 19:45:56,797] DEBUG/10 torque:87 - /usr/bin/qstat -f -1 

RE: Questions regarding configuration parameters...

2008-02-21 Thread C G
My performance problems fall into 2 categories:
   
  1.  Extremely slow reduce phases - our map phases march along at impressive 
speed, but during reduce phases most nodes go idle...the active machines mostly 
clunk along at 10-30% CPU.  Compare this to the map phase where I get all grid 
nodes cranking away at  100% CPU.  This is a vague explanation I realize.
   
  2.  Pregnant pauses during dfs -copyToLocal and -cat operations.  Frequently 
I'll be iterating over a list of HDFS files cat-ing them into one file to bulk 
load into a database.  Many times I'll see one of the copies/cats sit for 
anywhere from 2-5 minutes.  During that time no data is transferred, all nodes 
are idle, and absolutely nothing is written to any of the logs.  The file sizes 
being copied are relatively small...less than 1G each in most cases.
   
  Both of these issues persist in 0.16.0 and definitely have me puzzled.  I'm 
sure that I'm doing something wrong/non-optimal w/r/t slow reduce phases, but 
the long pauses during a dfs command line operation seems like a bug to me.  
Unfortunately I've not seen anybody else report this.
   
  Any thoughts/ideas most welcome...
   
  Thanks,
  C G
  

Joydeep Sen Sarma [EMAIL PROTECTED] wrote:
  
 The default value are 2 so you might only see 2 cores used by Hadoop per
 node/host.

that's 2 each for map and reduce. so theoretically - one could fully utilize a 
4 core box with this setting. in practice - a little bit of oversubscription (3 
each on a 4 core) seems to be working out well for us (maybe overlapping some 
compute and io - but mostly we are trading off for higher # concurrent jobs 
against per job latency).

unlikely that these settings are causing slowness in processing small amounts 
of data. send more details - what's slow (map/shuffle/reduce)? check cpu 
consumption when map task is running .. etc.


-Original Message-
From: Andy Li [mailto:[EMAIL PROTECTED]
Sent: Thu 2/21/2008 2:36 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions regarding configuration parameters...

Try the 2 parameters to utilize all the cores per node/host.



mapred.tasktracker.map.tasks.maximum
7
The maximum number of map tasks that will be run
simultaneously by a task tracker.






mapred.tasktracker.reduce.tasks.maximum
7
The maximum number of reduce tasks that will be run
simultaneously by a task tracker.




The default value are 2 so you might only see 2 cores used by Hadoop per
node/host.
If each system/machine has 4 cores (dual dual core), then you can change
them to 3.

Hope this works for you.

-Andy


On Wed, Feb 20, 2008 at 9:30 AM, C G 
wrote:

 Hi All:

 The documentation for the configuration parameters mapred.map.tasks and
 mapred.reduce.tasks discuss these values in terms of number of available
 hosts in the grid. This description strikes me as a bit odd given that a
 host could be anything from a uniprocessor to an N-way box, where values
 for N could vary from 2..16 or more. The documentation is also vague about
 computing the actual value. For example, for mapred.map.tasks the doc
 says .a prime number several times greater.. I'm curious about how people
 are interpreting the descriptions and what values people are using.
 Specifically, I'm wondering if I should be using core count instead of
 host count to set these values.

 In the specific case of my system, we have 24 hosts where each host is a
 4-way system (i.e. 96 cores total). For mapred.map.tasks I chose the
 value 173, as that is a prime number which is near 7*24. For
 mapred.reduce.tasks I chose 23 since that is a prime number close to 24.
 Is this what was intended?

 Beyond curiousity, I'm concerned about setting these values and other
 configuration parameters correctly because I am pursuing some performance
 issues where it is taking a very long time to process small amounts of data.
 I am hoping that some amount of tuning will resolve the problems.

 Any thoughts and insights most appreciated.

 Thanks,
 C G



 -
 Never miss a thing. Make Yahoo your homepage.




   
-
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.

Re: Sorting output data on value

2008-02-21 Thread Ted Dunning

But this only guarantees that the results will be sorted within each
reducers input.  Thus, this won't result in getting the results sorted by
the reducers output value.


On 2/21/08 8:40 PM, Owen O'Malley [EMAIL PROTECTED] wrote:

 
 On Feb 21, 2008, at 5:47 PM, Ted Dunning wrote:
 
 It may be sorted within the output for a single reducer and,
 indeed, you can
 even guarantee that it is sorted but *only* by the reduce key.  The
 order
 that values appear will not be deterministic.
 
 Actually, there is a better answer for this. If you put both the
 primary and secondary key into the key, you can use
 JobConf.setOutputValueGroupingComparator to set a comparator that
 only compares the primary key. Reduce will be called once per a
 primary key, but all of the values will be sorted by the secondary key.
 
 See http://tinyurl.com/32gld4
 
 -- Owen