New York user group?

2008-07-18 Thread Alex Dorman
Please let me know if you would be interested in joining NY Hadoop user group 
if one existed. 

I know about 5-6 people in New York City running Hadoop. I am sure there are 
many more. 

 

Let me know. If there is some interest, I will try to put together first 
meeting.

 

thanks  

-Alex

 

 

 

ALEX DORMAN
[EMAIL PROTECTED]
contextweb.com




Is Hadoop compatiable with IBM JDK 1.5 64 bit for AIX 5?

2008-07-18 Thread Amber
The Hadoop documentation says Sun's JDK must be used, this message is post to 
make sure that there is official statement about this.

RE: New York user group?

2008-07-18 Thread Leon Yu

Yes. I am interested. Date: Fri, 18 Jul 2008 05:59:33 -0700 From: [EMAIL 
PROTECTED] Subject: New York user group? To: core-user@hadoop.apache.org  
Please let me know if you would be interested in joining NY Hadoop user group 
if one existed.   I know about 5-6 people in New York City running Hadoop. I 
am sure there are many more. Let me know. If there is some interest, I 
will try to put together first meeting.thanks   -Alex
ALEX DORMAN [EMAIL PROTECTED] contextweb.com  
_
Keep your kids safer online with Windows Live Family Safety.
http://www.windowslive.com/family_safety/overview.html?ocid=TXT_TAGLM_WL_family_safety_072008

Hadoop 0.17.1 namenode service can't start on windows XP.

2008-07-18 Thread Amber
Hi, I followed the instructions from 
http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/ to install Hadoop 
0.17.1 on my Windows XP computer, whose computer name is AMBER, and the current 
user name is User. I installed CygWin on G:\. I have verified ssh and 
bin/hadoop version work fine. But when  trying to start the dfs service I found 
the following problems:

1. Hadoop can't create the logs directory automatically if it does not exist in 
the install directory.
2. The datanode service can automatically create the 
G:\tmp\hadoop-SYSTEM\dfs\data directory, but the namenode service cant' 
automatically create G:\tmp\hadoop-User directory and it's sub directories, 
even I manually created the G:\tmp\hadoop-User\dfs\name\image directory the 
name service can't start neither, I found the following exceptions in the 
nameservice's log file:
2008-07-18 22:11:46,578 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG: 
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = amber/116.76.140.27
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.17.1
STARTUP_MSG:   build = 
http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 669344; 
compiled by 'hadoopqa' on Thu Jun 19 01:18:25 UTC 2008
/
2008-07-18 22:11:47,234 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: 
Initializing RPC Metrics with hostName=NameNode, port=47110
2008-07-18 22:11:47,250 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: 
localhost/127.0.0.1:47110
2008-07-18 22:11:47,265 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=NameNode, sessionId=null
2008-07-18 22:11:47,281 INFO org.apache.hadoop.dfs.NameNodeMetrics: 
Initializing NameNodeMeterics using context 
object:org.apache.hadoop.metrics.spi.NullContext
2008-07-18 22:11:48,296 INFO org.apache.hadoop.fs.FSNamesystem: 
fsOwner=User,None,root,Administrators,Users,ORA_DBA
2008-07-18 22:11:48,296 INFO org.apache.hadoop.fs.FSNamesystem: 
supergroup=supergroup
2008-07-18 22:11:48,296 INFO org.apache.hadoop.fs.FSNamesystem: 
isPermissionEnabled=true
2008-07-18 22:11:48,359 INFO org.apache.hadoop.dfs.Storage: Storage directory 
G:\tmp\hadoop-User\dfs\name does not exist.
2008-07-18 22:11:48,359 INFO org.apache.hadoop.ipc.Server: Stopping server on 
47110
2008-07-18 22:11:48,359 ERROR org.apache.hadoop.dfs.NameNode: 
org.apache.hadoop.dfs.InconsistentFSStateException: Directory 
G:\tmp\hadoop-User\dfs\name is in an inconsistent state: storage directory does 
not exist or is not accessible.
 at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:154)
 at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
 at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)
 at org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255)
 at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)
 at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178)
 at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)
 at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
 at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)

2008-07-18 22:11:48,359 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG: 
/
SHUTDOWN_MSG: Shutting down NameNode at amber/116.76.140.27
/
2008-07-18 22:26:35,734 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG: 
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = amber/116.76.140.27
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.17.1
STARTUP_MSG:   build = 
http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 669344; 
compiled by 'hadoopqa' on Thu Jun 19 01:18:25 UTC 2008
/
2008-07-18 22:26:36,046 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: 
Initializing RPC Metrics with hostName=NameNode, port=47110
2008-07-18 22:26:36,062 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: 
localhost/127.0.0.1:47110
2008-07-18 22:26:36,062 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=NameNode, sessionId=null
2008-07-18 22:26:36,093 INFO org.apache.hadoop.dfs.NameNodeMetrics: 
Initializing NameNodeMeterics using context 
object:org.apache.hadoop.metrics.spi.NullContext
2008-07-18 22:26:37,421 INFO org.apache.hadoop.fs.FSNamesystem: 
fsOwner=User,None,root,Administrators,Users,ORA_DBA
2008-07-18 22:26:37,421 INFO org.apache.hadoop.fs.FSNamesystem: 
supergroup=supergroup
2008-07-18 22:26:37,421 INFO org.apache.hadoop.fs.FSNamesystem: 
isPermissionEnabled=true
2008-07-18 22:26:37,515 INFO org.apache.hadoop.dfs.Storage: Storage directory 
G:\tmp\hadoop-User\dfs\name does not exist.
2008-07-18 22:26:37,515 INFO org.apache.hadoop.ipc.Server: Stopping server on 

Re: Is Hadoop compatiable with IBM JDK 1.5 64 bit for AIX 5?

2008-07-18 Thread Colin Freas
I'm not sure if this is useful info, but I used both the Sun and the IBM JDK
under Linux to run version 0.16.iForget of Hadoop, without any problems.  I
did some brief performance testing, didn't see any significant difference,
then we switched over to the Sun JDK exclusively as per the recommendation
of the docs.

-Colin

On Fri, Jul 18, 2008 at 9:24 AM, Amber [EMAIL PROTECTED] wrote:

 The Hadoop documentation says Sun's JDK must be used, this message is
 post to make sure that there is official statement about this.


using too many mappers?

2008-07-18 Thread Ashish Venugopal
Is it possible that using too many mappers causes issues in Hadoop 0.17.1? I
have an input data directory with 100 files in it. I am running a job that
takes these files as input. When I set -jobconf mapred.map.tasks=200 in
the job invocation, its seems like the mappers received empty inputs (that
my binary does not cleanly handle). When I unset the mapred.map.tasks
parameter, the jobs runs fine, many mappers do get used because the input
files are manually split. Can anyone offer an explanation / have there been
changes in the use of this parameter between 0.16.4 and 0.17.1?
Ashish


Re: Reduce stalling

2008-07-18 Thread brainstorm
I'm having the same problem :-/ Maps are going fine while reduce phase
stalls on 9-16%, and then resumes after a loong while (30-40 minutes).

I'm using hadoop 0.16.0 (r618351) and wordcount hadoop-example... next
week I'll try with a newer hadoop version (perhaps trunk) to see if I
can reproduce this issue :-S

On Sat, Jun 21, 2008 at 8:28 AM, Arnie Horta [EMAIL PROTECTED] wrote:
 Hello,


 I am having a problem...when I configure more than one node on my hadoop
 cluster, the reduce jobs all stall. Ther logs look EXACTLY like the logs
 described by Amit Kumar Singh. in his post last month. I have checked that
 the disk is fine (I formatted the HDFS and deleted all the data on the data
 nodes to make sure) and this is not a firewall issue or a problem with
 etc/hosts. It works fine with one node, but adding a second node starts the
 stalling.


 I have tried using both .16.2 and .16.4, to no avail.


 HELP!



Timeouts when running balancer

2008-07-18 Thread David J. O'Dell
I'm trying to re balance my cluster as I've added to more nodes.
When I run balancer with the default threshold I am seeing timeouts in
the logs:

2008-07-18 09:50:46,636 INFO org.apache.hadoop.dfs.Balancer: Decided to
move block -8432927406854991437 with a length of 128 MB bytes from
10.11.6.234:50010 to 10.11.6.235:50010 using proxy source 10.11.6.234:50010
2008-07-18 09:50:46,636 INFO org.apache.hadoop.dfs.Balancer: Starting
Block mover for -8432927406854991437 from 10.11.6.234:50010 to
10.11.6.235:50010
2008-07-18 09:52:46,826 WARN org.apache.hadoop.dfs.Balancer: Timeout
moving block -8432927406854991437 from 10.11.6.234:50010 to
10.11.6.235:50010 through 10.11.6.234:50010

I read in the balancer guide-
http://issues.apache.org/jira/secure/attachment/12370966/BalancerUserGuide2
That the default transfer rate is 1mb/sec
I tried increasing this to 1gb/sec but I'm still seeing the timeouts.
All of the nodes have gigE nics and are on the same switch.


-- 
David O'Dell
Director, Operations
e: [EMAIL PROTECTED]
t:  (415) 738-5152
180 Townsend St., Third Floor
San Francisco, CA 94107 



help request: 0.16.0 java.io.IOException: Filesystem closed org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find task_....

2008-07-18 Thread Jason Venner

I am seeing an odd mix of errors in a job we have running on a particular 
cluster of machines.
Has anyone seen this before and what is actually the problem?
We are running linux (Centos51, on 8 way xeons, with all disks under raid 5) 
and GigE switches between the machines.
The namenode machine does not run a datanode or a tasktracker and is 
essentially idle.

Thanks!


2008-07-18 09:25:43,626 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
java.io.IOException: Filesystem closed
at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:158)
at org.apache.hadoop.dfs.DFSClient.access$500(DFSClient.java:58)
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.close(DFSClient.java:1095)
at java.io.FilterInputStream.close(FilterInputStream.java:155)
at 
org.apache.hadoop.mapred.LineRecordReader$LineReader.close(LineRecordReader.java:97)
at 
org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:277)
at 
org.apache.hadoop.mapred.KeyValueLineRecordReader.close(KeyValueLineRecordReader.java:113)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:155)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:212)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)

and

2008-07-18 04:26:41,056 WARN org.apache.hadoop.mapred.TaskTracker: Error 
running child
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
task_200807141819_0007_m_001981_1/spill0.out in any of the configured local 
directories
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:359)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at 
org.apache.hadoop.mapred.MapOutputFile.getSpillFile(MapOutputFile.java:77)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpillToDisk(MapTask.java:464)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:713)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:209)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)


--
Jason Venner
Attributor - Program the Web http://www.attributor.com/
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


What is the difference between streaming options: -file and -CacheFile ?

2008-07-18 Thread Steve Gao
Seems that they mean the same thing, right?
Another misleading options are : -NumReduceTasks and -jobconf 
mapred.reduce.tasks. Both are used to control (or give hit to) the number of 
reducers.



  

Re: can hadoop read files backwards

2008-07-18 Thread Elia Mazzawi

well here is the problem I'm trying to solve,

I have a data set that looks like this:

IDtype   Timestamp

A1X   1215647404
A2X   1215647405
A3X   1215647406
A1   Y   1215647409

I want to count how many A1 Y, show up within 5 seconds of an A1 X

I was planning to have the data sorted by ID then timestamp,
then read it backwards,  (or have it sorted by reverse timestamp)

go through it cashing all Y's for the same ID for 5 seconds to either 
find a matching X or not.


the results don't need to be 100% accurate.

so if hadoop gives the same file with the same lines in order then this 
will work.


seems hadoop is really good at solving problems that depend on 1 line at 
a time? but not multi lines?


hadoop has to get data in order, and be able to work on multi lines, 
otherwise how can it be setting records in data sorts.


I'd appreciate other suggestions to go about doing this.

Jim R. Wilson wrote:

does wordcount get the lines in order? or are they random? can i have
hadoop return them in reverse order?



You can't really depend on the order that the lines are given - it's
best to think of them as random.  The purpose of MapReduce/Hadoop is
to distribute a problem among a number of cooperating nodes.

The idea is that any given line can be interpreted separately,
completely independent of any other line.  So in wordcount, this makes
sense.  For example, say you and I are nodes. Each of us gets half the
lines in a file and we can count the words we see and report on them -
it doesn't matter what order we're given the lines, or which lines
we're given, or even whether we get the same number of lines (if
you're faster at it, or maybe you get shorter lines, you may get more
lines to process in the interest of saving time).

So if the project you're working on requires getting the lines in a
particular order, then you probably need to rethink your approach. It
may be that hadoop isn't right for your problem, or maybe that the
problem just needs to be attacked in a different way.  Without knowing
more about what you're trying to achieve, I can't offer any specifics.

Good luck!

-- Jim

On Thu, Jul 17, 2008 at 4:41 PM, Elia Mazzawi
[EMAIL PROTECTED] wrote:
  

I have a program based on wordcount.java
and I have files that are smaller than 64mb files (so i believe each file is
one task )

do does wordcount get the lines in order? or are they random? can i have
hadoop return them in reverse order?

Jim R. Wilson wrote:


It sounds to me like you're talking about hadoop streaming (correct me
if I'm wrong there).  In that case, there's really no order to the
lines being doled out as I understand it.  Any given line could be
handed to any given mapper task running on any given node.

I may be wrong, of course, someone closer to the project could give
you the right answer in that case.

-- Jim R. Wilson (jimbojw)

On Thu, Jul 17, 2008 at 4:06 PM, Elia Mazzawi
[EMAIL PROTECTED] wrote:

  

is there a way to have hadoop hand over the lines of a file backwards to
my
mapper ?

as in give the last line first.








Re: [PIG LATIN] how to get the size of a data bag

2008-07-18 Thread Arun C Murthy

Charles,

 The right forum for Pig is [EMAIL PROTECTED], I'm  
redirecting you there... good luck!


Arun

On Jul 18, 2008, at 11:51 AM, charles du wrote:


Hi:

Just start learning hadoop and pig latin. How can I get the number of
elements in a data bag?

For example, a data bag like follow has four elements.
  B= {1, 2, 3, 5}

I tried C = COUNT(B), it did not work. Thanks.

--
tp




[Streaming]What is the difference between streaming options: -file and -CacheFile ?

2008-07-18 Thread Steve Gao
Hi All,  
    I am using Hadoop Streaming. I am confused by streaming options: -file and 
-CacheFile. Seems that they mean the same thing, right?

    Another misleading options are : -NumReduceTasks and -jobconf 
mapred.reduce.tasks. Both are used to control (or give hit to) the number of 
reducers.

  Thanks



  

Re: [Streaming]What is the difference between streaming options: -file and -CacheFile ?

2008-07-18 Thread Arun C Murthy


On Jul 18, 2008, at 4:53 PM, Steve Gao wrote:


Hi All,
I am using Hadoop Streaming. I am confused by streaming  
options: -file and -CacheFile. Seems that they mean the same thing,  
right?




The difference is that -file will 'ship' your file (local file) to  
the cluster, while -cachefile assumes that it is already present on  
HDFS at the given path.


Another misleading options are : -NumReduceTasks and -jobconf  
mapred.reduce.tasks. Both are used to control (or give hit to) the  
number of reducers.




Yes, they are both equivalent.

hth,
Arun


Re: [Streaming]What is the difference between streaming options: -file and -CacheFile ?

2008-07-18 Thread Steve Gao
One more little question, why Hadoop streaming is designed in this way to use 2 
different options to do the same thing (i.e. control the reduce number)? What's 
the point here?
Thanks

--- On Fri, 7/18/08, Arun C Murthy [EMAIL PROTECTED] wrote:
From: Arun C Murthy [EMAIL PROTECTED]
Subject: Re: [Streaming]What is the difference between streaming options: -file 
and -CacheFile ?
To: core-user@hadoop.apache.org, Steve Gao [EMAIL PROTECTED]
Date: Friday, July 18, 2008, 8:27 PM

On Jul 18, 2008, at 4:53 PM, Steve Gao wrote:

 Hi All,
 I am using Hadoop Streaming. I am confused by streaming  
 options: -file and -CacheFile. Seems that they mean the same thing,  
 right?


The difference is that -file will 'ship' your file (local file) to  
the cluster, while -cachefile assumes that it is already present on  
HDFS at the given path.

 Another misleading options are : -NumReduceTasks and -jobconf  
 mapred.reduce.tasks. Both are used to control (or give hit to) the  
 number of reducers.


Yes, they are both equivalent.

hth,
Arun


  

Hadoop with Axis

2008-07-18 Thread Kylie McCormick
Hello Again:
I'm currently running Hadoop with various Client  objects in the Map phase.
A given Axis services provides the class of the Client to be used in this
situation, which runs the call over the wire to the provided URL and
translates the objects returned into Writable objects.

When I use the code without Hadoop, it runs just fine--objects are returned
from over the wire. When I run the code inside of Hadoop's structure, I am
getting null objects within the return type (although the return type itself
is not null) from the service. This is literally the same code.

Do you think this is a time-thing, where the connection is taking too long
so Hadoop kills it? It's only a few seconds, but I thought I should ask. Are
there other things I should be looking into?

Thanks,
Kylie

-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

Light, seeking light, doth the light of light beguile!
-- William Shakespeare's Love's Labor's Lost