When will hadoop 0.19.2 be released?

2009-04-24 Thread Zhou, Yunqing
currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data.
and I found 0.19.1 is buggy and I have already applied some patches on
hadoop jira to solve problems.
But I'm looking forward to a more stable release of hadoop.
Do you know when will 0.19.2 be released?

Thanks.


Re: When will hadoop 0.19.2 be released?

2009-04-24 Thread Zhou, Yunqing
But there are already 100TB data stored on DFS.
Is there a safe solution to do such a downgrade?

On Fri, Apr 24, 2009 at 2:08 PM, jason hadoop jason.had...@gmail.com wrote:
 You could try the cloudera release based on 18.3, with many backported
 features.
 http://www.cloudera.com/distribution

 On Thu, Apr 23, 2009 at 11:06 PM, Zhou, Yunqing azure...@gmail.com wrote:

 currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data.
 and I found 0.19.1 is buggy and I have already applied some patches on
 hadoop jira to solve problems.
 But I'm looking forward to a more stable release of hadoop.
 Do you know when will 0.19.2 be released?

 Thanks.




 --
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422



How to exclude machines from a cluster

2008-11-13 Thread Zhou, Yunqing
Here is a cluster with 13 machines. And due to the lack of storage
space, we set the replication factor to 1.
but recently we found 2 machines in the cluster are not stable. so I'd
like to exclude them from the cluster.
but I can't simply set the replication factor to 1 and remove them due
to the large amount of data.
so is there a way I can force hadoop to move the block stored on them
to other machines?

Thanks.


Re: Can anyone recommend me a inter-language data file format?

2008-11-02 Thread Zhou, Yunqing
I finally decided to use Protocol Buffers.
But there is a problem, when hadoop is handling a file larger than
blocksize,it will be splited.
How can I determine the boundary of a sequence of protocol buffer records?
I was thinking of using hadoop's SequenceFile as a container,but it hasn't a
C++ API.
Any advices?

On Sun, Nov 2, 2008 at 1:45 PM, Bryan Duxbury [EMAIL PROTECTED] wrote:

 Agree, we use Thrift at Rapleaf for this purpose. It's trivial to make a
 ThriftWritable if you want to be crafty, but you can also just use byte[]s
 and do the serialization and deserialization yourself.

 -Bryan


 On Nov 1, 2008, at 8:01 PM, Alex Loddengaard wrote:

 Take a look at Thrift:
 http://developers.facebook.com/thrift/

 Alex

 On Sat, Nov 1, 2008 at 7:15 PM, Zhou, Yunqing [EMAIL PROTECTED] wrote:

 The project I focused on has many modules written in different languages
 (several modules are hadoop jobs).
 So I'd like to utilize a common record based data file format for data
 exchange.
 XML is not efficient for appending new records.
 SequenceFile seems not having API of other languages except Java.
 Protocol Buffers' hadoop API seems under development.
 any recommendation for this?

 Thanks





Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Zhou, Yunqing
The project I focused on has many modules written in different languages
(several modules are hadoop jobs).
So I'd like to utilize a common record based data file format for data
exchange.
XML is not efficient for appending new records.
SequenceFile seems not having API of other languages except Java.
Protocol Buffers' hadoop API seems under development.
any recommendation for this?

Thanks


Re: Can anyone recommend me a inter-language data file format?

2008-11-01 Thread Zhou, Yunqing
Can thrift be easily used in hadoop?
a lot of things should be written, input/output format, writables,split
method,etc.

On Sun, Nov 2, 2008 at 11:01 AM, Alex Loddengaard [EMAIL PROTECTED] wrote:

 Take a look at Thrift:
 http://developers.facebook.com/thrift/

 Alex

 On Sat, Nov 1, 2008 at 7:15 PM, Zhou, Yunqing [EMAIL PROTECTED] wrote:

  The project I focused on has many modules written in different languages
  (several modules are hadoop jobs).
  So I'd like to utilize a common record based data file format for data
  exchange.
  XML is not efficient for appending new records.
  SequenceFile seems not having API of other languages except Java.
  Protocol Buffers' hadoop API seems under development.
  any recommendation for this?
 
  Thanks
 



Task Random Fail

2008-10-22 Thread Zhou, Yunqing
Recently the tasks on our cluster random failed (both map tasks and reduce
tasks) . When rerun them, they are all ok.
The whole job is a IO-bound job. (250G input and 500G output(map) and
10G(final))
from the jobtracker, I can see the failed job says:
   task_200810220830_0004_m_000653_0
 
tip_200810220830_0004_m_000653http://hadoop5:50030/taskdetails.jsp?jobid=job_200810220830_0004tipid=tip_200810220830_0004_m_000653
 vidi-005 http://vidi-005:50060/
 FAILED
 java.io.IOException: Task process exit with nonzero status of 65. at
org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:479) at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:391)
 Last 
4KBhttp://vidi-005:50060/tasklog?taskid=task_200810220830_0004_m_000653_0start=-4097
Last 
8KBhttp://vidi-005:50060/tasklog?taskid=task_200810220830_0004_m_000653_0start=-8193
All http://vidi-005:50060/tasklog?taskid=task_200810220830_0004_m_000653_0
and the log says (follow the link in the right-most column):

 Task Logs: 'task_200810220830_0004_m_000653_0'

*stdout logs*

--


*stderr logs*

--


*syslog logs*

2008-10-22 19:59:51,640 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2008-10-22 19:59:59,507 INFO org.apache.hadoop.mapred.MapTask:
numReduceTasks: 26
2008-10-22 20:12:25,968 INFO org.apache.hadoop.mapred.TaskRunner:
Communication exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:559)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
at java.lang.Thread.run(Thread.java:619)

2008-10-22 20:13:29,015 INFO org.apache.hadoop.mapred.TaskRunner:
Communication exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:559)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
at java.lang.Thread.run(Thread.java:619)

2008-10-22 20:14:32,030 INFO org.apache.hadoop.mapred.TaskRunner:
Communication exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:559)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source)
at org.apache.hadoop.mapred.Task$1.run(Task.java:316)
at java.lang.Thread.run(Thread.java:619)

2008-10-22 20:14:32,781 INFO org.apache.hadoop.mapred.TaskRunner:
Process Thread Dump: Communication exception
9 active threads
Thread 13 (Comm thread for task_200810220830_0004_m_000653_0):
  State: RUNNABLE
  Blocked count: 2
  Waited count: 430
  Stack:
sun.management.ThreadImpl.getThreadInfo0(Native Method)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)

org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114)

org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:168)
org.apache.hadoop.mapred.Task$1.run(Task.java:338)
java.lang.Thread.run(Thread.java:619)
Thread 12 ([EMAIL PROTECTED]):
  State: TIMED_WAITING
  Blocked count: 0
  Waited count: 872
  Stack:
java.lang.Thread.sleep(Native Method)
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:763)
java.lang.Thread.run(Thread.java:619)
Thread 11 (IPC Client connection to hadoop5/192.168.4.105:9000):
  State: WAITING
  Blocked count: 0
  Waited count: 2
  Waiting on [EMAIL PROTECTED]
  Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:247)
org.apache.hadoop.ipc.Client$Connection.run(Client.java:286)
Thread 9 (IPC Client connection to /127.0.0.1:49078):
  State: RUNNABLE
  Blocked count: 5
  Waited count: 214
  Stack:
sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)

org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:237)
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:155)
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:149)
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:122)
java.io.FilterInputStream.read(FilterInputStream.java:116)

Can I startup 2 datanodes on 1 machine?

2008-10-07 Thread Zhou, Yunqing
Here I have an existing hadoop 0.17.1 cluster. Now I'd like to add a second
disk on every machine.
So can I startup multi datanodes on 1 machine? Or do I have to setup each
machine with soft RAID configured ? (no RAID support on mainboards)

Thanks


Re: Can I startup 2 datanodes on 1 machine?

2008-10-07 Thread Zhou, Yunqing
Thanks, I will try it then.

On Tue, Oct 7, 2008 at 4:40 PM, Miles Osborne [EMAIL PROTECTED] wrote:

 you can specify multiple data directories in your conf file

 
 dfs.data.dir Comma separated list of paths on the local filesystem
 of a DataNode where it should store its blocks.  If this is a
 comma-delimited list of directories, then data will be stored in all
 named directories, typically on different devices.
 

 Miles

 2008/10/7 Zhou, Yunqing [EMAIL PROTECTED]:
   Here I have an existing hadoop 0.17.1 cluster. Now I'd like to add a
 second
  disk on every machine.
  So can I startup multi datanodes on 1 machine? Or do I have to setup each
  machine with soft RAID configured ? (no RAID support on mainboards)
 
  Thanks
 



 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.



Re: A question about Mapper

2008-10-04 Thread Zhou, Yunqing
but the close() function doesn't supply me a Collector to put pairs in.

Is it reasonable for me to store a reference of the collector in advance?

I'm not sure if the collector is still available then.




On Sat, Oct 4, 2008 at 12:17 PM, Joman Chu [EMAIL PROTECTED] wrote:

 Hello,

 Does MapReduceBase.close() fit your needs? Take a look at
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/MapReduceBase.html#close()

 On Fri, October 3, 2008 11:36 pm, Zhou, Yunqing said:
  the input is as follows. flag a b flag c d e flag f
 
  then I used a mapper to first store values and then emit them all when
  met with a line contains flag but when the file reached its end, I have
  no chance to emit the last record.(in this case ,f) so how can I detect
 the
  mapper's end of its life , or how can I emit a last record before a
 mapper
  exits.
 
  Thanks
 

 Have a good one,
 --
 Joman Chu
 Carnegie Mellon University
 School of Computer Science 2011
 AIM: ARcanUSNUMquam




Re: A question about Mapper

2008-10-04 Thread Zhou, Yunqing
Thanks a lot for such a detailed explanation.but I think the reducer here is
unnecessary, so I set the reducer number to 0.
then, I'd like to solve them all in mappers.
so I met with the problem.
Thanks anyway.

On Sat, Oct 4, 2008 at 3:33 PM, Joman Chu [EMAIL PROTECTED] wrote:

 Hello,

 I assume you want to associate {a,b}, {c,d,e}, and {f} into sets.

 One way to do this is by associating some value with each flag and then
 emitting the data associated with that value. For example,

 flag
 a
 b
 flag
 c
 d
 e
 flag
 f

 I define flag,a,b,c,d,e,f to be the key while in the Mapper context.

 Whenever the mapper sees a key, it will emit UID, Key. UID is some unique
 identifier associated with a certain set, and Key is the key that was passed
 into the mapper. We are essentially inverting the association here.

 Let's step through this testcase.
  1. Choose UID = mapper1flag1.
  2. flag, null - Mapper - mapper1flag1, flag
  3. We have reached a flag, so we change the UID = mapper1flag2.
  4. a, null - Mapper - mapper1flag2, a
  5. b, null - Mapper - mapper1flag2, b
  6. flag, null - Mapper - mapper1flag2, flag
  7. We have reached a flag, so we change the UID = mapper1flag3.
  8. c, null - Mapper - mapper1flag3, c
  9. d, null - Mapper - mapper1flag3, d
 10. e, null - Mapper - mapper1flag3, e
 11. flag, null - Mapper - mapper1flag3, flag
 12. We have reached a flag, so we change the UID = mapper1flag4.
 13. f, null - Mapper - mapper1flag3, f
 14. EOF

 Then the reducers will collect all values with the same UID, so here is
 what we get:

 1. mapper1flag1, {flag} - Reducer - {}, null
 2. mapper1flag2, {a,b,flag} - Reducer - {a,b}, null
 3. mapper1flag3, {c,d,e,flag} - Reducer - {c,d,e}, null
 4. mapper1flag4, {f} - Reducer - {f}, null

 Hopefully this solves your problem.

 On Sat, October 4, 2008 2:48 am, Zhou, Yunqing said:
  but the close() function doesn't supply me a Collector to put pairs in.
 
  Is it reasonable for me to store a reference of the collector in advance?
 
 
  I'm not sure if the collector is still available then.
 
 
 
 
  On Sat, Oct 4, 2008 at 12:17 PM, Joman Chu [EMAIL PROTECTED]
 wrote:
 
 
  Hello,
 
  Does MapReduceBase.close() fit your needs? Take a look at
  http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred
  /MapReduceBase.html#close()
 
  On Fri, October 3, 2008 11:36 pm, Zhou, Yunqing said:
  the input is as follows. flag a b flag c d e flag f
 
  then I used a mapper to first store values and then emit them all
  when met with a line contains flag but when the file reached its
  end, I have no chance to emit the last record.(in this case ,f) so how
  can I detect
  the
  mapper's end of its life , or how can I emit a last record before a
  mapper
  exits.
 
  Thanks
 
 
  Have a good one, -- Joman Chu Carnegie Mellon University School of
 Computer
  Science 2011 AIM: ARcanUSNUMquam
 
 
 


 --
 Joman Chu
 Carnegie Mellon University
 School of Computer Science 2011
 AIM: ARcanUSNUMquam




A question about Mapper

2008-10-03 Thread Zhou, Yunqing
the input is as follows.
flag
a
b
flag
c
d
e
flag
f

then I used a mapper to first store values and then emit them all when met
with a line contains flag
but when the file reached its end, I have no chance to emit the last
record.(in this case ,f)
so how can I detect the mapper's end of its life , or how can I emit a last
record before a mapper exits.

Thanks


Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Zhou, Yunqing
I only use it to do something in parallel,but the reduce step will cost me
additional several days, is it possible to make hadoop do not use a reduce
step?

Thanks


Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Zhou, Yunqing
I've tried it and it works.
Thank you very much

On Mon, Jul 21, 2008 at 6:33 PM, Miles Osborne [EMAIL PROTECTED] wrote:

 then just do what i said --set the number of reducers to zero.  this should
 just run the mapper phase

 2008/7/21 Zhou, Yunqing [EMAIL PROTECTED]:

  since the whole data is 5TB.  the Identity reducer still cost a lot of
  time.
 
  On Mon, Jul 21, 2008 at 5:09 PM, Christian Ulrik Søttrup 
 [EMAIL PROTECTED]
  wrote:
 
   Hi,
  
   you can simply use the built in reducer that just copies the map
 output:
  
  
 conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);
  
   Cheers,
   Christian
  
  
   Zhou, Yunqing wrote:
  
   I only use it to do something in parallel,but the reduce step will
 cost
  me
   additional several days, is it possible to make hadoop do not use a
  reduce
   step?
  
   Thanks
  
  
  
  
  
 



 --
 The University of Edinburgh is a charitable body, registered in Scotland,
 with registration number SC005336.



Re: Monthly Hadoop user group meetings

2008-05-06 Thread Zhou, Yunqing
Thirded. I'm doing my machine learning experiment on a hadoop cluster and
eagering to acquire more info on it. :-)

2008/5/7, Leon Mergen [EMAIL PROTECTED]:

 On Tue, May 6, 2008 at 6:59 PM, Cole Flournoy [EMAIL PROTECTED]
 wrote:

  Is there anyway we could set up some off site web cam conferencing
  abilities?  I would love to attend, but I am on the east coast.


 Seconded. I'm from Europe, and am pretty sure that I will watch any video
 about a Hadoop conference I can get my hands on, including this. :-)

 --
 Leon Mergen
 http://www.solatis.com