[jira] Commented: (HDFS-770) SocketTimeoutException: timeout while waiting for channel to be ready for read

2009-12-22 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793720#action_12793720
 ] 

Zheng Shao commented on HDFS-770:
-

bad machine:
{code}
/proc/sys/vm/block_dump:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:40
/proc/sys/vm/dirty_writeback_centisecs:500
/proc/sys/vm/hugetlb_shm_group:0
/proc/sys/vm/laptop_mode:0
/proc/sys/vm/legacy_va_layout:0
/proc/sys/vm/lowmem_reserve_ratio:256   32
/proc/sys/vm/max_map_count:65536
/proc/sys/vm/min_free_kbytes:16383
/proc/sys/vm/nr_hugepages:0
/proc/sys/vm/nr_pdflush_threads:2
/proc/sys/vm/overcommit_memory:0
/proc/sys/vm/overcommit_ratio:50
/proc/sys/vm/page-cluster:3
/proc/sys/vm/swappiness:60
/proc/sys/vm/swap_token_timeout:0   0
/proc/sys/vm/vfs_cache_pressure:100
{code}


good machine:
{code}
/proc/sys/vm/block_dump:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_expire_centisecs:2999
/proc/sys/vm/dirty_ratio:40
/proc/sys/vm/dirty_writeback_centisecs:499
/proc/sys/vm/drop_caches:0
/proc/sys/vm/hugetlb_shm_group:0
/proc/sys/vm/laptop_mode:0
/proc/sys/vm/legacy_va_layout:0
/proc/sys/vm/lowmem_reserve_ratio:256   256
/proc/sys/vm/max_map_count:65536
/proc/sys/vm/min_free_kbytes:23004
/proc/sys/vm/min_slab_ratio:5
/proc/sys/vm/min_unmapped_ratio:1
/proc/sys/vm/nr_hugepages:0
/proc/sys/vm/nr_pdflush_threads:2
/proc/sys/vm/overcommit_memory:0
/proc/sys/vm/overcommit_ratio:50
/proc/sys/vm/page-cluster:3
/proc/sys/vm/panic_on_oom:0
/proc/sys/vm/percpu_pagelist_fraction:0
/proc/sys/vm/swappiness:60
/proc/sys/vm/vfs_cache_pressure:100
/proc/sys/vm/zone_reclaim_mode:0
{code}


> SocketTimeoutException: timeout while waiting for channel to be ready for read
> --
>
> Key: HDFS-770
> URL: https://issues.apache.org/jira/browse/HDFS-770
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/libhdfs, data-node, hdfs client, name-node
>Affects Versions: 0.20.1
> Environment: Ubuntu Linux 8.04
>Reporter: Leon Mergen
> Attachments: client.txt, datanode.txt, filewriter.cpp, namenode.txt
>
>
> We're having issues with timeouts occurring in our client: for some reason, a 
> timeout of 63000 milliseconds is triggered while writing HDFS data. Since we 
> currently have a single-server setup, this results in our client terminating 
> with a "All datanodes are bad" IOException.
> We're running all services, including the client, on our single server, so it 
> cannot be a network error. The load on the client is extremely low during 
> this period: only a few kilobytes a minute were being written around the time 
> the error occured. 
> After browsing a bit online, a lot of people talk about setting 
> "dfs.datanode.socket.write.timeout" to 0 as a solution for this problem. Due 
> to the low load of our system during this period, however, I do feel this is 
> a real error and a timeout that should not be occurring. I have attached 3 
> logs of the namenode, datanode and client.
> It could be that this is related to 
> http://issues.apache.org/jira/browse/HDFS-693
> Any pointers on how I can assist to resolve this issue will be greatly 
> appreciated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-770) SocketTimeoutException: timeout while waiting for channel to be ready for read

2009-12-21 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793385#action_12793385
 ] 

Todd Lipcon commented on HDFS-770:
--

Hey Zheng,

Any chance you can run "grep . /proc/sys/vm/*" on the system that does show the 
problem, and compare the results to the one that doesn't show the problem? I'm 
thinking this might just be a factor of system level tuning. See 
http://www.westnet.com/~gsmith/content/linux-pdflush.htm

> SocketTimeoutException: timeout while waiting for channel to be ready for read
> --
>
> Key: HDFS-770
> URL: https://issues.apache.org/jira/browse/HDFS-770
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/libhdfs, data-node, hdfs client, name-node
>Affects Versions: 0.20.1
> Environment: Ubuntu Linux 8.04
>Reporter: Leon Mergen
> Attachments: client.txt, datanode.txt, filewriter.cpp, namenode.txt
>
>
> We're having issues with timeouts occurring in our client: for some reason, a 
> timeout of 63000 milliseconds is triggered while writing HDFS data. Since we 
> currently have a single-server setup, this results in our client terminating 
> with a "All datanodes are bad" IOException.
> We're running all services, including the client, on our single server, so it 
> cannot be a network error. The load on the client is extremely low during 
> this period: only a few kilobytes a minute were being written around the time 
> the error occured. 
> After browsing a bit online, a lot of people talk about setting 
> "dfs.datanode.socket.write.timeout" to 0 as a solution for this problem. Due 
> to the low load of our system during this period, however, I do feel this is 
> a real error and a timeout that should not be occurring. I have attached 3 
> logs of the namenode, datanode and client.
> It could be that this is related to 
> http://issues.apache.org/jira/browse/HDFS-693
> Any pointers on how I can assist to resolve this issue will be greatly 
> appreciated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-770) SocketTimeoutException: timeout while waiting for channel to be ready for read

2009-12-21 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793374#action_12793374
 ] 

Todd Lipcon commented on HDFS-770:
--

I assume the _fbk kernel doesn't have wacky patches backported from the last 
couple months of kernel development? Will run your test program on a couple 
boxes here as well, thanks for uploading it.

> SocketTimeoutException: timeout while waiting for channel to be ready for read
> --
>
> Key: HDFS-770
> URL: https://issues.apache.org/jira/browse/HDFS-770
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/libhdfs, data-node, hdfs client, name-node
>Affects Versions: 0.20.1
> Environment: Ubuntu Linux 8.04
>Reporter: Leon Mergen
> Attachments: client.txt, datanode.txt, filewriter.cpp, namenode.txt
>
>
> We're having issues with timeouts occurring in our client: for some reason, a 
> timeout of 63000 milliseconds is triggered while writing HDFS data. Since we 
> currently have a single-server setup, this results in our client terminating 
> with a "All datanodes are bad" IOException.
> We're running all services, including the client, on our single server, so it 
> cannot be a network error. The load on the client is extremely low during 
> this period: only a few kilobytes a minute were being written around the time 
> the error occured. 
> After browsing a bit online, a lot of people talk about setting 
> "dfs.datanode.socket.write.timeout" to 0 as a solution for this problem. Due 
> to the low load of our system during this period, however, I do feel this is 
> a real error and a timeout that should not be occurring. I have attached 3 
> logs of the namenode, datanode and client.
> It could be that this is related to 
> http://issues.apache.org/jira/browse/HDFS-693
> Any pointers on how I can assist to resolve this issue will be greatly 
> appreciated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-770) SocketTimeoutException: timeout while waiting for channel to be ready for read

2009-12-21 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793371#action_12793371
 ] 

Zheng Shao commented on HDFS-770:
-

The new machine is running: "Linux hadoop0100.xxx.yyy.com 2.6.20-39_fbk #1 SMP 
Mon Mar 16 20:33:46 PDT 2009 x86_64 x86_64 x86_64 GNU/Linux" (uname -a) with 12 
local disks.
You can use the attached filewriter.cpp to see if your box has the problem 
mentioned above.


> SocketTimeoutException: timeout while waiting for channel to be ready for read
> --
>
> Key: HDFS-770
> URL: https://issues.apache.org/jira/browse/HDFS-770
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/libhdfs, data-node, hdfs client, name-node
>Affects Versions: 0.20.1
> Environment: Ubuntu Linux 8.04
>Reporter: Leon Mergen
> Attachments: client.txt, datanode.txt, filewriter.cpp, namenode.txt
>
>
> We're having issues with timeouts occurring in our client: for some reason, a 
> timeout of 63000 milliseconds is triggered while writing HDFS data. Since we 
> currently have a single-server setup, this results in our client terminating 
> with a "All datanodes are bad" IOException.
> We're running all services, including the client, on our single server, so it 
> cannot be a network error. The load on the client is extremely low during 
> this period: only a few kilobytes a minute were being written around the time 
> the error occured. 
> After browsing a bit online, a lot of people talk about setting 
> "dfs.datanode.socket.write.timeout" to 0 as a solution for this problem. Due 
> to the low load of our system during this period, however, I do feel this is 
> a real error and a timeout that should not be occurring. I have attached 3 
> logs of the namenode, datanode and client.
> It could be that this is related to 
> http://issues.apache.org/jira/browse/HDFS-693
> Any pointers on how I can assist to resolve this issue will be greatly 
> appreciated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-770) SocketTimeoutException: timeout while waiting for channel to be ready for read

2009-12-21 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793363#action_12793363
 ] 

Todd Lipcon commented on HDFS-770:
--

Hi Zheng,

What kernel is the newer box running? Is it a brand new kernel? 2.6.32 has 
significant changes to dirty page writeback

-Todd

> SocketTimeoutException: timeout while waiting for channel to be ready for read
> --
>
> Key: HDFS-770
> URL: https://issues.apache.org/jira/browse/HDFS-770
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/libhdfs, data-node, hdfs client, name-node
>Affects Versions: 0.20.1
> Environment: Ubuntu Linux 8.04
>Reporter: Leon Mergen
> Attachments: client.txt, datanode.txt, namenode.txt
>
>
> We're having issues with timeouts occurring in our client: for some reason, a 
> timeout of 63000 milliseconds is triggered while writing HDFS data. Since we 
> currently have a single-server setup, this results in our client terminating 
> with a "All datanodes are bad" IOException.
> We're running all services, including the client, on our single server, so it 
> cannot be a network error. The load on the client is extremely low during 
> this period: only a few kilobytes a minute were being written around the time 
> the error occured. 
> After browsing a bit online, a lot of people talk about setting 
> "dfs.datanode.socket.write.timeout" to 0 as a solution for this problem. Due 
> to the low load of our system during this period, however, I do feel this is 
> a real error and a timeout that should not be occurring. I have attached 3 
> logs of the namenode, datanode and client.
> It could be that this is related to 
> http://issues.apache.org/jira/browse/HDFS-693
> Any pointers on how I can assist to resolve this issue will be greatly 
> appreciated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-770) SocketTimeoutException: timeout while waiting for channel to be ready for read

2009-12-11 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789514#action_12789514
 ] 

Zheng Shao commented on HDFS-770:
-

In my case, it seems a problem with my box.
It's running "2.6.12-1.1398_FC4smp #1 SMP Fri Jul 15 01:05:24 EDT 2005 x86_64 
x86_64 x86_64 GNU/Linux" (uname -a) with 4 local hard disks.

I wrote a simple program (tried both Java and C) to write out a huge file, and 
the writes get blocked frequently after writing out 5~10GB of data. The block 
was sometimes as long as 18 seconds.

Tried it on a newer box and the problem disappeared. The block is in 
"FileOutputStream.write" and "fwrite". Each write takes 60KB of data.

I guess it's not a problem of the hardware, because "iostat -x -n 2" shows that 
the disk write speed is still non-zero (10-30MB/s) when this blocking happens.

> SocketTimeoutException: timeout while waiting for channel to be ready for read
> --
>
> Key: HDFS-770
> URL: https://issues.apache.org/jira/browse/HDFS-770
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/libhdfs, data-node, hdfs client, name-node
>Affects Versions: 0.20.1
> Environment: Ubuntu Linux 8.04
>Reporter: Leon Mergen
> Attachments: client.txt, datanode.txt, namenode.txt
>
>
> We're having issues with timeouts occurring in our client: for some reason, a 
> timeout of 63000 milliseconds is triggered while writing HDFS data. Since we 
> currently have a single-server setup, this results in our client terminating 
> with a "All datanodes are bad" IOException.
> We're running all services, including the client, on our single server, so it 
> cannot be a network error. The load on the client is extremely low during 
> this period: only a few kilobytes a minute were being written around the time 
> the error occured. 
> After browsing a bit online, a lot of people talk about setting 
> "dfs.datanode.socket.write.timeout" to 0 as a solution for this problem. Due 
> to the low load of our system during this period, however, I do feel this is 
> a real error and a timeout that should not be occurring. I have attached 3 
> logs of the namenode, datanode and client.
> It could be that this is related to 
> http://issues.apache.org/jira/browse/HDFS-693
> Any pointers on how I can assist to resolve this issue will be greatly 
> appreciated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-770) SocketTimeoutException: timeout while waiting for channel to be ready for read

2009-12-09 Thread Leon Mergen (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788005#action_12788005
 ] 

Leon Mergen commented on HDFS-770:
--

For what it's worth, the timeout problems I was having are gone after I have 
set "dfs.datanode.socket.write.timeout" to 63, so it definitely appears as 
if the DN went to sleep for some reason.

> SocketTimeoutException: timeout while waiting for channel to be ready for read
> --
>
> Key: HDFS-770
> URL: https://issues.apache.org/jira/browse/HDFS-770
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/libhdfs, data-node, hdfs client, name-node
>Affects Versions: 0.20.1
> Environment: Ubuntu Linux 8.04
>Reporter: Leon Mergen
> Attachments: client.txt, datanode.txt, namenode.txt
>
>
> We're having issues with timeouts occurring in our client: for some reason, a 
> timeout of 63000 milliseconds is triggered while writing HDFS data. Since we 
> currently have a single-server setup, this results in our client terminating 
> with a "All datanodes are bad" IOException.
> We're running all services, including the client, on our single server, so it 
> cannot be a network error. The load on the client is extremely low during 
> this period: only a few kilobytes a minute were being written around the time 
> the error occured. 
> After browsing a bit online, a lot of people talk about setting 
> "dfs.datanode.socket.write.timeout" to 0 as a solution for this problem. Due 
> to the low load of our system during this period, however, I do feel this is 
> a real error and a timeout that should not be occurring. I have attached 3 
> logs of the namenode, datanode and client.
> It could be that this is related to 
> http://issues.apache.org/jira/browse/HDFS-693
> Any pointers on how I can assist to resolve this issue will be greatly 
> appreciated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-770) SocketTimeoutException: timeout while waiting for channel to be ready for read

2009-12-08 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787900#action_12787900
 ] 

Zheng Shao commented on HDFS-770:
-

I saw the same thing with hadoop 0.19 while doing heavy-weight writes.

{code}
09/10/29 01:05:18 WARN hdfs.DFSClient: DataStreamer Exception: 
java.net.SocketTimeoutException: 3 millis timeout while waiting for channel 
to be ready for write. ch : java.nio.channels.\
SocketChannel[connected local=/aa.bb.cc.dd:55040 remote=/ee.ff.gg.hh:50010]
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162)
at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
at java.io.DataOutputStream.write(DataOutputStream.java:90)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2323)
{code}


> SocketTimeoutException: timeout while waiting for channel to be ready for read
> --
>
> Key: HDFS-770
> URL: https://issues.apache.org/jira/browse/HDFS-770
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/libhdfs, data-node, hdfs client, name-node
>Affects Versions: 0.20.1
> Environment: Ubuntu Linux 8.04
>Reporter: Leon Mergen
> Attachments: client.txt, datanode.txt, namenode.txt
>
>
> We're having issues with timeouts occurring in our client: for some reason, a 
> timeout of 63000 milliseconds is triggered while writing HDFS data. Since we 
> currently have a single-server setup, this results in our client terminating 
> with a "All datanodes are bad" IOException.
> We're running all services, including the client, on our single server, so it 
> cannot be a network error. The load on the client is extremely low during 
> this period: only a few kilobytes a minute were being written around the time 
> the error occured. 
> After browsing a bit online, a lot of people talk about setting 
> "dfs.datanode.socket.write.timeout" to 0 as a solution for this problem. Due 
> to the low load of our system during this period, however, I do feel this is 
> a real error and a timeout that should not be occurring. I have attached 3 
> logs of the namenode, datanode and client.
> It could be that this is related to 
> http://issues.apache.org/jira/browse/HDFS-693
> Any pointers on how I can assist to resolve this issue will be greatly 
> appreciated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HDFS-770) SocketTimeoutException: timeout while waiting for channel to be ready for read

2009-11-13 Thread Raghu Angadi (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777612#action_12777612
 ] 

Raghu Angadi commented on HDFS-770:
---

>From the datanode log :

> 2009-11-13 06:18:21,965 DEBUG org.apache.hadoop.ipc.RPC: Call: sendHeartbeat 
> 14
> 2009-11-13 06:19:38,081 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) 
> connection to dfs.hadoop.tsukku.solatis/127.0.0.1:9000 from hadoop: closed

Note that there is no activity on DataNode for 77 seconds. There are number of 
possibilities, common one being GC. we haven't seen GC taking this long DN 
though.  

Assuming DN went to sleep for some reason, rest of the behaviour is expected. 
If you do expect such delays, what you need to increase is the read timeout for 
"responder thread" in DFSOutputStream (there is a config for generic read 
timeout that applies to sockets in many contexts).

> SocketTimeoutException: timeout while waiting for channel to be ready for read
> --
>
> Key: HDFS-770
> URL: https://issues.apache.org/jira/browse/HDFS-770
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: contrib/libhdfs, data-node, hdfs client, name-node
>Affects Versions: 0.20.1
> Environment: Ubuntu Linux 8.04
>Reporter: Leon Mergen
> Attachments: client.txt, datanode.txt, namenode.txt
>
>
> We're having issues with timeouts occurring in our client: for some reason, a 
> timeout of 63000 milliseconds is triggered while writing HDFS data. Since we 
> currently have a single-server setup, this results in our client terminating 
> with a "All datanodes are bad" IOException.
> We're running all services, including the client, on our single server, so it 
> cannot be a network error. The load on the client is extremely low during 
> this period: only a few kilobytes a minute were being written around the time 
> the error occured. 
> After browsing a bit online, a lot of people talk about setting 
> "dfs.datanode.socket.write.timeout" to 0 as a solution for this problem. Due 
> to the low load of our system during this period, however, I do feel this is 
> a real error and a timeout that should not be occurring. I have attached 3 
> logs of the namenode, datanode and client.
> It could be that this is related to 
> http://issues.apache.org/jira/browse/HDFS-693
> Any pointers on how I can assist to resolve this issue will be greatly 
> appreciated.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.