[
https://issues.apache.org/jira/browse/HADOOP-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12570484#action_12570484
]
Craig Macdonald commented on HADOOP-4:
--------------------------------------
Pete,
I have been experimenting with fuse_dfs.c and have a few questions:
(1) I am using a previous version of fuse_dfs.c, mainly because I dont have
bootstrap.sh. However, with respect to the new fuse_dfs.c option parsing - is
this compatible with calling via mount.fuse, and autofs?
This how I currently mount using an autofs map containing:
{code}
hdfs -fstype=fuse,rw,nodev,nonempty,noatime,allow_other
:/path/to/fuse_dfs_moutn/fuse_dfs.sh\#dfs\://namenode\:9000
{code}
fuse_dfs.sh is just a shell script setting the CLASSPATH and LD_LIBRARY_PATH,
and essentially, just execs the fuse_dfs. If I changed to the more recent
version, I would probably have to put the dfs://namenode:9000 configuration
into the script I think.
(2) Have you done any sort of performance testing? I'm experimenting with HDFS
for use in a mixed envionment (hadoop and non-hadoop jobs), and the throughput
I see is miserable. For example, I use a test network of 8 P3-1GHz nodes, and a
similar client on 100meg network.
Below, I compare cat-ing a 512MB file from (a) an NFS mount on the same network
as the cluster nodes (b) using the hadoop frontend and (c) using the FUSE HDFS
filesystem.
{noformat}
# (a)
$ time cat /mnt/tmp/data.df > /dev/null
real 0m47.280s
user 0m0.059s
sys 0m2.476s
# (b)
$ time bin/hadoop fs -cat hdfs:///user/craigm/data.df > /dev/null
real 0m48.839s
user 0m16.256s
sys 0m7.001s
# (c)
$ time cat /misc/hdfs/user/craigm/data.df >/dev/null
real 1m41.686s
user 0m0.135s
sys 0m2.302s
{noformat}
Note that the NFS and Hadoop fs -cat obtain about 10.5MB/sec, while the hdfs
fuse mount (in /misc/hdfs) achieves only 5MB/sec. Is this an expected overhead
for FUSE?
I did try tuning rd_buf_size to match the size of reads that the kernel was
requesting - ie 128KB instead of 32KB, however this made matters worse:
{noformat}
# with 128KB buffer size
$ time cat /misc/hdfs/user/craigm/data.df >/dev/null
real 2m11.080s
user 0m0.113s
sys 0m2.180s
{noformat}
Perhaps an option would be to keep the HDFS file open between reads and timeout
the connection when not used, or something; read more than we need and then
keep it in the memory? Both would overly complicate the neat code though!
(3) If I use an autofs for hdfs, then mounts will timeout quickly (30 seconds),
and then reconnect again on demand. Perhaps fuse_dfs.c can implement the
destroy fuse operation to free up the connection to the namenode, etc?
Cheers
Craig
> tool to mount dfs on linux
> --------------------------
>
> Key: HADOOP-4
> URL: https://issues.apache.org/jira/browse/HADOOP-4
> Project: Hadoop Core
> Issue Type: Improvement
> Components: fs
> Affects Versions: 0.5.0
> Environment: linux only
> Reporter: John Xing
> Assignee: Doug Cutting
> Attachments: fuse-dfs.tar.gz, fuse-dfs.tar.gz, fuse-dfs.tar.gz,
> fuse-hadoop-0.1.0_fuse-j.2.2.3_hadoop.0.5.0.tar.gz,
> fuse-hadoop-0.1.0_fuse-j.2.4_hadoop.0.5.0.tar.gz, fuse-hadoop-0.1.1.tar.gz,
> fuse-j-hadoopfs-03.tar.gz, fuse_dfs.c, fuse_dfs.c, Makefile
>
>
> tool to mount dfs on linux
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.