Re: Using HDFS to serve www requests
Indeed, it would be a very nice interface to have (if anyone has some free time)! I know a few Caltech people who'd like to see how how their WAN transfer product (http://monalisa.cern.ch/FDT/) would work with HDFS; if there was a HDFS NIO interface, playing around with HDFS and FDT would be fairly trivial. Brian On Apr 3, 2009, at 5:16 AM, Steve Loughran wrote: Snehal Nagmote wrote: can you please explain exactly adding NIO bridge means what and how it can be done , what could be advantages in this case ? NIO: java non-blocking IO. It's a standard API to talk to different filesystems; support has been discussed in jira. If the DFS APIs were accessible under an NIO front end, then applications written for the NIO APIs would work with the supported filesystems, with no need to code specifically for hadoop's not-yet-stable APIs Steve Loughran wrote: Edward Capriolo wrote: It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS- API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. If someone adds an NIO bridge to hadoop filesystems then it would be easier; leaving you only with the performance issues.
Re: Using HDFS to serve www requests
Snehal Nagmote wrote: can you please explain exactly adding NIO bridge means what and how it can be done , what could be advantages in this case ? NIO: java non-blocking IO. It's a standard API to talk to different filesystems; support has been discussed in jira. If the DFS APIs were accessible under an NIO front end, then applications written for the NIO APIs would work with the supported filesystems, with no need to code specifically for hadoop's not-yet-stable APIs Steve Loughran wrote: Edward Capriolo wrote: It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. If someone adds an NIO bridge to hadoop filesystems then it would be easier; leaving you only with the performance issues.
Re: Using HDFS to serve www requests
Edward Capriolo wrote: It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. If someone adds an NIO bridge to hadoop filesystems then it would be easier; leaving you only with the performance issues.
Re: Using HDFS to serve www requests
Hello Sir, I am doing mtech in iiit hyderabad and in our project we have similar requirement of accessing the hdfs from apache server(tomcat) directly, Can you please explain how to access the same with some example, probably the same you modified,Does it require hadoop installation directory to sit on the same machine or just jars and config file copying would do. Thanks in advance Edward Capriolo wrote: It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. GlusterFS is more like a traditional POSIX filesystem. It supports locking and appends and you can do things like put the mysql data directory on it. GLUSTERFS is geared for storing data to be accessed with low latency. Nodes (Bricks) are normally connected via GIG-E or infiniban. The GlusterFS volume is mounted directly on a unix system. Hadoop is a user space file system. The latency is higher. Nodes are connected by GIG-E. It is closely coupled with MAP/REDUCE. You can use the API or the FUSE module to mount hadoop but that is not a direct goal of hadoop. Hope that helps. -- View this message in context: http://www.nabble.com/Using-HDFS-to-serve-www-requests-tp22725659p22765743.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Using HDFS to serve www requests
Yes. IMHO GlusterFS advertises benchmarks vs Luster. You're right, I've found those now, thanks for the reply - it helped P On Fri, Mar 27, 2009 at 5:04 PM, Edward Capriolo edlinuxg...@gmail.com wrote: but does Sun's Lustre follow in the steps of Gluster then Yes. IMHO GlusterFS advertises benchmarks vs Luster. The main difference is that GlusterFS is a fuse (userspace filesystem) while Luster has to be patched into the kernel, or a module.
Re: Using HDFS to serve www requests
but does Sun's Lustre follow in the steps of Gluster then Yes. IMHO GlusterFS advertises benchmarks vs Luster. The main difference is that GlusterFS is a fuse (userspace filesystem) while Luster has to be patched into the kernel, or a module.
Re: Using HDFS to serve www requests
In general, Hadoop is unsuitable for the application you're suggesting. Systems like Fuse HDFS do exist, though they're not widely used. I don't know of anyone trying to connect Hadoop with Apache httpd. When you say that you have huge images, how big is huge? It might be useful if these images are 1 GB or larger. But in general, huge on Hadoop means 10s of GBs up to TBs. If you have a large number of moderately-sized files, you'll find that HDFS responds very poorly for your needs. It sounds like glusterfs is designed more for your needs. - Aaron On Thu, Mar 26, 2009 at 4:06 PM, phil cryer p...@cryer.us wrote: This is somewhat of a noob question I know, but after learning about Hadoop, testing it in a small cluster and running Map Reduce jobs on it, I'm still not sure if Hadoop is the right distributed file system to serve web requests. In other words, can, or is it right to, serve Images and data from HDFS using something like FUSE to mount a filesystem where Apache could serve images from it? We have huge images, thus the need for a distributed file system, and they go in, get stored with lots of metadata, and are redundant with Hadoop/HDFS - but is it the right way to serve web content? I looked at glusterfs before, they had an Apache and Lighttpd module which made it simple, does HDFS have something like this, do people just use a FUSE option as I described, or is this not a good use of Hadoop? Thanks P
Re: Using HDFS to serve www requests
On Mar 26, 2009, at 5:44 PM, Aaron Kimball wrote: In general, Hadoop is unsuitable for the application you're suggesting. Systems like Fuse HDFS do exist, though they're not widely used. We use FUSE on a 270TB cluster to serve up physics data because the client (2.5M lines of C++) doesn't understand how to connect to HDFS directly. Brian I don't know of anyone trying to connect Hadoop with Apache httpd. When you say that you have huge images, how big is huge? It might be useful if these images are 1 GB or larger. But in general, huge on Hadoop means 10s of GBs up to TBs. If you have a large number of moderately-sized files, you'll find that HDFS responds very poorly for your needs. It sounds like glusterfs is designed more for your needs. - Aaron On Thu, Mar 26, 2009 at 4:06 PM, phil cryer p...@cryer.us wrote: This is somewhat of a noob question I know, but after learning about Hadoop, testing it in a small cluster and running Map Reduce jobs on it, I'm still not sure if Hadoop is the right distributed file system to serve web requests. In other words, can, or is it right to, serve Images and data from HDFS using something like FUSE to mount a filesystem where Apache could serve images from it? We have huge images, thus the need for a distributed file system, and they go in, get stored with lots of metadata, and are redundant with Hadoop/ HDFS - but is it the right way to serve web content? I looked at glusterfs before, they had an Apache and Lighttpd module which made it simple, does HDFS have something like this, do people just use a FUSE option as I described, or is this not a good use of Hadoop? Thanks P
Re: Using HDFS to serve www requests
It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. GlusterFS is more like a traditional POSIX filesystem. It supports locking and appends and you can do things like put the mysql data directory on it. GLUSTERFS is geared for storing data to be accessed with low latency. Nodes (Bricks) are normally connected via GIG-E or infiniban. The GlusterFS volume is mounted directly on a unix system. Hadoop is a user space file system. The latency is higher. Nodes are connected by GIG-E. It is closely coupled with MAP/REDUCE. You can use the API or the FUSE module to mount hadoop but that is not a direct goal of hadoop. Hope that helps.
Re: Using HDFS to serve www requests
When you say that you have huge images, how big is huge? Yes, we're looking at some images that are 100Megs in size, but nothing like what you're speaking of. This helps me understand Hadoop's usage better and unfortunately it won't be the fit I was hoping for. You can use the API or the FUSE module to mount hadoop but that is not a direct goal of hadoop. Hope that helps. Very interesting, and yes, that indeed does help, not to veer off thread too much, but does Sun's Lustre follow in the steps of Gluster then? I know Lustre requires kernel patches to install, so it's at a different level than the others, but I have seen some articles about large scale clusters built with Lustre and want to look at that as another option. Again, thanks for the info, if anyone has general information on cluster software, or know of a more appropriate list, I'd appreciate the advice. Thanks P On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo edlinuxg...@gmail.com wrote: It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. GlusterFS is more like a traditional POSIX filesystem. It supports locking and appends and you can do things like put the mysql data directory on it. GLUSTERFS is geared for storing data to be accessed with low latency. Nodes (Bricks) are normally connected via GIG-E or infiniban. The GlusterFS volume is mounted directly on a unix system. Hadoop is a user space file system. The latency is higher. Nodes are connected by GIG-E. It is closely coupled with MAP/REDUCE. You can use the API or the FUSE module to mount hadoop but that is not a direct goal of hadoop. Hope that helps.
Re: Using HDFS to serve www requests
Have you looked into MogileFS already? Seems like a good fit, based on your description. This question has come up more than once here, and MogileFS is an oft-recommended solution. Norbert On 3/26/09, phil cryer p...@cryer.us wrote: When you say that you have huge images, how big is huge? Yes, we're looking at some images that are 100Megs in size, but nothing like what you're speaking of. This helps me understand Hadoop's usage better and unfortunately it won't be the fit I was hoping for. You can use the API or the FUSE module to mount hadoop but that is not a direct goal of hadoop. Hope that helps. Very interesting, and yes, that indeed does help, not to veer off thread too much, but does Sun's Lustre follow in the steps of Gluster then? I know Lustre requires kernel patches to install, so it's at a different level than the others, but I have seen some articles about large scale clusters built with Lustre and want to look at that as another option. Again, thanks for the info, if anyone has general information on cluster software, or know of a more appropriate list, I'd appreciate the advice. Thanks P On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo edlinuxg...@gmail.com wrote: It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. GlusterFS is more like a traditional POSIX filesystem. It supports locking and appends and you can do things like put the mysql data directory on it. GLUSTERFS is geared for storing data to be accessed with low latency. Nodes (Bricks) are normally connected via GIG-E or infiniban. The GlusterFS volume is mounted directly on a unix system. Hadoop is a user space file system. The latency is higher. Nodes are connected by GIG-E. It is closely coupled with MAP/REDUCE. You can use the API or the FUSE module to mount hadoop but that is not a direct goal of hadoop. Hope that helps.
Re: Using HDFS to serve www requests
On Mar 26, 2009, at 8:55 PM, phil cryer wrote: When you say that you have huge images, how big is huge? Yes, we're looking at some images that are 100Megs in size, but nothing like what you're speaking of. This helps me understand Hadoop's usage better and unfortunately it won't be the fit I was hoping for. I wouldn't split hairs between 100MB and 1GB. However, it may be less reliable due to the extra layer via FUSE if you want to serve it via apache. It wouldn't be too bad to whip up a tomcat webapp that goes through Hadoop... It really depends on your hardware level and redundancy. If you have the money to get the hardware necessary to go with a Lustre-based solution, do that. If you have enough money to load up your pre- existing cluster with lots of disk, HDFS might be better. Certainly it will be outperformed by lustre if you have lots of reliable hardware, especially in terms of latency. Brian You can use the API or the FUSE module to mount hadoop but that is not a direct goal of hadoop. Hope that helps. Very interesting, and yes, that indeed does help, not to veer off thread too much, but does Sun's Lustre follow in the steps of Gluster then? I know Lustre requires kernel patches to install, so it's at a different level than the others, but I have seen some articles about large scale clusters built with Lustre and want to look at that as another option. Again, thanks for the info, if anyone has general information on cluster software, or know of a more appropriate list, I'd appreciate the advice. Thanks P On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo edlinuxg...@gmail.com wrote: It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS- API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. GlusterFS is more like a traditional POSIX filesystem. It supports locking and appends and you can do things like put the mysql data directory on it. GLUSTERFS is geared for storing data to be accessed with low latency. Nodes (Bricks) are normally connected via GIG-E or infiniban. The GlusterFS volume is mounted directly on a unix system. Hadoop is a user space file system. The latency is higher. Nodes are connected by GIG-E. It is closely coupled with MAP/REDUCE. You can use the API or the FUSE module to mount hadoop but that is not a direct goal of hadoop. Hope that helps.
Re: Using HDFS to serve www requests
HDFS itself has some facilities for serving data over HTTP: https://issues.apache.org/jira/browse/HADOOP-5010. YMMV. On Thu, Mar 26, 2009 at 3:47 PM, Brian Bockelman bbock...@cse.unl.eduwrote: On Mar 26, 2009, at 8:55 PM, phil cryer wrote: When you say that you have huge images, how big is huge? Yes, we're looking at some images that are 100Megs in size, but nothing like what you're speaking of. This helps me understand Hadoop's usage better and unfortunately it won't be the fit I was hoping for. I wouldn't split hairs between 100MB and 1GB. However, it may be less reliable due to the extra layer via FUSE if you want to serve it via apache. It wouldn't be too bad to whip up a tomcat webapp that goes through Hadoop... It really depends on your hardware level and redundancy. If you have the money to get the hardware necessary to go with a Lustre-based solution, do that. If you have enough money to load up your pre-existing cluster with lots of disk, HDFS might be better. Certainly it will be outperformed by lustre if you have lots of reliable hardware, especially in terms of latency. Brian You can use the API or the FUSE module to mount hadoop but that is not a direct goal of hadoop. Hope that helps. Very interesting, and yes, that indeed does help, not to veer off thread too much, but does Sun's Lustre follow in the steps of Gluster then? I know Lustre requires kernel patches to install, so it's at a different level than the others, but I have seen some articles about large scale clusters built with Lustre and want to look at that as another option. Again, thanks for the info, if anyone has general information on cluster software, or know of a more appropriate list, I'd appreciate the advice. Thanks P On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo edlinuxg...@gmail.com wrote: It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. GlusterFS is more like a traditional POSIX filesystem. It supports locking and appends and you can do things like put the mysql data directory on it. GLUSTERFS is geared for storing data to be accessed with low latency. Nodes (Bricks) are normally connected via GIG-E or infiniban. The GlusterFS volume is mounted directly on a unix system. Hadoop is a user space file system. The latency is higher. Nodes are connected by GIG-E. It is closely coupled with MAP/REDUCE. You can use the API or the FUSE module to mount hadoop but that is not a direct goal of hadoop. Hope that helps.
Re: Using HDFS to serve www requests
Brian--- Can you share some performance figures for typical workloads with your HDFS/Fuse setup? Obviously, latency is going to be bad but throughput will probably be reasonable... but I'm curious to hear about concrete latency/throughput numbers. And, of course, I'm interested in these numbers as a function of concurrent clients... ;) Somewhat independent of file size is the workload... you can have huge TB-size files, but still have a seek-heavy workload (in which case HDFS is probably a sub-optimal choice). But if seek-heavy loads are reasonable, one can solve the lots-of-little-files problem by simple concatenation. Finally, I'm curious about the Fuse overhead (vs. directly using the Java API). Thanks in advance for your insights! -Jimmy Brian Bockelman wrote: On Mar 26, 2009, at 5:44 PM, Aaron Kimball wrote: In general, Hadoop is unsuitable for the application you're suggesting. Systems like Fuse HDFS do exist, though they're not widely used. We use FUSE on a 270TB cluster to serve up physics data because the client (2.5M lines of C++) doesn't understand how to connect to HDFS directly. Brian I don't know of anyone trying to connect Hadoop with Apache httpd. When you say that you have huge images, how big is huge? It might be useful if these images are 1 GB or larger. But in general, huge on Hadoop means 10s of GBs up to TBs. If you have a large number of moderately-sized files, you'll find that HDFS responds very poorly for your needs. It sounds like glusterfs is designed more for your needs. - Aaron On Thu, Mar 26, 2009 at 4:06 PM, phil cryer p...@cryer.us wrote: This is somewhat of a noob question I know, but after learning about Hadoop, testing it in a small cluster and running Map Reduce jobs on it, I'm still not sure if Hadoop is the right distributed file system to serve web requests. In other words, can, or is it right to, serve Images and data from HDFS using something like FUSE to mount a filesystem where Apache could serve images from it? We have huge images, thus the need for a distributed file system, and they go in, get stored with lots of metadata, and are redundant with Hadoop/HDFS - but is it the right way to serve web content? I looked at glusterfs before, they had an Apache and Lighttpd module which made it simple, does HDFS have something like this, do people just use a FUSE option as I described, or is this not a good use of Hadoop? Thanks P