Re: Using HDFS to serve www requests

2009-04-06 Thread Brian Bockelman
Indeed, it would be a very nice interface to have (if anyone has some  
free time)!


I know a few Caltech people who'd like to see how how their WAN  
transfer product (http://monalisa.cern.ch/FDT/) would work with HDFS;  
if there was a HDFS NIO interface, playing around with HDFS and FDT  
would be fairly trivial.


Brian

On Apr 3, 2009, at 5:16 AM, Steve Loughran wrote:


Snehal Nagmote wrote:
can you please explain exactly adding NIO bridge means what and how  
it can be

done , what could be advantages in this case ?


NIO: java non-blocking IO. It's a standard API to talk to different  
filesystems; support has been discussed in jira. If the DFS APIs  
were accessible under an NIO front end, then applications written  
for the NIO APIs would work with the supported filesystems, with no  
need to code specifically for hadoop's not-yet-stable APIs



Steve Loughran wrote:

Edward Capriolo wrote:

It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS- 
API.


I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under  
internet

traffic, but it does work.

If someone adds an NIO bridge to hadoop filesystems then it would  
be easier; leaving you only with the performance issues.







Re: Using HDFS to serve www requests

2009-04-03 Thread Steve Loughran

Snehal Nagmote wrote:

can you please explain exactly adding NIO bridge means what and how it can be
done , what could 
be advantages in this case ?  


NIO: java non-blocking IO. It's a standard API to talk to different 
filesystems; support has been discussed in jira. If the DFS APIs were 
accessible under an NIO front end, then applications written for the NIO 
APIs would work with the supported filesystems, with no need to code 
specifically for hadoop's not-yet-stable APIs







Steve Loughran wrote:

Edward Capriolo wrote:

It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS-API.

I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under internet
traffic, but it does work.

If someone adds an NIO bridge to hadoop filesystems then it would be 
easier; leaving you only with the performance issues.







Re: Using HDFS to serve www requests

2009-03-30 Thread Steve Loughran

Edward Capriolo wrote:

It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS-API.

I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under internet
traffic, but it does work.



If someone adds an NIO bridge to hadoop filesystems then it would be 
easier; leaving you only with the performance issues.


Re: Using HDFS to serve www requests

2009-03-29 Thread Snehal Nagmote

Hello Sir,
I am doing mtech in iiit hyderabad and in our project we  have similar
requirement of accessing the hdfs from apache server(tomcat) directly, Can
you please explain how to access the same with some example, probably the
same you modified,Does it require hadoop installation directory to sit on
the 
same machine or just jars and config file copying would do.

Thanks in advance




Edward Capriolo wrote:
 
 It is a little more natural to connect to HDFS from apache tomcat.
 This will allow you to skip the FUSE mounts and just use the HDFS-API.
 
 I have modified this code to run inside tomcat.
 http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
 
 I will not testify to how well this setup will perform under internet
 traffic, but it does work.
 
 GlusterFS is more like a traditional POSIX filesystem. It supports
 locking and appends and you can do things like put the mysql data
 directory on it.
 
 GLUSTERFS is geared for storing data to be accessed with low latency.
 Nodes (Bricks) are normally connected via GIG-E or infiniban. The
 GlusterFS volume is mounted directly on a unix system.
 
 Hadoop is a user space file system. The latency is higher. Nodes are
 connected by GIG-E. It is closely coupled with MAP/REDUCE.
 
 You can use the API or the FUSE module to mount hadoop but that is not
 a direct goal of hadoop. Hope that helps.
 
 

-- 
View this message in context: 
http://www.nabble.com/Using-HDFS-to-serve-www-requests-tp22725659p22765743.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Using HDFS to serve www requests

2009-03-28 Thread phil cryer
Yes. IMHO GlusterFS advertises benchmarks vs Luster.
You're right, I've found those now, thanks for the reply - it helped

P

On Fri, Mar 27, 2009 at 5:04 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
but does Sun's Lustre follow in the steps of Gluster then

 Yes. IMHO GlusterFS advertises benchmarks vs Luster.

 The main difference is that GlusterFS is a fuse (userspace filesystem)
 while Luster has to be patched into the kernel, or a module.



Re: Using HDFS to serve www requests

2009-03-27 Thread Edward Capriolo
but does Sun's Lustre follow in the steps of Gluster then

Yes. IMHO GlusterFS advertises benchmarks vs Luster.

The main difference is that GlusterFS is a fuse (userspace filesystem)
while Luster has to be patched into the kernel, or a module.


Re: Using HDFS to serve www requests

2009-03-26 Thread Aaron Kimball
In general, Hadoop is unsuitable for the application you're suggesting.
Systems like Fuse HDFS do exist, though they're not widely used. I don't
know of anyone trying to connect Hadoop with Apache httpd.

When you say that you have huge images, how big is huge? It might be
useful if these images are 1 GB or larger. But in general, huge on Hadoop
means 10s of GBs up to TBs.  If you have a large number of moderately-sized
files, you'll find that HDFS responds very poorly for your needs.

It sounds like glusterfs is designed more for your needs.

- Aaron

On Thu, Mar 26, 2009 at 4:06 PM, phil cryer p...@cryer.us wrote:

 This is somewhat of a noob question I know, but after learning about
 Hadoop, testing it in a small cluster and running Map Reduce jobs on
 it, I'm still not sure if Hadoop is the right distributed file system
 to serve web requests.  In other words, can, or is it right to, serve
 Images and data from HDFS using something like FUSE to mount a
 filesystem where Apache could serve images from it?  We have huge
 images, thus the need for a distributed file system, and they go in,
 get stored with lots of metadata, and are redundant with Hadoop/HDFS -
 but is it the right way to serve web content?

 I looked at glusterfs before, they had an Apache and Lighttpd module
 which made it simple, does HDFS have something like this, do people
 just use a FUSE option as I described, or is this not a good use of
 Hadoop?

 Thanks

 P



Re: Using HDFS to serve www requests

2009-03-26 Thread Brian Bockelman


On Mar 26, 2009, at 5:44 PM, Aaron Kimball wrote:

In general, Hadoop is unsuitable for the application you're  
suggesting.

Systems like Fuse HDFS do exist, though they're not widely used.


We use FUSE on a 270TB cluster to serve up physics data because the  
client (2.5M lines of C++) doesn't understand how to connect to HDFS  
directly.


Brian


I don't
know of anyone trying to connect Hadoop with Apache httpd.

When you say that you have huge images, how big is huge? It might be
useful if these images are 1 GB or larger. But in general, huge on  
Hadoop
means 10s of GBs up to TBs.  If you have a large number of  
moderately-sized

files, you'll find that HDFS responds very poorly for your needs.

It sounds like glusterfs is designed more for your needs.

- Aaron

On Thu, Mar 26, 2009 at 4:06 PM, phil cryer p...@cryer.us wrote:


This is somewhat of a noob question I know, but after learning about
Hadoop, testing it in a small cluster and running Map Reduce jobs on
it, I'm still not sure if Hadoop is the right distributed file system
to serve web requests.  In other words, can, or is it right to, serve
Images and data from HDFS using something like FUSE to mount a
filesystem where Apache could serve images from it?  We have huge
images, thus the need for a distributed file system, and they go in,
get stored with lots of metadata, and are redundant with Hadoop/ 
HDFS -

but is it the right way to serve web content?

I looked at glusterfs before, they had an Apache and Lighttpd module
which made it simple, does HDFS have something like this, do people
just use a FUSE option as I described, or is this not a good use of
Hadoop?

Thanks

P





Re: Using HDFS to serve www requests

2009-03-26 Thread Edward Capriolo
It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS-API.

I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under internet
traffic, but it does work.

GlusterFS is more like a traditional POSIX filesystem. It supports
locking and appends and you can do things like put the mysql data
directory on it.

GLUSTERFS is geared for storing data to be accessed with low latency.
Nodes (Bricks) are normally connected via GIG-E or infiniban. The
GlusterFS volume is mounted directly on a unix system.

Hadoop is a user space file system. The latency is higher. Nodes are
connected by GIG-E. It is closely coupled with MAP/REDUCE.

You can use the API or the FUSE module to mount hadoop but that is not
a direct goal of hadoop. Hope that helps.


Re: Using HDFS to serve www requests

2009-03-26 Thread phil cryer
 When you say that you have huge images, how big is huge?

Yes, we're looking at some images that are 100Megs in size, but
nothing like what you're speaking of.  This helps me understand
Hadoop's usage better and unfortunately it won't be the fit I was
hoping for.

 You can use the API or the FUSE module to mount hadoop but that is not
 a direct goal of hadoop. Hope that helps.

Very interesting, and yes, that indeed does help, not to veer off
thread too much, but does Sun's Lustre follow in the steps of Gluster
then?  I know Lustre requires kernel patches to install, so it's at a
different level than the others, but I have seen some articles about
large scale clusters built with Lustre and want to look at that as
another option.

Again, thanks for the info, if anyone has general information on
cluster software, or know of a more appropriate list, I'd appreciate
the advice.

Thanks

P

On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 It is a little more natural to connect to HDFS from apache tomcat.
 This will allow you to skip the FUSE mounts and just use the HDFS-API.

 I have modified this code to run inside tomcat.
 http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

 I will not testify to how well this setup will perform under internet
 traffic, but it does work.

 GlusterFS is more like a traditional POSIX filesystem. It supports
 locking and appends and you can do things like put the mysql data
 directory on it.

 GLUSTERFS is geared for storing data to be accessed with low latency.
 Nodes (Bricks) are normally connected via GIG-E or infiniban. The
 GlusterFS volume is mounted directly on a unix system.

 Hadoop is a user space file system. The latency is higher. Nodes are
 connected by GIG-E. It is closely coupled with MAP/REDUCE.

 You can use the API or the FUSE module to mount hadoop but that is not
 a direct goal of hadoop. Hope that helps.



Re: Using HDFS to serve www requests

2009-03-26 Thread Norbert Burger
Have you looked into MogileFS already?  Seems like a good fit, based
on your description.  This question has come up more than once here,
and MogileFS is an oft-recommended solution.

Norbert

On 3/26/09, phil cryer p...@cryer.us wrote:
  When you say that you have huge images, how big is huge?


 Yes, we're looking at some images that are 100Megs in size, but
  nothing like what you're speaking of.  This helps me understand
  Hadoop's usage better and unfortunately it won't be the fit I was
  hoping for.


   You can use the API or the FUSE module to mount hadoop but that is not
   a direct goal of hadoop. Hope that helps.


 Very interesting, and yes, that indeed does help, not to veer off
  thread too much, but does Sun's Lustre follow in the steps of Gluster
  then?  I know Lustre requires kernel patches to install, so it's at a
  different level than the others, but I have seen some articles about
  large scale clusters built with Lustre and want to look at that as
  another option.

  Again, thanks for the info, if anyone has general information on
  cluster software, or know of a more appropriate list, I'd appreciate
  the advice.

  Thanks


  P


  On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo edlinuxg...@gmail.com 
 wrote:
   It is a little more natural to connect to HDFS from apache tomcat.
   This will allow you to skip the FUSE mounts and just use the HDFS-API.
  
   I have modified this code to run inside tomcat.
   http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
  
   I will not testify to how well this setup will perform under internet
   traffic, but it does work.
  
   GlusterFS is more like a traditional POSIX filesystem. It supports
   locking and appends and you can do things like put the mysql data
   directory on it.
  
   GLUSTERFS is geared for storing data to be accessed with low latency.
   Nodes (Bricks) are normally connected via GIG-E or infiniban. The
   GlusterFS volume is mounted directly on a unix system.
  
   Hadoop is a user space file system. The latency is higher. Nodes are
   connected by GIG-E. It is closely coupled with MAP/REDUCE.
  
   You can use the API or the FUSE module to mount hadoop but that is not
   a direct goal of hadoop. Hope that helps.
  



Re: Using HDFS to serve www requests

2009-03-26 Thread Brian Bockelman


On Mar 26, 2009, at 8:55 PM, phil cryer wrote:


When you say that you have huge images, how big is huge?


Yes, we're looking at some images that are 100Megs in size, but
nothing like what you're speaking of.  This helps me understand
Hadoop's usage better and unfortunately it won't be the fit I was
hoping for.



I wouldn't split hairs between 100MB and 1GB.  However, it may be less  
reliable due to the extra layer via FUSE if you want to serve it via  
apache.  It wouldn't be too bad to whip up a tomcat webapp that goes  
through Hadoop...


It really depends on your hardware level and redundancy.  If you have  
the money to get the hardware necessary to go with a Lustre-based  
solution, do that.  If you have enough money to load up your pre- 
existing cluster with lots of disk, HDFS might be better.  Certainly  
it will be outperformed by lustre if you have lots of reliable  
hardware, especially in terms of latency.


Brian

You can use the API or the FUSE module to mount hadoop but that is  
not

a direct goal of hadoop. Hope that helps.


Very interesting, and yes, that indeed does help, not to veer off
thread too much, but does Sun's Lustre follow in the steps of Gluster
then?  I know Lustre requires kernel patches to install, so it's at a
different level than the others, but I have seen some articles about
large scale clusters built with Lustre and want to look at that as
another option.

Again, thanks for the info, if anyone has general information on
cluster software, or know of a more appropriate list, I'd appreciate
the advice.

Thanks

P

On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo edlinuxg...@gmail.com 
 wrote:

It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS- 
API.


I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

I will not testify to how well this setup will perform under internet
traffic, but it does work.

GlusterFS is more like a traditional POSIX filesystem. It supports
locking and appends and you can do things like put the mysql data
directory on it.

GLUSTERFS is geared for storing data to be accessed with low latency.
Nodes (Bricks) are normally connected via GIG-E or infiniban. The
GlusterFS volume is mounted directly on a unix system.

Hadoop is a user space file system. The latency is higher. Nodes are
connected by GIG-E. It is closely coupled with MAP/REDUCE.

You can use the API or the FUSE module to mount hadoop but that is  
not

a direct goal of hadoop. Hope that helps.





Re: Using HDFS to serve www requests

2009-03-26 Thread Jeff Hammerbacher
HDFS itself has some facilities for serving data over HTTP:
https://issues.apache.org/jira/browse/HADOOP-5010. YMMV.

On Thu, Mar 26, 2009 at 3:47 PM, Brian Bockelman bbock...@cse.unl.eduwrote:


 On Mar 26, 2009, at 8:55 PM, phil cryer wrote:

  When you say that you have huge images, how big is huge?


 Yes, we're looking at some images that are 100Megs in size, but
 nothing like what you're speaking of.  This helps me understand
 Hadoop's usage better and unfortunately it won't be the fit I was
 hoping for.


 I wouldn't split hairs between 100MB and 1GB.  However, it may be less
 reliable due to the extra layer via FUSE if you want to serve it via apache.
  It wouldn't be too bad to whip up a tomcat webapp that goes through
 Hadoop...

 It really depends on your hardware level and redundancy.  If you have the
 money to get the hardware necessary to go with a Lustre-based solution, do
 that.  If you have enough money to load up your pre-existing cluster with
 lots of disk, HDFS might be better.  Certainly it will be outperformed by
 lustre if you have lots of reliable hardware, especially in terms of
 latency.

 Brian


  You can use the API or the FUSE module to mount hadoop but that is not
 a direct goal of hadoop. Hope that helps.


 Very interesting, and yes, that indeed does help, not to veer off
 thread too much, but does Sun's Lustre follow in the steps of Gluster
 then?  I know Lustre requires kernel patches to install, so it's at a
 different level than the others, but I have seen some articles about
 large scale clusters built with Lustre and want to look at that as
 another option.

 Again, thanks for the info, if anyone has general information on
 cluster software, or know of a more appropriate list, I'd appreciate
 the advice.

 Thanks

 P

 On Thu, Mar 26, 2009 at 12:32 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 It is a little more natural to connect to HDFS from apache tomcat.
 This will allow you to skip the FUSE mounts and just use the HDFS-API.

 I have modified this code to run inside tomcat.
 http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

 I will not testify to how well this setup will perform under internet
 traffic, but it does work.

 GlusterFS is more like a traditional POSIX filesystem. It supports
 locking and appends and you can do things like put the mysql data
 directory on it.

 GLUSTERFS is geared for storing data to be accessed with low latency.
 Nodes (Bricks) are normally connected via GIG-E or infiniban. The
 GlusterFS volume is mounted directly on a unix system.

 Hadoop is a user space file system. The latency is higher. Nodes are
 connected by GIG-E. It is closely coupled with MAP/REDUCE.

 You can use the API or the FUSE module to mount hadoop but that is not
 a direct goal of hadoop. Hope that helps.





Re: Using HDFS to serve www requests

2009-03-26 Thread Jimmy Lin

Brian---

Can you share some performance figures for typical workloads with your 
HDFS/Fuse setup?  Obviously, latency is going to be bad but throughput 
will probably be reasonable... but I'm curious to hear about concrete 
latency/throughput numbers.  And, of course, I'm interested in these 
numbers as a function of concurrent clients... ;)


Somewhat independent of file size is the workload... you can have huge 
TB-size files, but still have a seek-heavy workload (in which case HDFS 
is probably a sub-optimal choice).  But if seek-heavy loads are 
reasonable, one can solve the lots-of-little-files problem by simple 
concatenation.


Finally, I'm curious about the Fuse overhead (vs. directly using the 
Java API).


Thanks in advance for your insights!

-Jimmy

Brian Bockelman wrote:


On Mar 26, 2009, at 5:44 PM, Aaron Kimball wrote:


In general, Hadoop is unsuitable for the application you're suggesting.
Systems like Fuse HDFS do exist, though they're not widely used.


We use FUSE on a 270TB cluster to serve up physics data because the 
client (2.5M lines of C++) doesn't understand how to connect to HDFS 
directly.


Brian


I don't
know of anyone trying to connect Hadoop with Apache httpd.

When you say that you have huge images, how big is huge? It might be
useful if these images are 1 GB or larger. But in general, huge on 
Hadoop
means 10s of GBs up to TBs.  If you have a large number of 
moderately-sized

files, you'll find that HDFS responds very poorly for your needs.

It sounds like glusterfs is designed more for your needs.

- Aaron

On Thu, Mar 26, 2009 at 4:06 PM, phil cryer p...@cryer.us wrote:


This is somewhat of a noob question I know, but after learning about
Hadoop, testing it in a small cluster and running Map Reduce jobs on
it, I'm still not sure if Hadoop is the right distributed file system
to serve web requests.  In other words, can, or is it right to, serve
Images and data from HDFS using something like FUSE to mount a
filesystem where Apache could serve images from it?  We have huge
images, thus the need for a distributed file system, and they go in,
get stored with lots of metadata, and are redundant with Hadoop/HDFS -
but is it the right way to serve web content?

I looked at glusterfs before, they had an Apache and Lighttpd module
which made it simple, does HDFS have something like this, do people
just use a FUSE option as I described, or is this not a good use of
Hadoop?

Thanks

P