Re: hadoop question using VMWARE

2011-09-28 Thread Steve Loughran

On 28/09/11 08:37, N Keywal wrote:

For example:
- It's adding two layers (windows&  linux), that can both fail, especially
under heavy workload (and hadoop is built to use all the resources
available). They will need to be managed as well (software upgrades,
hardware support...), it's an extra cost.
- These two layers will use randomly the different resources (HDD,
CPU,network) making issues and performance analysis more complicated.
- there will be a real performance impact. It's depends on what you do, and
how is configured Windows&  vmware, but on my non optimized laptop I lose
more than 50%. VMWare claims 15% max, but it's without Windows (using direct
ESX)



Where you take a big hit is in disk IO, as what your OS thinks is a disk 
with sequentially stored files is just a single file in the host OS that 
may be scattered round the real HDD. Disk IO goes through too many 
layers. It's often faster to NFS mount the real HDD.


For compute intensive work, the performance hit isn't so bad, at least 
provided you don't swap.



- Last time I checked (a few months ago), vmware was not able to use all the
core&  memory of medium sized servers.


Same with VirtualBox, which I like because it is lighter weight.

I use VMs because the infrastructure provides it; things like ElasticMR 
from AWS also offer it. Your code may be slower, but what you get is the 
ability to bring up clusters on a pay-per-hour basis, and the ability to 
vary the #of machines based on the workload/execution plan. If you can 
compensate for the IO hit by renting four more servers, you may still 
come out ahead.


http://www.slideshare.net/steve_l/farming-hadoop-inthecloud


Re: hadoop question using VMWARE

2011-09-28 Thread N Keywal
For example:
- It's adding two layers (windows & linux), that can both fail, especially
under heavy workload (and hadoop is built to use all the resources
available). They will need to be managed as well (software upgrades,
hardware support...), it's an extra cost.
- These two layers will use randomly the different resources (HDD,
CPU,network) making issues and performance analysis more complicated.
- there will be a real performance impact. It's depends on what you do, and
how is configured Windows & vmware, but on my non optimized laptop I lose
more than 50%. VMWare claims 15% max, but it's without Windows (using direct
ESX)
- Last time I checked (a few months ago), vmware was not able to use all the
core & memory of medium sized servers.
- The namenode needs to be secured, as it's a spof.

On Wed, Sep 28, 2011 at 9:07 AM, praveenesh kumar wrote:

>  "it's not something you can do for production nor performance
> analysis."
> Can you please tell me what does it mean ?
> Why Can't we use this approach for production ???
>
> Thanks
>
> On Tue, Sep 27, 2011 at 11:56 PM, N Keywal  wrote:
>
> > Hi,
> >
> > Yes, it will work. HBase won't see the difference, it's a pure vmware
> > stuff.
> > Obviously, it's not something you can do for production nor performance
> > analysis.
> >
> > Cheers,
> >
> > N.
> >
> > On Wed, Sep 28, 2011 at 8:38 AM, praveenesh kumar  > >wrote:
> >
> > > Hi,
> > >
> > > Suppose I am having 10 windows machines and if I have 10 VM individual
> > > instances running on these machines independently, can I use these VM
> > > instances to communicate with each other so that I can make hadoop
> > cluster
> > > using those VM instances.
> > >
> > > Did anyone tried that thing ?
> > >
> > > I know we can setup multiple VM instances on same machine, but can we
> do
> > it
> > > across different machines also ?
> > > And if I do like this, Is it a good approach, considering I don't have
> > > dedicated ubuntu machines for hadoop ?
> > >
> > > Thanks,
> > > Praveenesh
> > >
> >
>


Re: hadoop question using VMWARE

2011-09-28 Thread praveenesh kumar
 "it's not something you can do for production nor performance
analysis."
Can you please tell me what does it mean ?
Why Can't we use this approach for production ???

Thanks

On Tue, Sep 27, 2011 at 11:56 PM, N Keywal  wrote:

> Hi,
>
> Yes, it will work. HBase won't see the difference, it's a pure vmware
> stuff.
> Obviously, it's not something you can do for production nor performance
> analysis.
>
> Cheers,
>
> N.
>
> On Wed, Sep 28, 2011 at 8:38 AM, praveenesh kumar  >wrote:
>
> > Hi,
> >
> > Suppose I am having 10 windows machines and if I have 10 VM individual
> > instances running on these machines independently, can I use these VM
> > instances to communicate with each other so that I can make hadoop
> cluster
> > using those VM instances.
> >
> > Did anyone tried that thing ?
> >
> > I know we can setup multiple VM instances on same machine, but can we do
> it
> > across different machines also ?
> > And if I do like this, Is it a good approach, considering I don't have
> > dedicated ubuntu machines for hadoop ?
> >
> > Thanks,
> > Praveenesh
> >
>


Re: hadoop question using VMWARE

2011-09-27 Thread N Keywal
Hi,

Yes, it will work. HBase won't see the difference, it's a pure vmware stuff.
Obviously, it's not something you can do for production nor performance
analysis.

Cheers,

N.

On Wed, Sep 28, 2011 at 8:38 AM, praveenesh kumar wrote:

> Hi,
>
> Suppose I am having 10 windows machines and if I have 10 VM individual
> instances running on these machines independently, can I use these VM
> instances to communicate with each other so that I can make hadoop cluster
> using those VM instances.
>
> Did anyone tried that thing ?
>
> I know we can setup multiple VM instances on same machine, but can we do it
> across different machines also ?
> And if I do like this, Is it a good approach, considering I don't have
> dedicated ubuntu machines for hadoop ?
>
> Thanks,
> Praveenesh
>


hadoop question using VMWARE

2011-09-27 Thread praveenesh kumar
Hi,

Suppose I am having 10 windows machines and if I have 10 VM individual
instances running on these machines independently, can I use these VM
instances to communicate with each other so that I can make hadoop cluster
using those VM instances.

Did anyone tried that thing ?

I know we can setup multiple VM instances on same machine, but can we do it
across different machines also ?
And if I do like this, Is it a good approach, considering I don't have
dedicated ubuntu machines for hadoop ?

Thanks,
Praveenesh


Re: Hadoop Question

2011-07-28 Thread George Datskos

Nitin,

On 2011/07/28 14:51, Nitin Khandelwal wrote:

How can I determine if a file is being written to (by any thread) in HDFS.
That information is exposed by the NameNode http servlet.  You can 
obtain it with the

fsck tool (hadoop fsck /path/to/dir -openforwrite) or you can do an http get

http://namenode:port/fsck?path=/your/path&openforwrite=1


George




Re: Hadoop Question

2011-07-28 Thread Joey Echeverria
How about having the slave write to temp file first, then move it to the file 
the master is monitoring for after they close it?

-Joey



On Jul 27, 2011, at 22:51, Nitin Khandelwal  
wrote:

> Hi All,
> 
> How can I determine if a file is being written to (by any thread) in HDFS. I
> have a continuous process on the master node, which is tracking a particular
> folder in HDFS for files to process. On the slave nodes, I am creating files
> in the same folder using the following code :
> 
> At the slave node:
> 
> import org.apache.commons.io.IOUtils;
> import org.apache.hadoop.fs.FileSystem;
> import java.io.OutputStream;
> 
> OutputStream oStream = fileSystem.create(path);
> IOUtils.write(, oStream);
> IOUtils.closeQuietly(oStream);
> 
> 
> At the master node,
> I am getting the earliest modified file in the folder. At times when I try
> reading the file, I get nothing in the file, mostly because the slave might
> be still finishing writing to the file. Is there any way, to somehow tell
> the master, that the slave is still writing to the file and to check the
> file sometime later for actual content.
> 
> Thanks,
> -- 
> 
> 
> Nitin Khandelwal


Hadoop Question

2011-07-27 Thread Nitin Khandelwal
Hi All,

How can I determine if a file is being written to (by any thread) in HDFS. I
have a continuous process on the master node, which is tracking a particular
folder in HDFS for files to process. On the slave nodes, I am creating files
in the same folder using the following code :

At the slave node:

import org.apache.commons.io.IOUtils;
import org.apache.hadoop.fs.FileSystem;
import java.io.OutputStream;

OutputStream oStream = fileSystem.create(path);
IOUtils.write(, oStream);
IOUtils.closeQuietly(oStream);


At the master node,
I am getting the earliest modified file in the folder. At times when I try
reading the file, I get nothing in the file, mostly because the slave might
be still finishing writing to the file. Is there any way, to somehow tell
the master, that the slave is still writing to the file and to check the
file sometime later for actual content.

Thanks,
-- 


Nitin Khandelwal