Re: JNI in MAp REuce

2010-02-16 Thread Jason Rutherglen
How would this work?

On Fri, Feb 12, 2010 at 10:45 AM, Allen Wittenauer
 wrote:
>
> ... or just use distributed cache.
>
>
> On 2/12/10 10:02 AM, "Alex Kozlov"  wrote:
>
>> All native libraries should be on each of the cluster nodes.  You need to
>> set "java.library.path" property to point to your libraries (or just put
>> them in the default system dirs).
>>
>> On Fri, Feb 12, 2010 at 9:12 AM, Utkarsh Agarwal
>> wrote:
>>
>>> Can anybody point me how to use JNI calls in a map reduce program. My .so
>>> files have other dependencies also , is there a way to load the
>>> LD_LIBRARY_PATH for child processes . Should all the native stuff be in
>>> HDFS?
>>>
>>> Thanks,
>>> Utkarsh.
>>>
>
>


MiniDFSCluster accessed via hdfs:// URL

2010-02-16 Thread Jason Rutherglen
Is it possible to access a MiniDFSCluster via an hdfs:// URL?  I ask
because it seems to not work...


Re: MiniDFSCluster accessed via hdfs:// URL

2010-02-17 Thread Jason Rutherglen
Philip,

Thanks... I examined your patch, however I don't see the difference
between it and what I've got currently which is:

Configuration conf = new Configuration();
MiniDFSCluster dfs = new MiniDFSCluster(conf, 1, true, null);
URI uri = dfs.getFileSystem().getUri();
System.out.println("uri:" + uri);

What could be the difference?

Jason

On Tue, Feb 16, 2010 at 5:42 PM, Philip Zeyliger  wrote:
> It is, though you have to ask it what port it's running.  See the patch in
> https://issues.apache.org/jira/browse/MAPREDUCE-987 for some code that does
> that.
>
> -- Philip
>
> On Tue, Feb 16, 2010 at 5:30 PM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> Is it possible to access a MiniDFSCluster via an hdfs:// URL?  I ask
>> because it seems to not work...
>>
>


Re: MiniDFSCluster accessed via hdfs:// URL

2010-02-17 Thread Jason Rutherglen
Ok, I got this working... Thanks Philip!

On Wed, Feb 17, 2010 at 4:01 PM, Jason Rutherglen
 wrote:
> Philip,
>
> Thanks... I examined your patch, however I don't see the difference
> between it and what I've got currently which is:
>
> Configuration conf = new Configuration();
> MiniDFSCluster dfs = new MiniDFSCluster(conf, 1, true, null);
> URI uri = dfs.getFileSystem().getUri();
> System.out.println("uri:" + uri);
>
> What could be the difference?
>
> Jason
>
> On Tue, Feb 16, 2010 at 5:42 PM, Philip Zeyliger  wrote:
>> It is, though you have to ask it what port it's running.  See the patch in
>> https://issues.apache.org/jira/browse/MAPREDUCE-987 for some code that does
>> that.
>>
>> -- Philip
>>
>> On Tue, Feb 16, 2010 at 5:30 PM, Jason Rutherglen <
>> jason.rutherg...@gmail.com> wrote:
>>
>>> Is it possible to access a MiniDFSCluster via an hdfs:// URL?  I ask
>>> because it seems to not work...
>>>
>>
>


Re: Hadoop in Real time applications

2011-02-17 Thread Jason Rutherglen
Ted, thanks for the links, the yahoo.com one doesn't seem to exist?

On Wed, Feb 16, 2011 at 11:48 PM, Ted Dunning  wrote:
> Unless you go beyond the current standard semantics, this is true.
>
> See here: http://code.google.com/p/hop/  and
> http://labs.yahoo.com/node/476for alternatives.
>
> On Wed, Feb 16, 2011 at 10:30 PM, madhu phatak  wrote:
>
>> Hadoop is not suited for real time applications
>>
>> On Thu, Feb 17, 2011 at 9:47 AM, Karthik Kumar > >wrote:
>>
>> > Can Hadoop be used for Real time Applications such as banking
>> solutions...
>> >
>> > --
>> > With Regards,
>> > Karthik
>> >
>>
>


Re: Memory mapped resources

2011-04-11 Thread Jason Rutherglen
Yes you can however it will require customization of HDFS.  Take a
look at HDFS-347 specifically the HDFS-347-branch-20-append.txt patch.
 I have been altering it for use with HBASE-3529.  Note that the patch
noted is for the -append branch which is mainly for HBase.

On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies  wrote:
> We have some very large files that we access via memory mapping in
> Java. Someone's asked us about how to make this conveniently
> deployable in Hadoop. If we tell them to put the files into hdfs, can
> we obtain a File for the underlying file on any given node?
>


Re: Memory mapped resources

2011-04-11 Thread Jason Rutherglen
What do you mean by local chunk?  I think it's providing access to the
underlying file block?

On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning  wrote:
> Also, it only provides access to a local chunk of a file which isn't very
> useful.
>
> On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo 
> wrote:
>>
>> On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen
>>  wrote:
>> > Yes you can however it will require customization of HDFS.  Take a
>> > look at HDFS-347 specifically the HDFS-347-branch-20-append.txt patch.
>> >  I have been altering it for use with HBASE-3529.  Note that the patch
>> > noted is for the -append branch which is mainly for HBase.
>> >
>> > On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies
>> >  wrote:
>> >> We have some very large files that we access via memory mapping in
>> >> Java. Someone's asked us about how to make this conveniently
>> >> deployable in Hadoop. If we tell them to put the files into hdfs, can
>> >> we obtain a File for the underlying file on any given node?
>> >>
>> >
>>
>> This features it not yet part of hadoop so doing this is not "convenient".
>
>


Re: Memory mapped resources

2011-04-12 Thread Jason Rutherglen
Then one could MMap the blocks pertaining to the HDFS file and piece
them together.  Lucene's MMapDirectory implementation does just this
to avoid an obscure JVM bug.

On Mon, Apr 11, 2011 at 9:09 PM, Ted Dunning  wrote:
> Yes.  But only one such block. That is what I meant by chunk.
> That is fine if you want that chunk but if you want to mmap the entire file,
> it isn't real useful.
>
> On Mon, Apr 11, 2011 at 6:48 PM, Jason Rutherglen
>  wrote:
>>
>> What do you mean by local chunk?  I think it's providing access to the
>> underlying file block?
>>
>> On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning 
>> wrote:
>> > Also, it only provides access to a local chunk of a file which isn't
>> > very
>> > useful.
>> >
>> > On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo 
>> > wrote:
>> >>
>> >> On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen
>> >>  wrote:
>> >> > Yes you can however it will require customization of HDFS.  Take a
>> >> > look at HDFS-347 specifically the HDFS-347-branch-20-append.txt
>> >> > patch.
>> >> >  I have been altering it for use with HBASE-3529.  Note that the
>> >> > patch
>> >> > noted is for the -append branch which is mainly for HBase.
>> >> >
>> >> > On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies
>> >> >  wrote:
>> >> >> We have some very large files that we access via memory mapping in
>> >> >> Java. Someone's asked us about how to make this conveniently
>> >> >> deployable in Hadoop. If we tell them to put the files into hdfs,
>> >> >> can
>> >> >> we obtain a File for the underlying file on any given node?
>> >> >>
>> >> >
>> >>
>> >> This features it not yet part of hadoop so doing this is not
>> >> "convenient".
>> >
>> >
>
>


Setting a larger block size at runtime in the DFSClient

2011-04-12 Thread Jason Rutherglen
Are there performance implications to setting the block size to 1 GB
or higher (via the DFSClient.create method)?


Re: Memory mapped resources

2011-04-12 Thread Jason Rutherglen
>  The others you will have to read more conventionally

True.  I think there are emergent use cases that demand data locality,
eg, an optimized HBase system, search, and MMap'ing.

> If all blocks are guaranteed local, this would work.  I don't think that 
> guarantee is possible
> on a non-trivial cluster

Interesting.  I'm not familiar with how blocks go local, however I'm
interested in how to make this occur via a manual oriented call.  Eg,
is there an option available that guarantees locality, and if not,
perhaps there's work being done towards that path?

On Tue, Apr 12, 2011 at 8:08 AM, Ted Dunning  wrote:
> Well, no.
> You could mmap all the blocks that are local to the node your program is on.
>  The others you will have to read more conventionally.  If all blocks are
> guaranteed local, this would work.  I don't think that guarantee is possible
> on a non-trivial cluster.
>
> On Tue, Apr 12, 2011 at 6:32 AM, Jason Rutherglen
>  wrote:
>>
>> Then one could MMap the blocks pertaining to the HDFS file and piece
>> them together.  Lucene's MMapDirectory implementation does just this
>> to avoid an obscure JVM bug.
>>
>> On Mon, Apr 11, 2011 at 9:09 PM, Ted Dunning 
>> wrote:
>> > Yes.  But only one such block. That is what I meant by chunk.
>> > That is fine if you want that chunk but if you want to mmap the entire
>> > file,
>> > it isn't real useful.
>> >
>> > On Mon, Apr 11, 2011 at 6:48 PM, Jason Rutherglen
>> >  wrote:
>> >>
>> >> What do you mean by local chunk?  I think it's providing access to the
>> >> underlying file block?
>> >>
>> >> On Mon, Apr 11, 2011 at 6:30 PM, Ted Dunning 
>> >> wrote:
>> >> > Also, it only provides access to a local chunk of a file which isn't
>> >> > very
>> >> > useful.
>> >> >
>> >> > On Mon, Apr 11, 2011 at 5:32 PM, Edward Capriolo
>> >> > 
>> >> > wrote:
>> >> >>
>> >> >> On Mon, Apr 11, 2011 at 7:05 PM, Jason Rutherglen
>> >> >>  wrote:
>> >> >> > Yes you can however it will require customization of HDFS.  Take a
>> >> >> > look at HDFS-347 specifically the HDFS-347-branch-20-append.txt
>> >> >> > patch.
>> >> >> >  I have been altering it for use with HBASE-3529.  Note that the
>> >> >> > patch
>> >> >> > noted is for the -append branch which is mainly for HBase.
>> >> >> >
>> >> >> > On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies
>> >> >> >  wrote:
>> >> >> >> We have some very large files that we access via memory mapping
>> >> >> >> in
>> >> >> >> Java. Someone's asked us about how to make this conveniently
>> >> >> >> deployable in Hadoop. If we tell them to put the files into hdfs,
>> >> >> >> can
>> >> >> >> we obtain a File for the underlying file on any given node?
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >> This features it not yet part of hadoop so doing this is not
>> >> >> "convenient".
>> >> >
>> >> >
>> >
>> >
>
>


Re: Setting a larger block size at runtime in the DFSClient

2011-04-12 Thread Jason Rutherglen
Harsh, thanks, and sounds good!

On Tue, Apr 12, 2011 at 7:08 AM, Harsh J  wrote:
> Hey Jason,
>
> On Tue, Apr 12, 2011 at 7:06 PM, Jason Rutherglen
>  wrote:
>> Are there performance implications to setting the block size to 1 GB
>> or higher (via the DFSClient.create method)?
>
> You'll be streaming 1 complete GB per block to a DN with that value
> (before the next block gets scheduled on another), shouldn't be any
> differences beyond that.
>
> --
> Harsh J
>


Re: Memory mapped resources

2011-04-12 Thread Jason Rutherglen
To get around the chunks or blocks problem, I've been implementing a
system that simply sets a max block size that is too large for a file
to reach.  In this way there will only be one block for HDFS file, and
so MMap'ing or other single file ops become trivial.

On Tue, Apr 12, 2011 at 10:40 AM, Benson Margulies
 wrote:
> Here's the OP again.
>
> I want to make it clear that my question here has to do with the
> problem of distributing 'the program' around the cluster, not 'the
> data'. In the case at hand, the issue a system that has a large data
> resource that it needs to do its work. Every instance of the code
> needs the entire model. Not just some blocks or pieces.
>
> Memory mapping is a very attractive tactic for this kind of data
> resource. The data is read-only. Memory-mapping it allows the
> operating system to ensure that only one copy of the thing ends up in
> physical memory.
>
> If we force the model into a conventional file (storable in HDFS) and
> read it into the JVM in a conventional way, then we get as many copies
> in memory as we have JVMs.  On a big machine with a lot of cores, this
> begins to add up.
>
> For people who are running a cluster of relatively conventional
> systems, just putting copies on all the nodes in a conventional place
> is adequate.
>


Does changing the block size of MiniDFSCluster work?

2011-04-12 Thread Jason Rutherglen
I'm using the append 0.20.3 branch and am wondering why the following
fails, where setting the block size either in the Configuration or the
DFSClient.create method causes a failure later on when writing a file
out.

Configuration conf = new Configuration();
long blockSize = (long)32 * 1024 * 1024 * 1024;
conf.setLong("dfs.block.size", blockSize); // doesn't work
MiniDFSCluster cluster = new MiniDFSCluster(conf, 1, true, null);
FileSystem fileSys = cluster.getFileSystem();
fileSys.create(path, false, bufferSize, replication, blockSize); // doesn't work
fileSys.create(path); //works

Output: http://pastebin.com/MrQJcbJr


Re: Poor IO performance on a 10 node cluster.

2011-05-30 Thread Jason Rutherglen
That's a small town in Iceland.

On Mon, May 30, 2011 at 10:01 AM, James Seigel  wrote:
> Not sure that will help ;)
>
> Sent from my mobile. Please excuse the typos.
>
> On 2011-05-30, at 9:23 AM, Boris Aleksandrovsky  wrote:
>
>> Ljddfjfjfififfifjftjiifjfjjjffkxbznzsjxodiewisshsudddudsjidhddueiweefiuftttoitfiirriifoiffkllddiririiriioerorooiieirrioeekroooeoooirjjfdijdkkduddjudiiehs
>> On May 30, 2011 5:28 AM, "Gyuribácsi"  wrote:
>>>
>>>
>>> Hi,
>>>
>>> I have a 10 node cluster (IBM blade servers, 48GB RAM, 2x500GB Disk, 16 HT
>>> cores).
>>>
>>> I've uploaded 10 files to HDFS. Each file is 10GB. I used the streaming
>> jar
>>> with 'wc -l' as mapper and 'cat' as reducer.
>>>
>>> I use 64MB block size and the default replication (3).
>>>
>>> The wc on the 100 GB took about 220 seconds which translates to about 3.5
>>> Gbit/sec processing speed. One disk can do sequential read with 1Gbit/sec
>> so
>>> i would expect someting around 20 GBit/sec (minus some overhead), and I'm
>>> getting only 3.5.
>>>
>>> Is my expectaion valid?
>>>
>>> I checked the jobtracked and it seems all nodes are working, each reading
>>> the right blocks. I have not played with the number of mapper and reducers
>>> yet. It seems the number of mappers is the same as the number of blocks
>> and
>>> the number of reducers is 20 (there are 20 disks). This looks ok for me.
>>>
>>> We also did an experiment with TestDFSIO with similar results. Aggregated
>>> read io speed is around 3.5Gbit/sec. It is just too far from my
>>> expectation:(
>>>
>>> Please help!
>>>
>>> Thank you,
>>> Gyorgy
>>> --
>>> View this message in context:
>> http://old.nabble.com/Poor-IO-performance-on-a-10-node-cluster.-tp31732971p31732971.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>