Read and Write throughputs via JVM

2011-04-12 Thread Matthew John
Hi all, I wanted to figure out the Read and Write throughputs that happens in a Map task (Read - reading from the input splits, Write - writing the map output back) inside a JVM. Do we have any counters that can help me with this? Or where exactly should I focus on tweaking the code to add some ad

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Benson is actually a pretty sophisticated guy who knows a lot about mmap. I engaged with him yesterday on this since I know him from Apache. On Tue, Apr 12, 2011 at 7:16 PM, M. C. Srivas wrote: > I am not sure if you realize, but HDFS is not VM integrated.

Re: Memory mapped resources

2011-04-12 Thread M. C. Srivas
I am not sure if you realize, but HDFS is not VM integrated. What you are asking for is support *inside* the linux kernel for HDFS file systems. I don't see that happening for the next few years, and probably never at all. (HDFS is all Java today, and Java certainly is not going to go inside the

namenode format error

2011-04-12 Thread Jeffrey Wang
Hey all, I'm trying to format my NameNode (I've done it successfully in the past), but I'm getting a strange error: 11/04/12 16:47:32 INFO common.Storage: java.io.IOException: Input/output error at sun.nio.ch.FileChannelImpl.lock0(Native Method) at sun.nio.ch.FileChannelImpl.tryL

Does changing the block size of MiniDFSCluster work?

2011-04-12 Thread Jason Rutherglen
I'm using the append 0.20.3 branch and am wondering why the following fails, where setting the block size either in the Configuration or the DFSClient.create method causes a failure later on when writing a file out. Configuration conf = new Configuration(); long blockSize = (long)32 * 1024 * 1024

Re: cluster restart error

2011-04-12 Thread bikash sharma
p.s. Also, am unable to connect while doing hadoop/bin/hadoop fs -ls with the following error: inti76.cse.psu.edu: starting tasktracker, logging to /i3c/hpcl/bus145/cse598g/hadoop/bin/../logs/hadoop-bus145-tasktracker-inti76.cse.psu.edu.out inti79.cse.psu.edu 36% hadoop/bin/hadoop fs -ls 11/04/12

cluster restart error

2011-04-12 Thread bikash sharma
Hi, I changed some config. parameters in core-site/mapred.xml files and then stopped dfs, mapred services. While restarting them again, I am unable to do so and looking at the logs, the following error occurs: 2011-04-12 17:27:39,343 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapt

Re: Memory mapped resources

2011-04-12 Thread Luke Lu
You can use distributed cache for memory mapped files (they're local to the node the tasks run on.) http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata On Tue, Apr 12, 2011 at 10:40 AM, Benson Margulies wrote: > Here's the OP again. > > I want to make it clear that my question here h

io.sort.mb based on HDFS block size

2011-04-12 Thread Shrinivas Joshi
Looking at workloads like TeraSort where intermediate map output is proportional to HDFS block size, I was wondering whether it would be beneficial to have a mechanism for setting buffer spaces like io.sort.mb to be a certain factor of HDFS block size? I am sure there are other config parameters th

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Actually, it doesn't become trivial. It just becomes total fail or total win instead of almost always being partial win. It doesn't meet Benson's need. On Tue, Apr 12, 2011 at 11:09 AM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > To get around the chunks or blocks problem, I've been

Re: Memory mapped resources

2011-04-12 Thread Jason Rutherglen
To get around the chunks or blocks problem, I've been implementing a system that simply sets a max block size that is too large for a file to reach. In this way there will only be one block for HDFS file, and so MMap'ing or other single file ops become trivial. On Tue, Apr 12, 2011 at 10:40 AM, B

Re: Memory mapped resources

2011-04-12 Thread Benson Margulies
Here's the OP again. I want to make it clear that my question here has to do with the problem of distributing 'the program' around the cluster, not 'the data'. In the case at hand, the issue a system that has a large data resource that it needs to do its work. Every instance of the code needs the

Re: HOD exception: java.io.IOException: No valid local directories in property: mapred.local.dir

2011-04-12 Thread Harsh J
Hello Sridhar, The mapred.local.dir values are considered as local directories. On Tue, Apr 12, 2011 at 9:25 PM, sridhar basam wrote: > On Tue, Apr 12, 2011 at 11:49 AM, Steve Loughran wrote: > >> The job tracker can't find any of the local filesystem directories listed >> in the mapred.local.d

Re: HOD exception: java.io.IOException: No valid local directories in property: mapred.local.dir

2011-04-12 Thread sridhar basam
On Tue, Apr 12, 2011 at 11:49 AM, Steve Loughran wrote: > The job tracker can't find any of the local filesystem directories listed > in the mapred.local.dir property, either the conf file or the machine is > misconfigured > Is the mapred.local.dir a local directory or is it relative to the hdfs

Re: Reg HDFS checksum

2011-04-12 Thread Steve Loughran
On 12/04/2011 07:06, Josh Patterson wrote: If you take a look at: https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java you'll see a single process version of what HDFS does under the hood, albeit in a highly distributed fashi

Re: HOD exception: java.io.IOException: No valid local directories in property: mapred.local.dir

2011-04-12 Thread Steve Loughran
On 11/04/2011 16:48, Boyu Zhang wrote: Exception in thread "main" org.apache.hadoop.ipc.RemoteException: java.io.IOException: No valid local directories in property: mapred.local.dir The job tracker can't find any of the local filesystem directories listed in the mapred.local.dir property, eit

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Blocks live where they land when first created. They can be moved due to node failure or rebalancing, but it is typically pretty expensive to do this. It certainly is slower than just reading the file. If you really, really want mmap to work, then you need to set up some native code that builds

Re: Setting a larger block size at runtime in the DFSClient

2011-04-12 Thread Jason Rutherglen
Harsh, thanks, and sounds good! On Tue, Apr 12, 2011 at 7:08 AM, Harsh J wrote: > Hey Jason, > > On Tue, Apr 12, 2011 at 7:06 PM, Jason Rutherglen > wrote: >> Are there performance implications to setting the block size to 1 GB >> or higher (via the DFSClient.create method)? > > You'll be stream

Re: Memory mapped resources

2011-04-12 Thread Jason Rutherglen
> The others you will have to read more conventionally True. I think there are emergent use cases that demand data locality, eg, an optimized HBase system, search, and MMap'ing. > If all blocks are guaranteed local, this would work. I don't think that > guarantee is possible > on a non-trivia

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Well, no. You could mmap all the blocks that are local to the node your program is on. The others you will have to read more conventionally. If all blocks are guaranteed local, this would work. I don't think that guarantee is possible on a non-trivial cluster. On Tue, Apr 12, 2011 at 6:32 AM,

Re: Memory mapped resources

2011-04-12 Thread Ted Dunning
Kevin, You present a good discussion of architectural alternatives here. But my comment really had more to do with whether a particular HDFS patch would provide what the original poster seemed to be asking about. This is especially pertinent since the patch was intended to scratch a different it

Re: "Retrying connect" error while configuring hadoop

2011-04-12 Thread Sonal Goyal
Are your datanodes and namenode machines able to see each other - ping etc? Is the /etc/hosts configured correctly? Is the namenode process(seen through jps on master) up ? Thanks and Regards, Sonal Hadoop ETL and Data Integration

Re: Setting a larger block size at runtime in the DFSClient

2011-04-12 Thread Harsh J
Hey Jason, On Tue, Apr 12, 2011 at 7:06 PM, Jason Rutherglen wrote: > Are there performance implications to setting the block size to 1 GB > or higher (via the DFSClient.create method)? You'll be streaming 1 complete GB per block to a DN with that value (before the next block gets scheduled on a

Re: Memory mapped resources

2011-04-12 Thread Michael Flester
> We have some very large files that we access via memory mapping in > Java. Someone's asked us about how to make this conveniently > deployable in Hadoop. If we tell them to put the files into hdfs, can > we obtain a File for the underlying file on any given node? We sometimes find it convenient

Re: Reg HDFS checksum

2011-04-12 Thread Josh Patterson
If you take a look at: https://github.com/jpatanooga/IvoryMonkey/blob/master/src/tv/floe/IvoryMonkey/hadoop/fs/ExternalHDFSChecksumGenerator.java you'll see a single process version of what HDFS does under the hood, albeit in a highly distributed fashion. Whats going on here is that for every 512

Setting a larger block size at runtime in the DFSClient

2011-04-12 Thread Jason Rutherglen
Are there performance implications to setting the block size to 1 GB or higher (via the DFSClient.create method)?

Re: hadoop branch-0.20-append Build error:build.xml:933: exec returned: 1

2011-04-12 Thread Marcos Ortiz
El 4/11/2011 10:45 PM, Alex Luya escribió: BUILD FAILED .../branch-0 .20-append/build.xml:927: The following error occurred while executing this line: ../branch-0 .20-append/build.xml:933: exec returned: 1 Total time: 1 minute 17 seconds + RESULT=1 + '[' 1 '!=' 0 ']' + echo 'Build Failed

Re: Memory mapped resources

2011-04-12 Thread Jason Rutherglen
Then one could MMap the blocks pertaining to the HDFS file and piece them together. Lucene's MMapDirectory implementation does just this to avoid an obscure JVM bug. On Mon, Apr 11, 2011 at 9:09 PM, Ted Dunning wrote: > Yes.  But only one such block. That is what I meant by chunk. > That is fine

RE: Memory mapped resources

2011-04-12 Thread Kevin.Leach
This is the age old argument of what to share in a partitioned environment. IBM and Teradata have always used "shared nothing" which is what only getting one chunk of the file in each hadoop node is doing. Oracle has always used "shared disk" which is not an easy thing to do, especially in scale, a

Re: Reg HDFS checksum

2011-04-12 Thread Thamizh
Thanks of lot Josh.  I have been given a .gz file and been told that it has been downloaded from HDFS. When I tried to compute integrity of that file using "gzip -t", It ended up with "invalid compressed data--format violated" and even "gzip -d" also given the same result. I am bit worried

Re: "Retrying connect to server" error while configuring hadoop

2011-04-12 Thread praveen.peddi
Did you check if all ports are open between all nodes across the cluster? Ports need to be open not just between master and slaves but also between slaves for data nodes to talk to each other. Praveen On Apr 12, 2011, at 1:57 AM, "ext prasunb" wrote: > > Hello, > > I am trying to configure