Multiple Input
Does anyone can help me? http://stackoverflow.com/questions/26341913/hadoop-multipleinputs
Using different security mechanism instead of Kerberos/Simple
Hi, While going through the code we learned that except for simple and kerberos, there seems not to be a way we could plug in our own authentication mechanism. Are there any plans in accommodating the change any soon? Also if to abstract out the security component, is there any way other than replacing the UserGroupInformation logic?
Re: read from a hdfs file on the same host as client
Demai, you are right. HDFS's default BlockPlacementPolicyDefault makes sure one replica of your block is available on the writer's datanode. The replica selection for the read operation is also aimed at minimizing bandwidth/latency and will serve the block from the reader's local node. If you want to further optimize this, you can set 'dfs.client.read.shortcircuit' to true. This would allow the client to bypass the datanode to read the file directly. On Mon, Oct 13, 2014 at 11:58 AM, Demai Ni nid...@gmail.com wrote: hi, folks, a very simple question, looking forward a couple pointers. Let's say I have a hdfs file: testfile, which only have one block(256MB), and the block has a replica on datanode: host1.hdfs.com (the whole hdfs may have 100 nodes though, and the other 2 replica are available at other datanode). If on host1.hdfs.com, I did a hadoop fs -cat testfile or a java client to read the file. Should I assume there won't be any significant data movement through network? That is the namenode is smart enough to give me the data on host1.hdfs.com directly? thanks Demai -- Thanks Shivram
Re: HDFS openforwrite CORRUPT - HEALTHY
Hi Ulul, Thanks for trying. I will try the dev list to see if they can help me with this. Thanks, Vinayak On 10/11/14, 5:33 AM, Ulul wrote: Hi Vinayak, Sorry this is beyond my understanding. I would need to test furthet to try and understand the problem. Hope you'll find help from someone else Ulul Le 08/10/2014 07:18, Vinayak Borkar a écrit : Hi Ulul, I think I can explain why the sizes differ and the block names vary. There is no client interaction. My client writes data and calls hsync, and then writes more data to the same file. My understanding is that under such circumstances, the file size is not reflected accurately in HDFS until the file is actually closed. So the namenode's view of the file size will be lower than the actual size of the data in the block. If you look at the block closely, you will see that the block number is the same for the two blocks. The part that is different is the version number - this is consistent with HDFS's behavior when hsyncing the output stream and then continuing to write more. It looks like the name node is informed much later about the last block that the datanode actually wrote. My client was not started when the machine came back up. So all changes seen in the FSCK output were owing to HDFS. Vinayak On 10/7/14, 2:37 PM, Ulul wrote: Hi Vinayak I find strange that the file should have a different size and the block a different name. Are you sure your writing client wasn't interfering ? Ulul Le 07/10/2014 19:41, Vinayak Borkar a écrit : Trying again since I did not get a reply. Please let me know if I should use a different forum to ask this question. Thanks, Vinayak On 10/4/14, 8:45 PM, Vinayak Borkar wrote: Hi, I was experimenting with HDFS to push its boundaries on fault tolerance. Here is what I observed. I am using HDFS from Hadoop 2.2. I started the NameNode and then a single DataNode. I started writing to a DFS file from a Java client periodically calling hsync(). After some time, I powered off the machine that was running this test (not shutdown, just abruptly powered off). When the system came back up, and HDFS processes were up and HDFS was out of safe mode, I ran fsck on the DFS filesystem (with -openforwrite -files -blocks) options and here is the output: /test/test.log 388970 bytes, 1 block(s), OPENFORWRITE: MISSING 1 blocks of total size 388970 B 0. BP-1471648347-10.211.55.100-1412458980748:blk_1073743243_2420{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[[DISK]DS-e5bed5ae-1fa9-45ed-8d4c-8006919b4d9c:NORMAL|RWR]]} len=388970 MISSING! Status: CORRUPT Total size:7214119 B Total dirs:54 Total files:232 Total symlinks:0 Total blocks (validated):214 (avg. block size 33710 B) CORRUPT FILES:1 MISSING BLOCKS:1 MISSING SIZE:388970 B Minimally replicated blocks:213 (99.53271 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks:213 (99.53271 %) Mis-replicated blocks:0 (0.0 %) Default replication factor:3 Average block replication:0.9953271 Corrupt blocks:0 Missing replicas:426 (66.35514 %) Number of data-nodes:1 Number of racks:1 FSCK ended at Sat Oct 04 23:09:40 EDT 2014 in 47 milliseconds I just let the system sit for some time and reran fsck (after about 15-20 mins) and surprisingly the output was very different. The corruption was magically gone: /test/test.log 1859584 bytes, 1 block(s): Under replicated BP-1471648347-10.211.55.100-1412458980748:blk_1073743243_2421. Target Replicas is 3 but found 1 replica(s). 0. BP-1471648347-10.211.55.100-1412458980748:blk_1073743243_2421 len=1859584 repl=1 Status: HEALTHY Total size:8684733 B Total dirs:54 Total files:232 Total symlinks:0 Total blocks (validated):214 (avg. block size 40582 B) Minimally replicated blocks:214 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks:214 (100.0 %) Mis-replicated blocks:0 (0.0 %) Default replication factor:3 Average block replication:1.0 Corrupt blocks:0 Missing replicas:428 (66.64 %) Number of data-nodes:1 Number of racks:1 FSCK ended at Sat Oct 04 23:24:23 EDT 2014 in 63 milliseconds The filesystem under path '/' is HEALTHY So my question is this: What just happened? How did the NameNode recover that missing block and why did it take 15 mins or so? Is there some kind of a lease on the file (because of the open nature) that expired after the 15-20 mins? Can someone with knowledge of HDFS internals please shed some light on what could possibly be going on or point me to sections of the code that could answer my questions? Also is there a way to speed this process up? Like say trigger the expiration of the lease (assuming it is a lease). Thanks,
Re: read from a hdfs file on the same host as client
Shivram, many thanks for confirming the behavior. I will also turn on the shortcircuit as you suggested. Appreciate the help Demai On Mon, Oct 13, 2014 at 3:42 PM, Shivram Mani sm...@pivotal.io wrote: Demai, you are right. HDFS's default BlockPlacementPolicyDefault makes sure one replica of your block is available on the writer's datanode. The replica selection for the read operation is also aimed at minimizing bandwidth/latency and will serve the block from the reader's local node. If you want to further optimize this, you can set 'dfs.client.read.shortcircuit' to true. This would allow the client to bypass the datanode to read the file directly. On Mon, Oct 13, 2014 at 11:58 AM, Demai Ni nid...@gmail.com wrote: hi, folks, a very simple question, looking forward a couple pointers. Let's say I have a hdfs file: testfile, which only have one block(256MB), and the block has a replica on datanode: host1.hdfs.com (the whole hdfs may have 100 nodes though, and the other 2 replica are available at other datanode). If on host1.hdfs.com, I did a hadoop fs -cat testfile or a java client to read the file. Should I assume there won't be any significant data movement through network? That is the namenode is smart enough to give me the data on host1.hdfs.com directly? thanks Demai -- Thanks Shivram
Redirect Writes/Reads to a particular Node
Hi, Is it possible to redirect writes to one particular node i.e. store the primary replica always on the same node; and have reads served from this primary node. If the primary node goes down; then hadoop replication works as per its policy; but when this node comes up it should again become the primary node. I don't see any config parameter available for core-site.xml or hdfs-site.xml to serve this purpose. Is there any way I can do this. Regards, Dhiraj PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
RE: Redirect Writes/Reads to a particular Node
Hi Dhiraj, AFAIK there is a mechanism to pass the set of 'favoredNodes datanodes' that should be favored as targets while creating a file. But this is only considered as a hint, sometimes due to cluster state(for example: given dn doesn't have sufficient space, doesn't available etc.), namenode may not be able to place the blocks on these datanodes then the hadoop replication works as per its policy. Please go through the below api which can be used for achieving this behavior and hope this will help you to get some idea. DistributedFileSystem.java /* Same as * {@link #create(Path, FsPermission, boolean, int, short, long, * Progressable)} with the addition of favoredNodes that is a hint to * where the namenode should place the file blocks. * The favored nodes hint is not persisted in HDFS. Hence it may be honored * at the creation time only. HDFS could move the blocks during balancing or * replication, to move the blocks from favored nodes. A value of null means * no favored nodes for this create */ public HdfsDataOutputStream create(final Path f, final FsPermission permission, final boolean overwrite, final int bufferSize, final short replication, final long blockSize, final Progressable progress, final InetSocketAddress[] favoredNodes) Regards, Rakesh From: Dhiraj Kamble [mailto:dhiraj.kam...@sandisk.com] Sent: 14 October 2014 10:38 To: user@hadoop.apache.org Subject: Redirect Writes/Reads to a particular Node Hi, Is it possible to redirect writes to one particular node i.e. store the primary replica always on the same node; and have reads served from this primary node. If the primary node goes down; then hadoop replication works as per its policy; but when this node comes up it should again become the primary node. I don't see any config parameter available for core-site.xml or hdfs-site.xml to serve this purpose. Is there any way I can do this. Regards, Dhiraj PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).