Hadoop writes data to the local filesystem, when the blocksize is reached it is written into hdfs. Think of hdfs as a block management system rather than a file system even though the end result is a series of blocks that constitute a file. You will not see the data in hdfs until the file is closed - that is the reality of the implementation. The only way to 'flush' is what you have already discovered - closing the file. To flush the data often in hdfs can be a very expensive operation when you consider that it will affect multiple nodes distributed over a network. I suspect that is why it isn't there. I believe there is a jira somewhere to have 'sync' force the data out to disk but I do not know the number or what its status is.
Assuming you are collecting data as an unending process, you might consider closing the hdfs output at periodic intervals and/or collecting data locally (with your intervening flushes) and then moving it into hdfs so it can get processed by map/reduce. It is a prudent approach to minimize potential data loss if the hdfs connection gets broken. Every implementation is different so you gotta be creative. :o) Bill -----Original Message----- From: Sasha Dolgy [mailto:sdo...@gmail.com] Sent: Monday, May 18, 2009 9:50 AM To: core-user@hadoop.apache.org Subject: Re: proper method for writing files to hdfs Ok, on the same page with that. Going back to the original question. In our scenario we are trying to stream data into HDFS and despite the posts and hints I've been reading, it's still tough to crack this nut and this is why I thought (and thankfully I wasn't right) that we were going about this the wrong way: We open up a new file and get the FSDataOutputStream and start to write data and flush as concurrent information comes in: 2009-05-17 06:16:50,921 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file offset of block blk_5834867413110307425_1064 from 2451 to 20 48 meta file offset to 23 2009-05-17 06:16:50,921 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 adding seqno 3 to ack queue. 2009-05-17 06:16:50,921 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_5834867413110307425_1064 acking for packet 3 2009-05-17 06:16:51,111 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet for block blk_5834867413110307425_1064 of length 735 seqno 4 offsetInBlock 2048 lastPacketInBlock false 2009-05-17 06:16:51,111 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file offset of block blk_5834867413110307425_1064 from 2518 to 20 48 meta file offset to 23 2009-05-17 06:16:51,111 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 adding seqno 4 to ack queue. 2009-05-17 06:16:51,112 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_5834867413110307425_1064 acking for packet 4 2009-05-17 06:16:51,297 DEBUG org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving one packet for block blk_5834867413110307425_1064 of length 509 seqno 5 offsetInBlock 2560 lastPacketInBlock false 2009-05-17 06:16:51,297 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Changing block file offset of block blk_5834867413110307425_1064 from 2771 to 25 60 meta file offset to 27 The file gets bigger and bigger, but it is not commited to hdfs until we close() the stream. We've waited for the block size to go above 64k and even higher, and it never writes itself out to hdfs. I've seen the JIRA bug reports, etc. Has no one done this? Is it bad to stream data into it? How do I force it to flush the data to disk... The POC is with environmental data every moment from multiple sources for monitoring temperature in computers / facilities... Suppose I'm just a little frustrated. I see that hadoop is brilliant for large sets of data that you already have or are happy to move onto HDFS ... -sd > On Mon, May 18, 2009 at 2:09 PM, Bill Habermaas <b...@habermaas.us> wrote: >> Sasha, >> >> Connecting to the namenode is the proper way to establish the hdfs >> connection. Afterwards the Hadoop client handler that is called by your >> code will go directly to the datanodes. There is no reason for you to >> communicate directly with a datanode nor is there a way for you to even > know >> where the data nodes are located. That is all done by the Hadoop client > code >> and done silently under the covers by Hadoop itself. >> >> Bill >> >> -----Original Message----- >> From: sdo...@gmail.com [mailto:sdo...@gmail.com] On Behalf Of Sasha Dolgy >> Sent: Sunday, May 17, 2009 10:55 AM >> To: core-user@hadoop.apache.org >> Subject: proper method for writing files to hdfs >> >> The following graphic outlines the architecture for HDFS: >> http://hadoop.apache.org/core/docs/current/images/hdfsarchitecture.gif >> >> If one is to write a client that adds data into HDFS, it needs to add it >> through the Data Node. Now, from the graphic I am to understand that the >> client doesn't communicate with the NameNode, and only the Data Node. >> >> In the examples I've seen and the playing I am doing, I am connecting to > the >> hdfs url as a configuration parameter before I create a file. Is this the >> incorrect way to create files in HDFS? >> >> Configuration config = new Configuration(); >> config.set("fs.default.name","hdfs://foo.bar.com:9000/"); >> String path = "/tmp/i/am/a/path/to/a/file.name" >> Path hdfsPath = new Path(path); >> FileSystem fileSystem = FileSystem.get(config); >> FSDataOutputStream os = fileSystem.create(hdfsPath, false); >> os.write("something".getBytes()); >> os.close(); >> >> Should the client be connecting to a data node to create the file as >> indicated in the graphic above? >> >> If connecting to a data node is possible and suggested, where can I find >> more details about this process? >> >> Thanks in advance, >> -sasha >> >> -- >> Sasha Dolgy >> sasha.do...@gmail.com