Re: issue about distcp Source and target differ in block-size. Use -pb to preserve block-sizes during copy.

2014-07-25 Thread Stanley Shi
Your client side was running at 14/07/24 18:35:58 INFO mapreduce.Job:
T***, But you are pasting NN log at 2014-07-24 17:39:34,255;

By the way, which version of HDFS are you using?

Regards,
*Stanley Shi,*



On Fri, Jul 25, 2014 at 10:36 AM, ch huang justlo...@gmail.com wrote:

 2014-07-24 17:33:04,783 WARN
 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
 as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.ipc.StandbyException:
 Operation category READ is not supported in state standby
 2014-07-24 17:33:05,742 WARN
 org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
 as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.ipc.StandbyException:
 Operation category READ is not supported in state standby
  2014-07-24 17:33:33,179 INFO
 org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Triggering log
 roll on remote NameNode hz24/192.168.10.24:8020
 2014-07-24 17:33:33,442 INFO
 org.apache.hadoop.hdfs.server.namenode.FSImage: Reading
 org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@67698344
 expecting start txid #62525
 2014-07-24 17:33:33,442 INFO
 org.apache.hadoop.hdfs.server.namenode.FSImage: Start loading edits file
 http://hz24:8480/getJournal?jid=developsegmentTxId=62525storageInfo=-55%3A466484546%3A0%3ACID-a140fb1a-ac10-4053-8b91-8f19f2809b7c,

 http://hz23:8480/getJournal?jid=developsegmentTxId=62525storageInfo=-55%3A466484546%3A0%3ACID-a140fb1a-ac10-4053-8b91-8f19f2809b7c
 2014-07-24 17:33:33,442 INFO
 org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding
 stream '
 http://hz24:8480/getJournal?jid=developsegmentTxId=62525storageInfo=-55%3A466484546%3A0%3ACID-a140fb1a-ac10-4053-8b91-8f19f2809b7c,

 http://hz23:8480/getJournal?jid=developsegmentTxId=62525storageInfo=-55%3A466484546%3A0%3ACID-a140fb1a-ac10-4053-8b91-8f19f2809b7c'
 to transaction ID 62525
 2014-07-24 17:33:33,442 INFO
 org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: Fast-forwarding
 stream '
 http://hz24:8480/getJournal?jid=developsegmentTxId=62525storageInfo=-55%3A466484546%3A0%3ACID-a140fb1a-ac10-4053-8b91-8f19f2809b7c'
 to transaction ID 62525
 2014-07-24 17:33:33,480 INFO BlockStateChange: BLOCK* addToInvalidates:
 blk_1073753268_12641 192.168.10.51:50010 192.168.10.49:50010
 192.168.10.50:50010
 2014-07-24 17:33:33,482 INFO BlockStateChange: BLOCK* addStoredBlock:
 blockMap updated: 192.168.10.50:50010 is added to 
 blk_1073753337_12710{blockUCState=UNDER_CONSTRUCTION,
 primaryNodeIndex=-1,
 replicas=[ReplicaUnderConstruction[[DISK]DS-7496d6a7-2a8f-4884-8a8f-f3a0f3037c0e:NORMAL|RBW],
 ReplicaUnderConstruction[[DISK]DS-23f57228-24d8-4e51-afe9-c13a8b47a0a5:NORMAL|RBW],
 ReplicaUnderConstruction[[DISK]DS-a4cfa75c-28f4-4e73-9e17-b6e3f129864f:NORMAL|RBW]]}
 size 0
 2014-07-24 17:33:33,482 INFO BlockStateChange: BLOCK* addStoredBlock:
 blockMap updated: 192.168.10.51:50010 is added to 
 blk_1073753337_12710{blockUCState=UNDER_CONSTRUCTION,
 primaryNodeIndex=-1,
 replicas=[ReplicaUnderConstruction[[DISK]DS-7496d6a7-2a8f-4884-8a8f-f3a0f3037c0e:NORMAL|RBW],
 ReplicaUnderConstruction[[DISK]DS-23f57228-24d8-4e51-afe9-c13a8b47a0a5:NORMAL|RBW],
 ReplicaUnderConstruction[[DISK]DS-a4cfa75c-28f4-4e73-9e17-b6e3f129864f:NORMAL|RBW]]}
 size 0
 2014-07-24 17:33:33,482 INFO BlockStateChange: BLOCK* addStoredBlock:
 blockMap updated: 192.168.10.49:50010 is added to 
 blk_1073753337_12710{blockUCState=UNDER_CONSTRUCTION,
 primaryNodeIndex=-1,
 replicas=[ReplicaUnderConstruction[[DISK]DS-7496d6a7-2a8f-4884-8a8f-f3a0f3037c0e:NORMAL|RBW],
 ReplicaUnderConstruction[[DISK]DS-23f57228-24d8-4e51-afe9-c13a8b47a0a5:NORMAL|RBW],
 ReplicaUnderConstruction[[DISK]DS-a4cfa75c-28f4-4e73-9e17-b6e3f129864f:NORMAL|RBW]]}
 size 0
 2014-07-24 17:33:33,484 INFO BlockStateChange: BLOCK* addStoredBlock:
 blockMap updated: 192.168.10.51:50010 is added to 
 blk_1073753338_12711{blockUCState=UNDER_CONSTRUCTION,
 primaryNodeIndex=-1,
 replicas=[ReplicaUnderConstruction[[DISK]DS-a4cfa75c-28f4-4e73-9e17-b6e3f129864f:NORMAL|RBW],
 ReplicaUnderConstruction[[DISK]DS-7496d6a7-2a8f-4884-8a8f-f3a0f3037c0e:NORMAL|RBW],
 ReplicaUnderConstruction[[DISK]DS-23f57228-24d8-4e51-afe9-c13a8b47a0a5:NORMAL|RBW]]}
 size 0
 2014-07-24 17:33:33,484 INFO BlockStateChange: BLOCK* addStoredBlock:
 blockMap updated: 192.168.10.49:50010 is added to 
 blk_1073753338_12711{blockUCState=UNDER_CONSTRUCTION,
 primaryNodeIndex=-1,
 replicas=[ReplicaUnderConstruction[[DISK]DS-a4cfa75c-28f4-4e73-9e17-b6e3f129864f:NORMAL|RBW],
 ReplicaUnderConstruction[[DISK]DS-7496d6a7-2a8f-4884-8a8f-f3a0f3037c0e:NORMAL|RBW],
 ReplicaUnderConstruction[[DISK]DS-23f57228-24d8-4e51-afe9-c13a8b47a0a5:NORMAL|RBW]]}
 size 0
 2014-07-24 17:33:33,484 INFO BlockStateChange: BLOCK* addStoredBlock:
 blockMap updated: 192.168.10.50:50010 is added to 
 blk_1073753338_12711{blockUCState=UNDER_CONSTRUCTION,
 primaryNodeIndex=-1,
 replicas=[ReplicaUnderConstruction[[DISK]DS-a4cfa75c-28f4-4e73-9e17-b6e3f129864f:NORMAL|RBW],
 

Re: Skippin those gost darn 0 byte diles

2014-07-25 Thread Bertrand Dechoux
For reference : https://issues.apache.org/jira/browse/SPARK-1960
(which seems highly related)

I don't know if anything is tracked on Hadoop/MapReduce side.

Bertrand Dechoux


On Wed, Jul 23, 2014 at 5:15 PM, Edward Capriolo edlinuxg...@gmail.com
wrote:

 Anyway, a solution (seen in Flume if I remember correctly) is having a
 good file name strategy. For exemple, all new files should end in .open
 and only when they are finished the suffix is removed. Then for processing,
 you only target the latter.

 I am not sure this will help. The sequence file reader will still try to
 open it regardless of it's name.

 For Hive, you might need to adapt the strategy a bit because Hive may not
 be able to target only files with a specific name (you are the expert). A
 simple move of the file from a temporary directory to the table directory
 would have the same effect (because from the point of view of HDFS, it's
 the same operation : metadata change only).

 I would like to consider the file as soon as their is reasonable data in
 them. If I have to rename/move files I will not be able to see the data
 until it is moved in/renamed. (I am building files for N minutes before
 closing them). The problem only happens with 0 byte files- files being
 written currently work fine.

 It seems like the split calculation could throw away 0 byte files before
 we ever get down to the record reader and parsing the header. An
 interesting thing is that even though dfs -ls shows the files as 0
 bytesSometimes I can dfs -text theses 0 byte files and they actually
 have data! Sometimes when I dfs -text them I get the exception attached!

 So it is interesting that the semantics here are not obvious. Can we map
 reduce a file being written? How does it work etc? It would be nice to
 understand the semantics here.








 On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux decho...@gmail.com
 wrote:

 The best would be to get a hold on a Flume developer. I am not strictly
 sure of all the differences between sync/flush/hsync/hflush and the
 different hadoop versions. It might be the case that you are only flushing
 on the client side. Even if it was a clean strategy, creation+flush is
 unlikely to be an atomic operation.

 It is worth testing the read of an empty sequence file (real empty and
 with only header). It should be quite easy with a unit test. A solution
 would indeed to validate the behaviour of SequenceFileReader / InputFormat
 on edge cases. But nothing guarantee you that you won't have a record split
 between two HDFS blocks. This implies that during the writing only the
 first block is visible and only a part of the record. It would be normal
 for the reader to fail on that case. You could tweak mapreduce bad records
 skipping but that feels like hacking a system where the design is wrong
 from the beginning.

 Anyway, a solution (seen in Flume if I remember correctly) is having a
 good file name strategy. For exemple, all new files should end in .open
 and only when they are finished the suffix is removed. Then for processing,
 you only target the latter.

 For Hive, you might need to adapt the strategy a bit because Hive may not
 be able to target only files with a specific name (you are the expert). A
 simple move of the file from a temporary directory to the table directory
 would have the same effect (because from the point of view of HDFS, it's
 the same operation : metadata change only).

 Bertrand Dechoux


 On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Here is the stack trace...

  Caused by: java.io.EOFException
   at java.io.DataInputStream.readByte(DataInputStream.java:267)
   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
   at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
   at 
 org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
   at 
 org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
   at 
 org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
   at 
 org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
   ... 15 more




 On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Currently using:

 dependency
 groupIdorg.apache.hadoop/groupId
 artifactIdhadoop-hdfs/artifactId
 version2.3.0/version
 /dependency


 I have this piece of code that does.

 writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
 CompressionType.BLOCK, codec);

 Then I have a piece of code like this...

   public static final long SYNC_EVERY_LINES = 1000;
  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){