Re: Skippin those gost darn 0 byte diles

Bertrand Dechoux Fri, 25 Jul 2014 07:06:12 -0700

For reference : https://issues.apache.org/jira/browse/SPARK-1960
(which seems highly related)


I don't know if anything is tracked on Hadoop/MapReduce side.

Bertrand Dechoux


On Wed, Jul 23, 2014 at 5:15 PM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

> Anyway, a solution (seen in Flume if I remember correctly) is having a
> good file name strategy. For exemple, all new files should end in ".open"
> and only when they are finished the suffix is removed. Then for processing,
> you only target the latter.
>
> I am not sure this will help. The sequence file reader will still try to
> open it regardless of it's name.
>
> For Hive, you might need to adapt the strategy a bit because Hive may not
> be able to target only files with a specific name (you are the expert). A
> simple move of the file from a temporary directory to the table directory
> would have the same effect (because from the point of view of HDFS, it's
> the same operation : metadata change only).
>
> I would like to consider the file as soon as their is reasonable data in
> them. If I have to rename/move files I will not be able to see the data
> until it is moved in/renamed. (I am building files for N minutes before
> closing them). The problem only happens with 0 byte files- files being
> written currently work fine.
>
> It seems like the split calculation could throw away 0 byte files before
> we ever get down to the record reader and parsing the header. An
> interesting thing is that even though dfs -ls shows the files as 0
> bytes....Sometimes I can dfs -text theses 0 byte files and they actually
> have data! Sometimes when I dfs -text them I get the exception attached!
>
> So it is interesting that the semantics here are not obvious. Can we map
> reduce a file being written? How does it work etc? It would be nice to
> understand the semantics here.
>
>
>
>
>
>
>
>
> On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux <decho...@gmail.com>
> wrote:
>
>> The best would be to get a hold on a Flume developer. I am not strictly
>> sure of all the differences between sync/flush/hsync/hflush and the
>> different hadoop versions. It might be the case that you are only flushing
>> on the client side. Even if it was a clean strategy, creation+flush is
>> unlikely to be an atomic operation.
>>
>> It is worth testing the read of an empty sequence file (real empty and
>> with only header). It should be quite easy with a unit test. A solution
>> would indeed to validate the behaviour of SequenceFileReader / InputFormat
>> on edge cases. But nothing guarantee you that you won't have a record split
>> between two HDFS blocks. This implies that during the writing only the
>> first block is visible and only a part of the record. It would be normal
>> for the reader to fail on that case. You could tweak mapreduce bad records
>> skipping but that feels like hacking a system where the design is wrong
>> from the beginning.
>>
>> Anyway, a solution (seen in Flume if I remember correctly) is having a
>> good file name strategy. For exemple, all new files should end in ".open"
>> and only when they are finished the suffix is removed. Then for processing,
>> you only target the latter.
>>
>> For Hive, you might need to adapt the strategy a bit because Hive may not
>> be able to target only files with a specific name (you are the expert). A
>> simple move of the file from a temporary directory to the table directory
>> would have the same effect (because from the point of view of HDFS, it's
>> the same operation : metadata change only).
>>
>> Bertrand Dechoux
>>
>>
>> On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>>> Here is the stack trace...
>>>
>>>  Caused by: java.io.EOFException
>>>   at java.io.DataInputStream.readByte(DataInputStream.java:267)
>>>   at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>>>   at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>>>   at 
>>> org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072)
>>>   at 
>>> org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139)
>>>   at 
>>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214)
>>>   at 
>>> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109)
>>>   at 
>>> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84)
>>>   at 
>>> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274)
>>>   ... 15 more
>>>
>>>
>>>
>>>
>>> On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <edlinuxg...@gmail.com>
>>> wrote:
>>>
>>>> Currently using:
>>>>
>>>>     <dependency>
>>>>             <groupId>org.apache.hadoop</groupId>
>>>>             <artifactId>hadoop-hdfs</artifactId>
>>>>             <version>2.3.0</version>
>>>>         </dependency>
>>>>
>>>>
>>>> I have this piece of code that does.
>>>>
>>>> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class,
>>>> CompressionType.BLOCK, codec);
>>>>
>>>> Then I have a piece of code like this...
>>>>
>>>>   public static final long SYNC_EVERY_LINES = 1000;
>>>>  if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){
>>>>         meta.getWriter().sync();
>>>>       }
>>>>
>>>>
>>>> And I commonly see:
>>>>
>>>> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls  /user/beacon/
>>>> 2014072117
>>>> DEPRECATED: Use of this script to execute hdfs command is deprecated.
>>>> Instead use the hdfs command for it.
>>>>
>>>> Found 12 items
>>>> -rw-r--r--   3 service-igor supergroup    1065682 2014-07-21 17:50
>>>> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1
>>>> -rw-r--r--   3 service-igor supergroup    1029041 2014-07-21 17:40
>>>> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93
>>>> -rw-r--r--   3 service-igor supergroup    1002096 2014-07-21 17:10
>>>> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b
>>>> -rw-r--r--   3 service-igor supergroup    1028450 2014-07-21 17:30
>>>> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92
>>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>>> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351
>>>> -rw-r--r--   3 service-igor supergroup    1084873 2014-07-21 17:30
>>>> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2
>>>> -rw-r--r--   3 service-igor supergroup    1043108 2014-07-21 17:20
>>>> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b
>>>> -rw-r--r--   3 service-igor supergroup     986866 2014-07-21 17:10
>>>> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd
>>>> -rw-r--r--   3 service-igor supergroup          0 2014-07-21 17:50
>>>> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9
>>>> -rw-r--r--   3 service-igor supergroup    1040931 2014-07-21 17:50
>>>> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1
>>>> -rw-r--r--   3 service-igor supergroup    1012137 2014-07-21 17:40
>>>> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47
>>>> -rw-r--r--   3 service-igor supergroup    1028467 2014-07-21 17:20
>>>> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b
>>>>
>>>> Sometimes even though they show as 0 bytes you can read data from them.
>>>> Sometimes it blows up with a stack trace I have lost.
>>>>
>>>>
>>>> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <decho...@gmail.com>
>>>> wrote:
>>>>
>>>>> I looked at the source by curiosity, for the latest version (2.4), the
>>>>> header is flushed during the writer creation. Of course, key/value classes
>>>>> are provided. By 0-bytes, you really mean even without the header? Or 0
>>>>> bytes of payload?
>>>>>
>>>>>
>>>>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <decho...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> The header is expected to have the full name of the key class and
>>>>>> value class so if it is only detected with the first record (?) indeed 
>>>>>> the
>>>>>> file can not respect its own format.
>>>>>>
>>>>>> I haven't tried it but LazyOutputFormat should solve your problem.
>>>>>>
>>>>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> Bertrand Dechoux
>>>>>>
>>>>>>
>>>>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo <
>>>>>> edlinuxg...@gmail.com> wrote:
>>>>>>
>>>>>>> I have two processes. One that writes sequence files directly to
>>>>>>> hdfs, the other that is a hive table that reads these files.
>>>>>>>
>>>>>>> All works well with the exception that I am only flushing the files
>>>>>>> periodically. SequenceFile input format gets angry when it encounters
>>>>>>> 0-bytes seq files.
>>>>>>>
>>>>>>> I was considering flush and sync on first record write. Also was
>>>>>>> thinking should just be able to hack sequence file input format to skip >>>>>>> 0
>>>>>>> byte files and not throw exception on readFully() which it sometimes 
>>>>>>> does.
>>>>>>>
>>>>>>> Anyone ever tackled this?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Skippin those gost darn 0 byte diles

Reply via email to