For reference : https://issues.apache.org/jira/browse/SPARK-1960 (which seems highly related)
I don't know if anything is tracked on Hadoop/MapReduce side. Bertrand Dechoux On Wed, Jul 23, 2014 at 5:15 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > Anyway, a solution (seen in Flume if I remember correctly) is having a > good file name strategy. For exemple, all new files should end in ".open" > and only when they are finished the suffix is removed. Then for processing, > you only target the latter. > > I am not sure this will help. The sequence file reader will still try to > open it regardless of it's name. > > For Hive, you might need to adapt the strategy a bit because Hive may not > be able to target only files with a specific name (you are the expert). A > simple move of the file from a temporary directory to the table directory > would have the same effect (because from the point of view of HDFS, it's > the same operation : metadata change only). > > I would like to consider the file as soon as their is reasonable data in > them. If I have to rename/move files I will not be able to see the data > until it is moved in/renamed. (I am building files for N minutes before > closing them). The problem only happens with 0 byte files- files being > written currently work fine. > > It seems like the split calculation could throw away 0 byte files before > we ever get down to the record reader and parsing the header. An > interesting thing is that even though dfs -ls shows the files as 0 > bytes....Sometimes I can dfs -text theses 0 byte files and they actually > have data! Sometimes when I dfs -text them I get the exception attached! > > So it is interesting that the semantics here are not obvious. Can we map > reduce a file being written? How does it work etc? It would be nice to > understand the semantics here. > > > > > > > > > On Wed, Jul 23, 2014 at 2:00 AM, Bertrand Dechoux <decho...@gmail.com> > wrote: > >> The best would be to get a hold on a Flume developer. I am not strictly >> sure of all the differences between sync/flush/hsync/hflush and the >> different hadoop versions. It might be the case that you are only flushing >> on the client side. Even if it was a clean strategy, creation+flush is >> unlikely to be an atomic operation. >> >> It is worth testing the read of an empty sequence file (real empty and >> with only header). It should be quite easy with a unit test. A solution >> would indeed to validate the behaviour of SequenceFileReader / InputFormat >> on edge cases. But nothing guarantee you that you won't have a record split >> between two HDFS blocks. This implies that during the writing only the >> first block is visible and only a part of the record. It would be normal >> for the reader to fail on that case. You could tweak mapreduce bad records >> skipping but that feels like hacking a system where the design is wrong >> from the beginning. >> >> Anyway, a solution (seen in Flume if I remember correctly) is having a >> good file name strategy. For exemple, all new files should end in ".open" >> and only when they are finished the suffix is removed. Then for processing, >> you only target the latter. >> >> For Hive, you might need to adapt the strategy a bit because Hive may not >> be able to target only files with a specific name (you are the expert). A >> simple move of the file from a temporary directory to the table directory >> would have the same effect (because from the point of view of HDFS, it's >> the same operation : metadata change only). >> >> Bertrand Dechoux >> >> >> On Wed, Jul 23, 2014 at 12:16 AM, Edward Capriolo <edlinuxg...@gmail.com> >> wrote: >> >>> Here is the stack trace... >>> >>> Caused by: java.io.EOFException >>> at java.io.DataInputStream.readByte(DataInputStream.java:267) >>> at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308) >>> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329) >>> at >>> org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:2072) >>> at >>> org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:2139) >>> at >>> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2214) >>> at >>> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:109) >>> at >>> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:84) >>> at >>> org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:274) >>> ... 15 more >>> >>> >>> >>> >>> On Tue, Jul 22, 2014 at 6:14 PM, Edward Capriolo <edlinuxg...@gmail.com> >>> wrote: >>> >>>> Currently using: >>>> >>>> <dependency> >>>> <groupId>org.apache.hadoop</groupId> >>>> <artifactId>hadoop-hdfs</artifactId> >>>> <version>2.3.0</version> >>>> </dependency> >>>> >>>> >>>> I have this piece of code that does. >>>> >>>> writer = SequenceFile.createWriter(fs, conf, p, Text.class, Text.class, >>>> CompressionType.BLOCK, codec); >>>> >>>> Then I have a piece of code like this... >>>> >>>> public static final long SYNC_EVERY_LINES = 1000; >>>> if (meta.getLinesWritten() % SYNC_EVERY_LINES == 0){ >>>> meta.getWriter().sync(); >>>> } >>>> >>>> >>>> And I commonly see: >>>> >>>> [ecapriolo@staging-hadoop-cdh-67-14 ~]$ hadoop dfs -ls /user/beacon/ >>>> 2014072117 >>>> DEPRECATED: Use of this script to execute hdfs command is deprecated. >>>> Instead use the hdfs command for it. >>>> >>>> Found 12 items >>>> -rw-r--r-- 3 service-igor supergroup 1065682 2014-07-21 17:50 >>>> /user/beacon/2014072117/0bb6cd71-70ac-405a-a8b7-b8caf9af8da1 >>>> -rw-r--r-- 3 service-igor supergroup 1029041 2014-07-21 17:40 >>>> /user/beacon/2014072117/1b0ef6b3-bd51-4100-9d4b-1cecdd565f93 >>>> -rw-r--r-- 3 service-igor supergroup 1002096 2014-07-21 17:10 >>>> /user/beacon/2014072117/34e2acb4-2054-44df-bbf7-a4ce7f1e5d1b >>>> -rw-r--r-- 3 service-igor supergroup 1028450 2014-07-21 17:30 >>>> /user/beacon/2014072117/41c7aa62-d27f-4d53-bed8-df2fb5803c92 >>>> -rw-r--r-- 3 service-igor supergroup 0 2014-07-21 17:50 >>>> /user/beacon/2014072117/5450f246-7623-4bbd-8c97-8176a0c30351 >>>> -rw-r--r-- 3 service-igor supergroup 1084873 2014-07-21 17:30 >>>> /user/beacon/2014072117/8b36fbca-6f5b-48a3-be3c-6df6254c3db2 >>>> -rw-r--r-- 3 service-igor supergroup 1043108 2014-07-21 17:20 >>>> /user/beacon/2014072117/949da11a-247b-4992-b13a-5e6ce7e51e9b >>>> -rw-r--r-- 3 service-igor supergroup 986866 2014-07-21 17:10 >>>> /user/beacon/2014072117/979bba76-4d2e-423f-92f6-031bc41f6fbd >>>> -rw-r--r-- 3 service-igor supergroup 0 2014-07-21 17:50 >>>> /user/beacon/2014072117/b76db189-054f-4dac-84a4-a65f39a6c1a9 >>>> -rw-r--r-- 3 service-igor supergroup 1040931 2014-07-21 17:50 >>>> /user/beacon/2014072117/bba6a677-226c-4982-8fb2-4b136108baf1 >>>> -rw-r--r-- 3 service-igor supergroup 1012137 2014-07-21 17:40 >>>> /user/beacon/2014072117/be940202-f085-45bb-ac84-51ece2e1ba47 >>>> -rw-r--r-- 3 service-igor supergroup 1028467 2014-07-21 17:20 >>>> /user/beacon/2014072117/c336e0c8-76e7-40e7-98e2-9f529f25577b >>>> >>>> Sometimes even though they show as 0 bytes you can read data from them. >>>> Sometimes it blows up with a stack trace I have lost. >>>> >>>> >>>> On Tue, Jul 22, 2014 at 5:45 PM, Bertrand Dechoux <decho...@gmail.com> >>>> wrote: >>>> >>>>> I looked at the source by curiosity, for the latest version (2.4), the >>>>> header is flushed during the writer creation. Of course, key/value classes >>>>> are provided. By 0-bytes, you really mean even without the header? Or 0 >>>>> bytes of payload? >>>>> >>>>> >>>>> On Tue, Jul 22, 2014 at 11:05 PM, Bertrand Dechoux <decho...@gmail.com >>>>> > wrote: >>>>> >>>>>> The header is expected to have the full name of the key class and >>>>>> value class so if it is only detected with the first record (?) indeed >>>>>> the >>>>>> file can not respect its own format. >>>>>> >>>>>> I haven't tried it but LazyOutputFormat should solve your problem. >>>>>> >>>>>> https://hadoop.apache.org/docs/current/api/index.html?org/apache/hadoop/mapred/lib/LazyOutputFormat.html >>>>>> >>>>>> Regards >>>>>> >>>>>> Bertrand Dechoux >>>>>> >>>>>> >>>>>> Bertrand Dechoux >>>>>> >>>>>> >>>>>> On Tue, Jul 22, 2014 at 10:39 PM, Edward Capriolo < >>>>>> edlinuxg...@gmail.com> wrote: >>>>>> >>>>>>> I have two processes. One that writes sequence files directly to >>>>>>> hdfs, the other that is a hive table that reads these files. >>>>>>> >>>>>>> All works well with the exception that I am only flushing the files >>>>>>> periodically. SequenceFile input format gets angry when it encounters >>>>>>> 0-bytes seq files. >>>>>>> >>>>>>> I was considering flush and sync on first record write. Also was >>>>>>> thinking should just be able to hack sequence file input format to skip >>>>>>> 0 >>>>>>> byte files and not throw exception on readFully() which it sometimes >>>>>>> does. >>>>>>> >>>>>>> Anyone ever tackled this? >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >