IIRC the HW Trucking Demo creates a temporary table from csv files of the new data then issues a select … insert into an orc table.
For the love of google, I can’t find this demo atm, and I’m out of time. If I recall correctly, this strikes me as suboptimal compared to writing orc files directly. Data must be written to disk in a huge format and then must be copied. I’ll dig deep here as soon as I get a chance. On 4/14/15, 6:09 PM, "Grant Overby (groverby)" <grove...@cisco.com> wrote: >Submitting patches or test cases is tricky business for a Cisco employee. >I’ll put in the legal admin effort to get approval to do this. :/ The >majority of the issues I mentioned /should/ find their way to apache via >hortonworks. > > >Additional responses are inline. > > > > > > > > > >On 4/14/15, 5:28 PM, "Gopal Vijayaraghavan" <gop...@apache.org> wrote: > >> >>>0.14 . Acid tables have been a real pain for us. We don¹t believe they >>>are >>>production ready. At least in our use cases, Tez crashes for assorted >>>reasons or only assigns 1 mapper to the partition. Having delta files >>>and >>>no base files borks mapper assignments. >> >>Some of the chicken-egg problems for those were solved recently in >>HIVE-10114. >> >>Then TEZ-1993 is coming out in the next version of Tez, into which we¹re >>plugging in HIVE-7428 (no fix yet). >> >>Currently delta-only splits have 0 bytes as the ³file size², so it >>grouped >>together to make a 16Mb chunk (rather a huge single 0 sized split). >> >>Those patches are the effect of me shaving the yak from the ³1 mapper² >>issue. >> >>After which the writer has to follow up on HIVE-9933 to get the locality >>of files fixed. > >I’ll look into this. If the 1 mapper issue is solved, that would be a huge >win for streaming for us. > > >> >>>name are left scattered about, borking queries. Latency is higher with >>>streaming than writing to an orc file in hdfs, forcing obscene >>>quantities >>>of buckets and orc files smaller than any reasonable orc stripe / hdfs >>>block size. The compactor hangs seemingly at random for no reason we¹ve >>>been able to discern. >> >>I haven¹t seen these issues yet, but I am not dealing with a large volume >>insert rate, so haven¹t produced latency issues there. >> >>Since I work on Hive performance and I haven¹t seen too many bugs filed, >>so I haven¹t paid attention to the performance of ACID. >> >>Please file bugs when you find them, so that it appears on the radar for >>folks like me. >> >>I¹m poking about because I want a live stream into LLAP to work >>seamlessly >>& return sub-second query results when queried (pre-cache/stage & merge >>etc). > >These files aren’t orc, but hive expects them to be, leading to errors. >They are made by using the hive streaming api. >root@twig13:~# hdfs dfs -ls -R >/apps/hive/warehouse/events.db/connection_events4/ | grep flush | head -n >1 >-rw-r--r-- 3 storm hadoop 200 2015-04-09 17:12 >/apps/hive/warehouse/events.db/connection_events4/dt=1428613200/delta_1171 >4 >703_11714802/bucket_00007_flush_length >root@twig13:~# hdfs dfs -ls -R >/apps/hive/warehouse/events.db/connection_events4/ | grep flush | wc -l >283 > >This may be addressed by 8966 which is in the 1.0.0 release. kill -9 to >the processing writing to hive is a near guaranteed way to leave these >orphaned flush files, but we have seen them on several occasions when >there is no indication that .close() was skipped. > >Our insert rate is about 100k/s for a 4 box cluster. Storm, Kafka, Hdfs, >Hive, etc are ‘pancaked’ on this cluster. To keep up with this insert rate >we need somewhere between 64 and 128 buckets for streaming to support an >equal number of threads. We can keep up this same pace when writing orc >files directly to hdfs with only 8 threads and thus 8 orc files. The orc >files from streaming are on the order of 5mb a piece (15min insert-time >base partitions). Even if orc stripes this small isn’t a problem, it’s >still going to waste a lot of disk space due to hdfs block size. > > >> >>>An orc file without a footer is junk data (or, at least, the last stripe >>>is junk data). I suppose my question should have been 'what will the >>>hive >>>query do when it encounters this? Skip the stripe / file? Error out the >>>query? Something else?¹ >> >>It should throw an exception, because that¹s a corrupt ORC file. >> >>The trucking demo uses Storm without ACID - this is likely to get better >>once we use Apache Falcon to move the data around. >> >>Cheers, >>Gopal >> >> > >I suppose the best thing to do then is to write the orc file outside the >of the partition directory then issue an mv when the file is closed? > >> >