Re: Writing orc files with storm via java API

Bobby Evans Mon, 31 Jul 2017 07:47:09 -0700

It should be possible to make this work, but it is not going to be simple.  The 
real issue is the format of the orc file.  It is not one record at a time, like 
CSV or other supported formats are.  Sadly this is currently an assumption with 
the AbstractHdfsBolt.
https://github.com/apache/storm/blob/master/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/bolt/format/RecordFormat.java
So to support it we would need to make some modifications, not impossible, just 
not a drop in replacement.  If this is something you want to tackle and 
contribute back I think we would all love it.  You might also run into some 
issues with metadata for the format being written at the end of the file.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
I am not totally sure how easy it is to recover an ORC file if that footer is 
missing because a worker crashed.  You might end up with data loss in some 
cases if you are not extremely careful.  You might also need to modify the ORC 
APIs themselves to be able to support storing/recovering the metadata in an 
external location for recovery to truly fix it, and then store them in ZK on a 
flush until the file is rotated.


The Trident HDFState
https://github.com/apache/storm/blob/master/external/storm-hdfs/src/main/java/org/apache/storm/hdfs/trident/HdfsState.java
might be a more appropriate place to start, as the updated state is written out 
in micro batches, but you still have to deal with the footer issues, as trident 
really cares about exactly once processing.

So overall it is not a simple problem, and relying on an external server like 
hive would make it a lot simpler.


- Bobby


On Tuesday, July 25, 2017, 8:38:42 AM CDT, Igor Kuzmenko <f1she...@gmail.com> 
wrote:

Is there any implementation of storm bolt which can write files to HDFS in
ORC format, without using Hive Streaming API?
I've found java API for writing ORC files <https://github.com/apache/orc>
and I'm guessing is there any existing Hive bolts that uses it or any plans
to create such?

Re: Writing orc files with storm via java API

Reply via email to