You want to look into ADD JAR and CREATE FUNCTION (for UDFs) and STORED AS 'full.class.name' for serde.
For tutorials, google for "adding custom serde", I found one from Cloudera: http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/ Depending on your numbers (rows / file, bytes / file, files per time interval, #containers || map slots, mem size per slot or container) creating a split of your file may not be necessary to obtain good performance. - Douglas On 12/12/14 2:17 AM, "Ingo Thon" <ist...@gmx.de> wrote: >Dear List, > > >I want to set up a DW based on Hive. However, my data does not come as >handy csv files but as binary files in a proprietary format. > >The binary file consists of >- 1 header of a dynamic number of bytes, which can be read from the >contents of the header > The header tells me how to parse the rows and how many bytes each row >has. >- n rows of k bytes, where k is defined within the header > > >The solution I have in mind looks as follows >- Write a custom InputFormat which chunks the data into blobs of length k >but skips the bytes of the header. So I¹d have two parameters for the >Inputformat. (bytes to skip, bytes per row) > Do I really have to build this myself or does sth. like this already >exists? Worst case I could also remove the header prior to pushing the >data into the hdfs >- Write a custom SerDe to parse the Blobs. At least in theory easy. > >The coding part does not look to complicated, however, I¹m kind of >struggling with how to compile and install such serde. I installed Hive >from source and imported it into eclipse. >I guess I¹ve to now build my own projectŠ. Still I¹m a little bit lost. >Is there any tutorial which describes the process? >And also is my general idea ok? > >thanks in advance