Hi all, Sending this to core-u...@hadoop.apache.org and d...@hive.apache.org.
Trying to process Omniture's data log files with Hadoop/Hive. The file format is tab delimited and while being pretty simple for the most part, they do allow you to have multiple new lines and tabs within a field that are escaped by a backslash (\\n and \\t). As a result I've opted to create my own InputFormat to handle the multiple newlines and convert those tabs to spaces when Hive is going to try to do a split on the tabs. I've found a fairly good reference for doing this using the newer InputFormat API at http://blog.rguha.net/?p=293 but unfortunately my version of Hive (0.7.0) still uses the old InputFormat API. I haven't been able to find many tutorials on writing a custom InputFile using the older API so I'm looking to see if I can get some guidance as to what may be wrong with the following two classes: https://gist.github.com/3141e9d27d4e07f5f9ed https://gist.github.com/79fdab227950a0776616 The SELECT statements within hive currently return nothing and my other variations returned nothing but NULL values. This issue is also available on StackOverflow at http://stackoverflow.com/questions/7692994/custom-inputformat-with-hive. If there's a resource someone can point me to that'd also be great. Many thanks in advance, Mike