Any ideas on this?
> I am trying to figure out how Hive parses an input file into a table, > to use it as a model for implementing a similar parser. Having had a > look at the source code of the org.apache.hadoop.hive.ql.parse > package, I am not sure whether this is the (only) place to search for > the answer. > > For example, to parse in an Apache weblog, I have found this HQL example: > > CREATE TABLE apachelog(host STRING, identity STRING, > user STRING, time STRING, request STRING, status STRING, > size STRING, referer STRING, agent STRING) > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > WITH SERDEPROPERTIES ( > "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ > \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ > \"]*|\"[^\"]*\"))?", > "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" > ) > STORED AS TEXTFILE; > > whereas for CSV the row format would be something like > > ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12' > > So, my question is, > - Does Hive have a "conventional" parser library that uses a separate > class (e.g., regexParser, CSVParser) to implement the above commands, > - Does it embed any 3rd-party code (like the Apache Commons CSV > library) to do its parsing? or > - Does it work in a different way? >
