Re: Hive Parsing

Public Network Services Fri, 17 Jun 2011 23:13:53 -0700

Any ideas on this?


> I am trying to figure out how Hive parses an input file into a table,
> to use it as a model for implementing a similar parser. Having had a
> look at the source code of the org.apache.hadoop.hive.ql.parse
> package, I am not sure whether this is the (only) place to search for
> the answer.
>
> For example, to parse in an Apache weblog, I have found this HQL example:
>
> CREATE TABLE apachelog(host STRING, identity STRING,
>        user STRING,  time STRING, request STRING, status STRING,
>        size STRING, referer STRING, agent STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
>        "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
> \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^
> \"]*|\"[^\"]*\"))?",
>        "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
> )
> STORED AS TEXTFILE;
>
> whereas for CSV the row format would be something like
>
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
>
> So, my question is,
> - Does Hive have a "conventional" parser library that uses a separate
> class (e.g., regexParser, CSVParser) to implement the above commands,
> - Does it embed any 3rd-party code (like the Apache Commons CSV
> library) to do its parsing? or
> - Does it work in a different way?
>

Re: Hive Parsing

Reply via email to