(As an FYI I'm relatively new to Hive and have no previous SQL experience, so have been struggling a bit with the Language manual which seems to assume previous SQL experience)

Suppose I have a table, within which there is a column which contains domain names (ie such as hadoop.apache.org). I want to perform a count of all second-level domains, ie hadoop.apache.org and hive.apache.org would count in the same bucket.

Now I could count things for a particular second-level domain like this:

SELECT
  year, month, day, hour, COUNT(1) as count
FROM
  domainlog
WHERE
  year = 2011 AND
  month = 05 AND
  day = 15 AND
  (
    domain RLIKE ".*[.]apache[.]org"
  )
GROUP BY
  year, month, day, hour

however I'm not seeing a way to sum up all second-level domains rather than a particular one. I basically want to group everything using a regular expression along the lines of ".*[.][^.]*[.][^.]*" and then output lines with a count for the common portion. Any pointers in the correct direction would be welcome.

Thanks
- Adam

Reply via email to