Hive question, summing second-level domain names

Adam Phelps Mon, 23 May 2011 13:04:20 -0700

(As an FYI I'm relatively new to Hive and have no previous SQLexperience, so have been struggling a bit with the Language manual whichseems to assume previous SQL experience)

Suppose I have a table, within which there is a column which containsdomain names (ie such as hadoop.apache.org). I want to perform a countof all second-level domains, ie hadoop.apache.org and hive.apache.orgwould count in the same bucket.


Now I could count things for a particular second-level domain like this:

SELECT
  year, month, day, hour, COUNT(1) as count
FROM
  domainlog
WHERE
  year = 2011 AND
  month = 05 AND
  day = 15 AND
  (
    domain RLIKE ".*[.]apache[.]org"
  )
GROUP BY
  year, month, day, hour

however I'm not seeing a way to sum up all second-level domains ratherthan a particular one. I basically want to group everything using aregular expression along the lines of ".*[.][^.]*[.][^.]*" and thenoutput lines with a count for the common portion. Any pointers in thecorrect direction would be welcome.


Thanks
- Adam

Hive question, summing second-level domain names

Reply via email to