hey guys As data wranglers and programmers we often need quick tools. One such tool I need almost everyday is one that greps a file based on contents of another file. One can write this in perl, python but since I am already using hadoop ecosystem extensively, I said why not do this in Hive ? Perhaps you guys already know this and have better solutions....nevertheless :-) here goes...
Best regards sanjay(Hive super-fan) I just posted this on my bloghttps://bigdatalatte.wordpress.com/2015/04/20/using-hive-as-a-file-comparison-and-grep-ping-tool/ In case the blog URL does not work for any reason, here is the logic Using Hive as a file comparison and grep-ping tool==================================================1. Logon to your linux terminal where u run Hive queries from 2. Create a database called "myutils" in Hive Create two hive tables myutils.file1 and myutils.file2 in Hive - each of these tables will have a partition called "fn" ----> fn is short for "filename" - each of these tables will have just one column called "ln" ----> ln is short for "line" An easy script to help do that would be as follows for r in 1 2 ; do hive -e "CREATE DATABASE IF NOT EXISTS myutils; USE myutils; DROP TABLE IF EXISTS file${r}; CREATE EXTERNAL TABLE IF NOT EXISTS file${r}(ln STRING) PARTITIONED BY (fn STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';"; done 3. Create a permanent base location folder in HDFS hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/ hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file2/ USECASE 1 : ===========Search if a bunch of IP addresses exist in another file containing (larger) bunch of IPs [1] registeredIPs.txt 10.456.34.90 123.675.654.1 21.87.657.456 234.109.34.234 visitorIPs.txt 10.456.34.90 12.367.54.23 218.7.657.456 23.4.109.3 [2] Output which IPs in File1 are present in File2 [3] Put each file in a separate HDFS location hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/registeredIPs.txt hdfs dfs -put VisitorIPs.txt /workspace/myutils/filecomparator/file1/visitorIPs.txt hdfs dfs -put registeredIPs.txt /workspace/myutils/filecomparator/file1/registeredIPs.txt hdfs dfs -put visitorIPs.txt /workspace/myutils/filecomparator/file1/visitorIPs.txt [4] Add partition to myutils.file1 For simplicity keep the partition names identical to the file names themselves hive -e "USE myutils; ALTER TABLE file1 ADD PARTITION(ln='registeredIPs.txt') LOCATION '/workspace/myutils/filecomparator/file1/registeredIPs.txt'" hive -e "USE myutils; ALTER TABLE file2 ADD PARTITION(ln='visitorIPs.txt') LOCATION '/workspace/myutils/filecomparator/file2/visitorIPs.txt'" [5] Check that partitions can be accesd by Hive # This should give u the same answer as # wc -l registeredIPs.txt hive -e "select count(*) from myutils.file1 where fn='registeredIPs.txt'" # This should give u the same answer as # wc -l visitorIPs.txt hive -e "select count(*) from myutils.file2 where fn='visitorIPs.txt'" [6] Count the number of IPs in registeredIPs.txt that are in visitorIPs.txt # This dumps to a local file systemhive -e "SELECT f1.ln FROM (SELECT ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1 JOIN (select ln from myutils.file2 WHERE fn='visitorIPs.txt') f2 ON trim(f1.ln) = trim(f2.ln)" > ./registered_in_visitors_list.txt # This dumps to a new "internally-managed-by-hive" table # Make sure u already dont have some valuable hive table called "myutils.registered_in_visitors_list" - else this will overwrite that hive table with the results of this hive query hive -e "USE myutils; DROP TABLE IF EXITS registered_in_visitors_list; CREATE TABLE if not exists registered_in_visitors_list AS SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1 JOIN (SELECT ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2 ON trim(f1.ln) = trim(f2.ln)" # This dumps to a directory on HDFS# Make sure u already dont have some valuable directory called "registered_in_visitors_list" - else this will overwrite that director and all its contents with the results of this hive query hive -e "INSERT OVERWRITE DIRECTORY '/workspace/myutils/filecomparator/registered_in_visitors_list' SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1 JOIN (SELECT ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2 ON trim(f1.ln) = trim(f2.ln)"