what do you mean by join the data sets? a fake sample log file: #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2013-07-04 20:00:00 #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken 2013-07-04 20:00:00 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227 Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1 200 0 0 390 2013-07-04 20:00:01 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227 Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1 200 0 0 390 2013-07-04 20:00:02 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227 Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1 200 0 0 390 #Software: Microsoft Internet Information Services 7.5 #Version: 1.0 #Date: 2013-07-04 20:00:03 #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken 2013-07-04 20:00:03 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227 Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1 200 0 0 390
"#Fileds:" line is needed to parse the following IIS log, however, it may change, which makes splitting not supported. 2013/12/30 Azuryy Yu <azury...@gmail.com> > You can run a mapreduce firstly, Join these data sets into one data set. > then analyze the joined dataset. > > > On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO <raofeng...@gmail.com> wrote: > >> Hi, >> >> HDFS splits files into blocks, and mapreduce runs a map task for each >> block. However, Fields could be changed in IIS log files, which means >> fields in one block may depend on another, and thus make it not suitable >> for mapreduce job. It seems there should be some preprocess before storing >> and analyzing the IIS log files. We plan to parse each line to the same >> fields and store in Avro files with compression. Any other alternatives? >> Hbase? or any suggestions on analyzing IIS log files? >> >> thanks! >> >> >> >