what do you mean by join the data sets?

a fake sample log file:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:00 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
2013-07-04 20:00:01 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
2013-07-04 20:00:02 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:03
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:03 1.2.2.9 GET /gs.gif xxx 80 - 98.248.105.227
Mozilla/5.0+(Windows+NT+6.1;+WOW64)+AppleWebKit/537.1+(KHTML,+like+Gecko)+Chrome/21.0.1180.89+Safari/537.1
200 0 0 390


"#Fileds:" line is needed to parse the following IIS log, however, it may
change, which makes splitting not supported.


2013/12/30 Azuryy Yu <azury...@gmail.com>

> You can run a mapreduce firstly, Join these data sets into one data set.
> then analyze the joined dataset.
>
>
> On Mon, Dec 30, 2013 at 3:58 PM, Fengyun RAO <raofeng...@gmail.com> wrote:
>
>> Hi,
>>
>> HDFS splits files into blocks, and mapreduce runs a map task for each
>> block. However, Fields could be changed in IIS log files, which means
>> fields in one block may depend on another, and thus make it not suitable
>> for mapreduce job. It seems there should be some preprocess before storing
>> and analyzing the IIS log files. We plan to parse each line to the same
>> fields and store in Avro files with compression. Any other alternatives?
>> Hbase?  or any suggestions on analyzing IIS log files?
>>
>> thanks!
>>
>>
>>
>

Reply via email to