Hi,
 When i try to index the data using batchSize=150,default batchTimeout and 
TimedRotationPolicy which is set to 30 Mins, it creates some json files in HDFS 
with incomplete data, the last record in the HDFS file contains only a portion 
of the json record. When try to read the indexed data using hive external table 
it throws some exception, due to the partial json in the file. So in streaming 
i am not able to do any operation on the indexed data. 
 
Indexex File Eg: 
    {"key1":"value1","key2":"value2","key3":"value3"}
    {"key1":"value1","key2":"value2","key3" 
        
When i tried to find the root cause of this behavior, i came across the 
following observations
  1. Metron flushes the data to HDFS based on a CountSyncPolicy, by default 
it's value is set to the batchSize.
  2. When metron performs the file rotation , it first closes the current file, 
which also result in flushing to HDFS.
  3. Regardless of the batchSize, metron writes the data to HDFS after the 
batchTimeout.
  4. CountSyncPolicy is not having any relation with the batchTimeout, that is 
even if the batchTimeout expires and metron writes the data to the HDFS , it 
won't init the flush, it still wait for the     number of messages to become 
the CountSyncPolicy. Is this behavior set intentionally ? without the sync the 
end user won't be able to access the data completely, so it spoils the 
advantages of batchTimeout.
  
Due to the amount of data i am writing, i won't be able to set the 
CountSyncPolicy to 1, which will impact the performance.

Currently our indexing directory structure is like follows "yyyy/MM/dd". I need 
to do some operation on the newly indexed data based on a sliding window, now 
it is configured to max_window = 1 and window size = one hour. every hour i 
move the window to the current_window_hour + 1. when it's streaming i hit the 
JSON format error in hive.

Can you suggest any methods to over come this issue ?

Reply via email to