Hi, When i try to index the data using batchSize=150,default batchTimeout and TimedRotationPolicy which is set to 30 Mins, it creates some json files in HDFS with incomplete data, the last record in the HDFS file contains only a portion of the json record. When try to read the indexed data using hive external table it throws some exception, due to the partial json in the file. So in streaming i am not able to do any operation on the indexed data. Indexex File Eg: {"key1":"value1","key2":"value2","key3":"value3"} {"key1":"value1","key2":"value2","key3" When i tried to find the root cause of this behavior, i came across the following observations 1. Metron flushes the data to HDFS based on a CountSyncPolicy, by default it's value is set to the batchSize. 2. When metron performs the file rotation , it first closes the current file, which also result in flushing to HDFS. 3. Regardless of the batchSize, metron writes the data to HDFS after the batchTimeout. 4. CountSyncPolicy is not having any relation with the batchTimeout, that is even if the batchTimeout expires and metron writes the data to the HDFS , it won't init the flush, it still wait for the number of messages to become the CountSyncPolicy. Is this behavior set intentionally ? without the sync the end user won't be able to access the data completely, so it spoils the advantages of batchTimeout. Due to the amount of data i am writing, i won't be able to set the CountSyncPolicy to 1, which will impact the performance.
Currently our indexing directory structure is like follows "yyyy/MM/dd". I need to do some operation on the newly indexed data based on a sliding window, now it is configured to max_window = 1 and window size = one hour. every hour i move the window to the current_window_hour + 1. when it's streaming i hit the JSON format error in hive. Can you suggest any methods to over come this issue ?