[ https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259521#comment-16259521 ]
Don Bosco Durai edited comment on RANGER-1837 at 11/20/17 5:37 PM: ------------------------------------------------------------------- bq. current implementation uses existing AuditQueue for the data pipe into destination. If this has to be avoided we need to have a one more major refactoring on the audit framework Not sure I understood your concern. What are we currently using to write to HDFS in batches? Is it using AuditFileCacheProvider + ??? The audit framework consists of Queues and Destination. Ideally, we could just clone AuditBatchQueue and simplify it. E.g. in the new class AuditFileQueue, we can read from the consumer and write to a local file from the method log(AuditEventBase event). And at fixed intervals, we could just write the local file to the Destination (HDFS/S3). Similar to AuditFileSpool, we might have to keep track of the local files that need to be written to the destination. And I think, BaseAuditHandler should have another method called logFile( File localFile), which can be implemented by HDFSDestination to just copy the file over or convert it to ORC and write to the HDFS. Default implementation could be to read the file line by line and call logJSON(); b.q. Current framework provides 3 buffer size If possible, I would suggest using one of the current buffers. Looking at the current code, it seems AuditFileCacheProvider replaces AuditAsyncQueue at the top level. Which could be a problem, because now we are sinking into the file up front. The original design was, every destination has a queue backing it so that the queue can help regulate the output flow to the destination. In this way, the queue manages the buffer and not the destination. Even with the current implementation, I would suggest introducing something like AuditFileQueue and set the file size and time on it. This will give users to pick a larger file size e.g. 1-hour interval and write to the destination after 1 hour. bq. I see this article https://community.hortonworks.com/articles/75501/orc-creation-best-practices.html which has some details. This is good information. It seems advance users can reload the data using Hive at regular intervals to get good distribution, the sort order for audit dates, etc. And they can also analyze for better runtime performance. We can add these in best practices or recommendation section for this feature. bq. I tested it for 1 hr data for hdfs plugin and having all three 10000 was fine. It didn't create multiple files for the amount, but this depends on the amount of hdfs activities. I need to check with KAFKA plugin I am worried that if we do this in memory buffer, we risk affecting the native component by hogging the memory. was (Author: bosco): bq. current implementation uses existing AuditQueue for the data pipe into destination. If this has to be avoided we need to have a one more major refactoring on the audit framework Not sure I understood your concern. What are we currently using to write to HDFS in batches? Is it using AuditFileCacheProvider + ??? The audit framework consists of Queues and Destination. Ideally, we could just clone AuditBatchQueue and simplify it. E.g. in the new class AuditFileQueue, we can read from the consumer and write to a local file from the method log(AuditEventBase event). And at fixed intervals, we could just write the local file to the Destination (HDFS/S3). Similar to AuditFileSpool, we might have to keep track of the local files that need to be written to the destination. b.q. Current framework provides 3 buffer size If possible, I would suggest using one of the current buffers. Looking at the current code, it seems AuditFileCacheProvider replaces AuditAsyncQueue at the top level. Which could be a problem, because now we are sinking into the file up front. The original design was, every destination has a queue backing it so that the queue can help regulate the output flow to the destination. In this way, the queue manages the buffer and not the destination. Even with the current implementation, I would suggest introducing something like AuditFileQueue and set the file size and time on it. This will give users to pick a larger file size e.g. 1-hour interval and write to the destination after 1 hour. And I think, BaseAuditHandler should have another method called logFile( File localFile), which can be implemented by HDFSDestination to just copy the file over or convert it to ORC and write to the HDFS. Default implementation could be to read the file line by line and call logJSON(); bq. I see this article https://community.hortonworks.com/articles/75501/orc-creation-best-practices.html which has some details. This is good information. It seems advance users can reload the data using Hive at regular intervals to get good distribution, the sort order for audit dates, etc. And they can also analyze for better runtime performance. We can add these in best practices or recommendation section for this feature. bq. I tested it for 1 hr data for hdfs plugin and having all three 10000 was fine. It didn't create multiple files for the amount, but this depends on the amount of hdfs activities. I need to check with KAFKA plugin I am worried that if we do this in memory buffer, we risk affecting the native component by hogging the memory. > Enhance Ranger Audit to HDFS to support ORC file format > ------------------------------------------------------- > > Key: RANGER-1837 > URL: https://issues.apache.org/jira/browse/RANGER-1837 > Project: Ranger > Issue Type: Improvement > Components: audit > Reporter: Kevin Risden > Assignee: Ramesh Mani > Attachments: > 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch, > 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch > > > My team has done some research and found that Ranger HDFS audits are: > * Stored as JSON objects (one per line) > * Not compressed > This is currently very verbose and would benefit from compression since this > data is not frequently accessed. > From Bosco on the mailing list: > {quote}You are right, currently one of the options is saving the audits in > HDFS itself as JSON files in one folder per day. I have loaded these JSON > files from the folder into Hive as compressed ORC format. The compressed > files in ORC were less than 10% of the original size. So, it was significant > decrease in size. Also, it is easier to run analytics on the Hive tables. > > So, there are couple of ways of doing it. > > Write an Oozie job which runs every night and loads the previous day worth > audit logs into ORC or other format > Write a AuditDestination which can write into the format you want to. > > Regardless which approach you take, this would be a good feature for > Ranger.{quote} > http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.4.14#64029)