[ https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261466#comment-16261466 ]
Ramesh Mani edited comment on RANGER-1837 at 11/21/17 8:54 PM: --------------------------------------------------------------- !AuditDataFlow.png|thumbnail! was (Author: rmani): [~bosco] Current implementation is # AuditFileCacheProvider -> This gets the audit logs and stores it to Local filesystem via AuditFileCacheProviderSpooler. # AuditFileCacheProviderSpooler has a thread that reads the local audit files in chunks ( configured via param “xasecure.audit.provider.filecache.filespool.buffer.size” ) and send it to AsyncAuditQueue. This chunk becomes the batch size of the data that is going to the next point in this follow, in this case AsyncAuditQueue. # AsyncAuditQueue , this is existing one which using AuditBatchQueue for each of the destination configured Here Queue size is one buffer which can be configured "xasecure.audit.destination.<destination>.batch.batch.size” ( <destination>= hdfs/solr/etc. I used the existing AsyncAuditQueue, so that in case of failures in the destination, this can backup with its own spooling and forwarding mechanism. # Finally HDFSAuditDestination has a WRITER, which can writer in JSON/ORC file. When the write is ORCWriter, it has a buffer size which determine the batch size of each ORC file that is going to be created in HDFS or other destination. So configuring these buffers will determine the Batch size when ORC files are created. I believe that you wanted to eliminate AsyncAuditQueue in this flow and send directly to HDFSDestination / SOLR destination via a AuditFileQueue. If you proposing this, then that is what I was mentioning of about the refactoring / introducing a new pipeline to handle this scenario. Please correct me if I am wrong in this. I have one more request which is related to data flow rate to different destination. Currently if we store the data local and forwarding it, destinations will get the data at the same rate. Say suppose that AuditFileCacheProvider file rollover time 1 hr, each destination will get the data after 1 hr. Some may want SOLR destination to have the data more quickly than HDFS /S3. In that case we need to have the existing pipeline for one or more destination and store and forward for other destinations. so this also need refactoring to introduce a mechanism to pick queues for each destination or group of destinations. Please let me know about this. > Enhance Ranger Audit to HDFS to support ORC file format > ------------------------------------------------------- > > Key: RANGER-1837 > URL: https://issues.apache.org/jira/browse/RANGER-1837 > Project: Ranger > Issue Type: Improvement > Components: audit > Reporter: Kevin Risden > Assignee: Ramesh Mani > Attachments: > 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch, > 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch, > AuditDataFlow.png > > > My team has done some research and found that Ranger HDFS audits are: > * Stored as JSON objects (one per line) > * Not compressed > This is currently very verbose and would benefit from compression since this > data is not frequently accessed. > From Bosco on the mailing list: > {quote}You are right, currently one of the options is saving the audits in > HDFS itself as JSON files in one folder per day. I have loaded these JSON > files from the folder into Hive as compressed ORC format. The compressed > files in ORC were less than 10% of the original size. So, it was significant > decrease in size. Also, it is easier to run analytics on the Hive tables. > > So, there are couple of ways of doing it. > > Write an Oozie job which runs every night and loads the previous day worth > audit logs into ORC or other format > Write a AuditDestination which can write into the format you want to. > > Regardless which approach you take, this would be a good feature for > Ranger.{quote} > http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E -- This message was sent by Atlassian JIRA (v6.4.14#64029)