Re: [Dev] Logging Implementation {was: Re: Any Possibility of defining the Hive output directory programmatically?}

Buddhika Chamith Mon, 23 Jul 2012 09:39:00 -0700

Pleaase find my comments in line.

On Mon, Jul 23, 2012 at 8:35 PM, Manisha Gayathri <mani...@wso2.com> wrote:


> If I give a more clear picture into the scenario;
> We have separate column families for each tenant, server and day.
> Eg:
>
> log_0_esbserver_2012_07_23
> log_1_esbserver_2012_07_23
> log_2_esbserver_2012_07_23
> log_0_esbserver_2012_07_24
> log_2_appserver_2012_07_24
> log_3_appserver_2012_07_24   (0,1,2.. denotes the tenantID)
>
>
> With the task/summarizer, running at the end of the day, we need to create
> compressed files containing info in each of the above col. family.
> Eg:
>
> ....../0/esbserver/2012_07_23/logs.gz
> ....../1/esbserver/2012_07_23/logs.gz
> ....../2/esbserver/2012_07_23/logs.gz
> ....../0/esbserver/2012_07_24/logs.gz
>
>
> If we are doing this with Hive, we need to consider the following facts;
>
>    1. Dynamically pick the ALL the column families that is related to the
>    particular date.
>    2. Dynamically generate the file URL for each of the log file.
>
> Any particular reason why we need to assume the above need to be a
responsibility of Hive? I am under the impression that this dynamic URL can
be easily created out side the Hive query from within the logging component
rather that pushing it in to a Hive query (Using concat etc.). Please
correct me if I am wrong.

If there isn't an issue preventing that we can do the following.

Assume below is your hive query which transfers data from cassandra to
local file system.

DROP TABLE cassandraTable;
CREATE TABLE cassandraTable (tenantId INT,
    serverName STRING,serviceName
    STRING,date TIMESTAMP, logString STRING) STORED BY
    'org.apache.hadoop.hive.cassandra.CassandraStorageHandler' WITH
SERDEPROPERTIES ( "cassandra.host" =
    "127.0.0.1",
    "cassandra.port" = "9160","cassandra.ks.name" = "EVENT_KS",
    "cassandra.ks.username"
    = "admin","cassandra.ks.password" = "admin",
    "cassandra.cf.name" =
    "${hiveconf:cfname}",
    "cassandra.columns.mapping" =
    ":key,serverName,
    payload_user,serviceName,date,logString" );
INSERT OVERWRITE LOCAL DIRECTORY {hiveconf:dir} select * from
cassandraTable;

Now say we have following cf log tables.

log_0_esbserver_2012_07_23
log_1_esbserver_2012_07_23
log_2_esbserver_2012_07_23

Now within the logging component we can do the following. (in pseudo-code)

for (each tenant i)
  for (each server j)
    var logCf := log + _ + i + _ + server + _ + cur_date()
    var logLocation := file:///../i/server/cur_date() // May need to create
folders if not existing at this point

    var hiveScript := getScriptFromRegistry()
    var setCfName := "SET cfname=" + logCf // set cf name variable
    var hiveScript := setCfName + hiveScript // append it to the script

    hiveScript.replace( "{hiveconf:dir}", logLocation) // Need to do a
regex replace since it is not possible to use variable substitution for
entity

// names such as directories/ tables etc.
    HiveExecutionService.execute(hiveScript) // Execute the runtime
modified Hive script using Hive script execution OSGi service

There will be chances of parallelizing this script execution once we get
this basic form working. HTH.

Regards
Buddhika





> Tried various options to achieve above, but with no luck. In [1], that you
> have suggested, we need to give the file URL. But how can I dynamically
> generate the URL per each day, each server and each tenant? (Because none
> of the operations like concat, work for proving this file URL)
>
> Appreciate if you could suggest a concrete plan to implement this.
>
> [1].
> https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-Writingdataintofilesystemfromqueries
>
> Thanks
> Rgds
> Manisha
>
> On Mon, Jul 23, 2012 at 7:55 PM, Buddhika Chamith <buddhi...@wso2.com>wrote:
>
>> So if I understand right the data are stored in seperate column families
>> per each tenant,server,day and the requirement is to transfer these column
>> family data directly to a flat file which corresponds to a logs from a
>> tenant for a server in a given day with no analytics involved. If it is the
>> case may I suggest using what tharindu suggested (insert select * from foo)
>> in combination with [1] in a loop for each column family. In order to
>> dynamically  provide the directory name and the column family name we can
>> use SET hive command and append it to the script before passing in to the
>> Hive execution service as also suggested at [2].
>>
>> Regards
>> Buddhika
>>
>> [1]
>> https://cwiki.apache.org/Hive/languagemanual-dml.html#LanguageManualDML-Writingdataintofilesystemfromqueries
>>
>> [2]
>> http://mail-archives.apache.org/mod_mbox/hive-user/201207.mbox/browser
>>
>>
>> On Mon, Jul 23, 2012 at 7:21 PM, Tharindu Mathew <thari...@wso2.com>wrote:
>>
>>> insert select * from foo
>>>
>>>
>>> On Mon, Jul 23, 2012 at 7:15 PM, Afkham Azeez <az...@wso2.com> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Jul 23, 2012 at 6:41 PM, Tharindu Mathew <thari...@wso2.com>wrote:
>>>>
>>>>> If you are planning to do a few MB, that would mean that the size of
>>>>> logs will be ( size of logs * no. of tenants ), so roughly for 200 active
>>>>> tenants and 2 MB of logs, it would come to around 400 MB. This is still
>>>>> manageable in a custom task if your data processing is low.
>>>>>
>>>>> On Mon, Jul 23, 2012 at 6:24 PM, Afkham Azeez <az...@wso2.com> wrote:
>>>>>
>>>>>> Like you said, the task may not be the best way to do this. Like we
>>>>>> discussed the other day, we can publish logs to unique column families
>>>>>> which contain the <Service>_<Tenant>_<Date> as the unique identifier. We
>>>>>> need to generate logs in a file format & allow tenant users to download
>>>>>> those. What is the best approach to generate these log files from the 
>>>>>> data
>>>>>> collected? Typically, such a log file can run into a few MB.
>>>>>
>>>>> I'm a bit confused as we did not need to use Hive as per our earlier
>>>>> conversation. This is because as the data is published it is already
>>>>> grouped by server/ tenant and date.
>>>>>
>>>>
>>>> Yeah, there is no analytics to be done. It is a problem of converting
>>>> data stored in Cassandra into a flat file.
>>>>
>>>>
>>>>>
>>>>>> Azeez
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 23, 2012 at 6:18 PM, Tharindu Mathew 
>>>>>> <thari...@wso2.com>wrote:
>>>>>>
>>>>>>> I'm no expert, but I immediately question the scale of this approach.
>>>>>>>
>>>>>>> Do you have an idea of how much of logs you plan to process per task?
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 23, 2012 at 6:13 PM, Afkham Azeez <az...@wso2.com>wrote:
>>>>>>>
>>>>>>>> The requirement is simple. We need to generate log files on a per
>>>>>>>> tenant, per date, per Service basis. Now as a big data & analytics 
>>>>>>>> expert,
>>>>>>>> please advise us on what is the best solution for this.
>>>>>>>>
>>>>>>>> Azeez
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jul 23, 2012 at 6:05 PM, Tharindu Mathew <thari...@wso2.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> So through this custom java task, what is the scale of log
>>>>>>>>> processing you will support? 100MB, 1 GB, 100 GB, 1 TB?
>>>>>>>>>
>>>>>>>>> On Mon, Jul 23, 2012 at 5:14 PM, Manisha Gayathri <
>>>>>>>>> mani...@wso2.com> wrote:
>>>>>>>>>
>>>>>>>>>> Contacted Hive User Group as well on this matter.
>>>>>>>>>> They also mentioned that this approach is not possible.
>>>>>>>>>> Also as per the chat I had with Buddhika, right now, these kind
>>>>>>>>>> of dynamic variable creations is not possible in Hive that comes 
>>>>>>>>>> with BAM2.
>>>>>>>>>>
>>>>>>>>>> Therefore IMO, without going ahead with this cumbersome process,
>>>>>>>>>> the best way will be to run a scheduled java task to pick data from
>>>>>>>>>> relevant Cassandra Column families and dynamically generate the 
>>>>>>>>>> relevant
>>>>>>>>>> log files (according to the tenantID and current date) which will be 
>>>>>>>>>> stored
>>>>>>>>>> in Apache Directory.
>>>>>>>>>>
>>>>>>>>> You are going to store the results in a LDAP?
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As per the offline chat had with Azeez, will start to work on a
>>>>>>>>>> custom Java task that can handle the above scenario.
>>>>>>>>>>
>>>>>>>>>> On Mon, Jul 23, 2012 at 2:27 PM, Manisha Gayathri <
>>>>>>>>>> mani...@wso2.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> For a log file storing scenario using BAM2, I have a requirement
>>>>>>>>>>> to generate separate log files for each date. For that I have 
>>>>>>>>>>> created a
>>>>>>>>>>> Hive Analytic query along with a Hive UDF as well.
>>>>>>>>>>>
>>>>>>>>>>> I have the getFilePath function which should return a URL like
>>>>>>>>>>> this.
>>>>>>>>>>>
>>>>>>>>>>> home/user/Desktop/logDir/logs/log_0_testServer_2012_07_22
>>>>>>>>>>>
>>>>>>>>>>> The defined function works perfectly if I put *getFilePath(
>>>>>>>>>>> "0","testServer" ) *into the *select* statement.
>>>>>>>>>>>
>>>>>>>>>>> But I want to get that particular URL as the *local directory
>>>>>>>>>>> name*. (The requirement is such that this should not be
>>>>>>>>>>> hard-coded in the hive query. Rather should be generated in the 
>>>>>>>>>>> custom UDF.
>>>>>>>>>>> )
>>>>>>>>>>>
>>>>>>>>>>> So can I do something like I v shown below?
>>>>>>>>>>>
>>>>>>>>>>> *set file_name= getFilePath( "0","testServer" );    *//Define a
>>>>>>>>>>> parameter.* *
>>>>>>>>>>> *.................*
>>>>>>>>>>> *..............*
>>>>>>>>>>> *INSERT OVERWRITE LOCAL DIRECTORY
>>>>>>>>>>> 'file:///${hiveconf:file_name}'                    *//Assign
>>>>>>>>>>> the above parameter as the file URL
>>>>>>>>>>>
>>>>>>>>>>> I tried this way. But the directory name is returned as
>>>>>>>>>>>
>>>>>>>>>>> file:/getFilePath( "0" , "testServer" )
>>>>>>>>>>>
>>>>>>>>>>> Does that mean I cannot use UDF to define the local directory
>>>>>>>>>>> name?
>>>>>>>>>>> Or am I doing anything wrong in here?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> ~Regards
>>>>>>>>>>> *Manisha Eleperuma*
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> WSO2, Inc.: http://wso2.com
>>>>>>>>>>> lean.enterprise.middleware
>>>>>>>>>>>
>>>>>>>>>>> *
>>>>>>>>>>> *
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> ~Regards
>>>>>>>>>> *Manisha Eleperuma*
>>>>>>>>>> Software Engineer
>>>>>>>>>> WSO2, Inc.: http://wso2.com
>>>>>>>>>> lean.enterprise.middleware
>>>>>>>>>>
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Dev mailing list
>>>>>>>>>> Dev@wso2.org
>>>>>>>>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Tharindu
>>>>>>>>>
>>>>>>>>> blog: http://mackiemathew.com/
>>>>>>>>> M: +94777759908
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Dev mailing list
>>>>>>>>> Dev@wso2.org
>>>>>>>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Afkham Azeez*
>>>>>>>> Director of Architecture; WSO2, Inc.; http://wso2.com
>>>>>>>> Member; Apache Software Foundation; http://www.apache.org/
>>>>>>>> * <http://www.apache.org/>**
>>>>>>>> email: **az...@wso2.com* <az...@wso2.com>* cell: +94 77 3320919
>>>>>>>> blog: **http://blog.afkham.org* <http://blog.afkham.org>*
>>>>>>>> twitter: 
>>>>>>>> **http://twitter.com/afkham_azeez*<http://twitter.com/afkham_azeez>
>>>>>>>> *
>>>>>>>> linked-in: **http://lk.linkedin.com/in/afkhamazeez*
>>>>>>>> *
>>>>>>>> *
>>>>>>>> *Lean . Enterprise . Middleware*
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Regards,
>>>>>>>
>>>>>>> Tharindu
>>>>>>>
>>>>>>> blog: http://mackiemathew.com/
>>>>>>> M: +94777759908
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Afkham Azeez*
>>>>>> Director of Architecture; WSO2, Inc.; http://wso2.com
>>>>>> Member; Apache Software Foundation; http://www.apache.org/
>>>>>> * <http://www.apache.org/>**
>>>>>> email: **az...@wso2.com* <az...@wso2.com>* cell: +94 77 3320919
>>>>>> blog: **http://blog.afkham.org* <http://blog.afkham.org>*
>>>>>> twitter: 
>>>>>> **http://twitter.com/afkham_azeez*<http://twitter.com/afkham_azeez>
>>>>>> *
>>>>>> linked-in: **http://lk.linkedin.com/in/afkhamazeez*
>>>>>> *
>>>>>> *
>>>>>> *Lean . Enterprise . Middleware*
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>>
>>>>> Tharindu
>>>>>
>>>>> blog: http://mackiemathew.com/
>>>>> M: +94777759908
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Afkham Azeez*
>>>> Director of Architecture; WSO2, Inc.; http://wso2.com
>>>> Member; Apache Software Foundation; http://www.apache.org/
>>>> * <http://www.apache.org/>**
>>>> email: **az...@wso2.com* <az...@wso2.com>* cell: +94 77 3320919
>>>> blog: **http://blog.afkham.org* <http://blog.afkham.org>*
>>>> twitter: 
>>>> **http://twitter.com/afkham_azeez*<http://twitter.com/afkham_azeez>
>>>> *
>>>> linked-in: **http://lk.linkedin.com/in/afkhamazeez*
>>>> *
>>>> *
>>>> *Lean . Enterprise . Middleware*
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Tharindu
>>>
>>> blog: http://mackiemathew.com/
>>> M: +94777759908
>>>
>>>
>>> _______________________________________________
>>> Dev mailing list
>>> Dev@wso2.org
>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>
>>>
>>
>> _______________________________________________
>> Dev mailing list
>> Dev@wso2.org
>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>
>>
>
>
> --
> ~Regards
> *Manisha Eleperuma*
> Software Engineer
> WSO2, Inc.: http://wso2.com
> lean.enterprise.middleware
>
> *
> *
> *
> *
>
>

_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] Logging Implementation {was: Re: Any Possibility of defining the Hive output directory programmatically?}

Reply via email to