Re: Optimal Config for the worker

Odin Guillermo Caudillo Gallegos Fri, 09 Oct 2015 10:43:53 -0700

Good. thank you for the tips about cleaning.
Is there anyway to configure tajo with kerberos or some security tool like
Sentry already?


2015-10-09 11:13 GMT-05:00 Jihoon Son <[email protected]>:

> You mean, dfs-dir-aware doesn't work, so you set resource.disks as some
> value by yourself, right? If so, I'll check dfs-dir-aware configuration.
>
> Regarding on space cleaning, you can delete any directories. Some system
> directories and files will be automatically created by tajo if they are
> necessary.
> In contrast, deleting data means that tajo works normally but you cannot
> see deleted data anymore. For example, if you delete the query detail
> directory, you cannot see query details on the web ui anymore. This query
> detail directories are automatically deleted as time goes by, so you don't
> need to clean up unless you are suffering from the low available space.
>
> In addition, you may want to delete tajo's temporal data which are stored
> during query execution. The default temporal directory is created at
> /tmp/tajo-${user.name}/tmpdir. So you can delete by yourself, or set
> 'tajo.worker.tmpdir.cleanup-at-startup' for auto cleanup.
>
> Jihoon
>
> 2015년 10월 10일 (토) 오전 12:50, Odin Guillermo Caudillo Gallegos <
> [email protected]>님이 작성:
>
>> Hi.
>> I put the dfs-dir-aware to true, but the performance wasn't the expected.
>> So for test purposes, i let it with resource.disks
>> About the hdfs space cleaning, which directories can i delete from my
>> hadoop?
>> Like, is there a problem if i delete the query detail? Can i delete
>> another folder?
>> Thanks
>>
>> 2015-10-09 10:15 GMT-05:00 Jihoon Son <[email protected]>:
>>
>>> Hi Odin, yes you can make your query faster.
>>>
>>> First of all, you can increase disk resource for tajo workers by setting
>>> '*tajo.worker.resource.**disks*'. This disk resource is related to the
>>> number of tasks which are executed in parallel. A high disk resource
>>> increases the number of tasks which are executed in parallel. For example,
>>> given 10 tasks each of which reads data from hdfs, a tajo worker will
>>> execute those tasks one by one. With a disk resource of 2, two tasks can be
>>> executed simultaneously. So, it can improve the performance.
>>> However, as you may know, if too many tasks access a single disk at the
>>> same time, there will be a lot of random accesses which make the query
>>> performance worse.
>>> So, I recommend to use the real number of physical disks for this
>>> configuration. Or, if you already configured multiple disks for hdfs, tajo
>>> can automatically detect it and use for tajo worker's disk resource by
>>> setting '*tajo.worker.resource.dfs-dir-aware*' as true. Please refer to
>>> http://tajo.apache.org/docs/devel/configuration/worker_configuration.html
>>> for more information.
>>> After changing configuration values, you need to restart your tajo
>>> cluster.
>>>
>>> In addition, I *strongly recommend* to enable '
>>> *dfs.datanode.hdfs-blocks-metadata.enabled*' for your HDFS. With this
>>> configuration, tajo can achieve higher data locality when assigning its
>>> tasks to workers. This will improve tajo's performance significantly. You
>>> need to restart your hdfs after configuring this, too.
>>>
>>> Best regards,
>>> Jihoon
>>>
>>> 2015년 10월 9일 (금) 오후 11:43, Odin Guillermo Caudillo Gallegos <
>>> [email protected]>님이 작성:
>>>
>>>> Hi.
>>>> I did a select count from a hdfs wich returns me a total record of
>>>> almost 17 million.
>>>> The count was done in 2 minutes.
>>>> I have the current config for the worker:
>>>>
>>>> <property>
>>>>   <name>tajo.worker.resource.memory-mb</name>
>>>>   <value>4096</value>
>>>>   <description>Available memory size (MB)</description>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>tajo.worker.resource.disks</name>
>>>>   <value>1</value>
>>>>   <description>Available disk capacity (usually number of
>>>> disks)</description>
>>>> </property>
>>>>
>>>> <property>
>>>>   <name>tajo.worker.tmpdir.locations</name>
>>>>
>>>> <value>/tmp/tajo-11/tmpdir,/tmp/tajo-11/tmpdir1,/tmp/tajo-11/tmpdir2</value>
>>>>   <description>A base for other temporary directories.</description>
>>>> </property>
>>>>
>>>> Is there anyway to give the query more power to make it faster?
>>>> Do i need to do another configuration?
>>>>
>>>>
>>

Re: Optimal Config for the worker

Reply via email to