Good. thank you for the tips about cleaning. Is there anyway to configure tajo with kerberos or some security tool like Sentry already?
2015-10-09 11:13 GMT-05:00 Jihoon Son <[email protected]>: > You mean, dfs-dir-aware doesn't work, so you set resource.disks as some > value by yourself, right? If so, I'll check dfs-dir-aware configuration. > > Regarding on space cleaning, you can delete any directories. Some system > directories and files will be automatically created by tajo if they are > necessary. > In contrast, deleting data means that tajo works normally but you cannot > see deleted data anymore. For example, if you delete the query detail > directory, you cannot see query details on the web ui anymore. This query > detail directories are automatically deleted as time goes by, so you don't > need to clean up unless you are suffering from the low available space. > > In addition, you may want to delete tajo's temporal data which are stored > during query execution. The default temporal directory is created at > /tmp/tajo-${user.name}/tmpdir. So you can delete by yourself, or set > 'tajo.worker.tmpdir.cleanup-at-startup' for auto cleanup. > > Jihoon > > 2015년 10월 10일 (토) 오전 12:50, Odin Guillermo Caudillo Gallegos < > [email protected]>님이 작성: > >> Hi. >> I put the dfs-dir-aware to true, but the performance wasn't the expected. >> So for test purposes, i let it with resource.disks >> About the hdfs space cleaning, which directories can i delete from my >> hadoop? >> Like, is there a problem if i delete the query detail? Can i delete >> another folder? >> Thanks >> >> 2015-10-09 10:15 GMT-05:00 Jihoon Son <[email protected]>: >> >>> Hi Odin, yes you can make your query faster. >>> >>> First of all, you can increase disk resource for tajo workers by setting >>> '*tajo.worker.resource.**disks*'. This disk resource is related to the >>> number of tasks which are executed in parallel. A high disk resource >>> increases the number of tasks which are executed in parallel. For example, >>> given 10 tasks each of which reads data from hdfs, a tajo worker will >>> execute those tasks one by one. With a disk resource of 2, two tasks can be >>> executed simultaneously. So, it can improve the performance. >>> However, as you may know, if too many tasks access a single disk at the >>> same time, there will be a lot of random accesses which make the query >>> performance worse. >>> So, I recommend to use the real number of physical disks for this >>> configuration. Or, if you already configured multiple disks for hdfs, tajo >>> can automatically detect it and use for tajo worker's disk resource by >>> setting '*tajo.worker.resource.dfs-dir-aware*' as true. Please refer to >>> http://tajo.apache.org/docs/devel/configuration/worker_configuration.html >>> for more information. >>> After changing configuration values, you need to restart your tajo >>> cluster. >>> >>> In addition, I *strongly recommend* to enable ' >>> *dfs.datanode.hdfs-blocks-metadata.enabled*' for your HDFS. With this >>> configuration, tajo can achieve higher data locality when assigning its >>> tasks to workers. This will improve tajo's performance significantly. You >>> need to restart your hdfs after configuring this, too. >>> >>> Best regards, >>> Jihoon >>> >>> 2015년 10월 9일 (금) 오후 11:43, Odin Guillermo Caudillo Gallegos < >>> [email protected]>님이 작성: >>> >>>> Hi. >>>> I did a select count from a hdfs wich returns me a total record of >>>> almost 17 million. >>>> The count was done in 2 minutes. >>>> I have the current config for the worker: >>>> >>>> <property> >>>> <name>tajo.worker.resource.memory-mb</name> >>>> <value>4096</value> >>>> <description>Available memory size (MB)</description> >>>> </property> >>>> >>>> <property> >>>> <name>tajo.worker.resource.disks</name> >>>> <value>1</value> >>>> <description>Available disk capacity (usually number of >>>> disks)</description> >>>> </property> >>>> >>>> <property> >>>> <name>tajo.worker.tmpdir.locations</name> >>>> >>>> <value>/tmp/tajo-11/tmpdir,/tmp/tajo-11/tmpdir1,/tmp/tajo-11/tmpdir2</value> >>>> <description>A base for other temporary directories.</description> >>>> </property> >>>> >>>> Is there anyway to give the query more power to make it faster? >>>> Do i need to do another configuration? >>>> >>>> >>
