There are of course a lot of considerations that will affect the
performance of queries, depending on your particular data and workload. For
example, within HDFS it will generally be the case that data stored as
parquet will be more efficient that data stored as text.

In general, HDFS is highly optimized for large scans (where a large amount
of rows need to be processed), while Kudu is more efficient for highly
selective queries (where only a single row or a few rows are affected).

Note also that there are functional differences - Kudu tables support the
UPDATE and DELETE operations while HDFS does not, for example.

Some more documentation you might find useful:
http://impala.apache.org/docs/build/html/topics/impala_performance.html
http://impala.apache.org/docs/build/html/topics/impala_kudu.html

On Fri, Dec 6, 2019 at 10:01 AM l vic <[email protected]> wrote:

> Thank you for clarification. What the performance implications for search
> queries of using HDFS vs. Kudu if storing large datasets ( ~10,000 records)
> per table? Does storing large datasets in Kudu improve search performance?
> Thanks again,
> -V
>
> On Fri, Dec 6, 2019 at 12:51 PM Thomas Tauber-Marshall <
> [email protected]> wrote:
>
>> Yes, you can use Impala to run queries against data in HDFS. Kudu is not
>> required.
>>
>> By default, new tables will be created for HDFS. To create Kudu tables,
>> or control the file format that data is saved in HDFS for the table as, you
>> can use the "STORED AS" clause with CREATE TABLE. To control where in HDFS
>> the data is stored, you can use the LOCATION clause with CREATE TABLE. To
>> query data that is already in HDFS (rather than creating a new, empty
>> table) you can use EXTERNAL and LOCATION with CREATE TABLE.
>>
>> There are a bunch more details in the documentation:
>> http://impala.apache.org/docs/build/html/topics/impala_create_table.html
>>
>>
>> On Fri, Dec 6, 2019 at 9:43 AM l vic <[email protected]> wrote:
>>
>>> After first look at documentation and tutorial i am still confused with
>>> how to use/ configure storage backend for impala... Can I use impala sql to
>>> run queries against data in hdfs, or do i need backend data server like
>>> "kudu"? How to specify data storage in "create table" statement?
>>> Thank you,
>>> -V
>>>
>>

Reply via email to