Hi:
    
 I use kudu official website development documents, use spark analysis kudu 
data(kudu's version is 1.6.0):

the official  code is :
val df = sqlContext.read.options(Map("kudu.master" -> 
"kudu.master:7051","kudu.table" -> "kudu_table")).kudu // Query using the Spark 
API... df.select("id").filter("id" >= 5).show()


My question  is :
(1)If I use the official website code, when creating data collection of df, the 
data of my table is about 1.8 billion, and then the filter of df is performed. 
This is equivalent to loading 1.8 billion data into memory each time, and the 
performance is very poor.

(2)Create a time-based range partition on the 1.8 billion table, and then 
directly use the underlying java api,scan partition to analyze, this is not the 
amount of data each time loading is the specified number of partitions instead 
of 1.8 billion data?

Please give me some suggestions, thanks!



优速物流有限公司
大数据中心     冯宝利
Mobil:15050552430
Email:fengba...@uce.cn

Reply via email to