Hi: I use kudu official website development documents, use spark analysis kudu data(kudu's version is 1.6.0):
the official code is : val df = sqlContext.read.options(Map("kudu.master" -> "kudu.master:7051","kudu.table" -> "kudu_table")).kudu // Query using the Spark API... df.select("id").filter("id" >= 5).show() My question is : (1)If I use the official website code, when creating data collection of df, the data of my table is about 1.8 billion, and then the filter of df is performed. This is equivalent to loading 1.8 billion data into memory each time, and the performance is very poor. (2)Create a time-based range partition on the 1.8 billion table, and then directly use the underlying java api,scan partition to analyze, this is not the amount of data each time loading is the specified number of partitions instead of 1.8 billion data? Please give me some suggestions, thanks! 优速物流有限公司 大数据中心 冯宝利 Mobil:15050552430 Email:fengba...@uce.cn