?????? SparkSQL read Hive transactional table

2018-10-16 Thread daily
Hi,


Spark version: 2.3.0
Hive   version: 2.1.0



Best regards.





--  --
??: "Gourav Sengupta";
: 2018??10??16??(??) 6:35
??: "daily";
: "user"; "dev"; 
: Re: SparkSQL read Hive transactional table



Hi,

can I please ask which version of Hive and Spark are you using?


Regards,
Gourav Sengupta


On Tue, Oct 16, 2018 at 2:42 AM daily  wrote:

 
Hi,
   
I use HCatalog Streaming   Mutation API to write data to hive transactional 
table, and then, I use   SparkSQL to read data from the hive transactional 
table. I get the right   result.
   However, SparkSQL uses more time to read hive orc bucket transactional 
table,   beacause SparkSQL read all columns(not The columns involved in SQL) so 
it   uses more time.
   My question is why that SparkSQL read all columns of hive orc bucket   
transactional table, but not the columns involved in SQL? Is it possible to   
control the SparkSQL read the columns involved in SQL?
   
 
   
For example:
   Hive Table:
   create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
  partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored   
as orc TBLPROPERTIES ('transactional'='true');
   
create table dbtest.t_a2 (t0   VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st   string) clustered by(t0) into 10 buckets stored 
as orc TBLPROPERTIES   ('transactional'='false');
   
SparkSQL: 
   select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
   select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
   
SparkSQL's stage Input size:
   
dbtest.t_a1=113.9 GB,
   
dbtest.t_a2=96.5 MB
   
 
   
Best regards.

SparkSQL read Hive transactional table

2018-10-15 Thread daily
Hi,
   
I use HCatalog Streaming   Mutation API to write data to hive transactional 
table, and then, I use   SparkSQL to read data from the hive transactional 
table. I get the right   result.
   However, SparkSQL uses more time to read hive orc bucket transactional 
table,   beacause SparkSQL read all columns(not The columns involved in SQL) so 
it   uses more time.
   My question is why that SparkSQL read all columns of hive orc bucket   
transactional table, but not the columns involved in SQL? Is it possible to   
control the SparkSQL read the columns involved in SQL?
   
 
   
For example:
   Hive Table:
   create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
  partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored   
as orc TBLPROPERTIES ('transactional'='true');
   
create table dbtest.t_a2 (t0   VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st   string) clustered by(t0) into 10 buckets stored 
as orc TBLPROPERTIES   ('transactional'='false');
   
SparkSQL: 
   select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
   select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
   
SparkSQL's stage Input size:
   
dbtest.t_a1=113.9 GB,
   
dbtest.t_a2=96.5 MB
   
 
   
Best regards.

SparkSQL read Hive transactional table

2018-10-12 Thread daily
Hi,


I use HCatalog Streaming Mutation API to write data to hive transactional 
table, and then, I use SparkSQL to read data from the hive transactional table. 
I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional table, 
beacause SparkSQL read all columns(not The columns involved in SQL) so it uses 
more time.
My question is why that SparkSQL read all columns of hive orc bucket 
transactional table, but not the columns involved in SQL? Is it possible to 
control the SparkSQL read the columns involved in SQL?


For example:
Hive Table:


create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as 
orc TBLPROPERTIES ('transactional'='true');


create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as 
orc TBLPROPERTIES ('transactional'='false');



SparkSQL: 
select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
select sum(t1),sum(t2) from dbtest.t_a2 group by t0;


SparkSQL's stage Input size: dbtest.t_a1=113.9 GB, dbtest.t_a2=96.5 MB






Best regards.