Spark Structured Streaming resource contention / memory issue

2018-10-12 Thread Patrick McGloin
Hi all,

We have a Spark Structured streaming stream which is using
mapGroupWithState. After some time of processing in a stable manner
suddenly each mini batch starts taking 40 seconds. Suspiciously it looks
like exactly 40 seconds each time. Before this the batches were taking less
than a second.


Looking at the details for a particular task most partitions are processed
really quickly but a few take exactly 40 seconds:




The GC was looking ok as the data was being processed quickly but suddenly
the full GCs etc stop (at the same time as the 40 second issue):



I have taken a thread dump from one of the executors as this issue is
happening but I cannot see any resource they are blocked on:




Are we hitting a GC problem and why is it manifesting in this way? Is there
another resource that is blocking and what is it?


Thanks,
Patrick


Timestamp Difference/operations

2018-10-12 Thread Paras Agarwal
Hello Spark Community,

Currently in hive we can do operations on Timestamp Like :
CAST('2000-01-01 12:34:34' AS TIMESTAMP) - CAST('2000-01-01 00:00:00' AS 
TIMESTAMP)

Seems its not supporting in spark.
Is there any way available.

Kindly provide some insight on this.


Paras
9130006036


Code review and Coding livestreams today

2018-10-12 Thread Holden Karau
I’ll be doing my regular weekly code review at 10am Pacific today -
https://youtu.be/IlH-EGiWXK8 with a look at the current RC, and in the
afternoon at 3pm Pacific I’ll be doing some live coding around WIP graceful
decommissioning PR -
https://youtu.be/4FKuYk2sbQ8
-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Timestamp Difference/operations

2018-10-12 Thread John Zhuge
Yeah, operator "-" does not seem to be supported, however, you can use
"datediff" function:

In [9]: select datediff(CAST('2000-02-01 12:34:34' AS TIMESTAMP),
CAST('2000-01-01 00:00:00' AS TIMESTAMP))
Out[9]:
+--+
| datediff(CAST(CAST(2000-02-01 12:34:34 AS TIMESTAMP) AS DATE),
CAST(CAST(2000-01-01 00:00:00 AS TIMESTAMP) AS DATE)) |
+--+
| 31
   |
+--+

In [10]: select datediff('2000-02-01 12:34:34', '2000-01-01 00:00:00')
Out[10]:
++
| datediff(CAST(2000-02-01 12:34:34 AS DATE), CAST(2000-01-01 00:00:00 AS
DATE)) |
++
| 31
 |
++

In [11]: select datediff(timestamp '2000-02-01 12:34:34', timestamp
'2000-01-01 00:00:00')
Out[11]:
+--+
| datediff(CAST(TIMESTAMP('2000-02-01 12:34:34.0') AS DATE),
CAST(TIMESTAMP('2000-01-01 00:00:00.0') AS DATE)) |
+--+
| 31
   |
+--+

On Fri, Oct 12, 2018 at 7:01 AM Paras Agarwal 
wrote:

> Hello Spark Community,
>
> Currently in hive we can do operations on Timestamp Like :
> CAST('2000-01-01 12:34:34' AS TIMESTAMP) - CAST('2000-01-01 00:00:00' AS
> TIMESTAMP)
>
> Seems its not supporting in spark.
> Is there any way available.
>
> Kindly provide some insight on this.
>
>
> Paras
> 9130006036
>


-- 
John


SparkSQL read Hive transactional table

2018-10-12 Thread daily
Hi,
I use HCatalog Streaming Mutation API to write data to hive transactional 
table, and then, I use SparkSQL to read data from the hive transactional table. 
I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional table, 
beacause SparkSQL read all columns(not The columns involved in SQL) so it uses 
more time.
My question is why that SparkSQL read all columns of hive orc bucket 
transactional table, but not the columns involved in SQL? Is it possible to 
control the SparkSQL read the columns involved in SQL?


For example:
Hive Table:
create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as 
orc TBLPROPERTIES ('transactional'='true');
create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as 
orc TBLPROPERTIES ('transactional'='false');
SparkSQL: 
select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
SparkSQL's stage Input size: dbtest.t_a1=113.9 GB, dbtest.t_a2=96.5 MB


Best regards.