SparkSQL read Hive transactional table

2018-10-15 Thread daily
Hi,
   
I use HCatalog Streaming   Mutation API to write data to hive transactional 
table, and then, I use   SparkSQL to read data from the hive transactional 
table. I get the right   result.
   However, SparkSQL uses more time to read hive orc bucket transactional 
table,   beacause SparkSQL read all columns(not The columns involved in SQL) so 
it   uses more time.
   My question is why that SparkSQL read all columns of hive orc bucket   
transactional table, but not the columns involved in SQL? Is it possible to   
control the SparkSQL read the columns involved in SQL?
   
 
   
For example:
   Hive Table:
   create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
  partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored   
as orc TBLPROPERTIES ('transactional'='true');
   
create table dbtest.t_a2 (t0   VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) 
partitioned by(sd string,st   string) clustered by(t0) into 10 buckets stored 
as orc TBLPROPERTIES   ('transactional'='false');
   
SparkSQL: 
   select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
   select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
   
SparkSQL's stage Input size:
   
dbtest.t_a1=113.9 GB,
   
dbtest.t_a2=96.5 MB
   
 
   
Best regards.

unsubscribe

2018-10-15 Thread Vamshi Talla


Best Regards,

Vamshi T


Re: Spark seems to think that a particular broadcast variable is large in size

2018-10-15 Thread Dillon Dukek
In your program persist the smaller table and use count to force it to
materialize. Then in the Spark UI go to the Storage tab. The size of your
table as spark sees it should be displayed there. Out of curiosity what
version / language of Spark are you using?

On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri  wrote:

> I am trying to do a broadcast join on two tables. The size of the
> smaller table will vary based upon the parameters but the size of the
> larger table is close to 2TB. What I have noticed is that if I don't
> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
> operations do a SortMergeJoin instead of a broadcast join. But the
> size of the smaller table shouldn't be this big at all. I wrote the
> smaller table to a s3 folder and it took only 12.6 MB of space. I
> didn't some operations on the smaller table so the shuffle size
> appears on the Spark History Server and the size in memory seemed to
> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
> smaller table it takes a long time to broadcast, leading me to think
> that the table might not just be 150 MB in size. What would be a good
> way to figure out the actual size that Spark is seeing and deciding
> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RE: kerberos auth for MS SQL server jdbc driver

2018-10-15 Thread Luca Canali
We have a case where we interact with a Kerberized service and found a simple 
workaround to distribute and make use of the driver’s Kerberos credential cache 
file in  the executors. Maybe some of the ideas there can be of help for this 
case too? Our case in on Linux though. Details: 
https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_Executors_Kerberos_HowTo.md

Regards,
Luca

From: Marcelo Vanzin 
Sent: Monday, October 15, 2018 18:32
To: foster.langb...@riskfrontiers.com
Cc: user 
Subject: Re: kerberos auth for MS SQL server jdbc driver

Spark only does Kerberos authentication on the driver. For executors it 
currently only supports Hadoop's delegation tokens for Kerberos.

To use something that does not support delegation tokens you have to manually 
manage the Kerberos login in your code that runs in executors, which might be 
tricky. It means distributing the keytab yourself (not with Spark's --keytab 
argument) and calling into the UserGroupInformation API directly.

I don't have any examples of that, though, maybe someone does. (We have a 
similar example for Kafka on our blog somewhere, but not sure how far that will 
get you with MS SQL.)


On Mon, Oct 15, 2018 at 12:04 AM Foster Langbein 
mailto:foster.langb...@riskfrontiers.com>> 
wrote:
Has anyone gotten spark to write to SQL server using Kerberos authentication 
with Microsoft's JDBC driver? I'm having limited success, though in theory it 
should work.

I'm using a YARN-mode 4-node Spark 2.3.0 cluster and trying to write a simple 
table to SQL Server 2016. I can get it to work if I use SQL server credentials, 
however this is not an option in my application. I need to use windows 
authentication - so-called integratedSecurity - and in particular I want to use 
a keytab file.

The solution half works - the spark driver creates a table on SQL server - so 
I'm pretty confident the Kerberos implementation/credentials etc are setup 
correctly and valid. However the executors then fail to write any data to the 
table with an exception: "java.security.PrivilegedActionException: 
GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)"

After much tracing/debugging it seems executors are behaving differently to the 
spark driver and ignoring the specification to use the credentials supplied in 
the keytab and instead trying to use the default spark cluster user. I simply 
haven't been able to force them to use what's in the keytab after trying many. 
many variations.

Very grateful if anyone has any help/suggestions/ideas on how to get this to 
work.


--

 [Image removed by sender.]


Dr Foster Langbein | Chief Technology Officer | Risk Frontiers

Level 2, 100 Christie St, St Leonards, NSW, 2065



Telephone: +61 2 8459 9777

Email: 
foster.langb...@riskfrontiers.com | 
Website: www.riskfrontiers.com



Risk Modelling | Risk Management | Resilience | Disaster Management | Social 
Research
Australia | New Zealand | Asia Pacific





--
Marcelo


Spark seems to think that a particular broadcast variable is large in size

2018-10-15 Thread Venkat Dabri
I am trying to do a broadcast join on two tables. The size of the
smaller table will vary based upon the parameters but the size of the
larger table is close to 2TB. What I have noticed is that if I don't
set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
operations do a SortMergeJoin instead of a broadcast join. But the
size of the smaller table shouldn't be this big at all. I wrote the
smaller table to a s3 folder and it took only 12.6 MB of space. I
didn't some operations on the smaller table so the shuffle size
appears on the Spark History Server and the size in memory seemed to
be 150 MB nowhere near 10G. Also if I force a broadcast join on the
smaller table it takes a long time to broadcast, leading me to think
that the table might not just be 150 MB in size. What would be a good
way to figure out the actual size that Spark is seeing and deciding
whether it crosses the spark.sql.autoBroadcastJoinThreshold?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: kerberos auth for MS SQL server jdbc driver

2018-10-15 Thread Marcelo Vanzin
Spark only does Kerberos authentication on the driver. For executors it
currently only supports Hadoop's delegation tokens for Kerberos.

To use something that does not support delegation tokens you have to
manually manage the Kerberos login in your code that runs in executors,
which might be tricky. It means distributing the keytab yourself (not with
Spark's --keytab argument) and calling into the UserGroupInformation API
directly.

I don't have any examples of that, though, maybe someone does. (We have a
similar example for Kafka on our blog somewhere, but not sure how far that
will get you with MS SQL.)


On Mon, Oct 15, 2018 at 12:04 AM Foster Langbein <
foster.langb...@riskfrontiers.com> wrote:

> Has anyone gotten spark to write to SQL server using Kerberos
> authentication with Microsoft's JDBC driver? I'm having limited success,
> though in theory it should work.
>
> I'm using a YARN-mode 4-node Spark 2.3.0 cluster and trying to write a
> simple table to SQL Server 2016. I can get it to work if I use SQL server
> credentials, however this is not an option in my application. I need to
> use windows authentication - so-called integratedSecurity - and in
> particular I want to use a keytab file.
>
> The solution half works - the spark driver creates a table on SQL server -
> so I'm pretty confident the Kerberos implementation/credentials etc are
> setup correctly and valid. However the executors then fail to write any
> data to the table with an exception: "java.security.PrivilegedActionException:
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)"
>
> After much tracing/debugging it seems executors are behaving differently
> to the spark driver and ignoring the specification to use the credentials
> supplied in the keytab and instead trying to use the default spark cluster
> user. I simply haven't been able to force them to use what's in the keytab
> after trying many. many variations.
>
> Very grateful if anyone has any help/suggestions/ideas on how to get this
> to work.
>
>
> --
>
>
>
> *Dr Foster Langbein* | Chief Technology Officer | Risk Frontiers
>
> Level 2, 100 Christie St, St Leonards, NSW, 2065
>
>
> Telephone: +61 2 8459 9777
>
> Email: foster.langb...@riskfrontiers.com | Website: www.riskfrontiers.com
>
>
>
>
> *Risk Modelling | Risk Management | Resilience | Disaster Management
> | Social Research Australia | New Zealand | Asia Pacific*
>
>
>


-- 
Marcelo


Support nested keys in DataFrameWriter.bucketBy

2018-10-15 Thread Dávid Szakállas
Currently (In Spark 2.3.1) we cannot bucket DataFrames by nested columns, e.g 

df.write.bucketBy(10, "key.a").saveAsTable(“junk”)

will result in the following exception:

org.apache.spark.sql.AnalysisException: bucket column key.a is not defined in 
table junk, defined table columns are: key, value;
at 
org.apache.spark.sql.catalyst.catalog.CatalogUtils$$anonfun$org$apache$spark$sql$catalyst$catalog$CatalogUtils$$normalizeColumnName$2.apply(ExternalCatalogUtils.scala:246)
at 
org.apache.spark.sql.catalyst.catalog.CatalogUtils$$anonfun$org$apache$spark$sql$catalyst$catalog$CatalogUtils$$normalizeColumnName$2.apply(ExternalCatalogUtils.scala:246)
at scala.Option.getOrElse(Option.scala:121)
…

Are there plans to change this anytime soon?

Thanks, David





-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Timestamp Difference/operations

2018-10-15 Thread Brandon Geise
How about 

 

select unix_timestamp(timestamp2) – unix_timestamp(timestamp1)?

 

From: Paras Agarwal 
Date: Monday, October 15, 2018 at 2:41 AM
To: John Zhuge 
Cc: user , dev 
Subject: Re: Timestamp Difference/operations

 

Thanks John,

 

Actually need full date and  time difference not just date difference, 

which I guess not supported.

 

Let me know if its possible, or any UDF available for the same.

 

Thanks And Regards,

Paras

From: John Zhuge 
Sent: Friday, October 12, 2018 9:48:47 PM
To: Paras Agarwal
Cc: user; dev
Subject: Re: Timestamp Difference/operations 

 

Yeah, operator "-" does not seem to be supported, however, you can use 
"datediff" function: 

 

In [9]: select datediff(CAST('2000-02-01 12:34:34' AS TIMESTAMP), 
CAST('2000-01-01 00:00:00' AS TIMESTAMP))

Out[9]:

+--+

| datediff(CAST(CAST(2000-02-01 12:34:34 AS TIMESTAMP) AS DATE), 
CAST(CAST(2000-01-01 00:00:00 AS TIMESTAMP) AS DATE)) |

+--+

| 31
   |

+--+

 

In [10]: select datediff('2000-02-01 12:34:34', '2000-01-01 00:00:00')

Out[10]:

++

| datediff(CAST(2000-02-01 12:34:34 AS DATE), CAST(2000-01-01 00:00:00 AS 
DATE)) |

++

| 31
 |

++

 

In [11]: select datediff(timestamp '2000-02-01 12:34:34', timestamp '2000-01-01 
00:00:00')

Out[11]:

+--+

| datediff(CAST(TIMESTAMP('2000-02-01 12:34:34.0') AS DATE), 
CAST(TIMESTAMP('2000-01-01 00:00:00.0') AS DATE)) |

+--+

| 31
   |

+--+

 

On Fri, Oct 12, 2018 at 7:01 AM Paras Agarwal  
wrote:

Hello Spark Community,

Currently in hive we can do operations on Timestamp Like :
CAST('2000-01-01 12:34:34' AS TIMESTAMP) - CAST('2000-01-01 00:00:00' AS 
TIMESTAMP)

Seems its not supporting in spark.
Is there any way available.

Kindly provide some insight on this.


Paras
9130006036


 

-- 

John



Re: Spark Structured Streaming resource contention / memory issue

2018-10-15 Thread Patrick McGloin
Hi Jungtaek,

Thanks, we thought that might be the issue but haven't tested yet as
building against an unreleased version of Spark is tough for us, due to
network restrictions. We will try though. I will report back if we find
anything.

Best regards,
Patrick

On Fri, Oct 12, 2018, 2:57 PM Jungtaek Lim  wrote:

> Hi Patrick,
>
> Looks like you might be struggling with state memory, which multiple
> issues are going to be resolved in Spark 2.4.
>
> 1. SPARK-24441 [1]: Expose total estimated size of states in
> HDFSBackedStateStoreProvider
> 2. SPARK-24637 [2]: Add metrics regarding state and watermark to
> dropwizard metrics
> 3. SPARK-24717 [3]: Split out min retain version of state for memory in
> HDFSBackedStateStoreProvider
>
> There're other patches relevant to state store as well, but above issues
> are applied to map/flatmapGroupsWithState.
>
> Since Spark community is in progress on releasing Spark 2.4.0, could you
> try experimenting Spark 2.4.0 RC if you really don't mind? You could try
> out applying individual patches and see whether it helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 1. https://issues.apache.org/jira/browse/SPARK-24441
> 2. https://issues.apache.org/jira/browse/SPARK-24637
> 3. https://issues.apache.org/jira/browse/SPARK-24717
>
>
> 2018년 10월 12일 (금) 오후 9:31, Patrick McGloin 님이
> 작성:
>
>> Hi allI sent this earlier but the screenshots were not attached.
>> Hopefully this time it is correct.
>>
>> We have a Spark Structured streaming stream which is using
>> mapGroupWithState. After some time of processing in a stable manner
>> suddenly each mini batch starts taking 40 seconds. Suspiciously it looks
>> like exactly 40 seconds each time. Before this the batches were taking less
>> than a second.
>>
>>
>> Looking at the details for a particular task most partitions are
>> processed really quickly but a few take exactly 40 seconds:
>>
>>
>>
>>
>> The GC was looking ok as the data was being processed quickly but
>> suddenly the full GCs etc stop (at the same time as the 40 second issue):
>>
>>
>>
>> I have taken a thread dump from one of the executors as this issue is
>> happening but I cannot see any resource they are blocked on:
>>
>>
>>
>>
>> Are we hitting a GC problem and why is it manifesting in this way? Is
>> there another resource that is blocking and what is it?
>>
>>
>> Thanks,
>> Patrick
>>
>>
>>
>> This message has been sent by ABN AMRO Bank N.V., which has its seat at 
>> Gustav
>> Mahlerlaan 10 (1082 PP) Amsterdam, the Netherlands
>> ,
>> and is registered in the Commercial Register of Amsterdam under number
>> 34334259.
>>
>


kerberos auth for MS SQL server jdbc driver

2018-10-15 Thread Foster Langbein
Has anyone gotten spark to write to SQL server using Kerberos
authentication with Microsoft's JDBC driver? I'm having limited success,
though in theory it should work.

I'm using a YARN-mode 4-node Spark 2.3.0 cluster and trying to write a
simple table to SQL Server 2016. I can get it to work if I use SQL server
credentials, however this is not an option in my application. I need to use
windows authentication - so-called integratedSecurity - and in particular I
want to use a keytab file.

The solution half works - the spark driver creates a table on SQL server -
so I'm pretty confident the Kerberos implementation/credentials etc are
setup correctly and valid. However the executors then fail to write any
data to the table with an exception: "java.security.PrivilegedActionException:
GSSException: No valid credentials provided (Mechanism level: Failed to
find any Kerberos tgt)"

After much tracing/debugging it seems executors are behaving differently to
the spark driver and ignoring the specification to use the credentials
supplied in the keytab and instead trying to use the default spark cluster
user. I simply haven't been able to force them to use what's in the keytab
after trying many. many variations.

Very grateful if anyone has any help/suggestions/ideas on how to get this
to work.


-- 



*Dr Foster Langbein* | Chief Technology Officer | Risk Frontiers

Level 2, 100 Christie St, St Leonards, NSW, 2065


Telephone: +61 2 8459 9777

Email: foster.langb...@riskfrontiers.com | Website: www.riskfrontiers.com




*Risk Modelling | Risk Management | Resilience | Disaster Management
| Social Research Australia | New Zealand | Asia Pacific*


Re: Timestamp Difference/operations

2018-10-15 Thread Paras Agarwal
Thanks John,


Actually need full date and  time difference not just date difference,

which I guess not supported.


Let me know if its possible, or any UDF available for the same.


Thanks And Regards,

Paras


From: John Zhuge 
Sent: Friday, October 12, 2018 9:48:47 PM
To: Paras Agarwal
Cc: user; dev
Subject: Re: Timestamp Difference/operations

Yeah, operator "-" does not seem to be supported, however, you can use 
"datediff" function:

In [9]: select datediff(CAST('2000-02-01 12:34:34' AS TIMESTAMP), 
CAST('2000-01-01 00:00:00' AS TIMESTAMP))
Out[9]:
+--+
| datediff(CAST(CAST(2000-02-01 12:34:34 AS TIMESTAMP) AS DATE), 
CAST(CAST(2000-01-01 00:00:00 AS TIMESTAMP) AS DATE)) |
+--+
| 31
   |
+--+

In [10]: select datediff('2000-02-01 12:34:34', '2000-01-01 00:00:00')
Out[10]:
++
| datediff(CAST(2000-02-01 12:34:34 AS DATE), CAST(2000-01-01 00:00:00 AS 
DATE)) |
++
| 31
 |
++

In [11]: select datediff(timestamp '2000-02-01 12:34:34', timestamp '2000-01-01 
00:00:00')
Out[11]:
+--+
| datediff(CAST(TIMESTAMP('2000-02-01 12:34:34.0') AS DATE), 
CAST(TIMESTAMP('2000-01-01 00:00:00.0') AS DATE)) |
+--+
| 31
   |
+--+

On Fri, Oct 12, 2018 at 7:01 AM Paras Agarwal 
mailto:paras.agar...@datametica.com>> wrote:

Hello Spark Community,

Currently in hive we can do operations on Timestamp Like :
CAST('2000-01-01 12:34:34' AS TIMESTAMP) - CAST('2000-01-01 00:00:00' AS 
TIMESTAMP)

Seems its not supporting in spark.
Is there any way available.

Kindly provide some insight on this.


Paras
9130006036


--
John