RE: Spark-SQL : Getting current user name in UDF

2022-02-22 Thread Lavelle, Shawn
Apologies, this is Spark 3.2.0.

~ Shawn

From: Lavelle, Shawn
Sent: Monday, February 21, 2022 5:39 PM
To: 'user@spark.apache.org' 
Subject: Spark-SQL : Getting current user name in UDF

Hello Spark Users,

I have a UDF I wrote for use with Spark-SQL that performs a look up.  In 
that look up, I need to get the current sql user so I can validate their 
permissions.  I was using org.apach.spark.sql.util.Utils.getCurrentUserName() 
to retrieve the current active user from within the UDF but today I discovered 
that that call returns a different user based on the context:

select myUDF();
returns the SQL user
select myUDF() from myTable ;
returns the operating system (application?) user.

I can provide a code example if needed, but it's just calling 
Utils.getCurrentUserName() from within the UDF code.

Does this sound like expected behavior or a defect?  Is there another way I can 
get the active SQL user inside a UDF?

Thanks in advance,

~ Shawn

PS I can't add username as a parameter to the UDF because I can't rely the user 
to not submit someone else's username.




[OSI Logo]
Shawn Lavelle

Software Development

OSI Digital Grid Solutions
4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Email: shawn.lave...@osii.com
Website: www.osii.com<https://www.osii.com>
[Emerson Logo]
We are proud to
now be a part of
Emerson.


Spark-SQL : Getting current user name in UDF

2022-02-21 Thread Lavelle, Shawn
Hello Spark Users,

I have a UDF I wrote for use with Spark-SQL that performs a look up.  In 
that look up, I need to get the current sql user so I can validate their 
permissions.  I was using org.apach.spark.sql.util.Utils.getCurrentUserName() 
to retrieve the current active user from within the UDF but today I discovered 
that that call returns a different user based on the context:

select myUDF();
returns the SQL user

select myUDF() from myTable ;
returns the operating system (application?) user.

I can provide a code example if needed, but it's just calling 
Utils.getCurrentUserName() from within the UDF code.

Does this sound like expected behavior or a defect?  Is there another way I can 
get the active SQL user inside a UDF?

Thanks in advance,

~ Shawn

PS I can't add username as a parameter to the UDF because I can't rely the user 
to not submit someone else's username.




[OSI Logo]
Shawn Lavelle

Software Development

OSI Digital Grid Solutions
4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Email: shawn.lave...@osii.com
Website: www.osii.com
[Emerson Logo]
We are proud to
now be a part of
Emerson.


RE: DataSource API v2 & Spark-SQL

2021-07-02 Thread Lavelle, Shawn
Thanks for following up, I will give this a go!

 ~  Shawn

-Original Message-
From: roizaig 
Sent: Thursday, April 29, 2021 7:42 AM
To: user@spark.apache.org
Subject: Re: DataSource API v2 & Spark-SQL

You can create a custom data source following  this blog 

. It shows how to read a java log file using spark v3 api as an example.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org




[OSI Logo]
Shawn Lavelle

Software Development

OSI Digital Grid Solutions
4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Email: shawn.lave...@osii.com
Website: www.osii.com
[Emerson Logo]
We are proud to
now be a part of
Emerson.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Odd NoClassDefFoundError exception

2021-01-26 Thread Lavelle, Shawn
Hello Spark Community,
   I have a Spark-SQL problem where I am receiving a NoClassDefFoundError error 
for org.apache.spark.sql.catalyst.util.RebaseDateTime$ .  This happens for any 
query with a filter on a Timestamp column when the query is first run 
programmatically but not when the query is first fun via 
Beeline/HiveThriftServerCLI. That is, if I submit the query via beeline and 
HiveThriftServer, the query succeeds and I can also successfully call 
SparkSession.sqlContext().sql().  If I run it from the program first, however, 
it throws the aforementioned class loader error and then does beeline will fail 
with the same error.

   I think there's something the HiveThriftServer does to initialize the job / 
session, but I can't sort out what it is.  I have tried the SparkSession (and 
the sqlContext) that is passed in to create the HiveThriftServer as well as 
using SparkSession builder pattern to create one.

 Can you help?  Thanks in advance and let me know if there's more 
information I can provide.

~ Shawn
PS Spark 3.0.0


The Exception
Exception occurred in target VM: Could not initialize class 
org.apache.spark.sql.catalyst.util.RebaseDateTime$
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.spark.sql.catalyst.util.RebaseDateTime$

Code Snippet:
String sql = < passed in >
SparkSession ss = < passed in instance >
Dataset ds;
try {
ds = ss.sql(sql);
} catch (Exception ex) {
return -1d;
}
MyRDD myRdd = (MyRDD) ds.rdd();
for (Partition p : myRdd.getPartitions()) {
... Do the things
}

Note, ss.sql() doesn't throw an exception, but the ds.rdd() contains one.  When 
the provided SQL is run via beeline the code above works as expected.




[OSI Logo]
Shawn Lavelle

Software Development

OSI Digital Grid Solutions
4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Email: shawn.lave...@osii.com
Website: www.osii.com
[Emerson Logo]
We are proud to
now be a part of
Emerson.


RE: DataSource API v2 & Spark-SQL

2020-08-03 Thread Lavelle, Shawn
Thanks for clarifying, Russel.  Is spark native catalog reference on the 
roadmap for dsv2 or should I be trying to use something else?

~ Shawn

From: Russell Spitzer [mailto:russell.spit...@gmail.com]
Sent: Monday, August 3, 2020 8:27 AM
To: Lavelle, Shawn 
Cc: user 
Subject: Re: DataSource API v2 & Spark-SQL

<<<< EXTERNAL email. Do not open links or attachments unless you recognize the 
sender. If suspicious report 
here<https://osiinet/Global/IMS/SitePages/Reporting.aspx>. >>>>

That's a bad error message. Basically you can't make a spark native catalog 
reference for a dsv2 source. You have to use that Datasources catalog or use 
the programmatic API. Both dsv1 and dsv2 programattic apis work (plus or minus 
some options)

On Mon, Aug 3, 2020, 7:28 AM Lavelle, Shawn 
mailto:shawn.lave...@osii.com>> wrote:
Hello Spark community,
   I have a custom datasource in v1 API that I’m trying to port to v2 API, in 
Java.  Currently I have a DataSource registered via catalog.createTable(name, 
, schema, options map).  When trying to do this in data source API v2, 
I get an error saying my class (package) isn’t a valid data source Can you help 
me out?

Spark versions are 3.0.0 w/scala 2.12, artifacts are Spark-core, spark-sql, 
spark-hive, spark-hive-thriftserver, spark-catalyst

Here’s what the dataSource definition:  public class LogTableSource implements  
TableProvider,  SupportsRead,  DataSourceRegister, Serializable

I’m guessing that I am missing one of the required interfaces. Note, I did try 
this with using the LogTableSource below as “DefaultSource” but the behavior is 
the same.  Also, I keep reading about a DataSourceV2 Marker Interface, but it 
seems deprecated?

Also, I tried to add DataSourceV2ScanRelation but that won’t compile:
Output() in DataSourceV2ScanRelation cannot override Output() in QueryPlan 
return type Seq is not compatible with Seq

  I’m fairly stumped – everything I’ve read online says there’s a marker 
interface of some kind and yet I can’t find it in my package list.

  Looking forward to hearing from you,

~ Shawn





[Image removed by sender. OSI]
Shawn Lavelle

Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Email: shawn.lave...@osii.com<mailto:shawn.lave...@osii.com>
Website: www.osii.com<https://www.osii.com>


DataSource API v2 & Spark-SQL

2020-08-03 Thread Lavelle, Shawn
Hello Spark community,
   I have a custom datasource in v1 API that I'm trying to port to v2 API, in 
Java.  Currently I have a DataSource registered via catalog.createTable(name, 
, schema, options map).  When trying to do this in data source API v2, 
I get an error saying my class (package) isn't a valid data source Can you help 
me out?

Spark versions are 3.0.0 w/scala 2.12, artifacts are Spark-core, spark-sql, 
spark-hive, spark-hive-thriftserver, spark-catalyst

Here's what the dataSource definition:  public class LogTableSource implements  
TableProvider,  SupportsRead,  DataSourceRegister, Serializable

I'm guessing that I am missing one of the required interfaces. Note, I did try 
this with using the LogTableSource below as "DefaultSource" but the behavior is 
the same.  Also, I keep reading about a DataSourceV2 Marker Interface, but it 
seems deprecated?

Also, I tried to add DataSourceV2ScanRelation but that won't compile:
Output() in DataSourceV2ScanRelation cannot override Output() in QueryPlan 
return type Seq is not compatible with Seq

  I'm fairly stumped - everything I've read online says there's a marker 
interface of some kind and yet I can't find it in my package list.

  Looking forward to hearing from you,

~ Shawn







[OSI]
Shawn Lavelle

Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Email: shawn.lave...@osii.com
Website: www.osii.com


RE: Spark-SQL Query Optimization: overlapping ranges

2017-05-01 Thread Lavelle, Shawn
Jacek,

   Thanks for your help.  I didn’t want to write a bug/enhancement unless 
warranted.

~ Shawn

From: Jacek Laskowski [mailto:ja...@japila.pl]
Sent: Thursday, April 27, 2017 8:39 AM
To: Lavelle, Shawn 
Cc: user 
Subject: Re: Spark-SQL Query Optimization: overlapping ranges

Hi Shawn,

If you're asking me if Spark SQL should optimize such queries, I don't know.

If you're asking me if it's possible to convince Spark SQL to do so, I'd say, 
sure, it is. Write your optimization rule and attach it to Optimizer (using 
extraOptimizations extension point).


Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Thu, Apr 27, 2017 at 3:22 PM, Lavelle, Shawn 
mailto:shawn.lave...@osii.com>> wrote:
Hi Jacek,

 I know that it is not currently doing so, but should it be?  The algorithm 
isn’t complicated and could be applied to both OR and AND logical operators 
with comparison operators as children.
 My users write programs to generate queries that aren’t checked for this 
sort of thing. We’re probably going to write our own 
org.apache.spark.sql.catalyst.rules.Rule to handle it.

~ Shawn

From: Jacek Laskowski [mailto:ja...@japila.pl<mailto:ja...@japila.pl>]
Sent: Wednesday, April 26, 2017 2:55 AM
To: Lavelle, Shawn mailto:shawn.lave...@osii.com>>
Cc: user mailto:user@spark.apache.org>>
Subject: Re: Spark-SQL Query Optimization: overlapping ranges

explain it and you'll know what happens under the covers.

i.e. Use explain on the Dataset.

Jacek

On 25 Apr 2017 12:46 a.m., "Lavelle, Shawn" 
mailto:shawn.lave...@osii.com>> wrote:
Hello Spark Users!

   Does the Spark Optimization engine reduce overlapping column ranges?  If so, 
should it push this down to a Data Source?

  Example,
This:  Select * from table where col between 3 and 7 OR col between 5 and 9
Reduces to:  Select * from table where col between 3 and 9


  Thanks for your insight!

~ Shawn M Lavelle



[cid:image001.png@01D2C298.8343C580]
Shawn Lavelle
Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Fax: 763 551 0750
Email: shawn.lave...@osii.com<mailto:shawn.lave...@osii.com>
Website: www.osii.com<http://www.osii.com>





RE: Spark-SQL Query Optimization: overlapping ranges

2017-04-27 Thread Lavelle, Shawn
Hi Jacek,

 I know that it is not currently doing so, but should it be?  The algorithm 
isn’t complicated and could be applied to both OR and AND logical operators 
with comparison operators as children.
 My users write programs to generate queries that aren’t checked for this 
sort of thing. We’re probably going to write our own 
org.apache.spark.sql.catalyst.rules.Rule to handle it.

~ Shawn

From: Jacek Laskowski [mailto:ja...@japila.pl]
Sent: Wednesday, April 26, 2017 2:55 AM
To: Lavelle, Shawn 
Cc: user 
Subject: Re: Spark-SQL Query Optimization: overlapping ranges

explain it and you'll know what happens under the covers.

i.e. Use explain on the Dataset.

Jacek

On 25 Apr 2017 12:46 a.m., "Lavelle, Shawn" 
mailto:shawn.lave...@osii.com>> wrote:
Hello Spark Users!

   Does the Spark Optimization engine reduce overlapping column ranges?  If so, 
should it push this down to a Data Source?

  Example,
This:  Select * from table where col between 3 and 7 OR col between 5 and 9
Reduces to:  Select * from table where col between 3 and 9


  Thanks for your insight!

~ Shawn M Lavelle



[cid:image002.png@01D2BF2D.E0330800]
Shawn Lavelle
Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Fax: 763 551 0750
Email: shawn.lave...@osii.com<mailto:shawn.lave...@osii.com>
Website: www.osii.com<http://www.osii.com>




Spark-SQL Query Optimization: overlapping ranges

2017-04-24 Thread Lavelle, Shawn
Hello Spark Users!

   Does the Spark Optimization engine reduce overlapping column ranges?  If so, 
should it push this down to a Data Source?

  Example,
This:  Select * from table where col between 3 and 7 OR col between 5 and 9
Reduces to:  Select * from table where col between 3 and 9


  Thanks for your insight!

~ Shawn M Lavelle



[cid:imagec067d4.GIF@a6621680.4e9e19c3]

Shawn Lavelle
Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Fax: 763 551 0750
Email: shawn.lave...@osii.com
Website: www.osii.com



RE: Register Spark UDF for use with Hive Thriftserver/Beeline

2017-02-28 Thread Lavelle, Shawn
I forgot to mention, I can do:
sparkSession.sql("create function z as 'QualityToString'");

prior to starting HiveThriftServer2 and that will register the UDF, but only in 
the default database.  It won’t be present in other databases. I can register 
it again in the other databases as needed, but it just seems to me that that 
shouldn’t be necessary.

Further detail on QualityToString, for testing purposes I’m implementing Spark 
UDF and Hive UDF in one class:
public class QualityToString extends UDF implements UDF1, 
Serializable {

@Override
public String call(Integer t1) throws Exception {
return QUALITY.toString(t1);
}

public Text evaluate(Integer i) throws Exception {
return new Text(this.call(i));
}
}



[cid:imaged918e8.GIF@87801364.42ab9893]

Shawn Lavelle
Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Fax: 763 551 0750
Email: shawn.lave...@osii.com<mailto:shawn.lave...@osii.com>
Website: www.osii.com<http://www.osii.com>

From: Lavelle, Shawn
Sent: Tuesday, February 28, 2017 10:25 AM
To: user@spark.apache.org
Subject: Register Spark UDF for use with Hive Thriftserver/Beeline

Hello all,

   I’m trying to make my custom UDFs available from a beeline session via Hive 
ThriftServer.  I’ve been successful in registering them via my DataSourceAPI as 
it provides the current sqlContext. However, the udfs are not accessible at 
initial connection, meaning a query won’t parse as the udfs aren’t yet 
registered.

   What is the right way to register a Spark UDF to be available over the 
HiveThriftServer as connect time?

The following haven’t worked, but perhaps my timing is off?

   this.sparkSession = SparkSession.builder()…
   SQLContext sqlContext = sparkSession.sqlContext();
sqlContext.sql("create temporary function z as ‘QualityToString'");
sparkSession.udf().register("v", new 
QualityToString(),QualityToString.returnType());
SparkSession.setDefaultSession(this.sparkSession);
   HiveThriftServer2.startWithContext(sqlContext);

Neither z nor v show up.  I’ve tried registering after starting the 
HiveThriftServer, also to no avail. I’ve tried grabbing the spark or sql 
context as user authentication time, but to no avail either.  (Registering at 
authentication time worked in Hive 0.11 and Shark 0.9.2, but I suspect the 
session is now created differently and/or after authentication.)

I’m sure I’m not the first person to want to use sparkSQL UDFs in HIVE via 
beeline, how should I be registering them?

Thank you!

~ Shawn


Register Spark UDF for use with Hive Thriftserver/Beeline

2017-02-28 Thread Lavelle, Shawn
Hello all,

   I’m trying to make my custom UDFs available from a beeline session via Hive 
ThriftServer.  I’ve been successful in registering them via my DataSourceAPI as 
it provides the current sqlContext. However, the udfs are not accessible at 
initial connection, meaning a query won’t parse as the udfs aren’t yet 
registered.

   What is the right way to register a Spark UDF to be available over the 
HiveThriftServer as connect time?

The following haven’t worked, but perhaps my timing is off?

   this.sparkSession = SparkSession.builder()…
   SQLContext sqlContext = sparkSession.sqlContext();
sqlContext.sql("create temporary function z as ‘QualityToString'");
sparkSession.udf().register("v", new 
QualityToString(),QualityToString.returnType());
SparkSession.setDefaultSession(this.sparkSession);
   HiveThriftServer2.startWithContext(sqlContext);

Neither z nor v show up.  I’ve tried registering after starting the 
HiveThriftServer, also to no avail. I’ve tried grabbing the spark or sql 
context as user authentication time, but to no avail either.  (Registering at 
authentication time worked in Hive 0.11 and Shark 0.9.2, but I suspect the 
session is now created differently and/or after authentication.)

I’m sure I’m not the first person to want to use sparkSQL UDFs in HIVE via 
beeline, how should I be registering them?

Thank you!

~ Shawn


[cid:imaged8b7ed.GIF@68440114.468d7601]

Shawn Lavelle
Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Fax: 763 551 0750
Email: shawn.lave...@osii.com
Website: www.osii.com



Spark-SQL 1.6.2 w/Hive UDF @Description

2016-12-23 Thread Lavelle, Shawn
​Hello Spark Users,


I have a Hive UDF that I'm trying to use with Spark-SQL.  It's showing up a 
bit awkwardly:


I can load it into the Hive Thrift Server with a "Create function..." query 
against the hive context.  I can then use the UDF in queries.  However, a "desc 
function " says the function doesn't exist, meanwhile the function is 
loaded into the default table, ie, shows up as default. in a "desc 
functions" call.


Any thoughts as to why this is? Any work arounds?


~ Shawn M Lavelle



[cid:imageab7aa3.GIF@fda4092e.48ab7d50]

Shawn Lavelle
Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Fax: 763 551 0750
Email: shawn.lave...@osii.com
Website: www.osii.com



Upgrading a Hive External Storage Handler...

2016-07-21 Thread Lavelle, Shawn
Hello,

   I am looking to upgrade a Hive 0.11 external storage handler that was run on 
Shark 0.9.2 to work on spark-sql 1.6.1.  I’ve run into a snag in that it seems 
that the storage handler is not receiving predicate pushdown information.

   Being fairly new to Spark’s development, would someone please tell me A) Can 
I still use external storage handlers in HIVE or am I forced to use the new 
DataFrame API and B) Did I miss something in how Pushdown Predicates work or 
are accessed due to the upgrade from hive 0.11 to hive 1.2.1?

   Thank you,

~ Shawn M Lavelle



[cid:image81c412.GIF@06ba31c3.4ab1cd04]

Shawn Lavelle
Software Development

4101 Arrowhead Drive
Medina, Minnesota 55340-9457
Phone: 763 551 0559
Fax: 763 551 0750
Email: shawn.lave...@osii.com
Website: www.osii.com