unsubscribe

2021-01-20 Thread luby

陆伯鹰
中国投资有限责任公司信息技术部
电话:+86 (0)10 84096521
传真:+86 (0)10 64086851 
北京市东城区朝阳门北大街1号新保利大厦8层 100010
网站:www.china-inv.cn 
 




 
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。

 
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 





[SPARK SQL] How to overwrite a Hive table with spark sql (SPARK2)

2019-03-12 Thread luby
Hi, All,

I need to overwrite data in a Hive table and I use the following code to 
do so:

df = sqlContext.sql(my-spark-sql-statement);
df.count
df.write.format("orc").mode("overwrite").saveAsTable("foo") // I also 
tried 'insertInto("foo")

The "df.count" shows that there are only 452 records in the result.
But "select count(*) from foo" (run in beeline) shows that there are 716 
records.

The final table contains more data than expected.

Does anyone know the reason and how to overwrite data in a Hive table with 
spark sql?

I'm using spark 2.2

Thanks 

Boying 



 
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。

 
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 





The spark sql ODBC/JDBC driver that supports Kerbose delegation

2019-02-11 Thread luby
Hi, All,

We want to use SPARK SQL in Tableau. But according to the 
https://onlinehelp.tableau.com/current/pro/desktop/en-us/examples_sparksql.htm

The driver provided by Tableau doesn't suppport Kerbose delegation.

Is there any SPARK SQL ODBC or JDBC driver that support Kerbose 
delegation?

Thanks 

Boying 



 
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。

 
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 





答复: Re: Re: How to get all input tables of a SPARK SQL 'select' statement

2019-01-28 Thread luby
Thank you so much. I tried your suggestion and it really works!





发件人: 
"Ramandeep Singh Nanda" 
收件人:
l...@china-inv.cn
抄送:
"Shahab Yunus" , "Tomas Bartalos" 
, "user @spark/'user @spark'/spark 
users/user@spark" 
日期:
2019/01/26 05:42
主题:
Re: Re: How to get all input tables of a SPARK SQL 'select' statement



Hi, 

You don't have to run the SQL statement. You can parse it, that will be 
the logical parsing.

val logicalPlan = ss.sessionState.sqlParser.parsePlan(sqlText = query)
println(logicalPlan.prettyJson)
[ {
  "class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
  "num-children" : 1,
  "projectList" : [ [ {
"class" : "org.apache.spark.sql.catalyst.analysis.UnresolvedStar",
"num-children" : 0
  } ] ],
  "child" : 0
}, {
  "class" : "org.apache.spark.sql.catalyst.analysis.UnresolvedRelation",
  "num-children" : 0,
  "tableIdentifier" : {
"product-class" : "org.apache.spark.sql.catalyst.TableIdentifier",
"table" : "abc"
  }
} ]



On Fri, Jan 25, 2019 at 6:07 AM  wrote:
Hi, All, 

I tried the suggested approach and it works, but it requires to 'run' the 
SQL statement first. 

I just want to parse the SQL statement without running it, so I can do 
this in my laptop without connecting to our production environment. 

I tried to write a tool which uses the SqlBase.g4 bundled with SPARK SQL 
to extract names of the input tables and it works as expected. 

But I have a question: 

The parser generated by SqlBase.g4 only accepts 'select' statement with 
all keywords such as 'SELECT', 'FROM' and table names capitalized 
e.g. it accepts 'SELECT * FROM FOO', but it doesn't accept 'select * from 
foo'. 

But I can run the spark.sql("select * from foo") in the spark2-shell 
without any problem. 

Is there another 'layer' in the SPARK SQL to capitalize those 'tokens' 
before invoking the parser? 

If so, why not just modify the SqlBase.g4 to accept lower cases keywords? 

Thanks 

Boying 



发件人: 
"Shahab Yunus"  
收件人: 
"Ramandeep Singh Nanda"  
抄送: 
"Tomas Bartalos" , l...@china-inv.cn, "user 
@spark/'user @spark'/spark users/user@spark"  
日期: 
2019/01/24 06:45 
主题: 
Re: How to get all input tables of a SPARK SQL 'select' statement




Could be a tangential idea but might help: Why not use queryExecution and 
logicalPlan objects that are available when you execute a query using 
SparkSession and get a DataFrame back? The Json representation contains 
almost all the info that you need and you don't need to go to Hive to get 
this info. 

Some details here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Dataset.html#queryExecution
 


On Wed, Jan 23, 2019 at 5:35 PM Ramandeep Singh Nanda <
ramannan...@gmail.com> wrote: 
Explain extended or explain would list the plan along with the tables. Not 
aware of any statements that explicitly list dependencies or tables 
directly. 

Regards,
Ramandeep Singh 

On Wed, Jan 23, 2019, 11:05 Tomas Bartalos  napísal(a): 
Hi, All, 

We need to get all input tables of several SPARK SQL 'select' statements. 

We can get those information of Hive SQL statements by using 'explain 
dependency select'. 
But I can't find the equivalent command for SPARK SQL. 

Does anyone know how to get this information of a SPARK SQL 'select' 
statement? 

Thanks 

Boying 
 



   
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。 

  
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 






   
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。 

  
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 




-- 
Regards,
Ramandeep Singh
http://orastack.com
+13474792296
ramannan...@gmail.com




 
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。

 
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any 

答复: Re: How to get all input tables of a SPARK SQL 'select' statement

2019-01-25 Thread luby
Hi, All,

I tried the suggested approach and it works, but it requires to 'run' the 
SQL statement first.

I just want to parse the SQL statement without running it, so I can do 
this in my laptop without connecting to our production environment.

I tried to write a tool which uses the SqlBase.g4 bundled with SPARK SQL 
to extract names of the input tables and it works as expected.

But I have a question:

The parser generated by SqlBase.g4 only accepts 'select' statement with 
all keywords such as 'SELECT', 'FROM' and table names capitalized 
e.g. it accepts 'SELECT * FROM FOO', but it doesn't accept 'select * from 
foo'.

But I can run the spark.sql("select * from foo") in the spark2-shell 
without any problem.

Is there another 'layer' in the SPARK SQL to capitalize those 'tokens' 
before invoking the parser?

If so, why not just modify the SqlBase.g4 to accept lower cases keywords?

Thanks 

Boying




发件人: 
"Shahab Yunus" 
收件人:
"Ramandeep Singh Nanda" 
抄送:
"Tomas Bartalos" , l...@china-inv.cn, "user 
@spark/'user @spark'/spark users/user@spark" 
日期:
2019/01/24 06:45
主题:
Re: How to get all input tables of a SPARK SQL 'select' statement



Could be a tangential idea but might help: Why not use queryExecution and 
logicalPlan objects that are available when you execute a query using 
SparkSession and get a DataFrame back? The Json representation contains 
almost all the info that you need and you don't need to go to Hive to get 
this info.

Some details here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Dataset.html#queryExecution

On Wed, Jan 23, 2019 at 5:35 PM Ramandeep Singh Nanda <
ramannan...@gmail.com> wrote:
Explain extended or explain would list the plan along with the tables. Not 
aware of any statements that explicitly list dependencies or tables 
directly. 

Regards,
Ramandeep Singh 

On Wed, Jan 23, 2019, 11:05 Tomas Bartalos  napísal(a):
Hi, All, 

We need to get all input tables of several SPARK SQL 'select' statements. 

We can get those information of Hive SQL statements by using 'explain 
dependency select'. 
But I can't find the equivalent command for SPARK SQL. 

Does anyone know how to get this information of a SPARK SQL 'select' 
statement? 

Thanks 

Boying 
 



   
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。 

  
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 






 
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。

 
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 





答复: Re: How to get all input tables of a SPARK SQL 'select' statement

2019-01-24 Thread luby
Thanks all for your help.

I'll try your suggestions.

Thanks again :)
 




发件人: 
"Shahab Yunus" 
收件人:
"Ramandeep Singh Nanda" 
抄送:
"Tomas Bartalos" , l...@china-inv.cn, "user 
@spark/'user @spark'/spark users/user@spark" 
日期:
2019/01/24 06:45
主题:
Re: How to get all input tables of a SPARK SQL 'select' statement



Could be a tangential idea but might help: Why not use queryExecution and 
logicalPlan objects that are available when you execute a query using 
SparkSession and get a DataFrame back? The Json representation contains 
almost all the info that you need and you don't need to go to Hive to get 
this info.

Some details here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-Dataset.html#queryExecution

On Wed, Jan 23, 2019 at 5:35 PM Ramandeep Singh Nanda <
ramannan...@gmail.com> wrote:
Explain extended or explain would list the plan along with the tables. Not 
aware of any statements that explicitly list dependencies or tables 
directly. 

Regards,
Ramandeep Singh 

On Wed, Jan 23, 2019, 11:05 Tomas Bartalos  napísal(a):
Hi, All, 

We need to get all input tables of several SPARK SQL 'select' statements. 

We can get those information of Hive SQL statements by using 'explain 
dependency select'. 
But I can't find the equivalent command for SPARK SQL. 

Does anyone know how to get this information of a SPARK SQL 'select' 
statement? 

Thanks 

Boying 
 



   
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。 

  
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 






 
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。

 
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 





How to get all input tables of a SPARK SQL 'select' statement

2019-01-23 Thread luby
Hi, All,

We need to get all input tables of several SPARK SQL 'select' statements.

We can get those information of Hive SQL statements by using 'explain 
dependency select'.
But I can't find the equivalent command for SPARK SQL.

Does anyone know how to get this information of a SPARK SQL 'select' 
statement?

Thanks

Boying
 



 
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。

 
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 





[SPARK SQL] Difference between 'Hive on spark' and Spark SQL

2018-12-19 Thread luby
Hi, All,

We are starting to migrate our data to Hadoop platform in hoping to use 
'Big Data' technologies to 
improve our business.

We are new in the area and want to get some help from you.

Currently all our data is put into Hive and some complicated SQL query 
statements are run daily.

We want to improve the performance of these queries and have two options 
at hand: 
a. Turn on 'Hive on spark' feature and run HQLs and
b. Run those query statements with spark SQL

What the difference between these options?

Another question is:
There is a hive setting 'hive.optimze.ppd' to enable 'predicated pushdown' 
query optimize
Is ther equivalent option in spark sql or the same setting also works for 
spark SQL?

Thanks in advance

Boying


 
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。

 
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 





Failed to convert java.sql.Date to String

2018-11-13 Thread luby
Hi, All,

I'm new to Spark SQL and just start to use it in our project. We are using 
spark 2.

When importing data from a Hive table, I got the following error:

if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, 
StringType, fromString, 
validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 3, processing_dttm), StringType), true) 
AS processing_dttm#91
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at 
org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:573)
at 
org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:573)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 
Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:190)
at 
org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:108)
at 
org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.sql.Date is not a valid 
external type for schema of string
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown
 
Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_0$(Unknown
 
Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 
Source)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 21 more

This is related to following line in our code (actually it's the codes 
from the third party):
if (dataType != null && dataType.isDateOrTimestamp()) {
field = new StructField(field.name(), DataTypes.StringType, 
field.nullable(), field.metadata());
 }

Does anyone know why and what kind of types that can be converted to 
stirng?

Thanks 

Boying



 
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外
披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件
人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。

 
This email message may contain confidential and/or privileged information. 
If you are not the intended recipient, please do not read, save, forward, 
disclose or copy the contents of this email or open any file attached to 
this email. We will be grateful if you could advise the sender immediately 
by replying this email, and delete this email and any attachment or links 
to this email completely and immediately from your computer system. 





Using Spark 2.2.0 SparkSession extensions to optimize file filtering

2017-10-25 Thread Chris Luby
I have an external catalog that has additional information on my Parquet files 
that I want to match up with the parsed filters from the plan to prune the list 
of files included in the scan.  I’m looking at doing this using the Spark 2.2.0 
SparkSession extensions similar to the built in partition pruning:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala

and this other project that is along the lines of what I want:

https://github.com/lightcopy/parquet-index/blob/master/src/main/scala/org/apache/spark/sql/execution/datasources/IndexSourceStrategy.scala

but isn’t caught up to 2.2.0, but I’m struggling to understand what type of 
extension I would use to do something like the above:

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.SparkSessionExtensions

and if this is the appropriate strategy for this.

Are there any examples out there for using the new extension hooks to alter the 
files included in the plan?

Thanks.