Re: [SQL] parse_url does not work for Internationalized domain names ?

2018-01-11 Thread StanZhai
This problem was introduced by
 which is designed to
improve performance of PARSE_URL().

The same issue exists in the following SQL:

```SQL
SELECT PARSE_URL('http://stanzhai.site?p=["abc";]', 'QUERY', 'p')

// return null in Spark 2.1+
// return ["abc"] less than Spark 2.1
```

I think it's a regression.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL] parse_url does not work for Internationalized domain names ?

2018-01-11 Thread StanZhai
This problem was introduced by
 which is designed to
improve performance of PARSE_URL().The same issue exists in the following
SQL:```SQLSELECT PARSE_URL('http://stanzhai.site?p=["abc";]', 'QUERY', 'p')//
return null in Spark 2.1+// return ["abc"] less than Spark 2.1```I think
it's a regression.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

Re:[SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-13 Thread StanZhai
A workaround is diffcult.
You should consider merging this PR 
 into your Spark.







 
 "wangshuang [via Apache Spark Developers 
List]" wroted at 2017-07-13 18:43:

I'm trying to execute hive sql on spark sql (Also on spark thriftserver), For 
optimizing data skew, we use "case when" to handle null.
Simple sql as following:


SELECT a.col1
 FROM tbl1 a
 LEFT OUTER JOIN tbl2 b
 ON
 CASE
 WHEN a.col2 IS NULL
 TNEN cast(rand(9)*1000 - 99 as string)
 ELSE
 a.col2 END
 = b.col3;


But I get the error:

== Physical Plan ==
org.apache.spark.sql.AnalysisException: nondeterministic expressions are only 
allowed in
Project, Filter, Aggregate or Window, found:
 (((CASE WHEN (a.`nav_tcdt` IS NULL) THEN CAST(((rand(9) * CAST(1000 AS 
DOUBLE)) - CAST(99L AS DOUBLE)) AS STRING) ELSE a.`nav_tcdt` END = 
c.`site_categ_id`) AND (CAST(a.`nav_tcd` AS INT) = 9)) AND (c.`cur_flag` = 1))
in operator Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN 
cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as string) 
ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as int) = 9)) && 
(cur_flag#77 = 1))
   ;;
GlobalLimit 10
+- LocalLimit 10
   +- Aggregate [date_id#7, CASE WHEN (cast(city_id#10 as string) IN 
(cast(19596 as string),cast(20134 as string),cast(10997 as string)) && 
nav_tcdt#25 RLIKE ^[0-9]+$) THEN city_id#10 ELSE nav_tpa_id#21 END], [date_id#7]
  +- Filter (date_id#7 = 2017-07-12)
 +- Join LeftOuter, (((CASE WHEN isnull(nav_tcdt#25) THEN 
cast(((rand(9) * cast(1000 as double)) - cast(99 as double)) as string) 
ELSE nav_tcdt#25 END = site_categ_id#80) && (cast(nav_tcd#26 as int) = 9)) && 
(cur_flag#77 = 1))
:- SubqueryAlias a
:  +- SubqueryAlias tmp_lifan_trfc_tpa_hive
: +- CatalogRelation `tmp`.`tmp_lifan_trfc_tpa_hive`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [date_id#7, chanl_id#8L, 
pltfm_id#9, city_id#10, sessn_id#11, gu_id#12, nav_refer_page_type_id#13, 
nav_refer_page_value#14, nav_refer_tpa#15, nav_refer_tpa_id#16, 
nav_refer_tpc#17, nav_refer_tpi#18, nav_page_type_id#19, nav_page_value#20, 
nav_tpa_id#21, nav_tpa#22, nav_tpc#23, nav_tpi#24, nav_tcdt#25, nav_tcd#26, 
nav_tci#27, nav_tce#28, detl_refer_page_type_id#29, detl_refer_page_value#30, 
... 33 more fields]
+- SubqueryAlias c
   +- SubqueryAlias dim_site_categ_ext
  +- CatalogRelation `dw`.`dim_site_categ_ext`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [site_categ_skid#64L, 
site_categ_type#65, site_categ_code#66, site_categ_name#67, 
site_categ_parnt_skid#68L, site_categ_kywrd#69, leaf_flg#70L, sort_seq#71L, 
site_categ_srch_name#72, vsbl_flg#73, delet_flag#74, etl_batch_id#75L, 
updt_time#76, cur_flag#77, bkgrnd_categ_skid#78L, bkgrnd_categ_id#79L, 
site_categ_id#80, site_categ_parnt_id#81]

Does spark sql not support syntax "case when" in JOIN?  Additional, my spark 
version is 2.2.0.
Any help would be greatly appreciated.

If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21953.html
To start a new topic under Apache Spark Developers List, email 
ml+s1001551n1...@n3.nabble.com
 To unsubscribe from Apache Spark Developers List, click here.
 NAML







--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-SQL-Syntax-case-when-doesn-t-be-supported-in-JOIN-tp21960.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re:Re: [SQL]Analysis failed when combining Window function and GROUP BY in Spark2.x

2017-03-08 Thread StanZhai
Thanks for your reply!
I know what's going on now.






"Herman van Hövell tot Westerflier-2 [via Apache Spark Developers 
List]" wroted at 2017-03-08 21:35:

You are seeing a bug in the Hive parser. Hive drops the window clause when it 
encounters a count(distinct ...). See 
https://issues.apache.org/jira/browse/HIVE-10141 for more information.

Spark 1.6 plans this as a regular distinct aggregate (dropping the window 
clause), which is wrong. Spark 2.x uses its own parser, and it does not allow 
you to do use 'distinct' aggregates in window functions. You are getting this 
error because aggregates are planned before a windows, and the aggregate cannot 
find b in its grouping by expressions.


On Wed, Mar 8, 2017 at 5:21 AM, StanZhai <[hidden email]> wrote:
We can reproduce this using the following code:
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()

val sql1 =
  """
|create temporary view tb as select * from values
|(1, 0),
|(1, 0),
|(2, 0)
|as grouping(a, b)
  """.stripMargin

val sql =
  """
|select count(distinct(b)) over (partition by a) from tb group by a
  """.stripMargin

spark.sql(sql1)
spark.sql(sql).show()It will throw exception like this:
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
'tb.`b`' is neither present in the group by, nor is it an aggregate function. 
Add to group by or wrap in first() (or first_value) if you don't care which 
value you get.;;
Project [count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING)#4L]
+- Project [b#1, a#0, count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L, count(DISTINCT b) OVER 
(PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L]
   +- Window [count(distinct b#1) windowspecdefinition(a#0, ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS count(DISTINCT b) OVER 
(PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L], 
[a#0]
  +- Aggregate [a#0], [b#1, a#0]
 +- SubqueryAlias tb
+- Project [a#0, b#1]
   +- SubqueryAlias grouping
  +- LocalRelation [a#0, b#1]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:220)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$7.apply(CheckAnalysis.scala:247)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$7.apply(CheckAnalysis.scala:247)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:247)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)But, there is 
no exception in Spark 1.6.x.
I think the sql select count(distinct(b)) over (partition by a) from tb group 
by a should be executed. I've no idea about

[SQL]Analysis failed when combining Window function and GROUP BY in Spark2.x

2017-03-07 Thread StanZhai
We can reproduce this using the following code:
val spark = SparkSession.builder().appName("test").master("local").getOrCreate()

val sql1 =
  """
|create temporary view tb as select * from values
|(1, 0),
|(1, 0),
|(2, 0)
|as grouping(a, b)
  """.stripMargin

val sql =
  """
|select count(distinct(b)) over (partition by a) from tb group by a
  """.stripMargin

spark.sql(sql1)
spark.sql(sql).show()It will throw exception like this:
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 
'tb.`b`' is neither present in the group by, nor is it an aggregate function. 
Add to group by or wrap in first() (or first_value) if you don't care which 
value you get.;;
Project [count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING)#4L]
+- Project [b#1, a#0, count(DISTINCT b) OVER (PARTITION BY a ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L, count(DISTINCT b) OVER 
(PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L]
   +- Window [count(distinct b#1) windowspecdefinition(a#0, ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS count(DISTINCT b) OVER 
(PARTITION BY a ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)#4L], 
[a#0]
  +- Aggregate [a#0], [b#1, a#0]
 +- SubqueryAlias tb
+- Project [a#0, b#1]
   +- SubqueryAlias grouping
  +- LocalRelation [a#0, b#1]

  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:220)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$7.apply(CheckAnalysis.scala:247)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$7.apply(CheckAnalysis.scala:247)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:247)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
  at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)But, there is 
no exception in Spark 1.6.x.
I think the sql select count(distinct(b)) over (partition by a) from tb group 
by a should be executed. I've no idea about the exception. Is this in line with 
expectations?
Any help is appreciated!
Best, 
Stan







--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Analysis-failed-when-combining-Window-function-and-GROUP-BY-in-Spark2-x-tp21131.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-23 Thread StanZhai
Thanks for Cheng's help.


It must be something wrong with InferFiltersFromConstraints, I just removed 
InferFiltersFromConstraints from 
org/apache/spark/sql/catalyst/optimizer/Optimizer.scala to avoid this issue. I 
will analysis this issue with the method your provided.




-- Original --
From:  "Cheng Lian [via Apache Spark Developers 
List]";;
Send time: Friday, Feb 24, 2017 2:28 AM
To: "Stan Zhai"; 

Subject:  Re: The driver hangs at DataFrame.rdd in Spark 2.1.0



   
This one seems to be relevant, but it's already fixed in 2.1.0.
 
One way to debug is to turn on trace log and check how the   
analyzer/optimizer behaves.
 
     
 On 2/22/17 11:11 PM, StanZhai wrote:
 
Could this be related to 
https://issues.apache.org/jira/browse/SPARK-17733 ?

 
 
 
 -- Original --
From:  "Cheng Lian-3 [via Apache Spark Developers   
  List]";<[hidden   email]>;
   Send time: Thursday, Feb 23, 2017 9:43 AM
   To: "Stan Zhai"<[hidden   email]>; 
   Subject:  Re: The driver hangs at DataFrame.rdd in Spark 
2.1.0
 
 
 
 
Just from the thread dump you provided, it seems that this   particular 
query plan jams our optimizer. However, it's also   possible that the 
driver just happened to be running optimizer   rules at that particular 
time point.
 
 
Since query planning doesn't touch any actual data, could you   please 
try to minimize this query by replacing the actual   relations with 
temporary views derived from Scala local   collections? In this way, it 
would be much easier for others   to reproduce issue.
 
Cheng
 
 
 On 2/22/17 5:16 PM, Stan Zhai   wrote:
 
Thanks for lian's reply.
   
   
   Here is the QueryPlan generated by Spark 1.6.2(I can't 
get it in Spark 2.1.0):
...   


 
 -- Original --
Subject:  Re: The driver hangs at 
DataFrame.rdd in Spark 2.1.0
 
 
 
 
What is the query plan? We had once observed query plans   that 
grow exponentially in iterative ML workloads and the   query 
planner hangs forever. For example, each iteration   combines 4 
plan trees of the last iteration and forms a   larger plan tree. 
The size of the plan tree can easily   reach billions of nodes 
after 15 iterations.
 
 
 On 2/22/17 9:29 AM, Stan Zhai   wrote:
 
Hi all,
   
   
   The driver hangs at DataFrame.rdd in Spark 2.1.0 when
 the DataFrame(SQL) is complex, Following thread dump of my 
driver:
   ...
  
   
  
 
 
 
If you reply to this email, your message will 
be added to the discussion below:
   
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21053.html
 
To start a new topic under Apache Spark Developers List, 
email   [hidden email]   
   To unsubscribe from Apache Spark Developers List, click here.
   NAML 
   
   
   
   View this message in context: Re: The driver hangs at 
DataFrame.rdd in Spark 2.1.0
   Sent from the Apache Spark Developers List mailing list archive 
at Nabble.com.
  



If you reply to this email, your message will be added 
to the discussion below:

http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21069.html

To start a new topic under Apache Spark Developers 
List, email ml-node+s1001551n1...@n3.nabble.com 
To unsubscribe from Apache Spark Developers List, click here.
NAML



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21073.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-22 Thread StanZhai
Could this be related to https://issues.apache.org/jira/browse/SPARK-17733 ?




-- Original --
From:  "Cheng Lian-3 [via Apache Spark Developers 
List]";;
Send time: Thursday, Feb 23, 2017 9:43 AM
To: "Stan Zhai"; 

Subject:  Re: The driver hangs at DataFrame.rdd in Spark 2.1.0



   
Just from the thread dump you provided, it seems that this   particular 
query plan jams our optimizer. However, it's also   possible that the 
driver just happened to be running optimizer   rules at that particular 
time point.
 
 
Since query planning doesn't touch any actual data, could you   please try 
to minimize this query by replacing the actual   relations with temporary 
views derived from Scala local   collections? In this way, it would be much 
easier for others to   reproduce issue.
 
Cheng
 
 
 On 2/22/17 5:16 PM, Stan Zhai wrote:
 
Thanks for lian's reply.
   
   
   Here is the QueryPlan generated by Spark 1.6.2(I can't get it in 
Spark 2.1.0):
...   


 
 -- Original --
Subject:  Re: The driver hangs at DataFrame.rdd 
in Spark 2.1.0
 
 
 
 
What is the query plan? We had once observed query plans that   grow 
exponentially in iterative ML workloads and the query   planner hangs 
forever. For example, each iteration combines 4   plan trees of the 
last iteration and forms a larger plan tree.   The size of the plan 
tree can easily reach billions of nodes   after 15 iterations.
 
 
 On 2/22/17 9:29 AM, Stan Zhai   wrote:
 
Hi all,
   
   
   The driver hangs at DataFrame.rdd in Spark 2.1.0 when the
 DataFrame(SQL) is complex, Following thread dump of my driver:
   ...
  
   
  



If you reply to this email, your message will be added 
to the discussion below:

http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21053.html

To start a new topic under Apache Spark Developers 
List, email ml-node+s1001551n1...@n3.nabble.com 
To unsubscribe from Apache Spark Developers List, click here.
NAML



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-The-driver-hangs-at-DataFrame-rdd-in-Spark-2-1-0-tp21052p21054.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

The driver hangs at DataFrame.rdd in Spark 2.1.0

2017-02-22 Thread StanZhai
Hi all,


The driver hangs at DataFrame.rdd in Spark 2.1.0 when the DataFrame(SQL) is 
complex, Following thread dump of my driver:


org.apache.spark.sql.catalyst.expressions.AttributeReference.equals(namedExpressions.scala:230)
 
org.apache.spark.sql.catalyst.expressions.IsNotNull.equals(nullExpressions.scala:312)
 org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
org.apache.spark.sql.catalyst.expressions.Or.equals(predicates.scala:315) 
scala.collection.mutable.FlatHashTable$class.addEntry(FlatHashTable.scala:151) 
scala.collection.mutable.HashSet.addEntry(HashSet.scala:40) 
scala.collection.mutable.FlatHashTable$class.addElem(FlatHashTable.scala:139) 
scala.collection.mutable.HashSet.addElem(HashSet.scala:40) 
scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:59) 
scala.collection.mutable.HashSet.$plus$eq(HashSet.scala:40) 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59)
 scala.collection.mutable.HashSet.foreach(HashSet.scala:78) 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59) 
scala.collection.mutable.AbstractSet.$plus$plus$eq(Set.scala:46) 
scala.collection.mutable.HashSet.clone(HashSet.scala:83) 
scala.collection.mutable.HashSet.clone(HashSet.scala:40) 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:65)
 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus(ExpressionSet.scala:50)
 scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141) 
scala.collection.SetLike$$anonfun$$plus$plus$1.apply(SetLike.scala:141) 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
 scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316) 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) 
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:972) 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) 
scala.collection.AbstractTraversable.foldLeft(Traversable.scala:104) 
scala.collection.TraversableOnce$class.$div$colon(TraversableOnce.scala:151) 
scala.collection.AbstractTraversable.$div$colon(Traversable.scala:104) 
scala.collection.SetLike$class.$plus$plus(SetLike.scala:141) 
org.apache.spark.sql.catalyst.expressions.ExpressionSet.$plus$plus(ExpressionSet.scala:50)
 
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:300)
 
org.apache.spark.sql.catalyst.plans.logical.UnaryNode$$anonfun$getAliasedConstraints$1.apply(LogicalPlan.scala:297)
 scala.collection.immutable.List.foreach(List.scala:381) 
org.apache.spark.sql.catalyst.plans.logical.UnaryNode.getAliasedConstraints(LogicalPlan.scala:297)
 
org.apache.spark.sql.catalyst.plans.logical.Project.validConstraints(basicLogicalOperators.scala:58)
 
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints$lzycompute(QueryPlan.scala:187)
 => holding 
Monitor(org.apache.spark.sql.catalyst.plans.logical.Join@1365611745}) 
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints(QueryPlan.scala:187) 
org.apache.spark.sql.catalyst.plans.logical.Project.validConstraints(basicLogicalOperators.scala:58)
 
org.apache.spark.sql.catalyst.plans.QueryPlan.constraints$lzycompute(QueryPlan.scala:187)
 => holding 
Monitor(org.apache.spark.sql.catalyst.plans.logical.Join@1365611745}) 
org.apache.spark.sql

Re:compile about the code

2017-02-20 Thread StanZhai
Your antlr4-maven-plugin looks like incomplete, you can try to delete ~/.m2 in 
your home directory, then re-compile spark.




-- Original --
From:  " 萝卜丝炒饭 [via Apache Spark Developers 
List]";;
Date:  Feb 20, 2017
To:  "Stan Zhai"; 

Subject:  compile about the code



hi all,

when i compile  spark2.0.2,  i meet an error about the  antlr4.   

i paste the info  in the attach file, wpuld you like  help me pls?

- 
To unsubscribe e-mail: [hidden email]
 0220_2.png (76K) Download Attachment



If you reply to this email, your message will be added 
to the discussion below:

http://apache-spark-developers-list.1001551.n3.nabble.com/compile-about-the-code-tp21030.html
   
To start a new topic under Apache Spark Developers 
List, email ml-node+s1001551n1...@n3.nabble.com 
To unsubscribe from Apache Spark Developers List, click here.
NAML



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/compile-about-the-code-tp21030p21031.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-13 Thread StanZhai
I've filed a JIRA about this problem. 
https://issues.apache.org/jira/browse/SPARK-19532
<https://issues.apache.org/jira/browse/SPARK-19532>  

I've tried to set `spark.speculation` to `false`, but the off-heap also
exceed about 10G after triggering a FullGC to the Executor
process(--executor-memory 30G), as follow:

test@test Online ~ $ ps aux | grep CoarseGrainedExecutorBackend
test  105371  106 21.5 67325492 42621992 ?   Sl   15:20  55:14
/home/test/service/jdk/bin/java -cp
/home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/spark/conf/:/home/test/service/spark/jars/*:/home/test/service/hadoop/etc/hadoop/
-Xmx30720M -Dspark.driver.port=9835 -Dtag=spark_2_1_test -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -Xloggc:./gc.log -verbose:gc
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
spark://CoarseGrainedScheduler@172.16.34.235:9835 --executor-id 4 --hostname
test-192 --cores 36 --app-id app-20170213152037-0043 --worker-url
spark://Worker@test-192:33890

So, I think these are also other reasons for this problem.

We have been trying to upgrade our Spark from the releasing of Spark 2.1.0.

This version is unstable and not available for us because of the memory
problems, we should pay attention to this.


StanZhai wrote
> From thread dump page of Executor of WebUI, I found that there are about
> 1300 threads named  "DataStreamer for file
> /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
> in TIMED_WAITING state like this:
<http://apache-spark-developers-list.1001551.n3.nabble.com/file/n20881/QQ20170207-212340.png>
 
> 
> The exceed off-heap memory may be caused by these abnormal threads. 
> 
> This problem occurs only when writing data to the Hadoop(tasks may be
> killed by Executor during writing).
> 
> Could this be related to 
> https://issues.apache.org/jira/browse/HDFS-9812
> <https://issues.apache.org/jira/browse/HDFS-9812>  
> ?
> 
> It's may be a bug of Spark when killing tasks during writing data. What's
> the difference between Spark 1.6.x and 2.1.0 in killing tasks?
> 
> This is a critical issue, I've worked on this for days.
> 
> Any help?





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697p20935.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-07 Thread StanZhai
>From thread dump page of Executor of WebUI, I found that there are about 1300
threads named  "DataStreamer for file
/test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
in TIMED_WAITING state like this:

 

The exceed off-heap memory may be caused by these abnormal threads. 

This problem occurs only when writing data to the Hadoop(tasks may be killed
by Executor during writing).

Could this be related to  https://issues.apache.org/jira/browse/HDFS-9812
  ?

It's may be a bug of Spark when killing tasks during writing data. What's
the difference between Spark 1.6.x and 2.1.0 in killing tasks?

This is a critical issue, I've worked on this for days.

Any help?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697p20881.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[SQL]SQLParser fails to resolve nested CASE WHEN statement with parentheses in Spark 2.x

2017-02-06 Thread StanZhai
Hi all,

SQLParser fails to resolve nested CASE WHEN statement like this:

select case when
  (1) +
  case when 1>0 then 1 else 0 end = 2
then 1 else 0 end
from tb

 Exception 
Exception in thread "main"
org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input 'then' expecting {'.', '[', 'OR', 'AND', 'IN', NOT,
'BETWEEN', 'LIKE', RLIKE, 'IS', 'WHEN', EQ, '<=>', '<>', '!=', '<', LTE,
'>', GTE, '+', '-', '*', '/', '%', 'DIV', '&', '|', '^'}(line 5, pos 0)

== SQL ==

select case when
  (1) +
  case when 1>0 then 1 else 0 end = 2
then 1 else 0 end
^^^
from tb

But,remove parentheses will be fine:

select case when
  1 +
  case when 1>0 then 1 else 0 end = 2
then 1 else 0 end
from tb

I've already filed a JIRA for this: 
https://issues.apache.org/jira/browse/SPARK-19472
  

Any help is greatly appreciated!

Best,
Stan




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-SQLParser-fails-to-resolve-nested-CASE-WHEN-statement-with-parentheses-in-Spark-2-x-tp20867.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [SQL]A confusing NullPointerException when creating table using Spark2.1.0

2017-02-06 Thread StanZhai
This issue has been fixed by  https://github.com/apache/spark/pull/16820
  .



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-A-confusing-NullPointerException-when-creating-table-using-Spark2-1-0-tp20851p20866.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[SQL]A confusing NullPointerException when creating table using Spark2.1.0

2017-02-03 Thread StanZhai
Hi all,

After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing
NullPointerException when creating table under Spark 2.1.0, but the problem
does not exists in Spark 1.6.1.

Environment: Hive 1.2.1, Hadoop 2.6.4

 Code 
// spark is an instance of HiveContext
// merge is a Hive UDF
val df = spark.sql("SELECT merge(field_a, null) AS new_a, field_b AS new_b
FROM tb_1 group by field_a, field_b")
df.createTempView("tb_temp")
spark.sql("create table tb_result stored as parquet as " +
  "SELECT new_a" +
  "FROM tb_temp" +
  "LEFT JOIN `tb_2` ON " +
  "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL),
concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) =
`tb_2`.`fka6862f17`")

 Physical Plan 
*Project [new_a]
+- *BroadcastHashJoin [if (((new_b = ) || isnull(new_b))) concat(GrLSRwZE_,
cast(cast((_nondeterministic * 200.0) as int) as string)) else new_b],
[fka6862f17], LeftOuter, BuildRight
   :- HashAggregate(keys=[field_a, field_b], functions=[], output=[new_a,
new_b, _nondeterministic])
   :  +- Exchange(coordinator ) hashpartitioning(field_a, field_b, 180),
coordinator[target post-shuffle partition size: 1024880]
   : +- *HashAggregate(keys=[field_a, field_b], functions=[],
output=[field_a, field_b])
   :+- *FileScan parquet bdp.tb_1[field_a,field_b] Batched: true,
Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_1,
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string,
true]))
  +- *Project [fka6862f17]
 +- *FileScan parquet bdp.tb_2[fka6862f17] Batched: true, Format:
Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_2,
PartitionFilters: [], PushedFilters: [], ReadSchema: struct

What does '*' mean before HashAggregate?

 Exception 
org.apache.spark.SparkException: Task failed while writing rows
...
java.lang.NullPointerException
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:260)
at
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:259)
at
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392)
at
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:252)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:199)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:197)
at
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:202)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$4.apply(FileFormatWriter.scala:138)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$4.apply(FileFormatWriter.scala:137)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I also found that when I changed my code as follow:

spark.sql("create table tb_result stored as parquet as " +
  "SELECT new_b" +
  "FROM tb_temp" +
  "LEFT JOIN `tb_2` ON " +
  "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL),
concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) =
`tb_2`.`fka6862f17`")

Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-02-02 Thread StanZhai
CentOS 7.1,
Linux version 3.10.0-229.el7.x86_64 (buil...@kbuilder.dev.centos.org) (gcc
version 4.8.2 20140120 (Red Hat 4.8.2-16) (GCC) ) #1 SMP Fri Mar 6 11:36:42
UTC 2015


Michael Allman-2 wrote
> Hi Stan,
> 
> What OS/version are you using?
> 
> Michael
> 
>> On Jan 22, 2017, at 11:36 PM, StanZhai <

> mail@

> > wrote:
>> 
>> I'm using Parallel GC.
>> rxin wrote
>>> Are you using G1 GC? G1 sometimes uses a lot more memory than the size
>>> allocated.
>>> 
>>> 
>>> On Sun, Jan 22, 2017 at 12:58 AM StanZhai <
>> 
>>> mail@
>> 
>>> > wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> 
>>>> 
>>>> We just upgraded our Spark from 1.6.2 to 2.1.0.
>>>> 
>>>> 
>>>> 
>>>> Our Spark application is started by spark-submit with config of
>>>> 
>>>> `--executor-memory 35G` in standalone model, but the actual use of
>>>> memory
>>>> up
>>>> 
>>>> to 65G after a full gc(jmap -histo:live $pid) as follow:
>>>> 
>>>> 
>>>> 
>>>> test@c6 ~ $ ps aux | grep CoarseGrainedExecutorBackend
>>>> 
>>>> test  181941  181 34.7 94665384 68836752 ?   Sl   09:25 711:21
>>>> 
>>>> /home/test/service/jdk/bin/java -cp
>>>> 
>>>> 
>>>> /home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/spark/conf/:/home/test/service/spark/jars/*:/home/test/service/hadoop/etc/hadoop/
>>>> 
>>>> -Xmx35840M -Dspark.driver.port=47781 -XX:+PrintGCDetails
>>>> 
>>>> -XX:+PrintGCDateStamps -Xloggc:./gc.log -verbose:gc
>>>> 
>>>> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
>>>> 
>>>> spark://
>> 
>>> CoarseGrainedScheduler@.xxx
>> 
>>> :47781 --executor-id 1
>>>> 
>>>> --hostname test-192 --cores 36 --app-id app-20170122092509-0017
>>>> --worker-url
>>>> 
>>>> spark://Worker@test-192:33890
>>>> 
>>>> 
>>>> 
>>>> Our Spark jobs are all sql.
>>>> 
>>>> 
>>>> 
>>>> The exceed memory looks like off-heap memory, but the default value of
>>>> 
>>>> `spark.memory.offHeap.enabled` is `false`.
>>>> 
>>>> 
>>>> 
>>>> We didn't find the problem in Spark 1.6.x, what causes this in Spark
>>>> 2.1.0?
>>>> 
>>>> 
>>>> 
>>>> Any help is greatly appreicated!
>>>> 
>>>> 
>>>> 
>>>> Best,
>>>> 
>>>> Stan
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> 
>>>> View this message in context:
>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697.html
>>>> 
>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>> Nabble.com <http://nabble.com/>;.
>>>> 
>>>> 
>>>> 
>>>> -
>>>> 
>>>> To unsubscribe e-mail: 
>> 
>>> dev-unsubscribe@.apache
>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697p20707.html
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697p20707.html>;
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com <http://nabble.com/>;.
>> 
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>  <mailto:

> dev-unsubscribe@.apache

> >





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697p20833.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-22 Thread StanZhai
I'm using Parallel GC.
rxin wrote
> Are you using G1 GC? G1 sometimes uses a lot more memory than the size
> allocated.
> 
> 
> On Sun, Jan 22, 2017 at 12:58 AM StanZhai <

> mail@

> > wrote:
> 
>> Hi all,
>>
>>
>>
>> We just upgraded our Spark from 1.6.2 to 2.1.0.
>>
>>
>>
>> Our Spark application is started by spark-submit with config of
>>
>> `--executor-memory 35G` in standalone model, but the actual use of memory
>> up
>>
>> to 65G after a full gc(jmap -histo:live $pid) as follow:
>>
>>
>>
>> test@c6 ~ $ ps aux | grep CoarseGrainedExecutorBackend
>>
>> test  181941  181 34.7 94665384 68836752 ?   Sl   09:25 711:21
>>
>> /home/test/service/jdk/bin/java -cp
>>
>>
>> /home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/spark/conf/:/home/test/service/spark/jars/*:/home/test/service/hadoop/etc/hadoop/
>>
>> -Xmx35840M -Dspark.driver.port=47781 -XX:+PrintGCDetails
>>
>> -XX:+PrintGCDateStamps -Xloggc:./gc.log -verbose:gc
>>
>> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
>>
>> spark://

> CoarseGrainedScheduler@.xxx

> :47781 --executor-id 1
>>
>> --hostname test-192 --cores 36 --app-id app-20170122092509-0017
>> --worker-url
>>
>> spark://Worker@test-192:33890
>>
>>
>>
>> Our Spark jobs are all sql.
>>
>>
>>
>> The exceed memory looks like off-heap memory, but the default value of
>>
>> `spark.memory.offHeap.enabled` is `false`.
>>
>>
>>
>> We didn't find the problem in Spark 1.6.x, what causes this in Spark
>> 2.1.0?
>>
>>
>>
>> Any help is greatly appreicated!
>>
>>
>>
>> Best,
>>
>> Stan
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697.html
>>
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>
>>
>>





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697p20707.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Executors exceed maximum memory defined with `--executor-memory` in Spark 2.1.0

2017-01-22 Thread StanZhai
Hi all,

We just upgraded our Spark from 1.6.2 to 2.1.0.

Our Spark application is started by spark-submit with config of
`--executor-memory 35G` in standalone model, but the actual use of memory up
to 65G after a full gc(jmap -histo:live $pid) as follow:

test@c6 ~ $ ps aux | grep CoarseGrainedExecutorBackend
test  181941  181 34.7 94665384 68836752 ?   Sl   09:25 711:21
/home/test/service/jdk/bin/java -cp
/home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/hadoop/share/hadoop/common/hadoop-lzo-0.4.20-SNAPSHOT.jar:/home/test/service/spark/conf/:/home/test/service/spark/jars/*:/home/test/service/hadoop/etc/hadoop/
-Xmx35840M -Dspark.driver.port=47781 -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -Xloggc:./gc.log -verbose:gc
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
spark://coarsegrainedschedu...@xxx.xxx.xxx.xxx:47781 --executor-id 1
--hostname test-192 --cores 36 --app-id app-20170122092509-0017 --worker-url
spark://Worker@test-192:33890

Our Spark jobs are all sql.

The exceed memory looks like off-heap memory, but the default value of
`spark.memory.offHeap.enabled` is `false`.

We didn't find the problem in Spark 1.6.x, what causes this in Spark 2.1.0?

Any help is greatly appreicated!

Best,
Stan



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Executors-exceed-maximum-memory-defined-with-executor-memory-in-Spark-2-1-0-tp20697.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[SparkSQL]How does spark handle a parquet file in parallel?

2015-09-19 Thread StanZhai
Hi all,

I'm using Spark (1.4.1) + Hive (0.13.1), I found that a large number of
network IO appeared when query a parquet table *with only one part file* use
SparkSQL. 

The SQL is: SELECT concat(year(fkbb5855f0), "-", month(fkbb5855f0), "-",
day(fkbb5855f0), " 00:00:00"),COUNT(fk919b1d80) FROM test WHERE fkbb5855f0
>= '2015-08-02 00:00:00' AND fkbb5855f0 <
'2015-09-01 00:00:00' AND fk418c5509 IN ('add_summary') AND (fkbb5855f0
!= '' AND fkbb5855f0 is not NULL) GROUP BY year(fkbb5855f0),
month(fkbb5855f0), day(fkbb5855f0)

The SQL's query explain is:

== Parsed Logical Plan ==
'Limit 1
 'Aggregate ['year('fkbb5855f0),'month('fkbb5855f0),'day('fkbb5855f0)],
['concat('year('fkbb5855f0),-,'month('fkbb5855f0),-,'day('fkbb5855f0),
00:00:00) AS _c0#14,COUNT('fk919b1d80) AS _c1#15]
  'Filter 'fkbb5855f0 >= 2015-08-02 00:00:00) &&
('fkbb5855f0 < 2015-09-01 00:00:00)) && 'fk418c5509 IN
(add_summary)) && (NOT ('fkbb5855f0 = ) && IS NOT NULL
'fkbb5855f0))
   'UnresolvedRelation [test], None

== Analyzed Logical Plan ==
_c0: string, _c1: bigint
Limit 1
 Aggregate
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(fkbb5855f0#36),HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFMonth(fkbb5855f0#36),HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFDayOfMonth(fkbb5855f0#36)],
[HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(fkbb5855f0#36),-,HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFMonth(fkbb5855f0#36),-,HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFDayOfMonth(fkbb5855f0#36),
00:00:00) AS _c0#14,COUNT(fk919b1d80#34) AS _c1#15L]
  Filter fkbb5855f0#36 >= 2015-08-02 00:00:00) &&
(fkbb5855f0#36 < 2015-09-01 00:00:00)) && fk418c5509#35 IN
(add_summary)) && (NOT (fkbb5855f0#36 = ) && IS NOT NULL
fkbb5855f0#36))
   Subquery test
   
Relation[fkb80bb774#33,fk919b1d80#34,fk418c5509#35,fkbb5855f0#36]
org.apache.spark.sql.parquet.ParquetRelation2@5a271032

== Optimized Logical Plan ==
Limit 1
 Aggregate
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(fkbb5855f0#36),HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFMonth(fkbb5855f0#36),HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFDayOfMonth(fkbb5855f0#36)],
[HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(fkbb5855f0#36),-,HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFMonth(fkbb5855f0#36),-,HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFDayOfMonth(fkbb5855f0#36),
00:00:00) AS _c0#14,COUNT(fk919b1d80#34) AS _c1#15L]
  Project [fkbb5855f0#36,fk919b1d80#34]
   Filter fkbb5855f0#36 >= 2015-08-02 00:00:00) &&
(fkbb5855f0#36 < 2015-09-01 00:00:00)) && fk418c5509#35 INSET
(add_summary)) && (NOT (fkbb5855f0#36 = ) && IS NOT NULL
fkbb5855f0#36))
   
Relation[fkb80bb774#33,fk919b1d80#34,fk418c5509#35,fkbb5855f0#36]
org.apache.spark.sql.parquet.ParquetRelation2@5a271032

== Physical Plan ==
Limit 1
 Aggregate false, [PartialGroup#42,PartialGroup#43,PartialGroup#44],
[HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFConcat(PartialGroup#42,-,PartialGroup#43,-,PartialGroup#44,
00:00:00) AS _c0#14,Coalesce(SUM(PartialCount#41L),0) AS _c1#15L]
  Exchange (HashPartitioning 200)
   Aggregate true,
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(fkbb5855f0#36),HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFMonth(fkbb5855f0#36),HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFDayOfMonth(fkbb5855f0#36)],
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(fkbb5855f0#36) AS
PartialGroup#42,HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFMonth(fkbb5855f0#36)
AS
PartialGroup#43,HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFDayOfMonth(fkbb5855f0#36)
AS PartialGroup#44,COUNT(fk919b1d80#34) AS PartialCount#41L]
    Project [fkbb5855f0#36,fk919b1d80#34]
     Filter (fkbb5855f0#36 >= 2015-08-02 00:00:00)
&& (fkbb5855f0#36 < 2015-09-01 00:00:00)) &&
fk418c5509#35 INSET (add_summary)) && NOT (fkbb5855f0#36 = ))
&& IS NOT NULL fkbb5855f0#36)
      PhysicalRDD
[fkbb5855f0#36,fk919b1d80#34,fk418c5509#35], MapPartitionsRDD[5] at

Code Generation: false
== RDD ==

The size of the `test` table is 3.3GB, I have 5 nodes in the Hadoop cluster,
and Spark use the same cluster. There are 3 replications of test table and
the block size is 64MB. 

The task count of the first stage is 54 when SparkSQL execute the SQL, the
Locality Level of all task is NODE_LOCAL. I use dstat monitoring a node
of the cluster, there are a large number of network IO:

total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in  
out | int   csw
  1   0  99   0   0   0| 107k  389k|
  0     0 | 193B 1097B|2408  5443
  0   0 100   0   0   0|   0     0
|5709B 3285B|   0     0 |1921  4282
  0   0 100   0   0   0|1936k    0 |3761B
1251B|   0     0 |1907  4197
  0   0 100   0   0   0|   0   584k|3399B
1539B|   0     0 |1903  4338
  0   0  99   0   0   0|1936k    0
|4332B 1447B|   0     0 

Re: [SparkSQL]Could not alter table in Spark 1.5 use HiveContext

2015-09-10 Thread StanZhai
Thanks a lot! I've fixed this issue by set: 
spark.sql.hive.metastore.version = 0.13.1
spark.sql.hive.metastore.jars = maven


Yin Huai-2 wrote
> Yes, Spark 1.5 use Hive 1.2's metastore client by default. You can change
> it by putting the following settings in your spark conf.
> 
> spark.sql.hive.metastore.version = 0.13.1
> spark.sql.hive.metastore.jars = maven or the path of your hive 0.13 jars
> and hadoop jars
> 
> For spark.sql.hive.metastore.jars, basically, it tells spark sql where to
> find metastore client's classes of Hive 0.13.1. If you set it to maven, we
> will download needed jars directly (it is an easy way to do testing work).
> 
> On Thu, Sep 10, 2015 at 7:45 PM, StanZhai <

> mail@

> > wrote:
> 
>> Thank you for the swift reply!
>>
>> The version of my hive metastore server is 0.13.1, I've build spark use
>> sbt
>> like this:
>> build/sbt -Pyarn -Phadoop-2.4 -Phive -Phive-thriftserver assembly
>>
>> Is spark 1.5 bind the hive client version of 1.2 by default?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-Could-not-alter-table-in-Spark-1-5-use-HiveContext-tp14029p14044.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: 

> dev-unsubscribe@.apache

>> For additional commands, e-mail: 

> dev-help@.apache

>>
>>





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-Could-not-alter-table-in-Spark-1-5-use-HiveContext-tp14029p14047.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [SparkSQL]Could not alter table in Spark 1.5 use HiveContext

2015-09-10 Thread StanZhai
Thank you for the swift reply!

The version of my hive metastore server is 0.13.1, I've build spark use sbt
like this:
build/sbt -Pyarn -Phadoop-2.4 -Phive -Phive-thriftserver assembly

Is spark 1.5 bind the hive client version of 1.2 by default?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-Could-not-alter-table-in-Spark-1-5-use-HiveContext-tp14029p14044.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[SparkSQL]Could not alter table in Spark 1.5 use HiveContext

2015-09-09 Thread StanZhai
After upgrade spark from 1.4.1 to 1.5.0, I encountered the following
exception when use alter table statement in HiveContext:

The sql is: ALTER TABLE a RENAME TO b

The exception is:

FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Invalid
method name: 'alter_table_with_cascade'
msg: org.apache.spark.sql.execution.QueryExecutionException: FAILED:
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Unable to alter table. Invalid method name: 'alter_table_with_cascade'
at
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:433)
at
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$runHive$1.apply(ClientWrapper.scala:418)
at
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
at
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
at
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
at
org.apache.spark.sql.hive.client.ClientWrapper.runHive(ClientWrapper.scala:418)
at
org.apache.spark.sql.hive.client.ClientWrapper.runSqlHive(ClientWrapper.scala:408)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:558)
at
org.apache.spark.sql.hive.execution.HiveNativeCommand.run(HiveNativeCommand.scala:33)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:927)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:927)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719)
at test.service.QueryService.query(QueryService.scala:28)
at test.api.DatabaseApi$$anonfun$query$1.apply(DatabaseApi.scala:39)
at test.api.DatabaseApi$$anonfun$query$1.apply(DatabaseApi.scala:30)
at test.web.JettyUtils$$anon$1.getOrPost(JettyUtils.scala:81)
at test.web.JettyUtils$$anon$1.doPost(JettyUtils.scala:119)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:755)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:982)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1043)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)

The sql can be run both at Spark 1.4.1 and Hive, I think this should be a
bug of Spark 1.5, Any suggestion? 

Best, Stan



--
View this message in context: 
http://apache-spark-deve

Re: Parquet SaveMode.Append Trouble.

2015-07-30 Thread StanZhai
You should import org.apache.spark.sql.SaveMode



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Parquet-SaveMode-Append-Trouble-tp13529p13531.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[Spark SQL]Could not read parquet table after recreating it with the same table name

2015-07-28 Thread StanZhai
Hi all,

I'am using SparkSQL in Spark 1.4.1. I encounter an error when using parquet
table after recreating it, we can reproduce the error as following:

```scala
// hc is an instance of HiveContext
hc.sql("select * from b").show() // this is ok and b is a parquet
table
val df = hc.sql("select * from a")
df.write.mode(SaveMode.Overwrite).saveAsTable("b")
hc.sql("select * from b").show() // got error
```

The error is: 

java.io.FileNotFoundException: File does not exist:
/user/hive/warehouse/test.db/b/part-r-4-3abcbb07-e20a-4b5e-a6e5-59356c3d3149.gz.parquet
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:65)
at
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:55)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1716)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1659)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1639)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1613)
at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:497)
at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:322)
at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
at
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1144)
at 
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1132)
at 
org.apache.hadoop.hdfs.DFSClient.getBlockLocations(DFSClient.java:1182)
at
org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:218)
at
org.apache.hadoop.hdfs.DistributedFileSystem$1.doCall(DistributedFileSystem.java:214)
at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:214)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileBlockLocations(DistributedFileSystem.java:206)
at
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getTaskSideSplits$1.apply(ParquetTableOperations.scala:625)
at
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat$$anonfun$getTaskSideSplits$1.apply(ParquetTableOperations.scala:621)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getTaskSideSplits(ParquetTableOperations.scala:621)
at
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:511)
at 
parquet.hadoop.ParquetInputFormat.getSplits(ParquetInputFormat.java:245)
at
org.apache.spark.sql.parquet.FilteringParquetRowInputFormat.getSplits(ParquetTableOperations.scala:464)
at
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$buildScan$1$$anon$1.getPartitions(newParquet.scala:305)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at
org.apache.sp

[SparkSQL 1.4.0]The result of SUM(xxx) in SparkSQL is 0.0 but not null when the column xxx is all null

2015-07-02 Thread StanZhai
Hi all, 

I have a table named test like this:

|  a  |  b  |
|  1  | null |
|  2  | null |

After upgraded the cluster from spark 1.3.1 to 1.4.0, I found the Sum
function in spark 1.4 and 1.3 are different.

The SQL is: select sum(b) from test

In Spark 1.4.0 the result is 0.0, in spark 1.3.1 the result is null. I think
the result should be null, why the result is 0.0 in 1.4.0 but not null? Is
this a bug?

Any hint is appreciated.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-1-4-0-The-result-of-SUM-xxx-in-SparkSQL-is-0-0-but-not-null-when-the-column-xxx-is-all-null-tp13008.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [SparkSQL 1.4]Could not use concat with UDF in where clause

2015-06-24 Thread StanZhai
Hi Michael Armbrust,

I have filed an issue on JIRA for this, 
https://issues.apache.org/jira/browse/SPARK-8588
  



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSQL-1-4-Could-not-use-concat-with-UDF-in-where-clause-tp12832p12848.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[SparkSQL 1.4]Could not use concat with UDF in where clause

2015-06-23 Thread StanZhai
Hi all,

After upgraded the cluster from spark 1.3.1 to 1.4.0(rc4), I encountered the
following exception when use concat with UDF in where clause:

===Exception
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to
dataType on unresolved object, tree:
'concat(HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFYear(date#1776),年)
at
org.apache.spark.sql.catalyst.analysis.UnresolvedFunction.dataType(unresolved.scala:82)
at
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
at
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5$$anonfun$applyOrElse$15.apply(HiveTypeCoercion.scala:299)
at
scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
at scala.collection.immutable.List.exists(List.scala:84)
at
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:299)
at
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$InConversion$$anonfun$apply$5.applyOrElse(HiveTypeCoercion.scala:298)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
at
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:75)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:85)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:94)
at
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:136)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformAllExpressions$1.applyOrElse(QueryPlan.scala:135)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:272)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:227)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.colle

Re: A confusing ClassNotFoundException error

2015-06-13 Thread StanZhai
I have encountered the similar error too at spark 1.4.0. 

The same code can be run on spark 1.3.1.

My code is(it can be run on spark-shell): 
===
  // hc is a instance of HiveContext
  val df = hc.sql("select * from test limit 10")
  val sb = new mutable.StringBuilder
  def mapHandle = (row: Row) => {
val rowData = ArrayBuffer[String]()
for (i <- 0 until row.size) {
  val d = row.get(i)

  d match {
case data: ArrayBuffer[Any] =>
  sb.clear()
  sb.append('[')
  for (j <- 0 until data.length) {
val elm = data(j)
if (elm != null) {
  sb.append('"')
  sb.append(elm.toString)
  sb.append('"')
} else {
  sb.append("null")
}
sb.append(',')
  }
  if (sb.length > 1) {
sb.deleteCharAt(sb.length - 1)
  }
  sb.append(']')
  rowData += sb.toString()
case _ =>
  rowData += (if (d != null) d.toString else null)
  }
}
rowData
  }
  df.map(mapHandle).foreach(println)


My submit script is: spark-submit --class cn.zhaishidan.trans.Main --master
local[8] test-spark.jar
===the error
java.lang.ClassNotFoundException:
cn.zhaishidan.trans.service.SparkHiveService$$anonfun$mapHandle$1$1$$anonfun$apply$1
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at
org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455)
at
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
Source)
at
com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown
Source)
at
org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101)
at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1891)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
at org.apache.spark.rdd.RDD.map(RDD.scala:293)
at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210)
at
cn.zhaishidan.trans.service.SparkHiveService.formatDF(SparkHiveService.scala:66)
at
cn.zhaishidan.trans.service.SparkHiveService.query(SparkHiveService.scala:80)
at
cn.zhaishidan.trans.api.DatabaseApi$$anonfun$query$1.apply(DatabaseApi.scala:39)
at
cn.zhaishidan.trans.api.DatabaseApi$$anonfun$query$1.apply(DatabaseApi.scala:30)
at
cn.zhaishidan.trans.web.JettyUtils$$anon$1.getOrPost(JettyUtils.scala:56)
at cn.zhaishidan.trans.web.JettyUtils$$anon$1.doGet(JettyUtils.scala:73)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:735)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at 
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:370)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpP