[jira] [Commented] (SPARK-15467) Getting stack overflow when attempting to query a wide Dataset (>200 fields)

2018-01-20 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1694#comment-1694
 ] 

Takeshi Yamamuro commented on SPARK-15467:
--

I think this should be fixed in the master (that is, the next v2.3 release). 
You could check this in the current master.

> Getting stack overflow when attempting to query a wide Dataset (>200 fields)
> 
>
> Key: SPARK-15467
> URL: https://issues.apache.org/jira/browse/SPARK-15467
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Don Drake
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.1.0
>
>
> This can be duplicated in a spark-shell, I am running Spark 2.0.0-preview.
> {code}
> import spark.implicits._
> case class Wide(
> val f0:String = "",
> val f1:String = "",
> val f2:String = "",
> val f3:String = "",
> val f4:String = "",
> val f5:String = "",
> val f6:String = "",
> val f7:String = "",
> val f8:String = "",
> val f9:String = "",
> val f10:String = "",
> val f11:String = "",
> val f12:String = "",
> val f13:String = "",
> val f14:String = "",
> val f15:String = "",
> val f16:String = "",
> val f17:String = "",
> val f18:String = "",
> val f19:String = "",
> val f20:String = "",
> val f21:String = "",
> val f22:String = "",
> val f23:String = "",
> val f24:String = "",
> val f25:String = "",
> val f26:String = "",
> val f27:String = "",
> val f28:String = "",
> val f29:String = "",
> val f30:String = "",
> val f31:String = "",
> val f32:String = "",
> val f33:String = "",
> val f34:String = "",
> val f35:String = "",
> val f36:String = "",
> val f37:String = "",
> val f38:String = "",
> val f39:String = "",
> val f40:String = "",
> val f41:String = "",
> val f42:String = "",
> val f43:String = "",
> val f44:String = "",
> val f45:String = "",
> val f46:String = "",
> val f47:String = "",
> val f48:String = "",
> val f49:String = "",
> val f50:String = "",
> val f51:String = "",
> val f52:String = "",
> val f53:String = "",
> val f54:String = "",
> val f55:String = "",
> val f56:String = "",
> val f57:String = "",
> val f58:String = "",
> val f59:String = "",
> val f60:String = "",
> val f61:String = "",
> val f62:String = "",
> val f63:String = "",
> val f64:String = "",
> val f65:String = "",
> val f66:String = "",
> val f67:String = "",
> val f68:String = "",
> val f69:String = "",
> val f70:String = "",
> val f71:String = "",
> val f72:String = "",
> val f73:String = "",
> val f74:String = "",
> val f75:String = "",
> val f76:String = "",
> val f77:String = "",
> val f78:String = "",
> val f79:String = "",
> val f80:String = "",
> val f81:String = "",
> val f82:String = "",
> val f83:String = "",
> val f84:String = "",
> val f85:String = "",
> val f86:String = "",
> val f87:String = "",
> val f88:String = "",
> val f89:String = "",
> val f90:String = "",
> val f91:String = "",
> val f92:String = "",
> val f93:String = "",
> val f94:String = "",
> val f95:String = "",
> val f96:String = "",
> val f97:String = "",
> val f98:String = "",
> val f99:String = "",
> val f100:String = "",
> val f101:String = "",
> val f102:String = "",
> val f103:String = "",
> val f104:String = "",
> val f105:String = "",
> val f106:String = "",
> val f107:String = "",
> val f108:String = "",
> val f109:String = "",
> val f110:String = "",
> val f111:String = "",
> val f112:String = "",
> val f113:String = "",
> val f114:String = "",
> val f115:String = "",
> val f116:String = "",
> val f117:String = "",
> val f118:String = "",
> val f119:String = "",
> val f120:String = "",
> val f121:String = "",
> val f122:String = "",
> val f123:String = "",
> val f124:String = "",
> val f125:String = "",
> val f126:String = "",
> val f127:String = "",
> val f128:String = "",
> val f129:String = "",
> val f130:String = "",
> val f131:String = "",
> val f132:String = "",
> val f133:String = "",
> val f134:String = "",
> val f135:String = "",
> val f136:String = "",
> val f137:String = "",
> val f138:String = "",
> val f139:String = "",
> val f140:String = "",
> val f141:String = "",
> val f142:String = "",
> val f143:String = "",
> val f144:String = "",
> val f145:String = "",
> val f146:String = "",
> val f147:String = "",
> val f148:String = "",
> val f149:String = "",
> val f150:String = "",
> val f151:String = "",
> val f152:String = "",
> val f153:String = "",
> val f154:String = "",
> val f155:String = "",
> val f156:String = "",
> val f157:String = "",
> val f158:String = "",
> val f159:String = "",
> val f160:String = "",
> val f161:String = "",
> val f162:String = "",
> val f163:String = "",
> val f164:String = "",
> val f165:String = "",
> val f166:String = "",
> val f167:String = "",
> val f168:String = "",
> val f169:String = 

[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2018-01-20 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1698#comment-1698
 ] 

Takeshi Yamamuro commented on SPARK-19217:
--

If it still makes some sense and nobody takes this, I'll do. How about adding a 
new method for supported cast types in UserDefinedType ( 
[https://github.com/apache/spark/compare/master...maropu:CastUDF).]
 In this example we could cast from VectorUDT to Array and _vice versa_.
{code:java}
scala> val df1 = Seq((1, Vectors.dense(Array(1.0, 2.0, 3.0.toDF("a", "b")
scala> val df2 = df1.selectExpr("CAST(b AS ARRAY)")
scala> df2.printSchema
root
 |-- b: array (nullable = true)
 ||-- element: double (containsNull = true)

scala> df2.show
+---+
|  b|
+---+
|[1.0, 2.0, 3.0]|
+---+

scala> val df3 = Seq((1, Seq(1.0, 2.0, 3.0))).toDF("a", "b")
scala> val df4 = df3.select(df3("b").cast(new VectorUDT()))
scala> df4.printSchema
root
 |-- b: vector (nullable = true)

scala> df4.show
+-+
|b|
+-+
|[1.0,2.0,3.0]|
+-+
{code}

WDYT cc: [~cloud_fan]
 

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19217) Offer easy cast from vector to array

2018-01-20 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1698#comment-1698
 ] 

Takeshi Yamamuro edited comment on SPARK-19217 at 1/21/18 5:32 AM:
---

If it still makes some sense and nobody takes this, I'll do. How about adding a 
new method for supported cast types in UserDefinedType?
 ( [https://github.com/apache/spark/compare/master...maropu:CastUDF).]
 In this example we could cast from VectorUDT to Array and _vice versa_.
{code:java}
scala> val df1 = Seq((1, Vectors.dense(Array(1.0, 2.0, 3.0.toDF("a", "b")
scala> val df2 = df1.selectExpr("CAST(b AS ARRAY)")
scala> df2.printSchema
root
 |-- b: array (nullable = true)
 ||-- element: double (containsNull = true)

scala> df2.show
+---+
|  b|
+---+
|[1.0, 2.0, 3.0]|
+---+

scala> val df3 = Seq((1, Seq(1.0, 2.0, 3.0))).toDF("a", "b")
scala> val df4 = df3.select(df3("b").cast(new VectorUDT()))
scala> df4.printSchema
root
 |-- b: vector (nullable = true)

scala> df4.show
+-+
|b|
+-+
|[1.0,2.0,3.0]|
+-+
{code}

WDYT cc: [~cloud_fan]
 


was (Author: maropu):
If it still makes some sense and nobody takes this, I'll do. How about adding a 
new method for supported cast types in UserDefinedType ( 
[https://github.com/apache/spark/compare/master...maropu:CastUDF).]
 In this example we could cast from VectorUDT to Array and _vice versa_.
{code:java}
scala> val df1 = Seq((1, Vectors.dense(Array(1.0, 2.0, 3.0.toDF("a", "b")
scala> val df2 = df1.selectExpr("CAST(b AS ARRAY)")
scala> df2.printSchema
root
 |-- b: array (nullable = true)
 ||-- element: double (containsNull = true)

scala> df2.show
+---+
|  b|
+---+
|[1.0, 2.0, 3.0]|
+---+

scala> val df3 = Seq((1, Seq(1.0, 2.0, 3.0))).toDF("a", "b")
scala> val df4 = df3.select(df3("b").cast(new VectorUDT()))
scala> df4.printSchema
root
 |-- b: vector (nullable = true)

scala> df4.show
+-+
|b|
+-+
|[1.0,2.0,3.0]|
+-+
{code}

WDYT cc: [~cloud_fan]
 

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23168) Hints for fact tables and unique columns

2018-01-20 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-23168:


 Summary: Hints for fact tables and unique columns
 Key: SPARK-23168
 URL: https://issues.apache.org/jira/browse/SPARK-23168
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.1
 Environment: We already have fact table and unique column inferences 
in StarSchemaDetection for decision making queries. IMHO, in most cases, users 
could know which is a fact table and which columns are unique. So, fact table 
and unique column hint might help for these users.
For example,
{code}
scala> factTable.hint("factTable").hint("uid").join(dimTable)
{code}

Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23168) Hints for fact tables and unique columns

2018-01-20 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-23168:
-
Environment: (was: We already have fact table and unique column 
inferences in StarSchemaDetection for decision making queries. IMHO, in most 
cases, users could know which is a fact table and which columns are unique. So, 
fact table and unique column hint might help for these users.
For example,
{code}
scala> factTable.hint("factTable").hint("uid").join(dimTable)
{code}
)

> Hints for fact tables and unique columns
> 
>
> Key: SPARK-23168
> URL: https://issues.apache.org/jira/browse/SPARK-23168
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23168) Hints for fact tables and unique columns

2018-01-20 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333402#comment-16333402
 ] 

Takeshi Yamamuro commented on SPARK-23168:
--

An example code is like: 
[https://github.com/apache/spark/compare/master...maropu:FactTableHintSpike] 
cc: [~r...@databricks.com] [~smilegator]

> Hints for fact tables and unique columns
> 
>
> Key: SPARK-23168
> URL: https://issues.apache.org/jira/browse/SPARK-23168
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23168) Hints for fact tables and unique columns

2018-01-20 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-23168:
-
Description: 
We already have fact table and unique column inferences in StarSchemaDetection 
for decision making queries. IMHO, in most cases, users could know which is a 
fact table and which columns are unique. So, fact table and unique column hint 
might help for these users.
 For example,
{code:java}
scala> factTable.hint("factTable").hint("uid").join(dimTable)
{code}

> Hints for fact tables and unique columns
> 
>
> Key: SPARK-23168
> URL: https://issues.apache.org/jira/browse/SPARK-23168
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> We already have fact table and unique column inferences in 
> StarSchemaDetection for decision making queries. IMHO, in most cases, users 
> could know which is a fact table and which columns are unique. So, fact table 
> and unique column hint might help for these users.
>  For example,
> {code:java}
> scala> factTable.hint("factTable").hint("uid").join(dimTable)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23167) Update TPCDS queries from v1.4 to v2.7 (latest)

2018-01-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333536#comment-16333536
 ] 

Takeshi Yamamuro commented on SPARK-23167:
--

ok, will do.

> Update TPCDS queries from v1.4 to v2.7 (latest)
> ---
>
> Key: SPARK-23167
> URL: https://issues.apache.org/jira/browse/SPARK-23167
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> We currently use TPCDS v1.4 
> ([https://github.com/apache/spark/commits/master/sql/core/src/test/resources/tpcds)]
>  though, the latest one is v2.7 
> ([http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp]).
>  I found that some queries are different from v1.4 and v2.7 (e.g., q4, q5, 
> q6, ...) and some queries newly might appear (e.g., q10a, ..). I think it 
> might make some sense to update the queries for more correct evaluation.
> Raw generated queries from TPCDS v2.7 query templates:
>  [https://github.com/maropu/spark_tpcds_v2.7.0/tree/master/generated]
> Modified TPCDS v2.7 queries to pass TPCDSQuerySuite (e.g., replacing 
> unsupported syntaxes, + 14 days -> interval 14 days):
>  [https://github.com/apache/spark/compare/master...maropu:TPCDSV2_7]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23171) Reduce the time costs of the rule runs that do not change the plans

2018-01-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333551#comment-16333551
 ] 

Takeshi Yamamuro commented on SPARK-23171:
--

ok, I'll check code based on these metrics.

> Reduce the time costs of the rule runs that do not change the plans 
> 
>
> Key: SPARK-23171
> URL: https://issues.apache.org/jira/browse/SPARK-23171
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Below is the time stats of Analyzer/Optimizer rules. Try to improve the rules 
> and reduce the time costs, especially for the runs that do not change the 
> plans.
> {noformat}
> === Metrics of Analyzer/Optimizer Rules ===
> Total number of runs = 175827
> Total time: 20.699042877 seconds
> Rule  
>  Total Time Effective Time Total Runs 
> Effective Runs
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning 
>  2340563794 1338268224 1875   
> 761   
> org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution   
>  1632672623 1625071881 788
> 37
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions 
>  1395087131 347339931  1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.PruneFilters  
>  1177711364 21344174   1590   
> 3 
> org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries  
>  1145135465 1131417128 285
> 39
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences 
>  1008347217 663112062  1982   
> 616   
> org.apache.spark.sql.catalyst.optimizer.ReorderJoin   
>  767024424  693001699  1590   
> 132   
> org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability
>  598524650  40802876   742
> 12
> org.apache.spark.sql.catalyst.analysis.DecimalPrecision   
>  595384169  436153128  1982   
> 211   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery   
>  548178270  459695885  1982   
> 49
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts 
>  423002864  139869503  1982   
> 86
> org.apache.spark.sql.catalyst.optimizer.BooleanSimplification 
>  405544962  17250184   1590   
> 7 
> org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin  
>  383837603  284174662  1590   
> 708   
> org.apache.spark.sql.catalyst.optimizer.RemoveRedundantAliases
>  372901885  33623321590   
> 9 
> org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints   
>  364628214  343815519  285
> 192   
> org.apache.spark.sql.execution.datasources.FindDataSourceTable
>  303293296  285344766  1982   
> 233   
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions  
>  233195019  92648171   1982   
> 294   
> org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion
>  220568919  73932736   1982   
> 38
> org.apache.spark.sql.catalyst.optimizer.NullPropagation   
>  207976072  90723051590   
> 26

[jira] [Commented] (SPARK-23168) Hints for fact tables and unique columns

2018-01-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333572#comment-16333572
 ] 

Takeshi Yamamuro commented on SPARK-23168:
--

ok

> Hints for fact tables and unique columns
> 
>
> Key: SPARK-23168
> URL: https://issues.apache.org/jira/browse/SPARK-23168
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> We already have fact table and unique column inferences in 
> StarSchemaDetection for decision making queries. IMHO, in most cases, users 
> could know which is a fact table and which columns are unique. So, fact table 
> and unique column hint might help for these users.
>  For example,
> {code:java}
> scala> factTable.hint("factTable").hint("uid").join(dimTable)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23168) Hints for fact tables and unique columns

2018-01-21 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-23168:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-19842

> Hints for fact tables and unique columns
> 
>
> Key: SPARK-23168
> URL: https://issues.apache.org/jira/browse/SPARK-23168
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> We already have fact table and unique column inferences in 
> StarSchemaDetection for decision making queries. IMHO, in most cases, users 
> could know which is a fact table and which columns are unique. So, fact table 
> and unique column hint might help for these users.
>  For example,
> {code:java}
> scala> factTable.hint("factTable").hint("uid").join(dimTable)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2018-01-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333576#comment-16333576
 ] 

Takeshi Yamamuro commented on SPARK-19842:
--

What's the status of this tickets now? We need to discuss the benefits from 
this work first? 
https://github.com/apache/spark/pull/18994#issuecomment-331368062

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
>Priority: Major
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.
> Link to the google doc: 
> [InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23172) Respect Project nodes in ReorderJoin

2018-01-21 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-23172:


 Summary: Respect Project nodes in ReorderJoin
 Key: SPARK-23172
 URL: https://issues.apache.org/jira/browse/SPARK-23172
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Takeshi Yamamuro


The current `ReorderJoin` optimizer rule cannot flatten a pattern `Join -> 
Project -> Join` because `ExtractFiltersAndInnerJoins`
doesn't handle `Project` nodes. So, the current master cannot reorder joins in 
a query below;
{code}
val df1 = spark.range(100).selectExpr("id % 10 AS k0", s"id % 10 AS k1", s"id % 
10 AS k2", "id AS v1")
val df2 = spark.range(10).selectExpr("id AS k0", "id AS v2")
val df3 = spark.range(10).selectExpr("id AS k1", "id AS v3")
val df4 = spark.range(10).selectExpr("id AS k2", "id AS v4")
df1.join(df2, "k0").join(df3, "k1").join(df4, "k2").explain(true)

== Analyzed Logical Plan ==
k2: bigint, k1: bigint, k0: bigint, v1: bigint, v2: bigint, v3: bigint, v4: 
bigint
Project [k2#5L, k1#4L, k0#3L, v1#6L, v2#16L, v3#24L, v4#32L]
+- Join Inner, (k2#5L = k2#31L)
   :- Project [k1#4L, k0#3L, k2#5L, v1#6L, v2#16L, v3#24L]
   :  +- Join Inner, (k1#4L = k1#23L)
   : :- Project [k0#3L, k1#4L, k2#5L, v1#6L, v2#16L]
   : :  +- Join Inner, (k0#3L = k0#15L)
   : : :- Project [(id#0L % cast(10 as bigint)) AS k0#3L, (id#0L % 
cast(10 as bigint)) AS k1#4L, (id#0L % cast(10 as bigint)) AS k2#5L, id#0
L AS v1#6L]
   : : :  +- Range (0, 100, step=1, splits=Some(4))
   : : +- Project [id#12L AS k0#15L, id#12L AS v2#16L]
   : :+- Range (0, 10, step=1, splits=Some(4))
   : +- Project [id#20L AS k1#23L, id#20L AS v3#24L]
   :+- Range (0, 10, step=1, splits=Some(4))
   +- Project [id#28L AS k2#31L, id#28L AS v4#32L]
  +- Range (0, 10, step=1, splits=Some(4))
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2018-01-21 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333959#comment-16333959
 ] 

Takeshi Yamamuro commented on SPARK-19217:
--

If we can, I think it's the best to reuse `sqlType`, but IMO it's difficult to 
do so because `sqlType` just means an internal type of underlying data 
structure. If `sqlType` is the suitable type that most users want to cast UDT 
data to, it's totally ok. But, if not, we cannot tell which type we should cast 
the data to. `VectorUDT` is a good example; I think most users want to cast 
vectors to arrays, but `sqlType` is not an array type.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19217) Offer easy cast from vector to array

2018-01-22 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333987#comment-16333987
 ] 

Takeshi Yamamuro commented on SPARK-19217:
--

ok, I'll reconsider this.

> Offer easy cast from vector to array
> 
>
> Key: SPARK-19217
> URL: https://issues.apache.org/jira/browse/SPARK-19217
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Working with ML often means working with DataFrames with vector columns. You 
> can't save these DataFrames to storage (edit: at least as ORC) without 
> converting the vector columns to array columns, and there doesn't appear to 
> an easy way to make that conversion.
> This is a common enough problem that it is [documented on Stack 
> Overflow|http://stackoverflow.com/q/35855382/877069]. The current solutions 
> to making the conversion from a vector column to an array column are:
> # Convert the DataFrame to an RDD and back
> # Use a UDF
> Both approaches work fine, but it really seems like you should be able to do 
> something like this instead:
> {code}
> (le_data
> .select(
> col('features').cast('array').alias('features')
> ))
> {code}
> We already have an {{ArrayType}} in {{pyspark.sql.types}}, but it appears 
> that {{cast()}} doesn't support this conversion.
> Would this be an appropriate thing to add?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23172) Expand the ReorderJoin rule to handle Project nodes

2018-01-22 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-23172:
-
Summary: Expand the ReorderJoin rule to handle Project nodes  (was: Respect 
Project nodes in ReorderJoin)

> Expand the ReorderJoin rule to handle Project nodes
> ---
>
> Key: SPARK-23172
> URL: https://issues.apache.org/jira/browse/SPARK-23172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current `ReorderJoin` optimizer rule cannot flatten a pattern `Join -> 
> Project -> Join` because `ExtractFiltersAndInnerJoins`
> doesn't handle `Project` nodes. So, the current master cannot reorder joins 
> in a query below;
> {code}
> val df1 = spark.range(100).selectExpr("id % 10 AS k0", s"id % 10 AS k1", s"id 
> % 10 AS k2", "id AS v1")
> val df2 = spark.range(10).selectExpr("id AS k0", "id AS v2")
> val df3 = spark.range(10).selectExpr("id AS k1", "id AS v3")
> val df4 = spark.range(10).selectExpr("id AS k2", "id AS v4")
> df1.join(df2, "k0").join(df3, "k1").join(df4, "k2").explain(true)
> == Analyzed Logical Plan ==
> k2: bigint, k1: bigint, k0: bigint, v1: bigint, v2: bigint, v3: bigint, v4: 
> bigint
> Project [k2#5L, k1#4L, k0#3L, v1#6L, v2#16L, v3#24L, v4#32L]
> +- Join Inner, (k2#5L = k2#31L)
>:- Project [k1#4L, k0#3L, k2#5L, v1#6L, v2#16L, v3#24L]
>:  +- Join Inner, (k1#4L = k1#23L)
>: :- Project [k0#3L, k1#4L, k2#5L, v1#6L, v2#16L]
>: :  +- Join Inner, (k0#3L = k0#15L)
>: : :- Project [(id#0L % cast(10 as bigint)) AS k0#3L, (id#0L % 
> cast(10 as bigint)) AS k1#4L, (id#0L % cast(10 as bigint)) AS k2#5L, id#0
> L AS v1#6L]
>: : :  +- Range (0, 100, step=1, splits=Some(4))
>: : +- Project [id#12L AS k0#15L, id#12L AS v2#16L]
>: :+- Range (0, 10, step=1, splits=Some(4))
>: +- Project [id#20L AS k1#23L, id#20L AS v3#24L]
>:+- Range (0, 10, step=1, splits=Some(4))
>+- Project [id#28L AS k2#31L, id#28L AS v4#32L]
>   +- Range (0, 10, step=1, splits=Some(4))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23264) Support interval values without INTERVAL clauses

2018-01-29 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-23264:


 Summary: Support interval values without INTERVAL clauses
 Key: SPARK-23264
 URL: https://issues.apache.org/jira/browse/SPARK-23264
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1
Reporter: Takeshi Yamamuro


The master currently cannot parse a SQL query below;
{code:java}
SELECT cast('2017-08-04' as date) + 1 days;
{code}
Since other dbms-like systems support this syntax (e.g., hive and mysql), it 
might help to support in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20425) Support an extended display mode to print a column data per line

2018-02-05 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16352271#comment-16352271
 ] 

Takeshi Yamamuro commented on SPARK-20425:
--

yea, ok. If nobody takes on this, I'll make a pr.

> Support an extended display mode to print a column data per line
> 
>
> Key: SPARK-20425
> URL: https://issues.apache.org/jira/browse/SPARK-20425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.3.0
>
>
> In the master, when printing Dataset with many columns, the readability is 
> too low like;
> {code}
> scala> val df = spark.range(100).selectExpr((0 until 100).map(i => s"rand() 
> AS c$i"): _*)
> scala> df.show(3, 0)
> +--+--+--+---+--+--+---+--+--+--+--+---+--+--+--+---+---+---+--+--+---+--+---+--+---+---+---++---+--+---++--+--+---+---+---+--+--+---+--+--+---+---+---+--+++---+---+---+---+---+---++---+---+---+---+--+--+---+---+--+---+--+--+-+---+---+--+---+--+---+---+---+--+---+--+---+---+---+---+---+---+---+---+--+---+---+--+--+--+---+--+---+--+---+---+---+
> |c0|c1|c2|c3 
> |c4|c5|c6 |c7
> |c8|c9|c10   |c11
> |c12   |c13   |c14   |c15
> |c16|c17|c18   |c19   
> |c20|c21   |c22|c23   
> |c24|c25|c26|c27  
>|c28|c29   |c30|c31
>  |c32   |c33   |c34|c35   
>  |c36|c37   |c38   |c39   
>  |c40   |c41   |c42|c43   
>  |c44|c45   |c46 |c47 
> |c48|c49|c50|c51  
>   |c52|c53|c54 |c55   
>  |c56|c57|c58|c59 
>   |c60   |c61|c62|c63 
>   |c64|c65   |c66   |c67  
> |c68|c69|c70   |c71   
>  |c72   |c73|c74|c75  
>   |c76   |c77|c78   |c79  
>   |c80|c81|c82
> |c83|c84|c85|c86  
>   |c87   |c88|c89|c90 
>   |c91   |c92   |c93|c94  
>  |c95|c96   |c97

[jira] [Commented] (SPARK-26077) Reserved SQL words are not escaped by JDBC writer for table name

2018-11-19 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692546#comment-16692546
 ] 

Takeshi Yamamuro commented on SPARK-26077:
--

Thanks for your reporting! It seems the master doesn't quote a JDBC table name 
now. Are u interested in the contribution on this? I think this is a kind of 
starter issues, maybe...

> Reserved SQL words are not escaped by JDBC writer for table name
> 
>
> Key: SPARK-26077
> URL: https://issues.apache.org/jira/browse/SPARK-26077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Eugene Golovan
>Priority: Major
>
> This bug is similar to SPARK-16387 but this time table name is not escaped.
> How to reproduce:
> 1/ Start spark shell with mysql connector
> spark-shell --jars ./mysql-connector-java-8.0.13.jar
>  
> 2/ Execute next code
>  
> import spark.implicits._
> (spark
> .createDataset(Seq("a","b","c"))
> .toDF("order")
> .write
> .format("jdbc")
> .option("url", s"jdbc:mysql://root@localhost:3306/test")
> .option("driver", "com.mysql.cj.jdbc.Driver")
> .option("dbtable", "condition")
> .save)
>  
> , where condition - is reserved word.
>  
> Error message:
>  
> java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check 
> the manual that corresponds to your MySQL server version for the right syntax 
> to use near 'condition (`order` TEXT )' at line 1
>  at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
>  at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
>  at 
> com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1355)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2128)
>  at com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1264)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:844)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
>  ... 59 elided
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23264) Support interval values without INTERVAL clauses

2018-12-03 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-23264:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-26217

> Support interval values without INTERVAL clauses
> 
>
> Key: SPARK-23264
> URL: https://issues.apache.org/jira/browse/SPARK-23264
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> The master currently cannot parse a SQL query below;
> {code:java}
> SELECT cast('2017-08-04' as date) + 1 days;
> {code}
> Since other dbms-like systems support this syntax (e.g., hive and mysql), it 
> might help to support in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23264) Support interval values without INTERVAL clauses

2018-12-03 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro reopened SPARK-23264:
--

> Support interval values without INTERVAL clauses
> 
>
> Key: SPARK-23264
> URL: https://issues.apache.org/jira/browse/SPARK-23264
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> The master currently cannot parse a SQL query below;
> {code:java}
> SELECT cast('2017-08-04' as date) + 1 days;
> {code}
> Since other dbms-like systems support this syntax (e.g., hive and mysql), it 
> might help to support in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23264) Support interval values without INTERVAL clauses

2018-12-03 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707006#comment-16707006
 ] 

Takeshi Yamamuro commented on SPARK-23264:
--

Since the next release version is v3.0, I reopened.

> Support interval values without INTERVAL clauses
> 
>
> Key: SPARK-23264
> URL: https://issues.apache.org/jira/browse/SPARK-23264
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
>
> The master currently cannot parse a SQL query below;
> {code:java}
> SELECT cast('2017-08-04' as date) + 1 days;
> {code}
> Since other dbms-like systems support this syntax (e.g., hive and mysql), it 
> might help to support in spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26215) define reserved keywords after SQL standard

2018-12-03 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707095#comment-16707095
 ] 

Takeshi Yamamuro commented on SPARK-26215:
--

These reserved words should be handled inside SqlBase.g4? It seems postgresql 
do so 
https://github.com/postgres/postgres/blob/ee2b37ae044f34851baba69e9ba737077326414e/src/backend/parser/gram.y#L15366

> define reserved keywords after SQL standard
> ---
>
> Key: SPARK-26215
> URL: https://issues.apache.org/jira/browse/SPARK-26215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved 
> keywords can't be used as identifiers.
> In Spark SQL, we are too tolerant about non-reserved keywors. A lot of 
> keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a 
> problem when improving the INTERVAL syntax).
> I think it will be better to just follow other databases or SQL standard to 
> define reserved keywords, so that we don't need to think very hard about how 
> to avoid ambiguity.
> For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26215) define reserved keywords after SQL standard

2018-12-03 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707099#comment-16707099
 ] 

Takeshi Yamamuro commented on SPARK-26215:
--

I found some useful documents about the reserved words;
https://developer.mimer.com/mimer-sql-standard-compliance/
https://developer.mimer.com/wp-content/uploads/2018/05/Standard-SQL-Reserved-Words-Summary.pdf

> define reserved keywords after SQL standard
> ---
>
> Key: SPARK-26215
> URL: https://issues.apache.org/jira/browse/SPARK-26215
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Major
>
> There are 2 kinds of SQL keywords: reserved and non-reserved. Reserved 
> keywords can't be used as identifiers.
> In Spark SQL, we are too tolerant about non-reserved keywors. A lot of 
> keywords are non-reserved and sometimes it cause ambiguity (IIRC we hit a 
> problem when improving the INTERVAL syntax).
> I think it will be better to just follow other databases or SQL standard to 
> define reserved keywords, so that we don't need to think very hard about how 
> to avoid ambiguity.
> For reference: https://www.postgresql.org/docs/8.1/sql-keywords-appendix.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26262) Run SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false

2018-12-03 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-26262:


 Summary: Run SQLQueryTestSuite with 
WHOLESTAGE_CODEGEN_ENABLED=false
 Key: SPARK-26262
 URL: https://issues.apache.org/jira/browse/SPARK-26262
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Takeshi Yamamuro


For better test coverage, we need to set `false` at 
`WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running 
`SQLQueryTestSuite`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26262) Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE

2018-12-04 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-26262:
-
Description: 
For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed config 
sets:
1. 


set `false` at `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests 
when running . 

  was:For better test coverage, we need to set `false` at 
`WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running 
`SQLQueryTestSuite`. 


> Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and 
> CODEGEN_FACTORY_MODE
> 
>
> Key: SPARK-26262
> URL: https://issues.apache.org/jira/browse/SPARK-26262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed 
> config sets:
> 1. 
> set `false` at `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests 
> when running . 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26262) Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE

2018-12-04 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-26262:
-
Summary: Runs SQLQueryTestSuite on mixed config sets: 
WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE  (was: Run 
SQLQueryTestSuite with WHOLESTAGE_CODEGEN_ENABLED=false)

> Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and 
> CODEGEN_FACTORY_MODE
> 
>
> Key: SPARK-26262
> URL: https://issues.apache.org/jira/browse/SPARK-26262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> For better test coverage, we need to set `false` at 
> `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests when running 
> `SQLQueryTestSuite`. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26262) Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and CODEGEN_FACTORY_MODE

2018-12-04 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-26262:
-
Description: 
For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed config 
sets:
1. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
2. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
3. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=NO_CODEGEN
4. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=NO_CODEGEN

  was:
For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed config 
sets:
1. 


set `false` at `WHOLESTAGE_CODEGEN_ENABLED` for interpreter execution tests 
when running . 


> Runs SQLQueryTestSuite on mixed config sets: WHOLESTAGE_CODEGEN_ENABLED and 
> CODEGEN_FACTORY_MODE
> 
>
> Key: SPARK-26262
> URL: https://issues.apache.org/jira/browse/SPARK-26262
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> For better test coverage, we need to run `SQLQueryTestSuite` on 4 mixed 
> config sets:
> 1. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
> 2. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=CODEGEN_ONLY
> 3. WHOLESTAGE_CODEGEN_ENABLED=true, CODEGEN_FACTORY_MODE=NO_CODEGEN
> 4. WHOLESTAGE_CODEGEN_ENABLED=false, CODEGEN_FACTORY_MODE=NO_CODEGEN



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26182) Cost increases when optimizing scalaUDF

2018-12-05 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-26182:
-
Issue Type: Improvement  (was: Bug)

> Cost increases when optimizing scalaUDF
> ---
>
> Key: SPARK-26182
> URL: https://issues.apache.org/jira/browse/SPARK-26182
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: Jiayi Liao
>Priority: Major
>
> Let's assume that we have a udf called splitUDF which outputs a map data.
>  The SQL
> {code:java}
> select
> g['a'], g['b']
> from
>( select splitUDF(x) as g from table) tbl
> {code}
> will be optimized to the same logical plan of
> {code:java}
> select splitUDF(x)['a'], splitUDF(x)['b'] from table
> {code}
> which means that the splitUDF is executed twice instead of once.
> The optimization is from CollapseProject. 
>  I'm not sure whether this is a bug or not. Please tell me if I was wrong 
> about this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26182) Cost increases when optimizing scalaUDF

2018-12-05 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711014#comment-16711014
 ] 

Takeshi Yamamuro commented on SPARK-26182:
--

This is an expected behaviour  and a known issue, e.g., 
https://issues.apache.org/jira/browse/SPARK-15282. This is not a bug because 
this doesn't affect correctness.

> Cost increases when optimizing scalaUDF
> ---
>
> Key: SPARK-26182
> URL: https://issues.apache.org/jira/browse/SPARK-26182
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: Jiayi Liao
>Priority: Major
>
> Let's assume that we have a udf called splitUDF which outputs a map data.
>  The SQL
> {code:java}
> select
> g['a'], g['b']
> from
>( select splitUDF(x) as g from table) tbl
> {code}
> will be optimized to the same logical plan of
> {code:java}
> select splitUDF(x)['a'], splitUDF(x)['b'] from table
> {code}
> which means that the splitUDF is executed twice instead of once.
> The optimization is from CollapseProject. 
>  I'm not sure whether this is a bug or not. Please tell me if I was wrong 
> about this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26224) Results in stackOverFlowError when trying to add 3000 new columns using withColumn function of dataframe.

2018-12-07 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-26224:
-
Component/s: (was: Spark Core)
 SQL

> Results in stackOverFlowError when trying to add 3000 new columns using 
> withColumn function of dataframe.
> -
>
> Key: SPARK-26224
> URL: https://issues.apache.org/jira/browse/SPARK-26224
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: On macbook, used Intellij editor. Ran the above sample 
> code as unit test.
>Reporter: Dorjee Tsering
>Priority: Minor
>
> Reproduction step:
> Run this sample code on your laptop. I am trying to add 3000 new columns to a 
> base dataframe with 1 column.
>  
>  
> {code:java}
> import spark.implicits._
> val newColumnsToBeAdded : Seq[StructField] = for (i <- 1 to 3000) yield new 
> StructField("field_" + i, DataTypes.LongType)
> val baseDataFrame: DataFrame = Seq((1)).toDF("employee_id")
> val result = newColumnsToBeAdded.foldLeft(baseDataFrame)((df, newColumn) => 
> df.withColumn(newColumn.name, lit(0)))
> result.show(false)
>  
> {code}
> Ends up with following stacktrace:
> java.lang.StackOverflowError
>  at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
>  at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
>  at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>  at scala.collection.immutable.List.map(List.scala:296)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-02-18 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15152225#comment-15152225
 ] 

Takeshi Yamamuro commented on SPARK-13337:
--

Is it not enough to use `df1.join(df2, $"df1Key" <=> $"df2Key", "outer")` for 
your case?

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8000) SQLContext.read.load() should be able to auto-detect input data

2016-02-19 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15154044#comment-15154044
 ] 

Takeshi Yamamuro commented on SPARK-8000:
-

Oh, it is a simple and good approach to detect formats depending file names as 
a first step.

> SQLContext.read.load() should be able to auto-detect input data
> ---
>
> Key: SPARK-8000
> URL: https://issues.apache.org/jira/browse/SPARK-8000
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> If it is a parquet file, use parquet. If it is a JSON file, use JSON. If it 
> is an ORC file, use ORC. If it is a CSV file, use CSV.
> Maybe Spark SQL can also write an output metadata file to specify the schema 
> & data source that's used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13484) Filter outer joined result using a non-nullable column from the right table

2016-02-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15167344#comment-15167344
 ] 

Takeshi Yamamuro commented on SPARK-13484:
--

Im working on this, and almost finish.

> Filter outer joined result using a non-nullable column from the right table
> ---
>
> Key: SPARK-13484
> URL: https://issues.apache.org/jira/browse/SPARK-13484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
>Reporter: Xiangrui Meng
>
> Technically speaking, this is not a bug. But
> {code}
> val a = sqlContext.range(10).select(col("id"), lit(0).as("count"))
> val b = sqlContext.range(10).select((col("id") % 
> 3).as("id")).groupBy("id").count()
> a.join(b, a("id") === b("id"), "left_outer").filter(b("count").isNull).show()
> {code}
> returns nothing. This is because `b("count")` is not nullable and the filter 
> condition is always false by static analysis. However, it is common for users 
> to use `a(...)` and `b(...)` to filter the joined result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13484) Filter outer joined result using a non-nullable column from the right table

2016-02-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168246#comment-15168246
 ] 

Takeshi Yamamuro commented on SPARK-13484:
--

Thanks!

> Filter outer joined result using a non-nullable column from the right table
> ---
>
> Key: SPARK-13484
> URL: https://issues.apache.org/jira/browse/SPARK-13484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0, 2.0.0
>Reporter: Xiangrui Meng
>
> Technically speaking, this is not a bug. But
> {code}
> val a = sqlContext.range(10).select(col("id"), lit(0).as("count"))
> val b = sqlContext.range(10).select((col("id") % 
> 3).as("id")).groupBy("id").count()
> a.join(b, a("id") === b("id"), "left_outer").filter(b("count").isNull).show()
> {code}
> returns nothing. This is because `b("count")` is not nullable and the filter 
> condition is always false by static analysis. However, it is common for users 
> to use `a(...)` and `b(...)` to filter the joined result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11691) Allow to specify compression codec in HadoopFsRelation when saving

2016-02-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170333#comment-15170333
 ] 

Takeshi Yamamuro commented on SPARK-11691:
--

I think it's okay to close this ticket. As [~hyukjin.kwon] said, this issue is 
totally almost resolved by his pr.
Also, making this ticket a umbrella one for compression stuffs is kind of 
confusing to other developers, I think.

> Allow to specify compression codec in HadoopFsRelation when saving 
> ---
>
> Key: SPARK-11691
> URL: https://issues.apache.org/jira/browse/SPARK-11691
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> Currently, there's no way to specify compression codec when saving data frame 
> to hdfs. It would nice to allow specify compression codec in DataFrameWriter 
> just as we did in RDD api
> {code}
> def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit = 
> withScope {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11691) Allow to specify compression codec in HadoopFsRelation when saving

2016-02-26 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170333#comment-15170333
 ] 

Takeshi Yamamuro edited comment on SPARK-11691 at 2/27/16 3:38 AM:
---

I think it's okay to close this ticket. As [~hyukjin.kwon] said, this issue is 
almost resolved by his pr.
Also, making this ticket a umbrella one for compression stuffs is kind of 
confusing to other developers, I think.


was (Author: maropu):
I think it's okay to close this ticket. As [~hyukjin.kwon] said, this issue is 
totally almost resolved by his pr.
Also, making this ticket a umbrella one for compression stuffs is kind of 
confusing to other developers, I think.

> Allow to specify compression codec in HadoopFsRelation when saving 
> ---
>
> Key: SPARK-11691
> URL: https://issues.apache.org/jira/browse/SPARK-11691
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> Currently, there's no way to specify compression codec when saving data frame 
> to hdfs. It would nice to allow specify compression codec in DataFrameWriter 
> just as we did in RDD api
> {code}
> def saveAsTextFile(path: String, codec: Class[_ <: CompressionCodec]): Unit = 
> withScope {
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13528) Make the short names of compression codecs consistent in spark

2016-02-26 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-13528:


 Summary: Make the short names of compression codecs consistent in 
spark
 Key: SPARK-13528
 URL: https://issues.apache.org/jira/browse/SPARK-13528
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro


Add a common utility code to map short names to fully-qualified codec names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13607) Improves compression performance for integer-typed values on cache to reduce GC pressure

2016-03-01 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-13607:


 Summary: Improves compression performance for integer-typed values 
on cache to reduce GC pressure
 Key: SPARK-13607
 URL: https://issues.apache.org/jira/browse/SPARK-13607
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-03-02 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175548#comment-15175548
 ] 

Takeshi Yamamuro commented on SPARK-13337:
--

ISTM an interface to get TableC directly is confusing for other users. Any real 
and common use-cases to use this interface frequently?

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13656) Delte spark.sql.parquet.cacheMetadata

2016-03-07 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182872#comment-15182872
 ] 

Takeshi Yamamuro commented on SPARK-13656:
--

Is it okay to simply remove the related code of 
`spark.sql.parquet.cacheMetadata`?
If so, I'll take on this.
Thanks,

> Delte spark.sql.parquet.cacheMetadata
> -
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13656) Delte spark.sql.parquet.cacheMetadata

2016-03-07 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182949#comment-15182949
 ] 

Takeshi Yamamuro commented on SPARK-13656:
--

I looked over the related codes; does this ticket mean that the cache mechanism 
of parquet is always enabled without the option?

> Delte spark.sql.parquet.cacheMetadata
> -
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13337) DataFrame join-on-columns function should support null-safe equal

2016-03-07 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15182962#comment-15182962
 ] 

Takeshi Yamamuro commented on SPARK-13337:
--

Oh, I got your point ;)
However, it seems that all other joins in DataFrame preserve both key columns 
in two input tables. I'm not sure it is okay to drop one side column of them in 
an output schema. 
How about making a pr and discussing in github if it is easy to fix?

> DataFrame join-on-columns function should support null-safe equal
> -
>
> Key: SPARK-13337
> URL: https://issues.apache.org/jira/browse/SPARK-13337
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Zhong Wang
>Priority: Minor
>
> Currently, the join-on-columns function:
> {code}
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
> {code}
> performs a null-insafe join. It would be great if there is an option for 
> null-safe join.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata

2016-03-07 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184110#comment-15184110
 ] 

Takeshi Yamamuro commented on SPARK-13656:
--

Okay, I'll make a pr in a day.

> Delete spark.sql.parquet.cacheMetadata
> --
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata

2016-03-22 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207941#comment-15207941
 ] 

Takeshi Yamamuro commented on SPARK-13656:
--

[~yhuai] I looked over the codes, and basically it'd be better to simply remove 
cache functionality in ParquetRelation and integrate it in HDFSFileCatalog.
Anyway, I think this issue should keep unresolved to easily track this, though? 

> Delete spark.sql.parquet.cacheMetadata
> --
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata

2016-03-22 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207941#comment-15207941
 ] 

Takeshi Yamamuro edited comment on SPARK-13656 at 3/23/16 6:11 AM:
---

[~yhuai] I looked over the codes, and basically it'd be better to simply remove 
cache functionality in ParquetRelation and integrate it in HDFSFileCatalog.
Anyway, I think this issue should keep unresolved to easily track this, 
thought? 


was (Author: maropu):
[~yhuai] I looked over the codes, and basically it'd be better to simply remove 
cache functionality in ParquetRelation and integrate it in HDFSFileCatalog.
Anyway, I think this issue should keep unresolved to easily track this, though? 

> Delete spark.sql.parquet.cacheMetadata
> --
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6747) Support List<> as a return type in Hive UDF

2015-06-24 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6747:

Affects Version/s: 1.4.0

> Support List<> as a return type in Hive UDF
> ---
>
> Key: SPARK-6747
> URL: https://issues.apache.org/jira/browse/SPARK-6747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>  Labels: 1.5.0
>
> The current implementation can't handle List<> as a return type in Hive UDF.
> We assume an UDF below;
> public class UDFToListString extends UDF {
> public List evaluate(Object o) {
> return Arrays.asList("xxx", "yyy", "zzz");
> }
> }
> An exception of scala.MatchError is thrown as follows when the UDF used;
> scala.MatchError: interface java.util.List (of class java.lang.Class)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
>   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
>   at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
>   at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
> ...
> To fix this problem, we need to add an entry for List<> in 
> HiveInspectors#javaClassToDataType.
> However, it has one difficulty because of type erasure in JVM.
> We assume that lines below are appended in HiveInspectors#javaClassToDataType;
> // list type
> case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =>
> val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType]
> println(tpe.getActualTypeArguments()(0).toString()) => 'E'
> This logic fails to catch a component type in List<>.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6747) Support List<> as a return type in Hive UDF

2015-06-24 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6747:

Labels: 1.5.0  (was: )

> Support List<> as a return type in Hive UDF
> ---
>
> Key: SPARK-6747
> URL: https://issues.apache.org/jira/browse/SPARK-6747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>  Labels: 1.5.0
>
> The current implementation can't handle List<> as a return type in Hive UDF.
> We assume an UDF below;
> public class UDFToListString extends UDF {
> public List evaluate(Object o) {
> return Arrays.asList("xxx", "yyy", "zzz");
> }
> }
> An exception of scala.MatchError is thrown as follows when the UDF used;
> scala.MatchError: interface java.util.List (of class java.lang.Class)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
>   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
>   at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
>   at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
> ...
> To fix this problem, we need to add an entry for List<> in 
> HiveInspectors#javaClassToDataType.
> However, it has one difficulty because of type erasure in JVM.
> We assume that lines below are appended in HiveInspectors#javaClassToDataType;
> // list type
> case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =>
> val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType]
> println(tpe.getActualTypeArguments()(0).toString()) => 'E'
> This logic fails to catch a component type in List<>.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6747) Support List<> as a return type in Hive UDF

2015-06-24 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6747:

Labels:   (was: 1.5.0)

> Support List<> as a return type in Hive UDF
> ---
>
> Key: SPARK-6747
> URL: https://issues.apache.org/jira/browse/SPARK-6747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>
> The current implementation can't handle List<> as a return type in Hive UDF.
> We assume an UDF below;
> public class UDFToListString extends UDF {
> public List evaluate(Object o) {
> return Arrays.asList("xxx", "yyy", "zzz");
> }
> }
> An exception of scala.MatchError is thrown as follows when the UDF used;
> scala.MatchError: interface java.util.List (of class java.lang.Class)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
>   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
>   at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
>   at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
> ...
> To fix this problem, we need to add an entry for List<> in 
> HiveInspectors#javaClassToDataType.
> However, it has one difficulty because of type erasure in JVM.
> We assume that lines below are appended in HiveInspectors#javaClassToDataType;
> // list type
> case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =>
> val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType]
> println(tpe.getActualTypeArguments()(0).toString()) => 'E'
> This logic fails to catch a component type in List<>.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6912) Support Map as a return type in Hive UDF

2015-06-24 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6912:

Affects Version/s: 1.4.0

> Support Map as a return type in Hive UDF
> -
>
> Key: SPARK-6912
> URL: https://issues.apache.org/jira/browse/SPARK-6912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>
> The current implementation can't handle Map as a return type in Hive 
> UDF. 
> We assume an UDF below;
> public class UDFToIntIntMap extends UDF {
> public Map evaluate(Object o);
> }
> Hive supports this type, and see a link below for details;
> https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163
> https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6747) Throw an AnalysisException when unsupported Java list types used in Hive UDF

2015-07-06 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6747:

Summary: Throw an AnalysisException when unsupported Java list types used 
in Hive UDF  (was: Support List<> as a return type in Hive UDF)

> Throw an AnalysisException when unsupported Java list types used in Hive UDF
> 
>
> Key: SPARK-6747
> URL: https://issues.apache.org/jira/browse/SPARK-6747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>
> The current implementation can't handle List<> as a return type in Hive UDF.
> We assume an UDF below;
> public class UDFToListString extends UDF {
> public List evaluate(Object o) {
> return Arrays.asList("xxx", "yyy", "zzz");
> }
> }
> An exception of scala.MatchError is thrown as follows when the UDF used;
> scala.MatchError: interface java.util.List (of class java.lang.Class)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
>   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
>   at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
>   at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
>   at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
>   at 
> scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
> ...
> To fix this problem, we need to add an entry for List<> in 
> HiveInspectors#javaClassToDataType.
> However, it has one difficulty because of type erasure in JVM.
> We assume that lines below are appended in HiveInspectors#javaClassToDataType;
> // list type
> case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =>
> val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType]
> println(tpe.getActualTypeArguments()(0).toString()) => 'E'
> This logic fails to catch a component type in List<>.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6747) Throw an AnalysisException when unsupported Java list types used in Hive UDF

2015-07-06 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6747:

Description: 
The current implementation can't handle List<> as a return type in Hive UDF and
throws meaningless Match Error.
We assume an UDF below;

public class UDFToListString extends UDF {
public List evaluate(Object o) {
return Arrays.asList("xxx", "yyy", "zzz");
}
}

An exception of scala.MatchError is thrown as follows when the UDF used;

scala.MatchError: interface java.util.List (of class java.lang.Class)
at 
org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
at 
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at 
scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
...

To make udf developers more understood, we need to throw a more suitable 
exception.

  was:
The current implementation can't handle List<> as a return type in Hive UDF.
We assume an UDF below;

public class UDFToListString extends UDF {
public List evaluate(Object o) {
return Arrays.asList("xxx", "yyy", "zzz");
}
}

An exception of scala.MatchError is thrown as follows when the UDF used;

scala.MatchError: interface java.util.List (of class java.lang.Class)
at 
org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
at 
org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
at 
org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
at 
org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:94)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at 
scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
...

To fix this problem, we need to add an entry for List<> in 
HiveInspectors#javaClassToDataType.
However, it has one difficulty because of type erasure in JVM.
We assume that lines below are appended in HiveInspectors#javaClassToDataType;

// list type
case c: Class[_] if c == classOf[java.util.List[java.lang.Object]] =>
val tpe = c.getGenericInterfaces()(0).asInstanceOf[ParameterizedType]
println(tpe.getActualTypeArguments()(0).toString()) => 'E'

This logic fails to catch a component type in List<>.


> Throw an AnalysisException when unsupported Java list types used in Hive UDF
> 
>
> Key: SPARK-6747
> URL: https://issues.apache.org/jira/browse/SPARK-6747
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>
> The current implementation can't handle List<> as a return type in Hive UDF 
> and
> throws meaningless Match Error.
> We assume an UDF below;
> public class UDFToListString extends UDF {
> public List evaluate(Object o) {
> return Arrays.asList("xxx", "yyy", "zzz");
> }
> }
> An exception of scala.MatchError is thrown as follows when the UDF used;
> scala.MatchError: interface java.util.List (of class java.lang.Class)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:174)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf.javaClassToDataType(hiveUdfs.scala:76)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUdf.dataType$lzycompute(hiveUdfs.scala:106)
>   at org.apache.spark.sql.hive.HiveSimpleUdf.dataType(hiveUdfs.scala:106)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:131)
>   at 
> org.apache.spark.sql.catalyst.planning.PhysicalOperation$$anonfun$collectAliases$1.applyOrElse(patterns.scala:95)
>   at 
> org.apache.spark.sql.catalyst.planning.PhysicalOpera

[jira] [Updated] (SPARK-6912) Throw an AnalysisException when unsupported Java Map types used in Hive UDF

2015-07-07 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-6912:

Summary: Throw an AnalysisException when unsupported Java Map types 
used in Hive UDF  (was: Support Map as a return type in Hive UDF)

> Throw an AnalysisException when unsupported Java Map types used in Hive 
> UDF
> 
>
> Key: SPARK-6912
> URL: https://issues.apache.org/jira/browse/SPARK-6912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>
> The current implementation can't handle Map as a return type in Hive 
> UDF. 
> We assume an UDF below;
> public class UDFToIntIntMap extends UDF {
> public Map evaluate(Object o);
> }
> Hive supports this type, and see a link below for details;
> https://github.com/kyluka/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBridge.java#L163
> https://github.com/l1x/apache-hive/blob/master/hive-0.8.1/src/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java#L118



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8930) Support a star '*' in the generator function argument

2015-07-08 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-8930:
---

 Summary: Support a star '*' in the generator function argument
 Key: SPARK-8930
 URL: https://issues.apache.org/jira/browse/SPARK-8930
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro


The current implementation throws an exception if generators contain a star '*' 
like codes blow;

val df = Seq(("1", "1,2"), ("2", "4"), ("3", "7,8,9")).toDF("prefix", "csv")
checkAnswer(
  df.explode($"*") { case Row(prefix: String, csv: String) =>
csv.split(",").map(v => Tuple1(prefix + ":" + v))
  },
  Row("1", "1,2", "1:1") :: Row("1", "1,2", "1:2")
:: Row("2", "4", "2:4")
:: Row("3", "7,8,9", "3:7") :: Row("3", "7,8,9", "3:8") :: Row("3", 
"7,8,9", "3:9")
:: Nil
)

[info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
input columns prefix, csv;
[info]   at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
21)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8930) Support a star '*' in generator function arguments

2015-07-08 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-8930:

Summary: Support a star '*' in generator function arguments  (was: Support 
a star '*' in the generator function argument)

> Support a star '*' in generator function arguments
> --
>
> Key: SPARK-8930
> URL: https://issues.apache.org/jira/browse/SPARK-8930
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>
> The current implementation throws an exception if generators contain a star 
> '*' like codes blow;
> val df = Seq(("1", "1,2"), ("2", "4"), ("3", "7,8,9")).toDF("prefix", 
> "csv")
> checkAnswer(
>   df.explode($"*") { case Row(prefix: String, csv: String) =>
> csv.split(",").map(v => Tuple1(prefix + ":" + v))
>   },
>   Row("1", "1,2", "1:1") :: Row("1", "1,2", "1:2")
> :: Row("2", "4", "2:4")
> :: Row("3", "7,8,9", "3:7") :: Row("3", "7,8,9", "3:8") :: Row("3", 
> "7,8,9", "3:9")
> :: Nil
> )
> [info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
> input columns prefix, csv;
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
> 21)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> [info]   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> [info]   at scala.collection.immutable.List.foreach(List.scala:318)
> [info]   at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> [info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8955) Replace a duplicated initialize() in HiveGenericUDTF with new one

2015-07-09 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-8955:
---

 Summary: Replace a duplicated initialize() in HiveGenericUDTF with 
new one
 Key: SPARK-8955
 URL: https://issues.apache.org/jira/browse/SPARK-8955
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro


HiveGenericUDTF#initialize(ObjectInspector[] argOIs) in v0.13.1 is duplicated, 
so it needs to be replaced with new one. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8930) Support a star '*' in generator function arguments

2015-07-09 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-8930:

Description: 
The current implementation throws an exception if generators contain a star '*' 
like codes blow;

val df = Seq(("1", "1,2"), ("2", "4"), ("3", "7,8,9")).toDF("prefix", "csv")
checkAnswer(
  df.explode($"*") { case Row(prefix: String, csv: String) =>
csv.split(",").map(v => Tuple1(prefix + ":" + v))
  },
  Row("1", "1,2", "1:1") :: Row("1", "1,2", "1:2")
:: Row("2", "4", "2:4")
:: Row("3", "7,8,9", "3:7") :: Row("3", "7,8,9", "3:8") :: Row("3", 
"7,8,9", "3:9")
:: Nil
)

```
[info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
input columns prefix, csv;
[info]   at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
21)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
```

  was:
The current implementation throws an exception if generators contain a star '*' 
like codes blow;

val df = Seq(("1", "1,2"), ("2", "4"), ("3", "7,8,9")).toDF("prefix", "csv")
checkAnswer(
  df.explode($"*") { case Row(prefix: String, csv: String) =>
csv.split(",").map(v => Tuple1(prefix + ":" + v))
  },
  Row("1", "1,2", "1:1") :: Row("1", "1,2", "1:2")
:: Row("2", "4", "2:4")
:: Row("3", "7,8,9", "3:7") :: Row("3", "7,8,9", "3:8") :: Row("3", 
"7,8,9", "3:9")
:: Nil
)

[info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
input columns prefix, csv;
[info]   at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
21)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)


> Support a star '*' in generator function arguments
> --

[jira] [Updated] (SPARK-8930) Support a star '*' in generator function arguments

2015-07-09 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-8930:

Description: 
The current implementation throws an exception if generators contain a star '*' 
like codes blow;

val df = Seq(("1", "1,2"), ("2", "4"), ("3", "7,8,9")).toDF("prefix", "csv")
checkAnswer(
  df.explode($"*") { case Row(prefix: String, csv: String) =>
csv.split(",").map(v => Tuple1(prefix + ":" + v))
  },
  Row("1", "1,2", "1:1") :: Row("1", "1,2", "1:2")
:: Row("2", "4", "2:4")
:: Row("3", "7,8,9", "3:7") :: Row("3", "7,8,9", "3:8") :: Row("3", 
"7,8,9", "3:9")
:: Nil
)


{code}
[info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
input columns prefix, csv;
[info]   at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
21)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
{code}

  was:
The current implementation throws an exception if generators contain a star '*' 
like codes blow;

val df = Seq(("1", "1,2"), ("2", "4"), ("3", "7,8,9")).toDF("prefix", "csv")
checkAnswer(
  df.explode($"*") { case Row(prefix: String, csv: String) =>
csv.split(",").map(v => Tuple1(prefix + ":" + v))
  },
  Row("1", "1,2", "1:1") :: Row("1", "1,2", "1:2")
:: Row("2", "4", "2:4")
:: Row("3", "7,8,9", "3:7") :: Row("3", "7,8,9", "3:8") :: Row("3", 
"7,8,9", "3:9")
:: Nil
)

```
[info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
input columns prefix, csv;
[info]   at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
21)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
```


> Support a star '*' in generator function arguments
> ---

[jira] [Updated] (SPARK-8930) Support a star '*' in generator function arguments

2015-07-09 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-8930:

Description: 
The current implementation throws an exception if generators contain a star '*' 
like codes blow;

{code}
val df = Seq(("1", "1,2"), ("2", "4"), ("3", "7,8,9")).toDF("prefix", "csv")
checkAnswer(
  df.explode($"*") { case Row(prefix: String, csv: String) =>
csv.split(",").map(v => Tuple1(prefix + ":" + v))
  },
  Row("1", "1,2", "1:1") :: Row("1", "1,2", "1:2")
:: Row("2", "4", "2:4")
:: Row("3", "7,8,9", "3:7") :: Row("3", "7,8,9", "3:8") :: Row("3", 
"7,8,9", "3:9")
:: Nil
)
{code}

{code}
[info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
input columns prefix, csv;
[info]   at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
21)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
{code}

  was:
The current implementation throws an exception if generators contain a star '*' 
like codes blow;

val df = Seq(("1", "1,2"), ("2", "4"), ("3", "7,8,9")).toDF("prefix", "csv")
checkAnswer(
  df.explode($"*") { case Row(prefix: String, csv: String) =>
csv.split(",").map(v => Tuple1(prefix + ":" + v))
  },
  Row("1", "1,2", "1:1") :: Row("1", "1,2", "1:2")
:: Row("2", "4", "2:4")
:: Row("3", "7,8,9", "3:7") :: Row("3", "7,8,9", "3:8") :: Row("3", 
"7,8,9", "3:9")
:: Nil
)


{code}
[info] - explode takes UnresolvedStar *** FAILED *** (14 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: cannot resolve '_1' given 
input columns prefix, csv;
[info]   at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:55)
[info]   at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:291)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:290)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
[info]   at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:1
21)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
[info]   at scala.collection.immutable.List.foreach(List.scala:318)
[info]   at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
[info]   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
{code}


> Support a star '*' in generator function arguments
> ---

[jira] [Created] (SPARK-9034) Reflect field names defined in GenericUDTF

2015-07-14 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-9034:
---

 Summary: Reflect field names defined in GenericUDTF
 Key: SPARK-9034
 URL: https://issues.apache.org/jira/browse/SPARK-9034
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Takeshi Yamamuro


GenericUDTF#initialize() defines field names in a returned schema though,
the current HiveGenericUDTF drops these names.
We might need to reflect these in a logical plan tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9034) Reflect field names defined in GenericUDTF

2015-07-14 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-9034:

Description: 
Hive GenericUDTF#initialize() defines field names in a returned schema though,
the current HiveGenericUDTF drops these names.
We might need to reflect these in a logical plan tree.

  was:
GenericUDTF#initialize() in Hive defines field names in a returned schema 
though,
the current HiveGenericUDTF drops these names.
We might need to reflect these in a logical plan tree.


> Reflect field names defined in GenericUDTF
> --
>
> Key: SPARK-9034
> URL: https://issues.apache.org/jira/browse/SPARK-9034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>
> Hive GenericUDTF#initialize() defines field names in a returned schema though,
> the current HiveGenericUDTF drops these names.
> We might need to reflect these in a logical plan tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9034) Reflect field names defined in GenericUDTF

2015-07-14 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-9034:

Description: 
GenericUDTF#initialize() in Hive defines field names in a returned schema 
though,
the current HiveGenericUDTF drops these names.
We might need to reflect these in a logical plan tree.

  was:
GenericUDTF#initialize() defines field names in a returned schema though,
the current HiveGenericUDTF drops these names.
We might need to reflect these in a logical plan tree.


> Reflect field names defined in GenericUDTF
> --
>
> Key: SPARK-9034
> URL: https://issues.apache.org/jira/browse/SPARK-9034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>
> GenericUDTF#initialize() in Hive defines field names in a returned schema 
> though,
> the current HiveGenericUDTF drops these names.
> We might need to reflect these in a logical plan tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9034) Reflect field names defined in GenericUDTF

2015-07-14 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14626279#comment-14626279
 ] 

Takeshi Yamamuro commented on SPARK-9034:
-

I'll make a PR for this after SPARK-8955 and SPARK-8930 resolved.

> Reflect field names defined in GenericUDTF
> --
>
> Key: SPARK-9034
> URL: https://issues.apache.org/jira/browse/SPARK-9034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Takeshi Yamamuro
>
> Hive GenericUDTF#initialize() defines field names in a returned schema though,
> the current HiveGenericUDTF drops these names.
> We might need to reflect these in a logical plan tree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join

2015-11-13 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005111#comment-15005111
 ] 

Takeshi Yamamuro commented on SPARK-11704:
--

ISTM that some earlier stages in rdd2 are skipped in all the iterations except 
the first one in case of rdd2 comming from ShuffleRDD.
That said, it is worth doing this optimization.

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11704) Optimize the Cartesian Join

2015-11-13 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005217#comment-15005217
 ] 

Takeshi Yamamuro commented on SPARK-11704:
--

You're right; they're not automatically cached.
I just say that earlier stages in rdd2 are skipped and the iterator just 
fetches blocks from remote BlockManager (the blocks are written in ShuffleRDD). 
You mean that fetching remote blocks is too slow?

Anyway, adding cleanup hook can make a big impact on SparkPlan interfaces.
As an alternative idea, how about caching rdd2 in unsafe space in a similar 
logic of UnsafeExternalSorter?
We can release the space by using TaskContext#addTaskCompletionListener.

If you have no time, I'm okay to take it.

> Optimize the Cartesian Join
> ---
>
> Key: SPARK-11704
> URL: https://issues.apache.org/jira/browse/SPARK-11704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Zhan Zhang
>
> Currently CartesianProduct relies on RDD.cartesian, in which the computation 
> is realized as follows
>   override def compute(split: Partition, context: TaskContext): Iterator[(T, 
> U)] = {
> val currSplit = split.asInstanceOf[CartesianPartition]
> for (x <- rdd1.iterator(currSplit.s1, context);
>  y <- rdd2.iterator(currSplit.s2, context)) yield (x, y)
>   }
> From the above loop, if rdd1.count is n, rdd2 needs to be recomputed n times. 
> Which is really heavy and may never finished if n is large, especially when 
> rdd2 is coming from ShuffleRDD.
> We should have some optimization on CartesianProduct by caching rightResults. 
> The problem is that we don’t have cleanup hook to unpersist rightResults 
> AFAIK. I think we should have some cleanup hook after query execution.
> With the hook available, we can easily optimize such Cartesian join. I 
> believe such cleanup hook may also benefit other query optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6521) Bypass network shuffle read if both endpoints are local

2015-11-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033016#comment-15033016
 ] 

Takeshi Yamamuro commented on SPARK-6521:
-

Performance of the current Spark heavily depends on CPU, so this shuffle 
optimization has little effect on that ( benchmark results could be found in 
pullreq #9478). For a while, this ticket needs not to be considered and

> Bypass network shuffle read if both endpoints are local
> ---
>
> Key: SPARK-6521
> URL: https://issues.apache.org/jira/browse/SPARK-6521
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: xukun
>
> In the past, executor read other executor's shuffle file in the same node by 
> net. This pr make that executors in the same node read local shuffle file In 
> sort-based Shuffle. It will reduce net transport.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6521) Bypass network shuffle read if both endpoints are local

2015-12-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033016#comment-15033016
 ] 

Takeshi Yamamuro edited comment on SPARK-6521 at 12/15/15 9:23 AM:
---

Performance of the current Spark heavily depends on CPU, so this shuffle 
optimization has little effect on that ( benchmark results could be found in 
pullreq #9478). For a while, this ticket needs not to be considered.


was (Author: maropu):
Performance of the current Spark heavily depends on CPU, so this shuffle 
optimization has little effect on that ( benchmark results could be found in 
pullreq #9478). For a while, this ticket needs not to be considered and

> Bypass network shuffle read if both endpoints are local
> ---
>
> Key: SPARK-6521
> URL: https://issues.apache.org/jira/browse/SPARK-6521
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: xukun
>
> In the past, executor read other executor's shuffle file in the same node by 
> net. This pr make that executors in the same node read local shuffle file In 
> sort-based Shuffle. It will reduce net transport.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12392) Optimize a location order of broadcast blocks by considering preferred local hosts

2015-12-16 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-12392:


 Summary: Optimize a location order of broadcast blocks by 
considering preferred local hosts
 Key: SPARK-12392
 URL: https://issues.apache.org/jira/browse/SPARK-12392
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.5.2
Reporter: Takeshi Yamamuro


When multiple workers exist in a host, we can bypass unnecessary remote access 
for broadcasts; block managers fetch broadcast blocks from the same host 
instead of remote hosts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12446) Add unit tests for JDBCRDD internal functions

2015-12-20 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-12446:


 Summary: Add unit tests for JDBCRDD internal functions
 Key: SPARK-12446
 URL: https://issues.apache.org/jira/browse/SPARK-12446
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.5.2
Reporter: Takeshi Yamamuro


No tests done for JDBCRDD#compileFileter.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter

2015-12-21 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-12476:


 Summary: Implement JdbcRelation#unhandledFilters for removing 
unnecessary Spark Fileter
 Key: SPARK-12476
 URL: https://issues.apache.org/jira/browse/SPARK-12476
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takeshi Yamamuro


Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'

```
Current plan:
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
```

This patch enables a plan below;
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
```






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter

2015-12-21 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-12476:
-
Description: 
Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'

{code}
Current plan:
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}

This patch enables a plan below;
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
```




  was:
Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'

```
Current plan:
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
```

This patch enables a plan below;
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
```





> Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter
> --
>
> Key: SPARK-12476
> URL: https://issues.apache.org/jira/browse/SPARK-12476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takeshi Yamamuro
>
> Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'
> {code}
> Current plan:
> == Optimized Logical Plan ==
> Project [col0#0,col1#1]
> +- Filter (col0#0 = xxx)
>+- Relation[col0#0,col1#1] 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})
> == Physical Plan ==
> +- Filter (col0#0 = xxx)
>+- Scan 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
> [EqualTo(col0,xxx)]
> {code}
> This patch enables a plan below;
> ```
> == Optimized Logical Plan ==
> Project [col0#0,col1#1]
> +- Filter (col0#0 = xxx)
>+- Relation[col0#0,col1#1] 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})
> == Physical Plan ==
> +- Filter (col0#0 = xxx)
>+- Scan 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
> [EqualTo(col0,xxx)]
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter

2015-12-21 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-12476:
-
Description: 
Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'

Current plan:
{code}
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}

This patch enables a plan below;
{code}
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}



  was:
Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'

{code}
Current plan:
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}

This patch enables a plan below;
{code}
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}




> Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter
> -
>
> Key: SPARK-12476
> URL: https://issues.apache.org/jira/browse/SPARK-12476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takeshi Yamamuro
>
> Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'
> Current plan:
> {code}
> == Optimized Logical Plan ==
> Project [col0#0,col1#1]
> +- Filter (col0#0 = xxx)
>+- Relation[col0#0,col1#1] 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})
> == Physical Plan ==
> +- Filter (col0#0 = xxx)
>+- Scan 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
> [EqualTo(col0,xxx)]
> {code}
> This patch enables a plan below;
> {code}
> == Optimized Logical Plan ==
> Project [col0#0,col1#1]
> +- Filter (col0#0 = xxx)
>+- Relation[col0#0,col1#1] 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})
> == Physical Plan ==
> +- Filter (col0#0 = xxx)
>+- Scan 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
> [EqualTo(col0,xxx)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter

2015-12-21 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-12476:
-
Summary: Implement JdbcRelation#unhandledFilters for removing unnecessary 
Spark Filter  (was: Implement JdbcRelation#unhandledFilters for removing 
unnecessary Spark Fileter)

> Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter
> -
>
> Key: SPARK-12476
> URL: https://issues.apache.org/jira/browse/SPARK-12476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takeshi Yamamuro
>
> Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'
> {code}
> Current plan:
> == Optimized Logical Plan ==
> Project [col0#0,col1#1]
> +- Filter (col0#0 = xxx)
>+- Relation[col0#0,col1#1] 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})
> == Physical Plan ==
> +- Filter (col0#0 = xxx)
>+- Scan 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
> [EqualTo(col0,xxx)]
> {code}
> This patch enables a plan below;
> {code}
> == Optimized Logical Plan ==
> Project [col0#0,col1#1]
> +- Filter (col0#0 = xxx)
>+- Relation[col0#0,col1#1] 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})
> == Physical Plan ==
> +- Filter (col0#0 = xxx)
>+- Scan 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
> [EqualTo(col0,xxx)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter

2015-12-21 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-12476:
-
Description: 
Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'

{code}
Current plan:
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}

This patch enables a plan below;
{code}
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}



  was:
Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'

{code}
Current plan:
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}

This patch enables a plan below;
```
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
```





> Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Fileter
> --
>
> Key: SPARK-12476
> URL: https://issues.apache.org/jira/browse/SPARK-12476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takeshi Yamamuro
>
> Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'
> {code}
> Current plan:
> == Optimized Logical Plan ==
> Project [col0#0,col1#1]
> +- Filter (col0#0 = xxx)
>+- Relation[col0#0,col1#1] 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})
> == Physical Plan ==
> +- Filter (col0#0 = xxx)
>+- Scan 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
> [EqualTo(col0,xxx)]
> {code}
> This patch enables a plan below;
> {code}
> == Optimized Logical Plan ==
> Project [col0#0,col1#1]
> +- Filter (col0#0 = xxx)
>+- Relation[col0#0,col1#1] 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})
> == Physical Plan ==
> +- Filter (col0#0 = xxx)
>+- Scan 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
> [EqualTo(col0,xxx)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12476) Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter

2015-12-21 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-12476:
-
Description: 
Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'

Current plan:
{code}
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}

This patch enables a plan below;
{code}
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}



  was:
Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'

Current plan:
{code}
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}

This patch enables a plan below;
{code}
== Optimized Logical Plan ==
Project [col0#0,col1#1]
+- Filter (col0#0 = xxx)
   +- Relation[col0#0,col1#1] 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})

== Physical Plan ==
+- Filter (col0#0 = xxx)
   +- Scan 
JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
 password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
[EqualTo(col0,xxx)]
{code}




> Implement JdbcRelation#unhandledFilters for removing unnecessary Spark Filter
> -
>
> Key: SPARK-12476
> URL: https://issues.apache.org/jira/browse/SPARK-12476
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takeshi Yamamuro
>
> Input: SELECT * FROM jdbcTable WHERE col0 = 'xxx'
> Current plan:
> {code}
> == Optimized Logical Plan ==
> Project [col0#0,col1#1]
> +- Filter (col0#0 = xxx)
>+- Relation[col0#0,col1#1] 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})
> == Physical Plan ==
> +- Filter (col0#0 = xxx)
>+- Scan 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
> [EqualTo(col0,xxx)]
> {code}
> This patch enables a plan below;
> {code}
> == Optimized Logical Plan ==
> Project [col0#0,col1#1]
> +- Filter (col0#0 = xxx)
>+- Relation[col0#0,col1#1] 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})
> == Physical Plan ==
> Scan 
> JDBCRelation(jdbc:postgresql:postgres,testRel,[Lorg.apache.spark.Partition;@2ac7c683,{user=maropu,
>  password=, driver=org.postgresql.Driver})[col0#0,col1#1] PushedFilters: 
> [EqualTo(col0,xxx)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5753) add basic support to JDBCRDD for postgresql types: uuid, hstore, and array

2015-12-24 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15070822#comment-15070822
 ] 

Takeshi Yamamuro commented on SPARK-5753:
-

[~rxin] This ticket should be closed because it has already been fixed in 
SPARK-10186.

> add basic support to JDBCRDD for postgresql types: uuid, hstore, and array
> --
>
> Key: SPARK-5753
> URL: https://issues.apache.org/jira/browse/SPARK-5753
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Ricky Nguyen
>
> I recently saw the new JDBCRDD merged into master. Thanks for that, it works 
> pretty well and is really convenient.
> It would be nice if it could have basic support for a few more types.
> * uuid (as StringType)
> * hstore (as MapType). keys and values are both strings.
> * array (as ArrayType)
> I have a patch that gets started in this direction. Not sure where or how to 
> write/run tests, but I ran manual tests in spark-shell against my postgres db.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12581) Support case-sensitive table names in postgresql

2015-12-30 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-12581:


 Summary: Support case-sensitive table names in postgresql
 Key: SPARK-12581
 URL: https://issues.apache.org/jira/browse/SPARK-12581
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
Reporter: Takeshi Yamamuro


Table names in postgresql is originally case-insensitive.
To support case-sensitive table names, we need to wrap names in double quotes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12686) Support group-by push down into data sources

2016-01-06 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-12686:


 Summary: Support group-by push down into data sources
 Key: SPARK-12686
 URL: https://issues.apache.org/jira/browse/SPARK-12686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro


As for logical plan nodes like 'Aggregate -> Project -> (Filter) -> Scan', we 
can push down partial aggregation processing into data sources that could 
aggregate their own data efficiently because Orc/Parquet could fetch the 
MIN/MAX value by using statistics data and some databases have efficient 
aggregation implementations.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-01-25 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-12978:


 Summary: Skip unnecessary final group-by when input data already 
clustered with group-by keys
 Key: SPARK-12978
 URL: https://issues.apache.org/jira/browse/SPARK-12978
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro


This pr enables the optimization below;

Without opt.:

== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
 output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
InMemoryRelation [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, 
true, false, true, 1), ConvertToUnsafe, None

With opt.:

== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
[col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
true, 1), ConvertToUnsafe, None




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-01-25 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-12978:
-
Description: 
This pr enables the optimization below;

Without opt.:
```
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
 output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
InMemoryRelation [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, 
true, false, true, 1), ConvertToUnsafe, None

With opt.:
```
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
[col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
true, 1), ConvertToUnsafe, None


  was:
This pr enables the optimization below;

Without opt.:

== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
 output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
InMemoryRelation [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, 
true, false, true, 1), ConvertToUnsafe, None

With opt.:

== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
[col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
true, 1), ConvertToUnsafe, None



> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> This pr enables the optimization below;
> Without opt.:
> ```
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> With opt.:
> ```
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-01-25 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-12978:
-
Description: 
This pr enables the optimization to skip an unnecessary group-by operation 
below;

Without opt.:
{code}
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
 output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
InMemoryRelation [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, 
true, false, true, 1), ConvertToUnsafe, None
{code}

With opt.:
{code}
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
[col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
true, 1), ConvertToUnsafe, None
{code}


  was:
This pr enables the optimization below;

Without opt.:
{code}
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
 output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
InMemoryRelation [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, 
true, false, true, 1), ConvertToUnsafe, None
{code}

With opt.:
{code}
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
[col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
true, 1), ConvertToUnsafe, None
{code}



> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> This pr enables the optimization to skip an unnecessary group-by operation 
> below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-01-25 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-12978:
-
Description: 
This pr enables the optimization below;

Without opt.:
{code}
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
 output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
InMemoryRelation [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, 
true, false, true, 1), ConvertToUnsafe, None
{code}

With opt.:
{code}
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
[col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
true, 1), ConvertToUnsafe, None
{code}


  was:
This pr enables the optimization below;

Without opt.:
```
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
 output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
InMemoryRelation [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, 
true, false, true, 1), ConvertToUnsafe, None

With opt.:
```
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
[col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
true, 1), ConvertToUnsafe, None



> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> This pr enables the optimization below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-01-25 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-12978:
-
Description: 
This ticket targets the optimization to skip an unnecessary group-by operation 
below;

Without opt.:
{code}
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
 output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
InMemoryRelation [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, 
true, false, true, 1), ConvertToUnsafe, None
{code}

With opt.:
{code}
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
[col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
true, 1), ConvertToUnsafe, None
{code}


  was:
This pr enables the optimization to skip an unnecessary group-by operation 
below;

Without opt.:
{code}
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
 output=[col0#159,sum#200,sum#201,count#202L])
   +- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
InMemoryRelation [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, 
true, false, true, 1), ConvertToUnsafe, None
{code}

With opt.:
{code}
== Physical Plan ==
TungstenAggregate(key=[col0#159], 
functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
 output=[col0#159,sum(col1)#177,avg(col2)#178])
+- TungstenExchange hashpartitioning(col0#159,200), None
  +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
[col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
true, 1), ConvertToUnsafe, None
{code}



> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> This ticket targets the optimization to skip an unnecessary group-by 
> operation below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115076#comment-15115076
 ] 

Takeshi Yamamuro commented on SPARK-12890:
--

I looked over the related codes; ISTM that partition pruning optimization 
itself has been implemented in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74.
However, there is no interface in DataFrame#parquet to pass partition 
information 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321).

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115076#comment-15115076
 ] 

Takeshi Yamamuro edited comment on SPARK-12890 at 1/25/16 11:39 AM:


I looked over the related codes; partition pruning optimization itself has been 
implemented in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74.
However, there is no interface in DataFrame#parquet to pass partition 
information 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321).


was (Author: maropu):
I looked over the related codes; ISTM that partition pruning optimization 
itself has been implemented in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74.
However, there is no interface in DataFrame#parquet to pass partition 
information 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321).

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115076#comment-15115076
 ] 

Takeshi Yamamuro edited comment on SPARK-12890 at 1/25/16 1:32 PM:
---

I looked over the related codes; partition pruning optimization itself has been 
implemented in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74.
However, there is no interface in DataFrameReader#parquet to pass partition 
information 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321).


was (Author: maropu):
I looked over the related codes; partition pruning optimization itself has been 
implemented in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74.
However, there is no interface in DataFrame#parquet to pass partition 
information 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321).

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115257#comment-15115257
 ] 

Takeshi Yamamuro commented on SPARK-12890:
--

Ah, I see.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12995) Remove deprecate APIs from Pregel

2016-01-25 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-12995:


 Summary: Remove deprecate APIs from Pregel
 Key: SPARK-12995
 URL: https://issues.apache.org/jira/browse/SPARK-12995
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12995) Remove deprecate APIs from Pregel

2016-01-25 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-12995:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-11806

> Remove deprecate APIs from Pregel
> -
>
> Key: SPARK-12995
> URL: https://issues.apache.org/jira/browse/SPARK-12995
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13057) Add benchmark codes and the performance results for implemented compression schemes for InMemoryRelation

2016-01-27 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-13057:


 Summary: Add benchmark codes and the performance results for 
implemented compression schemes for InMemoryRelation
 Key: SPARK-13057
 URL: https://issues.apache.org/jira/browse/SPARK-13057
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro


This ticket adds benchmark codes for in-memory cache compression to make future 
developments and discussions more smooth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13158) Show the information of broadcast blocks in WebUI

2016-02-03 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-13158:


 Summary: Show the information of broadcast blocks in WebUI
 Key: SPARK-13158
 URL: https://issues.apache.org/jira/browse/SPARK-13158
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro


This ticket targets a function to show the information of broadcast blocks, # 
of blocks total size in mem/disk in a cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13361) Add benchmark codes for Encoder#compress() in CompressionSchemeBenchmark

2016-02-17 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-13361:


 Summary: Add benchmark codes for Encoder#compress() in 
CompressionSchemeBenchmark
 Key: SPARK-13361
 URL: https://issues.apache.org/jira/browse/SPARK-13361
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8000) SQLContext.read.load() should be able to auto-detect input data

2016-02-17 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150348#comment-15150348
 ] 

Takeshi Yamamuro commented on SPARK-8000:
-

This ticket covers files exported by other systems such as Impala?
I'm not sure how to automatically detect these kinds of unknown files in Spark.
One idea; we try to read a format-specific header and detect it in 
ResolvedDataSource.

> SQLContext.read.load() should be able to auto-detect input data
> ---
>
> Key: SPARK-8000
> URL: https://issues.apache.org/jira/browse/SPARK-8000
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> If it is a parquet file, use parquet. If it is a JSON file, use JSON. If it 
> is an ORC file, use ORC. If it is a CSV file, use CSV.
> Maybe Spark SQL can also write an output metadata file to specify the schema 
> & data source that's used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12449) Pushing down arbitrary logical plans to data sources

2016-02-17 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150372#comment-15150372
 ] 

Takeshi Yamamuro commented on SPARK-12449:
--

I agree though, there are many jira tickets(12506, 12126, 12686, 9182, 10195, 
...) related in this topic and it is hard to redesign datasource codes to 
satisfy all the requirements...

> Pushing down arbitrary logical plans to data sources
> 
>
> Key: SPARK-12449
> URL: https://issues.apache.org/jira/browse/SPARK-12449
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Stephan Kessler
> Attachments: pushingDownLogicalPlans.pdf
>
>
> With the help of the DataSource API we can pull data from external sources 
> for processing. Implementing interfaces such as {{PrunedFilteredScan}} allows 
> to push down filters and projects pruning unnecessary fields and rows 
> directly in the data source.
> However, data sources such as SQL Engines are capable of doing even more 
> preprocessing, e.g., evaluating aggregates. This is beneficial because it 
> would reduce the amount of data transferred from the source to Spark. The 
> existing interfaces do not allow such kind of processing in the source.
> We would propose to add a new interface {{CatalystSource}} that allows to 
> defer the processing of arbitrary logical plans to the data source. We have 
> already shown the details at the Spark Summit 2015 Europe 
> [https://spark-summit.org/eu-2015/events/the-pushdown-of-everything/]
> I will add a design document explaining details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13656) Delete spark.sql.parquet.cacheMetadata

2016-03-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209561#comment-15209561
 ] 

Takeshi Yamamuro commented on SPARK-13656:
--

Okay, I'll make a new ticket with a basic idea later.

> Delete spark.sql.parquet.cacheMetadata
> --
>
> Key: SPARK-13656
> URL: https://issues.apache.org/jira/browse/SPARK-13656
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Looks like spark.sql.parquet.cacheMetadata is not used anymore. Let's delete 
> it to avoid any potential confusion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14193) Skip unnecessary sorts if input data have been already ordered in InMemoryRelation

2016-03-28 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-14193:


 Summary: Skip unnecessary sorts if input data have been already 
ordered in InMemoryRelation
 Key: SPARK-14193
 URL: https://issues.apache.org/jira/browse/SPARK-14193
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.1
Reporter: Takeshi Yamamuro


This pr is to skip unnecessary sorts if input data have been already ordered in 
InMemoryTable.
Let's say we have a cached table with column 'a' sorted;

```
val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
val df2 = df1.sort("a").cache
df2.show // just cache data
```

If you say `df2.sort("a")`, the current spark generates a plan like;
```
== Physical Plan ==
Sort [a#13 ASC], true, 0
+- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, None

```
This pr removes this unncessary sort.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14193) Skip unnecessary sorts if input data have been already ordered in InMemoryRelation

2016-03-28 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-14193:
-
Description: 
This ticket is to skip unnecessary sorts if input data have been already 
ordered in InMemoryTable.
Let's say we have a cached table with column 'a' sorted;

```
val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
val df2 = df1.sort("a").cache
df2.show // just cache data
```

If you say `df2.sort("a")`, the current spark generates a plan like;
```
== Physical Plan ==
Sort [a#13 ASC], true, 0
+- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, None

```
This pr removes this unncessary sort.


  was:
This pr is to skip unnecessary sorts if input data have been already ordered in 
InMemoryTable.
Let's say we have a cached table with column 'a' sorted;

```
val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
val df2 = df1.sort("a").cache
df2.show // just cache data
```

If you say `df2.sort("a")`, the current spark generates a plan like;
```
== Physical Plan ==
Sort [a#13 ASC], true, 0
+- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, None

```
This pr removes this unncessary sort.



> Skip unnecessary sorts if input data have been already ordered in 
> InMemoryRelation
> --
>
> Key: SPARK-14193
> URL: https://issues.apache.org/jira/browse/SPARK-14193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> This ticket is to skip unnecessary sorts if input data have been already 
> ordered in InMemoryTable.
> Let's say we have a cached table with column 'a' sorted;
> ```
> val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
> val df2 = df1.sort("a").cache
> df2.show // just cache data
> ```
> If you say `df2.sort("a")`, the current spark generates a plan like;
> ```
> == Physical Plan ==
> Sort [a#13 ASC], true, 0
> +- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
> 1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, 
> None
> ```
> This pr removes this unncessary sort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14193) Skip unnecessary sorts if input data have been already ordered in InMemoryRelation

2016-03-28 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-14193:
-
Description: 
This ticket is to skip unnecessary sorts if input data have been already 
ordered in InMemoryTable.
Let's say we have a cached table with column 'a' sorted;

{code}
val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
val df2 = df1.sort("a").cache
df2.show // just cache data
{code}

If you say `df2.sort("a")`, the current spark generates a plan like;
{code}
== Physical Plan ==
Sort [a#13 ASC], true, 0
+- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, None

{code}
This pr removes this unncessary sort.


  was:
This ticket is to skip unnecessary sorts if input data have been already 
ordered in InMemoryTable.
Let's say we have a cached table with column 'a' sorted;

```
val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
val df2 = df1.sort("a").cache
df2.show // just cache data
```

If you say `df2.sort("a")`, the current spark generates a plan like;
```
== Physical Plan ==
Sort [a#13 ASC], true, 0
+- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, None

```
This pr removes this unncessary sort.



> Skip unnecessary sorts if input data have been already ordered in 
> InMemoryRelation
> --
>
> Key: SPARK-14193
> URL: https://issues.apache.org/jira/browse/SPARK-14193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> This ticket is to skip unnecessary sorts if input data have been already 
> ordered in InMemoryTable.
> Let's say we have a cached table with column 'a' sorted;
> {code}
> val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
> val df2 = df1.sort("a").cache
> df2.show // just cache data
> {code}
> If you say `df2.sort("a")`, the current spark generates a plan like;
> {code}
> == Physical Plan ==
> Sort [a#13 ASC], true, 0
> +- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
> 1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, 
> None
> {code}
> This pr removes this unncessary sort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14193) Skip unnecessary sorts if input data have been already ordered in InMemoryRelation

2016-03-28 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-14193:
-
Description: 
This ticket is to skip unnecessary sorts if input data have been already 
ordered in InMemoryTable.
Let's say we have a cached table with column 'a' sorted;

{code}
val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
val df2 = df1.sort("a").cache
df2.show // just cache data
{code}

If you say `df2.sort("a")`, the current spark generates a plan like;
{code}
== Physical Plan ==
Sort [a#13 ASC], true, 0
+- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, None

{code}
This ticket targets at removing this unncessary sort.


  was:
This ticket is to skip unnecessary sorts if input data have been already 
ordered in InMemoryTable.
Let's say we have a cached table with column 'a' sorted;

{code}
val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
val df2 = df1.sort("a").cache
df2.show // just cache data
{code}

If you say `df2.sort("a")`, the current spark generates a plan like;
{code}
== Physical Plan ==
Sort [a#13 ASC], true, 0
+- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, None

{code}
This pr removes this unncessary sort.



> Skip unnecessary sorts if input data have been already ordered in 
> InMemoryRelation
> --
>
> Key: SPARK-14193
> URL: https://issues.apache.org/jira/browse/SPARK-14193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Takeshi Yamamuro
>
> This ticket is to skip unnecessary sorts if input data have been already 
> ordered in InMemoryTable.
> Let's say we have a cached table with column 'a' sorted;
> {code}
> val df1 = Seq((1, 0), (3, 0), (2, 0), (1, 0)).toDF("a", "b")
> val df2 = df1.sort("a").cache
> df2.show // just cache data
> {code}
> If you say `df2.sort("a")`, the current spark generates a plan like;
> {code}
> == Physical Plan ==
> Sort [a#13 ASC], true, 0
> +- InMemoryColumnarTableScan [a#13,b#14], InMemoryRelation [a#13,b#14], true, 
> 1, StorageLevel(true, true, false, true, 1), Sort [a#13 ASC], true, 0, 
> None
> {code}
> This ticket targets at removing this unncessary sort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14260) Increase default value for maxCharsPerColumn

2016-03-30 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217664#comment-15217664
 ] 

Takeshi Yamamuro commented on SPARK-14260:
--

In addition to the performance issue, I think this bigger value possibly makes 
a CSV parser process much long columns that users do not intend to read because 
of input data bugs or something. Is it a bad idea to make exception message 
more readable for users, e.g., "you have to make a value 'maxCharsPerColumn' 
bigger", when this kind of errors happens?


> Increase default value for maxCharsPerColumn
> 
>
> Key: SPARK-14260
> URL: https://issues.apache.org/jira/browse/SPARK-14260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> I guess the default value of the option {{maxCharsPerColumn}} looks 
> relatively small,100 characters meaning 976KB.
> It looks some of guys have a problem with this ending up setting the value 
> manually.
> https://github.com/databricks/spark-csv/issues/295
> https://issues.apache.org/jira/browse/SPARK-14103
> According to [univocity 
> API|http://docs.univocity.com/parsers/2.0.0/com/univocity/parsers/common/CommonSettings.html#setMaxCharsPerColumn(int)],
>  this exists to avoid {{OutOfMemoryErrors}}.
> If this does not harm performance, then I think it would be better to make 
> the default value much bigger (eg. 10MB or 100MB) so that users do not take 
> care of the lengths of each field in CSV file.
> Apparently Apache CSV Parser does not have such limits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    3   4   5   6   7   8   9   10   11   12   >