[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898855#comment-15898855
 ] 

Nick Pentreath commented on SPARK-14409:


[~josephkb] the proposed input schema above encompasses that - the {{labelCol}} 
is the true relevance score (rating, confidence, etc), while the 
{{predictionCol}} is the predicted relevance (rating, confidence, etc). Note we 
can name these columns something more specific ({{labelCol}} and 
{{predictionCol}} are re-used really from the other evaluators).

This also allows "weighted" forms of ranking metric later (e.g. some metrics 
can incorporate the true relevance score into the computation which serves as a 
form of weighting of the metric) - the metrics we currently have don't do that. 
So for now the true relevance can serve as a filter - for example, when 
computing the ranking metric for recommendation, we *don't* want to include 
negative ratings in the "ground truth set of relevant documents" - hence the 
{{goodThreshold}} param above (I would rather call it something like 
{{relevanceThreshold}} myself).

*Note* that there are 2 formats I detail in my comment above - the first is the 
the actual schema of the {{DataFrame}} used as input to the 
{{RankingEvaluator}} - this must therefore be the schema of the DF output by 
{{model.transform}} (whether that is ALS for recommendation, a logistic 
regression for ad prediction, or whatever).

The second format mentioned is simply illustrating the *intermediate internal 
transformation* that the evaluator will do in the {{evaluate}} method. You can 
see a rough draft of it in Danilo's PR 
[here|https://github.com/apache/spark/pull/16618/files#diff-0345c4cb1878d3bb0d84297202fdc95fR93].

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18549:


Assignee: Apache Spark

> Failed to Uncache a View that References a Dropped Table.
> -
>
> Key: SPARK-18549
> URL: https://issues.apache.org/jira/browse/SPARK-18549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>
> {code}
>   spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1")
>   spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2")
>   sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2")
>   // Cache is empty at the beginning
>   assert(spark.sharedState.cacheManager.isEmpty)
>   sql("CACHE TABLE testView")
>   assert(spark.catalog.isCached("testView"))
>   // Cache is not empty
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}
> {code}
>   // drop a table referenced by a cached view
>   sql("DROP TABLE jt1")
> -- So far everything is fine
>   // Failed to unache the view
>   val e = intercept[AnalysisException] {
> sql("UNCACHE TABLE testView")
>   }.getMessage
>   assert(e.contains("Table or view not found: `default`.`jt1`"))
>   // We are unable to drop it from the cache
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18549:


Assignee: (was: Apache Spark)

> Failed to Uncache a View that References a Dropped Table.
> -
>
> Key: SPARK-18549
> URL: https://issues.apache.org/jira/browse/SPARK-18549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Priority: Critical
>
> {code}
>   spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1")
>   spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2")
>   sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2")
>   // Cache is empty at the beginning
>   assert(spark.sharedState.cacheManager.isEmpty)
>   sql("CACHE TABLE testView")
>   assert(spark.catalog.isCached("testView"))
>   // Cache is not empty
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}
> {code}
>   // drop a table referenced by a cached view
>   sql("DROP TABLE jt1")
> -- So far everything is fine
>   // Failed to unache the view
>   val e = intercept[AnalysisException] {
> sql("UNCACHE TABLE testView")
>   }.getMessage
>   assert(e.contains("Table or view not found: `default`.`jt1`"))
>   // We are unable to drop it from the cache
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.

2017-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898834#comment-15898834
 ] 

Apache Spark commented on SPARK-18549:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17097

> Failed to Uncache a View that References a Dropped Table.
> -
>
> Key: SPARK-18549
> URL: https://issues.apache.org/jira/browse/SPARK-18549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Priority: Critical
>
> {code}
>   spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1")
>   spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2")
>   sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2")
>   // Cache is empty at the beginning
>   assert(spark.sharedState.cacheManager.isEmpty)
>   sql("CACHE TABLE testView")
>   assert(spark.catalog.isCached("testView"))
>   // Cache is not empty
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}
> {code}
>   // drop a table referenced by a cached view
>   sql("DROP TABLE jt1")
> -- So far everything is fine
>   // Failed to unache the view
>   val e = intercept[AnalysisException] {
> sql("UNCACHE TABLE testView")
>   }.getMessage
>   assert(e.contains("Table or view not found: `default`.`jt1`"))
>   // We are unable to drop it from the cache
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18549:


Assignee: (was: Apache Spark)

> Failed to Uncache a View that References a Dropped Table.
> -
>
> Key: SPARK-18549
> URL: https://issues.apache.org/jira/browse/SPARK-18549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Priority: Critical
>
> {code}
>   spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1")
>   spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2")
>   sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2")
>   // Cache is empty at the beginning
>   assert(spark.sharedState.cacheManager.isEmpty)
>   sql("CACHE TABLE testView")
>   assert(spark.catalog.isCached("testView"))
>   // Cache is not empty
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}
> {code}
>   // drop a table referenced by a cached view
>   sql("DROP TABLE jt1")
> -- So far everything is fine
>   // Failed to unache the view
>   val e = intercept[AnalysisException] {
> sql("UNCACHE TABLE testView")
>   }.getMessage
>   assert(e.contains("Table or view not found: `default`.`jt1`"))
>   // We are unable to drop it from the cache
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18549:


Assignee: Apache Spark

> Failed to Uncache a View that References a Dropped Table.
> -
>
> Key: SPARK-18549
> URL: https://issues.apache.org/jira/browse/SPARK-18549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Critical
>
> {code}
>   spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1")
>   spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2")
>   sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2")
>   // Cache is empty at the beginning
>   assert(spark.sharedState.cacheManager.isEmpty)
>   sql("CACHE TABLE testView")
>   assert(spark.catalog.isCached("testView"))
>   // Cache is not empty
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}
> {code}
>   // drop a table referenced by a cached view
>   sql("DROP TABLE jt1")
> -- So far everything is fine
>   // Failed to unache the view
>   val e = intercept[AnalysisException] {
> sql("UNCACHE TABLE testView")
>   }.getMessage
>   assert(e.contains("Table or view not found: `default`.`jt1`"))
>   // We are unable to drop it from the cache
>   assert(!spark.sharedState.cacheManager.isEmpty)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name

2017-03-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19832:
---

Assignee: Song Jun

> DynamicPartitionWriteTask should escape the partition name 
> ---
>
> Key: SPARK-19832
> URL: https://issues.apache.org/jira/browse/SPARK-19832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Song Jun
> Fix For: 2.2.0
>
>
> Currently in DynamicPartitionWriteTask, when we get the paritionPath of a 
> parition, we just escape the partition value, not escape the partition name.
> this will cause some problems for some  special partition name situation, for 
> example :
> 1) if the partition name contains '%' etc,  there will be two partition path 
> created in the filesytem, one is for escaped path like '/path/a%25b=1', 
> another is for unescaped path like '/path/a%b=1'.
> and the data inserted stored in unescaped path, while the show partitions 
> table will return 'a%25b=1' which the partition name is escaped. So here it 
> is not consist. And I think the data should be stored in the escaped path in 
> filesystem, which Hive2.0.0 also have the same action.
> 2) if the partition name contains ':', there will throw exception that new 
> Path("/path","a:b"), this is illegal which has a colon in the relative path.
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: a:b
>   at org.apache.hadoop.fs.Path.initialize(Path.java:205)
>   at org.apache.hadoop.fs.Path.(Path.java:171)
>   at org.apache.hadoop.fs.Path.(Path.java:88)
>   ... 48 elided
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:202)
>   ... 50 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19832) DynamicPartitionWriteTask should escape the partition name

2017-03-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19832.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17173
[https://github.com/apache/spark/pull/17173]

> DynamicPartitionWriteTask should escape the partition name 
> ---
>
> Key: SPARK-19832
> URL: https://issues.apache.org/jira/browse/SPARK-19832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
> Fix For: 2.2.0
>
>
> Currently in DynamicPartitionWriteTask, when we get the paritionPath of a 
> parition, we just escape the partition value, not escape the partition name.
> this will cause some problems for some  special partition name situation, for 
> example :
> 1) if the partition name contains '%' etc,  there will be two partition path 
> created in the filesytem, one is for escaped path like '/path/a%25b=1', 
> another is for unescaped path like '/path/a%b=1'.
> and the data inserted stored in unescaped path, while the show partitions 
> table will return 'a%25b=1' which the partition name is escaped. So here it 
> is not consist. And I think the data should be stored in the escaped path in 
> filesystem, which Hive2.0.0 also have the same action.
> 2) if the partition name contains ':', there will throw exception that new 
> Path("/path","a:b"), this is illegal which has a colon in the relative path.
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: a:b
>   at org.apache.hadoop.fs.Path.initialize(Path.java:205)
>   at org.apache.hadoop.fs.Path.(Path.java:171)
>   at org.apache.hadoop.fs.Path.(Path.java:88)
>   ... 48 elided
> Caused by: java.net.URISyntaxException: Relative path in absolute URI: a:b
>   at java.net.URI.checkPath(URI.java:1823)
>   at java.net.URI.(URI.java:745)
>   at org.apache.hadoop.fs.Path.initialize(Path.java:202)
>   ... 50 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19847) port hive read to FileFormat API

2017-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898824#comment-15898824
 ] 

Apache Spark commented on SPARK-19847:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17187

> port hive read to FileFormat API
> 
>
> Key: SPARK-19847
> URL: https://issues.apache.org/jira/browse/SPARK-19847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19847) port hive read to FileFormat API

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19847:


Assignee: Apache Spark  (was: Wenchen Fan)

> port hive read to FileFormat API
> 
>
> Key: SPARK-19847
> URL: https://issues.apache.org/jira/browse/SPARK-19847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2017-03-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898823#comment-15898823
 ] 

Joseph K. Bradley commented on SPARK-14409:
---

Thanks [~nick.pentre...@gmail.com]!  I like this general approach.  A few 
initial thoughts:

Schema for evaluator:
* Some evaluators will take rating or confidence values as well.  Will those be 
appended as extra columns?
* If a recommendation model like ALSModel returns top K recommendations for 
each user, that will not fit the RankingEvaluator input.  Do you plan to have 
RankingEvaluator or CrossValidator handle efficient calculation of top K 
recommendations?
* Relatedly, I'll comment on the schema in 
[https://github.com/apache/spark/pull/17090] directly in that PR in case we 
want to make changes in a quick follow-up.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19847) port hive read to FileFormat API

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19847:


Assignee: Wenchen Fan  (was: Apache Spark)

> port hive read to FileFormat API
> 
>
> Key: SPARK-19847
> URL: https://issues.apache.org/jira/browse/SPARK-19847
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19847) port hive read to FileFormat API

2017-03-06 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-19847:
---

 Summary: port hive read to FileFormat API
 Key: SPARK-19847
 URL: https://issues.apache.org/jira/browse/SPARK-19847
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19818) rbind should check for name consistency of input data frames

2017-03-06 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19818.
--
  Resolution: Fixed
Assignee: Wayne Zhang
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

> rbind should check for name consistency of input data frames
> 
>
> Key: SPARK-19818
> URL: https://issues.apache.org/jira/browse/SPARK-19818
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.2.0
>
>
> The current implementation accepts data frames with different schemas. See 
> issues below:
> {code}
> df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = 
> c(1, 30, 19)))
> union(df, df[, c(2, 1)])
>  name age
> 1 Michael 1.0
> 2Andy30.0
> 3  Justin19.0
> 4 1.0 Michael
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19818) rbind should check for name consistency of input data frames

2017-03-06 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19818:
-
Labels: releasenotes  (was: )

> rbind should check for name consistency of input data frames
> 
>
> Key: SPARK-19818
> URL: https://issues.apache.org/jira/browse/SPARK-19818
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Priority: Minor
>  Labels: releasenotes
>
> The current implementation accepts data frames with different schemas. See 
> issues below:
> {code}
> df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = 
> c(1, 30, 19)))
> union(df, df[, c(2, 1)])
>  name age
> 1 Michael 1.0
> 2Andy30.0
> 3  Justin19.0
> 4 1.0 Michael
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19818) rbind should check for name consistency of input data frames

2017-03-06 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19818:
-
Summary: rbind should check for name consistency of input data frames  
(was: SparkR union should check for name consistency of input data frames )

> rbind should check for name consistency of input data frames
> 
>
> Key: SPARK-19818
> URL: https://issues.apache.org/jira/browse/SPARK-19818
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Wayne Zhang
>Priority: Minor
>
> The current implementation accepts data frames with different schemas. See 
> issues below:
> {code}
> df <- createDataFrame(data.frame(name = c("Michael", "Andy", "Justin"), age = 
> c(1, 30, 19)))
> union(df, df[, c(2, 1)])
>  name age
> 1 Michael 1.0
> 2Andy30.0
> 3  Justin19.0
> 4 1.0 Michael
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19350) Cardinality estimation of Limit and Sample

2017-03-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19350:
---

Assignee: Zhenhua Wang

> Cardinality estimation of Limit and Sample
> --
>
> Key: SPARK-19350
> URL: https://issues.apache.org/jira/browse/SPARK-19350
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> Currently, LocalLimit/GlobalLimit/Sample propagates the same row count and 
> column stats from its child, which is incorrect.
> We can get the correct rowCount in Statistics for Limit/Sample whether cbo is 
> enabled or not. And column stats should not be propagated because we don't 
> know the distribution of columns after Limit or Sample.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19350) Cardinality estimation of Limit and Sample

2017-03-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19350.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Cardinality estimation of Limit and Sample
> --
>
> Key: SPARK-19350
> URL: https://issues.apache.org/jira/browse/SPARK-19350
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.0
>
>
> Currently, LocalLimit/GlobalLimit/Sample propagates the same row count and 
> column stats from its child, which is incorrect.
> We can get the correct rowCount in Statistics for Limit/Sample whether cbo is 
> enabled or not. And column stats should not be propagated because we don't 
> know the distribution of columns after Limit or Sample.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19846) Add a flag to disable constraint propagation

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19846:


Assignee: (was: Apache Spark)

> Add a flag to disable constraint propagation
> 
>
> Key: SPARK-19846
> URL: https://issues.apache.org/jira/browse/SPARK-19846
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>
> Constraint propagation can be computation expensive and block the driver 
> execution for long time. For example, the below benchmark needs 30mins.
> Compared with other attempts to modify how constraints propagation works, 
> this is a much simpler option: add a flag to disable constraint propagation.
> {code}
> import org.apache.spark.ml.{Pipeline, PipelineStage}
> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, 
> VectorAssembler}
> spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false)
> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, 
> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
> val indexers = df.columns.tail.map(c => new StringIndexer()
>   .setInputCol(c)
>   .setOutputCol(s"${c}_indexed")
>   .setHandleInvalid("skip"))
> val encoders = indexers.map(indexer => new OneHotEncoder()
>   .setInputCol(indexer.getOutputCol)
>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>   .setDropLast(true))
> val stages: Array[PipelineStage] = indexers ++ encoders
> val pipeline = new Pipeline().setStages(stages)
> val startTime = System.nanoTime
> pipeline.fit(df).transform(df).show
> val runningTime = System.nanoTime - startTime
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19846) Add a flag to disable constraint propagation

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19846:


Assignee: Apache Spark

> Add a flag to disable constraint propagation
> 
>
> Key: SPARK-19846
> URL: https://issues.apache.org/jira/browse/SPARK-19846
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Constraint propagation can be computation expensive and block the driver 
> execution for long time. For example, the below benchmark needs 30mins.
> Compared with other attempts to modify how constraints propagation works, 
> this is a much simpler option: add a flag to disable constraint propagation.
> {code}
> import org.apache.spark.ml.{Pipeline, PipelineStage}
> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, 
> VectorAssembler}
> spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false)
> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, 
> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
> val indexers = df.columns.tail.map(c => new StringIndexer()
>   .setInputCol(c)
>   .setOutputCol(s"${c}_indexed")
>   .setHandleInvalid("skip"))
> val encoders = indexers.map(indexer => new OneHotEncoder()
>   .setInputCol(indexer.getOutputCol)
>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>   .setDropLast(true))
> val stages: Array[PipelineStage] = indexers ++ encoders
> val pipeline = new Pipeline().setStages(stages)
> val startTime = System.nanoTime
> pipeline.fit(df).transform(df).show
> val runningTime = System.nanoTime - startTime
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19846) Add a flag to disable constraint propagation

2017-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898774#comment-15898774
 ] 

Apache Spark commented on SPARK-19846:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/17186

> Add a flag to disable constraint propagation
> 
>
> Key: SPARK-19846
> URL: https://issues.apache.org/jira/browse/SPARK-19846
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Liang-Chi Hsieh
>
> Constraint propagation can be computation expensive and block the driver 
> execution for long time. For example, the below benchmark needs 30mins.
> Compared with other attempts to modify how constraints propagation works, 
> this is a much simpler option: add a flag to disable constraint propagation.
> {code}
> import org.apache.spark.ml.{Pipeline, PipelineStage}
> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, 
> VectorAssembler}
> spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false)
> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, 
> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
> val indexers = df.columns.tail.map(c => new StringIndexer()
>   .setInputCol(c)
>   .setOutputCol(s"${c}_indexed")
>   .setHandleInvalid("skip"))
> val encoders = indexers.map(indexer => new OneHotEncoder()
>   .setInputCol(indexer.getOutputCol)
>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>   .setDropLast(true))
> val stages: Array[PipelineStage] = indexers ++ encoders
> val pipeline = new Pipeline().setStages(stages)
> val startTime = System.nanoTime
> pipeline.fit(df).transform(df).show
> val runningTime = System.nanoTime - startTime
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19846) Add a flag to disable constraint propagation

2017-03-06 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-19846:
---

 Summary: Add a flag to disable constraint propagation
 Key: SPARK-19846
 URL: https://issues.apache.org/jira/browse/SPARK-19846
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Liang-Chi Hsieh


Constraint propagation can be computation expensive and block the driver 
execution for long time. For example, the below benchmark needs 30mins.

Compared with other attempts to modify how constraints propagation works, this 
is a much simpler option: add a flag to disable constraint propagation.

{code}
import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, 
VectorAssembler}

spark.conf.set(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key, false)

val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, 
"baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))

val indexers = df.columns.tail.map(c => new StringIndexer()
  .setInputCol(c)
  .setOutputCol(s"${c}_indexed")
  .setHandleInvalid("skip"))

val encoders = indexers.map(indexer => new OneHotEncoder()
  .setInputCol(indexer.getOutputCol)
  .setOutputCol(s"${indexer.getOutputCol}_encoded")
  .setDropLast(true))

val stages: Array[PipelineStage] = indexers ++ encoders
val pipeline = new Pipeline().setStages(stages)

val startTime = System.nanoTime
pipeline.fit(df).transform(df).show
val runningTime = System.nanoTime - startTime
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898741#comment-15898741
 ] 

Apache Spark commented on SPARK-19602:
--

User 'skambha' has created a pull request for this issue:
https://github.com/apache/spark/pull/17185

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Attachments: Design_ColResolution_JIRA19602.pdf
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-06 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19831:

Description: 
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. It may solve this problem by the followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like *SendHeartbeat*;

2. It had better not receive the heartbeat master by *receive* method.  Because 
any other rpc message may block the *receive* method. Then worker won't receive 
the heartbeat message. So it had better send the heartbeat master at an 
asynchronous timing thread .

  was:
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. It may solve this problem by the followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like *SendHeartbeat*;

2. It had better receive the heartbeat master by *receive* method.  Because any 
other rpc message may block the *receive* method. Then worker won't receive the 
heartbeat message. So it had better send the heartbeat master at an 
asynchronous timing thread .


> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>Priority: Minor
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. It may solve this problem by the followed suggests:
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like *SendHeartbeat*;
> 2. It had better not receive the heartbeat master by *receive* method.  
> Because any other rpc message may block the *receive* method. Then worker 
> won't receive the heartbeat message. So it had better send the heartbeat 
> master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-06 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19831:

Description: 
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. It may solve this problem by the followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like *SendHeartbeat*;

2. It had better not receive the heartbeat master by *receive* method.  Because 
any other rpc message may block the *receive* method. Then worker won't receive 
the heartbeat message timely. So it had better send the heartbeat master at an 
asynchronous timing thread .

  was:
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. It may solve this problem by the followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like *SendHeartbeat*;

2. It had better not receive the heartbeat master by *receive* method.  Because 
any other rpc message may block the *receive* method. Then worker won't receive 
the heartbeat message. So it had better send the heartbeat master at an 
asynchronous timing thread .


> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>Priority: Minor
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. It may solve this problem by the followed suggests:
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like *SendHeartbeat*;
> 2. It had better not receive the heartbeat master by *receive* method.  
> Because any other rpc message may block the *receive* method. Then worker 
> won't receive the heartbeat message timely. So it had better send the 
> heartbeat master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-06 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19831:

Description: 
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. It may solve this problem by the followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like *SendHeartbeat*;

2. It had better receive the heartbeat master by *receive* method.  Because any 
other rpc message may block the *receive* method. Then worker won't receive the 
heartbeat message. So it had better send the heartbeat master at an 
asynchronous timing thread .

  was:
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. It may solve this problem by the followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like *SendHeartbeat*;

2. It had better not send the heartbeat master by *receive* method.  Because 
any other rpc message may block the *receive* method. Then worker won't receive 
the heartbeat message. So it had better send the heartbeat master at an 
asynchronous timing thread .


> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>Priority: Minor
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. It may solve this problem by the followed suggests:
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like *SendHeartbeat*;
> 2. It had better receive the heartbeat master by *receive* method.  Because 
> any other rpc message may block the *receive* method. Then worker won't 
> receive the heartbeat message. So it had better send the heartbeat master at 
> an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-06 Thread hustfxj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hustfxj updated SPARK-19831:

Description: 
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. It may solve this problem by the followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like *SendHeartbeat*;

2. It had better not send the heartbeat master by *receive* method.  Because 
any other rpc message may block the *receive* method. Then worker won't receive 
the heartbeat message. So it had better send the heartbeat master at an 
asynchronous timing thread .

  was:
Cleaning the application may cost much time at worker, then it will block that  
the worker send heartbeats master because the worker is extend 
*ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
message *ApplicationFinished*,  master will think the worker is dead. If the 
worker has a driver, the driver will be scheduled by master again. So I think 
it is the bug on spark. It may solve this problem by the followed suggests:

1. It had better  put the cleaning the application in a single asynchronous 
thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
like *SendHeartbeat*;

2. It had better not send the heartbeat master by Rpc channel. Because any 
other rpc message may block the rpc channel. It had better send the heartbeat 
master at an asynchronous timing thread .


> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>Priority: Minor
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. It may solve this problem by the followed suggests:
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like *SendHeartbeat*;
> 2. It had better not send the heartbeat master by *receive* method.  Because 
> any other rpc message may block the *receive* method. Then worker won't 
> receive the heartbeat message. So it had better send the heartbeat master at 
> an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-06 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898704#comment-15898704
 ] 

hustfxj edited comment on SPARK-19831 at 3/7/17 4:15 AM:
-

[~zsxwing]. I only find the code which handles *ApplicationFinished* message  
is slow . So I also think  such codes should be run in a separate thread.  I 
will submit  a PR which make the codes in a separate thread.


was (Author: hustfxj):
[~zsxwing]. I only find the code which handles *ApplicationFinished* message  
is slow until now. So I also think  such codes should be run in a separate 
thread.

> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>Priority: Minor
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. It may solve this problem by the followed suggests:
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like *SendHeartbeat*;
> 2. It had better not send the heartbeat master by Rpc channel. Because any 
> other rpc message may block the rpc channel. It had better send the heartbeat 
> master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19829) The log about driver should support rolling like executor

2017-03-06 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898713#comment-15898713
 ] 

hustfxj commented on SPARK-19829:
-

[~sowen] Yes, this is handled by systems like YARN. But my spark cluster is 
standalone. A standalone cluster should build its own rolling about driver' log 
in default other than defined log4j configuration.

> The log about driver should support rolling like executor
> -
>
> Key: SPARK-19829
> URL: https://issues.apache.org/jira/browse/SPARK-19829
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: hustfxj
>Priority: Minor
>
> We should rollback the log of the driver , or the log maybe large!!! 
> {code:title=DriverRunner.java|borderStyle=solid}
> // modify the runDriver
>   private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: 
> Boolean): Int = {
> builder.directory(baseDir)
> def initialize(process: Process): Unit = {
>   // Redirect stdout and stderr to files-- the old code
> //  val stdout = new File(baseDir, "stdout")
> //  CommandUtils.redirectStream(process.getInputStream, stdout)
> //
> //  val stderr = new File(baseDir, "stderr")
> //  val formattedCommand = builder.command.asScala.mkString("\"", "\" 
> \"", "\"")
> //  val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, 
> "=" * 40)
> //  Files.append(header, stderr, StandardCharsets.UTF_8)
> //  CommandUtils.redirectStream(process.getErrorStream, stderr)
>   // Redirect its stdout and stderr to files-support rolling
>   val stdout = new File(baseDir, "stdout")
>   stdoutAppender = FileAppender(process.getInputStream, stdout, conf)
>   val stderr = new File(baseDir, "stderr")
>   val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", 
> "\"")
>   val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" 
> * 40)
>   Files.append(header, stderr, StandardCharsets.UTF_8)
>   stderrAppender = FileAppender(process.getErrorStream, stderr, conf)
> }
> runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages

2017-03-06 Thread hustfxj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898704#comment-15898704
 ] 

hustfxj commented on SPARK-19831:
-

[~zsxwing]. I only find the code which handles *ApplicationFinished* message  
is slow until now. So I also think  such codes should be run in a separate 
thread.

> Sending the heartbeat  master from worker  maybe blocked by other rpc messages
> --
>
> Key: SPARK-19831
> URL: https://issues.apache.org/jira/browse/SPARK-19831
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: hustfxj
>Priority: Minor
>
> Cleaning the application may cost much time at worker, then it will block 
> that  the worker send heartbeats master because the worker is extend 
> *ThreadSafeRpcEndpoint*. If the heartbeat from a worker  is blocked  by the 
> message *ApplicationFinished*,  master will think the worker is dead. If the 
> worker has a driver, the driver will be scheduled by master again. So I think 
> it is the bug on spark. It may solve this problem by the followed suggests:
> 1. It had better  put the cleaning the application in a single asynchronous 
> thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages 
> like *SendHeartbeat*;
> 2. It had better not send the heartbeat master by Rpc channel. Because any 
> other rpc message may block the rpc channel. It had better send the heartbeat 
> master at an asynchronous timing thread .



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19843:


Assignee: Apache Spark

> UTF8String => (int / long) conversion expensive for invalid inputs
> --
>
> Key: SPARK-19843
> URL: https://issues.apache.org/jira/browse/SPARK-19843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Apache Spark
>
> In case of invalid inputs, converting a UTF8String to int or long returns 
> null. This comes at a cost wherein the method for conversion (e.g [0]) would 
> throw an exception. Exception handling is expensive as it will convert the 
> UTF8String into a java string, populate the stack trace (which is a native 
> call). While migrating workloads from Hive -> Spark, I see that this at an 
> aggregate level affects the performance of queries in comparison with hive.
> The exception is just indicating that the conversion failed.. its not 
> propagated to users so it would be good to avoid.
> Couple of options:
> - Return Integer / Long (instead of primitive types) which can be set to NULL 
> if the conversion fails. This is boxing and super bad for perf so a big no.
> - Hive has a pre-check [1] for this which is not a perfect safety net but 
> helpful to capture typical bad inputs eg. empty string, "null".
> [0] : 
> https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
> [1] : 
> https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs

2017-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898632#comment-15898632
 ] 

Apache Spark commented on SPARK-19843:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/17184

> UTF8String => (int / long) conversion expensive for invalid inputs
> --
>
> Key: SPARK-19843
> URL: https://issues.apache.org/jira/browse/SPARK-19843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>
> In case of invalid inputs, converting a UTF8String to int or long returns 
> null. This comes at a cost wherein the method for conversion (e.g [0]) would 
> throw an exception. Exception handling is expensive as it will convert the 
> UTF8String into a java string, populate the stack trace (which is a native 
> call). While migrating workloads from Hive -> Spark, I see that this at an 
> aggregate level affects the performance of queries in comparison with hive.
> The exception is just indicating that the conversion failed.. its not 
> propagated to users so it would be good to avoid.
> Couple of options:
> - Return Integer / Long (instead of primitive types) which can be set to NULL 
> if the conversion fails. This is boxing and super bad for perf so a big no.
> - Hive has a pre-check [1] for this which is not a perfect safety net but 
> helpful to capture typical bad inputs eg. empty string, "null".
> [0] : 
> https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
> [1] : 
> https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19843:


Assignee: (was: Apache Spark)

> UTF8String => (int / long) conversion expensive for invalid inputs
> --
>
> Key: SPARK-19843
> URL: https://issues.apache.org/jira/browse/SPARK-19843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>
> In case of invalid inputs, converting a UTF8String to int or long returns 
> null. This comes at a cost wherein the method for conversion (e.g [0]) would 
> throw an exception. Exception handling is expensive as it will convert the 
> UTF8String into a java string, populate the stack trace (which is a native 
> call). While migrating workloads from Hive -> Spark, I see that this at an 
> aggregate level affects the performance of queries in comparison with hive.
> The exception is just indicating that the conversion failed.. its not 
> propagated to users so it would be good to avoid.
> Couple of options:
> - Return Integer / Long (instead of primitive types) which can be set to NULL 
> if the conversion fails. This is boxing and super bad for perf so a big no.
> - Hive has a pre-check [1] for this which is not a perfect safety net but 
> helpful to capture typical bad inputs eg. empty string, "null".
> [0] : 
> https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
> [1] : 
> https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18085) Better History Server scalability for many / large applications

2017-03-06 Thread DjvuLee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DjvuLee updated SPARK-18085:


Yes, you're right. I just want to impact as few as possible.




> Better History Server scalability for many / large applications
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19845) failed to uncache datasource table after the table location altered

2017-03-06 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898626#comment-15898626
 ] 

Song Jun edited comment on SPARK-19845 at 3/7/17 2:33 AM:
--

yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784
 the location changed, it make it more complex to uncache the table and recache 
other tables reference this.

I will dig it more


was (Author: windpiger):
yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784
 the location changed, it make it more complex to uncache the table and recache 
other tables reference this.

> failed to uncache datasource table after the table location altered
> ---
>
> Key: SPARK-19845
> URL: https://issues.apache.org/jira/browse/SPARK-19845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently if we first cache a datasource table, then we alter the table 
> location,
> then we drop the table, uncache table will failed in the DropTableCommand, 
> because the location has changed and sameResult for two InMemoryFileIndex 
> with different location return false, so we can't find the table key in the 
> cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19845) failed to uncache datasource table after the table location altered

2017-03-06 Thread Song Jun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898626#comment-15898626
 ] 

Song Jun commented on SPARK-19845:
--

yes, this jira is doing this https://issues.apache.org/jira/browse/SPARK-19784
 the location changed, it make it more complex to uncache the table and recache 
other tables reference this.

> failed to uncache datasource table after the table location altered
> ---
>
> Key: SPARK-19845
> URL: https://issues.apache.org/jira/browse/SPARK-19845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently if we first cache a datasource table, then we alter the table 
> location,
> then we drop the table, uncache table will failed in the DropTableCommand, 
> because the location has changed and sameResult for two InMemoryFileIndex 
> with different location return false, so we can't find the table key in the 
> cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19845) failed to uncache datasource table after the table location altered

2017-03-06 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898591#comment-15898591
 ] 

Wenchen Fan commented on SPARK-19845:
-

In this case, I think we have to uncache the table when alter the location.

> failed to uncache datasource table after the table location altered
> ---
>
> Key: SPARK-19845
> URL: https://issues.apache.org/jira/browse/SPARK-19845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> Currently if we first cache a datasource table, then we alter the table 
> location,
> then we drop the table, uncache table will failed in the DropTableCommand, 
> because the location has changed and sameResult for two InMemoryFileIndex 
> with different location return false, so we can't find the table key in the 
> cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19845) failed to uncache datasource table after the table location altered

2017-03-06 Thread Song Jun (JIRA)
Song Jun created SPARK-19845:


 Summary: failed to uncache datasource table after the table 
location altered
 Key: SPARK-19845
 URL: https://issues.apache.org/jira/browse/SPARK-19845
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


Currently if we first cache a datasource table, then we alter the table 
location,
then we drop the table, uncache table will failed in the DropTableCommand, 
because the location has changed and sameResult for two InMemoryFileIndex with 
different location return false, so we can't find the table key in the cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2017-03-06 Thread Ioana Delaney (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ioana Delaney updated SPARK-19842:
--
Description: 
*Informational Referential Integrity Constraints Support in Spark*

This work proposes support for _informational primary key_ and _foreign key 
(referential integrity) constraints_ in Spark. The main purpose is to open up 
an area of query optimization techniques that rely on referential integrity 
constraints semantics. 

An _informational_ or _statistical constraint_ is a constraint such as a 
_unique_, _primary key_, _foreign key_, or _check constraint_, that can be used 
by Spark to improve query performance. Informational constraints are not 
enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize 
the query processing. They provide semantics information that allows Catalyst 
to rewrite queries to eliminate joins, push down aggregates, remove unnecessary 
Distinct operations, and perform a number of other optimizations. Informational 
constraints are primarily targeted to applications that load and analyze data 
that originated from a data warehouse. For such applications, the conditions 
for a given constraint are known to be true, so the constraint does not need to 
be enforced during data load operations. 

The attached document covers constraint definition, metastore storage, 
constraint validation, and maintenance. The document shows many examples of 
query performance improvements that utilize referential integrity constraints 
and can be implemented in Spark.


Link to the google doc: 
[InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]




  was:
*Informational Referential Integrity Constraints Support in Spark*

This work proposes support for _informational primary key_ and _foreign key 
(referential integrity) constraints_ in Spark. The main purpose is to open up 
an area of query optimization techniques that rely on referential integrity 
constraints semantics. 

An _informational_ or _statistical constraint_ is a constraint such as a 
_unique_, _primary key_, _foreign key_, or _check constraint_, that can be used 
by Spark to improve query performance. Informational constraints are not 
enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize 
the query processing. They provide semantics information that allows Catalyst 
to rewrite queries to eliminate joins, push down aggregates, remove unnecessary 
Distinct operations, and perform a number of other optimizations. Informational 
constraints are primarily targeted to applications that load and analyze data 
that originated from a data warehouse. For such applications, the conditions 
for a given constraint are known to be true, so the constraint does not need to 
be enforced during data load operations. 

The attached document covers constraint definition, metastore storage, 
constraint validation, and maintenance. The document shows many examples of 
query performance improvements that utilize referential integrity constraints 
and can be implemented in Spark.

Link to the google doc: 
[InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]





> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 

[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2017-03-06 Thread Ioana Delaney (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ioana Delaney updated SPARK-19842:
--
Description: 
*Informational Referential Integrity Constraints Support in Spark*

This work proposes support for _informational primary key_ and _foreign key 
(referential integrity) constraints_ in Spark. The main purpose is to open up 
an area of query optimization techniques that rely on referential integrity 
constraints semantics. 

An _informational_ or _statistical constraint_ is a constraint such as a 
_unique_, _primary key_, _foreign key_, or _check constraint_, that can be used 
by Spark to improve query performance. Informational constraints are not 
enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize 
the query processing. They provide semantics information that allows Catalyst 
to rewrite queries to eliminate joins, push down aggregates, remove unnecessary 
Distinct operations, and perform a number of other optimizations. Informational 
constraints are primarily targeted to applications that load and analyze data 
that originated from a data warehouse. For such applications, the conditions 
for a given constraint are known to be true, so the constraint does not need to 
be enforced during data load operations. 

The attached document covers constraint definition, metastore storage, 
constraint validation, and maintenance. The document shows many examples of 
query performance improvements that utilize referential integrity constraints 
and can be implemented in Spark.

Link to the google doc: 
[InformationalRIConstraints|https://docs.google.com/document/d/17r-cOqbKF7Px0xb9L7krKg2-RQB_gD2pxOmklm-ehsw/edit]




  was:
*Informational Referential Integrity Constraints Support in Spark*

This work proposes support for _informational primary key_ and _foreign key 
(referential integrity) constraints_ in Spark. The main purpose is to open up 
an area of query optimization techniques that rely on referential integrity 
constraints semantics. 

An _informational_ or _statistical constraint_ is a constraint such as a 
_unique_, _primary key_, _foreign key_, or _check constraint_, that can be used 
by Spark to improve query performance. Informational constraints are not 
enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize 
the query processing. They provide semantics information that allows Catalyst 
to rewrite queries to eliminate joins, push down aggregates, remove unnecessary 
Distinct operations, and perform a number of other optimizations. Informational 
constraints are primarily targeted to applications that load and analyze data 
that originated from a data warehouse. For such applications, the conditions 
for a given constraint are known to be true, so the constraint does not need to 
be enforced during data load operations. 

The attached document covers constraint definition, metastore storage, 
constraint validation, and maintenance. The document shows many examples of 
query performance improvements that utilize referential integrity constraints 
and can be implemented in Spark.





> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many 

[jira] [Created] (SPARK-19844) UDF in when control function is executed before the when clause is evaluated.

2017-03-06 Thread Franklyn Dsouza (JIRA)
Franklyn Dsouza created SPARK-19844:
---

 Summary: UDF in when control function is executed before the when 
clause is evaluated.
 Key: SPARK-19844
 URL: https://issues.apache.org/jira/browse/SPARK-19844
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.1.0, 2.0.1
Reporter: Franklyn Dsouza


Sometimes we try to filter out the argument to a udf using {code}when(clause, 
udf).otherwise(default){code}

but we've noticed that sometimes the udf is being run on data that shouldn't 
have matched the clause.

heres some code to reproduce the issue.

{code}
from pyspark.sql import functions as F
from pyspark.sql import types

df = sc.sql.createDataFrame([{'r': None}], 
schema=types.StructType([types.StructField('r', types.StringType())]))

simple_udf = F.udf(lambda ref: ref.strip("/"), types.StringType())

df.withColumn('test', 
   F.when(F.col("r").isNotNull(), simple_udf(F.col("r")))
.otherwise(F.lit(None))
 ).collect()
{code}

This causes an exception because the udf is running on null data. i get 
AttributeError: 'NoneType' object has no attribute 'strip'. 

so it looks like the udf is being evaluated before the clause in the when is 
evaulated. Oddly enough when i change {code}F.col("r").isNotNull(){code} to 
{code}df["r"] != None{code} then it works. 

might be related to https://issues.apache.org/jira/browse/SPARK-13773
 
and https://issues.apache.org/jira/browse/SPARK-15282



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19719) Structured Streaming write to Kafka

2017-03-06 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-19719.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17043
[https://github.com/apache/spark/pull/17043]

> Structured Streaming write to Kafka
> ---
>
> Key: SPARK-19719
> URL: https://issues.apache.org/jira/browse/SPARK-19719
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
> Fix For: 2.2.0
>
>
> This issue deals with writing to Apache Kafka for both streaming and batch 
> queries. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19843) UTF8String => (int / long) conversion expensive for invalid inputs

2017-03-06 Thread Tejas Patil (JIRA)
Tejas Patil created SPARK-19843:
---

 Summary: UTF8String => (int / long) conversion expensive for 
invalid inputs
 Key: SPARK-19843
 URL: https://issues.apache.org/jira/browse/SPARK-19843
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Tejas Patil


In case of invalid inputs, converting a UTF8String to int or long returns null. 
This comes at a cost wherein the method for conversion (e.g [0]) would throw an 
exception. Exception handling is expensive as it will convert the UTF8String 
into a java string, populate the stack trace (which is a native call). While 
migrating workloads from Hive -> Spark, I see that this at an aggregate level 
affects the performance of queries in comparison with hive.

The exception is just indicating that the conversion failed.. its not 
propagated to users so it would be good to avoid.

Couple of options:
- Return Integer / Long (instead of primitive types) which can be set to NULL 
if the conversion fails. This is boxing and super bad for perf so a big no.
- Hive has a pre-check [1] for this which is not a perfect safety net but 
helpful to capture typical bad inputs eg. empty string, "null".

[0] : 
https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
[1] : 
https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2017-03-06 Thread Ioana Delaney (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ioana Delaney updated SPARK-19842:
--
Target Version/s:   (was: 2.2.0)

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2017-03-06 Thread Ioana Delaney (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ioana Delaney updated SPARK-19842:
--
Attachment: InformationalRIConstraints.doc

> Informational Referential Integrity Constraints Support in Spark
> 
>
> Key: SPARK-19842
> URL: https://issues.apache.org/jira/browse/SPARK-19842
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ioana Delaney
> Attachments: InformationalRIConstraints.doc
>
>
> *Informational Referential Integrity Constraints Support in Spark*
> This work proposes support for _informational primary key_ and _foreign key 
> (referential integrity) constraints_ in Spark. The main purpose is to open up 
> an area of query optimization techniques that rely on referential integrity 
> constraints semantics. 
> An _informational_ or _statistical constraint_ is a constraint such as a 
> _unique_, _primary key_, _foreign key_, or _check constraint_, that can be 
> used by Spark to improve query performance. Informational constraints are not 
> enforced by the Spark SQL engine; rather, they are used by Catalyst to 
> optimize the query processing. They provide semantics information that allows 
> Catalyst to rewrite queries to eliminate joins, push down aggregates, remove 
> unnecessary Distinct operations, and perform a number of other optimizations. 
> Informational constraints are primarily targeted to applications that load 
> and analyze data that originated from a data warehouse. For such 
> applications, the conditions for a given constraint are known to be true, so 
> the constraint does not need to be enforced during data load operations. 
> The attached document covers constraint definition, metastore storage, 
> constraint validation, and maintenance. The document shows many examples of 
> query performance improvements that utilize referential integrity constraints 
> and can be implemented in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19842) Informational Referential Integrity Constraints Support in Spark

2017-03-06 Thread Ioana Delaney (JIRA)
Ioana Delaney created SPARK-19842:
-

 Summary: Informational Referential Integrity Constraints Support 
in Spark
 Key: SPARK-19842
 URL: https://issues.apache.org/jira/browse/SPARK-19842
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Ioana Delaney


*Informational Referential Integrity Constraints Support in Spark*

This work proposes support for _informational primary key_ and _foreign key 
(referential integrity) constraints_ in Spark. The main purpose is to open up 
an area of query optimization techniques that rely on referential integrity 
constraints semantics. 

An _informational_ or _statistical constraint_ is a constraint such as a 
_unique_, _primary key_, _foreign key_, or _check constraint_, that can be used 
by Spark to improve query performance. Informational constraints are not 
enforced by the Spark SQL engine; rather, they are used by Catalyst to optimize 
the query processing. They provide semantics information that allows Catalyst 
to rewrite queries to eliminate joins, push down aggregates, remove unnecessary 
Distinct operations, and perform a number of other optimizations. Informational 
constraints are primarily targeted to applications that load and analyze data 
that originated from a data warehouse. For such applications, the conditions 
for a given constraint are known to be true, so the constraint does not need to 
be enforced during data load operations. 

The attached document covers constraint definition, metastore storage, 
constraint validation, and maintenance. The document shows many examples of 
query performance improvements that utilize referential integrity constraints 
and can be implemented in Spark.






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19841) StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19841:


Assignee: Shixiong Zhu  (was: Apache Spark)

> StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys
> 
>
> Key: SPARK-19841
> URL: https://issues.apache.org/jira/browse/SPARK-19841
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now it just uses the rows to filter but a column position in 
> keyExpressions may be different than the position in row.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19841) StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19841:


Assignee: Apache Spark  (was: Shixiong Zhu)

> StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys
> 
>
> Key: SPARK-19841
> URL: https://issues.apache.org/jira/browse/SPARK-19841
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Right now it just uses the rows to filter but a column position in 
> keyExpressions may be different than the position in row.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19841) StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys

2017-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898446#comment-15898446
 ] 

Apache Spark commented on SPARK-19841:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/17183

> StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys
> 
>
> Key: SPARK-19841
> URL: https://issues.apache.org/jira/browse/SPARK-19841
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now it just uses the rows to filter but a column position in 
> keyExpressions may be different than the position in row.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19841) StreamingDeduplicateExec.watermarkPredicate should filter rows based on keys

2017-03-06 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19841:


 Summary: StreamingDeduplicateExec.watermarkPredicate should filter 
rows based on keys
 Key: SPARK-19841
 URL: https://issues.apache.org/jira/browse/SPARK-19841
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now it just uses the rows to filter but a column position in 
keyExpressions may be different than the position in row.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19840) Disallow creating permanent functions with invalid class names

2017-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898402#comment-15898402
 ] 

Apache Spark commented on SPARK-19840:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17182

> Disallow creating permanent functions with invalid class names
> --
>
> Key: SPARK-19840
> URL: https://issues.apache.org/jira/browse/SPARK-19840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>
> Currently, Spark raises exceptions on creating invalid **temporary** 
> functions, but doesn't for **permanent** functions. This issue aims to 
> disallow creating permanent functions with invalid class names.
> **BEFORE**
> {code}
> scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> java.lang.ClassNotFoundException: org.invalid at 
> ...
> scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> ++
> ||
> ++
> ++
> {code}
> **AFTER**
> {code}
> scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> java.lang.ClassNotFoundException: org.invalid
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19840) Disallow creating permanent functions with invalid class names

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19840:


Assignee: (was: Apache Spark)

> Disallow creating permanent functions with invalid class names
> --
>
> Key: SPARK-19840
> URL: https://issues.apache.org/jira/browse/SPARK-19840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>
> Currently, Spark raises exceptions on creating invalid **temporary** 
> functions, but doesn't for **permanent** functions. This issue aims to 
> disallow creating permanent functions with invalid class names.
> **BEFORE**
> {code}
> scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> java.lang.ClassNotFoundException: org.invalid at 
> ...
> scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> ++
> ||
> ++
> ++
> {code}
> **AFTER**
> {code}
> scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> java.lang.ClassNotFoundException: org.invalid
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19840) Disallow creating permanent functions with invalid class names

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19840:


Assignee: Apache Spark

> Disallow creating permanent functions with invalid class names
> --
>
> Key: SPARK-19840
> URL: https://issues.apache.org/jira/browse/SPARK-19840
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> Currently, Spark raises exceptions on creating invalid **temporary** 
> functions, but doesn't for **permanent** functions. This issue aims to 
> disallow creating permanent functions with invalid class names.
> **BEFORE**
> {code}
> scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> java.lang.ClassNotFoundException: org.invalid at 
> ...
> scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> ++
> ||
> ++
> ++
> {code}
> **AFTER**
> {code}
> scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
> 'org.invalid'").show
> java.lang.ClassNotFoundException: org.invalid
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19840) Disallow creating permanent functions with invalid class names

2017-03-06 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-19840:
-

 Summary: Disallow creating permanent functions with invalid class 
names
 Key: SPARK-19840
 URL: https://issues.apache.org/jira/browse/SPARK-19840
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Dongjoon Hyun


Currently, Spark raises exceptions on creating invalid **temporary** functions, 
but doesn't for **permanent** functions. This issue aims to disallow creating 
permanent functions with invalid class names.

**BEFORE**
{code}
scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS 
'org.invalid'").show
java.lang.ClassNotFoundException: org.invalid at 
...

scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
'org.invalid'").show
++
||
++
++
{code}

**AFTER**
{code}
scala> sql("CREATE FUNCTION function_with_invalid_classname AS 
'org.invalid'").show
java.lang.ClassNotFoundException: org.invalid
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15522) DataFrame Column Names That are Numbers aren't referenced correctly in SQL

2017-03-06 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897638#comment-15897638
 ] 

Hyukjin Kwon edited comment on SPARK-15522 at 3/6/17 11:26 PM:
---

We can use backticks for it as below:

{code}
scala> Seq(Some(1), None).toDF("82").createOrReplaceTempView("piv_qhID")

scala> spark.sql("select * from piv_qhID where '82' is NULL limit 20").show()
+---+
| 82|
+---+
+---+


scala> spark.sql("select * from piv_qhID where `82` is NULL limit 20").show()
++
|  82|
++
|null|
++
{code}

It seems {{'82'}} or {{"82"}} was being created as a constant. I am resolving 
this JIRA. Please reopen this if I misunderstood. 


was (Author: hyukjin.kwon):
We can use backticks for it as below:

{code}
scala> spark.range(10).toDF("82").createOrReplaceTempView("piv_qhID")

scala> spark.sql("select * from piv_qhID where `82` is NULL limit 20").show()
+---+
| 82|
+---+
+---+
{code}

It seems {{'82'}} or {{"82"}} was being created as a constant. I am resolving 
this JIRA. Please reopen this if I misunderstood. 

> DataFrame Column Names That are Numbers aren't referenced correctly in SQL
> --
>
> Key: SPARK-15522
> URL: https://issues.apache.org/jira/browse/SPARK-15522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jason Pohl
>
> The following code is run:
> val pre_piv_df_a = sqlContext.sql("""
> SELECT
> CASE WHEN Gender = 'M' Then 1 ELSE 0 END AS has_male,
> CASE WHEN Gender = 'F' Then 1 ELSE 0 END AS has_female,
> CAST(StartAge AS Double) AS StartAge_dbl,
> CAST(EndAge AS Double) AS EndAge_dbl,
> *
> FROM alldem_union_curr
> """)
> .withColumn("JavaStartTimestamp", create_ts($"StartTimestamp"))
> .drop("StartTimestamp").withColumnRenamed("JavaStartTimestamp", 
> "StartTimestamp")
> .drop("StartAge").drop("EndAge")
> .withColumnRenamed("StartAge_dbl", 
> "StartAge").withColumnRenamed("EndAge_dbl", "EndAge")
> val pre_piv_df_b = pre_piv_df_a
> .withColumn("media_month_cc", media_month_cc($"MediaMonth"))
> .withColumn("media_week_cc", media_week_sts_cc($"StartTimestamp"))
> .withColumn("media_day_cc", media_day_sts_cc($"StartTimestamp"))
> .withColumn("week_day", week_day($"StartTimestamp"))
> .withColumn("week_end", week_end($"StartTimestamp"))
> .join(sqlContext.table("cad_nets"), $"Network" === $"nielsen_network", 
> "inner")
> .withColumnRenamed("cad_network", "norm_net_code_a")
> .withColumn("norm_net_code", reCodeNets($"norm_net_code_a"))
> pre_piv_df_b.registerTempTable("pre_piv_df")
> val piv_qhID_df = pre_piv_df_b.groupBy("Network", "Audience", "StartDate", 
> "rating_category_cd")
> .pivot("qaID").agg("rating" -> "mean")
> The pivot creates a lot of columns (96) with names that are like 
> ‘01’,’02’,…,’96’ as a result of pivoting a table that has quarter hour IDs.
> In the below SQL the highlighted section causes problems. If I rename the 
> columns to ‘col01’,’col02’,…,’col96’ I can run the SQL correctly and get the 
> expected results.
> select * from piv_qhID where 82 is NULL limit 20
> And I am getting no rows even though there are nulls.
> On the other hand the query:
> select * from piv_qhID where 82 is NOT NULL limit 20
> Returns all rows (even those with nulls)
> Renaming the columns fixes this, but it would be nice if the columns were 
> referenced correctly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14423) Handle jar conflict issue when uploading to distributed cache

2017-03-06 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898256#comment-15898256
 ] 

Junping Du commented on SPARK-14423:


Thanks [~jerryshao] for reporting this issue. I think YARN should fix this 
problem also. If the same jars are added to distributed cache, it should detect 
and failed fast with throwing indicating messages: YARN-5306 already get filed 
to track this issue.

> Handle jar conflict issue when uploading to distributed cache
> -
>
> Key: SPARK-14423
> URL: https://issues.apache.org/jira/browse/SPARK-14423
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 2.0.0
>
>
> Currently with the introduction of assembly-free deployment of Spark, by 
> default yarn#client will upload all the jars in assembly to HDFS staging 
> folder. If the jars in assembly and specified with \--jars have the same 
> name, this will introduce exception while downloading these jars and make the 
> application fail to run.
> Here is the exception when running example with {{run-example}}:
> {noformat}
> 16/04/06 10:29:48 INFO Client: Application report for 
> application_1459907402325_0004 (state: FAILED)
> 16/04/06 10:29:48 INFO Client:
>client token: N/A
>diagnostics: Application application_1459907402325_0004 failed 2 times 
> due to AM Container for appattempt_1459907402325_0004_02 exited with  
> exitCode: -1000
> For more detailed output, check application tracking 
> page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, 
> click on links to logs of each attempt.
> Diagnostics: Resource 
> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
>  changed on src filesystem (expected 1459909780508, was 1459909782590
> java.io.IOException: Resource 
> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
>  changed on src filesystem (expected 1459909780508, was 1459909782590
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The problem is that this jar {{avro-mapred-1.7.7-hadoop2.jar}} both existed 
> in assembly and example folder.
> We should handle this situation, since now spark example is failed to run 
> under yarn mode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2017-03-06 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898159#comment-15898159
 ] 

Sean Owen commented on SPARK-19767:
---

Hm, have you installed all the ruby-based dependencies? I don't know exactly 
what pulls in liquid.

sudo gem install jekyll jekyll-redirect-from pygments.rb

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19709) CSV datasource fails to read empty file

2017-03-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19709:
---

Assignee: Wojciech Szymanski

> CSV datasource fails to read empty file
> ---
>
> Key: SPARK-19709
> URL: https://issues.apache.org/jira/browse/SPARK-19709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Wojciech Szymanski
>Priority: Minor
> Fix For: 2.2.0
>
>
> I just {{touch a}} and then ran the codes below:
> {code}
> scala> spark.read.csv("a")
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.
> {code}
> It seems we should produce an empty dataframe consistently with 
> `spark.read.json("a")`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19709) CSV datasource fails to read empty file

2017-03-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19709.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17068
[https://github.com/apache/spark/pull/17068]

> CSV datasource fails to read empty file
> ---
>
> Key: SPARK-19709
> URL: https://issues.apache.org/jira/browse/SPARK-19709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.0
>
>
> I just {{touch a}} and then ran the codes below:
> {code}
> scala> spark.read.csv("a")
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.
> {code}
> It seems we should produce an empty dataframe consistently with 
> `spark.read.json("a")`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19094) Plumb through logging/error messages from the JVM to Jupyter PySpark

2017-03-06 Thread Kyle Kelley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898114#comment-15898114
 ] 

Kyle Kelley commented on SPARK-19094:
-

Super interested in this, as it's been confusing for our users. I've thought 
about making an alternate endpoint for a kernel to get logs out of, it would be 
much better to re-route these logs so that the python kernel can handle them 
directly.

> Plumb through logging/error messages from the JVM to Jupyter PySpark
> 
>
> Key: SPARK-19094
> URL: https://issues.apache.org/jira/browse/SPARK-19094
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: holdenk
>Priority: Trivial
>
> Jupyter/IPython notebooks works by overriding sys.stdout & sys.stderr, as 
> such the error messages that show up in IJupyter/IPython are often missing 
> the related logs - which is often more useful than the exception its self.
> This could make it easier for Python developers getting started with Spark on 
> their local laptops to debug their applications, since otherwise they need to 
> remember to keep going to the terminal where they launched the notebook from.
> One counterpoint to this is that Spark's logging is fairly verbose, but since 
> we provide the ability for the user to tune the log messages from within the 
> notebook that should be OK.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16845) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB

2017-03-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16845:

Fix Version/s: 2.0.3
   1.6.4

> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> -
>
> Key: SPARK-16845
> URL: https://issues.apache.org/jira/browse/SPARK-16845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: hejie
>Assignee: Liwei Lin
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
> Attachments: error.txt.zip
>
>
> I have a wide table(400 columns), when I try fitting the traindata on all 
> columns,  the fatal error occurs. 
>   ... 46 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:854)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19824) Standalone master JSON not showing cores for running applications

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19824:


Assignee: Apache Spark

> Standalone master JSON not showing cores for running applications
> -
>
> Key: SPARK-19824
> URL: https://issues.apache.org/jira/browse/SPARK-19824
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: Dan
>Assignee: Apache Spark
>Priority: Minor
>
> The JSON API of the standalone master ("/json") does not show the number of 
> cores for a running application, which is available on the UI.
>   "activeapps" : [ {
> "starttime" : 1488702337788,
> "id" : "app-20170305102537-19717",
> "name" : "POPAI_Aggregated",
> "user" : "ibiuser",
> "memoryperslave" : 16384,
> "submitdate" : "Sun Mar 05 10:25:37 IST 2017",
> "state" : "RUNNING",
> "duration" : 1141934
>   } ],



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19824) Standalone master JSON not showing cores for running applications

2017-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898111#comment-15898111
 ] 

Apache Spark commented on SPARK-19824:
--

User 'yongtang' has created a pull request for this issue:
https://github.com/apache/spark/pull/17181

> Standalone master JSON not showing cores for running applications
> -
>
> Key: SPARK-19824
> URL: https://issues.apache.org/jira/browse/SPARK-19824
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: Dan
>Priority: Minor
>
> The JSON API of the standalone master ("/json") does not show the number of 
> cores for a running application, which is available on the UI.
>   "activeapps" : [ {
> "starttime" : 1488702337788,
> "id" : "app-20170305102537-19717",
> "name" : "POPAI_Aggregated",
> "user" : "ibiuser",
> "memoryperslave" : 16384,
> "submitdate" : "Sun Mar 05 10:25:37 IST 2017",
> "state" : "RUNNING",
> "duration" : 1141934
>   } ],



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19824) Standalone master JSON not showing cores for running applications

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19824:


Assignee: (was: Apache Spark)

> Standalone master JSON not showing cores for running applications
> -
>
> Key: SPARK-19824
> URL: https://issues.apache.org/jira/browse/SPARK-19824
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.1.0
>Reporter: Dan
>Priority: Minor
>
> The JSON API of the standalone master ("/json") does not show the number of 
> cores for a running application, which is available on the UI.
>   "activeapps" : [ {
> "starttime" : 1488702337788,
> "id" : "app-20170305102537-19717",
> "name" : "POPAI_Aggregated",
> "user" : "ibiuser",
> "memoryperslave" : 16384,
> "submitdate" : "Sun Mar 05 10:25:37 IST 2017",
> "state" : "RUNNING",
> "duration" : 1141934
>   } ],



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19382) Test sparse vectors in LinearSVCSuite

2017-03-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19382.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16784
[https://github.com/apache/spark/pull/16784]

> Test sparse vectors in LinearSVCSuite
> -
>
> Key: SPARK-19382
> URL: https://issues.apache.org/jira/browse/SPARK-19382
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.2.0
>
>
> Currently, LinearSVCSuite does not test sparse vectors.  We should.  I 
> recommend that generateSVMInput be modified to create a mix of dense and 
> sparse vectors, rather than adding an additional test.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19382) Test sparse vectors in LinearSVCSuite

2017-03-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-19382:
-

Assignee: Miao Wang

> Test sparse vectors in LinearSVCSuite
> -
>
> Key: SPARK-19382
> URL: https://issues.apache.org/jira/browse/SPARK-19382
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>Priority: Minor
>
> Currently, LinearSVCSuite does not test sparse vectors.  We should.  I 
> recommend that generateSVMInput be modified to create a mix of dense and 
> sparse vectors, rather than adding an additional test.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF

2017-03-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898052#comment-15898052
 ] 

Dongjoon Hyun commented on SPARK-18832:
---

Could you check the permission of the following? Is it accessible for both 
`hive` and `spark` on your server and on the server running Spark Thrift Server?
{code}
'/root/spark_files/experiments-1.2.jar';
{code}
If there is a permission problem, it also ends up 
`org.apache.spark.sql.AnalysisException: Undefined function` with additional 
logs.


> Spark SQL: Thriftserver unable to run a registered Hive UDTF
> 
>
> Key: SPARK-18832
> URL: https://issues.apache.org/jira/browse/SPARK-18832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: HDP: 2.5
> Spark: 2.0.0
>Reporter: Lokesh Yadav
> Attachments: SampleUDTF.java
>
>
> Spark Thriftserver is unable to run a HiveUDTF.
> It throws the error that it is unable to find the functions although the 
> function registration succeeds and the funtions does show up in the list 
> output by {{show functions}}.
> I am using a Hive UDTF, registering it using a jar placed on my local 
> machine. Calling it using the following commands:
> //Registering the functions, this command succeeds.
> {{CREATE FUNCTION SampleUDTF AS 
> 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR 
> '/root/spark_files/experiments-1.2.jar';}}
> //Thriftserver is able to look up the functuion, on this command:
> {{DESCRIBE FUNCTION SampleUDTF;}}
> {quote}
> {noformat}
> Output: 
> +---+--+
> |   function_desc   |
> +---+--+
> | Function: default.SampleUDTF  |
> | Class: com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF  |
> | Usage: N/A.   |
> +---+--+
> {noformat}
> {quote}
> // Calling the function: 
> {{SELECT SampleUDTF('Paris');}}
> bq. Output of the above command: Error: 
> org.apache.spark.sql.AnalysisException: Undefined function: 'SampleUDTF'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.; line 1 pos 7 (state=,code=0)
> I have also tried with using a non-local (on hdfs) jar, but I get the same 
> error.
> My environment: HDP 2.5 with spark 2.0.0
> I have attached the class file for the UDTF I am using in testing this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert

2017-03-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19211:
---

Assignee: Jiang Xingbo

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19211
> URL: https://issues.apache.org/jira/browse/SPARK-19211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
> Fix For: 2.2.0
>
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19211) Explicitly prevent Insert into View or Create View As Insert

2017-03-06 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19211.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Explicitly prevent Insert into View or Create View As Insert
> 
>
> Key: SPARK-19211
> URL: https://issues.apache.org/jira/browse/SPARK-19211
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
> Fix For: 2.2.0
>
>
> Currently we don't explicitly forbid the following behaviors:
> 1. The statement CREATE VIEW AS INSERT INTO throws the following exception 
> from SQLBuilder:
> `java.lang.UnsupportedOperationException: unsupported plan InsertIntoTable 
> MetastoreRelation default, tbl, false, false`;
> 2. The statement INSERT INTO view VALUES throws the following exception from 
> checkAnalysis:
> `Error in query: Inserting into an RDD-based table is not allowed.;;`
> We should check for these behaviors earlier and explicitly prevent them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2017-03-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898005#comment-15898005
 ] 

Reynold Xin commented on SPARK-17495:
-

We should probably create subtickets next time for this ...


> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19790) OutputCommitCoordinator should not allow another task to commit after an ExecutorFailure

2017-03-06 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898003#comment-15898003
 ] 

Steve Loughran commented on SPARK-19790:


Thinking some more & looking at code snippets

# FileOutputFormat with algorithm 2 can recover from a failed task commit, 
somehow. That code is too complex to make sense of now that it mixes two 
algorithms with corecursion and stuff to run in client and server.
# The s3guard committer will have the task upload it's files in parallel, but 
will not complete the multipart commit; the information to do this will be 
persisted to HDFS for execution by the job committer.
# I plan a little spark extension which will do the same for files with 
absolute destinations, this time passing the data back to the job committer.

Which means a failed task can be recovered from. All pending writes for that 
task will need to be found (scan FS) and abort. Still an issue about ordering 
of PUT vs save of upload data; some GC of pending commits to a dest dir would 
be the way to avoid running up bills. 

This is one of those coordination problems where someone with TLA+ algorithm 
specification skills would be good, along with the foundation specs for 
filesystems and object stores. Someone needs to find a CS student looking for a 
project.

> OutputCommitCoordinator should not allow another task to commit after an 
> ExecutorFailure
> 
>
> Key: SPARK-19790
> URL: https://issues.apache.org/jira/browse/SPARK-19790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> The OutputCommitCoordinator resets the allowed committer when the task fails. 
>  
> https://github.com/apache/spark/blob/8aa560b75e6b083b2a890c52301414285ba35c3d/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L143
> However, if a task fails because of an ExecutorFailure, we actually have no 
> idea what the status is of the task.  The task may actually still be running, 
> and perhaps successfully commit its output.  By allowing another task to 
> commit its output, there is a chance that multiple tasks commit, which can 
> result in corrupt output.  This would be particularly problematic when 
> commit() is an expensive operation, eg. moving files on S3.
> For other task failures, we can allow other tasks to commit.  But with an 
> ExecutorFailure, its not clear what the right thing to do is.  The only safe 
> thing to do may be to fail the job.
> This is related to SPARK-19631, and was discovered during discussion on that 
> PR https://github.com/apache/spark/pull/16959#discussion_r103549134



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-06 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897997#comment-15897997
 ] 

Imran Rashid commented on SPARK-19796:
--

I'm opposed to (b) as well.

It feels wrong to only do a one-off just for JOB_DESCRIPTION, but maybe its a 
large enough savings that its worth doing.  I was thinking of something larger, 
along the lines of SPARK-19108.  Another option would be to add new apis, eg., 
jobs would take `driverProperties` and `executorProperties`, but maybe that is 
overkill.

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Assignee: Imran Rashid
>Priority: Blocker
> Fix For: 2.2.0
>
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19839) Fix memory leak in BytesToBytesMap

2017-03-06 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897975#comment-15897975
 ] 

Zhan Zhang commented on SPARK-19839:


When BytesToBytesMap spills, its longArray should be released. Otherwise, it 
may not released until the task complete. This array may take a significant 
amount of memory, which cannot be used by later operator, such as 
UnsafeShuffleExternalSorter, resulting in more frequent spill in sorter. This 
patch release the array as destructive iterator will not use this array anymore.

> Fix memory leak in BytesToBytesMap
> --
>
> Key: SPARK-19839
> URL: https://issues.apache.org/jira/browse/SPARK-19839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19839) Fix memory leak in BytesToBytesMap

2017-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897970#comment-15897970
 ] 

Apache Spark commented on SPARK-19839:
--

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17180

> Fix memory leak in BytesToBytesMap
> --
>
> Key: SPARK-19839
> URL: https://issues.apache.org/jira/browse/SPARK-19839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19839) Fix memory leak in BytesToBytesMap

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19839:


Assignee: Apache Spark

> Fix memory leak in BytesToBytesMap
> --
>
> Key: SPARK-19839
> URL: https://issues.apache.org/jira/browse/SPARK-19839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19839) Fix memory leak in BytesToBytesMap

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19839:


Assignee: (was: Apache Spark)

> Fix memory leak in BytesToBytesMap
> --
>
> Key: SPARK-19839
> URL: https://issues.apache.org/jira/browse/SPARK-19839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Zhan Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-06 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-19796:


Assignee: Imran Rashid

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Assignee: Imran Rashid
>Priority: Blocker
> Fix For: 2.2.0
>
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19837) Fetch failure throws a SparkException in SparkHiveWriter

2017-03-06 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897961#comment-15897961
 ] 

Sital Kedia commented on SPARK-19837:
-

`SparkHiveDynamicPartitionWriterContainer` has been refactored in latest 
master. Not sure if the issue still exists. Will close this JIRA and reopen if 
we still see issue with latest. 

> Fetch failure throws a SparkException in SparkHiveWriter
> 
>
> Key: SPARK-19837
> URL: https://issues.apache.org/jira/browse/SPARK-19837
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Sital Kedia
>
> Currently Fetchfailure in SparkHiveWriter fails the job with following 
> exception
> {code}
> 0_0): org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:385)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Connection reset by 
> peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
> at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731)
> at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:343)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Closed] (SPARK-19837) Fetch failure throws a SparkException in SparkHiveWriter

2017-03-06 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia closed SPARK-19837.
---
Resolution: Fixed

> Fetch failure throws a SparkException in SparkHiveWriter
> 
>
> Key: SPARK-19837
> URL: https://issues.apache.org/jira/browse/SPARK-19837
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
>Reporter: Sital Kedia
>
> Currently Fetchfailure in SparkHiveWriter fails the job with following 
> exception
> {code}
> 0_0): org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:385)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:84)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.shuffle.FetchFailedException: Connection reset by 
> peer
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
> at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
> at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731)
> at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:343)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current

2017-03-06 Thread Nick Afshartous (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897965#comment-15897965
 ] 

Nick Afshartous commented on SPARK-19767:
-

I believe all the dependencies are installed and got this error about a missing 
tag {{'include_example'}}

{code}
SKIP_API=1 jekyll build
Configuration file: none
Source: /home/nafshartous/Projects/spark
   Destination: /home/nafshartous/Projects/spark/_site
 Incremental build: disabled. Enable with --incremental
  Generating... 
 Build Warning: Layout 'global' requested in docs/api.md does not exist.
 Build Warning: Layout 'global' requested in docs/building-spark.md does 
not exist.
 Build Warning: Layout 'global' requested in docs/cluster-overview.md does 
not exist.
 Build Warning: Layout 'global' requested in docs/configuration.md does not 
exist.
 Build Warning: Layout 'global' requested in docs/contributing-to-spark.md 
does not exist.
 Build Warning: Layout 'global' requested in docs/ec2-scripts.md does not 
exist.
  Liquid Exception: Liquid syntax error (line 581): Unknown tag 
'include_example' in docs/graphx-programming-guide.md
jekyll 3.4.0 | Error:  Liquid syntax error (line 581): Unknown tag 
'include_example'
{code}

> API Doc pages for Streaming with Kafka 0.10 not current
> ---
>
> Key: SPARK-19767
> URL: https://issues.apache.org/jira/browse/SPARK-19767
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Nick Afshartous
>Priority: Minor
>
> The API docs linked from the Spark Kafka 0.10 Integration page are not 
> current.  For instance, on the page
>https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
> the code examples show the new API (i.e. class ConsumerStrategies).  However, 
> following the links
> API Docs --> (Scala | Java)
> lead to API pages that do not have class ConsumerStrategies) .  The API doc 
> package names  also have {code}streaming.kafka{code} as opposed to 
> {code}streaming.kafka10{code} 
> as in the code examples on streaming-kafka-0-10-integration.html.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server

2017-03-06 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-19796.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17140
[https://github.com/apache/spark/pull/17140]

> taskScheduler fails serializing long statements received by thrift server
> -
>
> Key: SPARK-19796
> URL: https://issues.apache.org/jira/browse/SPARK-19796
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Giambattista
>Priority: Blocker
> Fix For: 2.2.0
>
>
> This problem was observed after the changes made for SPARK-17931.
> In my use-case I'm sending very long insert statements to Spark thrift server 
> and they are failing at TaskDescription.scala:89 because writeUTF fails if 
> requested to write strings longer than 64Kb (see 
> https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for 
> a description of the issue).
> As suggested by Imran Rashid I tracked down the offending key: it is 
> "spark.job.description" and it contains the complete SQL statement.
> The problem can be reproduced by creating a table like:
> create table test (a int) using parquet
> and by sending an insert statement like:
> scala> val r = 1 to 128000
> scala> println("insert into table test values (" + r.mkString("),(") + ")")



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19839) Fix memory leak in BytesToBytesMap

2017-03-06 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-19839:
--

 Summary: Fix memory leak in BytesToBytesMap
 Key: SPARK-19839
 URL: https://issues.apache.org/jira/browse/SPARK-19839
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Zhan Zhang






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17495) Hive hash implementation

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17495:


Assignee: Apache Spark  (was: Tejas Patil)

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17495) Hive hash implementation

2017-03-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17495:


Assignee: Tejas Patil  (was: Apache Spark)

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17495) Hive hash implementation

2017-03-06 Thread Tejas Patil (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil reopened SPARK-17495:
-

Re-opening. This is not done yet as there are time related datatypes that need 
to be handled and making using of this hash in the codebase.

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)

2017-03-06 Thread Amit Sela (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897868#comment-15897868
 ] 

Amit Sela commented on SPARK-19067:
---

Sweet!
I will make time tomorrow to go over the PR thoroughly (We're on ~opposite 
timezones ;) ). 
I also see a note about State Store API, which is something I'm really looking 
forward to. Any news there ?

> mapGroupsWithState - arbitrary stateful operations with Structured Streaming 
> (similar to DStream.mapWithState)
> --
>
> Key: SPARK-19067
> URL: https://issues.apache.org/jira/browse/SPARK-19067
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
>
> Right now the only way to do stateful operations with with Aggregator or 
> UDAF.  However, this does not give users control of emission or expiration of 
> state making it hard to implement things like sessionization.  We should add 
> a more general construct (probably similar to {{DStream.mapWithState}}) to 
> structured streaming. Here is the design. 
> *Requirements*
> - Users should be able to specify a function that can do the following
> - Access the input row corresponding to a key
> - Access the previous state corresponding to a key
> - Optionally, update or remove the state
> - Output any number of new rows (or none at all)
> *Proposed API*
> {code}
> //  New methods on KeyValueGroupedDataset 
> class KeyValueGroupedDataset[K, V] {  
>   // Scala friendly
>   def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
> State[S]) => U)
> def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
> Iterator[V], State[S]) => Iterator[U])
>   // Java friendly
>def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
> R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
>def flatMapGroupsWithState[S, U](func: 
> FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], 
> resultEncoder: Encoder[U])
> }
> // --- New Java-friendly function classes --- 
> public interface MapGroupsWithStateFunction extends Serializable {
>   R call(K key, Iterator values, state: State) throws Exception;
> }
> public interface FlatMapGroupsWithStateFunction extends 
> Serializable {
>   Iterator call(K key, Iterator values, state: State) throws 
> Exception;
> }
> // -- Wrapper class for state data -- 
> trait KeyedState[S] {
>   def exists(): Boolean   
>   def get(): S// throws Exception is state does not 
> exist
>   def getOption(): Option[S]   
>   def update(newState: S): Unit
>   def remove(): Unit  // exists() will be false after this
> }
> {code}
> Key Semantics of the State class
> - The state can be null.
> - If the state.remove() is called, then state.exists() will return false, and 
> getOption will returm None.
> - After that state.update(newState) is called, then state.exists() will 
> return true, and getOption will return Some(...). 
> - None of the operations are thread-safe. This is to avoid memory barriers.
> *Usage*
> {code}
> val stateFunc = (word: String, words: Iterator[String, runningCount: 
> KeyedState[Long]) => {
> val newCount = words.size + runningCount.getOption.getOrElse(0L)
> runningCount.update(newCount)
>(word, newCount)
> }
> dataset   // type 
> is Dataset[String]
>   .groupByKey[String](w => w) // generates 
> KeyValueGroupedDataset[String, String]
>   .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns 
> Dataset[(String, Long)]
> {code}
> *Future Directions*
> - Timeout based state expiration (that has not received data for a while)
> - General expression based expiration 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19790) OutputCommitCoordinator should not allow another task to commit after an ExecutorFailure

2017-03-06 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897863#comment-15897863
 ] 

Steve Loughran commented on SPARK-19790:


The only time a task output committer should be making observable state changes 
is during the actual commit operation. If it is doing things before that commit 
operation, that's a bug in that it doesn't meet the goal "committer".

The Hadoop output committer has two stages here: the FileOutputFormat work and 
then rename of files; together they are not a transaction, but on a real 
filesystem: fast

The now deleted DirectOutputCommitter was doing things as it went along —but 
that's why it got pulled.

That leaves: the Hadoop Output Committer committing work on object stores which 
implement rename() as a copy, hence slow and with a large enough failure 
window. HADOOP-13786 is going to make that window very small indeed, at least 
for job completion.

One thing to look at here is the 
{{org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter}} protocol, where 
the committer can be asked whether or not it supports recovery, as well as 
{{isCommitJobRepeatable}} to probe for a job commit being repeatable even if it 
fails partway through. The committer gets to implement its policy there.


> OutputCommitCoordinator should not allow another task to commit after an 
> ExecutorFailure
> 
>
> Key: SPARK-19790
> URL: https://issues.apache.org/jira/browse/SPARK-19790
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>
> The OutputCommitCoordinator resets the allowed committer when the task fails. 
>  
> https://github.com/apache/spark/blob/8aa560b75e6b083b2a890c52301414285ba35c3d/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L143
> However, if a task fails because of an ExecutorFailure, we actually have no 
> idea what the status is of the task.  The task may actually still be running, 
> and perhaps successfully commit its output.  By allowing another task to 
> commit its output, there is a chance that multiple tasks commit, which can 
> result in corrupt output.  This would be particularly problematic when 
> commit() is an expensive operation, eg. moving files on S3.
> For other task failures, we can allow other tasks to commit.  But with an 
> ExecutorFailure, its not clear what the right thing to do is.  The only safe 
> thing to do may be to fail the job.
> This is related to SPARK-19631, and was discovered during discussion on that 
> PR https://github.com/apache/spark/pull/16959#discussion_r103549134



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)

2017-03-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897859#comment-15897859
 ] 

Apache Spark commented on SPARK-19067:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/17179

> mapGroupsWithState - arbitrary stateful operations with Structured Streaming 
> (similar to DStream.mapWithState)
> --
>
> Key: SPARK-19067
> URL: https://issues.apache.org/jira/browse/SPARK-19067
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
>
> Right now the only way to do stateful operations with with Aggregator or 
> UDAF.  However, this does not give users control of emission or expiration of 
> state making it hard to implement things like sessionization.  We should add 
> a more general construct (probably similar to {{DStream.mapWithState}}) to 
> structured streaming. Here is the design. 
> *Requirements*
> - Users should be able to specify a function that can do the following
> - Access the input row corresponding to a key
> - Access the previous state corresponding to a key
> - Optionally, update or remove the state
> - Output any number of new rows (or none at all)
> *Proposed API*
> {code}
> //  New methods on KeyValueGroupedDataset 
> class KeyValueGroupedDataset[K, V] {  
>   // Scala friendly
>   def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
> State[S]) => U)
> def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
> Iterator[V], State[S]) => Iterator[U])
>   // Java friendly
>def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
> R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
>def flatMapGroupsWithState[S, U](func: 
> FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], 
> resultEncoder: Encoder[U])
> }
> // --- New Java-friendly function classes --- 
> public interface MapGroupsWithStateFunction extends Serializable {
>   R call(K key, Iterator values, state: State) throws Exception;
> }
> public interface FlatMapGroupsWithStateFunction extends 
> Serializable {
>   Iterator call(K key, Iterator values, state: State) throws 
> Exception;
> }
> // -- Wrapper class for state data -- 
> trait KeyedState[S] {
>   def exists(): Boolean   
>   def get(): S// throws Exception is state does not 
> exist
>   def getOption(): Option[S]   
>   def update(newState: S): Unit
>   def remove(): Unit  // exists() will be false after this
> }
> {code}
> Key Semantics of the State class
> - The state can be null.
> - If the state.remove() is called, then state.exists() will return false, and 
> getOption will returm None.
> - After that state.update(newState) is called, then state.exists() will 
> return true, and getOption will return Some(...). 
> - None of the operations are thread-safe. This is to avoid memory barriers.
> *Usage*
> {code}
> val stateFunc = (word: String, words: Iterator[String, runningCount: 
> KeyedState[Long]) => {
> val newCount = words.size + runningCount.getOption.getOrElse(0L)
> runningCount.update(newCount)
>(word, newCount)
> }
> dataset   // type 
> is Dataset[String]
>   .groupByKey[String](w => w) // generates 
> KeyValueGroupedDataset[String, String]
>   .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns 
> Dataset[(String, Long)]
> {code}
> *Future Directions*
> - Timeout based state expiration (that has not received data for a while)
> - General expression based expiration 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF

2017-03-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897799#comment-15897799
 ] 

Dongjoon Hyun edited comment on SPARK-18832 at 3/6/17 7:11 PM:
---

Hi, [~roadster11x].

Thank you for the sample file. I tried the following locally with your Sample 
code on Apache Spark 2.0.0. (I removed the package name line from the code just 
for simplicity.)

*HIVE*
{code}
$ hive
hive> CREATE FUNCTION hello AS 'SampleUDTF' USING JAR '/Users/dhyun/UDF/a.jar';
hive> exit;

$ hive
Logging initialized using configuration in 
jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> select hello('a');
Added [/Users/dhyun/UDF/a.jar] to class path
Added resources: [/Users/dhyun/UDF/a.jar]
OK
***a*** ###a###
Time taken: 1.347 seconds, Fetched: 1 row(s)
{code}

*SPARK THRIFTSERVER*
{code}
$ SPARK_HOME=$PWD sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
/Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out

$ bin/beeline -u jdbc:hive2://localhost:1/default
Connecting to jdbc:hive2://localhost:1/default
...
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:1/default> select hello('a');
+--+--+--+
|  first   |  second  |
+--+--+--+
| ***a***  | ###a###  |
+--+--+--+
1 row selected (2.031 seconds)
0: jdbc:hive2://localhost:1/default> describe function hello;
+--+--+
|  function_desc   |
+--+--+
| Function: default.hello  |
| Class: SampleUDTF|
| Usage: N/A.  |
+--+--+
3 rows selected (0.041 seconds)
0: jdbc:hive2://localhost:1/default>
{code}


was (Author: dongjoon):
Hi, [~roadster11x].

Thank you for the sample file. I tried the following locally with your Sample 
code on Apache Spark 2.0.0. (I removed the package name line from the code just 
for simplicity.)

*HIVE*
{code}
$ hive
hive> CREATE FUNCTION hello AS 'SampleUDTF' USING JAR '/Users/dhyun/UDF/a.jar';
hive> exit;

$ hive
Logging initialized using configuration in 
jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> select hello('a');
Added [/Users/dhyun/UDF/a.jar] to class path
Added resources: [/Users/dhyun/UDF/a.jar]
OK
***a*** ###a###
Time taken: 1.347 seconds, Fetched: 1 row(s)
{code}

*SPARK THRIFTSERVER*
{code}
$ SPARK_HOME=$PWD sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
/Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out

$ bin/beeline -u jdbc:hive2://localhost:1/default
Connecting to jdbc:hive2://localhost:1/default
...
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:1/default> select hello('a');
+--+--+--+
|  first   |  second  |
+--+--+--+
| ***a***  | ###a###  |
+--+--+--+
1 row selected (2.031 seconds)
0: jdbc:hive2://localhost:1/default> describe function hello;
+--+--+
|  function_desc   |
+--+--+
| Function: default.hello  |
| Class: SampleUDTF|
| Usage: N/A.  |
+--+--+
3 rows selected (0.041 seconds)
0: jdbc:hive2://localhost:1/default>
{code}

I'm wondering if your Hive work with your function.

> Spark SQL: Thriftserver unable to run a registered Hive UDTF
> 
>
> Key: SPARK-18832
> URL: https://issues.apache.org/jira/browse/SPARK-18832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: HDP: 2.5
> Spark: 2.0.0
>Reporter: Lokesh Yadav
> Attachments: SampleUDTF.java
>
>
> Spark Thriftserver is unable to run a HiveUDTF.
> It throws the error that it is unable to find the functions although the 
> function registration succeeds and the funtions does show up in the list 
> output by {{show functions}}.
> I am using a Hive UDTF, registering it using a jar placed on my local 
> machine. Calling it using the following commands:
> //Registering the functions, this command succeeds.
> {{CREATE FUNCTION SampleUDTF AS 
> 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR 
> 

[jira] [Commented] (SPARK-18568) vertex attributes in the edge triplet not getting updated in super steps for Pregel API

2017-03-06 Thread Laurent Philippart (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897852#comment-15897852
 ] 

Laurent Philippart commented on SPARK-18568:


Does it also occur when the vertex attribute or the message contains an array 
(despite using only array copy)?

> vertex attributes in the edge triplet not getting updated in super steps for 
> Pregel API
> ---
>
> Key: SPARK-18568
> URL: https://issues.apache.org/jira/browse/SPARK-18568
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.2
>Reporter: Rohit
>
> When running the Pregel API with vertex attribute as complex objects. The 
> vertex attributes are not getting updated in the triplet view. For example if 
> the vertex attributes changes in first superstep for vertex"a" the triplet 
> src attributes in the send msg program for the first super step gets the 
> latest attributes of the vertex "a" but on 2nd super step if the vertex 
> attributes changes in the vprog the edge triplets are not updated with this 
> new state of the vertex for all the edge triplets having the vertex "a" as 
> src or destination. if I re-create the graph using g = Graph(g.vertices, 
> g.edges) in the while loop before the next super step then its getting 
> updated. But this fix is not good performance wise. A detailed description of 
> the bug along with the code to recreate it is in the attached URL.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)

2017-03-06 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897851#comment-15897851
 ] 

Tathagata Das commented on SPARK-19067:
---

Yes, they will be resettable. Just see the timeout subtask that I added just 
now.

> mapGroupsWithState - arbitrary stateful operations with Structured Streaming 
> (similar to DStream.mapWithState)
> --
>
> Key: SPARK-19067
> URL: https://issues.apache.org/jira/browse/SPARK-19067
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tathagata Das
>Priority: Critical
>
> Right now the only way to do stateful operations with with Aggregator or 
> UDAF.  However, this does not give users control of emission or expiration of 
> state making it hard to implement things like sessionization.  We should add 
> a more general construct (probably similar to {{DStream.mapWithState}}) to 
> structured streaming. Here is the design. 
> *Requirements*
> - Users should be able to specify a function that can do the following
> - Access the input row corresponding to a key
> - Access the previous state corresponding to a key
> - Optionally, update or remove the state
> - Output any number of new rows (or none at all)
> *Proposed API*
> {code}
> //  New methods on KeyValueGroupedDataset 
> class KeyValueGroupedDataset[K, V] {  
>   // Scala friendly
>   def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], 
> State[S]) => U)
> def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, 
> Iterator[V], State[S]) => Iterator[U])
>   // Java friendly
>def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, 
> R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
>def flatMapGroupsWithState[S, U](func: 
> FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], 
> resultEncoder: Encoder[U])
> }
> // --- New Java-friendly function classes --- 
> public interface MapGroupsWithStateFunction extends Serializable {
>   R call(K key, Iterator values, state: State) throws Exception;
> }
> public interface FlatMapGroupsWithStateFunction extends 
> Serializable {
>   Iterator call(K key, Iterator values, state: State) throws 
> Exception;
> }
> // -- Wrapper class for state data -- 
> trait KeyedState[S] {
>   def exists(): Boolean   
>   def get(): S// throws Exception is state does not 
> exist
>   def getOption(): Option[S]   
>   def update(newState: S): Unit
>   def remove(): Unit  // exists() will be false after this
> }
> {code}
> Key Semantics of the State class
> - The state can be null.
> - If the state.remove() is called, then state.exists() will return false, and 
> getOption will returm None.
> - After that state.update(newState) is called, then state.exists() will 
> return true, and getOption will return Some(...). 
> - None of the operations are thread-safe. This is to avoid memory barriers.
> *Usage*
> {code}
> val stateFunc = (word: String, words: Iterator[String, runningCount: 
> KeyedState[Long]) => {
> val newCount = words.size + runningCount.getOption.getOrElse(0L)
> runningCount.update(newCount)
>(word, newCount)
> }
> dataset   // type 
> is Dataset[String]
>   .groupByKey[String](w => w) // generates 
> KeyValueGroupedDataset[String, String]
>   .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns 
> Dataset[(String, Long)]
> {code}
> *Future Directions*
> - Timeout based state expiration (that has not received data for a while)
> - General expression based expiration 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19838) Adding Processing Time based timeout

2017-03-06 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-19838:
-

 Summary: Adding Processing Time based timeout
 Key: SPARK-19838
 URL: https://issues.apache.org/jira/browse/SPARK-19838
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Tathagata Das
Assignee: Tathagata Das


When a key does not get any new data in mapGroupsWithState, the mapping 
function is never called on it. So we need a timeout feature that calls the 
function again in such cases, so that the user can decide whether to continue 
waiting or clean up (remove state, save stuff externally, etc.).

Timeouts can be either based  on processing time or event time. This JIRA is 
for processing time, but defines the high level API design for both. The usage 
would look like this 

{code}
def stateFunction(key: K, value: Iterator[V], state: KeyedState[S]): U = {
  ...
  state.setTimeoutDuration(1)
  ...
}

dataset // type is Dataset[T]
  .groupByKey[K](keyingFunc)   // generates KeyValueGroupedDataset[K, T]
  .mapGroupsWithState[S, U](
 func = stateFunction, 
 timeout = KeyedStateTimeout.withProcessingTime)// returns Dataset[U]
{code}

Note the following design aspects. 

- The timeout type is provided as a param in mapGroupsWithState as a parameter 
global to all the keys. This is so that the planner knows this at planning 
time, and accordingly optimize the execution based on whether to saves extra 
info in state or not (e.g. timeout durations or timestamps).

- The exact timeout duration is provided inside the function call so that it 
can be customized on a per key basis.

- When the timeout occurs for a key, the function is called with no values, and 
{{KeyedState.isTimingOut()}} set to {{true}}.

- The timeout is reset for key every time the function is called on the key, 
that is, when the key has new data, or the key has timed out. So the user has 
to set the timeout duration everytime the function is called, otherwise there 
will not be any timeout set.







--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF

2017-03-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897841#comment-15897841
 ] 

Dongjoon Hyun commented on SPARK-18832:
---

Could you upload your `jar` file? It could be a problem of jar packaging.

> Spark SQL: Thriftserver unable to run a registered Hive UDTF
> 
>
> Key: SPARK-18832
> URL: https://issues.apache.org/jira/browse/SPARK-18832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: HDP: 2.5
> Spark: 2.0.0
>Reporter: Lokesh Yadav
> Attachments: SampleUDTF.java
>
>
> Spark Thriftserver is unable to run a HiveUDTF.
> It throws the error that it is unable to find the functions although the 
> function registration succeeds and the funtions does show up in the list 
> output by {{show functions}}.
> I am using a Hive UDTF, registering it using a jar placed on my local 
> machine. Calling it using the following commands:
> //Registering the functions, this command succeeds.
> {{CREATE FUNCTION SampleUDTF AS 
> 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR 
> '/root/spark_files/experiments-1.2.jar';}}
> //Thriftserver is able to look up the functuion, on this command:
> {{DESCRIBE FUNCTION SampleUDTF;}}
> {quote}
> {noformat}
> Output: 
> +---+--+
> |   function_desc   |
> +---+--+
> | Function: default.SampleUDTF  |
> | Class: com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF  |
> | Usage: N/A.   |
> +---+--+
> {noformat}
> {quote}
> // Calling the function: 
> {{SELECT SampleUDTF('Paris');}}
> bq. Output of the above command: Error: 
> org.apache.spark.sql.AnalysisException: Undefined function: 'SampleUDTF'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.; line 1 pos 7 (state=,code=0)
> I have also tried with using a non-local (on hdfs) jar, but I get the same 
> error.
> My environment: HDP 2.5 with spark 2.0.0
> I have attached the class file for the UDTF I am using in testing this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF

2017-03-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897799#comment-15897799
 ] 

Dongjoon Hyun edited comment on SPARK-18832 at 3/6/17 6:55 PM:
---

Hi, [~roadster11x].

Thank you for the sample file. I tried the following locally with your Sample 
code on Apache Spark 2.0.0. (I removed the package name line from the code just 
for simplicity.)

*HIVE*
{code}
$ hive
hive> CREATE FUNCTION hello AS 'SampleUDTF' USING JAR '/Users/dhyun/UDF/a.jar';
hive> exit;

$ hive
Logging initialized using configuration in 
jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> select hello('a');
Added [/Users/dhyun/UDF/a.jar] to class path
Added resources: [/Users/dhyun/UDF/a.jar]
OK
***a*** ###a###
Time taken: 1.347 seconds, Fetched: 1 row(s)
{code}

*SPARK THRIFTSERVER*
{code}
$ SPARK_HOME=$PWD sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
/Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out

$ bin/beeline -u jdbc:hive2://localhost:1/default
Connecting to jdbc:hive2://localhost:1/default
...
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:1/default> select hello('a');
+--+--+--+
|  first   |  second  |
+--+--+--+
| ***a***  | ###a###  |
+--+--+--+
1 row selected (2.031 seconds)
0: jdbc:hive2://localhost:1/default> describe function hello;
+--+--+
|  function_desc   |
+--+--+
| Function: default.hello  |
| Class: SampleUDTF|
| Usage: N/A.  |
+--+--+
3 rows selected (0.041 seconds)
0: jdbc:hive2://localhost:1/default>
{code}

I'm wondering if your Hive work with your function.


was (Author: dongjoon):
Hi, [~roadster11x].

Thank you for the sample file. I tried the following locally with your Sample 
code on Apache Spark 2.0.0. (I removed the package name line from the code just 
for simplicity.)

*HIVE*
{code}
$ hive
hive> CREATE FUNCTION hello AS 'SampleUDTF' USING JAR '/Users/dhyun/UDF/a.jar';
hive> exit;

$ hive
Logging initialized using configuration in 
jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> select hello('a');
Added [/Users/dhyun/UDF/a.jar] to class path
Added resources: [/Users/dhyun/UDF/a.jar]
OK
***a*** ###a###
Time taken: 1.347 seconds, Fetched: 1 row(s)
{code}

*SPARK THRIFTSERVER*
{code}
$ SPARK_HOME=$PWD sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
/Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out

$ bin/beeline -u jdbc:hive2://localhost:1/default
Connecting to jdbc:hive2://localhost:1/default
...
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:1/default> select hello('a');
+--+--+--+
|  first   |  second  |
+--+--+--+
| ***a***  | ###a###  |
+--+--+--+
1 row selected (2.031 seconds)
0: jdbc:hive2://localhost:1/default> describe function hello;
+--+--+
|  function_desc   |
+--+--+
| Function: default.hello  |
| Class: SampleUDTF|
| Usage: N/A.  |
+--+--+
3 rows selected (0.041 seconds)
0: jdbc:hive2://localhost:1/default>
{code}

I'm wondering if Hive work with your function.

> Spark SQL: Thriftserver unable to run a registered Hive UDTF
> 
>
> Key: SPARK-18832
> URL: https://issues.apache.org/jira/browse/SPARK-18832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: HDP: 2.5
> Spark: 2.0.0
>Reporter: Lokesh Yadav
> Attachments: SampleUDTF.java
>
>
> Spark Thriftserver is unable to run a HiveUDTF.
> It throws the error that it is unable to find the functions although the 
> function registration succeeds and the funtions does show up in the list 
> output by {{show functions}}.
> I am using a Hive UDTF, registering it using a jar placed on my local 
> machine. Calling it using the following commands:
> //Registering the functions, this command succeeds.
> {{CREATE FUNCTION SampleUDTF AS 
> 

[jira] [Comment Edited] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF

2017-03-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897799#comment-15897799
 ] 

Dongjoon Hyun edited comment on SPARK-18832 at 3/6/17 6:53 PM:
---

Hi, [~roadster11x].

Thank you for the sample file. I tried the following locally with your Sample 
code on Apache Spark 2.0.0. (I removed the package name line from the code just 
for simplicity.)

*HIVE*
{code}
$ hive
hive> CREATE FUNCTION hello AS 'SampleUDTF' USING JAR '/Users/dhyun/UDF/a.jar';
hive> exit;

$ hive
Logging initialized using configuration in 
jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> select hello('a');
Added [/Users/dhyun/UDF/a.jar] to class path
Added resources: [/Users/dhyun/UDF/a.jar]
OK
***a*** ###a###
Time taken: 1.347 seconds, Fetched: 1 row(s)
{code}

*SPARK THRIFTSERVER*
{code}
$ SPARK_HOME=$PWD sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
/Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out

$ bin/beeline -u jdbc:hive2://localhost:1/default
Connecting to jdbc:hive2://localhost:1/default
...
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:1/default> select hello('a');
+--+--+--+
|  first   |  second  |
+--+--+--+
| ***a***  | ###a###  |
+--+--+--+
1 row selected (2.031 seconds)
0: jdbc:hive2://localhost:1/default> describe function hello;
+--+--+
|  function_desc   |
+--+--+
| Function: default.hello  |
| Class: SampleUDTF|
| Usage: N/A.  |
+--+--+
3 rows selected (0.041 seconds)
0: jdbc:hive2://localhost:1/default>
{code}

I'm wondering if Hive work with your function.


was (Author: dongjoon):
Hi, [~roadster11x].

Thank you for the sample file. I tried the following locally with your Sample 
code on Apache Spark 2.0.0. (I removed the package name line from the code just 
for simplicity.)

*HIVE*
{code}
$ hive
Logging initialized using configuration in 
jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> select hello('a');
Added [/Users/dhyun/UDF/a.jar] to class path
Added resources: [/Users/dhyun/UDF/a.jar]
OK
***a*** ###a###
Time taken: 1.347 seconds, Fetched: 1 row(s)
{code}

*SPARK THRIFTSERVER*
{code}
$ SPARK_HOME=$PWD sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
/Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out

$ bin/beeline -u jdbc:hive2://localhost:1/default
Connecting to jdbc:hive2://localhost:1/default
...
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:1/default> select hello('a');
+--+--+--+
|  first   |  second  |
+--+--+--+
| ***a***  | ###a###  |
+--+--+--+
1 row selected (2.031 seconds)
0: jdbc:hive2://localhost:1/default> describe function hello;
+--+--+
|  function_desc   |
+--+--+
| Function: default.hello  |
| Class: SampleUDTF|
| Usage: N/A.  |
+--+--+
3 rows selected (0.041 seconds)
0: jdbc:hive2://localhost:1/default>
{code}

I'm wondering if Hive work with your function.

> Spark SQL: Thriftserver unable to run a registered Hive UDTF
> 
>
> Key: SPARK-18832
> URL: https://issues.apache.org/jira/browse/SPARK-18832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: HDP: 2.5
> Spark: 2.0.0
>Reporter: Lokesh Yadav
> Attachments: SampleUDTF.java
>
>
> Spark Thriftserver is unable to run a HiveUDTF.
> It throws the error that it is unable to find the functions although the 
> function registration succeeds and the funtions does show up in the list 
> output by {{show functions}}.
> I am using a Hive UDTF, registering it using a jar placed on my local 
> machine. Calling it using the following commands:
> //Registering the functions, this command succeeds.
> {{CREATE FUNCTION SampleUDTF AS 
> 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR 
> '/root/spark_files/experiments-1.2.jar';}}
> //Thriftserver is able 

[jira] [Comment Edited] (SPARK-18832) Spark SQL: Thriftserver unable to run a registered Hive UDTF

2017-03-06 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897799#comment-15897799
 ] 

Dongjoon Hyun edited comment on SPARK-18832 at 3/6/17 6:46 PM:
---

Hi, [~roadster11x].

Thank you for the sample file. I tried the following locally with your Sample 
code on Apache Spark 2.0.0. (I removed the package name line from the code just 
for simplicity.)

*HIVE*
{code}
$ hive
Logging initialized using configuration in 
jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> select hello('a');
Added [/Users/dhyun/UDF/a.jar] to class path
Added resources: [/Users/dhyun/UDF/a.jar]
OK
***a*** ###a###
Time taken: 1.347 seconds, Fetched: 1 row(s)
{code}

*SPARK THRIFTSERVER*
{code}
$ SPARK_HOME=$PWD sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
/Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out

$ bin/beeline -u jdbc:hive2://localhost:1/default
Connecting to jdbc:hive2://localhost:1/default
...
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:1/default> select hello('a');
+--+--+--+
|  first   |  second  |
+--+--+--+
| ***a***  | ###a###  |
+--+--+--+
1 row selected (2.031 seconds)
0: jdbc:hive2://localhost:1/default> describe function hello;
+--+--+
|  function_desc   |
+--+--+
| Function: default.hello  |
| Class: SampleUDTF|
| Usage: N/A.  |
+--+--+
3 rows selected (0.041 seconds)
0: jdbc:hive2://localhost:1/default>
{code}

I'm wondering if Hive work with your function.


was (Author: dongjoon):
Hi, [~roadster11x].

Thank you for the sample file. I tried the following with your Sample code on 
Apache Spark 2.0.0. (I removed the package name line from the code just for 
simplicity.)

*HIVE*
{code}
$ hive
Logging initialized using configuration in 
jar:file:/usr/local/Cellar/hive12/1.2.1/libexec/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> select hello('a');
Added [/Users/dhyun/UDF/a.jar] to class path
Added resources: [/Users/dhyun/UDF/a.jar]
OK
***a*** ###a###
Time taken: 1.347 seconds, Fetched: 1 row(s)
{code}

*SPARK THRIFTSERVER*
{code}
$ SPARK_HOME=$PWD sbin/start-thriftserver.sh
starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to 
/Users/dhyun/spark-release/spark-2.0.0-bin-hadoop2.7/logs/spark-dhyun-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-HW13499.local.out

$ bin/beeline -u jdbc:hive2://localhost:1/default
Connecting to jdbc:hive2://localhost:1/default
...
Connected to: Spark SQL (version 2.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1.spark2 by Apache Hive
0: jdbc:hive2://localhost:1/default> select hello('a');
+--+--+--+
|  first   |  second  |
+--+--+--+
| ***a***  | ###a###  |
+--+--+--+
1 row selected (2.031 seconds)
0: jdbc:hive2://localhost:1/default> describe function hello;
+--+--+
|  function_desc   |
+--+--+
| Function: default.hello  |
| Class: SampleUDTF|
| Usage: N/A.  |
+--+--+
3 rows selected (0.041 seconds)
0: jdbc:hive2://localhost:1/default>
{code}

I'm wondering if Hive work with your function.

> Spark SQL: Thriftserver unable to run a registered Hive UDTF
> 
>
> Key: SPARK-18832
> URL: https://issues.apache.org/jira/browse/SPARK-18832
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Environment: HDP: 2.5
> Spark: 2.0.0
>Reporter: Lokesh Yadav
> Attachments: SampleUDTF.java
>
>
> Spark Thriftserver is unable to run a HiveUDTF.
> It throws the error that it is unable to find the functions although the 
> function registration succeeds and the funtions does show up in the list 
> output by {{show functions}}.
> I am using a Hive UDTF, registering it using a jar placed on my local 
> machine. Calling it using the following commands:
> //Registering the functions, this command succeeds.
> {{CREATE FUNCTION SampleUDTF AS 
> 'com.fuzzylogix.experiments.udf.hiveUDF.SampleUDTF' USING JAR 
> '/root/spark_files/experiments-1.2.jar';}}
> //Thriftserver is able to look up the functuion, on this command:
> {{DESCRIBE FUNCTION SampleUDTF;}}
> {quote}
> {noformat}
> 

[jira] [Assigned] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-03-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19257:
---

Assignee: Song Jun

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Song Jun
> Fix For: 2.2.0
>
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19257) The type of CatalogStorageFormat.locationUri should be java.net.URI instead of String

2017-03-06 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19257.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17149
[https://github.com/apache/spark/pull/17149]

> The type of CatalogStorageFormat.locationUri should be java.net.URI instead 
> of String
> -
>
> Key: SPARK-19257
> URL: https://issues.apache.org/jira/browse/SPARK-19257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 2.2.0
>
>
> Currently we treat `CatalogStorageFormat.locationUri` as URI string and 
> always convert it to path by `new Path(new URI(locationUri))`
> It will be safer if we can make the type of 
> `CatalogStorageFormat.locationUri` java.net.URI. We should finish the TODO in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L50-L52



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19304) Kinesis checkpoint recovery is 10x slow

2017-03-06 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-19304.
-
  Resolution: Fixed
   Fix Version/s: 2.2.0
Target Version/s: 2.2.0

Resolved by: https://github.com/apache/spark/pull/16842

> Kinesis checkpoint recovery is 10x slow
> ---
>
> Key: SPARK-19304
> URL: https://issues.apache.org/jira/browse/SPARK-19304
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: using s3 for checkpoints using 1 executor, with 19g mem 
> & 3 cores per executor
>Reporter: Gaurav Shah
>Assignee: Gaurav Shah
>  Labels: kinesis
> Fix For: 2.2.0
>
>
> Application runs fine initially, running batches of 1hour and the processing 
> time is less than 30 minutes on average. For some reason lets say the 
> application crashes, and we try to restart from checkpoint. The processing 
> now takes forever and does not move forward. We tried to test out the same 
> thing at batch interval of 1 minute, the processing runs fine and takes 1.2 
> minutes for batch to finish. When we recover from checkpoint it takes about 
> 15 minutes for each batch. Post the recovery the batches again process at 
> normal speed
> I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown.
> Stackoverflow post with more details: 
> http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19304) Kinesis checkpoint recovery is 10x slow

2017-03-06 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-19304:
---

Assignee: Gaurav Shah

> Kinesis checkpoint recovery is 10x slow
> ---
>
> Key: SPARK-19304
> URL: https://issues.apache.org/jira/browse/SPARK-19304
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
> Environment: using s3 for checkpoints using 1 executor, with 19g mem 
> & 3 cores per executor
>Reporter: Gaurav Shah
>Assignee: Gaurav Shah
>  Labels: kinesis
>
> Application runs fine initially, running batches of 1hour and the processing 
> time is less than 30 minutes on average. For some reason lets say the 
> application crashes, and we try to restart from checkpoint. The processing 
> now takes forever and does not move forward. We tried to test out the same 
> thing at batch interval of 1 minute, the processing runs fine and takes 1.2 
> minutes for batch to finish. When we recover from checkpoint it takes about 
> 15 minutes for each batch. Post the recovery the batches again process at 
> normal speed
> I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown.
> Stackoverflow post with more details: 
> http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >