[jira] [Resolved] (SPARK-24188) /api/v1/version not working

2018-05-07 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-24188.
-
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21245
[https://github.com/apache/spark/pull/21245]

> /api/v1/version not working
> ---
>
> Key: SPARK-24188
> URL: https://issues.apache.org/jira/browse/SPARK-24188
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> That URI from the REST API is currently returning a 404.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24188) /api/v1/version not working

2018-05-07 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-24188:
---

Assignee: Marcelo Vanzin

> /api/v1/version not working
> ---
>
> Key: SPARK-24188
> URL: https://issues.apache.org/jira/browse/SPARK-24188
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
>
> That URI from the REST API is currently returning a 404.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24200) Read subdirectories with out asterisks

2018-05-07 Thread kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466948#comment-16466948
 ] 

kumar commented on SPARK-24200:
---

This is not question for how, it's an improvement suggestion, i found a 
solution to make it work, but i am wondering why the subdirectories are not 
considered with out giving asterisks ? 

> Read subdirectories with out asterisks
> --
>
> Key: SPARK-24200
> URL: https://issues.apache.org/jira/browse/SPARK-24200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>
> String folder = "/Users/test/data/* /* ";
> sparkContext.textFile(folder, 1).toJavaRDD() 
> Is asterisks mandatory to read a folder -Yes, otherwise it does not read 
> files under subdirectories.
> What if I get a folder which is having more subdirectories than the number of 
> asterisks mentioned ? 
> For example:
> 1) {{/Users/test/data/}} This would work ONLY if I get data as 
> /Users/test/data/folder1/file.txt
> 2)How to make this expression as *generic* ? It should still work if I get a 
> folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}
> My input folder structure is not same all the time.
> Is there anything exists in Spark to handle this kind of scenario ? I know 
> you might have thought about this, but i am wondering why this has not been 
> implemented ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks

2018-05-07 Thread kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kumar updated SPARK-24200:
--
Description: 
String folder = "/Users/test/data/* /* ";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? 

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ? I know you 
might have thought about this, but i am wondering why this has not been 
implemented ?

  was:
String folder = "/Users/test/data/* /* ";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ? I know you 
might have thought about this, but i am wondering why this has not been 
implemented ?


> Read subdirectories with out asterisks
> --
>
> Key: SPARK-24200
> URL: https://issues.apache.org/jira/browse/SPARK-24200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>
> String folder = "/Users/test/data/* /* ";
> sparkContext.textFile(folder, 1).toJavaRDD() 
> Is asterisks mandatory to read a folder -Yes, otherwise it does not read 
> files under subdirectories.
> What if I get a folder which is having more subdirectories than the number of 
> asterisks mentioned ? 
> For example:
> 1) {{/Users/test/data/}} This would work ONLY if I get data as 
> /Users/test/data/folder1/file.txt
> 2)How to make this expression as *generic* ? It should still work if I get a 
> folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}
> My input folder structure is not same all the time.
> Is there anything exists in Spark to handle this kind of scenario ? I know 
> you might have thought about this, but i am wondering why this has not been 
> implemented ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24207) PrefixSpan: R API

2018-05-07 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-24207:


 Summary: PrefixSpan: R API
 Key: SPARK-24207
 URL: https://issues.apache.org/jira/browse/SPARK-24207
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Felix Cheung






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR

2018-05-07 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466920#comment-16466920
 ] 

Felix Cheung commented on SPARK-23780:
--

I suppose if you load googleVis first and then SparkR it would have the same 
effect as Ivan's steps?

> Failed to use googleVis library with new SparkR
> ---
>
> Key: SPARK-23780
> URL: https://issues.apache.org/jira/browse/SPARK-23780
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.1
>Reporter: Ivan Dzikovsky
>Priority: Major
>
> I've tried to use googleVis library with Spark 2.2.1, and faced with problem.
> Steps to reproduce:
> # Install R with googleVis library.
> # Run SparkR:
> {code}
> sparkR --master yarn --deploy-mode client
> {code}
> # Run code that uses googleVis:
> {code}
> library(googleVis)
> df=data.frame(country=c("US", "GB", "BR"), 
>   val1=c(10,13,14), 
>   val2=c(23,12,32))
> Bar <- gvisBarChart(df)
> cat("%html ", Bar$html$chart)
> {code}
> Than I got following error message:
> {code}
> Error : .onLoad failed in loadNamespace() for 'googleVis', details:
>   call: rematchDefinition(definition, fdef, mnames, fnames, signature)
>   error: methods can add arguments to the generic 'toJSON' only if '...' is 
> an argument to the generic
> Error : package or namespace load failed for 'googleVis'
> {code}
> But expected result is to get some HTML code output, as it was with Spark 
> 2.1.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24206) Improve DataSource benchmark code for read and pushdown

2018-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466902#comment-16466902
 ] 

Apache Spark commented on SPARK-24206:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21266

> Improve DataSource benchmark code for read and pushdown
> ---
>
> Key: SPARK-24206
> URL: https://issues.apache.org/jira/browse/SPARK-24206
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> I improved the DataSource code for read and pushdown in the parquet v1.10.0 
> upgrade activity: [https://github.com/apache/spark/pull/21070]
> Based on the code, we need to brush up the benchmark code and results in the 
> master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24206) Improve DataSource benchmark code for read and pushdown

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24206:


Assignee: Apache Spark

> Improve DataSource benchmark code for read and pushdown
> ---
>
> Key: SPARK-24206
> URL: https://issues.apache.org/jira/browse/SPARK-24206
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> I improved the DataSource code for read and pushdown in the parquet v1.10.0 
> upgrade activity: [https://github.com/apache/spark/pull/21070]
> Based on the code, we need to brush up the benchmark code and results in the 
> master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24206) Improve DataSource benchmark code for read and pushdown

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24206:


Assignee: (was: Apache Spark)

> Improve DataSource benchmark code for read and pushdown
> ---
>
> Key: SPARK-24206
> URL: https://issues.apache.org/jira/browse/SPARK-24206
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> I improved the DataSource code for read and pushdown in the parquet v1.10.0 
> upgrade activity: [https://github.com/apache/spark/pull/21070]
> Based on the code, we need to brush up the benchmark code and results in the 
> master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2018-05-07 Thread Vikram Agrawal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466897#comment-16466897
 ] 

Vikram Agrawal commented on SPARK-18165:


Thanks [~marmbrus]

- Planning to start the work on porting the connector in next few weeks. Will 
share my feedbacks/ask for help once I am ready. 
- Thanks for your suggestion. Will check out apache Bahir/Spark Packages and 
start a PR once I have ported my changes to DataSourceV2 APIs.

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2018-05-07 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-20114:
---
Component/s: (was: PySpark)

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24146) spark.ml parity for sequential pattern mining - PrefixSpan: Python API

2018-05-07 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-24146:
---
Component/s: PySpark

> spark.ml parity for sequential pattern mining - PrefixSpan: Python API
> --
>
> Key: SPARK-24146
> URL: https://issues.apache.org/jira/browse/SPARK-24146
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Major
>
> spark.ml parity for sequential pattern mining - PrefixSpan: Python API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24146) spark.ml parity for sequential pattern mining - PrefixSpan: Python API

2018-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466896#comment-16466896
 ] 

Apache Spark commented on SPARK-24146:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/21265

> spark.ml parity for sequential pattern mining - PrefixSpan: Python API
> --
>
> Key: SPARK-24146
> URL: https://issues.apache.org/jira/browse/SPARK-24146
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Major
>
> spark.ml parity for sequential pattern mining - PrefixSpan: Python API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24146) spark.ml parity for sequential pattern mining - PrefixSpan: Python API

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24146:


Assignee: Apache Spark

> spark.ml parity for sequential pattern mining - PrefixSpan: Python API
> --
>
> Key: SPARK-24146
> URL: https://issues.apache.org/jira/browse/SPARK-24146
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
>
> spark.ml parity for sequential pattern mining - PrefixSpan: Python API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2018-05-07 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-20114:
---
Component/s: PySpark

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24146) spark.ml parity for sequential pattern mining - PrefixSpan: Python API

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24146:


Assignee: (was: Apache Spark)

> spark.ml parity for sequential pattern mining - PrefixSpan: Python API
> --
>
> Key: SPARK-24146
> URL: https://issues.apache.org/jira/browse/SPARK-24146
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Major
>
> spark.ml parity for sequential pattern mining - PrefixSpan: Python API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24206) Improve DataSource benchmark code for read and pushdown

2018-05-07 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-24206:


 Summary: Improve DataSource benchmark code for read and pushdown
 Key: SPARK-24206
 URL: https://issues.apache.org/jira/browse/SPARK-24206
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Takeshi Yamamuro


I improved the DataSource code for read and pushdown in the parquet v1.10.0 
upgrade activity: [https://github.com/apache/spark/pull/21070]

Based on the code, we need to brush up the benchmark code and results in the 
master.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24128) Mention spark.sql.crossJoin.enabled in implicit cartesian product error msg

2018-05-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24128:


Assignee: Henry Robinson

> Mention spark.sql.crossJoin.enabled in implicit cartesian product error msg
> ---
>
> Key: SPARK-24128
> URL: https://issues.apache.org/jira/browse/SPARK-24128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Assignee: Henry Robinson
>Priority: Minor
> Fix For: 2.3.1, 2.4.0
>
>
> The error message given when a query contains an implicit cartesian product 
> suggests rewriting the query using {{CROSS JOIN}}, but not disabling the 
> check using {{spark.sql.crossJoin.enabled=true}}. It's sometimes easier to 
> change a config variable than edit a query, so it would be helpful to make 
> the user aware of their options. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24128) Mention spark.sql.crossJoin.enabled in implicit cartesian product error msg

2018-05-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24128.
--
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 21201
[https://github.com/apache/spark/pull/21201]

> Mention spark.sql.crossJoin.enabled in implicit cartesian product error msg
> ---
>
> Key: SPARK-24128
> URL: https://issues.apache.org/jira/browse/SPARK-24128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Assignee: Henry Robinson
>Priority: Minor
> Fix For: 2.4.0, 2.3.1
>
>
> The error message given when a query contains an implicit cartesian product 
> suggests rewriting the query using {{CROSS JOIN}}, but not disabling the 
> check using {{spark.sql.crossJoin.enabled=true}}. It's sometimes easier to 
> change a config variable than edit a query, so it would be helpful to make 
> the user aware of their options. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23975) Allow Clustering to take Arrays of Double as input features

2018-05-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-23975.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> Allow Clustering to take Arrays of Double as input features
> ---
>
> Key: SPARK-23975
> URL: https://issues.apache.org/jira/browse/SPARK-23975
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Assignee: Lu Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Clustering algorithms should accept Arrays in addition to Vectors as input 
> features. The python interface should also be changed so that it would make 
> PySpark a lot easier to use. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24205) java.util.concurrent.locks.LockSupport.parkNanos

2018-05-07 Thread joy-m (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

joy-m updated SPARK-24205:
--
Attachment: 屏幕快照 2018-05-08 上午10.58.08.png

> java.util.concurrent.locks.LockSupport.parkNanos
> 
>
> Key: SPARK-24205
> URL: https://issues.apache.org/jira/browse/SPARK-24205
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: joy-m
>Priority: Major
> Attachments: 屏幕快照 2018-05-08 上午10.58.08.png
>
>
> when i use yarn client mode, the spark task locked in the collect stage
> Because of the data is in the driver machine, so I used the client mode to 
> run my application!
> but the the stage collect was locked!
> countDf.collect().map(_.getLong(0)).mkString.toLong
> {{sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>  
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
>  org.apache.spark.rpc.netty.Dispatcher.awaitTermination(Dispatcher.scala:180) 
> org.apache.spark.rpc.netty.NettyRpcEnv.awaitTermination(NettyRpcEnv.scala:281)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:231)
>  
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67) 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66) 
> java.security.AccessController.doPrivileged(Native Method) 
> javax.security.auth.Subject.doAs(Subject.java:422) 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
>  
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24205) java.util.concurrent.locks.LockSupport.parkNanos

2018-05-07 Thread joy-m (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

joy-m updated SPARK-24205:
--
Attachment: (was: 屏幕快照 2018-05-06 上午10.04.27.png)

> java.util.concurrent.locks.LockSupport.parkNanos
> 
>
> Key: SPARK-24205
> URL: https://issues.apache.org/jira/browse/SPARK-24205
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: joy-m
>Priority: Major
>
> when i use yarn client mode, the spark task locked in the collect stage
> Because of the data is in the driver machine, so I used the client mode to 
> run my application!
> but the the stage collect was locked!
> countDf.collect().map(_.getLong(0)).mkString.toLong
> {{sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>  
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
>  org.apache.spark.rpc.netty.Dispatcher.awaitTermination(Dispatcher.scala:180) 
> org.apache.spark.rpc.netty.NettyRpcEnv.awaitTermination(NettyRpcEnv.scala:281)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:231)
>  
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67) 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66) 
> java.security.AccessController.doPrivileged(Native Method) 
> javax.security.auth.Subject.doAs(Subject.java:422) 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
>  
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24205) java.util.concurrent.locks.LockSupport.parkNanos

2018-05-07 Thread joy-m (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

joy-m updated SPARK-24205:
--
Attachment: 屏幕快照 2018-05-06 上午10.04.27.png

> java.util.concurrent.locks.LockSupport.parkNanos
> 
>
> Key: SPARK-24205
> URL: https://issues.apache.org/jira/browse/SPARK-24205
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: joy-m
>Priority: Major
> Attachments: 屏幕快照 2018-05-06 上午10.04.27.png
>
>
> when i use yarn client mode, the spark task locked in the collect stage
> Because of the data is in the driver machine, so I used the client mode to 
> run my application!
> but the the stage collect was locked!
> countDf.collect().map(_.getLong(0)).mkString.toLong
> {{sun.misc.Unsafe.park(Native Method) 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>  
> java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
>  org.apache.spark.rpc.netty.Dispatcher.awaitTermination(Dispatcher.scala:180) 
> org.apache.spark.rpc.netty.NettyRpcEnv.awaitTermination(NettyRpcEnv.scala:281)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:231)
>  
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67) 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66) 
> java.security.AccessController.doPrivileged(Native Method) 
> javax.security.auth.Subject.doAs(Subject.java:422) 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
>  
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
>  
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24205) java.util.concurrent.locks.LockSupport.parkNanos

2018-05-07 Thread joy-m (JIRA)
joy-m created SPARK-24205:
-

 Summary: java.util.concurrent.locks.LockSupport.parkNanos
 Key: SPARK-24205
 URL: https://issues.apache.org/jira/browse/SPARK-24205
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
Reporter: joy-m


when i use yarn client mode, the spark task locked in the collect stage
Because of the data is in the driver machine, so I used the client mode to run 
my application!
but the the stage collect was locked!
countDf.collect().map(_.getLong(0)).mkString.toLong

{{sun.misc.Unsafe.park(Native Method) 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
 
java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
 org.apache.spark.rpc.netty.Dispatcher.awaitTermination(Dispatcher.scala:180) 
org.apache.spark.rpc.netty.NettyRpcEnv.awaitTermination(NettyRpcEnv.scala:281) 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:231)
 org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67) 
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66) 
java.security.AccessController.doPrivileged(Native Method) 
javax.security.auth.Subject.doAs(Subject.java:422) 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24204) Verify a write schema in OrcFileFormat

2018-05-07 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466742#comment-16466742
 ] 

Takeshi Yamamuro commented on SPARK-24204:
--

This fix is like: 
https://github.com/apache/spark/compare/master...maropu:VerifySchemaInOrc
{code}
scala> df.write.orc("/tmp/orc")
java.lang.UnsupportedOperationException: ORC data source does not support null 
data type.
  at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer$.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$verifyType$1(OrcSerializer.scala:251)
  at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer$$anonfun$verifySchema$1.apply(OrcSerializer.scala:255)
  at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer$$anonfun$verifySchema$1.apply(OrcSerializer.scala:255)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
  at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer$.verifySchema(OrcSerializer.scala:255)
  at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.prepareWrite(OrcFileFormat.scala:92)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140)
  at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
{code}
cc: [~dongjoon]

> Verify a write schema in OrcFileFormat
> --
>
> Key: SPARK-24204
> URL: https://issues.apache.org/jira/browse/SPARK-24204
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The native orc file format throws an exception with a meaningless message in 
> executor-sides when unsupported types passed;
> {code}
> scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
> null)))
> scala> val schema = StructType(StructField("a", IntegerType) :: 
> StructField("b", NullType) :: Nil)
> scala> val df = spark.createDataFrame(rdd, schema)
> scala> df.write.orc("/tmp/orc")
> java.lang.IllegalArgumentException: Can't parse category at 
> 'struct'
> at 
> org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
> at 
> org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
> at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
> at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
> er.scala:226)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
> (FileFormatWriter.scala:278)
> {code}
> It seems to be better to verify a write schema in a driver side for users 
> along with the CSV fromat;
> https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24084) Add job group id for query through spark-sql

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24084:


Assignee: (was: Apache Spark)

> Add job group id for query through spark-sql
> 
>
> Key: SPARK-24084
> URL: https://issues.apache.org/jira/browse/SPARK-24084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
>
> For spark-sql we can add job group id for the same statement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24084) Add job group id for query through spark-sql

2018-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466741#comment-16466741
 ] 

Apache Spark commented on SPARK-24084:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/21263

> Add job group id for query through spark-sql
> 
>
> Key: SPARK-24084
> URL: https://issues.apache.org/jira/browse/SPARK-24084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
>
> For spark-sql we can add job group id for the same statement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24084) Add job group id for query through spark-sql

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24084:


Assignee: Apache Spark

> Add job group id for query through spark-sql
> 
>
> Key: SPARK-24084
> URL: https://issues.apache.org/jira/browse/SPARK-24084
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Assignee: Apache Spark
>Priority: Major
>
> For spark-sql we can add job group id for the same statement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24204) Verify a write schema in OrcFileFormat

2018-05-07 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-24204:


 Summary: Verify a write schema in OrcFileFormat
 Key: SPARK-24204
 URL: https://issues.apache.org/jira/browse/SPARK-24204
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Takeshi Yamamuro


The native orc file format throws an exception with a meaningless message in 
executor-sides when unsupported types passed;
{code}

scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, 
null)))
scala> val schema = StructType(StructField("a", IntegerType) :: 
StructField("b", NullType) :: Nil)
scala> val df = spark.createDataFrame(rdd, schema)
scala> df.write.orc("/tmp/orc")
java.lang.IllegalArgumentException: Can't parse category at 
'struct'
at 
org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332)
at org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ
er.scala:226)
at 
org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36)
at 
org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36)
at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply
(FileFormatWriter.scala:278)
{code}
It seems to be better to verify a write schema in a driver side for users along 
with the CSV fromat;
https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24200) Read subdirectories with out asterisks

2018-05-07 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466700#comment-16466700
 ] 

Hyukjin Kwon commented on SPARK-24200:
--

If it's a question for now, I would suggest to ask it to mailing list first 
before filing a JIRA as an issue.

> Read subdirectories with out asterisks
> --
>
> Key: SPARK-24200
> URL: https://issues.apache.org/jira/browse/SPARK-24200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>
> String folder = "/Users/test/data/* /* ";
> sparkContext.textFile(folder, 1).toJavaRDD() 
> Is asterisks mandatory to read a folder -Yes, otherwise it does not read 
> files under subdirectories.
> What if I get a folder which is having more subdirectories than the number of 
> asterisks mentioned ? How to handle this scenario ?
> For example:
> 1) {{/Users/test/data/}} This would work ONLY if I get data as 
> /Users/test/data/folder1/file.txt
> 2)How to make this expression as *generic* ? It should still work if I get a 
> folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}
> My input folder structure is not same all the time.
> Is there anything exists in Spark to handle this kind of scenario ? I know 
> you might have thought about this, but i am wondering why this has not been 
> implemented ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24199) Structured Streaming

2018-05-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24199.
--
Resolution: Invalid

Questions should go to mailing list rather than filing it as an issue here. I 
believe you could have a better answer.

> Structured Streaming
> 
>
> Key: SPARK-24199
> URL: https://issues.apache.org/jira/browse/SPARK-24199
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: shuke
>Priority: Major
>
> h3. Hey,when i use the {color:#FF}where {color}operate to filter data 
> while using Structured Streaming
> I got some problem about it 
>  
> 
> get_json_object(col("value"),"$.type").cast(DataTypes.StringType).alias("type"),
>  
> get_json_object(col("value"),"$.saleData.type").cast(DataTypes.StringType).alias("saleDataType",get_json_object(col("value"),
>  "$.uid").cast(DataTypes.IntegerType).alias(ROI_SHOP_KEY),
>  from_unixtime(get_json_object(col("value"), 
> "$.time").cast(DataTypes.IntegerType),"-MM-dd").alias("event_time"),
>  //get_json_object(xjson, '$.balanceData.money')/100 as money,
>  (get_json_object(col("value"), 
> "$.balanceData.money").cast(DataTypes.DoubleType) / 
> 100).alias(BUSINESS_AMOUNT),
>  get_json_object(col("value"), 
> "$.shopData.id").cast(DataTypes.LongType).alias(DARK_ID),
>  get_json_object(col("value"), 
> "$.balanceData.out_trade_no").cast(DataTypes.StringType).alias(OUT_TRADE_NO),
>  get_json_object(col("value"), 
> "$.balanceData.type").cast(DataTypes.StringType).alias(BalanceData_type)
> )
> {color:#FF}.where("(type = 'residue' and saleDataType = '4' and shop_id 
> in ('8610022','5382783')) or type = 'promotion' "{color} )
>  .select(col("*"))
>  .writeStream
>  .trigger(Trigger.ProcessingTime(5000))
>  .outputMode("Update")
>  .format("console")
>  .start()
> =
> i find that its not while using this way to filter data 
> anyone can help 
> best wishes
>  
>  
>  
> h1.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24172) we should not apply operator pushdown to data source v2 many times

2018-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466638#comment-16466638
 ] 

Apache Spark commented on SPARK-24172:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21262

> we should not apply operator pushdown to data source v2 many times
> --
>
> Key: SPARK-24172
> URL: https://issues.apache.org/jira/browse/SPARK-24172
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2018-05-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20114.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20973
[https://github.com/apache/spark/pull/20973]

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22885) ML test for StructuredStreaming: spark.ml.tuning

2018-05-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22885.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20261
[https://github.com/apache/spark/pull/20261]

> ML test for StructuredStreaming: spark.ml.tuning
> 
>
> Key: SPARK-22885
> URL: https://issues.apache.org/jira/browse/SPARK-22885
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-07 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-23291:

Fix Version/s: 2.3.1

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15750) Constructing FPGrowth fails when no numPartitions specified in pyspark

2018-05-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-15750.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 13493
[https://github.com/apache/spark/pull/13493]

> Constructing FPGrowth fails when no numPartitions specified in pyspark
> --
>
> Key: SPARK-15750
> URL: https://issues.apache.org/jira/browse/SPARK-15750
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
> Fix For: 2.4.0
>
>
> {code}
> >>> model1 = FPGrowth.train(rdd, 0.6)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/fpm.py", line 96, 
> in train
> model = callMLlibFunc("trainFPGrowthModel", data, float(minSupport), 
> int(numPartitions))
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line 
> 130, in callMLlibFunc
> return callJavaFunc(sc, api, *args)
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line 
> 123, in callJavaFunc
> return _java2py(sc, func(*args))
>   File 
> "/Users/jzhang/github/spark-2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/Users/jzhang/github/spark-2/python/pyspark/sql/utils.py", line 79, 
> in deco
> raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Number of 
> partitions must be positive but got -1'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem

2018-05-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466513#comment-16466513
 ] 

Joseph K. Bradley commented on SPARK-24152:
---

Thank you all!

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24203) Make executor's bindAddress configurable

2018-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466474#comment-16466474
 ] 

Apache Spark commented on SPARK-24203:
--

User 'lukmajercak' has created a pull request for this issue:
https://github.com/apache/spark/pull/21261

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-24203
> URL: https://issues.apache.org/jira/browse/SPARK-24203
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lukas Majercak
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24203) Make executor's bindAddress configurable

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24203:


Assignee: (was: Apache Spark)

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-24203
> URL: https://issues.apache.org/jira/browse/SPARK-24203
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lukas Majercak
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24203) Make executor's bindAddress configurable

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24203:


Assignee: Apache Spark

> Make executor's bindAddress configurable
> 
>
> Key: SPARK-24203
> URL: https://issues.apache.org/jira/browse/SPARK-24203
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: Lukas Majercak
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24203) Make executor's bindAddress configurable

2018-05-07 Thread Lukas Majercak (JIRA)
Lukas Majercak created SPARK-24203:
--

 Summary: Make executor's bindAddress configurable
 Key: SPARK-24203
 URL: https://issues.apache.org/jira/browse/SPARK-24203
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Lukas Majercak






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24202) Separate SQLContext dependencies from SparkSession.implicits

2018-05-07 Thread Gerard Maas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerard Maas updated SPARK-24202:

Description: 
The current implementation of the implicits in SparkSession passes the current 
active SQLContext to the SQLImplicits class. This implies that all usage of 
these (extremely helpful) implicits require the prior creation of a Spark 
Session instance.

Usage is typically done as follows:

 
{code:java}
val sparkSession = SparkSession.builder()
build()
import sparkSession.implicits._
{code}
 

This is OK in user code, but it burdens the creation of library code that uses 
Spark, where  static imports for _Encoder_ support is required.

A simple example would be:

 
{code:java}
class SparkTransformation[In: Encoder, Out: Encoder] {
    def transform(ds: Dataset[In]): Dataset[Out]
}
{code}
 

Attempting to compile such code would result in the following exception:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
String, etc) and Product types (case classes) are supported by importing 
spark.implicits._  Support for serializing other types will be added in future 
releases.

The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two 
utilities to transform _RDD_ and local collections into a _Dataset_.

These are 2 methods of the 46 implicit conversions offered by this class.

The request is to separate the two implicit methods that depend on the instance 
creation into a separate class:
{code:java}
SQLImplicits#214-229
/**
 * Creates a [[Dataset]] from an RDD.
 *
 * @since 1.6.0
 */
implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
 DatasetHolder(_sqlContext.createDataset(rdd))
}

/**
 * Creates a [[Dataset]] from a local Seq.
 * @since 1.6.0
 */
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] 
= {
 DatasetHolder(_sqlContext.createDataset(s))
}{code}
By separating the static methods from these two methods that depend on 
_sqlContext_ into  separate classes, we could provide static imports for all 
the other functionality and only require the instance-bound  implicits for the 
RDD and collection support (Which is an uncommon use case these days)

As this is potentially breaking the current interface, this might be a 
candidate for Spark 3.0. Although there's nothing stopping us from creating a 
separate hierarchy for the static encoders already. 

  was:
The current implementation of the implicits in SparkSession passes the current 
active SQLContext to the SQLImplicits class. This implies that all usage of 
these (extremely helpful) implicits require the prior creation of a Spark 
Session instance.

Usage is typically done as follows:

 
{code:java}
val sparkSession = SessionBuilderbuild()
import sparkSession.implicits._
{code}
 

This is OK in user code, but it burdens the creation of library code that uses 
Spark, where  static imports for _Encoder_ support is required.

A simple example would be:

 
{code:java}
class SparkTransformation[In: Encoder, Out: Encoder] {
    def transform(ds: Dataset[In]): Dataset[Out]
}
{code}
 

Attempting to compile such code would result in the following exception:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
String, etc) and Product types (case classes) are supported by importing 
spark.implicits._  Support for serializing other types will be added in future 
releases.

The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two 
utilities to transform _RDD_ and local collections into a _Dataset_.

These are 2 methods of the 46 implicit conversions offered by this class.

The request is to separate the two implicit methods that depend on the instance 
creation into a separate class:
{code:java}
SQLImplicits#214-229
/**
 * Creates a [[Dataset]] from an RDD.
 *
 * @since 1.6.0
 */
implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
 DatasetHolder(_sqlContext.createDataset(rdd))
}

/**
 * Creates a [[Dataset]] from a local Seq.
 * @since 1.6.0
 */
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] 
= {
 DatasetHolder(_sqlContext.createDataset(s))
}{code}
By separating the static methods from these two methods that depend on 
_sqlContext_ into  separate classes, we could provide static imports for all 
the other functionality and only require the instance-bound  implicits for the 
RDD and collection support (Which is an uncommon use case these days)

As this is potentially breaking the current interface, this might be a 
candidate for Spark 3.0. Although there's nothing stopping us from creating a 
separate hierarchy for the static encoders already. 


> Separate SQLContext dependencies from SparkSession.implicits
> 
>
> Key: SPARK-24202
> URL: https://issues.apache.o

[jira] [Updated] (SPARK-24202) Separate SQLContext dependencies from SparkSession.implicits

2018-05-07 Thread Gerard Maas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gerard Maas updated SPARK-24202:

Description: 
The current implementation of the implicits in SparkSession passes the current 
active SQLContext to the SQLImplicits class. This implies that all usage of 
these (extremely helpful) implicits require the prior creation of a Spark 
Session instance.

Usage is typically done as follows:

 
{code:java}
val sparkSession = SparkSession.builder()
getOrCreate()
import sparkSession.implicits._
{code}
 

This is OK in user code, but it burdens the creation of library code that uses 
Spark, where  static imports for _Encoder_ support is required.

A simple example would be:

 
{code:java}
class SparkTransformation[In: Encoder, Out: Encoder] {
    def transform(ds: Dataset[In]): Dataset[Out]
}
{code}
 

Attempting to compile such code would result in the following exception:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
String, etc) and Product types (case classes) are supported by importing 
spark.implicits._  Support for serializing other types will be added in future 
releases.

The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two 
utilities to transform _RDD_ and local collections into a _Dataset_.

These are 2 methods of the 46 implicit conversions offered by this class.

The request is to separate the two implicit methods that depend on the instance 
creation into a separate class:
{code:java}
SQLImplicits#214-229
/**
 * Creates a [[Dataset]] from an RDD.
 *
 * @since 1.6.0
 */
implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
 DatasetHolder(_sqlContext.createDataset(rdd))
}

/**
 * Creates a [[Dataset]] from a local Seq.
 * @since 1.6.0
 */
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] 
= {
 DatasetHolder(_sqlContext.createDataset(s))
}{code}
By separating the static methods from these two methods that depend on 
_sqlContext_ into  separate classes, we could provide static imports for all 
the other functionality and only require the instance-bound  implicits for the 
RDD and collection support (Which is an uncommon use case these days)

As this is potentially breaking the current interface, this might be a 
candidate for Spark 3.0. Although there's nothing stopping us from creating a 
separate hierarchy for the static encoders already. 

  was:
The current implementation of the implicits in SparkSession passes the current 
active SQLContext to the SQLImplicits class. This implies that all usage of 
these (extremely helpful) implicits require the prior creation of a Spark 
Session instance.

Usage is typically done as follows:

 
{code:java}
val sparkSession = SparkSession.builder()
build()
import sparkSession.implicits._
{code}
 

This is OK in user code, but it burdens the creation of library code that uses 
Spark, where  static imports for _Encoder_ support is required.

A simple example would be:

 
{code:java}
class SparkTransformation[In: Encoder, Out: Encoder] {
    def transform(ds: Dataset[In]): Dataset[Out]
}
{code}
 

Attempting to compile such code would result in the following exception:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
String, etc) and Product types (case classes) are supported by importing 
spark.implicits._  Support for serializing other types will be added in future 
releases.

The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two 
utilities to transform _RDD_ and local collections into a _Dataset_.

These are 2 methods of the 46 implicit conversions offered by this class.

The request is to separate the two implicit methods that depend on the instance 
creation into a separate class:
{code:java}
SQLImplicits#214-229
/**
 * Creates a [[Dataset]] from an RDD.
 *
 * @since 1.6.0
 */
implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
 DatasetHolder(_sqlContext.createDataset(rdd))
}

/**
 * Creates a [[Dataset]] from a local Seq.
 * @since 1.6.0
 */
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] 
= {
 DatasetHolder(_sqlContext.createDataset(s))
}{code}
By separating the static methods from these two methods that depend on 
_sqlContext_ into  separate classes, we could provide static imports for all 
the other functionality and only require the instance-bound  implicits for the 
RDD and collection support (Which is an uncommon use case these days)

As this is potentially breaking the current interface, this might be a 
candidate for Spark 3.0. Although there's nothing stopping us from creating a 
separate hierarchy for the static encoders already. 


> Separate SQLContext dependencies from SparkSession.implicits
> 
>
> Key: SPARK-24202
> URL: https://

[jira] [Created] (SPARK-24202) Separate SQLContext dependencies from SparkSession.implicits

2018-05-07 Thread Gerard Maas (JIRA)
Gerard Maas created SPARK-24202:
---

 Summary: Separate SQLContext dependencies from 
SparkSession.implicits
 Key: SPARK-24202
 URL: https://issues.apache.org/jira/browse/SPARK-24202
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Gerard Maas


The current implementation of the implicits in SparkSession passes the current 
active SQLContext to the SQLImplicits class. This implies that all usage of 
these (extremely helpful) implicits require the prior creation of a Spark 
Session instance.

Usage is typically done as follows:

 
{code:java}
val sparkSession = SessionBuilderbuild()
import sparkSession.implicits._
{code}
 

This is OK in user code, but it burdens the creation of library code that uses 
Spark, where  static imports for _Encoder_ support is required.

A simple example would be:

 
{code:java}
class SparkTransformation[In: Encoder, Out: Encoder] {
    def transform(ds: Dataset[In]): Dataset[Out]
}
{code}
 

Attempting to compile such code would result in the following exception:

Unable to find encoder for type stored in a Dataset.  Primitive types (Int, 
String, etc) and Product types (case classes) are supported by importing 
spark.implicits._  Support for serializing other types will be added in future 
releases.

The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two 
utilities to transform _RDD_ and local collections into a _Dataset_.

These are 2 methods of the 46 implicit conversions offered by this class.

The request is to separate the two implicit methods that depend on the instance 
creation into a separate class:
{code:java}
SQLImplicits#214-229
/**
 * Creates a [[Dataset]] from an RDD.
 *
 * @since 1.6.0
 */
implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = {
 DatasetHolder(_sqlContext.createDataset(rdd))
}

/**
 * Creates a [[Dataset]] from a local Seq.
 * @since 1.6.0
 */
implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] 
= {
 DatasetHolder(_sqlContext.createDataset(s))
}{code}
By separating the static methods from these two methods that depend on 
_sqlContext_ into  separate classes, we could provide static imports for all 
the other functionality and only require the instance-bound  implicits for the 
RDD and collection support (Which is an uncommon use case these days)

As this is potentially breaking the current interface, this might be a 
candidate for Spark 3.0. Although there's nothing stopping us from creating a 
separate hierarchy for the static encoders already. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18165) Kinesis support in Structured Streaming

2018-05-07 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18165:
-
Component/s: (was: DStreams)
 Structured Streaming

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming

2018-05-07 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466410#comment-16466410
 ] 

Michael Armbrust commented on SPARK-18165:
--

This is great!  I'm glad there are more connectors for Structured Streaming!

A few high-level thoughts:
 - The current Source/Sink APIs are internal/unstable.  We are working on 
building public/stable APIs as part of DataSourceV2. Would be great to get 
feedback on those APIs if this is ported to them
 - In general as the Spark project scales, we are trying to move more of the 
connectors out of the core project.  I'd suggest looking at contributing this 
to Apache Bahir and/or Spark Packages.

> Kinesis support in Structured Streaming
> ---
>
> Key: SPARK-18165
> URL: https://issues.apache.org/jira/browse/SPARK-18165
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Lauren Moos
>Priority: Major
>
> Implement Kinesis based sources and sinks for Structured Streaming



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24201) IllegalArgumentException originating from ClosureCleaner in Java 9+

2018-05-07 Thread Grant Henke (JIRA)
Grant Henke created SPARK-24201:
---

 Summary: IllegalArgumentException originating from ClosureCleaner 
in Java 9+ 
 Key: SPARK-24201
 URL: https://issues.apache.org/jira/browse/SPARK-24201
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
 Environment: java version "9.0.4"

scala version "2.11.12"
Reporter: Grant Henke


Apache Kudu's kudu-spark tests are failing on Java 9. 

I assume Java 9 is supported and this is an unexpected bug given the docs say 
"Spark runs on Java 8+" [here|https://spark.apache.org/docs/2.3.0/].

The stacktrace seen is below:
{code}
java.lang.IllegalArgumentException
at org.apache.xbean.asm5.ClassReader.(Unknown Source)
at org.apache.xbean.asm5.ClassReader.(Unknown Source)
at org.apache.xbean.asm5.ClassReader.(Unknown Source)
at 
org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
at 
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
at 
org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
at scala.collection.immutable.List.foreach(List.scala:392)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2292)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:938)
at 
org.apache.kudu.spark.kudu.KuduRDDTest$$anonfun$1.apply(KuduRDDTest.scala:30)
at 
org.apache.kudu.spark.kudu.KuduRDDTest$$anonfun$1.apply(KuduRDDTest.scala:27)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
at 
org.apache.kudu.spark.kudu.KuduRDDTest.org$scalatest$BeforeAndAfter$$super$runTest(KuduRDDTest.scala:25)
at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203)
at org.apache.kudu.spark.kudu.KuduRDDTest.runTest(KuduRDDTest.scala:25)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine$$anonfun$traverseSu

[jira] [Commented] (SPARK-24176) The hdfs file path with wildcard can not be identified when loading data

2018-05-07 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466263#comment-16466263
 ] 

kevin yu commented on SPARK-24176:
--

I am looking at this one, will provide a proposal fix soon. 

> The hdfs file path with wildcard can not be identified when loading data
> 
>
> Key: SPARK-24176
> URL: https://issues.apache.org/jira/browse/SPARK-24176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE11
> Spark Version:2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> # Launch spark-sql
>  # create table wild1 (time timestamp, name string, isright boolean, 
> datetoday date, num binary, height double, score float, decimaler 
> decimal(10,0), id tinyint, age int, license bigint, length smallint) row 
> format delimited fields terminated by ',' stored as textfile;
>  # loaded data in table as below and it failed some cases not consistent
>  # load data inpath '/user/testdemo1/user1/?ype* ' into table wild1; - Success
> load data inpath '/user/testdemo1/user1/t??eddata60.txt' into table wild1; - 
> *Failed*
> load data inpath '/user/testdemo1/user1/?ypeddata60.txt' into table wild1; - 
> Success
> Exception as below
> > load data inpath '/user/testdemo1/user1/t??eddata61.txt' into table wild1;
> 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_database: one
> 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
> ip=unknown-ip-addr cmd=get_database: one
> 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1
> 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
> ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1
> 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1
> 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com 
> ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1
> *Error in query: LOAD DATA input path does not exist: 
> /user/testdemo1/user1/t??eddata61.txt;*
> spark-sql>
> Behavior is not consistent. Need to fix with all combination of wild card 
> char as it is not consistent.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes

2018-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466203#comment-16466203
 ] 

Apache Spark commented on SPARK-23529:
--

User 'andrusha' has created a pull request for this issue:
https://github.com/apache/spark/pull/21260

> Specify hostpath volume and mount the volume in Spark driver and executor 
> pods in Kubernetes
> 
>
> Key: SPARK-23529
> URL: https://issues.apache.org/jira/browse/SPARK-23529
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Suman Somasundar
>Assignee: Anirudh Ramanathan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24112) Add `spark.sql.hive.convertMetastoreTableProperty` for backward compatiblility

2018-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466189#comment-16466189
 ] 

Apache Spark commented on SPARK-24112:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/21259

> Add `spark.sql.hive.convertMetastoreTableProperty` for backward compatiblility
> --
>
> Key: SPARK-24112
> URL: https://issues.apache.org/jira/browse/SPARK-24112
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to not to surprise the previous Parquet Hive table users due 
> to behavior changes. They had Hive Parquet tables and all of them are 
> converted by default without table properties since Spark 2.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2018-05-07 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466099#comment-16466099
 ] 

Paul Wu commented on SPARK-22371:
-

Got the same problem with 2.3 and also the program stalled:

{{ Uncaught exception in thread heartbeat-receiver-event-loop-thread}}
{{java.lang.IllegalStateException: Attempted to access garbage collected 
accumulator 8825}}
{{    at 
org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)}}
{{    at 
org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)}}
{{    at scala.Option.map(Option.scala:146)}}
{{    at 
org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)}}
{{    at 
org.apache.spark.util.AccumulatorV2$$anonfun$name$1.apply(AccumulatorV2.scala:87)}}
{{    at 
org.apache.spark.util.AccumulatorV2$$anonfun$name$1.apply(AccumulatorV2.scala:87)}}
{{    at scala.Option.orElse(Option.scala:289)}}
{{    at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:87)}}
{{    at 
org.apache.spark.util.AccumulatorV2.toInfo(AccumulatorV2.scala:108)}}

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
>Priority: Major
> Attachments: Helper.scala, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt, sampledata
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23161) Add missing APIs to Python GBTClassifier

2018-05-07 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23161:
-
Description: 
GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved to 
{{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.

GBTClassificationModel is missing {{numClasses}}. It should inherit from 
{{JavaClassificationModel}} instead of prediction model which will give it this 
param.

  was:
GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved 
{{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.

GBTClassificationModel is missing {{numClasses}}. It should inherit from 
{{JavaClassificationModel}} instead of prediction model which will give it this 
param.


> Add missing APIs to Python GBTClassifier
> 
>
> Key: SPARK-23161
> URL: https://issues.apache.org/jira/browse/SPARK-23161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Minor
>  Labels: starter
>
> GBTClassifier is missing \{{featureSubsetStrategy}}.  This should be moved to 
> {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs.
> GBTClassificationModel is missing {{numClasses}}. It should inherit from 
> {{JavaClassificationModel}} instead of prediction model which will give it 
> this param.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-05-07 Thread Darek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466088#comment-16466088
 ] 

Darek edited comment on SPARK-18673 at 5/7/18 4:09 PM:
---

[PR20819|https://github.com/apache/spark/pull/20819] for Spark => Hive 2.x was 
done but not merged and deleted.


was (Author: bidek):
PR20819 for Spark => Hive 2.x was done but not merged and deleted.

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-05-07 Thread Darek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466088#comment-16466088
 ] 

Darek commented on SPARK-18673:
---

PR20819 for Spark => Hive 2.x was done but not merged and deleted.

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23458) Flaky test: OrcQuerySuite

2018-05-07 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466078#comment-16466078
 ] 

Xiao Li commented on SPARK-23458:
-

Yeah. [~dongjoon] Please investigate why they still fail. 

After your fix, I still found HiveExternalCatalogVersionsSuite never pass in 
this test branch. Do you know the reason? 
https://spark-tests.appspot.com/jobs/spark-master-test-sbt-hadoop-2.7



>  Flaky test: OrcQuerySuite
> --
>
> Key: SPARK-23458
> URL: https://issues.apache.org/jira/browse/SPARK-23458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
> Environment: AMPLab Jenkins
>Reporter: Marco Gaido
>Priority: Major
>
> Sometimes we have UT failures with the following stacktrace:
> {code:java}
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 
> 10.01396221801 seconds. Last failure message: There are 1 possibly leaked 
> file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.afterEach(OrcQuerySuite.scala:583)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.runTest(OrcQuerySuite.scala:583)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are 
> 1 possibly leaked file streams.
>   at 
> org.apach

[jira] [Resolved] (SPARK-24170) [Spark SQL] json file format is not dropped after dropping table

2018-05-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24170.
-
Resolution: Not A Bug

> [Spark SQL] json file format is not dropped after dropping table
> 
>
> Key: SPARK-24170
> URL: https://issues.apache.org/jira/browse/SPARK-24170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Steps:
>  # Launch spark-sql --master yarn
>  #  create table json(name STRING, age int, gender string, id INT) using 
> org.apache.spark.sql.json options(path "hdfs:///user/testdemo/");
>  # Execute the below SQL queries 
> INSERT into json
> SELECT 'Shaan',21,'Male',1
> UNION ALL
> SELECT 'Xing',20,'Female',11
> UNION ALL
> SELECT 'Mile',4,'Female',20
> UNION ALL
> SELECT 'Malan',10,'Male',9;
> Below 4 json file format created 
> BLR123111:/opt/Antsecure/install/hadoop/namenode/bin # ./hdfs dfs -ls 
> /user/testdemo
> Found 14 items
> -rw-r--r-- 3 spark hadoop 0 2018-04-26 17:44 /user/testdemo/_SUCCESS
> -rw-r--r-- 3 spark hadoop 4802 2018-04-24 18:20 /user/testdemo/customer1.csv
> -rw-r--r-- 3 spark hadoop 92 2018-04-26 17:02 /user/testdemo/json1.txt
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 
> /user/testdemo/part-0-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 
> /user/testdemo/part-0-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:32 
> /user/testdemo/part-1-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:44 
> /user/testdemo/part-1-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:32 
> /user/testdemo/part-2-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:44 
> /user/testdemo/part-2-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 
> /user/testdemo/part-3-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 
> /user/testdemo/part-3-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
>  
> Issue is:
> Now executed below drop command
> spark-sql> drop table json;
>  
> Table dropped successfully but json file still present in the path  
> /user/testdemo



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24170) [Spark SQL] json file format is not dropped after dropping table

2018-05-07 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466071#comment-16466071
 ] 

Xiao Li commented on SPARK-24170:
-

They are external tables when you specify the path in CREATE TABLE. Thus, the 
files will not be dropped. 

> [Spark SQL] json file format is not dropped after dropping table
> 
>
> Key: SPARK-24170
> URL: https://issues.apache.org/jira/browse/SPARK-24170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> Steps:
>  # Launch spark-sql --master yarn
>  #  create table json(name STRING, age int, gender string, id INT) using 
> org.apache.spark.sql.json options(path "hdfs:///user/testdemo/");
>  # Execute the below SQL queries 
> INSERT into json
> SELECT 'Shaan',21,'Male',1
> UNION ALL
> SELECT 'Xing',20,'Female',11
> UNION ALL
> SELECT 'Mile',4,'Female',20
> UNION ALL
> SELECT 'Malan',10,'Male',9;
> Below 4 json file format created 
> BLR123111:/opt/Antsecure/install/hadoop/namenode/bin # ./hdfs dfs -ls 
> /user/testdemo
> Found 14 items
> -rw-r--r-- 3 spark hadoop 0 2018-04-26 17:44 /user/testdemo/_SUCCESS
> -rw-r--r-- 3 spark hadoop 4802 2018-04-24 18:20 /user/testdemo/customer1.csv
> -rw-r--r-- 3 spark hadoop 92 2018-04-26 17:02 /user/testdemo/json1.txt
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 
> /user/testdemo/part-0-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 
> /user/testdemo/part-0-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:32 
> /user/testdemo/part-1-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:44 
> /user/testdemo/part-1-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:32 
> /user/testdemo/part-2-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:44 
> /user/testdemo/part-2-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 
> /user/testdemo/part-3-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json
> -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 
> /user/testdemo/part-3-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json
>  
> Issue is:
> Now executed below drop command
> spark-sql> drop table json;
>  
> Table dropped successfully but json file still present in the path  
> /user/testdemo



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24043) InterpretedPredicate.eval fails if expression tree contains Nondeterministic expressions

2018-05-07 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-24043.
---
   Resolution: Fixed
 Assignee: Bruce Robbins
Fix Version/s: 2.4.0

> InterpretedPredicate.eval fails if expression tree contains Nondeterministic 
> expressions
> 
>
> Key: SPARK-24043
> URL: https://issues.apache.org/jira/browse/SPARK-24043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Minor
> Fix For: 2.4.0
>
>
> When whole-stage codegen and predicate codegen both fail, FilterExec falls 
> back to using InterpretedPredicate. If the predicate's expression contains 
> any non-deterministic expressions, the evaluation throws an error:
> {noformat}
> scala> val df = Seq((1)).toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: int]
> scala> df.filter('a > 0).show // this works fine
> 2018-04-21 20:39:26 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (value#1 > 0)
> +---+
> |  a|
> +---+
> |  1|
> +---+
> scala> df.filter('a > rand(7)).show // this will throw an error
> 2018-04-21 20:39:40 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (cast(value#1 as double) > rand(7))
> 2018-04-21 20:39:40 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
> expression org.apache.spark.sql.catalyst.expressions.Rand should be 
> initialized before eval.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:326)
>   at 
> org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34)
> {noformat}
> This is because no code initializes the Nondeterministic expressions before 
> eval is called on them.
> This is a low impact issue, since it would require both whole-stage codegen 
> and predicate codegen to fail before FilterExec would fall back to using 
> InterpretedPredicate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-05-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465950#comment-16465950
 ] 

Steve Loughran commented on SPARK-18673:


Good Q, [~Bidek]. That SPARK-23807 POM fixes up the build, but without the 
mutant org.spark-project.hive JAR fixed up to not throw an exception whenever 
Hadoop version == 3, you can't run the code. including tests. I do have such a 
fixed up JAR, what I'm proposing here is cherry picking in the least amount of 
change needed there.

This is work is part of the overall "spark on Hadoop 3.x". 

Oh and yes, I'm targeting 3.1+ too, though the key issue here is the "3", not 
the suffix.

What would supercede this is Spark => Hive 2.x. This is an interim artifact 
until that is done by someone

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-05-07 Thread Darek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465934#comment-16465934
 ] 

Darek edited comment on SPARK-18673 at 5/7/18 1:59 PM:
---

Based on the recent PR, the community is moving toward Hadoop 3.1, why do you 
even bother with this ticket? Check the recent PR like SPARK-23807


was (Author: bidek):
Based on the recent PR, the community is moving toward Hadoop 3.1, why do you 
event bother with this ticket? Check the recent PR like SPARK-23807

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-05-07 Thread Darek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465934#comment-16465934
 ] 

Darek commented on SPARK-18673:
---

Based on the recent PR, the community is moving toward Hadoop 3.1, why do you 
event bother with this ticket? Check the recent PR like SPARK-23807

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2018-05-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465925#comment-16465925
 ] 

Steve Loughran commented on SPARK-18673:


Josh Rosen added some changes, particularly:

* 8f5918ad3dc7f3aa84ea04f3ef7761493c009d22 Update version to 1.2.1.spark2
* 10d91dca6c602a9f6c6fa428f341f135054c2c16 Re-shade Kryo
* 721aa7e4904a8a6069afe815af7cbf5ed3bde936 Change groupId to 
org.spark-project.hive; keep relocated Kryo under Hive namespace.
* aa9f5557b60facfe862f1f6c0a60537da8e88076 Put shaded protobuf classes under 
Hive package namespace.


Int-HDP patches/changes that I also plan to pull n on the basis that (a) they 
were clearly deemed important and (b) they apparently work
* HIVE-11102  ReaderImpl: getColumnIndicesFromNames does not work for some cases
* allow the repo for publishing artficats to be reconfigured from the normal 
sonatype one
* updating the group assembly plugin to use the same package names as from 
721aa7e4 


> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2018-05-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465917#comment-16465917
 ] 

Steve Loughran commented on SPARK-23977:


It will need the hadoop-aws module and deoendencies as that is where the core 
code is. This patch just does the binding to the InsertIntoHadoopFS relation 
(move to Hadoop MRv2 FileOutputFormat & expect the new superclass, 
PathOutputCommtter, rather than always a FileOutputcommitter, and for Parquet, 
something similar with a ParquetOutputCommitter.

+its only in Hadoop 3.1, though you can backport to branch-2, especially if you 
are prepared to bump up the minimum java version to 8 in that branch.

t should work on k8s, given it works standalone. All it needs is an endpoint 
supporting the multipart upload operation of S3, which includes some non-AWS 
object stores.

 Look @ the HADOOP-13786 work and the paper [a zero rename 
committer|https://github.com/steveloughran/zero-rename-committer/releases/download/tag_draft_003/a_zero_rename_committer.pdf].
 

And there's some integration tests downstream in 
https://github.com/hortonworks-spark/cloud-integration . I can help set you up 
to run those, if you email me directly. Essentially: you need to choose which 
stores to test against from: s3, openstack, azure, and configure them

Note that of the two variant committers, "staging" and "magic", the magic one 
needs a consistent S3 endpoint, which you only get on AWS S3 with an external 
services, usually dynamo DB based (S3mper, EMR consisent S3, S3Guard). The 
staging one needs enough local HDD to buffer the output of all active tasks, 
but doesn't need that consistency for its own query. You will need a plan for 
chaining together work though, which is inevitably one of "consistency layer" 
or "wait long enough between writer and reader that you expect the metadata to 
be consistent"

Finally, if you are using spark to write directly to S3 today, without any 
consistency layer, then your commit algorithm had better not be mimicing 
directory rename by list + copy + delete. You need this code for safe as well 
as performant committing of work to S3.



> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks

2018-05-07 Thread kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kumar updated SPARK-24200:
--
Description: 
String folder = "/Users/test/data/* /* ";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ? I know you 
might have thought about this, but i am wondering why this has not been 
implemented ?

  was:
String folder = "/Users/test/data/* /* ";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?


> Read subdirectories with out asterisks
> --
>
> Key: SPARK-24200
> URL: https://issues.apache.org/jira/browse/SPARK-24200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>
> String folder = "/Users/test/data/* /* ";
> sparkContext.textFile(folder, 1).toJavaRDD() 
> Is asterisks mandatory to read a folder -Yes, otherwise it does not read 
> files under subdirectories.
> What if I get a folder which is having more subdirectories than the number of 
> asterisks mentioned ? How to handle this scenario ?
> For example:
> 1) {{/Users/test/data/}} This would work ONLY if I get data as 
> /Users/test/data/folder1/file.txt
> 2)How to make this expression as *generic* ? It should still work if I get a 
> folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}
> My input folder structure is not same all the time.
> Is there anything exists in Spark to handle this kind of scenario ? I know 
> you might have thought about this, but i am wondering why this has not been 
> implemented ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks

2018-05-07 Thread kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kumar updated SPARK-24200:
--
Description: 
String folder = "/Users/test/data/ */* ";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?

  was:
String folder = "/Users/test/data/ ** /** ";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?


> Read subdirectories with out asterisks
> --
>
> Key: SPARK-24200
> URL: https://issues.apache.org/jira/browse/SPARK-24200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>
> String folder = "/Users/test/data/ */* ";
> sparkContext.textFile(folder, 1).toJavaRDD() 
> Is asterisks mandatory to read a folder -Yes, otherwise it does not read 
> files under subdirectories.
> What if I get a folder which is having more subdirectories than the number of 
> asterisks mentioned ? How to handle this scenario ?
> For example:
> 1) {{/Users/test/data/}} This would work ONLY if I get data as 
> /Users/test/data/folder1/file.txt
> 2)How to make this expression as *generic* ? It should still work if I get a 
> folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}
> My input folder structure is not same all the time.
> Is there anything exists in Spark to handle this kind of scenario ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks

2018-05-07 Thread kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kumar updated SPARK-24200:
--
Description: 
String folder = "/Users/test/data/* /* ";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?

  was:
String folder = "/Users/test/data/ */* ";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?


> Read subdirectories with out asterisks
> --
>
> Key: SPARK-24200
> URL: https://issues.apache.org/jira/browse/SPARK-24200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>
> String folder = "/Users/test/data/* /* ";
> sparkContext.textFile(folder, 1).toJavaRDD() 
> Is asterisks mandatory to read a folder -Yes, otherwise it does not read 
> files under subdirectories.
> What if I get a folder which is having more subdirectories than the number of 
> asterisks mentioned ? How to handle this scenario ?
> For example:
> 1) {{/Users/test/data/}} This would work ONLY if I get data as 
> /Users/test/data/folder1/file.txt
> 2)How to make this expression as *generic* ? It should still work if I get a 
> folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}
> My input folder structure is not same all the time.
> Is there anything exists in Spark to handle this kind of scenario ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks

2018-05-07 Thread kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kumar updated SPARK-24200:
--
Description: 
String folder = "/Users/test/data/ ** /** ";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?

  was:
String folder = "/Users/test/data/*/*";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?


> Read subdirectories with out asterisks
> --
>
> Key: SPARK-24200
> URL: https://issues.apache.org/jira/browse/SPARK-24200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>
> String folder = "/Users/test/data/ ** /** ";
> sparkContext.textFile(folder, 1).toJavaRDD() 
> Is asterisks mandatory to read a folder -Yes, otherwise it does not read 
> files under subdirectories.
> What if I get a folder which is having more subdirectories than the number of 
> asterisks mentioned ? How to handle this scenario ?
> For example:
> 1) {{/Users/test/data/}} This would work ONLY if I get data as 
> /Users/test/data/folder1/file.txt
> 2)How to make this expression as *generic* ? It should still work if I get a 
> folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}
> My input folder structure is not same all the time.
> Is there anything exists in Spark to handle this kind of scenario ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks

2018-05-07 Thread kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kumar updated SPARK-24200:
--
Description: 
String folder = "/Users/test/data/*/*";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?

  was:
String folder = "/Users/test/data/";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?


> Read subdirectories with out asterisks
> --
>
> Key: SPARK-24200
> URL: https://issues.apache.org/jira/browse/SPARK-24200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>
> String folder = "/Users/test/data/*/*";
> sparkContext.textFile(folder, 1).toJavaRDD() 
> Is asterisks mandatory to read a folder -Yes, otherwise it does not read 
> files under subdirectories.
> What if I get a folder which is having more subdirectories than the number of 
> asterisks mentioned ? How to handle this scenario ?
> For example:
> 1) {{/Users/test/data/}} This would work ONLY if I get data as 
> /Users/test/data/folder1/file.txt
> 2)How to make this expression as *generic* ? It should still work if I get a 
> folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}
> My input folder structure is not same all the time.
> Is there anything exists in Spark to handle this kind of scenario ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks

2018-05-07 Thread kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kumar updated SPARK-24200:
--
Description: 
String folder = "/Users/test/data/";

sparkContext.textFile(folder, 1).toJavaRDD() 

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?

  was:
{{String folder = "/Users/test/data/*/*"; sparkContext.textFile(folder, 
1).toJavaRDD() }}

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/*/*}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?


> Read subdirectories with out asterisks
> --
>
> Key: SPARK-24200
> URL: https://issues.apache.org/jira/browse/SPARK-24200
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: kumar
>Priority: Major
>
> String folder = "/Users/test/data/";
> sparkContext.textFile(folder, 1).toJavaRDD() 
> Is asterisks mandatory to read a folder -Yes, otherwise it does not read 
> files under subdirectories.
> What if I get a folder which is having more subdirectories than the number of 
> asterisks mentioned ? How to handle this scenario ?
> For example:
> 1) {{/Users/test/data/}} This would work ONLY if I get data as 
> /Users/test/data/folder1/file.txt
> 2)How to make this expression as *generic* ? It should still work if I get a 
> folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}
> My input folder structure is not same all the time.
> Is there anything exists in Spark to handle this kind of scenario ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24200) Read subdirectories with out asterisks

2018-05-07 Thread kumar (JIRA)
kumar created SPARK-24200:
-

 Summary: Read subdirectories with out asterisks
 Key: SPARK-24200
 URL: https://issues.apache.org/jira/browse/SPARK-24200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: kumar


{{String folder = "/Users/test/data/*/*"; sparkContext.textFile(folder, 
1).toJavaRDD() }}

Is asterisks mandatory to read a folder -Yes, otherwise it does not read files 
under subdirectories.

What if I get a folder which is having more subdirectories than the number of 
asterisks mentioned ? How to handle this scenario ?

For example:

1) {{/Users/test/data/*/*}} This would work ONLY if I get data as 
/Users/test/data/folder1/file.txt

2)How to make this expression as *generic* ? It should still work if I get a 
folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}}

My input folder structure is not same all the time.

Is there anything exists in Spark to handle this kind of scenario ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23933) High-order function: map(array, array) → map

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23933:


Assignee: (was: Apache Spark)

> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map

2018-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465746#comment-16465746
 ] 

Apache Spark commented on SPARK-23933:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21258

> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23933) High-order function: map(array, array) → map

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23933:


Assignee: Apache Spark

> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24194) HadoopFsRelation cannot overwrite a path that is also being read from

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24194:


Assignee: Apache Spark

> HadoopFsRelation cannot overwrite a path that is also being read from
> -
>
> Key: SPARK-24194
> URL: https://issues.apache.org/jira/browse/SPARK-24194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: spark master
>Reporter: yangz
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When
> {code:java}
> INSERT OVERWRITE TABLE territory_count_compare select * from 
> territory_count_compare where shop_count!=real_shop_count
> {code}
> And territory_count_compare is a table with parquet, there will be a error 
> Cannot overwrite a path that is also being read from
>  
> And in file MetastoreDataSourceSuite.scala, there have a test case
>  
>  
> {code:java}
> table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName)
> {code}
>  
> But when the table territory_count_compare is a common hive table, there is 
> no error. 
> So I think the reason is when insert overwrite into hadoopfs relation with 
> static partition, it first delete the partition in the output. But it should 
> be the time when the job commited.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24194) HadoopFsRelation cannot overwrite a path that is also being read from

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24194:


Assignee: (was: Apache Spark)

> HadoopFsRelation cannot overwrite a path that is also being read from
> -
>
> Key: SPARK-24194
> URL: https://issues.apache.org/jira/browse/SPARK-24194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: spark master
>Reporter: yangz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When
> {code:java}
> INSERT OVERWRITE TABLE territory_count_compare select * from 
> territory_count_compare where shop_count!=real_shop_count
> {code}
> And territory_count_compare is a table with parquet, there will be a error 
> Cannot overwrite a path that is also being read from
>  
> And in file MetastoreDataSourceSuite.scala, there have a test case
>  
>  
> {code:java}
> table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName)
> {code}
>  
> But when the table territory_count_compare is a common hive table, there is 
> no error. 
> So I think the reason is when insert overwrite into hadoopfs relation with 
> static partition, it first delete the partition in the output. But it should 
> be the time when the job commited.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24194) HadoopFsRelation cannot overwrite a path that is also being read from

2018-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465732#comment-16465732
 ] 

Apache Spark commented on SPARK-24194:
--

User 'zheh12' has created a pull request for this issue:
https://github.com/apache/spark/pull/21257

> HadoopFsRelation cannot overwrite a path that is also being read from
> -
>
> Key: SPARK-24194
> URL: https://issues.apache.org/jira/browse/SPARK-24194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: spark master
>Reporter: yangz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.4.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When
> {code:java}
> INSERT OVERWRITE TABLE territory_count_compare select * from 
> territory_count_compare where shop_count!=real_shop_count
> {code}
> And territory_count_compare is a table with parquet, there will be a error 
> Cannot overwrite a path that is also being read from
>  
> And in file MetastoreDataSourceSuite.scala, there have a test case
>  
>  
> {code:java}
> table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName)
> {code}
>  
> But when the table territory_count_compare is a common hive table, there is 
> no error. 
> So I think the reason is when insert overwrite into hadoopfs relation with 
> static partition, it first delete the partition in the output. But it should 
> be the time when the job commited.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))

2018-05-07 Thread Ajay Monga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465729#comment-16465729
 ] 

Ajay Monga commented on SPARK-24177:


Thanks Marco. We have a few systems running on the latest version of Spark but 
the system that behaved erratic is still on 1.6. We are planning to move it to 
a later version, possibly to 2.2 but I would appreciate if someone can confirm 
my understanding.

> Spark returning inconsistent rows and data in a join query when run using 
> Spark SQL (using SQLContext.sql(...))
> ---
>
> Key: SPARK-24177
> URL: https://issues.apache.org/jira/browse/SPARK-24177
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
> Environment: Production
>Reporter: Ajay Monga
>Priority: Major
>
> Spark SQL is returning inconsistent result for a JOIN query. It returns 
> different rows and the value of the column on which a simple multiplication 
> takes place returns different values:
> The query is like:
> SELECT
>  second_table.date_value, SUM(XXX * second_table.shift_value)
>  FROM
>  (
>  SELECT
>  date_value, SUM(value) as XXX
>  FROM first_table
>  WHERE
>  AND date IN ( '2018-01-01', '2018-01-02' )
>  GROUP BY date_value
>  )
>  intermediate LEFT OUTER
>  JOIN second_table ON second_table.date_value = ( 'date_value' from first table, say if it's a Saturday or Sunday then use 
> Monday, else next valid working date>)
>  AND second_table.date_value IN (
>  '2018-01-02',
>  '2018-01-03'
>  )
>  GROUP BY second_table.date_value
>  
> Suspicion is that, the execution of above query is split into two queries - 
> one for first_table and other for second_table before joining. Then the 
> results get split across partitions, seemingly grouped/distributed by the 
> join column, which is 'date_value'. In the join there is a date shift logic 
> that fails to join in some cases when it should, primarily for the 
> date_values at the edge of the partitions distributed across the executors. 
> So, the execution is dependent on how the data (or the rdd) of the individual 
> queries is partitioned in the first place, which is not ideal as a normal 
> looking ANSI standard SQL query is not behaving consistently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24199) Structured Streaming

2018-05-07 Thread shuke (JIRA)
shuke created SPARK-24199:
-

 Summary: Structured Streaming
 Key: SPARK-24199
 URL: https://issues.apache.org/jira/browse/SPARK-24199
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.2.0
Reporter: shuke


h3. Hey,when i use the {color:#FF}where {color}operate to filter data while 
using Structured Streaming

I got some problem about it 

 



get_json_object(col("value"),"$.type").cast(DataTypes.StringType).alias("type"),
 
get_json_object(col("value"),"$.saleData.type").cast(DataTypes.StringType).alias("saleDataType",get_json_object(col("value"),
 "$.uid").cast(DataTypes.IntegerType).alias(ROI_SHOP_KEY),
 from_unixtime(get_json_object(col("value"), 
"$.time").cast(DataTypes.IntegerType),"-MM-dd").alias("event_time"),
 //get_json_object(xjson, '$.balanceData.money')/100 as money,
 (get_json_object(col("value"), 
"$.balanceData.money").cast(DataTypes.DoubleType) / 100).alias(BUSINESS_AMOUNT),
 get_json_object(col("value"), 
"$.shopData.id").cast(DataTypes.LongType).alias(DARK_ID),
 get_json_object(col("value"), 
"$.balanceData.out_trade_no").cast(DataTypes.StringType).alias(OUT_TRADE_NO),
 get_json_object(col("value"), 
"$.balanceData.type").cast(DataTypes.StringType).alias(BalanceData_type)
)

{color:#FF}.where("(type = 'residue' and saleDataType = '4' and shop_id in 
('8610022','5382783')) or type = 'promotion' "{color} )
 .select(col("*"))
 .writeStream
 .trigger(Trigger.ProcessingTime(5000))
 .outputMode("Update")
 .format("console")
 .start()

=

i find that its not while using this way to filter data 

anyone can help 

best wishes

 

 

 
h1.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16406) Reference resolution for large number of columns should be faster

2018-05-07 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-16406.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> Reference resolution for large number of columns should be faster
> -
>
> Key: SPARK-16406
> URL: https://issues.apache.org/jira/browse/SPARK-16406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Major
> Fix For: 2.4.0
>
>
> Resolving columns in a LogicalPlan on average takes n / 2 (n being the number 
> of columns). This gets problematic as soon as you try to resolve a large 
> number of columns (m) on a large table: O(m * n / 2)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24197) add array_sort function

2018-05-07 Thread Marek Novotny (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marek Novotny updated SPARK-24197:
--
Description: Add a SparkR equivalent function to 
[SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921].  (was: Add a 
SparkR equivalent function to SPARK-23921.)

> add array_sort function
> ---
>
> Key: SPARK-24197
> URL: https://issues.apache.org/jira/browse/SPARK-24197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add a SparkR equivalent function to 
> [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24197) add array_sort function

2018-05-07 Thread Marek Novotny (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marek Novotny updated SPARK-24197:
--
Description: Add a SparkR equivalent function to SPARK-23921.  (was: Add a 
SparkR equivalent function for 
[SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921].)

> add array_sort function
> ---
>
> Key: SPARK-24197
> URL: https://issues.apache.org/jira/browse/SPARK-24197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add a SparkR equivalent function to SPARK-23921.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24198) add slice function

2018-05-07 Thread Marek Novotny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465627#comment-16465627
 ] 

Marek Novotny commented on SPARK-24198:
---

I will work on this. Thanks.

> add slice function
> --
>
> Key: SPARK-24198
> URL: https://issues.apache.org/jira/browse/SPARK-24198
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add a SparkR equivalent function to 
> [SPARK-23930|https://issues.apache.org/jira/browse/SPARK-23930].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24198) add slice function

2018-05-07 Thread Marek Novotny (JIRA)
Marek Novotny created SPARK-24198:
-

 Summary: add slice function
 Key: SPARK-24198
 URL: https://issues.apache.org/jira/browse/SPARK-24198
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Marek Novotny


Add a SparkR equivalent function to 
[SPARK-23930|https://issues.apache.org/jira/browse/SPARK-23930].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24197) add array_sort function

2018-05-07 Thread Marek Novotny (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465626#comment-16465626
 ] 

Marek Novotny commented on SPARK-24197:
---

I will work on this. Thanks.

> add array_sort function
> ---
>
> Key: SPARK-24197
> URL: https://issues.apache.org/jira/browse/SPARK-24197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Priority: Major
>
> Add a SparkR equivalent function for 
> [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24197) add array_sort function

2018-05-07 Thread Marek Novotny (JIRA)
Marek Novotny created SPARK-24197:
-

 Summary: add array_sort function
 Key: SPARK-24197
 URL: https://issues.apache.org/jira/browse/SPARK-24197
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Marek Novotny


Add a SparkR equivalent function for 
[SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23930) High-order function: slice(x, start, length) → array

2018-05-07 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23930.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21040
[https://github.com/apache/spark/pull/21040]

> High-order function: slice(x, start, length) → array
> 
>
> Key: SPARK-23930
> URL: https://issues.apache.org/jira/browse/SPARK-23930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Subsets array x starting from index start (or starting from the end if start 
> is negative) with a length of length.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23930) High-order function: slice(x, start, length) → array

2018-05-07 Thread Takuya Ueshin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23930:
-

Assignee: Marco Gaido

> High-order function: slice(x, start, length) → array
> 
>
> Key: SPARK-23930
> URL: https://issues.apache.org/jira/browse/SPARK-23930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Marco Gaido
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Subsets array x starting from index start (or starting from the end if start 
> is negative) with a length of length.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24196) Spark Thrift Server - SQL Client connections does't show db artefacts

2018-05-07 Thread rr (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rr updated SPARK-24196:
---
Attachment: screenshot-1.png

> Spark Thrift Server - SQL Client connections does't show db artefacts
> -
>
> Key: SPARK-24196
> URL: https://issues.apache.org/jira/browse/SPARK-24196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: rr
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When connecting to Spark Thrift Server via JDBC artefacts(db objects are not 
> showing up)
> whereas when connecting to hiveserver2 it shows the schema, tables, columns 
> ...
> SQL Client user: IBM Data Studio, DBeaver SQL Client



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24160) ShuffleBlockFetcherIterator should fail if it receives zero-size blocks

2018-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465536#comment-16465536
 ] 

Apache Spark commented on SPARK-24160:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21256

> ShuffleBlockFetcherIterator should fail if it receives zero-size blocks
> ---
>
> Key: SPARK-24160
> URL: https://issues.apache.org/jira/browse/SPARK-24160
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 2.4.0
>
>
> In the shuffle layer, we guarantee that zero-size blocks will never be 
> requested (a block containing zero records is always 0 bytes in size and is 
> marked as empty such that it will never be legitimately requested by 
> executors). However, we failed to take advantage of this in the shuffle-read 
> path: the existing code did not explicitly check whether blocks are 
> non-zero-size.
>  
> We should add `buf.size != 0` checks to ShuffleBlockFetcherIterator to take 
> advantage of this invariant and prevent potential data loss / corruption 
> issues. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24196) Spark Thrift Server - SQL Client connections does't show db artefacts

2018-05-07 Thread rr (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rr updated SPARK-24196:
---
Description: 
When connecting to Spark Thrift Server via JDBC artefacts(db objects are not 
showing up)

whereas when connecting to hiveserver2 it shows the schema, tables, columns ...

SQL Client user: IBM Data Studio, DBeaver SQL Client

  was:
When connecting to Spark Thrift Server via JDBC artefacts(db objects are not 
showing up)

whereas when connecting to hiveserver2 is shows the schema, tables, columns ...

SQL Client user: IBM Data Studio, DBeaver SQL Client


> Spark Thrift Server - SQL Client connections does't show db artefacts
> -
>
> Key: SPARK-24196
> URL: https://issues.apache.org/jira/browse/SPARK-24196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: rr
>Priority: Major
>
> When connecting to Spark Thrift Server via JDBC artefacts(db objects are not 
> showing up)
> whereas when connecting to hiveserver2 it shows the schema, tables, columns 
> ...
> SQL Client user: IBM Data Studio, DBeaver SQL Client



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24186) add array reverse and concat

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24186:


Assignee: (was: Apache Spark)

> add array reverse and concat 
> -
>
> Key: SPARK-24186
> URL: https://issues.apache.org/jira/browse/SPARK-24186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and 
> https://issues.apache.org/jira/browse/SPARK-23926
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24186) add array reverse and concat

2018-05-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465526#comment-16465526
 ] 

Apache Spark commented on SPARK-24186:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/21255

> add array reverse and concat 
> -
>
> Key: SPARK-24186
> URL: https://issues.apache.org/jira/browse/SPARK-24186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and 
> https://issues.apache.org/jira/browse/SPARK-23926
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24196) Spark Thrift Server - SQL Client connections does't show db artefacts

2018-05-07 Thread rr (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rr updated SPARK-24196:
---
Description: 
When connecting to Spark Thrift Server via JDBC artefacts(db objects are not 
showing up)

whereas when connecting to hiveserver2 is shows the schema, tables, columns ...

SQL Client user: IBM Data Studio, DBeaver SQL Client

  was:
When connecting to Spark Thrift Server via JDBC artefacts(db objects are not 
showing up)

whereas when connecting to hiveserver2 is shows the schema, tables, colums ...

SQL Client user: IBM Data Studio, DBeaver SQL Client


> Spark Thrift Server - SQL Client connections does't show db artefacts
> -
>
> Key: SPARK-24196
> URL: https://issues.apache.org/jira/browse/SPARK-24196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: rr
>Priority: Major
>
> When connecting to Spark Thrift Server via JDBC artefacts(db objects are not 
> showing up)
> whereas when connecting to hiveserver2 is shows the schema, tables, columns 
> ...
> SQL Client user: IBM Data Studio, DBeaver SQL Client



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24196) Spark Thrift Server - SQL Client connections does't show db artefacts

2018-05-07 Thread rr (JIRA)
rr created SPARK-24196:
--

 Summary: Spark Thrift Server - SQL Client connections does't show 
db artefacts
 Key: SPARK-24196
 URL: https://issues.apache.org/jira/browse/SPARK-24196
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: rr


When connecting to Spark Thrift Server via JDBC artefacts(db objects are not 
showing up)

whereas when connecting to hiveserver2 is shows the schema, tables, colums ...

SQL Client user: IBM Data Studio, DBeaver SQL Client



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24186) add array reverse and concat

2018-05-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24186:


Assignee: Apache Spark

> add array reverse and concat 
> -
>
> Key: SPARK-24186
> URL: https://issues.apache.org/jira/browse/SPARK-24186
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and 
> https://issues.apache.org/jira/browse/SPARK-23926
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org