[jira] [Resolved] (SPARK-24188) /api/v1/version not working
[ https://issues.apache.org/jira/browse/SPARK-24188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-24188. - Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21245 [https://github.com/apache/spark/pull/21245] > /api/v1/version not working > --- > > Key: SPARK-24188 > URL: https://issues.apache.org/jira/browse/SPARK-24188 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > That URI from the REST API is currently returning a 404. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24188) /api/v1/version not working
[ https://issues.apache.org/jira/browse/SPARK-24188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-24188: --- Assignee: Marcelo Vanzin > /api/v1/version not working > --- > > Key: SPARK-24188 > URL: https://issues.apache.org/jira/browse/SPARK-24188 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Major > > That URI from the REST API is currently returning a 404. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24200) Read subdirectories with out asterisks
[ https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466948#comment-16466948 ] kumar commented on SPARK-24200: --- This is not question for how, it's an improvement suggestion, i found a solution to make it work, but i am wondering why the subdirectories are not considered with out giving asterisks ? > Read subdirectories with out asterisks > -- > > Key: SPARK-24200 > URL: https://issues.apache.org/jira/browse/SPARK-24200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: kumar >Priority: Major > > String folder = "/Users/test/data/* /* "; > sparkContext.textFile(folder, 1).toJavaRDD() > Is asterisks mandatory to read a folder -Yes, otherwise it does not read > files under subdirectories. > What if I get a folder which is having more subdirectories than the number of > asterisks mentioned ? > For example: > 1) {{/Users/test/data/}} This would work ONLY if I get data as > /Users/test/data/folder1/file.txt > 2)How to make this expression as *generic* ? It should still work if I get a > folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} > My input folder structure is not same all the time. > Is there anything exists in Spark to handle this kind of scenario ? I know > you might have thought about this, but i am wondering why this has not been > implemented ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks
[ https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kumar updated SPARK-24200: -- Description: String folder = "/Users/test/data/* /* "; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? I know you might have thought about this, but i am wondering why this has not been implemented ? was: String folder = "/Users/test/data/* /* "; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? I know you might have thought about this, but i am wondering why this has not been implemented ? > Read subdirectories with out asterisks > -- > > Key: SPARK-24200 > URL: https://issues.apache.org/jira/browse/SPARK-24200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: kumar >Priority: Major > > String folder = "/Users/test/data/* /* "; > sparkContext.textFile(folder, 1).toJavaRDD() > Is asterisks mandatory to read a folder -Yes, otherwise it does not read > files under subdirectories. > What if I get a folder which is having more subdirectories than the number of > asterisks mentioned ? > For example: > 1) {{/Users/test/data/}} This would work ONLY if I get data as > /Users/test/data/folder1/file.txt > 2)How to make this expression as *generic* ? It should still work if I get a > folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} > My input folder structure is not same all the time. > Is there anything exists in Spark to handle this kind of scenario ? I know > you might have thought about this, but i am wondering why this has not been > implemented ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24207) PrefixSpan: R API
Felix Cheung created SPARK-24207: Summary: PrefixSpan: R API Key: SPARK-24207 URL: https://issues.apache.org/jira/browse/SPARK-24207 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 2.4.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23780) Failed to use googleVis library with new SparkR
[ https://issues.apache.org/jira/browse/SPARK-23780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466920#comment-16466920 ] Felix Cheung commented on SPARK-23780: -- I suppose if you load googleVis first and then SparkR it would have the same effect as Ivan's steps? > Failed to use googleVis library with new SparkR > --- > > Key: SPARK-23780 > URL: https://issues.apache.org/jira/browse/SPARK-23780 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.1 >Reporter: Ivan Dzikovsky >Priority: Major > > I've tried to use googleVis library with Spark 2.2.1, and faced with problem. > Steps to reproduce: > # Install R with googleVis library. > # Run SparkR: > {code} > sparkR --master yarn --deploy-mode client > {code} > # Run code that uses googleVis: > {code} > library(googleVis) > df=data.frame(country=c("US", "GB", "BR"), > val1=c(10,13,14), > val2=c(23,12,32)) > Bar <- gvisBarChart(df) > cat("%html ", Bar$html$chart) > {code} > Than I got following error message: > {code} > Error : .onLoad failed in loadNamespace() for 'googleVis', details: > call: rematchDefinition(definition, fdef, mnames, fnames, signature) > error: methods can add arguments to the generic 'toJSON' only if '...' is > an argument to the generic > Error : package or namespace load failed for 'googleVis' > {code} > But expected result is to get some HTML code output, as it was with Spark > 2.1.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24206) Improve DataSource benchmark code for read and pushdown
[ https://issues.apache.org/jira/browse/SPARK-24206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466902#comment-16466902 ] Apache Spark commented on SPARK-24206: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/21266 > Improve DataSource benchmark code for read and pushdown > --- > > Key: SPARK-24206 > URL: https://issues.apache.org/jira/browse/SPARK-24206 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > I improved the DataSource code for read and pushdown in the parquet v1.10.0 > upgrade activity: [https://github.com/apache/spark/pull/21070] > Based on the code, we need to brush up the benchmark code and results in the > master. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24206) Improve DataSource benchmark code for read and pushdown
[ https://issues.apache.org/jira/browse/SPARK-24206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24206: Assignee: Apache Spark > Improve DataSource benchmark code for read and pushdown > --- > > Key: SPARK-24206 > URL: https://issues.apache.org/jira/browse/SPARK-24206 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > I improved the DataSource code for read and pushdown in the parquet v1.10.0 > upgrade activity: [https://github.com/apache/spark/pull/21070] > Based on the code, we need to brush up the benchmark code and results in the > master. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24206) Improve DataSource benchmark code for read and pushdown
[ https://issues.apache.org/jira/browse/SPARK-24206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24206: Assignee: (was: Apache Spark) > Improve DataSource benchmark code for read and pushdown > --- > > Key: SPARK-24206 > URL: https://issues.apache.org/jira/browse/SPARK-24206 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > I improved the DataSource code for read and pushdown in the parquet v1.10.0 > upgrade activity: [https://github.com/apache/spark/pull/21070] > Based on the code, we need to brush up the benchmark code and results in the > master. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466897#comment-16466897 ] Vikram Agrawal commented on SPARK-18165: Thanks [~marmbrus] - Planning to start the work on porting the connector in next few weeks. Will share my feedbacks/ask for help once I am ready. - Thanks for your suggestion. Will check out apache Bahir/Spark Packages and start a PR once I have ported my changes to DataSourceV2 APIs. > Kinesis support in Structured Streaming > --- > > Key: SPARK-18165 > URL: https://issues.apache.org/jira/browse/SPARK-18165 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Lauren Moos >Priority: Major > > Implement Kinesis based sources and sinks for Structured Streaming -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-20114: --- Component/s: (was: PySpark) > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24146) spark.ml parity for sequential pattern mining - PrefixSpan: Python API
[ https://issues.apache.org/jira/browse/SPARK-24146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-24146: --- Component/s: PySpark > spark.ml parity for sequential pattern mining - PrefixSpan: Python API > -- > > Key: SPARK-24146 > URL: https://issues.apache.org/jira/browse/SPARK-24146 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Major > > spark.ml parity for sequential pattern mining - PrefixSpan: Python API -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24146) spark.ml parity for sequential pattern mining - PrefixSpan: Python API
[ https://issues.apache.org/jira/browse/SPARK-24146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466896#comment-16466896 ] Apache Spark commented on SPARK-24146: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/21265 > spark.ml parity for sequential pattern mining - PrefixSpan: Python API > -- > > Key: SPARK-24146 > URL: https://issues.apache.org/jira/browse/SPARK-24146 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Major > > spark.ml parity for sequential pattern mining - PrefixSpan: Python API -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24146) spark.ml parity for sequential pattern mining - PrefixSpan: Python API
[ https://issues.apache.org/jira/browse/SPARK-24146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24146: Assignee: Apache Spark > spark.ml parity for sequential pattern mining - PrefixSpan: Python API > -- > > Key: SPARK-24146 > URL: https://issues.apache.org/jira/browse/SPARK-24146 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Major > > spark.ml parity for sequential pattern mining - PrefixSpan: Python API -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-20114: --- Component/s: PySpark > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24146) spark.ml parity for sequential pattern mining - PrefixSpan: Python API
[ https://issues.apache.org/jira/browse/SPARK-24146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24146: Assignee: (was: Apache Spark) > spark.ml parity for sequential pattern mining - PrefixSpan: Python API > -- > > Key: SPARK-24146 > URL: https://issues.apache.org/jira/browse/SPARK-24146 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Major > > spark.ml parity for sequential pattern mining - PrefixSpan: Python API -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24206) Improve DataSource benchmark code for read and pushdown
Takeshi Yamamuro created SPARK-24206: Summary: Improve DataSource benchmark code for read and pushdown Key: SPARK-24206 URL: https://issues.apache.org/jira/browse/SPARK-24206 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Takeshi Yamamuro I improved the DataSource code for read and pushdown in the parquet v1.10.0 upgrade activity: [https://github.com/apache/spark/pull/21070] Based on the code, we need to brush up the benchmark code and results in the master. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24128) Mention spark.sql.crossJoin.enabled in implicit cartesian product error msg
[ https://issues.apache.org/jira/browse/SPARK-24128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-24128: Assignee: Henry Robinson > Mention spark.sql.crossJoin.enabled in implicit cartesian product error msg > --- > > Key: SPARK-24128 > URL: https://issues.apache.org/jira/browse/SPARK-24128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Assignee: Henry Robinson >Priority: Minor > Fix For: 2.3.1, 2.4.0 > > > The error message given when a query contains an implicit cartesian product > suggests rewriting the query using {{CROSS JOIN}}, but not disabling the > check using {{spark.sql.crossJoin.enabled=true}}. It's sometimes easier to > change a config variable than edit a query, so it would be helpful to make > the user aware of their options. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24128) Mention spark.sql.crossJoin.enabled in implicit cartesian product error msg
[ https://issues.apache.org/jira/browse/SPARK-24128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24128. -- Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 Issue resolved by pull request 21201 [https://github.com/apache/spark/pull/21201] > Mention spark.sql.crossJoin.enabled in implicit cartesian product error msg > --- > > Key: SPARK-24128 > URL: https://issues.apache.org/jira/browse/SPARK-24128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Henry Robinson >Assignee: Henry Robinson >Priority: Minor > Fix For: 2.4.0, 2.3.1 > > > The error message given when a query contains an implicit cartesian product > suggests rewriting the query using {{CROSS JOIN}}, but not disabling the > check using {{spark.sql.crossJoin.enabled=true}}. It's sometimes easier to > change a config variable than edit a query, so it would be helpful to make > the user aware of their options. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23975) Allow Clustering to take Arrays of Double as input features
[ https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-23975. --- Resolution: Fixed Fix Version/s: 2.4.0 > Allow Clustering to take Arrays of Double as input features > --- > > Key: SPARK-23975 > URL: https://issues.apache.org/jira/browse/SPARK-23975 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Lu Wang >Assignee: Lu Wang >Priority: Major > Fix For: 2.4.0 > > > Clustering algorithms should accept Arrays in addition to Vectors as input > features. The python interface should also be changed so that it would make > PySpark a lot easier to use. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24205) java.util.concurrent.locks.LockSupport.parkNanos
[ https://issues.apache.org/jira/browse/SPARK-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] joy-m updated SPARK-24205: -- Attachment: 屏幕快照 2018-05-08 上午10.58.08.png > java.util.concurrent.locks.LockSupport.parkNanos > > > Key: SPARK-24205 > URL: https://issues.apache.org/jira/browse/SPARK-24205 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 >Reporter: joy-m >Priority: Major > Attachments: 屏幕快照 2018-05-08 上午10.58.08.png > > > when i use yarn client mode, the spark task locked in the collect stage > Because of the data is in the driver machine, so I used the client mode to > run my application! > but the the stage collect was locked! > countDf.collect().map(_.getLong(0)).mkString.toLong > {{sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > > java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475) > org.apache.spark.rpc.netty.Dispatcher.awaitTermination(Dispatcher.scala:180) > org.apache.spark.rpc.netty.NettyRpcEnv.awaitTermination(NettyRpcEnv.scala:281) > > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:231) > > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67) > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) > > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) > > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) > > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284) > > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24205) java.util.concurrent.locks.LockSupport.parkNanos
[ https://issues.apache.org/jira/browse/SPARK-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] joy-m updated SPARK-24205: -- Attachment: (was: 屏幕快照 2018-05-06 上午10.04.27.png) > java.util.concurrent.locks.LockSupport.parkNanos > > > Key: SPARK-24205 > URL: https://issues.apache.org/jira/browse/SPARK-24205 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 >Reporter: joy-m >Priority: Major > > when i use yarn client mode, the spark task locked in the collect stage > Because of the data is in the driver machine, so I used the client mode to > run my application! > but the the stage collect was locked! > countDf.collect().map(_.getLong(0)).mkString.toLong > {{sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > > java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475) > org.apache.spark.rpc.netty.Dispatcher.awaitTermination(Dispatcher.scala:180) > org.apache.spark.rpc.netty.NettyRpcEnv.awaitTermination(NettyRpcEnv.scala:281) > > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:231) > > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67) > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) > > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) > > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) > > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284) > > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24205) java.util.concurrent.locks.LockSupport.parkNanos
[ https://issues.apache.org/jira/browse/SPARK-24205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] joy-m updated SPARK-24205: -- Attachment: 屏幕快照 2018-05-06 上午10.04.27.png > java.util.concurrent.locks.LockSupport.parkNanos > > > Key: SPARK-24205 > URL: https://issues.apache.org/jira/browse/SPARK-24205 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 >Reporter: joy-m >Priority: Major > Attachments: 屏幕快照 2018-05-06 上午10.04.27.png > > > when i use yarn client mode, the spark task locked in the collect stage > Because of the data is in the driver machine, so I used the client mode to > run my application! > but the the stage collect was locked! > countDf.collect().map(_.getLong(0)).mkString.toLong > {{sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) > > java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475) > org.apache.spark.rpc.netty.Dispatcher.awaitTermination(Dispatcher.scala:180) > org.apache.spark.rpc.netty.NettyRpcEnv.awaitTermination(NettyRpcEnv.scala:281) > > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:231) > > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67) > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:422) > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) > > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) > > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) > > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284) > > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24205) java.util.concurrent.locks.LockSupport.parkNanos
joy-m created SPARK-24205: - Summary: java.util.concurrent.locks.LockSupport.parkNanos Key: SPARK-24205 URL: https://issues.apache.org/jira/browse/SPARK-24205 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.2.0 Reporter: joy-m when i use yarn client mode, the spark task locked in the collect stage Because of the data is in the driver machine, so I used the client mode to run my application! but the the stage collect was locked! countDf.collect().map(_.getLong(0)).mkString.toLong {{sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475) org.apache.spark.rpc.netty.Dispatcher.awaitTermination(Dispatcher.scala:180) org.apache.spark.rpc.netty.NettyRpcEnv.awaitTermination(NettyRpcEnv.scala:281) org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:231) org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67) org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66) java.security.AccessController.doPrivileged(Native Method) javax.security.auth.Subject.doAs(Subject.java:422) org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692) org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66) org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284) org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24204) Verify a write schema in OrcFileFormat
[ https://issues.apache.org/jira/browse/SPARK-24204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466742#comment-16466742 ] Takeshi Yamamuro commented on SPARK-24204: -- This fix is like: https://github.com/apache/spark/compare/master...maropu:VerifySchemaInOrc {code} scala> df.write.orc("/tmp/orc") java.lang.UnsupportedOperationException: ORC data source does not support null data type. at org.apache.spark.sql.execution.datasources.orc.OrcSerializer$.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$verifyType$1(OrcSerializer.scala:251) at org.apache.spark.sql.execution.datasources.orc.OrcSerializer$$anonfun$verifySchema$1.apply(OrcSerializer.scala:255) at org.apache.spark.sql.execution.datasources.orc.OrcSerializer$$anonfun$verifySchema$1.apply(OrcSerializer.scala:255) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) at org.apache.spark.sql.execution.datasources.orc.OrcSerializer$.verifySchema(OrcSerializer.scala:255) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.prepareWrite(OrcFileFormat.scala:92) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:140) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) {code} cc: [~dongjoon] > Verify a write schema in OrcFileFormat > -- > > Key: SPARK-24204 > URL: https://issues.apache.org/jira/browse/SPARK-24204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > The native orc file format throws an exception with a meaningless message in > executor-sides when unsupported types passed; > {code} > scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, > null))) > scala> val schema = StructType(StructField("a", IntegerType) :: > StructField("b", NullType) :: Nil) > scala> val df = spark.createDataFrame(rdd, schema) > scala> df.write.orc("/tmp/orc") > java.lang.IllegalArgumentException: Can't parse category at > 'struct' > at > org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223) > at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332) > at > org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327) > at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385) > at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406) > at > org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ > er.scala:226) > at > org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36) > at > org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36) > at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply > (FileFormatWriter.scala:278) > {code} > It seems to be better to verify a write schema in a driver side for users > along with the CSV fromat; > https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24084) Add job group id for query through spark-sql
[ https://issues.apache.org/jira/browse/SPARK-24084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24084: Assignee: (was: Apache Spark) > Add job group id for query through spark-sql > > > Key: SPARK-24084 > URL: https://issues.apache.org/jira/browse/SPARK-24084 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: zhoukang >Priority: Major > > For spark-sql we can add job group id for the same statement. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24084) Add job group id for query through spark-sql
[ https://issues.apache.org/jira/browse/SPARK-24084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466741#comment-16466741 ] Apache Spark commented on SPARK-24084: -- User 'caneGuy' has created a pull request for this issue: https://github.com/apache/spark/pull/21263 > Add job group id for query through spark-sql > > > Key: SPARK-24084 > URL: https://issues.apache.org/jira/browse/SPARK-24084 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: zhoukang >Priority: Major > > For spark-sql we can add job group id for the same statement. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24084) Add job group id for query through spark-sql
[ https://issues.apache.org/jira/browse/SPARK-24084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24084: Assignee: Apache Spark > Add job group id for query through spark-sql > > > Key: SPARK-24084 > URL: https://issues.apache.org/jira/browse/SPARK-24084 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: zhoukang >Assignee: Apache Spark >Priority: Major > > For spark-sql we can add job group id for the same statement. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24204) Verify a write schema in OrcFileFormat
Takeshi Yamamuro created SPARK-24204: Summary: Verify a write schema in OrcFileFormat Key: SPARK-24204 URL: https://issues.apache.org/jira/browse/SPARK-24204 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Takeshi Yamamuro The native orc file format throws an exception with a meaningless message in executor-sides when unsupported types passed; {code} scala> val rdd = spark.sparkContext.parallelize(List(Row(1, null), Row(2, null))) scala> val schema = StructType(StructField("a", IntegerType) :: StructField("b", NullType) :: Nil) scala> val df = spark.createDataFrame(rdd, schema) scala> df.write.orc("/tmp/orc") java.lang.IllegalArgumentException: Can't parse category at 'struct' at org.apache.orc.TypeDescription.parseCategory(TypeDescription.java:223) at org.apache.orc.TypeDescription.parseType(TypeDescription.java:332) at org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:327) at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385) at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406) at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializ er.scala:226) at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.(OrcSerializer.scala:36) at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.(OrcOutputWriter.scala:36) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376) at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply (FileFormatWriter.scala:278) {code} It seems to be better to verify a write schema in a driver side for users along with the CSV fromat; https://github.com/apache/spark/blob/76ecd095024a658bf68e5db658e4416565b30c17/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L65 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24200) Read subdirectories with out asterisks
[ https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466700#comment-16466700 ] Hyukjin Kwon commented on SPARK-24200: -- If it's a question for now, I would suggest to ask it to mailing list first before filing a JIRA as an issue. > Read subdirectories with out asterisks > -- > > Key: SPARK-24200 > URL: https://issues.apache.org/jira/browse/SPARK-24200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: kumar >Priority: Major > > String folder = "/Users/test/data/* /* "; > sparkContext.textFile(folder, 1).toJavaRDD() > Is asterisks mandatory to read a folder -Yes, otherwise it does not read > files under subdirectories. > What if I get a folder which is having more subdirectories than the number of > asterisks mentioned ? How to handle this scenario ? > For example: > 1) {{/Users/test/data/}} This would work ONLY if I get data as > /Users/test/data/folder1/file.txt > 2)How to make this expression as *generic* ? It should still work if I get a > folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} > My input folder structure is not same all the time. > Is there anything exists in Spark to handle this kind of scenario ? I know > you might have thought about this, but i am wondering why this has not been > implemented ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24199) Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-24199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-24199. -- Resolution: Invalid Questions should go to mailing list rather than filing it as an issue here. I believe you could have a better answer. > Structured Streaming > > > Key: SPARK-24199 > URL: https://issues.apache.org/jira/browse/SPARK-24199 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: shuke >Priority: Major > > h3. Hey,when i use the {color:#FF}where {color}operate to filter data > while using Structured Streaming > I got some problem about it > > > get_json_object(col("value"),"$.type").cast(DataTypes.StringType).alias("type"), > > get_json_object(col("value"),"$.saleData.type").cast(DataTypes.StringType).alias("saleDataType",get_json_object(col("value"), > "$.uid").cast(DataTypes.IntegerType).alias(ROI_SHOP_KEY), > from_unixtime(get_json_object(col("value"), > "$.time").cast(DataTypes.IntegerType),"-MM-dd").alias("event_time"), > //get_json_object(xjson, '$.balanceData.money')/100 as money, > (get_json_object(col("value"), > "$.balanceData.money").cast(DataTypes.DoubleType) / > 100).alias(BUSINESS_AMOUNT), > get_json_object(col("value"), > "$.shopData.id").cast(DataTypes.LongType).alias(DARK_ID), > get_json_object(col("value"), > "$.balanceData.out_trade_no").cast(DataTypes.StringType).alias(OUT_TRADE_NO), > get_json_object(col("value"), > "$.balanceData.type").cast(DataTypes.StringType).alias(BalanceData_type) > ) > {color:#FF}.where("(type = 'residue' and saleDataType = '4' and shop_id > in ('8610022','5382783')) or type = 'promotion' "{color} ) > .select(col("*")) > .writeStream > .trigger(Trigger.ProcessingTime(5000)) > .outputMode("Update") > .format("console") > .start() > = > i find that its not while using this way to filter data > anyone can help > best wishes > > > > h1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24172) we should not apply operator pushdown to data source v2 many times
[ https://issues.apache.org/jira/browse/SPARK-24172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466638#comment-16466638 ] Apache Spark commented on SPARK-24172: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/21262 > we should not apply operator pushdown to data source v2 many times > -- > > Key: SPARK-24172 > URL: https://issues.apache.org/jira/browse/SPARK-24172 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-20114. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20973 [https://github.com/apache/spark/pull/20973] > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22885) ML test for StructuredStreaming: spark.ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-22885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22885. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20261 [https://github.com/apache/spark/pull/20261] > ML test for StructuredStreaming: spark.ml.tuning > > > Key: SPARK-22885 > URL: https://issues.apache.org/jira/browse/SPARK-22885 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1
[ https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-23291: Fix Version/s: 2.3.1 > SparkR : substr : In SparkR dataframe , starting and ending position > arguments in "substr" is giving wrong result when the position is greater > than 1 > -- > > Key: SPARK-23291 > URL: https://issues.apache.org/jira/browse/SPARK-23291 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0 >Reporter: Narendra >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > Defect Description : > - > For example ,an input string "2017-12-01" is read into a SparkR dataframe > "df" with column name "col1". > The target is to create a a new column named "col2" with the value "12" > which is inside the string ."12" can be extracted with "starting position" as > "6" and "Ending position" as "7" > (the starting position of the first character is considered as "1" ) > But,the current code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,7,8))) > Observe that the first argument in the "substr" API , which indicates the > 'starting position', is mentioned as "7" > Also, observe that the second argument in the "substr" API , which indicates > the 'ending position', is mentioned as "8" > i.e the number that should be mentioned to indicate the position should be > the "actual position + 1" > Expected behavior : > > The code that needs to be written is : > > df <- withColumn(df,"col2",substr(df$col1,6,7))) > Note : > --- > This defect is observed with only when the starting position is greater than > 1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15750) Constructing FPGrowth fails when no numPartitions specified in pyspark
[ https://issues.apache.org/jira/browse/SPARK-15750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-15750. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 13493 [https://github.com/apache/spark/pull/13493] > Constructing FPGrowth fails when no numPartitions specified in pyspark > -- > > Key: SPARK-15750 > URL: https://issues.apache.org/jira/browse/SPARK-15750 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Major > Fix For: 2.4.0 > > > {code} > >>> model1 = FPGrowth.train(rdd, 0.6) > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/fpm.py", line 96, > in train > model = callMLlibFunc("trainFPGrowthModel", data, float(minSupport), > int(numPartitions)) > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line > 130, in callMLlibFunc > return callJavaFunc(sc, api, *args) > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line > 123, in callJavaFunc > return _java2py(sc, func(*args)) > File > "/Users/jzhang/github/spark-2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/jzhang/github/spark-2/python/pyspark/sql/utils.py", line 79, > in deco > raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Number of > partitions must be positive but got -1' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem
[ https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466513#comment-16466513 ] Joseph K. Bradley commented on SPARK-24152: --- Thank you all! > SparkR CRAN feasibility check server problem > > > Key: SPARK-24152 > URL: https://issues.apache.org/jira/browse/SPARK-24152 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Critical > > PR builder and master branch test fails with the following SparkR error with > unknown reason. The following is an error message from that. > {code} > * this is package 'SparkR' version '2.4.0' > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 24] do not match the length of object [0] > Execution halted > {code} > *PR BUILDER* > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/ > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/ > (Fail with no failures) > This is critical because we already start to merge the PR by ignoring this > **known unkonwn** SparkR failure. > - https://github.com/apache/spark/pull/21175 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24203) Make executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466474#comment-16466474 ] Apache Spark commented on SPARK-24203: -- User 'lukmajercak' has created a pull request for this issue: https://github.com/apache/spark/pull/21261 > Make executor's bindAddress configurable > > > Key: SPARK-24203 > URL: https://issues.apache.org/jira/browse/SPARK-24203 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lukas Majercak >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24203) Make executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24203: Assignee: (was: Apache Spark) > Make executor's bindAddress configurable > > > Key: SPARK-24203 > URL: https://issues.apache.org/jira/browse/SPARK-24203 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lukas Majercak >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24203) Make executor's bindAddress configurable
[ https://issues.apache.org/jira/browse/SPARK-24203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24203: Assignee: Apache Spark > Make executor's bindAddress configurable > > > Key: SPARK-24203 > URL: https://issues.apache.org/jira/browse/SPARK-24203 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Lukas Majercak >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24203) Make executor's bindAddress configurable
Lukas Majercak created SPARK-24203: -- Summary: Make executor's bindAddress configurable Key: SPARK-24203 URL: https://issues.apache.org/jira/browse/SPARK-24203 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.1.1 Reporter: Lukas Majercak -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24202) Separate SQLContext dependencies from SparkSession.implicits
[ https://issues.apache.org/jira/browse/SPARK-24202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerard Maas updated SPARK-24202: Description: The current implementation of the implicits in SparkSession passes the current active SQLContext to the SQLImplicits class. This implies that all usage of these (extremely helpful) implicits require the prior creation of a Spark Session instance. Usage is typically done as follows: {code:java} val sparkSession = SparkSession.builder() build() import sparkSession.implicits._ {code} This is OK in user code, but it burdens the creation of library code that uses Spark, where static imports for _Encoder_ support is required. A simple example would be: {code:java} class SparkTransformation[In: Encoder, Out: Encoder] { def transform(ds: Dataset[In]): Dataset[Out] } {code} Attempting to compile such code would result in the following exception: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two utilities to transform _RDD_ and local collections into a _Dataset_. These are 2 methods of the 46 implicit conversions offered by this class. The request is to separate the two implicit methods that depend on the instance creation into a separate class: {code:java} SQLImplicits#214-229 /** * Creates a [[Dataset]] from an RDD. * * @since 1.6.0 */ implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = { DatasetHolder(_sqlContext.createDataset(rdd)) } /** * Creates a [[Dataset]] from a local Seq. * @since 1.6.0 */ implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = { DatasetHolder(_sqlContext.createDataset(s)) }{code} By separating the static methods from these two methods that depend on _sqlContext_ into separate classes, we could provide static imports for all the other functionality and only require the instance-bound implicits for the RDD and collection support (Which is an uncommon use case these days) As this is potentially breaking the current interface, this might be a candidate for Spark 3.0. Although there's nothing stopping us from creating a separate hierarchy for the static encoders already. was: The current implementation of the implicits in SparkSession passes the current active SQLContext to the SQLImplicits class. This implies that all usage of these (extremely helpful) implicits require the prior creation of a Spark Session instance. Usage is typically done as follows: {code:java} val sparkSession = SessionBuilderbuild() import sparkSession.implicits._ {code} This is OK in user code, but it burdens the creation of library code that uses Spark, where static imports for _Encoder_ support is required. A simple example would be: {code:java} class SparkTransformation[In: Encoder, Out: Encoder] { def transform(ds: Dataset[In]): Dataset[Out] } {code} Attempting to compile such code would result in the following exception: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two utilities to transform _RDD_ and local collections into a _Dataset_. These are 2 methods of the 46 implicit conversions offered by this class. The request is to separate the two implicit methods that depend on the instance creation into a separate class: {code:java} SQLImplicits#214-229 /** * Creates a [[Dataset]] from an RDD. * * @since 1.6.0 */ implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = { DatasetHolder(_sqlContext.createDataset(rdd)) } /** * Creates a [[Dataset]] from a local Seq. * @since 1.6.0 */ implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = { DatasetHolder(_sqlContext.createDataset(s)) }{code} By separating the static methods from these two methods that depend on _sqlContext_ into separate classes, we could provide static imports for all the other functionality and only require the instance-bound implicits for the RDD and collection support (Which is an uncommon use case these days) As this is potentially breaking the current interface, this might be a candidate for Spark 3.0. Although there's nothing stopping us from creating a separate hierarchy for the static encoders already. > Separate SQLContext dependencies from SparkSession.implicits > > > Key: SPARK-24202 > URL: https://issues.apache.o
[jira] [Updated] (SPARK-24202) Separate SQLContext dependencies from SparkSession.implicits
[ https://issues.apache.org/jira/browse/SPARK-24202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerard Maas updated SPARK-24202: Description: The current implementation of the implicits in SparkSession passes the current active SQLContext to the SQLImplicits class. This implies that all usage of these (extremely helpful) implicits require the prior creation of a Spark Session instance. Usage is typically done as follows: {code:java} val sparkSession = SparkSession.builder() getOrCreate() import sparkSession.implicits._ {code} This is OK in user code, but it burdens the creation of library code that uses Spark, where static imports for _Encoder_ support is required. A simple example would be: {code:java} class SparkTransformation[In: Encoder, Out: Encoder] { def transform(ds: Dataset[In]): Dataset[Out] } {code} Attempting to compile such code would result in the following exception: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two utilities to transform _RDD_ and local collections into a _Dataset_. These are 2 methods of the 46 implicit conversions offered by this class. The request is to separate the two implicit methods that depend on the instance creation into a separate class: {code:java} SQLImplicits#214-229 /** * Creates a [[Dataset]] from an RDD. * * @since 1.6.0 */ implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = { DatasetHolder(_sqlContext.createDataset(rdd)) } /** * Creates a [[Dataset]] from a local Seq. * @since 1.6.0 */ implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = { DatasetHolder(_sqlContext.createDataset(s)) }{code} By separating the static methods from these two methods that depend on _sqlContext_ into separate classes, we could provide static imports for all the other functionality and only require the instance-bound implicits for the RDD and collection support (Which is an uncommon use case these days) As this is potentially breaking the current interface, this might be a candidate for Spark 3.0. Although there's nothing stopping us from creating a separate hierarchy for the static encoders already. was: The current implementation of the implicits in SparkSession passes the current active SQLContext to the SQLImplicits class. This implies that all usage of these (extremely helpful) implicits require the prior creation of a Spark Session instance. Usage is typically done as follows: {code:java} val sparkSession = SparkSession.builder() build() import sparkSession.implicits._ {code} This is OK in user code, but it burdens the creation of library code that uses Spark, where static imports for _Encoder_ support is required. A simple example would be: {code:java} class SparkTransformation[In: Encoder, Out: Encoder] { def transform(ds: Dataset[In]): Dataset[Out] } {code} Attempting to compile such code would result in the following exception: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two utilities to transform _RDD_ and local collections into a _Dataset_. These are 2 methods of the 46 implicit conversions offered by this class. The request is to separate the two implicit methods that depend on the instance creation into a separate class: {code:java} SQLImplicits#214-229 /** * Creates a [[Dataset]] from an RDD. * * @since 1.6.0 */ implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = { DatasetHolder(_sqlContext.createDataset(rdd)) } /** * Creates a [[Dataset]] from a local Seq. * @since 1.6.0 */ implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = { DatasetHolder(_sqlContext.createDataset(s)) }{code} By separating the static methods from these two methods that depend on _sqlContext_ into separate classes, we could provide static imports for all the other functionality and only require the instance-bound implicits for the RDD and collection support (Which is an uncommon use case these days) As this is potentially breaking the current interface, this might be a candidate for Spark 3.0. Although there's nothing stopping us from creating a separate hierarchy for the static encoders already. > Separate SQLContext dependencies from SparkSession.implicits > > > Key: SPARK-24202 > URL: https://
[jira] [Created] (SPARK-24202) Separate SQLContext dependencies from SparkSession.implicits
Gerard Maas created SPARK-24202: --- Summary: Separate SQLContext dependencies from SparkSession.implicits Key: SPARK-24202 URL: https://issues.apache.org/jira/browse/SPARK-24202 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Gerard Maas The current implementation of the implicits in SparkSession passes the current active SQLContext to the SQLImplicits class. This implies that all usage of these (extremely helpful) implicits require the prior creation of a Spark Session instance. Usage is typically done as follows: {code:java} val sparkSession = SessionBuilderbuild() import sparkSession.implicits._ {code} This is OK in user code, but it burdens the creation of library code that uses Spark, where static imports for _Encoder_ support is required. A simple example would be: {code:java} class SparkTransformation[In: Encoder, Out: Encoder] { def transform(ds: Dataset[In]): Dataset[Out] } {code} Attempting to compile such code would result in the following exception: Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases. The usage of the _SQLContext_ instance in _SQLImplicits_ is limited to two utilities to transform _RDD_ and local collections into a _Dataset_. These are 2 methods of the 46 implicit conversions offered by this class. The request is to separate the two implicit methods that depend on the instance creation into a separate class: {code:java} SQLImplicits#214-229 /** * Creates a [[Dataset]] from an RDD. * * @since 1.6.0 */ implicit def rddToDatasetHolder[T : Encoder](rdd: RDD[T]): DatasetHolder[T] = { DatasetHolder(_sqlContext.createDataset(rdd)) } /** * Creates a [[Dataset]] from a local Seq. * @since 1.6.0 */ implicit def localSeqToDatasetHolder[T : Encoder](s: Seq[T]): DatasetHolder[T] = { DatasetHolder(_sqlContext.createDataset(s)) }{code} By separating the static methods from these two methods that depend on _sqlContext_ into separate classes, we could provide static imports for all the other functionality and only require the instance-bound implicits for the RDD and collection support (Which is an uncommon use case these days) As this is potentially breaking the current interface, this might be a candidate for Spark 3.0. Although there's nothing stopping us from creating a separate hierarchy for the static encoders already. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18165) Kinesis support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18165: - Component/s: (was: DStreams) Structured Streaming > Kinesis support in Structured Streaming > --- > > Key: SPARK-18165 > URL: https://issues.apache.org/jira/browse/SPARK-18165 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Lauren Moos >Priority: Major > > Implement Kinesis based sources and sinks for Structured Streaming -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18165) Kinesis support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-18165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466410#comment-16466410 ] Michael Armbrust commented on SPARK-18165: -- This is great! I'm glad there are more connectors for Structured Streaming! A few high-level thoughts: - The current Source/Sink APIs are internal/unstable. We are working on building public/stable APIs as part of DataSourceV2. Would be great to get feedback on those APIs if this is ported to them - In general as the Spark project scales, we are trying to move more of the connectors out of the core project. I'd suggest looking at contributing this to Apache Bahir and/or Spark Packages. > Kinesis support in Structured Streaming > --- > > Key: SPARK-18165 > URL: https://issues.apache.org/jira/browse/SPARK-18165 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Lauren Moos >Priority: Major > > Implement Kinesis based sources and sinks for Structured Streaming -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24201) IllegalArgumentException originating from ClosureCleaner in Java 9+
Grant Henke created SPARK-24201: --- Summary: IllegalArgumentException originating from ClosureCleaner in Java 9+ Key: SPARK-24201 URL: https://issues.apache.org/jira/browse/SPARK-24201 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Environment: java version "9.0.4" scala version "2.11.12" Reporter: Grant Henke Apache Kudu's kudu-spark tests are failing on Java 9. I assume Java 9 is supported and this is an unexpected bug given the docs say "Spark runs on Java 8+" [here|https://spark.apache.org/docs/2.3.0/]. The stacktrace seen is below: {code} java.lang.IllegalArgumentException at org.apache.xbean.asm5.ClassReader.(Unknown Source) at org.apache.xbean.asm5.ClassReader.(Unknown Source) at org.apache.xbean.asm5.ClassReader.(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46) at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449) at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134) at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432) at org.apache.xbean.asm5.ClassReader.a(Unknown Source) at org.apache.xbean.asm5.ClassReader.b(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262) at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2292) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2066) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2092) at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.collect(RDD.scala:938) at org.apache.kudu.spark.kudu.KuduRDDTest$$anonfun$1.apply(KuduRDDTest.scala:30) at org.apache.kudu.spark.kudu.KuduRDDTest$$anonfun$1.apply(KuduRDDTest.scala:27) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) at org.scalatest.TestSuite$class.withFixture(TestSuite.scala:196) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1560) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196) at org.apache.kudu.spark.kudu.KuduRDDTest.org$scalatest$BeforeAndAfter$$super$runTest(KuduRDDTest.scala:25) at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:203) at org.apache.kudu.spark.kudu.KuduRDDTest.runTest(KuduRDDTest.scala:25) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) at org.scalatest.SuperEngine$$anonfun$traverseSu
[jira] [Commented] (SPARK-24176) The hdfs file path with wildcard can not be identified when loading data
[ https://issues.apache.org/jira/browse/SPARK-24176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466263#comment-16466263 ] kevin yu commented on SPARK-24176: -- I am looking at this one, will provide a proposal fix soon. > The hdfs file path with wildcard can not be identified when loading data > > > Key: SPARK-24176 > URL: https://issues.apache.org/jira/browse/SPARK-24176 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE11 > Spark Version:2.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Minor > > # Launch spark-sql > # create table wild1 (time timestamp, name string, isright boolean, > datetoday date, num binary, height double, score float, decimaler > decimal(10,0), id tinyint, age int, license bigint, length smallint) row > format delimited fields terminated by ',' stored as textfile; > # loaded data in table as below and it failed some cases not consistent > # load data inpath '/user/testdemo1/user1/?ype* ' into table wild1; - Success > load data inpath '/user/testdemo1/user1/t??eddata60.txt' into table wild1; - > *Failed* > load data inpath '/user/testdemo1/user1/?ypeddata60.txt' into table wild1; - > Success > Exception as below > > load data inpath '/user/testdemo1/user1/t??eddata61.txt' into table wild1; > 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_database: one > 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com > ip=unknown-ip-addr cmd=get_database: one > 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1 > 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com > ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1 > 2018-05-04 13:16:25 INFO HiveMetaStore:746 - 0: get_table : db=one tbl=wild1 > 2018-05-04 13:16:25 INFO audit:371 - ugi=spark/had...@hadoop.com > ip=unknown-ip-addr cmd=get_table : db=one tbl=wild1 > *Error in query: LOAD DATA input path does not exist: > /user/testdemo1/user1/t??eddata61.txt;* > spark-sql> > Behavior is not consistent. Need to fix with all combination of wild card > char as it is not consistent. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23529) Specify hostpath volume and mount the volume in Spark driver and executor pods in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-23529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466203#comment-16466203 ] Apache Spark commented on SPARK-23529: -- User 'andrusha' has created a pull request for this issue: https://github.com/apache/spark/pull/21260 > Specify hostpath volume and mount the volume in Spark driver and executor > pods in Kubernetes > > > Key: SPARK-23529 > URL: https://issues.apache.org/jira/browse/SPARK-23529 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Suman Somasundar >Assignee: Anirudh Ramanathan >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24112) Add `spark.sql.hive.convertMetastoreTableProperty` for backward compatiblility
[ https://issues.apache.org/jira/browse/SPARK-24112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466189#comment-16466189 ] Apache Spark commented on SPARK-24112: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/21259 > Add `spark.sql.hive.convertMetastoreTableProperty` for backward compatiblility > -- > > Key: SPARK-24112 > URL: https://issues.apache.org/jira/browse/SPARK-24112 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to not to surprise the previous Parquet Hive table users due > to behavior changes. They had Hive Parquet tables and all of them are > converted by default without table properties since Spark 2.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466099#comment-16466099 ] Paul Wu commented on SPARK-22371: - Got the same problem with 2.3 and also the program stalled: {{ Uncaught exception in thread heartbeat-receiver-event-loop-thread}} {{java.lang.IllegalStateException: Attempted to access garbage collected accumulator 8825}} {{ at org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:265)}} {{ at org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:261)}} {{ at scala.Option.map(Option.scala:146)}} {{ at org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:261)}} {{ at org.apache.spark.util.AccumulatorV2$$anonfun$name$1.apply(AccumulatorV2.scala:87)}} {{ at org.apache.spark.util.AccumulatorV2$$anonfun$name$1.apply(AccumulatorV2.scala:87)}} {{ at scala.Option.orElse(Option.scala:289)}} {{ at org.apache.spark.util.AccumulatorV2.name(AccumulatorV2.scala:87)}} {{ at org.apache.spark.util.AccumulatorV2.toInfo(AccumulatorV2.scala:108)}} > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal >Priority: Major > Attachments: Helper.scala, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt, sampledata > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23161) Add missing APIs to Python GBTClassifier
[ https://issues.apache.org/jira/browse/SPARK-23161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-23161: - Description: GBTClassifier is missing \{{featureSubsetStrategy}}. This should be moved to {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs. GBTClassificationModel is missing {{numClasses}}. It should inherit from {{JavaClassificationModel}} instead of prediction model which will give it this param. was: GBTClassifier is missing \{{featureSubsetStrategy}}. This should be moved {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs. GBTClassificationModel is missing {{numClasses}}. It should inherit from {{JavaClassificationModel}} instead of prediction model which will give it this param. > Add missing APIs to Python GBTClassifier > > > Key: SPARK-23161 > URL: https://issues.apache.org/jira/browse/SPARK-23161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bryan Cutler >Priority: Minor > Labels: starter > > GBTClassifier is missing \{{featureSubsetStrategy}}. This should be moved to > {{TreeEnsembleParams}}, as in Scala, and it will then be part of GBTs. > GBTClassificationModel is missing {{numClasses}}. It should inherit from > {{JavaClassificationModel}} instead of prediction model which will give it > this param. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466088#comment-16466088 ] Darek edited comment on SPARK-18673 at 5/7/18 4:09 PM: --- [PR20819|https://github.com/apache/spark/pull/20819] for Spark => Hive 2.x was done but not merged and deleted. was (Author: bidek): PR20819 for Spark => Hive 2.x was done but not merged and deleted. > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466088#comment-16466088 ] Darek commented on SPARK-18673: --- PR20819 for Spark => Hive 2.x was done but not merged and deleted. > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23458) Flaky test: OrcQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-23458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466078#comment-16466078 ] Xiao Li commented on SPARK-23458: - Yeah. [~dongjoon] Please investigate why they still fail. After your fix, I still found HiveExternalCatalogVersionsSuite never pass in this test branch. Do you know the reason? https://spark-tests.appspot.com/jobs/spark-master-test-sbt-hadoop-2.7 > Flaky test: OrcQuerySuite > -- > > Key: SPARK-23458 > URL: https://issues.apache.org/jira/browse/SPARK-23458 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 > Environment: AMPLab Jenkins >Reporter: Marco Gaido >Priority: Major > > Sometimes we have UT failures with the following stacktrace: > {code:java} > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 15 times over > 10.01396221801 seconds. Last failure message: There are 1 possibly leaked > file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308) > at > org.apache.spark.sql.execution.datasources.orc.OrcTest.eventually(OrcTest.scala:45) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.afterEach(OrcQuerySuite.scala:583) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379) > at > org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375) > at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454) > at org.scalatest.Status$class.withAfterEffect(Status.scala:375) > at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232) > at > org.apache.spark.sql.execution.datasources.orc.OrcQuerySuite.runTest(OrcQuerySuite.scala:583) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384) > at scala.collection.immutable.List.foreach(List.scala:381) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1560) > at org.scalatest.Suite$class.run(Suite.scala:1147) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233) > at org.scalatest.SuperEngine.runImpl(Engine.scala:521) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233) > at > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: sbt.ForkMain$ForkError: java.lang.IllegalStateException: There are > 1 possibly leaked file streams. > at > org.apach
[jira] [Resolved] (SPARK-24170) [Spark SQL] json file format is not dropped after dropping table
[ https://issues.apache.org/jira/browse/SPARK-24170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24170. - Resolution: Not A Bug > [Spark SQL] json file format is not dropped after dropping table > > > Key: SPARK-24170 > URL: https://issues.apache.org/jira/browse/SPARK-24170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE 11 > Spark Version: 2.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Steps: > # Launch spark-sql --master yarn > # create table json(name STRING, age int, gender string, id INT) using > org.apache.spark.sql.json options(path "hdfs:///user/testdemo/"); > # Execute the below SQL queries > INSERT into json > SELECT 'Shaan',21,'Male',1 > UNION ALL > SELECT 'Xing',20,'Female',11 > UNION ALL > SELECT 'Mile',4,'Female',20 > UNION ALL > SELECT 'Malan',10,'Male',9; > Below 4 json file format created > BLR123111:/opt/Antsecure/install/hadoop/namenode/bin # ./hdfs dfs -ls > /user/testdemo > Found 14 items > -rw-r--r-- 3 spark hadoop 0 2018-04-26 17:44 /user/testdemo/_SUCCESS > -rw-r--r-- 3 spark hadoop 4802 2018-04-24 18:20 /user/testdemo/customer1.csv > -rw-r--r-- 3 spark hadoop 92 2018-04-26 17:02 /user/testdemo/json1.txt > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 > /user/testdemo/part-0-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 > /user/testdemo/part-0-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:32 > /user/testdemo/part-1-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:44 > /user/testdemo/part-1-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:32 > /user/testdemo/part-2-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:44 > /user/testdemo/part-2-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 > /user/testdemo/part-3-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 > /user/testdemo/part-3-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > > Issue is: > Now executed below drop command > spark-sql> drop table json; > > Table dropped successfully but json file still present in the path > /user/testdemo -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24170) [Spark SQL] json file format is not dropped after dropping table
[ https://issues.apache.org/jira/browse/SPARK-24170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466071#comment-16466071 ] Xiao Li commented on SPARK-24170: - They are external tables when you specify the path in CREATE TABLE. Thus, the files will not be dropped. > [Spark SQL] json file format is not dropped after dropping table > > > Key: SPARK-24170 > URL: https://issues.apache.org/jira/browse/SPARK-24170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: OS: SUSE 11 > Spark Version: 2.3 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > Steps: > # Launch spark-sql --master yarn > # create table json(name STRING, age int, gender string, id INT) using > org.apache.spark.sql.json options(path "hdfs:///user/testdemo/"); > # Execute the below SQL queries > INSERT into json > SELECT 'Shaan',21,'Male',1 > UNION ALL > SELECT 'Xing',20,'Female',11 > UNION ALL > SELECT 'Mile',4,'Female',20 > UNION ALL > SELECT 'Malan',10,'Male',9; > Below 4 json file format created > BLR123111:/opt/Antsecure/install/hadoop/namenode/bin # ./hdfs dfs -ls > /user/testdemo > Found 14 items > -rw-r--r-- 3 spark hadoop 0 2018-04-26 17:44 /user/testdemo/_SUCCESS > -rw-r--r-- 3 spark hadoop 4802 2018-04-24 18:20 /user/testdemo/customer1.csv > -rw-r--r-- 3 spark hadoop 92 2018-04-26 17:02 /user/testdemo/json1.txt > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 > /user/testdemo/part-0-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 > /user/testdemo/part-0-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:32 > /user/testdemo/part-1-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 51 2018-04-26 17:44 > /user/testdemo/part-1-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:32 > /user/testdemo/part-2-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 50 2018-04-26 17:44 > /user/testdemo/part-2-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:32 > /user/testdemo/part-3-4311f66b-ba1b-4a4d-a289-1a211f27f653-c000.json > -rw-r--r-- 3 spark hadoop 49 2018-04-26 17:44 > /user/testdemo/part-3-b8a8e16a-91a8-48ec-9998-2d741c52cf5a-c000.json > > Issue is: > Now executed below drop command > spark-sql> drop table json; > > Table dropped successfully but json file still present in the path > /user/testdemo -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24043) InterpretedPredicate.eval fails if expression tree contains Nondeterministic expressions
[ https://issues.apache.org/jira/browse/SPARK-24043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-24043. --- Resolution: Fixed Assignee: Bruce Robbins Fix Version/s: 2.4.0 > InterpretedPredicate.eval fails if expression tree contains Nondeterministic > expressions > > > Key: SPARK-24043 > URL: https://issues.apache.org/jira/browse/SPARK-24043 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Minor > Fix For: 2.4.0 > > > When whole-stage codegen and predicate codegen both fail, FilterExec falls > back to using InterpretedPredicate. If the predicate's expression contains > any non-deterministic expressions, the evaluation throws an error: > {noformat} > scala> val df = Seq((1)).toDF("a") > df: org.apache.spark.sql.DataFrame = [a: int] > scala> df.filter('a > 0).show // this works fine > 2018-04-21 20:39:26 WARN FilterExec:66 - Codegen disabled for this > expression: > (value#1 > 0) > +---+ > | a| > +---+ > | 1| > +---+ > scala> df.filter('a > rand(7)).show // this will throw an error > 2018-04-21 20:39:40 WARN FilterExec:66 - Codegen disabled for this > expression: > (cast(value#1 as double) > rand(7)) > 2018-04-21 20:39:40 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.IllegalArgumentException: requirement failed: Nondeterministic > expression org.apache.spark.sql.catalyst.expressions.Rand should be > initialized before eval. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:326) > at > org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34) > {noformat} > This is because no code initializes the Nondeterministic expressions before > eval is called on them. > This is a low impact issue, since it would require both whole-stage codegen > and predicate codegen to fail before FilterExec would fall back to using > InterpretedPredicate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465950#comment-16465950 ] Steve Loughran commented on SPARK-18673: Good Q, [~Bidek]. That SPARK-23807 POM fixes up the build, but without the mutant org.spark-project.hive JAR fixed up to not throw an exception whenever Hadoop version == 3, you can't run the code. including tests. I do have such a fixed up JAR, what I'm proposing here is cherry picking in the least amount of change needed there. This is work is part of the overall "spark on Hadoop 3.x". Oh and yes, I'm targeting 3.1+ too, though the key issue here is the "3", not the suffix. What would supercede this is Spark => Hive 2.x. This is an interim artifact until that is done by someone > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465934#comment-16465934 ] Darek edited comment on SPARK-18673 at 5/7/18 1:59 PM: --- Based on the recent PR, the community is moving toward Hadoop 3.1, why do you even bother with this ticket? Check the recent PR like SPARK-23807 was (Author: bidek): Based on the recent PR, the community is moving toward Hadoop 3.1, why do you event bother with this ticket? Check the recent PR like SPARK-23807 > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465934#comment-16465934 ] Darek commented on SPARK-18673: --- Based on the recent PR, the community is moving toward Hadoop 3.1, why do you event bother with this ticket? Check the recent PR like SPARK-23807 > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
[ https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465925#comment-16465925 ] Steve Loughran commented on SPARK-18673: Josh Rosen added some changes, particularly: * 8f5918ad3dc7f3aa84ea04f3ef7761493c009d22 Update version to 1.2.1.spark2 * 10d91dca6c602a9f6c6fa428f341f135054c2c16 Re-shade Kryo * 721aa7e4904a8a6069afe815af7cbf5ed3bde936 Change groupId to org.spark-project.hive; keep relocated Kryo under Hive namespace. * aa9f5557b60facfe862f1f6c0a60537da8e88076 Put shaded protobuf classes under Hive package namespace. Int-HDP patches/changes that I also plan to pull n on the basis that (a) they were clearly deemed important and (b) they apparently work * HIVE-11102 ReaderImpl: getColumnIndicesFromNames does not work for some cases * allow the repo for publishing artficats to be reconfigured from the normal sonatype one * updating the group assembly plugin to use the same package names as from 721aa7e4 > Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version > -- > > Key: SPARK-18673 > URL: https://issues.apache.org/jira/browse/SPARK-18673 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 > Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT >Reporter: Steve Loughran >Priority: Major > > Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader > considers 3.x to be an unknown Hadoop version. > Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it > will need to be updated to match. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
[ https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465917#comment-16465917 ] Steve Loughran commented on SPARK-23977: It will need the hadoop-aws module and deoendencies as that is where the core code is. This patch just does the binding to the InsertIntoHadoopFS relation (move to Hadoop MRv2 FileOutputFormat & expect the new superclass, PathOutputCommtter, rather than always a FileOutputcommitter, and for Parquet, something similar with a ParquetOutputCommitter. +its only in Hadoop 3.1, though you can backport to branch-2, especially if you are prepared to bump up the minimum java version to 8 in that branch. t should work on k8s, given it works standalone. All it needs is an endpoint supporting the multipart upload operation of S3, which includes some non-AWS object stores. Look @ the HADOOP-13786 work and the paper [a zero rename committer|https://github.com/steveloughran/zero-rename-committer/releases/download/tag_draft_003/a_zero_rename_committer.pdf]. And there's some integration tests downstream in https://github.com/hortonworks-spark/cloud-integration . I can help set you up to run those, if you email me directly. Essentially: you need to choose which stores to test against from: s3, openstack, azure, and configure them Note that of the two variant committers, "staging" and "magic", the magic one needs a consistent S3 endpoint, which you only get on AWS S3 with an external services, usually dynamo DB based (S3mper, EMR consisent S3, S3Guard). The staging one needs enough local HDD to buffer the output of all active tasks, but doesn't need that consistency for its own query. You will need a plan for chaining together work though, which is inevitably one of "consistency layer" or "wait long enough between writer and reader that you expect the metadata to be consistent" Finally, if you are using spark to write directly to S3 today, without any consistency layer, then your commit algorithm had better not be mimicing directory rename by list + copy + delete. You need this code for safe as well as performant committing of work to S3. > Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism > --- > > Key: SPARK-23977 > URL: https://issues.apache.org/jira/browse/SPARK-23977 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Priority: Minor > > Hadoop 3.1 adds a mechanism for job-specific and store-specific committers > (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, > HADOOP-13786 > These committers deliver high-performance output of MR and spark jobs to S3, > and offer the key semantics which Spark depends on: no visible output until > job commit, a failure of a task at an stage, including partway through task > commit, can be handled by executing and committing another task attempt. > In contrast, the FileOutputFormat commit algorithms on S3 have issues: > * Awful performance because files are copied by rename > * FileOutputFormat v1: weak task commit failure recovery semantics as the > (v1) expectation: "directory renames are atomic" doesn't hold. > * S3 metadata eventual consistency can cause rename to miss files or fail > entirely (SPARK-15849) > Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of > the commit semantics w.r.t observability of or recovery from task commit > failure, on any filesystem. > The S3A committers address these by way of uploading all data to the > destination through multipart uploads, uploads which are only completed in > job commit. > The new {{PathOutputCommitter}} factory mechanism allows applications to work > with the S3A committers and any other, by adding a plugin mechanism into the > MRv2 FileOutputFormat class, where it job config and filesystem configuration > options can dynamically choose the output committer. > Spark can use these with some binding classes to > # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 > classes and {{PathOutputCommitterFactory}} to create the committers. > # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}} > to wire up Parquet output even when code requires the committer to be a > subclass of {{ParquetOutputCommitter}} > This patch builds on SPARK-23807 for setting up the dependencies. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks
[ https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kumar updated SPARK-24200: -- Description: String folder = "/Users/test/data/* /* "; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? I know you might have thought about this, but i am wondering why this has not been implemented ? was: String folder = "/Users/test/data/* /* "; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? > Read subdirectories with out asterisks > -- > > Key: SPARK-24200 > URL: https://issues.apache.org/jira/browse/SPARK-24200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: kumar >Priority: Major > > String folder = "/Users/test/data/* /* "; > sparkContext.textFile(folder, 1).toJavaRDD() > Is asterisks mandatory to read a folder -Yes, otherwise it does not read > files under subdirectories. > What if I get a folder which is having more subdirectories than the number of > asterisks mentioned ? How to handle this scenario ? > For example: > 1) {{/Users/test/data/}} This would work ONLY if I get data as > /Users/test/data/folder1/file.txt > 2)How to make this expression as *generic* ? It should still work if I get a > folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} > My input folder structure is not same all the time. > Is there anything exists in Spark to handle this kind of scenario ? I know > you might have thought about this, but i am wondering why this has not been > implemented ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks
[ https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kumar updated SPARK-24200: -- Description: String folder = "/Users/test/data/ */* "; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? was: String folder = "/Users/test/data/ ** /** "; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? > Read subdirectories with out asterisks > -- > > Key: SPARK-24200 > URL: https://issues.apache.org/jira/browse/SPARK-24200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: kumar >Priority: Major > > String folder = "/Users/test/data/ */* "; > sparkContext.textFile(folder, 1).toJavaRDD() > Is asterisks mandatory to read a folder -Yes, otherwise it does not read > files under subdirectories. > What if I get a folder which is having more subdirectories than the number of > asterisks mentioned ? How to handle this scenario ? > For example: > 1) {{/Users/test/data/}} This would work ONLY if I get data as > /Users/test/data/folder1/file.txt > 2)How to make this expression as *generic* ? It should still work if I get a > folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} > My input folder structure is not same all the time. > Is there anything exists in Spark to handle this kind of scenario ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks
[ https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kumar updated SPARK-24200: -- Description: String folder = "/Users/test/data/* /* "; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? was: String folder = "/Users/test/data/ */* "; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? > Read subdirectories with out asterisks > -- > > Key: SPARK-24200 > URL: https://issues.apache.org/jira/browse/SPARK-24200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: kumar >Priority: Major > > String folder = "/Users/test/data/* /* "; > sparkContext.textFile(folder, 1).toJavaRDD() > Is asterisks mandatory to read a folder -Yes, otherwise it does not read > files under subdirectories. > What if I get a folder which is having more subdirectories than the number of > asterisks mentioned ? How to handle this scenario ? > For example: > 1) {{/Users/test/data/}} This would work ONLY if I get data as > /Users/test/data/folder1/file.txt > 2)How to make this expression as *generic* ? It should still work if I get a > folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} > My input folder structure is not same all the time. > Is there anything exists in Spark to handle this kind of scenario ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks
[ https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kumar updated SPARK-24200: -- Description: String folder = "/Users/test/data/ ** /** "; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? was: String folder = "/Users/test/data/*/*"; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? > Read subdirectories with out asterisks > -- > > Key: SPARK-24200 > URL: https://issues.apache.org/jira/browse/SPARK-24200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: kumar >Priority: Major > > String folder = "/Users/test/data/ ** /** "; > sparkContext.textFile(folder, 1).toJavaRDD() > Is asterisks mandatory to read a folder -Yes, otherwise it does not read > files under subdirectories. > What if I get a folder which is having more subdirectories than the number of > asterisks mentioned ? How to handle this scenario ? > For example: > 1) {{/Users/test/data/}} This would work ONLY if I get data as > /Users/test/data/folder1/file.txt > 2)How to make this expression as *generic* ? It should still work if I get a > folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} > My input folder structure is not same all the time. > Is there anything exists in Spark to handle this kind of scenario ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks
[ https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kumar updated SPARK-24200: -- Description: String folder = "/Users/test/data/*/*"; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? was: String folder = "/Users/test/data/"; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? > Read subdirectories with out asterisks > -- > > Key: SPARK-24200 > URL: https://issues.apache.org/jira/browse/SPARK-24200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: kumar >Priority: Major > > String folder = "/Users/test/data/*/*"; > sparkContext.textFile(folder, 1).toJavaRDD() > Is asterisks mandatory to read a folder -Yes, otherwise it does not read > files under subdirectories. > What if I get a folder which is having more subdirectories than the number of > asterisks mentioned ? How to handle this scenario ? > For example: > 1) {{/Users/test/data/}} This would work ONLY if I get data as > /Users/test/data/folder1/file.txt > 2)How to make this expression as *generic* ? It should still work if I get a > folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} > My input folder structure is not same all the time. > Is there anything exists in Spark to handle this kind of scenario ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24200) Read subdirectories with out asterisks
[ https://issues.apache.org/jira/browse/SPARK-24200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kumar updated SPARK-24200: -- Description: String folder = "/Users/test/data/"; sparkContext.textFile(folder, 1).toJavaRDD() Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? was: {{String folder = "/Users/test/data/*/*"; sparkContext.textFile(folder, 1).toJavaRDD() }} Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/*/*}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? > Read subdirectories with out asterisks > -- > > Key: SPARK-24200 > URL: https://issues.apache.org/jira/browse/SPARK-24200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: kumar >Priority: Major > > String folder = "/Users/test/data/"; > sparkContext.textFile(folder, 1).toJavaRDD() > Is asterisks mandatory to read a folder -Yes, otherwise it does not read > files under subdirectories. > What if I get a folder which is having more subdirectories than the number of > asterisks mentioned ? How to handle this scenario ? > For example: > 1) {{/Users/test/data/}} This would work ONLY if I get data as > /Users/test/data/folder1/file.txt > 2)How to make this expression as *generic* ? It should still work if I get a > folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} > My input folder structure is not same all the time. > Is there anything exists in Spark to handle this kind of scenario ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24200) Read subdirectories with out asterisks
kumar created SPARK-24200: - Summary: Read subdirectories with out asterisks Key: SPARK-24200 URL: https://issues.apache.org/jira/browse/SPARK-24200 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: kumar {{String folder = "/Users/test/data/*/*"; sparkContext.textFile(folder, 1).toJavaRDD() }} Is asterisks mandatory to read a folder -Yes, otherwise it does not read files under subdirectories. What if I get a folder which is having more subdirectories than the number of asterisks mentioned ? How to handle this scenario ? For example: 1) {{/Users/test/data/*/*}} This would work ONLY if I get data as /Users/test/data/folder1/file.txt 2)How to make this expression as *generic* ? It should still work if I get a folder as: {{/Users/test/data/folder1/folder2/folder3/folder4}} My input folder structure is not same all the time. Is there anything exists in Spark to handle this kind of scenario ? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23933) High-order function: map(array, array) → map
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23933: Assignee: (was: Apache Spark) > High-order function: map(array, array) → map > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465746#comment-16465746 ] Apache Spark commented on SPARK-23933: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/21258 > High-order function: map(array, array) → map > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23933) High-order function: map(array, array) → map
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23933: Assignee: Apache Spark > High-order function: map(array, array) → map > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24194) HadoopFsRelation cannot overwrite a path that is also being read from
[ https://issues.apache.org/jira/browse/SPARK-24194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24194: Assignee: Apache Spark > HadoopFsRelation cannot overwrite a path that is also being read from > - > > Key: SPARK-24194 > URL: https://issues.apache.org/jira/browse/SPARK-24194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: spark master >Reporter: yangz >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > Fix For: 2.4.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > When > {code:java} > INSERT OVERWRITE TABLE territory_count_compare select * from > territory_count_compare where shop_count!=real_shop_count > {code} > And territory_count_compare is a table with parquet, there will be a error > Cannot overwrite a path that is also being read from > > And in file MetastoreDataSourceSuite.scala, there have a test case > > > {code:java} > table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName) > {code} > > But when the table territory_count_compare is a common hive table, there is > no error. > So I think the reason is when insert overwrite into hadoopfs relation with > static partition, it first delete the partition in the output. But it should > be the time when the job commited. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24194) HadoopFsRelation cannot overwrite a path that is also being read from
[ https://issues.apache.org/jira/browse/SPARK-24194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24194: Assignee: (was: Apache Spark) > HadoopFsRelation cannot overwrite a path that is also being read from > - > > Key: SPARK-24194 > URL: https://issues.apache.org/jira/browse/SPARK-24194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: spark master >Reporter: yangz >Priority: Major > Labels: pull-request-available > Fix For: 2.4.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > When > {code:java} > INSERT OVERWRITE TABLE territory_count_compare select * from > territory_count_compare where shop_count!=real_shop_count > {code} > And territory_count_compare is a table with parquet, there will be a error > Cannot overwrite a path that is also being read from > > And in file MetastoreDataSourceSuite.scala, there have a test case > > > {code:java} > table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName) > {code} > > But when the table territory_count_compare is a common hive table, there is > no error. > So I think the reason is when insert overwrite into hadoopfs relation with > static partition, it first delete the partition in the output. But it should > be the time when the job commited. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24194) HadoopFsRelation cannot overwrite a path that is also being read from
[ https://issues.apache.org/jira/browse/SPARK-24194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465732#comment-16465732 ] Apache Spark commented on SPARK-24194: -- User 'zheh12' has created a pull request for this issue: https://github.com/apache/spark/pull/21257 > HadoopFsRelation cannot overwrite a path that is also being read from > - > > Key: SPARK-24194 > URL: https://issues.apache.org/jira/browse/SPARK-24194 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 > Environment: spark master >Reporter: yangz >Priority: Major > Labels: pull-request-available > Fix For: 2.4.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > When > {code:java} > INSERT OVERWRITE TABLE territory_count_compare select * from > territory_count_compare where shop_count!=real_shop_count > {code} > And territory_count_compare is a table with parquet, there will be a error > Cannot overwrite a path that is also being read from > > And in file MetastoreDataSourceSuite.scala, there have a test case > > > {code:java} > table(tableName).write.mode(SaveMode.Overwrite).insertInto(tableName) > {code} > > But when the table territory_count_compare is a common hive table, there is > no error. > So I think the reason is when insert overwrite into hadoopfs relation with > static partition, it first delete the partition in the output. But it should > be the time when the job commited. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24177) Spark returning inconsistent rows and data in a join query when run using Spark SQL (using SQLContext.sql(...))
[ https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465729#comment-16465729 ] Ajay Monga commented on SPARK-24177: Thanks Marco. We have a few systems running on the latest version of Spark but the system that behaved erratic is still on 1.6. We are planning to move it to a later version, possibly to 2.2 but I would appreciate if someone can confirm my understanding. > Spark returning inconsistent rows and data in a join query when run using > Spark SQL (using SQLContext.sql(...)) > --- > > Key: SPARK-24177 > URL: https://issues.apache.org/jira/browse/SPARK-24177 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 > Environment: Production >Reporter: Ajay Monga >Priority: Major > > Spark SQL is returning inconsistent result for a JOIN query. It returns > different rows and the value of the column on which a simple multiplication > takes place returns different values: > The query is like: > SELECT > second_table.date_value, SUM(XXX * second_table.shift_value) > FROM > ( > SELECT > date_value, SUM(value) as XXX > FROM first_table > WHERE > AND date IN ( '2018-01-01', '2018-01-02' ) > GROUP BY date_value > ) > intermediate LEFT OUTER > JOIN second_table ON second_table.date_value = ( 'date_value' from first table, say if it's a Saturday or Sunday then use > Monday, else next valid working date>) > AND second_table.date_value IN ( > '2018-01-02', > '2018-01-03' > ) > GROUP BY second_table.date_value > > Suspicion is that, the execution of above query is split into two queries - > one for first_table and other for second_table before joining. Then the > results get split across partitions, seemingly grouped/distributed by the > join column, which is 'date_value'. In the join there is a date shift logic > that fails to join in some cases when it should, primarily for the > date_values at the edge of the partitions distributed across the executors. > So, the execution is dependent on how the data (or the rdd) of the individual > queries is partitioned in the first place, which is not ideal as a normal > looking ANSI standard SQL query is not behaving consistently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24199) Structured Streaming
shuke created SPARK-24199: - Summary: Structured Streaming Key: SPARK-24199 URL: https://issues.apache.org/jira/browse/SPARK-24199 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.2.0 Reporter: shuke h3. Hey,when i use the {color:#FF}where {color}operate to filter data while using Structured Streaming I got some problem about it get_json_object(col("value"),"$.type").cast(DataTypes.StringType).alias("type"), get_json_object(col("value"),"$.saleData.type").cast(DataTypes.StringType).alias("saleDataType",get_json_object(col("value"), "$.uid").cast(DataTypes.IntegerType).alias(ROI_SHOP_KEY), from_unixtime(get_json_object(col("value"), "$.time").cast(DataTypes.IntegerType),"-MM-dd").alias("event_time"), //get_json_object(xjson, '$.balanceData.money')/100 as money, (get_json_object(col("value"), "$.balanceData.money").cast(DataTypes.DoubleType) / 100).alias(BUSINESS_AMOUNT), get_json_object(col("value"), "$.shopData.id").cast(DataTypes.LongType).alias(DARK_ID), get_json_object(col("value"), "$.balanceData.out_trade_no").cast(DataTypes.StringType).alias(OUT_TRADE_NO), get_json_object(col("value"), "$.balanceData.type").cast(DataTypes.StringType).alias(BalanceData_type) ) {color:#FF}.where("(type = 'residue' and saleDataType = '4' and shop_id in ('8610022','5382783')) or type = 'promotion' "{color} ) .select(col("*")) .writeStream .trigger(Trigger.ProcessingTime(5000)) .outputMode("Update") .format("console") .start() = i find that its not while using this way to filter data anyone can help best wishes h1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16406) Reference resolution for large number of columns should be faster
[ https://issues.apache.org/jira/browse/SPARK-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-16406. --- Resolution: Fixed Fix Version/s: 2.4.0 > Reference resolution for large number of columns should be faster > - > > Key: SPARK-16406 > URL: https://issues.apache.org/jira/browse/SPARK-16406 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Major > Fix For: 2.4.0 > > > Resolving columns in a LogicalPlan on average takes n / 2 (n being the number > of columns). This gets problematic as soon as you try to resolve a large > number of columns (m) on a large table: O(m * n / 2) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24197) add array_sort function
[ https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marek Novotny updated SPARK-24197: -- Description: Add a SparkR equivalent function to [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921]. (was: Add a SparkR equivalent function to SPARK-23921.) > add array_sort function > --- > > Key: SPARK-24197 > URL: https://issues.apache.org/jira/browse/SPARK-24197 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Priority: Major > > Add a SparkR equivalent function to > [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24197) add array_sort function
[ https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marek Novotny updated SPARK-24197: -- Description: Add a SparkR equivalent function to SPARK-23921. (was: Add a SparkR equivalent function for [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921].) > add array_sort function > --- > > Key: SPARK-24197 > URL: https://issues.apache.org/jira/browse/SPARK-24197 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Priority: Major > > Add a SparkR equivalent function to SPARK-23921. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24198) add slice function
[ https://issues.apache.org/jira/browse/SPARK-24198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465627#comment-16465627 ] Marek Novotny commented on SPARK-24198: --- I will work on this. Thanks. > add slice function > -- > > Key: SPARK-24198 > URL: https://issues.apache.org/jira/browse/SPARK-24198 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Priority: Major > > Add a SparkR equivalent function to > [SPARK-23930|https://issues.apache.org/jira/browse/SPARK-23930]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24198) add slice function
Marek Novotny created SPARK-24198: - Summary: add slice function Key: SPARK-24198 URL: https://issues.apache.org/jira/browse/SPARK-24198 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 2.4.0 Reporter: Marek Novotny Add a SparkR equivalent function to [SPARK-23930|https://issues.apache.org/jira/browse/SPARK-23930]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24197) add array_sort function
[ https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465626#comment-16465626 ] Marek Novotny commented on SPARK-24197: --- I will work on this. Thanks. > add array_sort function > --- > > Key: SPARK-24197 > URL: https://issues.apache.org/jira/browse/SPARK-24197 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Priority: Major > > Add a SparkR equivalent function for > [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24197) add array_sort function
Marek Novotny created SPARK-24197: - Summary: add array_sort function Key: SPARK-24197 URL: https://issues.apache.org/jira/browse/SPARK-24197 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 2.4.0 Reporter: Marek Novotny Add a SparkR equivalent function for [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23930) High-order function: slice(x, start, length) → array
[ https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-23930. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21040 [https://github.com/apache/spark/pull/21040] > High-order function: slice(x, start, length) → array > > > Key: SPARK-23930 > URL: https://issues.apache.org/jira/browse/SPARK-23930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > Fix For: 2.4.0 > > > Ref: https://prestodb.io/docs/current/functions/array.html > Subsets array x starting from index start (or starting from the end if start > is negative) with a length of length. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23930) High-order function: slice(x, start, length) → array
[ https://issues.apache.org/jira/browse/SPARK-23930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin reassigned SPARK-23930: - Assignee: Marco Gaido > High-order function: slice(x, start, length) → array > > > Key: SPARK-23930 > URL: https://issues.apache.org/jira/browse/SPARK-23930 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Marco Gaido >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Subsets array x starting from index start (or starting from the end if start > is negative) with a length of length. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24196) Spark Thrift Server - SQL Client connections does't show db artefacts
[ https://issues.apache.org/jira/browse/SPARK-24196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rr updated SPARK-24196: --- Attachment: screenshot-1.png > Spark Thrift Server - SQL Client connections does't show db artefacts > - > > Key: SPARK-24196 > URL: https://issues.apache.org/jira/browse/SPARK-24196 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: rr >Priority: Major > Attachments: screenshot-1.png > > > When connecting to Spark Thrift Server via JDBC artefacts(db objects are not > showing up) > whereas when connecting to hiveserver2 it shows the schema, tables, columns > ... > SQL Client user: IBM Data Studio, DBeaver SQL Client -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24160) ShuffleBlockFetcherIterator should fail if it receives zero-size blocks
[ https://issues.apache.org/jira/browse/SPARK-24160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465536#comment-16465536 ] Apache Spark commented on SPARK-24160: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21256 > ShuffleBlockFetcherIterator should fail if it receives zero-size blocks > --- > > Key: SPARK-24160 > URL: https://issues.apache.org/jira/browse/SPARK-24160 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Fix For: 2.4.0 > > > In the shuffle layer, we guarantee that zero-size blocks will never be > requested (a block containing zero records is always 0 bytes in size and is > marked as empty such that it will never be legitimately requested by > executors). However, we failed to take advantage of this in the shuffle-read > path: the existing code did not explicitly check whether blocks are > non-zero-size. > > We should add `buf.size != 0` checks to ShuffleBlockFetcherIterator to take > advantage of this invariant and prevent potential data loss / corruption > issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24196) Spark Thrift Server - SQL Client connections does't show db artefacts
[ https://issues.apache.org/jira/browse/SPARK-24196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rr updated SPARK-24196: --- Description: When connecting to Spark Thrift Server via JDBC artefacts(db objects are not showing up) whereas when connecting to hiveserver2 it shows the schema, tables, columns ... SQL Client user: IBM Data Studio, DBeaver SQL Client was: When connecting to Spark Thrift Server via JDBC artefacts(db objects are not showing up) whereas when connecting to hiveserver2 is shows the schema, tables, columns ... SQL Client user: IBM Data Studio, DBeaver SQL Client > Spark Thrift Server - SQL Client connections does't show db artefacts > - > > Key: SPARK-24196 > URL: https://issues.apache.org/jira/browse/SPARK-24196 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: rr >Priority: Major > > When connecting to Spark Thrift Server via JDBC artefacts(db objects are not > showing up) > whereas when connecting to hiveserver2 it shows the schema, tables, columns > ... > SQL Client user: IBM Data Studio, DBeaver SQL Client -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24186) add array reverse and concat
[ https://issues.apache.org/jira/browse/SPARK-24186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24186: Assignee: (was: Apache Spark) > add array reverse and concat > - > > Key: SPARK-24186 > URL: https://issues.apache.org/jira/browse/SPARK-24186 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Major > > Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and > https://issues.apache.org/jira/browse/SPARK-23926 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24186) add array reverse and concat
[ https://issues.apache.org/jira/browse/SPARK-24186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465526#comment-16465526 ] Apache Spark commented on SPARK-24186: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/21255 > add array reverse and concat > - > > Key: SPARK-24186 > URL: https://issues.apache.org/jira/browse/SPARK-24186 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Priority: Major > > Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and > https://issues.apache.org/jira/browse/SPARK-23926 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24196) Spark Thrift Server - SQL Client connections does't show db artefacts
[ https://issues.apache.org/jira/browse/SPARK-24196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] rr updated SPARK-24196: --- Description: When connecting to Spark Thrift Server via JDBC artefacts(db objects are not showing up) whereas when connecting to hiveserver2 is shows the schema, tables, columns ... SQL Client user: IBM Data Studio, DBeaver SQL Client was: When connecting to Spark Thrift Server via JDBC artefacts(db objects are not showing up) whereas when connecting to hiveserver2 is shows the schema, tables, colums ... SQL Client user: IBM Data Studio, DBeaver SQL Client > Spark Thrift Server - SQL Client connections does't show db artefacts > - > > Key: SPARK-24196 > URL: https://issues.apache.org/jira/browse/SPARK-24196 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: rr >Priority: Major > > When connecting to Spark Thrift Server via JDBC artefacts(db objects are not > showing up) > whereas when connecting to hiveserver2 is shows the schema, tables, columns > ... > SQL Client user: IBM Data Studio, DBeaver SQL Client -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24196) Spark Thrift Server - SQL Client connections does't show db artefacts
rr created SPARK-24196: -- Summary: Spark Thrift Server - SQL Client connections does't show db artefacts Key: SPARK-24196 URL: https://issues.apache.org/jira/browse/SPARK-24196 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: rr When connecting to Spark Thrift Server via JDBC artefacts(db objects are not showing up) whereas when connecting to hiveserver2 is shows the schema, tables, colums ... SQL Client user: IBM Data Studio, DBeaver SQL Client -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24186) add array reverse and concat
[ https://issues.apache.org/jira/browse/SPARK-24186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24186: Assignee: Apache Spark > add array reverse and concat > - > > Key: SPARK-24186 > URL: https://issues.apache.org/jira/browse/SPARK-24186 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > > Add R versions of https://issues.apache.org/jira/browse/SPARK-23736 and > https://issues.apache.org/jira/browse/SPARK-23926 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org