[jira] [Assigned] (SPARK-27034) Nested schema pruning for ORC
[ https://issues.apache.org/jira/browse/SPARK-27034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27034: Assignee: Apache Spark > Nested schema pruning for ORC > - > > Key: SPARK-27034 > URL: https://issues.apache.org/jira/browse/SPARK-27034 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Major > > We only support nested schema pruning for Parquet currently. This is opened > to propose to support nested schema pruning for ORC too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27034) Nested schema pruning for ORC
Liang-Chi Hsieh created SPARK-27034: --- Summary: Nested schema pruning for ORC Key: SPARK-27034 URL: https://issues.apache.org/jira/browse/SPARK-27034 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Liang-Chi Hsieh We only support nested schema pruning for Parquet currently. This is opened to propose to support nested schema pruning for ORC too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27034) Nested schema pruning for ORC
[ https://issues.apache.org/jira/browse/SPARK-27034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27034: Assignee: (was: Apache Spark) > Nested schema pruning for ORC > - > > Key: SPARK-27034 > URL: https://issues.apache.org/jira/browse/SPARK-27034 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > We only support nested schema pruning for Parquet currently. This is opened > to propose to support nested schema pruning for ORC too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27033) Add rule to optimize binary comparisons to its push down format
[ https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27033: Assignee: Apache Spark > Add rule to optimize binary comparisons to its push down format > --- > > Key: SPARK-27033 > URL: https://issues.apache.org/jira/browse/SPARK-27033 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.4.0 >Reporter: EdisonWang >Assignee: Apache Spark >Priority: Minor > > Currently, filters like this "select * from table where a + 1 >= 3" cannot be > pushed down, this optimizer can convert it to "select * from table where a >= > 3 - 1", and then be "select * from table where a >= 2". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27033) Add rule to optimize binary comparisons to its push down format
[ https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27033: Assignee: (was: Apache Spark) > Add rule to optimize binary comparisons to its push down format > --- > > Key: SPARK-27033 > URL: https://issues.apache.org/jira/browse/SPARK-27033 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Minor > > Currently, filters like this "select * from table where a + 1 >= 3" cannot be > pushed down, this optimizer can convert it to "select * from table where a >= > 3 - 1", and then be "select * from table where a >= 2". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format
[ https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-27033: --- Affects Version/s: (was: 3.0.0) 2.4.0 > Add rule to optimize binary comparisons to its push down format > --- > > Key: SPARK-27033 > URL: https://issues.apache.org/jira/browse/SPARK-27033 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Minor > > Currently, filters like this "select * from table where a + 1 >= 3" cannot be > pushed down, this optimizer can convert it to "select * from table where a >= > 3 - 1", and then be "select * from table where a >= 2". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format
[ https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-27033: --- Description: Currently, filters like this "select * from table where a + 1 >= 3" cannot be pushed down, this optimizer can convert it to "select * from table where a >= 3 - 1", and then be "select * from table where a >= 2". (was: _emphasized text_) > Add rule to optimize binary comparisons to its push down format > --- > > Key: SPARK-27033 > URL: https://issues.apache.org/jira/browse/SPARK-27033 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > > Currently, filters like this "select * from table where a + 1 >= 3" cannot be > pushed down, this optimizer can convert it to "select * from table where a >= > 3 - 1", and then be "select * from table where a >= 2". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format
[ https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-27033: --- Description: _emphasized text_ > Add rule to optimize binary comparisons to its push down format > --- > > Key: SPARK-27033 > URL: https://issues.apache.org/jira/browse/SPARK-27033 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > > _emphasized text_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27033) Add rule to optimize binary comparisons to its push down format
EdisonWang created SPARK-27033: -- Summary: Add rule to optimize binary comparisons to its push down format Key: SPARK-27033 URL: https://issues.apache.org/jira/browse/SPARK-27033 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 3.0.0 Reporter: EdisonWang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26860) RangeBetween docs appear to be wrong
[ https://issues.apache.org/jira/browse/SPARK-26860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782575#comment-16782575 ] Jagadesh Kiran N commented on SPARK-26860: -- For Followup only i have mentioned few days back about uploading the pull request , > RangeBetween docs appear to be wrong > - > > Key: SPARK-26860 > URL: https://issues.apache.org/jira/browse/SPARK-26860 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Shelby Vanhooser >Priority: Major > Labels: docs, easyfix, python > Original Estimate: 1h > Remaining Estimate: 1h > > The docs describing > [RangeBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rangeBetween] > for PySpark appear to be duplicates of > [RowsBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween] > even though these are functionally different windows. Rows between > reference proceeding and succeeding rows, but rangeBetween is based on the > values in these rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26860) RangeBetween docs appear to be wrong
[ https://issues.apache.org/jira/browse/SPARK-26860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782573#comment-16782573 ] Jagadesh Kiran N commented on SPARK-26860: -- @srowen i took time to understand steps both R and Phython im almost done with fix , please move it inprogress state so that i can raise a pull request > RangeBetween docs appear to be wrong > - > > Key: SPARK-26860 > URL: https://issues.apache.org/jira/browse/SPARK-26860 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Shelby Vanhooser >Priority: Major > Labels: docs, easyfix, python > Original Estimate: 1h > Remaining Estimate: 1h > > The docs describing > [RangeBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rangeBetween] > for PySpark appear to be duplicates of > [RowsBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween] > even though these are functionally different windows. Rows between > reference proceeding and succeeding rows, but rangeBetween is based on the > values in these rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26918) All .md should have ASF license header
[ https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782565#comment-16782565 ] Felix Cheung commented on SPARK-26918: -- [~srowen] what do you think? > All .md should have ASF license header > -- > > Key: SPARK-26918 > URL: https://issues.apache.org/jira/browse/SPARK-26918 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Priority: Minor > > per policy, all md files should have the header, like eg. > [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md] > or > [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md] > > currently it does not > [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26918) All .md should have ASF license header
[ https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782564#comment-16782564 ] Felix Cheung commented on SPARK-26918: -- I'm for doing this (Reopen this issue) Also [~rmsm...@gmail.com] this needs to be # on all .md file # remove rat filter for .md then after that # run doc build to check the doc is generated properly ie. at the beginning of the section "Update the Spark Website" https://spark.apache.org/release-process.html {{$ cd docs }} {{$ PRODUCTION=1 jekyll build }} > All .md should have ASF license header > -- > > Key: SPARK-26918 > URL: https://issues.apache.org/jira/browse/SPARK-26918 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.4.0, 3.0.0 >Reporter: Felix Cheung >Priority: Minor > > per policy, all md files should have the header, like eg. > [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md] > or > [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md] > > currently it does not > [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27029) Update Thrift to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-27029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-27029. --- Resolution: Fixed Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23935 > Update Thrift to 0.12.0 > --- > > Key: SPARK-27029 > URL: https://issues.apache.org/jira/browse/SPARK-27029 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Fix For: 3.0.0 > > > We should update to Thrift 0.12.0 to pick up security and bug fixes. It > appears to be compatible with the current build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call
[ https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782542#comment-16782542 ] Maxim Gekk commented on SPARK-26016: [~srowen] JSON datasource uses Text datasource in schema inferring to split text input by lines. For that, line separator should be converted to a correct sequence of bytes and passed to Hadoop's Line Reader. Here we need correct encoding in Text datasource. Text datasource itself does copy bytes from Hadoop Line Reader to InternalRow: [https://github.com/apache/spark/blob/36a2e6371b4d173c3e03cc0d869c39335a0d7682/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L135] . After that, the bytes are read by Jackson parser from streams decoded according to encoding: [https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala#L88-L93] The PR you pointed out just makes necessary changes in Text datasource to support encodings in JSON datasource. > Encoding not working when using a map / mapPartitions call > -- > > Key: SPARK-26016 > URL: https://issues.apache.org/jira/browse/SPARK-26016 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.0 >Reporter: Chris Caspanello >Priority: Major > Attachments: spark-sandbox.zip > > > Attached you will find a project with unit tests showing the issue at hand. > If I read in a ISO-8859-1 encoded file and simply write out what was read; > the contents in the part file matches what was read. Which is great. > However, the second I use a map / mapPartitions function it looks like the > encoding is not correct. In addition a simple collectAsList and writing that > list of strings to a file does not work either. I don't think I'm doing > anything wrong. Can someone please investigate? I think this is a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781670#comment-16781670 ] shahid edited comment on SPARK-27019 at 3/2/19 9:44 PM: seems event reordering has happened. Job start event came after sql execution end event, when the query failed. Could you please share spark eventLog for the application, if possible. was (Author: shahid): seems event reordering has happened. Job start event came after sql execution end event, when the query failed. Could you please share spark eventLog for the application, if possible, as I'm unable to reproduce in our cluster. > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27019: Assignee: (was: Apache Spark) > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27019) Spark UI's SQL tab shows inconsistent values
[ https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27019: Assignee: Apache Spark > Spark UI's SQL tab shows inconsistent values > > > Key: SPARK-27019 > URL: https://issues.apache.org/jira/browse/SPARK-27019 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.4.0 >Reporter: peay >Assignee: Apache Spark >Priority: Major > Attachments: Screenshot from 2019-03-01 21-31-48.png, > application_1550040445209_4748, query-1-details.png, query-1-list.png, > query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png > > > Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the > Spark UI, where submitted/duration make no sense, description has the ID > instead of the actual description. > Clicking on the link to open a query, the SQL plan is missing as well. > I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to > very large values like 30k out of paranoia that we may have too many events, > but to no avail. I have not identified anything particular that leads to > that: it doesn't occur in all my jobs, but it does occur in a lot of them > still. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25762) Upgrade guava version in spark dependency lists due to CVE issue
[ https://issues.apache.org/jira/browse/SPARK-25762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-25762: -- Target Version/s: 3.0.0 Updating Guava can be really disruptive, but we shade it, so it's possible I think. > Upgrade guava version in spark dependency lists due to CVE issue > - > > Key: SPARK-25762 > URL: https://issues.apache.org/jira/browse/SPARK-25762 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.2.1, 2.2.2, 2.3.1, 2.3.2 >Reporter: Debojyoti >Priority: Major > > In spark2.x dependency list we have guava-14.0.1.jar. However there are lot > vulnerabilities exists in this version.eg. CVE-2018-10237 > [https://www.cvedetails.com/cve/CVE-2018-10237/] > Do we have any solution to resolve it or is there any plan to upgrade guava > version any of the spark's future release? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25781) relative importance of linear regression
[ https://issues.apache.org/jira/browse/SPARK-25781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782537#comment-16782537 ] Sean Owen commented on SPARK-25781: --- PS are you referring to Shapley values? > relative importance of linear regression > > > Key: SPARK-25781 > URL: https://issues.apache.org/jira/browse/SPARK-25781 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.3.2 >Reporter: ruxi zhang >Priority: Minor > Labels: features > Attachments: v17i01.pdf > > > There is an R package relaimpo that generates relative importance for linear > regression features. This method utilizes sharply value regression, which > will take a long time to run on big datasets. This method is quite useful > for many use cases such as attribution model in marketing. It will be great > if it is written in Spark with paralleled computing, which would be producing > result within a much short time. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25941) Random forest score decreased due to updating spark version
[ https://issues.apache.org/jira/browse/SPARK-25941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25941. --- Resolution: Not A Problem This isn't a bug. The implementation has changed in Spark and MLlib a lot over time. I would not expect exactly the same answer, and this is quite close. > Random forest score decreased due to updating spark version > --- > > Key: SPARK-25941 > URL: https://issues.apache.org/jira/browse/SPARK-25941 > Project: Spark > Issue Type: Bug > Components: Deploy, Input/Output, ML >Affects Versions: 2.3.2 >Reporter: jack li >Priority: Major > Labels: ML, forest, random > > h3. Problem description > I use different versions of spark to analyze random forest scores.. > * spark-core_2.10 and version 2.0.0 > ** RandomForestsKaggle Score = 0.8978765219058574 > * spark-core_2.11 and version 2.4.0 > ** RandomForestsKaggle Score = 0.8886987035251259 > Source : [https://github.com/smartscity/Kaggle_Titanic_spark] > [Example github source and > readme|https://github.com/smartscity/Kaggle_Titanic_spark/blob/master/README.md] > > h3. Introduce > This case is Titanic Competitions on the Kaggle. > [https://www.kaggle.com/c/titanic] > h3. Conclusion > After upgrading the spark version({{version 2.4.0}}), the random forest score > dropped({{0.01}}). > h3. Expectation > Expect random forest score not to drop as the version upgrades. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
[ https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25814. --- Resolution: Duplicate I think there are a few related items that this could be a duplicate of; picked one here. > spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore > -- > > Key: SPARK-25814 > URL: https://issues.apache.org/jira/browse/SPARK-25814 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.2 >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: driver, memory-analysis, memory-leak, statestore > Attachments: image-2018-10-23-14-06-53-722.png > > > We're looking into issue when even huge spark driver memory gets eventually > exhausted and GC makes driver stop responding. > Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is > used by > > {noformat} > org.apache.spark.status.AppStatusStore > -> org.apache.spark.status.ElementTrackingStore > -> org.apache.spark.util.kvstore.InMemoryStore > > {noformat} > > Is there is a way to tune this particular spark driver's memory region down? > > > !image-2018-10-23-14-06-53-722.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25911) [spark-ml] Hypothesis testing module
[ https://issues.apache.org/jira/browse/SPARK-25911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25911. --- Resolution: Won't Fix I don't think we'd add all of those. Some of these are already in JIRA as ideas. While I don't think we'll add much more like this to ML, if you have one you can argue is widely used, and you can implement it, then I'd create (or find) a JIRA for that one to discuss first. > [spark-ml] Hypothesis testing module > > > Key: SPARK-25911 > URL: https://issues.apache.org/jira/browse/SPARK-25911 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 3.0.0 >Reporter: Uday Babbar >Priority: Minor > > h2. Why this ticket was created > Feasibility determination of some subset of hypothesis testing module mainly > along value proposition front and to get a preliminary opinion of how does it > generally sound. Can work on a more comprehensive proposal if say, it's > generally agreed upon that including dataframe API for t-test makes sense in > the o.a.s.ml package. > h2. Current state > There are some streaming implementation in the o.a.s.mllib module, but there > are no dataframe APIs for some standard tests (t-test). > ||Test ||Current state||Proposed state|| > |t-test (welch's, student)|only streaming |Dataframe API| > |chi-squared|streaming, Dataframe/RDD API present| - | > |ANOVA|-|Dataframe API| > |mann-whitney-u-test|-|RDD API (in maintenance mode so probably doesn't make > sense to include this)| > h2. Rationale > The utility of experimentation platforms is pervasive and most of them that > operate at scale (a large portion of them use spark for offline computation) > require distributed implementation of hypothesis tests to calculate p-values > of different metrics/features. These APIs would enable distributed > computation of the relevant stats and prevent overhead in moving data (or > some downstream view of it) to a framework where such stats computation is > available (R, scipy). > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25928) NoSuchMethodError net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V
[ https://issues.apache.org/jira/browse/SPARK-25928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25928. --- Resolution: Not A Problem Yeah, you have an lz4-java version conflict somewhere. That's not a Spark problem per se, but a problem. I'd work around it, see if Oozie can update (? assuming it's on an older version) or manually update your Oozie install. Vendors sometimes make these kind of workarounds too. > NoSuchMethodError > net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V > - > > Key: SPARK-25928 > URL: https://issues.apache.org/jira/browse/SPARK-25928 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.1 > Environment: EMR 5.17 which is using oozie 5.0.0 and spark 2.3.1 >Reporter: Jerry Chabot >Priority: Major > > I am not sure if this is an Oozie problem, a Spark problem or a user error. > It is blocking our upcoming release. > We are upgrading from Amazon's EMR 5.7 to EMR 5.17. The version changes are: > oozie 4.3.0 -> 5.0.0 > spark 2.1.1 -> 2.3.1 > All our Oozie/Spark jobs were working in EMR 5.7. After ugprading, some of > our jobs which use a spark action are failing with the NoSuchMethod as shown > further in the description. It seems like conflicting classes. > I noticed the spark share lib directory has two versions of the LZ4 jar. > sudo -u hdfs hadoop fs -ls /user/oozie/share/lib/lib_20181029182704/spark/*lz* > -rw-r--r-- 3 oozie oozie 79845 2018-10-29 18:27 > /user/oozie/share/lib/lib_20181029182704/spark/compress-lzf-1.0.3.jar > -rw-r--r-- 3 hdfs oozie 236880 2018-11-01 18:22 > /user/oozie/share/lib/lib_20181029182704/spark/lz4-1.3.0.jar > -rw-r--r-- 3 oozie oozie 370119 2018-10-29 18:27 > /user/oozie/share/lib/lib_20181029182704/spark/lz4-java-1.4.0.jar > But, both of these jars have the constructor > LZ4BlockInputStream(java/io/InputStream). The spark/jars directory has only > lz4-java-1.4.0.jar. share lib seems to be getting it from the > /usr/lib/oozie/oozie-sharelib.tar.gz. > Unfortunately, my team member that knows most about Spark is on vacation. > Does anyone have any suggestions on how best to troubleshoot this problem? > Here is the strack trace. > diagnostics: User class threw exception: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 > (TID 3, ip-172-27-113-49.ec2.internal, executor 2): > java.lang.NoSuchMethodError: > net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V > at > org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:304) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1346) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-25982) Dataframe write is non blocking in fair scheduling mode
[ https://issues.apache.org/jira/browse/SPARK-25982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782527#comment-16782527 ] Ramandeep Singh commented on SPARK-25982: - No, as I said those operations at a stage are independent. And I explicitly await for them to complete before launching the next stage. It's the fact that operation from next stage start running before all futures have completed. > Dataframe write is non blocking in fair scheduling mode > --- > > Key: SPARK-25982 > URL: https://issues.apache.org/jira/browse/SPARK-25982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Ramandeep Singh >Priority: Major > > Hi, > I have noticed that expected behavior of dataframe write operation to block > is not working in fair scheduling mode. > Ideally when a dataframe write is occurring and a future is blocking on > AwaitResult, no other job should be started, but this is not the case. I have > noticed that other jobs are started when the partitions are being written. > > Regards, > Ramandeep Singh > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26146) CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782526#comment-16782526 ] Sean Owen commented on SPARK-26146: --- It's possible, even likely, that the dependency is transitive. In this case: {code} [INFO] net.jgp.books:spark-chapter01:jar:1.0.0-SNAPSHOT [INFO] +- org.apache.spark:spark-core_2.11:jar:2.4.0:compile [INFO] | +- org.apache.avro:avro:jar:1.8.2:compile [INFO] | | +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile [INFO] | | +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile [INFO] | | +- com.thoughtworks.paranamer:paranamer:jar:2.7:compile {code} Here's what I'm confused about: Spark has depended on paranamer 2.8 since 2.3.0: https://github.com/apache/spark/blob/v2.4.0/pom.xml#L185 {code} [INFO] +- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile [INFO] | +- com.thoughtworks.paranamer:paranamer:jar:2.8:runtime [INFO] | +- org.apache.avro:avro:jar:1.8.2:compile [INFO] | | +- org.codehaus.jackson:jackson-core-asl:jar {code} My only guess right now is that this is due to the arcance scoping rules for transitive dependencies in Maven. paranamer is a runtime dependency. Here, Spark is given as a compile-time dependency. It should be 'provided'. That doesn't change what Maven reports (weird still) but might solve the problem. Spark itself really does have 2.8. You can see it in https://search.maven.org/artifact/org.apache.spark/spark-parent_2.12/2.4.0/pom I think this is still probably a weird issue between the code here and Maven. > CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12 > -- > > Key: SPARK-26146 > URL: https://issues.apache.org/jira/browse/SPARK-26146 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Jean Georges Perrin >Priority: Major > > Ingestion of a CSV file seems to fail with Spark v2.4.0 and Scala v2.12, > where it works ok with Scala v2.11. > When running a simple CSV ingestion like:{{ }} > {code:java} > // Creates a session on a local master > SparkSession spark = SparkSession.builder() > .appName("CSV to Dataset") > .master("local") > .getOrCreate(); > // Reads a CSV file with header, called books.csv, stores it in a > dataframe > Dataset df = spark.read().format("csv") > .option("header", "true") > .load("data/books.csv"); > {code} > With Scala 2.12, I get: > {code:java} > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240) > ... > at > net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.start(CsvToDataframeApp.java:37) > at > net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.main(CsvToDataframeApp.java:21) > {code} > Where it works pretty smoothly if I switch back to 2.11. > Full example available at > [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch01.] You can > modify pom.xml to change easily the Scala version in the property section: > {code:java} > > UTF-8 > 1.8 > 2.11 > 2.4.0 > {code} > > (ps. It's my first bug submission, so I hope I did not mess too much, be > tolerant if I did) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26146) CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782519#comment-16782519 ] Michael Heuer commented on SPARK-26146: --- [~srowen] Hey, wait a minute, the full example from the original submitter appears to be in github here [https://github.com/jgperrin/net.jgp.books.spark.ch01] This example has no explicit dependency on paranamer, and neither does Disq for that matter [https://github.com/disq-bio/disq/blob/5760a80b3322c537226a876e0df4f7710188f7b2/pom.xml] This is not the first instance I've seen (and reported) where Spark itself has dependencies that conflict or otherwise are not compatible with each other, and those do not manifest themselves at test scope in Spark CI, only when an external project has a compile scope dependency on Spark jars. > CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12 > -- > > Key: SPARK-26146 > URL: https://issues.apache.org/jira/browse/SPARK-26146 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Jean Georges Perrin >Priority: Major > > Ingestion of a CSV file seems to fail with Spark v2.4.0 and Scala v2.12, > where it works ok with Scala v2.11. > When running a simple CSV ingestion like:{{ }} > {code:java} > // Creates a session on a local master > SparkSession spark = SparkSession.builder() > .appName("CSV to Dataset") > .master("local") > .getOrCreate(); > // Reads a CSV file with header, called books.csv, stores it in a > dataframe > Dataset df = spark.read().format("csv") > .option("header", "true") > .load("data/books.csv"); > {code} > With Scala 2.12, I get: > {code:java} > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240) > ... > at > net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.start(CsvToDataframeApp.java:37) > at > net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.main(CsvToDataframeApp.java:21) > {code} > Where it works pretty smoothly if I switch back to 2.11. > Full example available at > [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch01.] You can > modify pom.xml to change easily the Scala version in the property section: > {code:java} > > UTF-8 > 1.8 > 2.11 > 2.4.0 > {code} > > (ps. It's my first bug submission, so I hope I did not mess too much, be > tolerant if I did) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26146) CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782522#comment-16782522 ] Michael Heuer commented on SPARK-26146: --- It appears that this is a duplicate of SPARK-26583 and will be resolved in 2.4.1. > CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12 > -- > > Key: SPARK-26146 > URL: https://issues.apache.org/jira/browse/SPARK-26146 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Jean Georges Perrin >Priority: Major > > Ingestion of a CSV file seems to fail with Spark v2.4.0 and Scala v2.12, > where it works ok with Scala v2.11. > When running a simple CSV ingestion like:{{ }} > {code:java} > // Creates a session on a local master > SparkSession spark = SparkSession.builder() > .appName("CSV to Dataset") > .master("local") > .getOrCreate(); > // Reads a CSV file with header, called books.csv, stores it in a > dataframe > Dataset df = spark.read().format("csv") > .option("header", "true") > .load("data/books.csv"); > {code} > With Scala 2.12, I get: > {code:java} > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240) > ... > at > net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.start(CsvToDataframeApp.java:37) > at > net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.main(CsvToDataframeApp.java:21) > {code} > Where it works pretty smoothly if I switch back to 2.11. > Full example available at > [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch01.] You can > modify pom.xml to change easily the Scala version in the property section: > {code:java} > > UTF-8 > 1.8 > 2.11 > 2.4.0 > {code} > > (ps. It's my first bug submission, so I hope I did not mess too much, be > tolerant if I did) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25982) Dataframe write is non blocking in fair scheduling mode
[ https://issues.apache.org/jira/browse/SPARK-25982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782521#comment-16782521 ] Sean Owen commented on SPARK-25982: --- I don't understand this; you're running operations in parallel on purpose, but expecting one to wait for the other? > Dataframe write is non blocking in fair scheduling mode > --- > > Key: SPARK-25982 > URL: https://issues.apache.org/jira/browse/SPARK-25982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Ramandeep Singh >Priority: Major > > Hi, > I have noticed that expected behavior of dataframe write operation to block > is not working in fair scheduling mode. > Ideally when a dataframe write is occurring and a future is blocking on > AwaitResult, no other job should be started, but this is not the case. I have > noticed that other jobs are started when the partitions are being written. > > Regards, > Ramandeep Singh > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call
[ https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782520#comment-16782520 ] Sean Owen commented on SPARK-26016: --- Yeah, the TL;DR from reading the code is that it writes UTF-8, right? df.write.text always writes UTF-8 strings; I think that's the most immediate explanation for all of this. df.read.text also only reads UTF-8, as it pushes down to the Hadoop input formats and Text works in terms of UTF-8. I think it's working insofar as ISO-8859-1's character mapping is a subset of UTF-8's. I'm not as sure why the write path doesn't accidentally work; BOMs? I think this is 80% right at least as an explanation. The problem is that this isn't documented anywhere I can find. I think people are just accustomed to always working in UTF-8 and the fact that you can read some common encodings as UTF-8 anyway. That much can easily be documented. It's exacerbated by the fact that there's an 'encoding' option on DataFrameReader/Writer, which does nothing but affect how the line separator is treated. (I don't quite get why you could use a non-UTF-8 encoding of your line separator if everything else assumes UTF-8!) [~maxgekk] it looks like this option was added in https://github.com/apache/spark/pull/20937/files#diff-cbd2ad0c69c1517fbbc42b587fd37e25R44 ; was it necessary for the text data source? for the reason above I'm not sure about how it works. I see it's also used for JSON, separately. Is it meant to be 'hidden'? > Encoding not working when using a map / mapPartitions call > -- > > Key: SPARK-26016 > URL: https://issues.apache.org/jira/browse/SPARK-26016 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.0 >Reporter: Chris Caspanello >Priority: Major > Attachments: spark-sandbox.zip > > > Attached you will find a project with unit tests showing the issue at hand. > If I read in a ISO-8859-1 encoded file and simply write out what was read; > the contents in the part file matches what was read. Which is great. > However, the second I use a map / mapPartitions function it looks like the > encoding is not correct. In addition a simple collectAsList and writing that > list of strings to a file does not work either. I don't think I'm doing > anything wrong. Can someone please investigate? I think this is a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26943) Weird behaviour with `.cache()`
[ https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782518#comment-16782518 ] Will Uto commented on SPARK-26943: -- Thanks for the information - I was hoping that if I e.g. installed PySpark v2.4.0 in each Python Virtual Environment on each cluster worker/node, then I could run against Spark v2.4.0, but it sounds like I would need to upgrade Spark through something like Cloudera Manager. > Weird behaviour with `.cache()` > --- > > Key: SPARK-26943 > URL: https://issues.apache.org/jira/browse/SPARK-26943 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Will Uto >Priority: Major > > > {code:java} > sdf.count(){code} > > works fine. However: > > {code:java} > sdf = sdf.cache() > sdf.count() > {code} > does not, and produces error > {code:java} > Py4JJavaError: An error occurred while calling o314.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 > in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 > (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable > number: "(N/A)" > at java.text.NumberFormat.parse(NumberFormat.java:350) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26045) Error in the spark 2.4 release package with the spark-avro_2.11 depdency
[ https://issues.apache.org/jira/browse/SPARK-26045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782508#comment-16782508 ] Sean Owen commented on SPARK-26045: --- Avro is already on 1.8.2 in 2.4.0. I think the problem may be that you're adding a separate copy of Avro, which might end up in a different classloader. You wouldn't want to do this, in general, but certainly won't likely work with Spark 2.3, which was on Avro 1.7.7. Try without adding an avro jar, and make sure you don't have a different Avro dependency in your app or environment. If that's not it, my final guess is that something in hive-exec has a copy of avro. But then I am not sure how the tests would pass. > Error in the spark 2.4 release package with the spark-avro_2.11 depdency > > > Key: SPARK-26045 > URL: https://issues.apache.org/jira/browse/SPARK-26045 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 > Environment: 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC > 2018 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Oscar garcía >Priority: Major > Original Estimate: 2h > Remaining Estimate: 2h > > Hello I have been problems with the last spark 2.4 release, the read avro > file feature does not seem to be working, I have fixed it in local building > the source code and updating the *avro-1.8.2.jar* on the *$SPARK_HOME*/jars/ > dependencies. > With the default spark 2.4 release when I try to read an avro file spark > raise the following exception. > {code:java} > spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0 > scala> spark.read.format("avro").load("file.avro") > java.lang.NoSuchMethodError: > org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:51) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105 > {code} > Checksum: spark-2.4.0-bin-without-hadoop.tgz: 7670E29B 59EAE7A8 5DBC9350 > 085DD1E0 F056CA13 11365306 7A6A32E9 B607C68E A8DAA666 EF053350 008D0254 > 318B70FB DE8A8B97 6586CA19 D65BA2B3 FD7F919E > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.
[ https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782510#comment-16782510 ] Alfredo Gimenez commented on SPARK-24295: - Our current workaround FWII: We've added a streaming query listener that, at every query progress event, writes out a manual checkpoint (from the QueryProgressEvent sourceOffset member that contains the last used source offsets). We gracefully stop the stream job every 6 hours, purge the _spark_metadata and spark checkpoints, and upon restart check for the existence of the manual checkpoint and use it if available. We do the stop/purge/restart via Airflow but it would be trivial to do this by looping around a stream awaitTermination with a provided timeout. A simple solution would be to just have an option to disable metadata file compaction that also allows old metadata files to be deleted after a delay. Currently it appears that all files stay around until compaction, upon which files older than the delay and not in the compaction are purged. > Purge Structured streaming FileStreamSinkLog metadata compact file data. > > > Key: SPARK-24295 > URL: https://issues.apache.org/jira/browse/SPARK-24295 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Iqbal Singh >Priority: Major > Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz > > > FileStreamSinkLog metadata logs are concatenated to a single compact file > after defined compact interval. > For long running jobs, compact file size can grow up to 10's of GB's, Causing > slowness while reading the data from FileStreamSinkLog dir as spark is > defaulting to the "__spark__metadata" dir for the read. > We need a functionality to purge the compact file size. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26017) SVD++ error rate is high in the test suite.
[ https://issues.apache.org/jira/browse/SPARK-26017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26017. --- Resolution: Invalid I don't think this states an actionable issue; high relative to what? > SVD++ error rate is high in the test suite. > --- > > Key: SPARK-26017 > URL: https://issues.apache.org/jira/browse/SPARK-26017 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.3.2 >Reporter: shahid >Priority: Major > Attachments: image-2018-11-12-20-41-49-370.png > > > In the test suite, "{color:#008000}Test SVD++ with mean square error on > training set", {color}error rate is quite high, even for large number of > iterations. > > !image-2018-11-12-20-41-49-370.png! > > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap
[ https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782502#comment-16782502 ] Sean Owen commented on SPARK-26166: --- I agree there's a problem there. I don't think checkpoint() is appropriate as it would try to save the whole RDD to a local file. Although you would generally get the same order of evaluation for the same dataset, it's not guaranteed. cache()-ing the whole thing takes memory and as you say even that isn't guaranteed to work. The Scala implementation does it differently, and more correctly, in MLUtils.kFold. I think the solution is to call that from Pyspark to get the training, validation splits. Are you open to trying a fix? > CrossValidator.fit() bug,training and validation dataset may overlap > > > Key: SPARK-26166 > URL: https://issues.apache.org/jira/browse/SPARK-26166 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Xinyong Tian >Priority: Major > > In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column > df = dataset.select("*", rand(seed).alias(randCol)) > Should add > df.checkpoint() > If df is not checkpointed, it will be recomputed each time when train and > validation dataframe need to be created. The order of rows in df,which > rand(seed) is dependent on, is not deterministic . Thus each time random > column value could be different for a specific row even with seed. Note , > checkpoint() can not be replaced with cached(), because when a node fails, > cached table need be recomputed, thus random number could be different. > This might especially be a problem when input 'dataset' dataframe is > resulted from a query including 'where' clause. see below. > [https://dzone.com/articles/non-deterministic-order-for-select-with-limit] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26058) Incorrect logging class loaded for all the logs.
[ https://issues.apache.org/jira/browse/SPARK-26058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26058: -- Priority: Minor (was: Major) Description: In order to make the bug more evident, please change the log4j configuration to use this pattern, instead of default. {code} log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %C: %m%n {code} The logging class recorded in the log is : {code} INFO org.apache.spark.internal.Logging$class {code} instead of the actual logging class. Sample output of the logs, after applying the above log4j configuration change. {code} 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Stopped Spark web UI at http://9.234.206.241:4040 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: MapOutputTrackerMasterEndpoint stopped! 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: MemoryStore cleared 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: BlockManager stopped 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: BlockManagerMaster stopped 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: OutputCommitCoordinator stopped! 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Successfully stopped SparkContext {code} This happens due to the fact, actual logging is done inside the trait logging and that is picked up as logging class for the log message. It can either be corrected by using `log` variable directly instead of delegator logInfo methods or if we would like to not miss out on theoretical performance benefits of pre-checking logXYZ.isEnabled, then we can use scala macro to inject those checks. Later has a disadvantage, that during debugging wrong line number information may be produced. was: In order to make the bug more evident, please change the log4j configuration to use this pattern, instead of default. {code} log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %C: %m%n {code} The logging class recorded in the log is : {code} INFO org.apache.spark.internal.Logging$class {code} instead of the actual logging class. Sample output of the logs, after applying the above log4j configuration change. {code} 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Stopped Spark web UI at http://9.234.206.241:4040 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: MapOutputTrackerMasterEndpoint stopped! 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: MemoryStore cleared 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: BlockManager stopped 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: BlockManagerMaster stopped 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: OutputCommitCoordinator stopped! 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Successfully stopped SparkContext {code} This happens due to the fact, actual logging is done inside the trait logging and that is picked up as logging class for the log message. It can either be corrected by using `log` variable directly instead of delegator logInfo methods or if we would like to not miss out on theoretical performance benefits of pre-checking logXYZ.isEnabled, then we can use scala macro to inject those checks. Later has a disadvantage, that during debugging wrong line number information may be produced. Issue Type: Improvement (was: Bug) I don't think that's a bug, but if there's a way to get the logging class properly without changing all the logging code, OK. > Incorrect logging class loaded for all the logs. > > > Key: SPARK-26058 > URL: https://issues.apache.org/jira/browse/SPARK-26058 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Prashant Sharma >Priority: Minor > > In order to make the bug more evident, please change the log4j configuration > to use this pattern, instead of default. > {code} > log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %C: > %m%n > {code} > The logging class recorded in the log is : > {code} > INFO org.apache.spark.internal.Logging$class > {code} > instead of the actual logging class. > Sample output of the logs, after applying the above log4j configuration > change. > {code} > 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Stopped Spark > web UI at http://9.234.206.241:4040 > 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: > MapOutputTrackerMasterEndpoint stopped! > 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: MemoryStore > cleared > 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: BlockManager > stopped > 18/11/14 13:44:48
[jira] [Resolved] (SPARK-26128) filter breaks input_file_name
[ https://issues.apache.org/jira/browse/SPARK-26128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26128. --- Resolution: Cannot Reproduce > filter breaks input_file_name > - > > Key: SPARK-26128 > URL: https://issues.apache.org/jira/browse/SPARK-26128 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.2 >Reporter: Paul Praet >Priority: Minor > > This works: > {code:java} > scala> > spark.read.parquet("/tmp/newparquet").select(input_file_name).show(5,false) > +-+ > |input_file_name() > | > +-+ > |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet| > |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet| > |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet| > |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet| > |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet| > +-+ > {code} > When adding a filter: > {code:java} > scala> > spark.read.parquet("/tmp/newparquet").where("key.station='XYZ'").select(input_file_name()).show(5,false) > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > +-+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26146) CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26146. --- Resolution: Cannot Reproduce Is that code from Spark? it doesn't quite look like it from the stack trace. It's a paranamer problem with Java 8 that I think is triggered by Scala 2.12, which uses lambdas for closures. This was updated in Spark 2.4 though in SPARK-22128 for this reason. I'm wondering if you are using a different paranamer dependency in your app, either in the non-Spark or Spark code. Update to 2.8. In any event, at least, the CSV tests are fine in Java 8 + Scala 2.12 + Spark 2.4. > CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12 > -- > > Key: SPARK-26146 > URL: https://issues.apache.org/jira/browse/SPARK-26146 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 >Reporter: Jean Georges Perrin >Priority: Major > > Ingestion of a CSV file seems to fail with Spark v2.4.0 and Scala v2.12, > where it works ok with Scala v2.11. > When running a simple CSV ingestion like:{{ }} > {code:java} > // Creates a session on a local master > SparkSession spark = SparkSession.builder() > .appName("CSV to Dataset") > .master("local") > .getOrCreate(); > // Reads a CSV file with header, called books.csv, stores it in a > dataframe > Dataset df = spark.read().format("csv") > .option("header", "true") > .load("data/books.csv"); > {code} > With Scala 2.12, I get: > {code:java} > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582 > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338) > at > com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103) > at > com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58) > at > com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240) > ... > at > net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.start(CsvToDataframeApp.java:37) > at > net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.main(CsvToDataframeApp.java:21) > {code} > Where it works pretty smoothly if I switch back to 2.11. > Full example available at > [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch01.] You can > modify pom.xml to change easily the Scala version in the property section: > {code:java} > > UTF-8 > 1.8 > 2.11 > 2.4.0 > {code} > > (ps. It's my first bug submission, so I hope I did not mess too much, be > tolerant if I did) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26152) Flaky test: BroadcastSuite
[ https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782505#comment-16782505 ] Sean Owen commented on SPARK-26152: --- I haven't seen this one in a while, FWIW. > Flaky test: BroadcastSuite > -- > > Key: SPARK-26152 > URL: https://issues.apache.org/jira/browse/SPARK-26152 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Critical > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627 > (2018-11-16) > {code} > BroadcastSuite: > - Using TorrentBroadcast locally > - Accessing TorrentBroadcast variables from multiple threads > - Accessing TorrentBroadcast variables in a local cluster (encryption = off) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@59428a1 rejected from > java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = > 1, active threads = 1, queued tasks = 0, completed tasks = 0] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) > at > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63) > at > scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81) > at > scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870) > at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106) > at > scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) > at scala.concurrent.Promise.complete(Promise.scala:49) > at scala.concurrent.Promise.complete$(Promise.scala:48) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183) > at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > java.util.concurrent.RejectedExecutionException: Task > scala.concurrent.impl.CallbackRunnable@40a5bf17 rejected from > java.util.concurrent.ThreadPoolExecutor@5a73967[Shutting down, pool size = 1, > active threads = 1, queued tasks = 0, completed tasks = 0] > at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) > at > java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668) > at >
[jira] [Resolved] (SPARK-26162) ALS results vary with user or item ID encodings
[ https://issues.apache.org/jira/browse/SPARK-26162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26162. --- Resolution: Not A Problem I don't think it's a bug; the distributed computation and blocking might produce slightly different results depending on the order it encounters the vectors. Reopen if you have more detail that shows the results are meaningfully different. > ALS results vary with user or item ID encodings > --- > > Key: SPARK-26162 > URL: https://issues.apache.org/jira/browse/SPARK-26162 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Pablo J. Villacorta >Priority: Major > > When calling ALS.fit() with the same seed on a dataset, the results (both the > latent factors matrices and the accuracy of the recommendations) differ when > we change the labels used to encode the users or items. The code example > below illustrates this by just changing user ID 1 or 2 to an unused ID like > 30. The user factors matrix changes, but not only the rows corresponding to > users 1 or 2 but also the other rows. > Is this the intended behaviour? > {code:java} > val r = scala.util.Random > r.setSeed(123456) > val trainDataset1 = spark.sparkContext.parallelize( > (1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // > users go from 0 to 19 > ).toDF("user", "item", "rating") > val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, > 30).otherwise(col("user"))) > val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, > 30).otherwise(col("user"))) > val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map( > _.groupBy("user").agg(collect_list("item").alias("watched")) > ) > val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, > trainDataset3).map(new ALS().setSeed(12345).fit(_)) > als1.userFactors.show(5, false) > als2.userFactors.show(5, false) > als3.userFactors.show(5, false){code} > If we ask for recommendations and compare them with a test dataset also > modified accordingly (in this example, the test dataset is exactly the train > dataset) the results also differ: > {code:java} > val recommendations = Array(als1, als2, als3).map(x => > x.recommendForAllUsers(20).map{ > case Row(user: Int, recommendations: WrappedArray[Row]) => { > val items = recommendations.map{case Row(item: Int, score: Float) > => item} > (user, items) > } > }.toDF("user", "recommendations") > ) > val predictionsAndActualRDD = testDatasets.zip(recommendations).map{ > case (testDataset, recommendationsDF) => > testDataset.join(recommendationsDF, "user") > .rdd.map(r => { > > (r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array, > r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array > ) > }) > } > val metrics = predictionsAndActualRDD.map(new RankingMetrics(_)) > println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}") > println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}") > println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}") > {code} > EDIT: The results also change if we just swap the IDs of some users, like: > {code:java} > val trainDataset4 = trainDataset1.withColumn("user", when(col("user")===1, 4) > .when(col("user")===4, > 1).otherwise(col("user"))) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator
[ https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26183. --- Resolution: Not A Problem This doesn't look like a bug. It arises because you are (inadvertently) serializing the accumulator because it's a field in your app. It's not meant to be anywhere but the driver. The serializer is reading it while you're writing it, as it doesn't know about your locks. > ConcurrentModificationException when using Spark collectionAccumulator > -- > > Key: SPARK-26183 > URL: https://issues.apache.org/jira/browse/SPARK-26183 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Rob Dawson >Priority: Major > > I'm trying to run a Spark-based application on an Azure HDInsight on-demand > cluster, and am seeing lots of SparkExceptions (caused by > ConcurrentModificationExceptions) being logged. The application runs without > these errors when I start a local Spark instance. > I've seen reports of [similar errors when using > accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code > is indeed using a CollectionAccumulator, however I have placed synchronized > blocks everywhere I use it, and it makes no difference. The accumulator code > looks like this: > {code:scala} > class MySparkClass(sc : SparkContext) { > val myAccumulator = sc.collectionAccumulator[MyRecord] > override def add(record: MyRecord) = { > synchronized { > myAccumulator.add(record) > } > } > override def endOfBatch() = { > synchronized { > myAccumulator.value.asScala.foreach((record: MyRecord) => { > processIt(record) > }) > } > } > }{code} > The exceptions don't cause the application to fail, however when the code > tries to read values out of the accumulator it is empty and {{processIt}} is > never called. > {code} > 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in > heartbeater > org.apache.spark.SparkException: Exception thrown in awaitResult: > at > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785) > at > org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814) > at > org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814) > at > org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) > at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:770) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at > java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441) > at > java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
[jira] [Resolved] (SPARK-26214) Add "broadcast" method to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26214. --- Resolution: Won't Fix > Add "broadcast" method to DataFrame > --- > > Key: SPARK-26214 > URL: https://issues.apache.org/jira/browse/SPARK-26214 > Project: Spark > Issue Type: Wish > Components: SQL >Affects Versions: 2.4.0 >Reporter: Thomas Decaux >Priority: Trivial > Labels: broadcast, dataframe > > As discussed at > [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,] > it's possible to force broadcast of DataFrame, even if total size is greater > than ``*spark.sql.autoBroadcastJoinThreshold``.* > But this not trivial for beginner, because there is no "broadcast" method (I > know, I am lazy ...). > We could add this method, with a WARN if size is greater than the threshold. > (if it's an easy one, I could do it?) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26224) Results in stackOverFlowError when trying to add 3000 new columns using withColumn function of dataframe.
[ https://issues.apache.org/jira/browse/SPARK-26224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782496#comment-16782496 ] Sean Owen commented on SPARK-26224: --- The most realistic thing I can imagine is exposing `withColumns`. But for this use case, there are pretty easy workarounds, like just mapping the DataFrame Rows to contain a bunch more 0s and specifying your new schema. > Results in stackOverFlowError when trying to add 3000 new columns using > withColumn function of dataframe. > - > > Key: SPARK-26224 > URL: https://issues.apache.org/jira/browse/SPARK-26224 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 > Environment: On macbook, used Intellij editor. Ran the above sample > code as unit test. >Reporter: Dorjee Tsering >Priority: Minor > > Reproduction step: > Run this sample code on your laptop. I am trying to add 3000 new columns to a > base dataframe with 1 column. > > > {code:java} > import spark.implicits._ > val newColumnsToBeAdded : Seq[StructField] = for (i <- 1 to 3000) yield new > StructField("field_" + i, DataTypes.LongType) > val baseDataFrame: DataFrame = Seq((1)).toDF("employee_id") > val result = newColumnsToBeAdded.foldLeft(baseDataFrame)((df, newColumn) => > df.withColumn(newColumn.name, lit(0))) > result.show(false) > > {code} > Ends up with following stacktrace: > java.lang.StackOverflowError > at > scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57) > at > scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52) > at > scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) > at scala.collection.immutable.List.map(List.scala:296) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26229) Expose SizeEstimator as a developer API in pyspark
[ https://issues.apache.org/jira/browse/SPARK-26229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26229. --- Resolution: Won't Fix It's pretty internal. I don't think we'd want to expose it further, especially as it can't account for Python memory usage. > Expose SizeEstimator as a developer API in pyspark > -- > > Key: SPARK-26229 > URL: https://issues.apache.org/jira/browse/SPARK-26229 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Ranjith Pulluru >Priority: Minor > > SizeEstimator is not available in pyspark. > This api will be helpful for understanding the memory footprint of an object. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving
[ https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782494#comment-16782494 ] Sean Owen commented on SPARK-26247: --- I tend to think of this problem as about as solved as it will be by MLeap. There just hasn't been a representation that 'rules them all', so a library to translate Spark models is probably about as good as it gets. PMML is the best shot at a format and it doesn't cover enough; PFA seems similar in this regard. > SPIP - ML Model Extension for no-Spark MLLib Online Serving > --- > > Key: SPARK-26247 > URL: https://issues.apache.org/jira/browse/SPARK-26247 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Anne Holler >Priority: Major > Labels: SPIP > Attachments: SPIPMlModelExtensionForOnlineServing.pdf > > > This ticket tracks an SPIP to improve model load time and model serving > interfaces for online serving of Spark MLlib models. The SPIP is here > [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub] > > The improvement opportunity exists in all versions of spark. We developed > our set of changes wrt version 2.1.0 and can port them forward to other > versions (e.g., we have ported them forward to 2.3.2). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions
[ https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782491#comment-16782491 ] Sean Owen commented on SPARK-26257: --- I'm not sure how much of the Pyspark and SparkR integration is meaningfully refactorable to support other languages. They both wrap Spark, effectively. I'm not sure therefore it's necessary or a good thing to further modify Spark for these things, if they already manage to support Python and R. With the exception of C#, those are niche languages anyway. I'd be open to seeing concrete proposals. > SPIP: Interop Support for Spark Language Extensions > --- > > Key: SPARK-26257 > URL: https://issues.apache.org/jira/browse/SPARK-26257 > Project: Spark > Issue Type: Improvement > Components: PySpark, R, Spark Core >Affects Versions: 2.4.0 >Reporter: Tyson Condie >Priority: Minor > > h2. ** Background and Motivation: > There is a desire for third party language extensions for Apache Spark. Some > notable examples include: > * C#/F# from project Mobius [https://github.com/Microsoft/Mobius] > * Haskell from project sparkle [https://github.com/tweag/sparkle] > * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl] > Presently, Apache Spark supports Python and R via a tightly integrated > interop layer. It would seem that much of that existing interop layer could > be refactored into a clean surface for general (third party) language > bindings, such as the above mentioned. More specifically, could we generalize > the following modules: > * Deploy runners (e.g., PythonRunner and RRunner) > * DataFrame Executors > * RDD operations? > The last being questionable: integrating third party language extensions at > the RDD level may be too heavy-weight and unnecessary given the preference > towards the Dataframe abstraction. > The main goals of this effort would be: > * Provide a clean abstraction for third party language extensions making it > easier to maintain (the language extension) with the evolution of Apache Spark > * Provide guidance to third party language authors on how a language > extension should be implemented > * Provide general reusable libraries that are not specific to any language > extension > * Open the door to developers that prefer alternative languages > * Identify and clean up common code shared between Python and R interops > h2. Target Personas: > Data Scientists, Data Engineers, Library Developers > h2. Goals: > Data scientists and engineers will have the opportunity to work with Spark in > languages other than what’s natively supported. Library developers will be > able to create language extensions for Spark in a clean way. The interop > layer should also provide guidance for developing language extensions. > h2. Non-Goals: > The proposal does not aim to create an actual language extension. Rather, it > aims to provide a stable interop layer for third party language extensions to > dock. > h2. Proposed API Changes: > Much of the work will involve generalizing existing interop APIs for PySpark > and R, specifically for the Dataframe API. For instance, it would be good to > have a general deploy.Runner (similar to PythonRunner) for language extension > efforts. In Spark SQL, it would be good to have a general InteropUDF and > evaluator (similar to BatchEvalPythonExec). > Low-level RDD operations should not be needed in this initial offering; > depending on the success of the interop layer and with proper demand, RDD > interop could be added later. However, one open question is supporting a > subset of low-level functions that are core to ETL e.g., transform. > h2. Optional Design Sketch: > The work would be broken down into two top-level phases: > Phase 1: Introduce general interop API for deploying a driver/application, > running an interop UDF along with any other low-level transformations that > aid with ETL. > Phase 2: Port existing Python and R language extensions to the new interop > layer. This port should be contained solely to the Spark core side, and all > protocols specific to Python and R should not change e.g., Python should > continue to use py4j is the protocol between the Python process and core > Spark. The port itself should be contained to a handful of files e.g., some > examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, > PythonRDD (possibly), and will mostly involve refactoring common logic > abstract implementations and utilities. > h2. Optional Rejected Designs: > The clear alternative is the status quo; developers that want to provide a > third-party language extension to Spark do so directly; often by extending > existing Python classes and overriding the portions that are relevant to the > new extension. Not only is
[jira] [Commented] (SPARK-26261) Spark does not check completeness temporary file
[ https://issues.apache.org/jira/browse/SPARK-26261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782489#comment-16782489 ] Sean Owen commented on SPARK-26261: --- Why would you truncate the file -- how would this happen normally? > Spark does not check completeness temporary file > - > > Key: SPARK-26261 > URL: https://issues.apache.org/jira/browse/SPARK-26261 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: Jialin LIu >Priority: Minor > > Spark does not check temporary files' completeness. When persisting to disk > is enabled on some RDDs, a bunch of temporary files will be created on > blockmgr folder. Block manager is able to detect missing blocks while it is > not able detect file content being modified during execution. > Our initial test shows that if we truncate the block file before being used > by executors, the program will finish without detecting any error, but the > result content is totally wrong. > We believe there should be a file checksum on every RDD file block and these > files should be protected by checksum. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26266) Update to Scala 2.12.8
[ https://issues.apache.org/jira/browse/SPARK-26266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26266. --- Resolution: Fixed Assignee: Sean Owen (was: Yuming Wang) Fix Version/s: 3.0.0 2.4.1 > Update to Scala 2.12.8 > -- > > Key: SPARK-26266 > URL: https://issues.apache.org/jira/browse/SPARK-26266 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > Labels: release-notes > Fix For: 2.4.1, 3.0.0 > > > [~yumwang] notes that Scala 2.12.8 is out and fixes two minor issues: > Don't reject views with result types which are TypeVars (#7295) > Don't emit static forwarders (which simplify the use of methods in top-level > objects from Java) for bridge methods (#7469) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26272) Please delete old releases from mirroring system
[ https://issues.apache.org/jira/browse/SPARK-26272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26272. --- Resolution: Fixed Fix Version/s: 2.3.3 Yep, we do this regularly, and do it alongside releases now, and 2.3.1 was zapped a while ago. 2.3.2 should be removed; not sure why it wasn't with 2.3.3's release. > Please delete old releases from mirroring system > > > Key: SPARK-26272 > URL: https://issues.apache.org/jira/browse/SPARK-26272 > Project: Spark > Issue Type: Bug > Components: Deploy, Documentation, Web UI >Affects Versions: 2.3.1 >Reporter: Sebb >Priority: Major > Fix For: 2.3.3 > > > The release notes say 2.3.2 is the latest release for the 2.3.x line. > As such, earlier releases such as 2.3.1 should no longer be hosted on the > mirrors. > Please drop https://www.apache.org/dist/spark/spark-2.3.1/ and adjust any > remaining download links accordingly -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27029) Update Thrift to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-27029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782441#comment-16782441 ] Apache Spark commented on SPARK-27029: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/23935 > Update Thrift to 0.12.0 > --- > > Key: SPARK-27029 > URL: https://issues.apache.org/jira/browse/SPARK-27029 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > We should update to Thrift 0.12.0 to pick up security and bug fixes. It > appears to be compatible with the current build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27029) Update Thrift to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-27029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27029: Assignee: Sean Owen (was: Apache Spark) > Update Thrift to 0.12.0 > --- > > Key: SPARK-27029 > URL: https://issues.apache.org/jira/browse/SPARK-27029 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > We should update to Thrift 0.12.0 to pick up security and bug fixes. It > appears to be compatible with the current build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27029) Update Thrift to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-27029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782439#comment-16782439 ] Apache Spark commented on SPARK-27029: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/23935 > Update Thrift to 0.12.0 > --- > > Key: SPARK-27029 > URL: https://issues.apache.org/jira/browse/SPARK-27029 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > We should update to Thrift 0.12.0 to pick up security and bug fixes. It > appears to be compatible with the current build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27029) Update Thrift to 0.12.0
[ https://issues.apache.org/jira/browse/SPARK-27029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27029: Assignee: Apache Spark (was: Sean Owen) > Update Thrift to 0.12.0 > --- > > Key: SPARK-27029 > URL: https://issues.apache.org/jira/browse/SPARK-27029 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > We should update to Thrift 0.12.0 to pick up security and bug fixes. It > appears to be compatible with the current build. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository
[ https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26048. --- Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 3.0.0 2.4.1 This is resolved via https://github.com/apache/spark/pull/23931 > Flume connector for Spark 2.4 does not exist in Maven repository > > > Key: SPARK-26048 > URL: https://issues.apache.org/jira/browse/SPARK-26048 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Aki Tanaka >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > Flume connector for Spark 2.4 does not exist in the Maven repository. > [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume] > > [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink] > These packages will be removed in Spark 3. But Spark 2.4 branch still has > these packages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository
[ https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782435#comment-16782435 ] Dongjoon Hyun edited comment on SPARK-26048 at 3/2/19 4:18 PM: --- This is resolved via https://github.com/apache/spark/pull/23931 Please note that the code is merged into master branch only. `branch-2.4` release code has `-Pflume` already. As [~vanzin] explained, we should use the release script in the `master` branch and [~dbtsai] will use the script for next 2.4.1-RC to include `flume`. So, I set `2.4.1/3.0.0` as `Fixed Versions` (although the code is merged into only `master`.) was (Author: dongjoon): This is resolved via https://github.com/apache/spark/pull/23931 > Flume connector for Spark 2.4 does not exist in Maven repository > > > Key: SPARK-26048 > URL: https://issues.apache.org/jira/browse/SPARK-26048 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Aki Tanaka >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > Flume connector for Spark 2.4 does not exist in the Maven repository. > [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume] > > [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink] > These packages will be removed in Spark 3. But Spark 2.4 branch still has > these packages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26943) Weird behaviour with `.cache()`
[ https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782429#comment-16782429 ] Sean Owen commented on SPARK-26943: --- You can download and run the Spark distro wherever you like. If you want to execute against a cluster of course that cluster has to be updated. I suppose you can spin up a hosted version of Spark with a newer version somewhere too. > Weird behaviour with `.cache()` > --- > > Key: SPARK-26943 > URL: https://issues.apache.org/jira/browse/SPARK-26943 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Will Uto >Priority: Major > > > {code:java} > sdf.count(){code} > > works fine. However: > > {code:java} > sdf = sdf.cache() > sdf.count() > {code} > does not, and produces error > {code:java} > Py4JJavaError: An error occurred while calling o314.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 > in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 > (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable > number: "(N/A)" > at java.text.NumberFormat.parse(NumberFormat.java:350) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26274) Download page must link to https://www.apache.org/dist/spark for current releases
[ https://issues.apache.org/jira/browse/SPARK-26274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26274: -- Priority: Minor (was: Major) Component/s: (was: Web UI) (was: Deploy) Just saw this one -- sounds fine to me. See https://github.com/apache/spark-website/pull/184 > Download page must link to https://www.apache.org/dist/spark for current > releases > - > > Key: SPARK-26274 > URL: https://issues.apache.org/jira/browse/SPARK-26274 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.2, 2.4.0 >Reporter: Sebb >Priority: Minor > > The download page currently uses the archive server: > https://archive.apache.org/dist/spark/... > for all sigs and hashes. > This is fine for archived releases, however current ones must link to the > mirror system, i.e. > https://www.apache.org/dist/spark/... > Also, the page does not link directly to the hash or sig. > This makes it very difficult for the user, as they have to choose the correct > file. > The download page must link directly to the actual sig or hash. > Ideally do so for the archived releases as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26943) Weird behaviour with `.cache()`
[ https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782418#comment-16782418 ] Will Uto commented on SPARK-26943: -- Thanks for explanation [~srowen], makes sense - I think this is why I couldn't reproduce it locally (on a smaller dataset). Out of curiosity, is there a way to run a newer version of Spark on a cluster e.g. within Python Virtual Environments, or do I have to upgrade an entire cluster? > Weird behaviour with `.cache()` > --- > > Key: SPARK-26943 > URL: https://issues.apache.org/jira/browse/SPARK-26943 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0 >Reporter: Will Uto >Priority: Major > > > {code:java} > sdf.count(){code} > > works fine. However: > > {code:java} > sdf = sdf.cache() > sdf.count() > {code} > does not, and produces error > {code:java} > Py4JJavaError: An error occurred while calling o314.count. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 > in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 > (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable > number: "(N/A)" > at java.text.NumberFormat.parse(NumberFormat.java:350) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27007) add rawPrediction to OneVsRest in PySpark
[ https://issues.apache.org/jira/browse/SPARK-27007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27007. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23910 [https://github.com/apache/spark/pull/23910] > add rawPrediction to OneVsRest in PySpark > - > > Key: SPARK-27007 > URL: https://issues.apache.org/jira/browse/SPARK-27007 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Add RawPrediction to OneVsRest in PySpark to make it consistent with scala > implementation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27007) add rawPrediction to OneVsRest in PySpark
[ https://issues.apache.org/jira/browse/SPARK-27007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27007: - Assignee: Huaxin Gao > add rawPrediction to OneVsRest in PySpark > - > > Key: SPARK-27007 > URL: https://issues.apache.org/jira/browse/SPARK-27007 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Add RawPrediction to OneVsRest in PySpark to make it consistent with scala > implementation -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26860) RangeBetween docs appear to be wrong
[ https://issues.apache.org/jira/browse/SPARK-26860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-26860. --- Resolution: Not A Problem > RangeBetween docs appear to be wrong > - > > Key: SPARK-26860 > URL: https://issues.apache.org/jira/browse/SPARK-26860 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Shelby Vanhooser >Priority: Major > Labels: docs, easyfix, python > Original Estimate: 1h > Remaining Estimate: 1h > > The docs describing > [RangeBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rangeBetween] > for PySpark appear to be duplicates of > [RowsBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween] > even though these are functionally different windows. Rows between > reference proceeding and succeeding rows, but rangeBetween is based on the > values in these rows. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27032) Flaky test: org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision
[ https://issues.apache.org/jira/browse/SPARK-27032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27032: Assignee: Sean Owen (was: Apache Spark) > Flaky test: > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: > metadata directory collision > --- > > Key: SPARK-27032 > URL: https://issues.apache.org/jira/browse/SPARK-27032 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Critical > > Locally and on Jenkins, I've frequently seen this test fail: > {code} > Error Message > The await method on Waiter timed out. > Stacktrace > org.scalatest.exceptions.TestFailedException: The await method on > Waiter timed out. > at org.scalatest.concurrent.Waiters$Waiter.awaitImpl(Waiters.scala:406) > at org.scalatest.concurrent.Waiters$Waiter.await(Waiters.scala:540) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19(HDFSMetadataLogSuite.scala:158) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19$adapted(HDFSMetadataLogSuite.scala:133) > ... > {code} > See for example > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6057/testReport/ > There aren't obvious errors or problems with the test. Because it passes > sometimes, my guess is that the timeout is simply too short or the test too > long. I'd like to try reducing the number of threads/batches in the test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27032) Flaky test: org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision
[ https://issues.apache.org/jira/browse/SPARK-27032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27032: Assignee: Apache Spark (was: Sean Owen) > Flaky test: > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: > metadata directory collision > --- > > Key: SPARK-27032 > URL: https://issues.apache.org/jira/browse/SPARK-27032 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests >Affects Versions: 3.0.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Critical > > Locally and on Jenkins, I've frequently seen this test fail: > {code} > Error Message > The await method on Waiter timed out. > Stacktrace > org.scalatest.exceptions.TestFailedException: The await method on > Waiter timed out. > at org.scalatest.concurrent.Waiters$Waiter.awaitImpl(Waiters.scala:406) > at org.scalatest.concurrent.Waiters$Waiter.await(Waiters.scala:540) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19(HDFSMetadataLogSuite.scala:158) > at > org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19$adapted(HDFSMetadataLogSuite.scala:133) > ... > {code} > See for example > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6057/testReport/ > There aren't obvious errors or problems with the test. Because it passes > sometimes, my guess is that the timeout is simply too short or the test too > long. I'd like to try reducing the number of threads/batches in the test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27032) Flaky test: org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision
Sean Owen created SPARK-27032: - Summary: Flaky test: org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision Key: SPARK-27032 URL: https://issues.apache.org/jira/browse/SPARK-27032 Project: Spark Issue Type: Test Components: Spark Core, Tests Affects Versions: 3.0.0 Reporter: Sean Owen Assignee: Sean Owen Locally and on Jenkins, I've frequently seen this test fail: {code} Error Message The await method on Waiter timed out. Stacktrace org.scalatest.exceptions.TestFailedException: The await method on Waiter timed out. at org.scalatest.concurrent.Waiters$Waiter.awaitImpl(Waiters.scala:406) at org.scalatest.concurrent.Waiters$Waiter.await(Waiters.scala:540) at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19(HDFSMetadataLogSuite.scala:158) at org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19$adapted(HDFSMetadataLogSuite.scala:133) ... {code} See for example https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6057/testReport/ There aren't obvious errors or problems with the test. Because it passes sometimes, my guess is that the timeout is simply too short or the test too long. I'd like to try reducing the number of threads/batches in the test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27031) Avoid double formatting in timestampToString
[ https://issues.apache.org/jira/browse/SPARK-27031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27031: Assignee: Apache Spark > Avoid double formatting in timestampToString > > > Key: SPARK-27031 > URL: https://issues.apache.org/jira/browse/SPARK-27031 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > The method DateTimeUtils.timestampToString performs converting of its input > to string twice: > - First time by converting microseconds to java.sql.Timestamp and toString: > https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L103 > - Second time by using provided formatter: > https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L104 > This could be avoided by using specialised TimestampFormatter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27031) Avoid double formatting in timestampToString
[ https://issues.apache.org/jira/browse/SPARK-27031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27031: Assignee: (was: Apache Spark) > Avoid double formatting in timestampToString > > > Key: SPARK-27031 > URL: https://issues.apache.org/jira/browse/SPARK-27031 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > The method DateTimeUtils.timestampToString performs converting of its input > to string twice: > - First time by converting microseconds to java.sql.Timestamp and toString: > https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L103 > - Second time by using provided formatter: > https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L104 > This could be avoided by using specialised TimestampFormatter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27031) Avoid double formatting in timestampToString
Maxim Gekk created SPARK-27031: -- Summary: Avoid double formatting in timestampToString Key: SPARK-27031 URL: https://issues.apache.org/jira/browse/SPARK-27031 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk The method DateTimeUtils.timestampToString performs converting of its input to string twice: - First time by converting microseconds to java.sql.Timestamp and toString: https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L103 - Second time by using provided formatter: https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L104 This could be avoided by using specialised TimestampFormatter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27030) DataFrameWriter.insertInto fails when writing in parallel to a hive table
Lev Katzav created SPARK-27030: -- Summary: DataFrameWriter.insertInto fails when writing in parallel to a hive table Key: SPARK-27030 URL: https://issues.apache.org/jira/browse/SPARK-27030 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Lev Katzav When writing to a hive table, the following temp directory is used: {code:java} /path/to/table/_temporary/0/{code} (the 0 at the end comes from the config {code:java} "mapreduce.job.application.attempt.id"{code} since that config is missing, it falls back to 0) when there are 2 processes that write to the same table, there could be the following race condition: # p1 creates temp folder and uses it # p2 uses temp folder # p1 finishes and deletes temp folder # p2 fails since temp folder is missing It is possible to recreate this error locally with the following code: (the code runs locally, but I experienced the same error when running on a cluster with 2 jobs writing to the same table) {code:java} import org.apache.spark.sql.functions._ val df = spark .range(1000) .toDF("a") .withColumn("partition", lit(0)) .cache() //create db sqlContext.sql("CREATE DATABASE IF NOT EXISTS db").count() //create table df .write .partitionBy("partition") .saveAsTable("db.table") val x = (1 to 100).par x.tasksupport = new ForkJoinTaskSupport( new ForkJoinPool(10)) //insert to different partitions in parallel x.foreach { p => val df2 = df .withColumn("partition",lit(p)) df2 .write .mode(SaveMode.Overwrite) .insertInto("db.table") } {code} the error would be: {code:java} java.io.FileNotFoundException: File file:/path/to/warehouse/db.db/table/_temporary/0 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:406) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1497) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537) at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:669) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1497) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:283) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:325) at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:166) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:185) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:325) at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:311) at company.name.spark.hive.SparkHiveUtilsTest$$anonfun$3$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(SparkHiveUtilsTest.scala:190) at scala.collection.parallel.immutable.ParRange$ParRangeIterator.foreach(ParRange.scala:91) at scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:972) at
[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782330#comment-16782330 ] t oo commented on SPARK-26998: -- [https://github.com/apache/spark/pull/23820] is only about hiding password from log file, SPARK-26998 is about hiding passwords from showing in 'ps -ef' process list > spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor > processes in Standalone mode > --- > > Key: SPARK-26998 > URL: https://issues.apache.org/jira/browse/SPARK-26998 > Project: Spark > Issue Type: Bug > Components: Scheduler, Security, Spark Core >Affects Versions: 2.3.3, 2.4.0 >Reporter: t oo >Priority: Major > Labels: SECURITY, Security, secur, security, security-issue > > Run spark standalone mode, then start a spark-submit requiring at least 1 > executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to > see spark.ssl.keyStorePassword value in plaintext! > > spark.ssl.keyStorePassword and spark.ssl.keyPassword don't need to be passed > to CoarseGrainedExecutorBackend. Only spark.ssl.trustStorePassword is used. > > Can be resolved if below PR is merged: > [[Github] Pull Request #21514 > (tooptoop4)|https://github.com/apache/spark/pull/21514] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782322#comment-16782322 ] Erik van Oosten edited comment on SPARK-27025 at 3/2/19 8:43 AM: - The point is to _not_ fetch pro-actively. I have a program in which several steps need to be executed before anything can be transferred to the driver. So why can't the executors start executing immediately, and only transfer the results to the driver when its ready? was (Author: erikvanoosten): I have a program in which several steps need to be executed before anything can be transferred to the driver. So why can't the executors start executing immediately, and only transfer the results to the driver when its ready? > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27025) Speed up toLocalIterator
[ https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782322#comment-16782322 ] Erik van Oosten commented on SPARK-27025: - I have a program in which several steps need to be executed before anything can be transferred to the driver. So why can't the executors start executing immediately, and only transfer the results to the driver when its ready? > Speed up toLocalIterator > > > Key: SPARK-27025 > URL: https://issues.apache.org/jira/browse/SPARK-27025 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.3.3 >Reporter: Erik van Oosten >Priority: Major > > Method {{toLocalIterator}} fetches the partitions to the driver one by one. > However, as far as I can see, any required computation for the > yet-to-be-fetched-partitions is not kicked off until it is fetched. > Effectively only one partition is being computed at the same time. > Desired behavior: immediately start calculation of all partitions while > retaining the download-a-partition at a time behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org