[jira] [Assigned] (SPARK-27034) Nested schema pruning for ORC

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27034:


Assignee: Apache Spark

> Nested schema pruning for ORC
> -
>
> Key: SPARK-27034
> URL: https://issues.apache.org/jira/browse/SPARK-27034
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> We only support nested schema pruning for Parquet currently. This is opened 
> to propose to support nested schema pruning for ORC too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27034) Nested schema pruning for ORC

2019-03-02 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-27034:
---

 Summary: Nested schema pruning for ORC
 Key: SPARK-27034
 URL: https://issues.apache.org/jira/browse/SPARK-27034
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


We only support nested schema pruning for Parquet currently. This is opened to 
propose to support nested schema pruning for ORC too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27034) Nested schema pruning for ORC

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27034:


Assignee: (was: Apache Spark)

> Nested schema pruning for ORC
> -
>
> Key: SPARK-27034
> URL: https://issues.apache.org/jira/browse/SPARK-27034
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> We only support nested schema pruning for Parquet currently. This is opened 
> to propose to support nested schema pruning for ORC too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27033:


Assignee: Apache Spark

> Add rule to optimize binary comparisons to its push down format
> ---
>
> Key: SPARK-27033
> URL: https://issues.apache.org/jira/browse/SPARK-27033
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, filters like this "select * from table where a + 1 >= 3" cannot be 
> pushed down, this optimizer can convert it to "select * from table where a >= 
> 3 - 1", and then be "select * from table where a >= 2". 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27033:


Assignee: (was: Apache Spark)

> Add rule to optimize binary comparisons to its push down format
> ---
>
> Key: SPARK-27033
> URL: https://issues.apache.org/jira/browse/SPARK-27033
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Minor
>
> Currently, filters like this "select * from table where a + 1 >= 3" cannot be 
> pushed down, this optimizer can convert it to "select * from table where a >= 
> 3 - 1", and then be "select * from table where a >= 2". 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2019-03-02 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-27033:
---
Affects Version/s: (was: 3.0.0)
   2.4.0

> Add rule to optimize binary comparisons to its push down format
> ---
>
> Key: SPARK-27033
> URL: https://issues.apache.org/jira/browse/SPARK-27033
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Minor
>
> Currently, filters like this "select * from table where a + 1 >= 3" cannot be 
> pushed down, this optimizer can convert it to "select * from table where a >= 
> 3 - 1", and then be "select * from table where a >= 2". 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2019-03-02 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-27033:
---
Description: Currently, filters like this "select * from table where a + 1 
>= 3" cannot be pushed down, this optimizer can convert it to "select * from 
table where a >= 3 - 1", and then be "select * from table where a >= 2".   
(was: _emphasized text_)

> Add rule to optimize binary comparisons to its push down format
> ---
>
> Key: SPARK-27033
> URL: https://issues.apache.org/jira/browse/SPARK-27033
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> Currently, filters like this "select * from table where a + 1 >= 3" cannot be 
> pushed down, this optimizer can convert it to "select * from table where a >= 
> 3 - 1", and then be "select * from table where a >= 2". 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2019-03-02 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-27033:
---
Description: _emphasized text_

> Add rule to optimize binary comparisons to its push down format
> ---
>
> Key: SPARK-27033
> URL: https://issues.apache.org/jira/browse/SPARK-27033
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> _emphasized text_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2019-03-02 Thread EdisonWang (JIRA)
EdisonWang created SPARK-27033:
--

 Summary: Add rule to optimize binary comparisons to its push down 
format
 Key: SPARK-27033
 URL: https://issues.apache.org/jira/browse/SPARK-27033
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 3.0.0
Reporter: EdisonWang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26860) RangeBetween docs appear to be wrong

2019-03-02 Thread Jagadesh Kiran N (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782575#comment-16782575
 ] 

Jagadesh Kiran N commented on SPARK-26860:
--

For Followup only i have mentioned few days back about uploading the pull 
request , 

> RangeBetween docs appear to be wrong 
> -
>
> Key: SPARK-26860
> URL: https://issues.apache.org/jira/browse/SPARK-26860
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Shelby Vanhooser
>Priority: Major
>  Labels: docs, easyfix, python
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The docs describing 
> [RangeBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rangeBetween]
>  for PySpark appear to be duplicates of 
> [RowsBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween]
>  even though these are functionally different windows.  Rows between 
> reference proceeding and succeeding rows, but rangeBetween is based on the 
> values in these rows.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26860) RangeBetween docs appear to be wrong

2019-03-02 Thread Jagadesh Kiran N (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782573#comment-16782573
 ] 

Jagadesh Kiran N commented on SPARK-26860:
--

 @srowen  i took time to understand steps both R and Phython im almost done 
with fix ,  please move it inprogress state so that i can raise a pull request 

> RangeBetween docs appear to be wrong 
> -
>
> Key: SPARK-26860
> URL: https://issues.apache.org/jira/browse/SPARK-26860
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Shelby Vanhooser
>Priority: Major
>  Labels: docs, easyfix, python
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The docs describing 
> [RangeBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rangeBetween]
>  for PySpark appear to be duplicates of 
> [RowsBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween]
>  even though these are functionally different windows.  Rows between 
> reference proceeding and succeeding rows, but rangeBetween is based on the 
> values in these rows.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26918) All .md should have ASF license header

2019-03-02 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782565#comment-16782565
 ] 

Felix Cheung commented on SPARK-26918:
--

[~srowen] what do you think?

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26918) All .md should have ASF license header

2019-03-02 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782564#comment-16782564
 ] 

Felix Cheung commented on SPARK-26918:
--

I'm for doing this (Reopen this issue)

Also [~rmsm...@gmail.com] this needs to be 
 # on all .md file
 # remove rat filter for .md then after that
 # run doc build to check the doc is generated properly

ie. at the beginning of the section "Update the Spark Website" 
https://spark.apache.org/release-process.html

{{$ cd docs }}

{{$ PRODUCTION=1 jekyll build }}

> All .md should have ASF license header
> --
>
> Key: SPARK-26918
> URL: https://issues.apache.org/jira/browse/SPARK-26918
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Felix Cheung
>Priority: Minor
>
> per policy, all md files should have the header, like eg. 
> [https://raw.githubusercontent.com/apache/arrow/master/docs/README.md]
>  or
> [https://raw.githubusercontent.com/apache/hadoop/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md]
>  
> currently it does not
> [https://raw.githubusercontent.com/apache/spark/master/docs/sql-reference.md] 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27029) Update Thrift to 0.12.0

2019-03-02 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27029.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23935

> Update Thrift to 0.12.0
> ---
>
> Key: SPARK-27029
> URL: https://issues.apache.org/jira/browse/SPARK-27029
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> We should update to Thrift 0.12.0 to pick up security and bug fixes. It 
> appears to be compatible with the current build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call

2019-03-02 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782542#comment-16782542
 ] 

Maxim Gekk commented on SPARK-26016:


[~srowen] JSON datasource uses Text datasource in schema inferring to split 
text input by lines. For that, line separator should be converted to a correct 
sequence of bytes and passed to Hadoop's Line Reader. Here we need correct 
encoding in Text datasource. Text datasource itself does copy bytes from Hadoop 
Line Reader to InternalRow: 
[https://github.com/apache/spark/blob/36a2e6371b4d173c3e03cc0d869c39335a0d7682/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala#L135]
 . After that, the bytes are read by Jackson parser from streams decoded 
according to encoding: 
[https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala#L88-L93]

The PR you pointed out just makes necessary changes in Text datasource to 
support encodings in JSON datasource. 

> Encoding not working when using a map / mapPartitions call
> --
>
> Key: SPARK-26016
> URL: https://issues.apache.org/jira/browse/SPARK-26016
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Chris Caspanello
>Priority: Major
> Attachments: spark-sandbox.zip
>
>
> Attached you will find a project with unit tests showing the issue at hand.
> If I read in a ISO-8859-1 encoded file and simply write out what was read; 
> the contents in the part file matches what was read.  Which is great.
> However, the second I use a map / mapPartitions function it looks like the 
> encoding is not correct.  In addition a simple collectAsList and writing that 
> list of strings to a file does not work either.  I don't think I'm doing 
> anything wrong.  Can someone please investigate?  I think this is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-02 Thread shahid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16781670#comment-16781670
 ] 

shahid edited comment on SPARK-27019 at 3/2/19 9:44 PM:


seems event reordering has happened. Job start event came after sql execution 
end event, when the query failed. Could you please share spark eventLog for the 
application, if possible.


was (Author: shahid):
seems event reordering has happened. Job start event came after sql execution 
end event, when the query failed. Could you please share spark eventLog for the 
application, if possible, as I'm unable to reproduce in our cluster.

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27019:


Assignee: (was: Apache Spark)

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27019) Spark UI's SQL tab shows inconsistent values

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27019:


Assignee: Apache Spark

> Spark UI's SQL tab shows inconsistent values
> 
>
> Key: SPARK-27019
> URL: https://issues.apache.org/jira/browse/SPARK-27019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.4.0
>Reporter: peay
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screenshot from 2019-03-01 21-31-48.png, 
> application_1550040445209_4748, query-1-details.png, query-1-list.png, 
> query-job-1.png, screenshot-spark-ui-details.png, screenshot-spark-ui-list.png
>
>
> Since 2.4.0, I am frequently seeing broken outputs in the SQL tab of the 
> Spark UI, where submitted/duration make no sense, description has the ID 
> instead of the actual description.
> Clicking on the link to open a query, the SQL plan is missing as well.
> I have tried to increase `spark.scheduler.listenerbus.eventqueue.capacity` to 
> very large values like 30k out of paranoia that we may have too many events, 
> but to no avail. I have not identified anything particular that leads to 
> that: it doesn't occur in all my jobs, but it does occur in a lot of them 
> still.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25762) Upgrade guava version in spark dependency lists due to CVE issue

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-25762:
--
Target Version/s: 3.0.0

Updating Guava can be really disruptive, but we shade it, so it's possible I 
think.

> Upgrade guava version in spark dependency lists due to  CVE issue
> -
>
> Key: SPARK-25762
> URL: https://issues.apache.org/jira/browse/SPARK-25762
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.2.1, 2.2.2, 2.3.1, 2.3.2
>Reporter: Debojyoti
>Priority: Major
>
> In spark2.x dependency list we have guava-14.0.1.jar. However there are lot 
> vulnerabilities exists in this version.eg. CVE-2018-10237
> [https://www.cvedetails.com/cve/CVE-2018-10237/]
> Do we have any solution to resolve it or is there any plan to upgrade guava 
> version any of the spark's future release?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25781) relative importance of linear regression

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782537#comment-16782537
 ] 

Sean Owen commented on SPARK-25781:
---

PS are you referring to Shapley values?

> relative importance of linear regression
> 
>
> Key: SPARK-25781
> URL: https://issues.apache.org/jira/browse/SPARK-25781
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.3.2
>Reporter: ruxi zhang
>Priority: Minor
>  Labels: features
> Attachments: v17i01.pdf
>
>
> There is an R package relaimpo that generates relative importance for linear 
> regression features.  This method utilizes sharply value regression, which 
> will take a long time to run on big datasets.  This method is quite useful 
> for many use cases such as attribution model in marketing.  It will be great 
> if it is written in Spark with paralleled computing, which would be producing 
> result within a much short time.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25941) Random forest score decreased due to updating spark version

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25941.
---
Resolution: Not A Problem

This isn't a bug. The implementation has changed in Spark and MLlib a lot over 
time. I would not expect exactly the same answer, and this is quite close.

> Random forest score decreased due to updating spark version
> ---
>
> Key: SPARK-25941
> URL: https://issues.apache.org/jira/browse/SPARK-25941
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Input/Output, ML
>Affects Versions: 2.3.2
>Reporter: jack li
>Priority: Major
>  Labels: ML, forest, random
>
> h3. Problem description
> I use different versions of spark to analyze random forest scores..
>  * spark-core_2.10 and version 2.0.0 
>  ** RandomForestsKaggle Score = 0.8978765219058574
>  * spark-core_2.11 and version 2.4.0 
>  ** RandomForestsKaggle Score = 0.8886987035251259
> Source :  [https://github.com/smartscity/Kaggle_Titanic_spark]
> [Example github source and 
> readme|https://github.com/smartscity/Kaggle_Titanic_spark/blob/master/README.md]
>  
> h3. Introduce
> This case is Titanic Competitions on the Kaggle. 
> [https://www.kaggle.com/c/titanic]
> h3. Conclusion
> After upgrading the spark version({{version 2.4.0}}), the random forest score 
> dropped({{0.01}}).
> h3. Expectation
> Expect random forest score not to drop as the version upgrades.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25814) spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25814.
---
Resolution: Duplicate

I think there are a few related items that this could be a duplicate of; picked 
one here.

> spark driver runs out of memory on org.apache.spark.util.kvstore.InMemoryStore
> --
>
> Key: SPARK-25814
> URL: https://issues.apache.org/jira/browse/SPARK-25814
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.2
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: driver, memory-analysis, memory-leak, statestore
> Attachments: image-2018-10-23-14-06-53-722.png
>
>
>  We're looking into issue when even huge spark driver memory gets eventually 
> exhausted and GC makes driver stop responding.
> Used [JXRay.com|http://jxray.com/] tool and found that most of driver heap is 
> used by 
>  
> {noformat}
> org.apache.spark.status.AppStatusStore
>   -> org.apache.spark.status.ElementTrackingStore
> -> org.apache.spark.util.kvstore.InMemoryStore
>  
> {noformat}
>  
> Is there is a way to tune this particular spark driver's memory region down?
>  
>  
> !image-2018-10-23-14-06-53-722.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25911) [spark-ml] Hypothesis testing module

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25911.
---
Resolution: Won't Fix

I don't think we'd add all of those. Some of these are already in JIRA as 
ideas. While I don't think we'll add much more like this to ML, if you have one 
you can argue is widely used, and you can implement it, then I'd create (or 
find) a JIRA for that one to discuss first.

> [spark-ml] Hypothesis testing module
> 
>
> Key: SPARK-25911
> URL: https://issues.apache.org/jira/browse/SPARK-25911
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 3.0.0
>Reporter: Uday Babbar
>Priority: Minor
>
> h2. Why this ticket was created
> Feasibility determination of some subset of hypothesis testing module mainly 
> along value proposition front and to get a preliminary opinion of how does it 
> generally sound. Can work on a more comprehensive proposal if say, it's 
> generally agreed upon that including dataframe API for t-test makes sense in 
> the o.a.s.ml package. 
> h2. Current state
> There are some streaming implementation in the o.a.s.mllib module, but there 
> are no dataframe APIs for some standard tests (t-test). 
> ||Test ||Current state||Proposed state||
> |t-test (welch's, student)|only streaming |Dataframe API|
> |chi-squared|streaming, Dataframe/RDD API present| - |
> |ANOVA|-|Dataframe API|
> |mann-whitney-u-test|-|RDD API (in maintenance mode so probably doesn't make 
> sense to include this)|
> h2. Rationale 
> The utility of experimentation platforms is pervasive and most of them that 
> operate at scale (a large portion of them use spark for offline computation) 
> require distributed implementation of hypothesis tests to calculate p-values 
> of different metrics/features. These APIs would enable distributed 
> computation of the relevant stats and prevent overhead in moving data (or 
> some downstream view of it) to a framework where such stats computation is 
> available (R, scipy). 
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25928) NoSuchMethodError net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25928.
---
Resolution: Not A Problem

Yeah, you have an lz4-java version conflict somewhere. That's not a Spark 
problem per se, but a problem. I'd work around it, see if Oozie can update (? 
assuming it's on an older version) or manually update your Oozie install. 
Vendors sometimes make these kind of workarounds too.

> NoSuchMethodError 
> net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V
> -
>
> Key: SPARK-25928
> URL: https://issues.apache.org/jira/browse/SPARK-25928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.1
> Environment: EMR 5.17 which is using oozie 5.0.0 and spark 2.3.1
>Reporter: Jerry Chabot
>Priority: Major
>
> I am not sure if this is an Oozie problem, a Spark problem or a user error. 
> It is blocking our upcoming release.
> We are upgrading from Amazon's EMR 5.7 to EMR 5.17. The version changes are:
>     oozie 4.3.0 -> 5.0.0
>      spark 2.1.1 -> 2.3.1
> All our Oozie/Spark jobs were working in EMR 5.7. After ugprading, some of 
> our jobs which use a spark action are failing with the NoSuchMethod as shown 
> further in the description. It seems like conflicting classes.
> I noticed the spark share lib directory has two versions of the LZ4 jar.
> sudo -u hdfs hadoop fs -ls /user/oozie/share/lib/lib_20181029182704/spark/*lz*
>  -rw-r--r--   3 oozie oozie  79845 2018-10-29 18:27 
> /user/oozie/share/lib/lib_20181029182704/spark/compress-lzf-1.0.3.jar
>  -rw-r--r--   3 hdfs  oozie 236880 2018-11-01 18:22 
> /user/oozie/share/lib/lib_20181029182704/spark/lz4-1.3.0.jar
>  -rw-r--r--   3 oozie oozie 370119 2018-10-29 18:27 
> /user/oozie/share/lib/lib_20181029182704/spark/lz4-java-1.4.0.jar
> But, both of these jars have the constructor 
> LZ4BlockInputStream(java/io/InputStream). The spark/jars directory has only 
> lz4-java-1.4.0.jar. share lib seems to be getting it from the 
> /usr/lib/oozie/oozie-sharelib.tar.gz.
> Unfortunately, my team member that knows most about Spark is on vacation. 
> Does anyone have any suggestions on how best to troubleshoot this problem?
> Here is the strack trace.
>   diagnostics: User class threw exception: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 
> (TID 3, ip-172-27-113-49.ec2.internal, executor 2): 
> java.lang.NoSuchMethodError: 
> net.jpountz.lz4.LZ4BlockInputStream.(Ljava/io/InputStream;Z)V
>     at 
> org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
>     at scala.Option.map(Option.scala:146)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:304)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235)
>     at scala.Option.getOrElse(Option.scala:121)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
>     at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1346)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
>     at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>     at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
>     at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>     at org.apache.spark.scheduler.Task.run(Task.scala:109)
>     at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-25982) Dataframe write is non blocking in fair scheduling mode

2019-03-02 Thread Ramandeep Singh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782527#comment-16782527
 ] 

Ramandeep Singh commented on SPARK-25982:
-

No, as I said those operations at a stage are independent. And I explicitly 
await for them to complete before launching the next stage. It's the fact that 
operation from next stage start running before all futures have completed. 

> Dataframe write is non blocking in fair scheduling mode
> ---
>
> Key: SPARK-25982
> URL: https://issues.apache.org/jira/browse/SPARK-25982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Ramandeep Singh
>Priority: Major
>
> Hi,
> I have noticed that expected behavior of dataframe write operation to block 
> is not working in fair scheduling mode.
> Ideally when a dataframe write is occurring and a future is blocking on 
> AwaitResult, no other job should be started, but this is not the case. I have 
> noticed that other jobs are started when the partitions are being written.  
>  
> Regards,
> Ramandeep Singh
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26146) CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782526#comment-16782526
 ] 

Sean Owen commented on SPARK-26146:
---

It's possible, even likely, that the dependency is transitive. In this case:

{code}
[INFO] net.jgp.books:spark-chapter01:jar:1.0.0-SNAPSHOT
[INFO] +- org.apache.spark:spark-core_2.11:jar:2.4.0:compile
[INFO] |  +- org.apache.avro:avro:jar:1.8.2:compile
[INFO] |  |  +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
[INFO] |  |  +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
[INFO] |  |  +- com.thoughtworks.paranamer:paranamer:jar:2.7:compile
{code}

Here's what I'm confused about: Spark has depended on paranamer 2.8 since 2.3.0:
https://github.com/apache/spark/blob/v2.4.0/pom.xml#L185
{code}
[INFO] +- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile
[INFO] |  +- com.thoughtworks.paranamer:paranamer:jar:2.8:runtime
[INFO] |  +- org.apache.avro:avro:jar:1.8.2:compile
[INFO] |  |  +- org.codehaus.jackson:jackson-core-asl:jar
{code}

My only guess right now is that this is due to the arcance scoping rules for 
transitive dependencies in Maven. paranamer is a runtime dependency. Here, 
Spark is given as a compile-time dependency. It should be 'provided'. That 
doesn't change what Maven reports (weird still) but might solve the problem. 
Spark itself really does have 2.8. You can see it in 
https://search.maven.org/artifact/org.apache.spark/spark-parent_2.12/2.4.0/pom

I think this is still probably a weird issue between the code here and Maven.

> CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12
> --
>
> Key: SPARK-26146
> URL: https://issues.apache.org/jira/browse/SPARK-26146
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Jean Georges Perrin
>Priority: Major
>
> Ingestion of a CSV file seems to fail with Spark v2.4.0 and Scala v2.12, 
> where it works ok with Scala v2.11.
> When running a simple CSV ingestion like:{{ }}
> {code:java}
>     // Creates a session on a local master
>     SparkSession spark = SparkSession.builder()
>         .appName("CSV to Dataset")
>         .master("local")
>         .getOrCreate();
>     // Reads a CSV file with header, called books.csv, stores it in a 
> dataframe
>     Dataset df = spark.read().format("csv")
>         .option("header", "true")
>         .load("data/books.csv");
> {code}
>   With Scala 2.12, I get: 
> {code:java}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)
> at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58)
> at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240)
> ...
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.start(CsvToDataframeApp.java:37)
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.main(CsvToDataframeApp.java:21)
> {code}
> Where it works pretty smoothly if I switch back to 2.11.
> Full example available at 
> [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch01.] You can 
> modify pom.xml to change easily the Scala version in the property section:
> {code:java}
> 
>  UTF-8
>  1.8
>  2.11
>  2.4.0
> {code}
>  
> (ps. It's my first bug submission, so I hope I did not mess too much, be 
> tolerant if I did)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26146) CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12

2019-03-02 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782519#comment-16782519
 ] 

Michael Heuer commented on SPARK-26146:
---

[~srowen] Hey, wait a minute, the full example from the original submitter 
appears to be in github here

[https://github.com/jgperrin/net.jgp.books.spark.ch01]

This example has no explicit dependency on paranamer, and neither does Disq for 
that matter 

[https://github.com/disq-bio/disq/blob/5760a80b3322c537226a876e0df4f7710188f7b2/pom.xml]

This is not the first instance I've seen (and reported) where Spark itself has 
dependencies that conflict or otherwise are not compatible with each other, and 
those do not manifest themselves at test scope in Spark CI, only when an 
external project has a compile scope dependency on Spark jars.

> CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12
> --
>
> Key: SPARK-26146
> URL: https://issues.apache.org/jira/browse/SPARK-26146
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Jean Georges Perrin
>Priority: Major
>
> Ingestion of a CSV file seems to fail with Spark v2.4.0 and Scala v2.12, 
> where it works ok with Scala v2.11.
> When running a simple CSV ingestion like:{{ }}
> {code:java}
>     // Creates a session on a local master
>     SparkSession spark = SparkSession.builder()
>         .appName("CSV to Dataset")
>         .master("local")
>         .getOrCreate();
>     // Reads a CSV file with header, called books.csv, stores it in a 
> dataframe
>     Dataset df = spark.read().format("csv")
>         .option("header", "true")
>         .load("data/books.csv");
> {code}
>   With Scala 2.12, I get: 
> {code:java}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)
> at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58)
> at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240)
> ...
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.start(CsvToDataframeApp.java:37)
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.main(CsvToDataframeApp.java:21)
> {code}
> Where it works pretty smoothly if I switch back to 2.11.
> Full example available at 
> [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch01.] You can 
> modify pom.xml to change easily the Scala version in the property section:
> {code:java}
> 
>  UTF-8
>  1.8
>  2.11
>  2.4.0
> {code}
>  
> (ps. It's my first bug submission, so I hope I did not mess too much, be 
> tolerant if I did)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26146) CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12

2019-03-02 Thread Michael Heuer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782522#comment-16782522
 ] 

Michael Heuer commented on SPARK-26146:
---

It appears that this is a duplicate of SPARK-26583 and will be resolved in 
2.4.1.

> CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12
> --
>
> Key: SPARK-26146
> URL: https://issues.apache.org/jira/browse/SPARK-26146
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Jean Georges Perrin
>Priority: Major
>
> Ingestion of a CSV file seems to fail with Spark v2.4.0 and Scala v2.12, 
> where it works ok with Scala v2.11.
> When running a simple CSV ingestion like:{{ }}
> {code:java}
>     // Creates a session on a local master
>     SparkSession spark = SparkSession.builder()
>         .appName("CSV to Dataset")
>         .master("local")
>         .getOrCreate();
>     // Reads a CSV file with header, called books.csv, stores it in a 
> dataframe
>     Dataset df = spark.read().format("csv")
>         .option("header", "true")
>         .load("data/books.csv");
> {code}
>   With Scala 2.12, I get: 
> {code:java}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)
> at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58)
> at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240)
> ...
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.start(CsvToDataframeApp.java:37)
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.main(CsvToDataframeApp.java:21)
> {code}
> Where it works pretty smoothly if I switch back to 2.11.
> Full example available at 
> [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch01.] You can 
> modify pom.xml to change easily the Scala version in the property section:
> {code:java}
> 
>  UTF-8
>  1.8
>  2.11
>  2.4.0
> {code}
>  
> (ps. It's my first bug submission, so I hope I did not mess too much, be 
> tolerant if I did)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25982) Dataframe write is non blocking in fair scheduling mode

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782521#comment-16782521
 ] 

Sean Owen commented on SPARK-25982:
---

I don't understand this; you're running operations in parallel on purpose, but 
expecting one to wait for the other?

> Dataframe write is non blocking in fair scheduling mode
> ---
>
> Key: SPARK-25982
> URL: https://issues.apache.org/jira/browse/SPARK-25982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Ramandeep Singh
>Priority: Major
>
> Hi,
> I have noticed that expected behavior of dataframe write operation to block 
> is not working in fair scheduling mode.
> Ideally when a dataframe write is occurring and a future is blocking on 
> AwaitResult, no other job should be started, but this is not the case. I have 
> noticed that other jobs are started when the partitions are being written.  
>  
> Regards,
> Ramandeep Singh
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782520#comment-16782520
 ] 

Sean Owen commented on SPARK-26016:
---

Yeah, the TL;DR from reading the code is that it writes UTF-8, right?

df.write.text always writes UTF-8 strings; I think that's the most immediate 
explanation for all of this. df.read.text also only reads UTF-8, as it pushes 
down to the Hadoop input formats and Text works in terms of UTF-8. I think it's 
working insofar as ISO-8859-1's character mapping is a subset of UTF-8's. I'm 
not as sure why the write path doesn't accidentally work; BOMs? I think this is 
80% right at least as an explanation.

The problem is that this isn't documented anywhere I can find. I think people 
are just accustomed to always working in UTF-8 and the fact that you can read 
some common encodings as UTF-8 anyway. That much can easily be documented.

It's exacerbated by the fact that there's an 'encoding' option on 
DataFrameReader/Writer, which does nothing but affect how the line separator is 
treated. (I don't quite get why you could use a non-UTF-8 encoding of your line 
separator if everything else assumes UTF-8!)

[~maxgekk] it looks like this option was added in 
https://github.com/apache/spark/pull/20937/files#diff-cbd2ad0c69c1517fbbc42b587fd37e25R44
 ; was it necessary for the text data source? for the reason above I'm not sure 
about how it works. I see it's also used for JSON, separately. Is it meant to 
be 'hidden'?


> Encoding not working when using a map / mapPartitions call
> --
>
> Key: SPARK-26016
> URL: https://issues.apache.org/jira/browse/SPARK-26016
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Chris Caspanello
>Priority: Major
> Attachments: spark-sandbox.zip
>
>
> Attached you will find a project with unit tests showing the issue at hand.
> If I read in a ISO-8859-1 encoded file and simply write out what was read; 
> the contents in the part file matches what was read.  Which is great.
> However, the second I use a map / mapPartitions function it looks like the 
> encoding is not correct.  In addition a simple collectAsList and writing that 
> list of strings to a file does not work either.  I don't think I'm doing 
> anything wrong.  Can someone please investigate?  I think this is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26943) Weird behaviour with `.cache()`

2019-03-02 Thread Will Uto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782518#comment-16782518
 ] 

Will Uto commented on SPARK-26943:
--

Thanks for the information - I was hoping that if I e.g. installed PySpark 
v2.4.0 in each Python Virtual Environment on each cluster worker/node, then I 
could run against Spark v2.4.0, but it sounds like I would need to upgrade 
Spark through something like Cloudera Manager.

> Weird behaviour with `.cache()`
> ---
>
> Key: SPARK-26943
> URL: https://issues.apache.org/jira/browse/SPARK-26943
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Will Uto
>Priority: Major
>
>  
> {code:java}
> sdf.count(){code}
>  
> works fine. However:
>  
> {code:java}
> sdf = sdf.cache()
> sdf.count()
> {code}
>  does not, and produces error
> {code:java}
> Py4JJavaError: An error occurred while calling o314.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 
> in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 
> (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable 
> number: "(N/A)"
>   at java.text.NumberFormat.parse(NumberFormat.java:350)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26045) Error in the spark 2.4 release package with the spark-avro_2.11 depdency

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782508#comment-16782508
 ] 

Sean Owen commented on SPARK-26045:
---

Avro is already on 1.8.2 in 2.4.0. I think the problem may be that you're 
adding a separate copy of Avro, which might end up in a different classloader. 
You wouldn't want to do this, in general, but certainly won't likely work with 
Spark 2.3, which was on Avro 1.7.7.

Try without adding an avro jar, and make sure you don't have a different Avro 
dependency in your app or environment. If that's not it, my final guess is that 
something in hive-exec has a copy of avro. But then I am not sure how the tests 
would pass.

> Error in the spark 2.4 release package with the spark-avro_2.11 depdency
> 
>
> Key: SPARK-26045
> URL: https://issues.apache.org/jira/browse/SPARK-26045
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
> Environment: 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 
> 2018 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Oscar garcía 
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Hello I have been problems with the last spark 2.4 release, the read avro 
> file feature does not seem to be working, I have fixed it in local building 
> the source code and updating the *avro-1.8.2.jar* on the *$SPARK_HOME*/jars/ 
> dependencies.
> With the default spark 2.4 release when I try to read an avro file spark 
> raise the following exception.  
> {code:java}
> spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
> scala> spark.read.format("avro").load("file.avro")
> java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:51)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105
> {code}
> Checksum:  spark-2.4.0-bin-without-hadoop.tgz: 7670E29B 59EAE7A8 5DBC9350 
> 085DD1E0 F056CA13 11365306 7A6A32E9 B607C68E A8DAA666 EF053350 008D0254 
> 318B70FB DE8A8B97 6586CA19 D65BA2B3 FD7F919E
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24295) Purge Structured streaming FileStreamSinkLog metadata compact file data.

2019-03-02 Thread Alfredo Gimenez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782510#comment-16782510
 ] 

Alfredo Gimenez commented on SPARK-24295:
-

Our current workaround FWII:

We've added a streaming query listener that, at every query progress event, 
writes out a manual checkpoint (from the QueryProgressEvent sourceOffset member 
that contains the last used source offsets). We gracefully stop the stream job 
every 6 hours, purge the _spark_metadata and spark checkpoints, and upon 
restart check for the existence of the manual checkpoint and use it if 
available. We do the stop/purge/restart via Airflow but it would be trivial to 
do this by looping around a stream awaitTermination with a provided timeout. 

A simple solution would be to just have an option to disable metadata file 
compaction that also allows old metadata files to be deleted after a delay. 
Currently it appears that all files stay around until compaction, upon which 
files older than the delay and not in the compaction are purged.

> Purge Structured streaming FileStreamSinkLog metadata compact file data.
> 
>
> Key: SPARK-24295
> URL: https://issues.apache.org/jira/browse/SPARK-24295
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Iqbal Singh
>Priority: Major
> Attachments: spark_metadatalog_compaction_perfbug_repro.tar.gz
>
>
> FileStreamSinkLog metadata logs are concatenated to a single compact file 
> after defined compact interval.
> For long running jobs, compact file size can grow up to 10's of GB's, Causing 
> slowness  while reading the data from FileStreamSinkLog dir as spark is 
> defaulting to the "__spark__metadata" dir for the read.
> We need a functionality to purge the compact file size.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26017) SVD++ error rate is high in the test suite.

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26017.
---
Resolution: Invalid

I don't think this states an actionable issue; high relative to what?

> SVD++ error rate is high in the test suite.
> ---
>
> Key: SPARK-26017
> URL: https://issues.apache.org/jira/browse/SPARK-26017
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.2
>Reporter: shahid
>Priority: Major
> Attachments: image-2018-11-12-20-41-49-370.png
>
>
> In the test suite, "{color:#008000}Test SVD++ with mean square error on 
> training set", {color}error rate is quite high, even for large number of 
> iterations.
>  
> !image-2018-11-12-20-41-49-370.png!
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782502#comment-16782502
 ] 

Sean Owen commented on SPARK-26166:
---

I agree there's a problem there. I don't think checkpoint() is appropriate as 
it would try to save the whole RDD to a local file. Although you would 
generally get the same order of evaluation for the same dataset, it's not 
guaranteed. cache()-ing the whole thing takes memory and as you say even that 
isn't guaranteed to work.

The Scala implementation does it differently, and more correctly, in 
MLUtils.kFold. I think the solution is to call that from Pyspark to get the 
training, validation splits. Are you open to trying a fix?

> CrossValidator.fit() bug,training and validation dataset may overlap
> 
>
> Key: SPARK-26166
> URL: https://issues.apache.org/jira/browse/SPARK-26166
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xinyong Tian
>Priority: Major
>
> In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column
> df = dataset.select("*", rand(seed).alias(randCol))
> Should add
> df.checkpoint()
> If  df is  not checkpointed, it will be recomputed each time when train and 
> validation dataframe need to be created. The order of rows in df,which 
> rand(seed)  is dependent on, is not deterministic . Thus each time random 
> column value could be different for a specific row even with seed. Note , 
> checkpoint() can not be replaced with cached(), because when a node fails, 
> cached table need be  recomputed, thus random number could be different.
> This might especially  be a problem when input 'dataset' dataframe is 
> resulted from a query including 'where' clause. see below.
> [https://dzone.com/articles/non-deterministic-order-for-select-with-limit]
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26058) Incorrect logging class loaded for all the logs.

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26058:
--
   Priority: Minor  (was: Major)
Description: 
In order to make the bug more evident, please change the log4j configuration to 
use this pattern, instead of default.
{code}
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %C: 
%m%n
{code}

The logging class recorded in the log is :
{code}
INFO org.apache.spark.internal.Logging$class
{code}
instead of the actual logging class.

Sample output of the logs, after applying the above log4j configuration change.
{code}
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Stopped Spark 
web UI at http://9.234.206.241:4040
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: 
MapOutputTrackerMasterEndpoint stopped!
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: MemoryStore 
cleared
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: BlockManager 
stopped
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: 
BlockManagerMaster stopped
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: 
OutputCommitCoordinator stopped!
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Successfully 
stopped SparkContext
{code}


This happens due to the fact, actual logging is done inside the trait logging 
and that is picked up as logging class for the log message. It can either be 
corrected by using `log` variable directly instead of delegator logInfo methods 
or if we would like to not miss out on theoretical performance benefits of 
pre-checking logXYZ.isEnabled, then we can use scala macro to inject those 
checks. Later has a disadvantage, that during debugging wrong line number 
information may be produced.

  was:

In order to make the bug more evident, please change the log4j configuration to 
use this pattern, instead of default.
{code}
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %C: 
%m%n
{code}

The logging class recorded in the log is :
{code}
INFO org.apache.spark.internal.Logging$class
{code}
instead of the actual logging class.

Sample output of the logs, after applying the above log4j configuration change.
{code}
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Stopped Spark 
web UI at http://9.234.206.241:4040
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: 
MapOutputTrackerMasterEndpoint stopped!
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: MemoryStore 
cleared
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: BlockManager 
stopped
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: 
BlockManagerMaster stopped
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: 
OutputCommitCoordinator stopped!
18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Successfully 
stopped SparkContext
{code}


This happens due to the fact, actual logging is done inside the trait logging 
and that is picked up as logging class for the log message. It can either be 
corrected by using `log` variable directly instead of delegator logInfo methods 
or if we would like to not miss out on theoretical performance benefits of 
pre-checking logXYZ.isEnabled, then we can use scala macro to inject those 
checks. Later has a disadvantage, that during debugging wrong line number 
information may be produced.

 Issue Type: Improvement  (was: Bug)

I don't think that's a bug, but if there's a way to get the logging class 
properly without changing all the logging code, OK.

> Incorrect logging class loaded for all the logs.
> 
>
> Key: SPARK-26058
> URL: https://issues.apache.org/jira/browse/SPARK-26058
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Prashant Sharma
>Priority: Minor
>
> In order to make the bug more evident, please change the log4j configuration 
> to use this pattern, instead of default.
> {code}
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %C: 
> %m%n
> {code}
> The logging class recorded in the log is :
> {code}
> INFO org.apache.spark.internal.Logging$class
> {code}
> instead of the actual logging class.
> Sample output of the logs, after applying the above log4j configuration 
> change.
> {code}
> 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: Stopped Spark 
> web UI at http://9.234.206.241:4040
> 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: 
> MapOutputTrackerMasterEndpoint stopped!
> 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: MemoryStore 
> cleared
> 18/11/14 13:44:48 INFO org.apache.spark.internal.Logging$class: BlockManager 
> stopped
> 18/11/14 13:44:48 

[jira] [Resolved] (SPARK-26128) filter breaks input_file_name

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26128.
---
Resolution: Cannot Reproduce

> filter breaks input_file_name
> -
>
> Key: SPARK-26128
> URL: https://issues.apache.org/jira/browse/SPARK-26128
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.2
>Reporter: Paul Praet
>Priority: Minor
>
> This works:
> {code:java}
> scala> 
> spark.read.parquet("/tmp/newparquet").select(input_file_name).show(5,false)
> +-+
> |input_file_name()
>     |
> +-+
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> |file:///tmp/newparquet/parquet-5-PT6H/junit/data/tenant=NA/year=2017/month=201704/day=20170406/hour=2017040618/data.eu-west-1b.290.PT6H.FINAL.parquet|
> +-+
> {code}
> When adding a filter:
> {code:java}
> scala> 
> spark.read.parquet("/tmp/newparquet").where("key.station='XYZ'").select(input_file_name()).show(5,false)
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26146) CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26146.
---
Resolution: Cannot Reproduce

Is that code from Spark? it doesn't quite look like it from the stack trace. 
It's a paranamer problem with Java 8 that I think is triggered by Scala 2.12, 
which uses lambdas for closures. This was updated in Spark 2.4 though in 
SPARK-22128 for this reason. I'm wondering if you are using a different 
paranamer dependency in your app, either in the non-Spark or Spark code. Update 
to 2.8.

In any event, at least, the CSV tests are fine in Java 8 + Scala 2.12 + Spark 
2.4.

> CSV wouln't be ingested in Spark 2.4.0 with Scala 2.12
> --
>
> Key: SPARK-26146
> URL: https://issues.apache.org/jira/browse/SPARK-26146
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Jean Georges Perrin
>Priority: Major
>
> Ingestion of a CSV file seems to fail with Spark v2.4.0 and Scala v2.12, 
> where it works ok with Scala v2.11.
> When running a simple CSV ingestion like:{{ }}
> {code:java}
>     // Creates a session on a local master
>     SparkSession spark = SparkSession.builder()
>         .appName("CSV to Dataset")
>         .master("local")
>         .getOrCreate();
>     // Reads a CSV file with header, called books.csv, stores it in a 
> dataframe
>     Dataset df = spark.read().format("csv")
>         .option("header", "true")
>         .load("data/books.csv");
> {code}
>   With Scala 2.12, I get: 
> {code:java}
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 10582
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.accept(BytecodeReadingParanamer.java:563)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer$ClassReader.access$200(BytecodeReadingParanamer.java:338)
> at 
> com.thoughtworks.paranamer.BytecodeReadingParanamer.lookupParameterNames(BytecodeReadingParanamer.java:103)
> at 
> com.thoughtworks.paranamer.CachingParanamer.lookupParameterNames(CachingParanamer.java:90)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.getCtorParams(BeanIntrospector.scala:44)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1(BeanIntrospector.scala:58)
> at 
> com.fasterxml.jackson.module.scala.introspect.BeanIntrospector$.$anonfun$apply$1$adapted(BeanIntrospector.scala:58)
> at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:240)
> ...
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.start(CsvToDataframeApp.java:37)
> at 
> net.jgp.books.sparkWithJava.ch01.CsvToDataframeApp.main(CsvToDataframeApp.java:21)
> {code}
> Where it works pretty smoothly if I switch back to 2.11.
> Full example available at 
> [https://github.com/jgperrin/net.jgp.books.sparkWithJava.ch01.] You can 
> modify pom.xml to change easily the Scala version in the property section:
> {code:java}
> 
>  UTF-8
>  1.8
>  2.11
>  2.4.0
> {code}
>  
> (ps. It's my first bug submission, so I hope I did not mess too much, be 
> tolerant if I did)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26152) Flaky test: BroadcastSuite

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782505#comment-16782505
 ] 

Sean Owen commented on SPARK-26152:
---

I haven't seen this one in a while, FWIW.

> Flaky test: BroadcastSuite
> --
>
> Key: SPARK-26152
> URL: https://issues.apache.org/jira/browse/SPARK-26152
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/5627
>  (2018-11-16)
> {code}
> BroadcastSuite:
> - Using TorrentBroadcast locally
> - Accessing TorrentBroadcast variables from multiple threads
> - Accessing TorrentBroadcast variables in a local cluster (encryption = off)
> java.util.concurrent.RejectedExecutionException: Task 
> scala.concurrent.impl.CallbackRunnable@59428a1 rejected from 
> java.util.concurrent.ThreadPoolExecutor@4096a677[Shutting down, pool size = 
> 1, active threads = 1, queued tasks = 0, completed tasks = 0]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
>   at 
> java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668)
>   at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:134)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>   at scala.concurrent.Promise.complete(Promise.scala:49)
>   at scala.concurrent.Promise.complete$(Promise.scala:48)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183)
>   at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>   at 
> scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:63)
>   at 
> scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:78)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)
>   at 
> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:55)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
>   at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:106)
>   at 
> scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>   at scala.concurrent.Promise.complete(Promise.scala:49)
>   at scala.concurrent.Promise.complete$(Promise.scala:48)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:183)
>   at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> java.util.concurrent.RejectedExecutionException: Task 
> scala.concurrent.impl.CallbackRunnable@40a5bf17 rejected from 
> java.util.concurrent.ThreadPoolExecutor@5a73967[Shutting down, pool size = 1, 
> active threads = 1, queued tasks = 0, completed tasks = 0]
>   at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
>   at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
>   at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
>   at 
> java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668)
>   at 
> 

[jira] [Resolved] (SPARK-26162) ALS results vary with user or item ID encodings

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26162.
---
Resolution: Not A Problem

I don't think it's a bug; the distributed computation and blocking might 
produce slightly different results depending on the order it encounters the 
vectors. Reopen if you have more detail that shows the results are meaningfully 
different.

> ALS results vary with user or item ID encodings
> ---
>
> Key: SPARK-26162
> URL: https://issues.apache.org/jira/browse/SPARK-26162
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Pablo J. Villacorta
>Priority: Major
>
> When calling ALS.fit() with the same seed on a dataset, the results (both the 
> latent factors matrices and the accuracy of the recommendations) differ when 
> we change the labels used to encode the users or items. The code example 
> below illustrates this by just changing user ID 1 or 2 to an unused ID like 
> 30. The user factors matrix changes, but not only the rows corresponding to 
> users 1 or 2 but also the other rows. 
> Is this the intended behaviour?
> {code:java}
> val r = scala.util.Random
> r.setSeed(123456)
> val trainDataset1 = spark.sparkContext.parallelize(
> (1 to 1000).map(_=> (r.nextInt(20), r.nextInt(100), r.nextInt(4) + 1)) // 
> users go from 0 to 19
> ).toDF("user", "item", "rating")
> val trainDataset2 = trainDataset1.withColumn("user", when(col("user")===1, 
> 30).otherwise(col("user")))
> val trainDataset3 = trainDataset1.withColumn("user", when(col("user")===2, 
> 30).otherwise(col("user")))
> val testDatasets = Array(trainDataset1, trainDataset2, trainDataset3).map(
> _.groupBy("user").agg(collect_list("item").alias("watched"))
> )
> val Array(als1, als2, als3) = Array(trainDataset1, trainDataset2, 
> trainDataset3).map(new ALS().setSeed(12345).fit(_))
> als1.userFactors.show(5, false)
> als2.userFactors.show(5, false)
> als3.userFactors.show(5, false){code}
> If we ask for recommendations and compare them with a test dataset also 
> modified accordingly (in this example, the test dataset is exactly the train 
> dataset) the results also differ:
> {code:java}
> val recommendations = Array(als1, als2, als3).map(x =>
> x.recommendForAllUsers(20).map{
> case Row(user: Int, recommendations: WrappedArray[Row]) => {
> val items = recommendations.map{case Row(item: Int, score: Float) 
> => item}
> (user, items)
> }
> }.toDF("user", "recommendations")
> )
> val predictionsAndActualRDD = testDatasets.zip(recommendations).map{
> case (testDataset, recommendationsDF) =>
> testDataset.join(recommendationsDF, "user")
> .rdd.map(r => {
> 
> (r.getAs[WrappedArray[Int]](r.fieldIndex("recommendations")).array,
> r.getAs[WrappedArray[Int]](r.fieldIndex("watched")).array
> )
> })
> }
> val metrics = predictionsAndActualRDD.map(new RankingMetrics(_))
> println(s"Precision at 5 of first model = ${metrics(0).precisionAt(5)}")
> println(s"Precision at 5 of second model = ${metrics(1).precisionAt(5)}")
> println(s"Precision at 5 of third model = ${metrics(2).precisionAt(5)}")
> {code}
> EDIT: The results also change if we just swap the IDs of some users, like:
> {code:java}
> val trainDataset4 = trainDataset1.withColumn("user", when(col("user")===1, 4)
> .when(col("user")===4, 
> 1).otherwise(col("user")))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26183) ConcurrentModificationException when using Spark collectionAccumulator

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26183.
---
Resolution: Not A Problem

This doesn't look like a bug. It arises because you are (inadvertently) 
serializing the accumulator because it's a field in your app. It's not meant to 
be anywhere but the driver. The serializer is reading it while you're writing 
it, as it doesn't know about your locks.

> ConcurrentModificationException when using Spark collectionAccumulator
> --
>
> Key: SPARK-26183
> URL: https://issues.apache.org/jira/browse/SPARK-26183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Rob Dawson
>Priority: Major
>
> I'm trying to run a Spark-based application on an Azure HDInsight on-demand 
> cluster, and am seeing lots of SparkExceptions (caused by 
> ConcurrentModificationExceptions) being logged. The application runs without 
> these errors when I start a local Spark instance.
> I've seen reports of [similar errors when using 
> accumulators|https://issues.apache.org/jira/browse/SPARK-17463] and my code 
> is indeed using a CollectionAccumulator, however I have placed synchronized 
> blocks everywhere I use it, and it makes no difference. The accumulator code 
> looks like this:
> {code:scala}
> class MySparkClass(sc : SparkContext) {
> val myAccumulator = sc.collectionAccumulator[MyRecord]
> override def add(record: MyRecord) = {
> synchronized {
> myAccumulator.add(record)
> }
> }
> override def endOfBatch() = {
> synchronized {
> myAccumulator.value.asScala.foreach((record: MyRecord) => {
> processIt(record)
> })
> }
> }
> }{code}
> The exceptions don't cause the application to fail, however when the code 
> tries to read values out of the accumulator it is empty and {{processIt}} is 
> never called.
> {code}
> 18/11/26 11:04:37 WARN Executor: Issue communicating with driver in 
> heartbeater
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
> at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
> at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:770)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at 
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:441)
> at 
> java.util.Collections$SynchronizedCollection.writeObject(Collections.java:2081)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 

[jira] [Resolved] (SPARK-26214) Add "broadcast" method to DataFrame

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26214.
---
Resolution: Won't Fix

> Add "broadcast" method to DataFrame
> ---
>
> Key: SPARK-26214
> URL: https://issues.apache.org/jira/browse/SPARK-26214
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Thomas Decaux
>Priority: Trivial
>  Labels: broadcast, dataframe
>
> As discussed at 
> [https://stackoverflow.com/questions/43984068/does-spark-sql-autobroadcastjointhreshold-work-for-joins-using-datasets-join-op/43994022,]
>  it's possible to force broadcast of DataFrame, even if total size is greater 
> than ``*spark.sql.autoBroadcastJoinThreshold``.*
> But this not trivial for beginner, because there is no "broadcast" method (I 
> know, I am lazy ...).
> We could add this method, with a WARN if size is greater than the threshold.
> (if it's an easy one, I could do it?)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26224) Results in stackOverFlowError when trying to add 3000 new columns using withColumn function of dataframe.

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782496#comment-16782496
 ] 

Sean Owen commented on SPARK-26224:
---

The most realistic thing I can imagine is exposing `withColumns`.
But for this use case, there are pretty easy workarounds, like just mapping the 
DataFrame Rows to contain a bunch more 0s and specifying your new schema.

> Results in stackOverFlowError when trying to add 3000 new columns using 
> withColumn function of dataframe.
> -
>
> Key: SPARK-26224
> URL: https://issues.apache.org/jira/browse/SPARK-26224
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: On macbook, used Intellij editor. Ran the above sample 
> code as unit test.
>Reporter: Dorjee Tsering
>Priority: Minor
>
> Reproduction step:
> Run this sample code on your laptop. I am trying to add 3000 new columns to a 
> base dataframe with 1 column.
>  
>  
> {code:java}
> import spark.implicits._
> val newColumnsToBeAdded : Seq[StructField] = for (i <- 1 to 3000) yield new 
> StructField("field_" + i, DataTypes.LongType)
> val baseDataFrame: DataFrame = Seq((1)).toDF("employee_id")
> val result = newColumnsToBeAdded.foldLeft(baseDataFrame)((df, newColumn) => 
> df.withColumn(newColumn.name, lit(0)))
> result.show(false)
>  
> {code}
> Ends up with following stacktrace:
> java.lang.StackOverflowError
>  at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
>  at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
>  at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>  at scala.collection.immutable.List.map(List.scala:296)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:333)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26229) Expose SizeEstimator as a developer API in pyspark

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26229.
---
Resolution: Won't Fix

It's pretty internal. I don't think we'd want to expose it further, especially 
as it can't account for Python memory usage.

> Expose SizeEstimator as a developer API in pyspark
> --
>
> Key: SPARK-26229
> URL: https://issues.apache.org/jira/browse/SPARK-26229
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Ranjith Pulluru
>Priority: Minor
>
> SizeEstimator is not available in pyspark.
> This api will be helpful for understanding the memory footprint of an object.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26247) SPIP - ML Model Extension for no-Spark MLLib Online Serving

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782494#comment-16782494
 ] 

Sean Owen commented on SPARK-26247:
---

I tend to think of this problem as about as solved as it will be by MLeap. 
There just hasn't been a representation that 'rules them all', so a library to 
translate Spark models is probably about as good as it gets. PMML is the best 
shot at a format and it doesn't cover enough; PFA seems similar in this regard. 

> SPIP - ML Model Extension for no-Spark MLLib Online Serving
> ---
>
> Key: SPARK-26247
> URL: https://issues.apache.org/jira/browse/SPARK-26247
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.0
>Reporter: Anne Holler
>Priority: Major
>  Labels: SPIP
> Attachments: SPIPMlModelExtensionForOnlineServing.pdf
>
>
> This ticket tracks an SPIP to improve model load time and model serving 
> interfaces for online serving of Spark MLlib models.  The SPIP is here
> [https://docs.google.com/a/uber.com/document/d/e/2PACX-1vRttVNNMBt4pBU2oBWKoiK3-7PW6RDwvHNgSMqO67ilxTX_WUStJ2ysUdAk5Im08eyHvlpcfq1g-DLF/pub]
>  
> The improvement opportunity exists in all versions of spark.  We developed 
> our set of changes wrt version 2.1.0 and can port them forward to other 
> versions (e.g., we have ported them forward to 2.3.2).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26257) SPIP: Interop Support for Spark Language Extensions

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782491#comment-16782491
 ] 

Sean Owen commented on SPARK-26257:
---

I'm not sure how much of the Pyspark and SparkR integration is meaningfully 
refactorable to support other languages. They both wrap Spark, effectively. I'm 
not sure therefore it's necessary or a good thing to further modify Spark for 
these things, if they already manage to support Python and R. With the 
exception of C#, those are niche languages anyway. I'd be open to seeing 
concrete proposals.

> SPIP: Interop Support for Spark Language Extensions
> ---
>
> Key: SPARK-26257
> URL: https://issues.apache.org/jira/browse/SPARK-26257
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, R, Spark Core
>Affects Versions: 2.4.0
>Reporter: Tyson Condie
>Priority: Minor
>
> h2.  ** Background and Motivation:
> There is a desire for third party language extensions for Apache Spark. Some 
> notable examples include:
>  * C#/F# from project Mobius [https://github.com/Microsoft/Mobius]
>  * Haskell from project sparkle [https://github.com/tweag/sparkle]
>  * Julia from project Spark.jl [https://github.com/dfdx/Spark.jl]
> Presently, Apache Spark supports Python and R via a tightly integrated 
> interop layer. It would seem that much of that existing interop layer could 
> be refactored into a clean surface for general (third party) language 
> bindings, such as the above mentioned. More specifically, could we generalize 
> the following modules:
>  * Deploy runners (e.g., PythonRunner and RRunner)
>  * DataFrame Executors
>  * RDD operations?
> The last being questionable: integrating third party language extensions at 
> the RDD level may be too heavy-weight and unnecessary given the preference 
> towards the Dataframe abstraction.
> The main goals of this effort would be:
>  * Provide a clean abstraction for third party language extensions making it 
> easier to maintain (the language extension) with the evolution of Apache Spark
>  * Provide guidance to third party language authors on how a language 
> extension should be implemented
>  * Provide general reusable libraries that are not specific to any language 
> extension
>  * Open the door to developers that prefer alternative languages
>  * Identify and clean up common code shared between Python and R interops
> h2. Target Personas:
> Data Scientists, Data Engineers, Library Developers
> h2. Goals:
> Data scientists and engineers will have the opportunity to work with Spark in 
> languages other than what’s natively supported. Library developers will be 
> able to create language extensions for Spark in a clean way. The interop 
> layer should also provide guidance for developing language extensions.
> h2. Non-Goals:
> The proposal does not aim to create an actual language extension. Rather, it 
> aims to provide a stable interop layer for third party language extensions to 
> dock.
> h2. Proposed API Changes:
> Much of the work will involve generalizing existing interop APIs for PySpark 
> and R, specifically for the Dataframe API. For instance, it would be good to 
> have a general deploy.Runner (similar to PythonRunner) for language extension 
> efforts. In Spark SQL, it would be good to have a general InteropUDF and 
> evaluator (similar to BatchEvalPythonExec).
> Low-level RDD operations should not be needed in this initial offering; 
> depending on the success of the interop layer and with proper demand, RDD 
> interop could be added later. However, one open question is supporting a 
> subset of low-level functions that are core to ETL e.g., transform.
> h2. Optional Design Sketch:
> The work would be broken down into two top-level phases:
>  Phase 1: Introduce general interop API for deploying a driver/application, 
> running an interop UDF along with any other low-level transformations that 
> aid with ETL.
> Phase 2: Port existing Python and R language extensions to the new interop 
> layer. This port should be contained solely to the Spark core side, and all 
> protocols specific to Python and R should not change e.g., Python should 
> continue to use py4j is the protocol between the Python process and core 
> Spark. The port itself should be contained to a handful of files e.g., some 
> examples for Python: PythonRunner, BatchEvalPythonExec, +PythonUDFRunner+, 
> PythonRDD (possibly), and will mostly involve refactoring common logic 
> abstract implementations and utilities.
> h2. Optional Rejected Designs:
> The clear alternative is the status quo; developers that want to provide a 
> third-party language extension to Spark do so directly; often by extending 
> existing Python classes and overriding the portions that are relevant to the 
> new extension. Not only is 

[jira] [Commented] (SPARK-26261) Spark does not check completeness temporary file

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782489#comment-16782489
 ] 

Sean Owen commented on SPARK-26261:
---

Why would you truncate the file -- how would this happen normally?

> Spark does not check completeness temporary file 
> -
>
> Key: SPARK-26261
> URL: https://issues.apache.org/jira/browse/SPARK-26261
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: Jialin LIu
>Priority: Minor
>
> Spark does not check temporary files' completeness. When persisting to disk 
> is enabled on some RDDs, a bunch of temporary files will be created on 
> blockmgr folder. Block manager is able to detect missing blocks while it is 
> not able detect file content being modified during execution. 
> Our initial test shows that if we truncate the block file before being used 
> by executors, the program will finish without detecting any error, but the 
> result content is totally wrong.
> We believe there should be a file checksum on every RDD file block and these 
> files should be protected by checksum.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26266) Update to Scala 2.12.8

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26266.
---
   Resolution: Fixed
 Assignee: Sean Owen  (was: Yuming Wang)
Fix Version/s: 3.0.0
   2.4.1

> Update to Scala 2.12.8
> --
>
> Key: SPARK-26266
> URL: https://issues.apache.org/jira/browse/SPARK-26266
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: release-notes
> Fix For: 2.4.1, 3.0.0
>
>
> [~yumwang] notes that Scala 2.12.8 is out and fixes two minor issues:
> Don't reject views with result types which are TypeVars (#7295)
> Don't emit static forwarders (which simplify the use of methods in top-level 
> objects from Java) for bridge methods (#7469)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26272) Please delete old releases from mirroring system

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26272.
---
   Resolution: Fixed
Fix Version/s: 2.3.3

Yep, we do this regularly, and do it alongside releases now, and 2.3.1 was 
zapped a while ago. 2.3.2 should be removed; not sure why it wasn't with 
2.3.3's release.

> Please delete old releases from mirroring system
> 
>
> Key: SPARK-26272
> URL: https://issues.apache.org/jira/browse/SPARK-26272
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Documentation, Web UI
>Affects Versions: 2.3.1
>Reporter: Sebb
>Priority: Major
> Fix For: 2.3.3
>
>
> The release notes say 2.3.2 is the latest release for the 2.3.x line.
> As such, earlier releases such as 2.3.1 should no longer be hosted on the 
> mirrors.
> Please drop https://www.apache.org/dist/spark/spark-2.3.1/ and adjust any 
> remaining download links accordingly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27029) Update Thrift to 0.12.0

2019-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782441#comment-16782441
 ] 

Apache Spark commented on SPARK-27029:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23935

> Update Thrift to 0.12.0
> ---
>
> Key: SPARK-27029
> URL: https://issues.apache.org/jira/browse/SPARK-27029
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> We should update to Thrift 0.12.0 to pick up security and bug fixes. It 
> appears to be compatible with the current build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27029) Update Thrift to 0.12.0

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27029:


Assignee: Sean Owen  (was: Apache Spark)

> Update Thrift to 0.12.0
> ---
>
> Key: SPARK-27029
> URL: https://issues.apache.org/jira/browse/SPARK-27029
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> We should update to Thrift 0.12.0 to pick up security and bug fixes. It 
> appears to be compatible with the current build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27029) Update Thrift to 0.12.0

2019-03-02 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782439#comment-16782439
 ] 

Apache Spark commented on SPARK-27029:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/23935

> Update Thrift to 0.12.0
> ---
>
> Key: SPARK-27029
> URL: https://issues.apache.org/jira/browse/SPARK-27029
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> We should update to Thrift 0.12.0 to pick up security and bug fixes. It 
> appears to be compatible with the current build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27029) Update Thrift to 0.12.0

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27029:


Assignee: Apache Spark  (was: Sean Owen)

> Update Thrift to 0.12.0
> ---
>
> Key: SPARK-27029
> URL: https://issues.apache.org/jira/browse/SPARK-27029
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> We should update to Thrift 0.12.0 to pick up security and bug fixes. It 
> appears to be compatible with the current build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository

2019-03-02 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26048.
---
   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 3.0.0
   2.4.1

This is resolved via https://github.com/apache/spark/pull/23931

> Flume connector for Spark 2.4 does not exist in Maven repository
> 
>
> Key: SPARK-26048
> URL: https://issues.apache.org/jira/browse/SPARK-26048
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Aki Tanaka
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> Flume connector for Spark 2.4 does not exist in the Maven repository.
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume]
>  
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink]
> These packages will be removed in Spark 3. But Spark 2.4 branch still has 
> these packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository

2019-03-02 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782435#comment-16782435
 ] 

Dongjoon Hyun edited comment on SPARK-26048 at 3/2/19 4:18 PM:
---

This is resolved via https://github.com/apache/spark/pull/23931

Please note that the code is merged into master branch only.
`branch-2.4` release code has `-Pflume` already.

As [~vanzin] explained, we should use the release script in the `master` branch 
and [~dbtsai] will use the script for next 2.4.1-RC to include `flume`. So, I 
set `2.4.1/3.0.0` as `Fixed Versions` (although the code is merged into only 
`master`.)


was (Author: dongjoon):
This is resolved via https://github.com/apache/spark/pull/23931

> Flume connector for Spark 2.4 does not exist in Maven repository
> 
>
> Key: SPARK-26048
> URL: https://issues.apache.org/jira/browse/SPARK-26048
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Aki Tanaka
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> Flume connector for Spark 2.4 does not exist in the Maven repository.
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume]
>  
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink]
> These packages will be removed in Spark 3. But Spark 2.4 branch still has 
> these packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26943) Weird behaviour with `.cache()`

2019-03-02 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782429#comment-16782429
 ] 

Sean Owen commented on SPARK-26943:
---

You can download and run the Spark distro wherever you like. If you want to 
execute against a cluster of course that cluster has to be updated. I suppose 
you can spin up a hosted version of Spark with a newer version somewhere too.

> Weird behaviour with `.cache()`
> ---
>
> Key: SPARK-26943
> URL: https://issues.apache.org/jira/browse/SPARK-26943
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Will Uto
>Priority: Major
>
>  
> {code:java}
> sdf.count(){code}
>  
> works fine. However:
>  
> {code:java}
> sdf = sdf.cache()
> sdf.count()
> {code}
>  does not, and produces error
> {code:java}
> Py4JJavaError: An error occurred while calling o314.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 
> in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 
> (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable 
> number: "(N/A)"
>   at java.text.NumberFormat.parse(NumberFormat.java:350)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26274) Download page must link to https://www.apache.org/dist/spark for current releases

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26274:
--
   Priority: Minor  (was: Major)
Component/s: (was: Web UI)
 (was: Deploy)

Just saw this one -- sounds fine to me.
See https://github.com/apache/spark-website/pull/184

> Download page must link to https://www.apache.org/dist/spark for current 
> releases
> -
>
> Key: SPARK-26274
> URL: https://issues.apache.org/jira/browse/SPARK-26274
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sebb
>Priority: Minor
>
> The download page currently uses the archive server:
> https://archive.apache.org/dist/spark/...
> for all sigs and hashes.
> This is fine for archived releases, however current ones must link to the 
> mirror system, i.e.
> https://www.apache.org/dist/spark/...
> Also, the page does not link directly to the hash or sig.
> This makes it very difficult for the user, as they have to choose the correct 
> file.
> The download page must link directly to the actual sig or hash.
> Ideally do so for the archived releases as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26943) Weird behaviour with `.cache()`

2019-03-02 Thread Will Uto (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782418#comment-16782418
 ] 

Will Uto commented on SPARK-26943:
--

Thanks for explanation [~srowen], makes sense - I think this is why I couldn't 
reproduce it locally (on a smaller dataset).

Out of curiosity, is there a way to run a newer version of Spark on a cluster 
e.g. within Python Virtual Environments, or do I have to upgrade an entire 
cluster?

> Weird behaviour with `.cache()`
> ---
>
> Key: SPARK-26943
> URL: https://issues.apache.org/jira/browse/SPARK-26943
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Will Uto
>Priority: Major
>
>  
> {code:java}
> sdf.count(){code}
>  
> works fine. However:
>  
> {code:java}
> sdf = sdf.cache()
> sdf.count()
> {code}
>  does not, and produces error
> {code:java}
> Py4JJavaError: An error occurred while calling o314.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 75 
> in stage 8.0 failed 4 times, most recent failure: Lost task 75.3 in stage 8.0 
> (TID 438, uat-datanode-02, executor 1): java.text.ParseException: Unparseable 
> number: "(N/A)"
>   at java.text.NumberFormat.parse(NumberFormat.java:350)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27007) add rawPrediction to OneVsRest in PySpark

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27007.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23910
[https://github.com/apache/spark/pull/23910]

> add rawPrediction to OneVsRest in PySpark
> -
>
> Key: SPARK-27007
> URL: https://issues.apache.org/jira/browse/SPARK-27007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Add RawPrediction to OneVsRest in PySpark to make it consistent with scala 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27007) add rawPrediction to OneVsRest in PySpark

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27007:
-

Assignee: Huaxin Gao

> add rawPrediction to OneVsRest in PySpark
> -
>
> Key: SPARK-27007
> URL: https://issues.apache.org/jira/browse/SPARK-27007
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Add RawPrediction to OneVsRest in PySpark to make it consistent with scala 
> implementation



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26860) RangeBetween docs appear to be wrong

2019-03-02 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-26860.
---
Resolution: Not A Problem

> RangeBetween docs appear to be wrong 
> -
>
> Key: SPARK-26860
> URL: https://issues.apache.org/jira/browse/SPARK-26860
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Shelby Vanhooser
>Priority: Major
>  Labels: docs, easyfix, python
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The docs describing 
> [RangeBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rangeBetween]
>  for PySpark appear to be duplicates of 
> [RowsBetween|http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween]
>  even though these are functionally different windows.  Rows between 
> reference proceeding and succeeding rows, but rangeBetween is based on the 
> values in these rows.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27032) Flaky test: org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27032:


Assignee: Sean Owen  (was: Apache Spark)

> Flaky test: 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog:
>  metadata directory collision
> ---
>
> Key: SPARK-27032
> URL: https://issues.apache.org/jira/browse/SPARK-27032
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Critical
>
> Locally and on Jenkins, I've frequently seen this test fail:
> {code}
> Error Message
> The await method on Waiter timed out.
> Stacktrace
>   org.scalatest.exceptions.TestFailedException: The await method on 
> Waiter timed out.
>   at org.scalatest.concurrent.Waiters$Waiter.awaitImpl(Waiters.scala:406)
>   at org.scalatest.concurrent.Waiters$Waiter.await(Waiters.scala:540)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19(HDFSMetadataLogSuite.scala:158)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19$adapted(HDFSMetadataLogSuite.scala:133)
> ...
> {code}
> See for example 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6057/testReport/
> There aren't obvious errors or problems with the test. Because it passes 
> sometimes, my guess is that the timeout is simply too short or the test too 
> long. I'd like to try reducing the number of threads/batches in the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27032) Flaky test: org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27032:


Assignee: Apache Spark  (was: Sean Owen)

> Flaky test: 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog:
>  metadata directory collision
> ---
>
> Key: SPARK-27032
> URL: https://issues.apache.org/jira/browse/SPARK-27032
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Critical
>
> Locally and on Jenkins, I've frequently seen this test fail:
> {code}
> Error Message
> The await method on Waiter timed out.
> Stacktrace
>   org.scalatest.exceptions.TestFailedException: The await method on 
> Waiter timed out.
>   at org.scalatest.concurrent.Waiters$Waiter.awaitImpl(Waiters.scala:406)
>   at org.scalatest.concurrent.Waiters$Waiter.await(Waiters.scala:540)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19(HDFSMetadataLogSuite.scala:158)
>   at 
> org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19$adapted(HDFSMetadataLogSuite.scala:133)
> ...
> {code}
> See for example 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6057/testReport/
> There aren't obvious errors or problems with the test. Because it passes 
> sometimes, my guess is that the timeout is simply too short or the test too 
> long. I'd like to try reducing the number of threads/batches in the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27032) Flaky test: org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: metadata directory collision

2019-03-02 Thread Sean Owen (JIRA)
Sean Owen created SPARK-27032:
-

 Summary: Flaky test: 
org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.HDFSMetadataLog: 
metadata directory collision
 Key: SPARK-27032
 URL: https://issues.apache.org/jira/browse/SPARK-27032
 Project: Spark
  Issue Type: Test
  Components: Spark Core, Tests
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


Locally and on Jenkins, I've frequently seen this test fail:

{code}
Error Message
The await method on Waiter timed out.
Stacktrace
  org.scalatest.exceptions.TestFailedException: The await method on Waiter 
timed out.
  at org.scalatest.concurrent.Waiters$Waiter.awaitImpl(Waiters.scala:406)
  at org.scalatest.concurrent.Waiters$Waiter.await(Waiters.scala:540)
  at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19(HDFSMetadataLogSuite.scala:158)
  at 
org.apache.spark.sql.execution.streaming.HDFSMetadataLogSuite.$anonfun$new$19$adapted(HDFSMetadataLogSuite.scala:133)
...
{code}

See for example 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6057/testReport/

There aren't obvious errors or problems with the test. Because it passes 
sometimes, my guess is that the timeout is simply too short or the test too 
long. I'd like to try reducing the number of threads/batches in the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27031) Avoid double formatting in timestampToString

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27031:


Assignee: Apache Spark

> Avoid double formatting in timestampToString
> 
>
> Key: SPARK-27031
> URL: https://issues.apache.org/jira/browse/SPARK-27031
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> The method DateTimeUtils.timestampToString performs converting of its input 
> to string twice:
> - First time by converting microseconds to java.sql.Timestamp and toString: 
> https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L103
> - Second time by using provided formatter: 
> https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L104
> This could be avoided by using specialised TimestampFormatter. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27031) Avoid double formatting in timestampToString

2019-03-02 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27031:


Assignee: (was: Apache Spark)

> Avoid double formatting in timestampToString
> 
>
> Key: SPARK-27031
> URL: https://issues.apache.org/jira/browse/SPARK-27031
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> The method DateTimeUtils.timestampToString performs converting of its input 
> to string twice:
> - First time by converting microseconds to java.sql.Timestamp and toString: 
> https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L103
> - Second time by using provided formatter: 
> https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L104
> This could be avoided by using specialised TimestampFormatter. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27031) Avoid double formatting in timestampToString

2019-03-02 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-27031:
--

 Summary: Avoid double formatting in timestampToString
 Key: SPARK-27031
 URL: https://issues.apache.org/jira/browse/SPARK-27031
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


The method DateTimeUtils.timestampToString performs converting of its input to 
string twice:
- First time by converting microseconds to java.sql.Timestamp and toString: 
https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L103
- Second time by using provided formatter: 
https://github.com/apache/spark/blob/8e5f9995cad409799f3646b3d03761a771ea1664/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L104

This could be avoided by using specialised TimestampFormatter. 




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27030) DataFrameWriter.insertInto fails when writing in parallel to a hive table

2019-03-02 Thread Lev Katzav (JIRA)
Lev Katzav created SPARK-27030:
--

 Summary: DataFrameWriter.insertInto fails when writing in parallel 
to a hive table
 Key: SPARK-27030
 URL: https://issues.apache.org/jira/browse/SPARK-27030
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Lev Katzav


When writing to a hive table, the following temp directory is used:
{code:java}
/path/to/table/_temporary/0/{code}
(the 0 at the end comes from the config
{code:java}
"mapreduce.job.application.attempt.id"{code}
since that config is missing, it falls back to 0)

when there are 2 processes that write to the same table, there could be the 
following race condition:
 # p1 creates temp folder and uses it
 # p2 uses temp folder
 # p1 finishes and deletes temp folder
 # p2 fails since temp folder is missing

 

It is possible to recreate this error locally with the following code:
(the code runs locally, but I experienced the same error when running on a 
cluster
with 2 jobs writing to the same table)
{code:java}
import org.apache.spark.sql.functions._
val df = spark
 .range(1000)
 .toDF("a")
 .withColumn("partition", lit(0))
 .cache()
//create db
sqlContext.sql("CREATE DATABASE IF NOT EXISTS db").count()

//create table
df
 .write
 .partitionBy("partition")
 .saveAsTable("db.table")
val x = (1 to 100).par
x.tasksupport = new ForkJoinTaskSupport( new ForkJoinPool(10))


//insert to different partitions in parallel
x.foreach { p =>

 val df2 = df
 .withColumn("partition",lit(p))
  df2
   .write
   .mode(SaveMode.Overwrite)
   .insertInto("db.table")
}
{code}
 

 the error would be:
{code:java}
java.io.FileNotFoundException: File 
file:/path/to/warehouse/db.db/table/_temporary/0 does not exist
 at 
org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:406)
 at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1497)
 at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
 at 
org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:669)
 at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1497)
 at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1537)
 at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:283)
 at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:325)
 at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
 at 
org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:166)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:185)
 at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
 at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
 at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
 at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
 at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
 at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:325)
 at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:311)
 at 
company.name.spark.hive.SparkHiveUtilsTest$$anonfun$3$$anonfun$apply$mcV$sp$1.apply$mcVI$sp(SparkHiveUtilsTest.scala:190)
 at 
scala.collection.parallel.immutable.ParRange$ParRangeIterator.foreach(ParRange.scala:91)
 at 
scala.collection.parallel.ParIterableLike$Foreach.leaf(ParIterableLike.scala:972)
 at 

[jira] [Commented] (SPARK-26998) spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor processes in Standalone mode

2019-03-02 Thread t oo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782330#comment-16782330
 ] 

t oo commented on SPARK-26998:
--

[https://github.com/apache/spark/pull/23820] is only about hiding password from 
log file, SPARK-26998 is about hiding passwords from showing in 'ps -ef' 
process list

> spark.ssl.keyStorePassword in plaintext on 'ps -ef' output of executor 
> processes in Standalone mode
> ---
>
> Key: SPARK-26998
> URL: https://issues.apache.org/jira/browse/SPARK-26998
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Security, Spark Core
>Affects Versions: 2.3.3, 2.4.0
>Reporter: t oo
>Priority: Major
>  Labels: SECURITY, Security, secur, security, security-issue
>
> Run spark standalone mode, then start a spark-submit requiring at least 1 
> executor. Do a 'ps -ef' on linux (ie putty terminal) and you will be able to 
> see  spark.ssl.keyStorePassword value in plaintext!
>  
> spark.ssl.keyStorePassword and  spark.ssl.keyPassword don't need to be passed 
> to  CoarseGrainedExecutorBackend. Only  spark.ssl.trustStorePassword is used.
>  
> Can be resolved if below PR is merged:
> [[Github] Pull Request #21514 
> (tooptoop4)|https://github.com/apache/spark/pull/21514]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27025) Speed up toLocalIterator

2019-03-02 Thread Erik van Oosten (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782322#comment-16782322
 ] 

Erik van Oosten edited comment on SPARK-27025 at 3/2/19 8:43 AM:
-

The point is to _not_ fetch pro-actively.

I have a program in which several steps need to be executed before anything can 
be transferred to the driver. So why can't the executors start executing 
immediately, and only transfer the results to the driver when its ready?


was (Author: erikvanoosten):
I have a program in which several steps need to be executed before anything can 
be transferred to the driver. So why can't the executors start executing 
immediately, and only transfer the results to the driver when its ready?

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27025) Speed up toLocalIterator

2019-03-02 Thread Erik van Oosten (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782322#comment-16782322
 ] 

Erik van Oosten commented on SPARK-27025:
-

I have a program in which several steps need to be executed before anything can 
be transferred to the driver. So why can't the executors start executing 
immediately, and only transfer the results to the driver when its ready?

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org