date:20180612

[jira] [Created] (SPARK-24541) TCP based shuffle

2018-06-12 Thread Jose Torres (JIRA)

Jose Torres created SPARK-24541:
---

 Summary: TCP based shuffle
 Key: SPARK-24541
 URL: https://issues.apache.org/jira/browse/SPARK-24541
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24540) Support for multiple delimiter in Spark CSV read

2018-06-12 Thread Ashwin K (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashwin K updated SPARK-24540:
-
Description: 
Currently, the delimiter option Spark 2.0 to read and split CSV files/data only 
support a single character delimiter. If we try to provide multiple delimiters, 
we observer the following error message.

eg: Dataset df = spark.read().option("inferSchema", "true")
                                                          .option("header", 
"false").option("delimiter", ", ")
                                                          .csv("C:test.txt");

Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot 
be more than one character: , 

at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at scala.Option.orElse(Option.scala:289)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)

 

Generally, the data to be processed contains multiple delimiters and presently 
we need to do a manual data clean up on the source/input file, which doesn't 
work well in large applications which consumes numerous files.

There seems to be work-around like reading data as text and using the split 
option, but this in my opinion defeats the purpose, advantage and efficiency of 
a direct read from CSV file.

 

  was:
Currently, the delimiter option Spark 2.0 to read and split CSV files/data only 
support a single character delimiter. If we try to provide multiple delimiters, 
we observer the following error message.

eg: Dataset df = spark.read().option("inferSchema", "true")
                                                         .option("header", 
"false").option("delimiter", ", ")
                                                         .csv("C:\\test.txt");

Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot 
be more than one character: , 

at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at scala.Option.orElse(Option.scala:289)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)

 

Generally, the data to be processed contains multiple delimiters and presently 
we need to do a manual data clean up on the source/input file, which doesn't 
work well in large applications which consumes numerous files.

There seems to be work-around like reading data as text and using the split 
option, but this in my opinion defeats the purpose, advantage and efficiency of 
a direct read from CSV file.

 


> Support for multiple delimiter in Spark CSV read
> 
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.

[jira] [Created] (SPARK-24540) Support for multiple delimiter in Spark CSV read

2018-06-12 Thread Ashwin K (JIRA)

Ashwin K created SPARK-24540:


 Summary: Support for multiple delimiter in Spark CSV read
 Key: SPARK-24540
 URL: https://issues.apache.org/jira/browse/SPARK-24540
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Ashwin K


Currently, the delimiter option Spark 2.0 to read and split CSV files/data only 
support a single character delimiter. If we try to provide multiple delimiters, 
we observer the following error message.

eg: Dataset df = spark.read().option("inferSchema", "true")
                                                         .option("header", 
"false").option("delimiter", ", ")
                                                         .csv("C:\\test.txt");

Exception in thread "main" java.lang.IllegalArgumentException: Delimiter cannot 
be more than one character: , 

at 
org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
 at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
 at scala.Option.orElse(Option.scala:289)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)

 

Generally, the data to be processed contains multiple delimiters and presently 
we need to do a manual data clean up on the source/input file, which doesn't 
work well in large applications which consumes numerous files.

There seems to be work-around like reading data as text and using the split 
option, but this in my opinion defeats the purpose, advantage and efficiency of 
a direct read from CSV file.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20592) Alter table concatenate is not working as expected.

2018-06-12 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-20592:
--
Affects Version/s: 2.2.1
   2.3.1

> Alter table concatenate is not working as expected.
> ---
>
> Key: SPARK-20592
> URL: https://issues.apache.org/jira/browse/SPARK-20592
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.2.1, 2.3.1
>Reporter: Guru Prabhakar Reddy Marthala
>Priority: Major
>  Labels: hive, pyspark
>
> Created a table using CTAS from csv to parquet.Parquet table generated 
> numerous small files.tried alter table concatenate but it's not working as 
> expected.
> spark.sql("CREATE TABLE flight.flight_data(year INT,   month INT,   day INT,  
>  day_of_week INT,   dep_time INT,   crs_dep_time INT,   arr_time INT,   
> crs_arr_time INT,   unique_carrier STRING,   flight_num INT,   tail_num 
> STRING,   actual_elapsed_time INT,   crs_elapsed_time INT,   air_time INT,   
> arr_delay INT,   dep_delay INT,   origin STRING,   dest STRING,   distance 
> INT,   taxi_in INT,   taxi_out INT,   cancelled INT,   cancellation_code 
> STRING,   diverted INT,   carrier_delay STRING,   weather_delay STRING,   
> nas_delay STRING,   security_delay STRING,   late_aircraft_delay STRING) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' stored as textfile")
> spark.sql("load data local INPATH 'i:/2008/2008.csv' INTO TABLE 
> flight.flight_data")
> spark.sql("create table flight.flight_data_pq stored as parquet as select * 
> from flight.flight_data")
> spark.sql("create table flight.flight_data_orc stored as orc as select * from 
> flight.flight_data")
> pyspark.sql.utils.ParseException: u'\nOperation not allowed: alter table 
> concatenate(line 1, pos 0)\n\n== SQL ==\nalter table 
> flight_data.flight_data_pq concatenate\n^^^\n'
> Tried on both orc and parquet format.It's not working.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23606) Flakey FileBasedDataSourceSuite

2018-06-12 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-23606.
---
Resolution: Duplicate

Thank you for reporting, [~henryr] . This is already tracked by SPARK-23390. 
I'll close this one to avoid information splitting.

> Flakey FileBasedDataSourceSuite
> ---
>
> Key: SPARK-23606
> URL: https://issues.apache.org/jira/browse/SPARK-23606
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Priority: Major
>
> I've seen the following exception twice today in PR builds (one example: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/87978/testReport/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/).
>  It's not deterministic, as I've had one PR build pass in the same span.
> {code:java}
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 15 times over 10.016101897 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:30)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:30)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:30)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:379)
>   at 
> org.scalatest.Status$$anonfun$withAfterEffect$1.apply(Status.scala:375)
>   at org.scalatest.SucceededStatus$.whenCompleted(Status.scala:454)
>   at org.scalatest.Status$class.withAfterEffect(Status.scala:375)
>   at org.scalatest.SucceededStatus$.withAfterEffect(Status.scala:426)
>   at 
> org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:232)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.runTest(FileBasedDataSourceSuite.scala:30)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite$class.run(Suite.scala:1147)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:52)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by:

[jira] [Commented] (SPARK-23797) SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-06-12 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510612#comment-16510612
 ] 

Dongjoon Hyun commented on SPARK-23797:
---

In general, I agree with [~maropu] . This is too general question and lack of 
information.

[~tinvukhac] . If you are using vanilla Spark with all defalt configuration, 
please try again with Apache Spark 2.4.0-SNAPSHOT or Apache Spark 2.3.1 with 
`spark.sql.orc.impl=native` and `spark.sql.hive.convertMetastoreOrc=true`.

> SparkSQL performance on small TPCDS tables is very low when compared to Drill 
> or Presto
> ---
>
> Key: SPARK-23797
> URL: https://issues.apache.org/jira/browse/SPARK-23797
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Submit, SQL
>Affects Versions: 2.3.0
>Reporter: Tin Vu
>Priority: Major
>
> I am executing a benchmark to compare performance of SparkSQL, Apache Drill 
> and Presto. My experimental setup:
>  * TPCDS dataset with scale factor 100 (size 100GB).
>  * Spark, Drill, Presto have a same number of workers: 12.
>  * Each worked has same allocated amount of memory: 4GB.
>  * Data is stored by Hive with ORC format.
> I executed a very simple SQL query: "SELECT * from table_name"
>  The issue is that for some small size tables (even table with few dozen of 
> records), SparkSQL still required about 7-8 seconds to finish, while Drill 
> and Presto only needed less than 1 second.
>  For other large tables with billions records, SparkSQL performance was 
> reasonable when it required 20-30 seconds to scan the whole table.
>  Do you have any idea or reasonable explanation for this issue?
> Thanks,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24485) Measure and log elapsed time for filesystem operations in HDFSBackedStateStoreProvider

2018-06-12 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24485.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21506
[https://github.com/apache/spark/pull/21506]

> Measure and log elapsed time for filesystem operations in 
> HDFSBackedStateStoreProvider
> --
>
> Key: SPARK-24485
> URL: https://issues.apache.org/jira/browse/SPARK-24485
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 2.4.0
>
>
> There're couple of operations which communicate with file system (mostly 
> remote HDFS in production) in HDFSBackedStateStoreProvider, which contribute 
> huge part of latency.
> It would be better to measure the latency (elapsed time) and log to help 
> debugging when there's unexpected huge latency on state store.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24485) Measure and log elapsed time for filesystem operations in HDFSBackedStateStoreProvider

2018-06-12 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24485:


Assignee: Jungtaek Lim

> Measure and log elapsed time for filesystem operations in 
> HDFSBackedStateStoreProvider
> --
>
> Key: SPARK-24485
> URL: https://issues.apache.org/jira/browse/SPARK-24485
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 2.4.0
>
>
> There're couple of operations which communicate with file system (mostly 
> remote HDFS in production) in HDFSBackedStateStoreProvider, which contribute 
> huge part of latency.
> It would be better to measure the latency (elapsed time) and log to help 
> debugging when there's unexpected huge latency on state store.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24466) TextSocketMicroBatchReader no longer works with nc utility

2018-06-12 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24466.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21497
[https://github.com/apache/spark/pull/21497]

> TextSocketMicroBatchReader no longer works with nc utility
> --
>
> Key: SPARK-24466
> URL: https://issues.apache.org/jira/browse/SPARK-24466
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.0
>
>
> While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before 
> reading actual data so the query also exits with error.
>  
> The reason is due to launching temporary reader for reading schema, and 
> closing reader, and re-opening reader. While reliable socket server should be 
> able to handle this without any issue, nc command normally can't handle 
> multiple connections and simply exits when closing temporary reader.
>  
> Given that socket source is expected to be used from examples on official 
> document or some experiments, which we tend to simply use netcat, this is 
> better to be treated as bug, though this is a kind of limitation on netcat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24466) TextSocketMicroBatchReader no longer works with nc utility

2018-06-12 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24466:


Assignee: Jungtaek Lim

> TextSocketMicroBatchReader no longer works with nc utility
> --
>
> Key: SPARK-24466
> URL: https://issues.apache.org/jira/browse/SPARK-24466
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.0
>
>
> While playing with Spark 2.4.0-SNAPSHOT, I found nc command exits before 
> reading actual data so the query also exits with error.
>  
> The reason is due to launching temporary reader for reading schema, and 
> closing reader, and re-opening reader. While reliable socket server should be 
> able to handle this without any issue, nc command normally can't handle 
> multiple connections and simply exits when closing temporary reader.
>  
> Given that socket source is expected to be used from examples on official 
> document or some experiments, which we tend to simply use netcat, this is 
> better to be treated as bug, though this is a kind of limitation on netcat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24539) HistoryServer does not display metrics from tasks that complete after stage failure

2018-06-12 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-24539:


 Summary: HistoryServer does not display metrics from tasks that 
complete after stage failure
 Key: SPARK-24539
 URL: https://issues.apache.org/jira/browse/SPARK-24539
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.1
Reporter: Imran Rashid


I noticed that task metrics for completed tasks with a stage failure do not 
show up in the new history server.  I have a feeling this is because all of the 
tasks succeeded *after* the stage had been failed (so they were completions 
from a "zombie" taskset).  The task metrics (eg. the shuffle read size & 
shuffle write size) do not show up at all, either in the task table, the 
executor table, or the overall stage summary metrics.  (they might not show up 
in the job summary page either, but in the event logs I have, there is another 
successful stage attempt after this one, and that is the only thing which shows 
up in the jobs page.)  If you get task details from the api endpoint (eg. 
http://[host]:[port]/api/v1/applications/[app-id]/stages/[stage-id]/[stage-attempt])
 then you can see the successful tasks and all the metrics

Unfortunately the event logs I have are huge and I don't have a small repro 
handy, but I hope that description is enough to go on.

I loaded the event logs I have in the SHS from spark 2.2 and they appear fine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-12 Thread Lee Dongjin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510570#comment-16510570
 ] 

Lee Dongjin commented on SPARK-24530:
-

In my case, it works correctly on current master (commit 9786ce6). My 
environment is Ubuntu 18.04, Python 2.7, and Sphinx 1.7.5.

!pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png! 

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-12 Thread Lee Dongjin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lee Dongjin updated SPARK-24530:

Attachment: pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png, 
> pyspark-ml-doc-utuntu18.04-python2.7-sphinx-1.7.5.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24538) Decimal type support push down to the data sources

2018-06-12 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24538:

Description: 
Latest parquet support decimal type statistics. then we can push down:
{noformat}
LM-SHC-16502798:parquet-mr yumwang$ java -jar 
./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet

file:         
file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet

creator:      parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)

extra:        org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}



file schema:  spark_schema



id:           REQUIRED INT64 R:0 D:0

d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1

d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1

d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1

d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1

d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1

d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1



row group 1:  RC:241867 TS:15480513 OFFSET:4



id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]

d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]

d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]

d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]

d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., num_nulls: 0]

d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
241866, num_nulls: 0]

d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, max: 
241866.00, num_nulls: 0]



row group 2:  RC:241867 TS:15480513 OFFSET:8584904



id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]

d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]

d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]

d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]

d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., num_nulls: 0]

d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
max: 483733, num_nulls: 0]

d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
241867.00, max: 483733.00, num_nulls: 
0]{noformat}

> Decimal type support push down to the data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> Latest parquet support decimal type statistics. then we can push down:
> {noformat}
> LM-SHC-16502798:parquet-mr yumwang$ java -jar 
> ./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
> /tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet
> file:         
>

[jira] [Updated] (SPARK-24538) Decimal type support push down to the data sources

2018-06-12 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24538:

Description: 
Latest parquet support decimal type statistics. then we can push down to the 
data sources:
{noformat}
LM-SHC-16502798:parquet-mr yumwang$ java -jar 
./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet

file:         
file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet

creator:      parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)

extra:        org.apache.spark.sql.parquet.row.metadata = 
{"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}},{"name":"d1","type":"decimal(9,0)","nullable":true,"metadata":{}},{"name":"d2","type":"decimal(9,2)","nullable":true,"metadata":{}},{"name":"d3","type":"decimal(18,0)","nullable":true,"metadata":{}},{"name":"d4","type":"decimal(18,4)","nullable":true,"metadata":{}},{"name":"d5","type":"decimal(38,0)","nullable":true,"metadata":{}},{"name":"d6","type":"decimal(38,18)","nullable":true,"metadata":{}}]}



file schema:  spark_schema



id:           REQUIRED INT64 R:0 D:0

d1:           OPTIONAL INT32 O:DECIMAL R:0 D:1

d2:           OPTIONAL INT32 O:DECIMAL R:0 D:1

d3:           OPTIONAL INT64 O:DECIMAL R:0 D:1

d4:           OPTIONAL INT64 O:DECIMAL R:0 D:1

d5:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1

d6:           OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1



row group 1:  RC:241867 TS:15480513 OFFSET:4



id:            INT64 SNAPPY DO:0 FPO:4 SZ:968154/1935071/2.00 VC:241867 
ENC:BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]

d1:            INT32 SNAPPY DO:0 FPO:968158 SZ:967555/967515/1.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]

d2:            INT32 SNAPPY DO:0 FPO:1935713 SZ:967558/967515/1.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0.00, max: 241866.00, num_nulls: 0]

d3:            INT64 SNAPPY DO:0 FPO:2903271 SZ:968866/1935047/2.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 241866, num_nulls: 0]

d4:            INT64 SNAPPY DO:0 FPO:3872137 SZ:1247007/1935047/1.55 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0., max: 241866., num_nulls: 0]

d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5119144 
SZ:1266850/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0, max: 
241866, num_nulls: 0]

d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:6385994 
SZ:2198910/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 0E-18, max: 
241866.00, num_nulls: 0]



row group 2:  RC:241867 TS:15480513 OFFSET:8584904



id:            INT64 SNAPPY DO:0 FPO:8584904 SZ:968131/1935071/2.00 VC:241867 
ENC:BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]

d1:            INT32 SNAPPY DO:0 FPO:9553035 SZ:967563/967515/1.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]

d2:            INT32 SNAPPY DO:0 FPO:10520598 SZ:967563/967515/1.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867.00, max: 483733.00, num_nulls: 0]

d3:            INT64 SNAPPY DO:0 FPO:11488161 SZ:968110/1935047/2.00 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, max: 483733, num_nulls: 0]

d4:            INT64 SNAPPY DO:0 FPO:12456271 SZ:1247071/1935047/1.55 VC:241867 
ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867., max: 483733., num_nulls: 0]

d5:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:13703342 
SZ:1270587/3870159/3.05 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 241867, 
max: 483733, num_nulls: 0]

d6:            FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:14973929 
SZ:2197306/3870159/1.76 VC:241867 ENC:RLE,BIT_PACKED,PLAIN ST:[min: 
241867.00, max: 483733.00, num_nulls: 
0]{noformat}

  was:
Latest parquet support decimal type statistics. then we can push down:
{noformat}
LM-SHC-16502798:parquet-mr yumwang$ java -jar 
./parquet-tools/target/parquet-tools-1.10.10-column-index-SNAPSHOT.jar meta 
/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet

file:         
file:/tmp/spark/parquet/decimal/part-0-3880e69a-6dd1-4c2b-946c-e7dae047f65c-c000.snappy.parquet

creator:      parquet-mr version 1.10.0 (build 
031a6654009e3b82020012a18434c582bd74c73a)

extra:        org.apache.spark.sql.parquet.row.metadata =

[jira] [Assigned] (SPARK-24538) Decimal type support push down to the data sources

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24538:


Assignee: Apache Spark

> Decimal type support push down to the data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24538) Decimal type support push down to the data sources

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24538:


Assignee: (was: Apache Spark)

> Decimal type support push down to the data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-24538) Decimal type support push down to the data sources

2018-06-12 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-24538:

Comment: was deleted

(was: I'm working on this.)

> Decimal type support push down to the data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24538) Decimal type support push down to the data sources

2018-06-12 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510508#comment-16510508
 ] 

Apache Spark commented on SPARK-24538:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/21547

> Decimal type support push down to the data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22357) SparkContext.binaryFiles ignore minPartitions parameter

2018-06-12 Thread John Brock (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-22357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510507#comment-16510507
 ] 

John Brock commented on SPARK-22357:


What are people's opinions of [~bomeng]'s fix? This bug just bit me, so I'd 
like to see this fixed.

> SparkContext.binaryFiles ignore minPartitions parameter
> ---
>
> Key: SPARK-22357
> URL: https://issues.apache.org/jira/browse/SPARK-22357
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.2, 2.2.0
>Reporter: Weichen Xu
>Priority: Major
>
> this is a bug in binaryFiles - even though we give it the partitions, 
> binaryFiles ignores it.
> This is a bug introduced in spark 2.1 from spark 2.0, in file 
> PortableDataStream.scala the argument “minPartitions” is no longer used (with 
> the push to master on 11/7/6):
> {code}
> /**
> Allow minPartitions set by end-user in order to keep compatibility with old 
> Hadoop API
> which is set through setMaxSplitSize
> */
> def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: 
> Int) {
> val defaultMaxSplitBytes = 
> sc.getConf.get(config.FILES_MAX_PARTITION_BYTES)
> val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES)
> val defaultParallelism = sc.defaultParallelism
> val files = listStatus(context).asScala
> val totalBytes = files.filterNot(.isDirectory).map(.getLen + 
> openCostInBytes).sum
> val bytesPerCore = totalBytes / defaultParallelism
> val maxSplitSize = Math.min(defaultMaxSplitBytes, 
> Math.max(openCostInBytes, bytesPerCore))
> super.setMaxSplitSize(maxSplitSize)
> }
> {code}
> The code previously, in version 2.0, was:
> {code}
> def setMinPartitions(context: JobContext, minPartitions: Int) {
> val totalLen = 
> listStatus(context).asScala.filterNot(.isDirectory).map(.getLen).sum
> val maxSplitSize = math.ceil(totalLen / math.max(minPartitions, 
> 1.0)).toLong
> super.setMaxSplitSize(maxSplitSize)
> }
> {code}
> The new code is very smart, but it ignores what the user passes in and uses 
> the data size, which is kind of a breaking change in some sense
> In our specific case this was a problem, because we initially read in just 
> the files names and only after that the dataframe becomes very large, when 
> reading in the images themselves – and in this case the new code does not 
> handle the partitioning very well.
> I’m not sure if it can be easily fixed because I don’t understand the full 
> context of the change in spark (but at the very least the unused parameter 
> should be removed to avoid confusion).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24538) Decimal type support push down to the data sources

2018-06-12 Thread Yuming Wang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510503#comment-16510503
 ] 

Yuming Wang commented on SPARK-24538:
-

I'm working on this.

> Decimal type support push down to the data sources
> --
>
> Key: SPARK-24538
> URL: https://issues.apache.org/jira/browse/SPARK-24538
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24538) Decimal type support push down to the data sources

2018-06-12 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-24538:
---

 Summary: Decimal type support push down to the data sources
 Key: SPARK-24538
 URL: https://issues.apache.org/jira/browse/SPARK-24538
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22239) User-defined window functions with pandas udf

2018-06-12 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22239.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21082
[https://github.com/apache/spark/pull/21082]

> User-defined window functions with pandas udf
> -
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22239) User-defined window functions with pandas udf

2018-06-12 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-22239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22239:


Assignee: Li Jin

> User-defined window functions with pandas udf
> -
>
> Key: SPARK-22239
> URL: https://issues.apache.org/jira/browse/SPARK-22239
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: 
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
> Fix For: 2.4.0
>
>
> Window function is another place we can benefit from vectored udf and add 
> another useful function to the pandas_udf suite.
> Example usage (preliminary):
> {code:java}
> w = Window.partitionBy('id').orderBy('time').rangeBetween(-200, 0)
> @pandas_udf(DoubleType())
> def ema(v1):
> return v1.ewm(alpha=0.5).mean().iloc[-1]
> df.withColumn('v1_ema', ema(df.v1).over(window))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24537) Add array_remove / array_zip / map_from_arrays / array_distinct

2018-06-12 Thread Huaxin Gao (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510418#comment-16510418
 ] 

Huaxin Gao commented on SPARK-24537:


I will work on this. Thanks!

> Add array_remove / array_zip / map_from_arrays / array_distinct
> ---
>
> Key: SPARK-24537
> URL: https://issues.apache.org/jira/browse/SPARK-24537
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add R versions of 
>  * array_remove   -SPARK-23920-
>  * array_zip   -SPARK-23931-
>  * map_from_arrays   -SPARK-23933-
>  * array_distinct   -SPARK-23912-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24537) Add array_remove / array_zip / map_from_arrays / array_distinct

2018-06-12 Thread Huaxin Gao (JIRA)

Huaxin Gao created SPARK-24537:
--

 Summary: Add array_remove / array_zip / map_from_arrays / 
array_distinct
 Key: SPARK-24537
 URL: https://issues.apache.org/jira/browse/SPARK-24537
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Huaxin Gao


Add R versions of 
 * array_remove   -SPARK-23920-
 * array_zip   -SPARK-23931-
 * map_from_arrays   -SPARK-23933-
 * array_distinct   -SPARK-23912-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24258) SPIP: Improve PySpark support for ML Matrix and Vector types

2018-06-12 Thread Leif Walsh (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510406#comment-16510406
 ] 

Leif Walsh commented on SPARK-24258:


I think for PySpark users, we could just make it easy to use python UDFs and 
use numpy and pandas code directly. That’s probably the model those users want 
anyway, so they can be sure of numerical stability w.r.t. the pandas and numpy 
code they already have. 

I realize that then leaves the Scala and Java API users out in the cold, but 
it’s a possible place to start. Really, I think the “add many many linear 
algebra functions” problem is only a problem for the JVM APIs, where we’d need 
to link them up to breeze and test they do the same thing as the numpy ones. 

> SPIP: Improve PySpark support for ML Matrix and Vector types
> 
>
> Key: SPARK-24258
> URL: https://issues.apache.org/jira/browse/SPARK-24258
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Leif Walsh
>Priority: Major
>
> h1. Background and Motivation:
> In Spark ML ({{pyspark.ml.linalg}}), there are four column types you can 
> construct, {{SparseVector}}, {{DenseVector}}, {{SparseMatrix}}, and 
> {{DenseMatrix}}.  In PySpark, you can construct one of these vectors with 
> {{VectorAssembler}}, and then you can run python UDFs on these columns, and 
> use {{toArray()}} to get numpy ndarrays and do things with them.  They also 
> have a small native API where you can compute {{dot()}}, {{norm()}}, and a 
> few other things with them (I think these are computed in scala, not python, 
> could be wrong).
> For statistical applications, having the ability to manipulate columns of 
> matrix and vector values (from here on, I will use the term tensor to refer 
> to arrays of arbitrary dimensionality, matrices are 2-tensors and vectors are 
> 1-tensors) would be powerful.  For example, you could use PySpark to reshape 
> your data in parallel, assemble some matrices from your raw data, and then 
> run some statistical computation on them using UDFs leveraging python 
> libraries like statsmodels, numpy, tensorflow, and scikit-learn.
> I propose enriching the {{pyspark.ml.linalg}} types in the following ways:
> # Expand the set of column operations one can apply to tensor columns beyond 
> the few functions currently available on these types.  Ideally, the API 
> should aim to be as wide as the numpy ndarray API, but would wrap Breeze 
> operations.  For example, we should provide {{DenseVector.outerProduct()}} so 
> that a user could write something like {{df.withColumn("XtX", 
> df["X"].outerProduct(df["X"]))}}.
> # Make sure all ser/de mechanisms (including Arrow) understand these types, 
> and faithfully represent them as natural types in all languages (in scala and 
> java, Breeze objects, in python, numpy ndarrays rather than the 
> pyspark.ml.linalg types that wrap them, in SparkR, I'm not sure what, but 
> something natural) when applying UDFs or collecting with {{toPandas()}}.
> # Improve the construction of these types from scalar columns.  The 
> {{VectorAssembler}} API is not very ergonomic.  I propose something like 
> {{df.withColumn("predictors", Vector.of(df["feature1"], df["feature2"], 
> df["feature3"]))}}.
> h1. Target Personas:
> Data scientists, machine learning practitioners, machine learning library 
> developers.
> h1. Goals:
> This would allow users to do more statistical computation in Spark natively, 
> and would allow users to apply python statistical computation to data in 
> Spark using UDFs.
> h1. Non-Goals:
> I suppose one non-goal is to reimplement something like statsmodels using 
> Breeze data structures and computation.  That could be seen as an effort to 
> enrich Spark ML itself, but is out of scope of this effort.  This effort is 
> just to make it possible and easy to apply existing python libraries to 
> tensor values in parallel.
> h1. Proposed API Changes:
> Add the above APIs to PySpark and the other language bindings.  I think the 
> list is:
> # {{pyspark.ml.linalg.Vector.of(*columns)}}
> # {{pyspark.ml.linalg.Matrix.of( provide this>)}}
> # For each of the matrix and vector types in {{pyspark.ml.linalg}}, add more 
> methods like {{outerProduct}}, {{matmul}}, {{kron}}, etc.  
> https://docs.scipy.org/doc/numpy-1.14.0/reference/routines.linalg.html has a 
> good list to look at.
> Also, change python UDFs so that these tensor types are passed to the python 
> function not as \{Sparse,Dense\}\{Matrix,Vector\} objects that wrap 
> {{numpy.ndarray}}, but as {{numpy.ndarray}} objects by themselves, and 
> interpret return values that are {{numpy.ndarray}} objects back into the 
> spark types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (SPARK-24506) Spark.ui.filters not applied to /sqlserver/ url

2018-06-12 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24506.

   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.4.0
   2.3.2
   2.2.2

> Spark.ui.filters not applied to /sqlserver/ url
> ---
>
> Key: SPARK-24506
> URL: https://issues.apache.org/jira/browse/SPARK-24506
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: t oo
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.2.2, 2.3.2, 2.4.0
>
>
> With Spark.ui.filters applied, the web ui's for 
> master/history/worker/storage/executors/stages.etc prompt for http auth but 
> /sqlserver/ tab is not prompting for http auth



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError

2018-06-12 Thread Alexander Behm (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm updated SPARK-24536:
---
Description: 
SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT)

fails in the QueryPlanner with:
{code}
java.lang.AssertionError: assertion failed: No plan for GlobalLimit null
{code}

I think this issue should be caught earlier during semantic analysis.


  was:
SELECT COUNT(*) FROM t LIMIT CAST(NULL AS INT)

fails in the QueryPlanner with:
{code}
java.lang.AssertionError: assertion failed: No plan for GlobalLimit null
{code}

I think this issue should be caught earlier during semantic analysis.



> Query with nonsensical LIMIT hits AssertionError
> 
>
> Key: SPARK-24536
> URL: https://issues.apache.org/jira/browse/SPARK-24536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Behm
>Priority: Trivial
>  Labels: beginner
>
> SELECT COUNT(1) FROM t LIMIT CAST(NULL AS INT)
> fails in the QueryPlanner with:
> {code}
> java.lang.AssertionError: assertion failed: No plan for GlobalLimit null
> {code}
> I think this issue should be caught earlier during semantic analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError

2018-06-12 Thread Alexander Behm (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Behm updated SPARK-24536:
---
Labels: beginner  (was: )

> Query with nonsensical LIMIT hits AssertionError
> 
>
> Key: SPARK-24536
> URL: https://issues.apache.org/jira/browse/SPARK-24536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alexander Behm
>Priority: Trivial
>  Labels: beginner
>
> SELECT COUNT(*) FROM t LIMIT CAST(NULL AS INT)
> fails in the QueryPlanner with:
> {code}
> java.lang.AssertionError: assertion failed: No plan for GlobalLimit null
> {code}
> I think this issue should be caught earlier during semantic analysis.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24536) Query with nonsensical LIMIT hits AssertionError

2018-06-12 Thread Alexander Behm (JIRA)

Alexander Behm created SPARK-24536:
--

 Summary: Query with nonsensical LIMIT hits AssertionError
 Key: SPARK-24536
 URL: https://issues.apache.org/jira/browse/SPARK-24536
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Alexander Behm


SELECT COUNT(*) FROM t LIMIT CAST(NULL AS INT)

fails in the QueryPlanner with:
{code}
java.lang.AssertionError: assertion failed: No plan for GlobalLimit null
{code}

I think this issue should be caught earlier during semantic analysis.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24535) Fix java version parsing in SparkR

2018-06-12 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-24535:
-

 Summary: Fix java version parsing in SparkR
 Key: SPARK-24535
 URL: https://issues.apache.org/jira/browse/SPARK-24535
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.3.1, 2.4.0
Reporter: Shivaram Venkataraman


We see errors on CRAN of the form 
{code:java}
  java version "1.8.0_144"
  Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
  Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
  Picked up _JAVA_OPTIONS: -XX:-UsePerfData 
  -- 1. Error: create DataFrame from list or data.frame (@test_basic.R#21)  
--
  subscript out of bounds
  1: sparkR.session(master = sparkRTestMaster, enableHiveSupport = FALSE, 
sparkConfig = sparkRTestConfig) at 
D:/temp/RtmpIJ8Cc3/RLIBS_3242c713c3181/SparkR/tests/testthat/test_basic.R:21
  2: sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap, 
sparkExecutorEnvMap, 
 sparkJars, sparkPackages)
  3: checkJavaVersion()
  4: strsplit(javaVersionFilter[[1]], "[\"]")
{code}

The complete log file is at 
http://home.apache.org/~shivaram/SparkR_2.3.1_check_results/Windows/00check.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23030) Decrease memory consumption with toPandas() collection using Arrow

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23030:


Assignee: (was: Apache Spark)

> Decrease memory consumption with toPandas() collection using Arrow
> --
>
> Key: SPARK-23030
> URL: https://issues.apache.org/jira/browse/SPARK-23030
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently with Arrow enabled, calling {{toPandas()}} results in a collection 
> of all partitions in the JVM in the form of batches of Arrow file format.  
> Once collected in the JVM, they are served to the Python driver process. 
> I believe using the Arrow stream format can help to optimize this and reduce 
> memory consumption in the JVM by only loading one record batch at a time 
> before sending it to Python.  This might also reduce the latency between 
> making the initial call in Python and receiving the first batch of records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23030) Decrease memory consumption with toPandas() collection using Arrow

2018-06-12 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510235#comment-16510235
 ] 

Apache Spark commented on SPARK-23030:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/21546

> Decrease memory consumption with toPandas() collection using Arrow
> --
>
> Key: SPARK-23030
> URL: https://issues.apache.org/jira/browse/SPARK-23030
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently with Arrow enabled, calling {{toPandas()}} results in a collection 
> of all partitions in the JVM in the form of batches of Arrow file format.  
> Once collected in the JVM, they are served to the Python driver process. 
> I believe using the Arrow stream format can help to optimize this and reduce 
> memory consumption in the JVM by only loading one record batch at a time 
> before sending it to Python.  This might also reduce the latency between 
> making the initial call in Python and receiving the first batch of records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23030) Decrease memory consumption with toPandas() collection using Arrow

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23030:


Assignee: Apache Spark

> Decrease memory consumption with toPandas() collection using Arrow
> --
>
> Key: SPARK-23030
> URL: https://issues.apache.org/jira/browse/SPARK-23030
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Major
>
> Currently with Arrow enabled, calling {{toPandas()}} results in a collection 
> of all partitions in the JVM in the form of batches of Arrow file format.  
> Once collected in the JVM, they are served to the Python driver process. 
> I believe using the Arrow stream format can help to optimize this and reduce 
> memory consumption in the JVM by only loading one record batch at a time 
> before sending it to Python.  This might also reduce the latency between 
> making the initial call in Python and receiving the first batch of records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-12 Thread Ricardo Martinelli de Oliveira (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricardo Martinelli de Oliveira updated SPARK-24534:
---
Description: 
As an improvement in the entrypoint.sh script, I'd like to propose spark 
entrypoint do a passthrough if driver/executor/init is not the command passed. 
Currently it raises an error.

To me more specific, I'm talking about these lines:

[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]

This allows the openshift-spark image to continue to function as a Spark 
Standalone component, with custom configuration support etc. without 
compromising the previous method to configure the cluster inside a kubernetes 
environment.

  was:
As an improvement in the entrypoint.sh script, I'd like to propose spark 
entrypoint do a passthrough if driver/executor/init is not the command passed. 
Currently it raises an error.

To me more specific, I'm talking about these lines:

[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]

This allows the openshift-spark image to continue to function as a Spark 
Standalone component, with custom configuration support etc.


> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Minor
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc. without 
> compromising the previous method to configure the cluster inside a kubernetes 
> environment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-12 Thread Ricardo Martinelli de Oliveira (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ricardo Martinelli de Oliveira updated SPARK-24534:
---
Description: 
As an improvement in the entrypoint.sh script, I'd like to propose spark 
entrypoint do a passthrough if driver/executor/init is not the command passed. 
Currently it raises an error.

To me more specific, I'm talking about these lines:

[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]

This allows the openshift-spark image to continue to function as a Spark 
Standalone component, with custom configuration support etc.

  was:
As an improvement in the entrypoint.sh script, I'd like to propose spark 
entrypoint do a passthrough if driver/executor/init is not the command passed. 
Currently it raises an error.

To me more specific, I'm talking about these lines:

 

[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]

 

This allows the openshift-spark image to continue to function as a Spark 
Standalone component, with custom configuration support etc, but double as an 
OpenShift spark-on-k8s image.


> Add a way to bypass entrypoint.sh script if no spark cmd is passed
> --
>
> Key: SPARK-24534
> URL: https://issues.apache.org/jira/browse/SPARK-24534
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ricardo Martinelli de Oliveira
>Priority: Minor
>
> As an improvement in the entrypoint.sh script, I'd like to propose spark 
> entrypoint do a passthrough if driver/executor/init is not the command 
> passed. Currently it raises an error.
> To me more specific, I'm talking about these lines:
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]
> This allows the openshift-spark image to continue to function as a Spark 
> Standalone component, with custom configuration support etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24534) Add a way to bypass entrypoint.sh script if no spark cmd is passed

2018-06-12 Thread Ricardo Martinelli de Oliveira (JIRA)

Ricardo Martinelli de Oliveira created SPARK-24534:
--

 Summary: Add a way to bypass entrypoint.sh script if no spark cmd 
is passed
 Key: SPARK-24534
 URL: https://issues.apache.org/jira/browse/SPARK-24534
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.0
Reporter: Ricardo Martinelli de Oliveira


As an improvement in the entrypoint.sh script, I'd like to propose spark 
entrypoint do a passthrough if driver/executor/init is not the command passed. 
Currently it raises an error.

To me more specific, I'm talking about these lines:

 

[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L113-L114]

 

This allows the openshift-spark image to continue to function as a Spark 
Standalone component, with custom configuration support etc, but double as an 
OpenShift spark-on-k8s image.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23010) Add integration testing for Kubernetes backend into the apache/spark repository

2018-06-12 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510095#comment-16510095
 ] 

Apache Spark commented on SPARK-23010:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/21545

> Add integration testing for Kubernetes backend into the apache/spark 
> repository
> ---
>
> Key: SPARK-23010
> URL: https://issues.apache.org/jira/browse/SPARK-23010
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Anirudh Ramanathan
>Priority: Major
> Fix For: 2.4.0
>
>
> Add tests for the scheduler backend into apache/spark
> /xref: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Integration-testing-and-Scheduler-Backends-td23105.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23931) High-order function: array_zip(array1, array2[, ...]) → array

2018-06-12 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-23931:

Description: 
Ref: https://prestodb.io/docs/current/functions/array.html

Merges the given arrays, element-wise, into a single array of rows. The M-th 
element of the N-th argument will be the N-th field of the M-th output element. 
If the arguments have an uneven length, missing values are filled with NULL.
{noformat}
SELECT array_zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), 
ROW(2, null), ROW(null, '3b')]
{noformat}

Note: while Presto's function name is {{zip}}, we used {{array_zip}} as name. 
For details please check the discussion in the PR.


  was:
Ref: https://prestodb.io/docs/current/functions/array.html

Merges the given arrays, element-wise, into a single array of rows. The M-th 
element of the N-th argument will be the N-th field of the M-th output element. 
If the arguments have an uneven length, missing values are filled with NULL.
{noformat}
SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
null), ROW(null, '3b')]
{noformat}




> High-order function: array_zip(array1, array2[, ...]) → array
> --
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Dylan Guedes
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT array_zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), 
> ROW(2, null), ROW(null, '3b')]
> {noformat}
> Note: while Presto's function name is {{zip}}, we used {{array_zip}} as name. 
> For details please check the discussion in the PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23931) High-order function: array_zip(array1, array2[, ...]) → array

2018-06-12 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510079#comment-16510079
 ] 

Marco Gaido commented on SPARK-23931:
-

I just edited the title/description in order to update to the new name we used 
for the function. Please revert my changes if you consider them inappropriate.

> High-order function: array_zip(array1, array2[, ...]) → array
> --
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Dylan Guedes
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT array_zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), 
> ROW(2, null), ROW(null, '3b')]
> {noformat}
> Note: while Presto's function name is {{zip}}, we used {{array_zip}} as name. 
> For details please check the discussion in the PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23931) High-order function: array_zip(array1, array2[, ...]) → array

2018-06-12 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-23931:

Summary: High-order function: array_zip(array1, array2[, ...]) → array 
 (was: High-order function: zip(array1, array2[, ...]) → array)

> High-order function: array_zip(array1, array2[, ...]) → array
> --
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Dylan Guedes
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
> null), ROW(null, '3b')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23933) High-order function: map(array, array) → map

2018-06-12 Thread Takuya Ueshin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23933.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21258
https://github.com/apache/spark/pull/21258

> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version

2018-06-12 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510072#comment-16510072
 ] 

Apache Spark commented on SPARK-24531:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21543

> HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
> -
>
> Key: SPARK-24531
> URL: https://issues.apache.org/jira/browse/SPARK-24531
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Blocker
>
> We have many build failures caused by HiveExternalCatalogVersionsSuite 
> failing because Spark 2.2.0 is not present anymore in the mirrors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23933) High-order function: map(array, array) → map

2018-06-12 Thread Takuya Ueshin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23933:
-

Assignee: Kazuaki Ishizaki

> High-order function: map(array, array) → map
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23435) R tests should support latest testthat

2018-06-12 Thread Weiqiang Zhuang (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16510065#comment-16510065
 ] 

Weiqiang Zhuang commented on SPARK-23435:
-

I hit this since I just reinstalled R and bunch of stuff. It took me some time 
to get to locate this issue here. [~felixcheung], are you still working on 
this? If not, can I work on it? I can pull together the changes I have made and 
raise a PR.

> R tests should support latest testthat
> --
>
> Key: SPARK-23435
> URL: https://issues.apache.org/jira/browse/SPARK-23435
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Major
>
> To follow up on SPARK-22817, the latest version of testthat, 2.0.0 was 
> released in Dec 2017, and its method has been changed.
> In order for our tests to keep working, we need to detect that and call a 
> different method.
> Jenkins is running 1.0.1 though, we need to check if it is going to work.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala

2018-06-12 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24216:
---

Assignee: Fangshi Li

> Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
> ---
>
> Key: SPARK-24216
> URL: https://issues.apache.org/jira/browse/SPARK-24216
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Fangshi Li
>Assignee: Fangshi Li
>Priority: Major
> Fix For: 2.4.0
>
>
> When user create a aggregator object in scala and pass the aggregator to 
> Spark Dataset's agg() method, Spark's will initialize 
> TypedAggregateExpression with the nodeName field as 
> aggregator.getClass.getSimpleName. However, getSimpleName is not safe in 
> scala environment, depending on how user creates the aggregator object. For 
> example, if the aggregator class full qualified name is 
> "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw 
> java.lang.InternalError "Malformed class name". This has been reported in 
> scalatest 
> [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] 
> and discussed in many scala upstream jiras such as SI-8110, SI-5425.
> To fix this issue, we follow the solution in 
> [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] 
> to add safer version of getSimpleName as a util method, and 
> TypedAggregateExpression will invoke this util method rather than 
> getClass.getSimpleName.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24216) Spark TypedAggregateExpression uses getSimpleName that is not safe in scala

2018-06-12 Thread Wenchen Fan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24216.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21276
[https://github.com/apache/spark/pull/21276]

> Spark TypedAggregateExpression uses getSimpleName that is not safe in scala
> ---
>
> Key: SPARK-24216
> URL: https://issues.apache.org/jira/browse/SPARK-24216
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Fangshi Li
>Assignee: Fangshi Li
>Priority: Major
> Fix For: 2.4.0
>
>
> When user create a aggregator object in scala and pass the aggregator to 
> Spark Dataset's agg() method, Spark's will initialize 
> TypedAggregateExpression with the nodeName field as 
> aggregator.getClass.getSimpleName. However, getSimpleName is not safe in 
> scala environment, depending on how user creates the aggregator object. For 
> example, if the aggregator class full qualified name is 
> "com.my.company.MyUtils$myAgg$2$", the getSimpleName will throw 
> java.lang.InternalError "Malformed class name". This has been reported in 
> scalatest 
> [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] 
> and discussed in many scala upstream jiras such as SI-8110, SI-5425.
> To fix this issue, we follow the solution in 
> [scalatest/scalatest#1044|https://github.com/scalatest/scalatest/pull/1044] 
> to add safer version of getSimpleName as a util method, and 
> TypedAggregateExpression will invoke this util method rather than 
> getClass.getSimpleName.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array

2018-06-12 Thread Takuya Ueshin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-23931.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21045
https://github.com/apache/spark/pull/21045

> High-order function: zip(array1, array2[, ...]) → array
> 
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Dylan Guedes
>Priority: Major
> Fix For: 2.4.0
>
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
> null), ROW(null, '3b')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23931) High-order function: zip(array1, array2[, ...]) → array

2018-06-12 Thread Takuya Ueshin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin reassigned SPARK-23931:
-

Assignee: Dylan Guedes

> High-order function: zip(array1, array2[, ...]) → array
> 
>
> Key: SPARK-23931
> URL: https://issues.apache.org/jira/browse/SPARK-23931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Dylan Guedes
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Merges the given arrays, element-wise, into a single array of rows. The M-th 
> element of the N-th argument will be the N-th field of the M-th output 
> element. If the arguments have an uneven length, missing values are filled 
> with NULL.
> {noformat}
> SELECT zip(ARRAY[1, 2], ARRAY['1b', null, '3b']); -- [ROW(1, '1b'), ROW(2, 
> null), ROW(null, '3b')]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24416) Update configuration definition for spark.blacklist.killBlacklistedExecutors

2018-06-12 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-24416:


Assignee: Sanket Reddy

> Update configuration definition for spark.blacklist.killBlacklistedExecutors
> 
>
> Key: SPARK-24416
> URL: https://issues.apache.org/jira/browse/SPARK-24416
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
> Fix For: 2.4.0
>
>
> spark.blacklist.killBlacklistedExecutors is defined as 
> (Experimental) If set to "true", allow Spark to automatically kill, and 
> attempt to re-create, executors when they are blacklisted. Note that, when an 
> entire node is added to the blacklist, all of the executors on that node will 
> be killed.
> I presume the killing of blacklisted executors only happens after the stage 
> completes successfully and all tasks have completed or on fetch failures 
> (updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It is 
> confusing because the definition states that the executor will be attempted 
> to be recreated as soon as it is blacklisted. This is not true while the 
> stage is in progress and an executor is blacklisted, it will not attempt to 
> cleanup until the stage finishes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24416) Update configuration definition for spark.blacklist.killBlacklistedExecutors

2018-06-12 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-24416.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21475
[https://github.com/apache/spark/pull/21475]

> Update configuration definition for spark.blacklist.killBlacklistedExecutors
> 
>
> Key: SPARK-24416
> URL: https://issues.apache.org/jira/browse/SPARK-24416
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
> Fix For: 2.4.0
>
>
> spark.blacklist.killBlacklistedExecutors is defined as 
> (Experimental) If set to "true", allow Spark to automatically kill, and 
> attempt to re-create, executors when they are blacklisted. Note that, when an 
> entire node is added to the blacklist, all of the executors on that node will 
> be killed.
> I presume the killing of blacklisted executors only happens after the stage 
> completes successfully and all tasks have completed or on fetch failures 
> (updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It is 
> confusing because the definition states that the executor will be attempted 
> to be recreated as soon as it is blacklisted. This is not true while the 
> stage is in progress and an executor is blacklisted, it will not attempt to 
> cleanup until the stage finishes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24529) Add spotbugs into maven build process

2018-06-12 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509949#comment-16509949
 ] 

Apache Spark commented on SPARK-24529:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/21542

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24529) Add spotbugs into maven build process

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24529:


Assignee: (was: Apache Spark)

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24529) Add spotbugs into maven build process

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24529:


Assignee: Apache Spark

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Minor
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20168:


Assignee: Apache Spark  (was: Yash Sharma)

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Apache Spark
>Priority: Major
>  Labels: kinesis, streaming
> Fix For: 2.3.0
>
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20168:


Assignee: Yash Sharma  (was: Apache Spark)

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>Priority: Major
>  Labels: kinesis, streaming
> Fix For: 2.3.0
>
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2018-06-12 Thread Yash Sharma (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Sharma reopened SPARK-20168:
-

The last patch in the kinesis streaming receiver sets the timestamp for the 
mode AT_TIMESTAMP, but this mode can only be set via the

{{baseClientLibConfiguration.withTimestampAtInitialPositionInStream()}}
and can't be set directly using
{{.withInitialPositionInStream()}}

This patch fixes the issue.

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>Priority: Major
>  Labels: kinesis, streaming
> Fix For: 2.3.0
>
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2018-06-12 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509933#comment-16509933
 ] 

Apache Spark commented on SPARK-20168:
--

User 'yashs360' has created a pull request for this issue:
https://github.com/apache/spark/pull/21541

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>Priority: Major
>  Labels: kinesis, streaming
> Fix For: 2.3.0
>
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24533) typesafe has rebranded to lightbend. change the build/mvn endpoint from downloads.typesafe.com to downloads.lightbend.com

2018-06-12 Thread Sanket Reddy (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509850#comment-16509850
 ] 

Sanket Reddy commented on SPARK-24533:
--

I will put up a PR shortly thanks

> typesafe has rebranded to lightbend. change the build/mvn endpoint from 
> downloads.typesafe.com to downloads.lightbend.com
> -
>
> Key: SPARK-24533
> URL: https://issues.apache.org/jira/browse/SPARK-24533
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Sanket Reddy
>Priority: Trivial
>
> typesafe has rebranded to lightbend. change the build/mvn endpoint from 
> downloads.typesafe.com to downloads.lightbend.com.  Redirection works for 
> now. It is nice to just update the endpoint to stay upto date.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24533) typesafe has rebranded to lightbend. change the build/mvn endpoint from downloads.typesafe.com to downloads.lightbend.com

2018-06-12 Thread Sanket Reddy (JIRA)

Sanket Reddy created SPARK-24533:


 Summary: typesafe has rebranded to lightbend. change the build/mvn 
endpoint from downloads.typesafe.com to downloads.lightbend.com
 Key: SPARK-24533
 URL: https://issues.apache.org/jira/browse/SPARK-24533
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Sanket Reddy


typesafe has rebranded to lightbend. change the build/mvn endpoint from 
downloads.typesafe.com to downloads.lightbend.com.  Redirection works for now. 
It is nice to just update the endpoint to stay upto date.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15064) Locale support in StopWordsRemover

2018-06-12 Thread Lee Dongjin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-15064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509799#comment-16509799
 ] 

Lee Dongjin commented on SPARK-15064:
-

[~mengxr] Hello. Please assign this issue to me.

> Locale support in StopWordsRemover
> --
>
> Key: SPARK-15064
> URL: https://issues.apache.org/jira/browse/SPARK-15064
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Major
> Fix For: 2.4.0
>
>
> We support case insensitive filtering (default) in StopWordsRemover. However, 
> case insensitive matching depends on the locale and region, which cannot be 
> explicitly set in StopWordsRemover. We should consider adding this support in 
> MLlib.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version

2018-06-12 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24531:
---
Target Version/s: 2.2.2, 2.3.2

> HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
> -
>
> Key: SPARK-24531
> URL: https://issues.apache.org/jira/browse/SPARK-24531
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Blocker
>
> We have many build failures caused by HiveExternalCatalogVersionsSuite 
> failing because Spark 2.2.0 is not present anymore in the mirrors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version

2018-06-12 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-24531:
---
Priority: Blocker  (was: Major)

> HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
> -
>
> Key: SPARK-24531
> URL: https://issues.apache.org/jira/browse/SPARK-24531
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Blocker
>
> We have many build failures caused by HiveExternalCatalogVersionsSuite 
> failing because Spark 2.2.0 is not present anymore in the mirrors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24532) HiveExternalCatalogVersionSuite should be resilient to missing versions

2018-06-12 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-24532:
--

 Summary: HiveExternalCatalogVersionSuite should be resilient to 
missing versions
 Key: SPARK-24532
 URL: https://issues.apache.org/jira/browse/SPARK-24532
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 2.3.1, 2.4.0
Reporter: Marcelo Vanzin


See SPARK-24531.

As part of our release process we clean up older releases from the mirror 
network. That causes this test to start failing.

The test should be more resilient to this; either ignore releases that are not 
found, or fallback to the ASF archive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-24112) Add `spark.sql.hive.convertMetastoreTableProperty` for backward compatiblility

2018-06-12 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-24112.
-

> Add `spark.sql.hive.convertMetastoreTableProperty` for backward compatiblility
> --
>
> Key: SPARK-24112
> URL: https://issues.apache.org/jira/browse/SPARK-24112
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to not to surprise the previous Parquet Hive table users due 
> to behavior changes. They had Hive Parquet tables and all of them are 
> converted by default without table properties since Spark 2.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24531:


Assignee: (was: Apache Spark)

> HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
> -
>
> Key: SPARK-24531
> URL: https://issues.apache.org/jira/browse/SPARK-24531
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> We have many build failures caused by HiveExternalCatalogVersionsSuite 
> failing because Spark 2.2.0 is not present anymore in the mirrors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version

2018-06-12 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509748#comment-16509748
 ] 

Apache Spark commented on SPARK-24531:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21540

> HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
> -
>
> Key: SPARK-24531
> URL: https://issues.apache.org/jira/browse/SPARK-24531
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Priority: Major
>
> We have many build failures caused by HiveExternalCatalogVersionsSuite 
> failing because Spark 2.2.0 is not present anymore in the mirrors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24531:


Assignee: Apache Spark

> HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version
> -
>
> Key: SPARK-24531
> URL: https://issues.apache.org/jira/browse/SPARK-24531
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Major
>
> We have many build failures caused by HiveExternalCatalogVersionsSuite 
> failing because Spark 2.2.0 is not present anymore in the mirrors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-12 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24530:
--
Description: 
I generated python docs from master locally using `make html`. However, the 
generated html doc doesn't render class docs correctly. I attached the 
screenshot from Spark 2.3 docs and master docs generated on my local. Not sure 
if this is because my local setup.

cc: [~dongjoon] Could you help verify?


  was:
I generated python docs from master locally using `make html`. However, the 
generated html doc doesn't render class docs correctly. I attached the 
screenshot from Spark 2.3 docs and master docs generated on my local. Not sure 
if this is because my local setup.




> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.
> cc: [~dongjoon] Could you help verify?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-12 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24530:
--
Priority: Blocker  (was: Major)

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Blocker
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24531) HiveExternalCatalogVersionsSuite failing due to missing 2.2.0 version

2018-06-12 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-24531:
---

 Summary: HiveExternalCatalogVersionsSuite failing due to missing 
2.2.0 version
 Key: SPARK-24531
 URL: https://issues.apache.org/jira/browse/SPARK-24531
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.4.0
Reporter: Marco Gaido


We have many build failures caused by HiveExternalCatalogVersionsSuite failing 
because Spark 2.2.0 is not present anymore in the mirrors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-12 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24530:
--
Attachment: Screen Shot 2018-06-12 at 8.23.18 AM.png

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-12 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-24530:
--
Attachment: Screen Shot 2018-06-12 at 8.23.29 AM.png

> pyspark.ml doesn't generate class docs correctly
> 
>
> Key: SPARK-24530
> URL: https://issues.apache.org/jira/browse/SPARK-24530
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Priority: Major
> Attachments: Screen Shot 2018-06-12 at 8.23.18 AM.png, Screen Shot 
> 2018-06-12 at 8.23.29 AM.png
>
>
> I generated python docs from master locally using `make html`. However, the 
> generated html doc doesn't render class docs correctly. I attached the 
> screenshot from Spark 2.3 docs and master docs generated on my local. Not 
> sure if this is because my local setup.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24530) pyspark.ml doesn't generate class docs correctly

2018-06-12 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-24530:
-

 Summary: pyspark.ml doesn't generate class docs correctly
 Key: SPARK-24530
 URL: https://issues.apache.org/jira/browse/SPARK-24530
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Xiangrui Meng


I generated python docs from master locally using `make html`. However, the 
generated html doc doesn't render class docs correctly. I attached the 
screenshot from Spark 2.3 docs and master docs generated on my local. Not sure 
if this is because my local setup.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15064) Locale support in StopWordsRemover

2018-06-12 Thread Xiangrui Meng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-15064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15064.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21501
[https://github.com/apache/spark/pull/21501]

> Locale support in StopWordsRemover
> --
>
> Key: SPARK-15064
> URL: https://issues.apache.org/jira/browse/SPARK-15064
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Major
> Fix For: 2.4.0
>
>
> We support case insensitive filtering (default) in StopWordsRemover. However, 
> case insensitive matching depends on the locale and region, which cannot be 
> explicitly set in StopWordsRemover. We should consider adding this support in 
> MLlib.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming

2018-06-12 Thread bharath kumar avusherla (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509708#comment-16509708
 ] 

bharath kumar avusherla edited comment on SPARK-24476 at 6/12/18 2:48 PM:
--

[~ste...@apache.org] Our initial try was enable the speculative execution, 
reran the job and it is working fine. Then replaced s3n with s3a, reran the 
job. It is working fine too. Right now we are running two jobs simultaneously 
with these two different settings, nothing got failed. Do you recommend 
enabling speculative execution even after changing s3n from s3a?


was (Author: abharath9):
Our initial try was enable the speculative execution, reran the job and it is 
working fine. Then replaced s3n with s3a, reran the job. It is working fine 
too. Right now we are running two jobs simultaneously with these two different 
settings, nothing got failed. Do you recommend enabling speculative execution 
even after changing s3n from s3a?

> java.net.SocketTimeoutException: Read timed out under jets3t while running 
> the Spark Structured Streaming
> -
>
> Key: SPARK-24476
> URL: https://issues.apache.org/jira/browse/SPARK-24476
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Minor
> Attachments: socket-timeout-exception
>
>
> We are working on spark streaming application using spark structured 
> streaming with checkpointing in s3. When we start the application, the 
> application runs just fine for sometime  then it crashes with the error 
> mentioned below. The amount of time it will run successfully varies from time 
> to time, sometimes it will run for 2 days without any issues then crashes, 
> sometimes it will crash after 4hrs/ 24hrs. 
> Our streaming application joins(left and inner) multiple sources from kafka 
> and also s3 and aurora database.
> Can you please let us know how to solve this problem?
> Is it possible to somehow tweak the SocketTimeout-Time? 
> Here, I'm pasting the few line of complete exception log below. Also attached 
> the complete exception to the issue.
> *_Exception:_*
> *_Caused by: java.net.SocketTimeoutException: Read timed out_*
>         _at java.net.SocketInputStream.socketRead0(Native Method)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:150)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:121)_
>         _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_
>         _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_
>         _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_
>         _at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24476) java.net.SocketTimeoutException: Read timed out under jets3t while running the Spark Structured Streaming

2018-06-12 Thread bharath kumar avusherla (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509708#comment-16509708
 ] 

bharath kumar avusherla commented on SPARK-24476:
-

Our initial try was enable the speculative execution, reran the job and it is 
working fine. Then replaced s3n with s3a, reran the job. It is working fine 
too. Right now we are running two jobs simultaneously with these two different 
settings, nothing got failed. Do you recommend enabling speculative execution 
even after changing s3n from s3a?

> java.net.SocketTimeoutException: Read timed out under jets3t while running 
> the Spark Structured Streaming
> -
>
> Key: SPARK-24476
> URL: https://issues.apache.org/jira/browse/SPARK-24476
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Minor
> Attachments: socket-timeout-exception
>
>
> We are working on spark streaming application using spark structured 
> streaming with checkpointing in s3. When we start the application, the 
> application runs just fine for sometime  then it crashes with the error 
> mentioned below. The amount of time it will run successfully varies from time 
> to time, sometimes it will run for 2 days without any issues then crashes, 
> sometimes it will crash after 4hrs/ 24hrs. 
> Our streaming application joins(left and inner) multiple sources from kafka 
> and also s3 and aurora database.
> Can you please let us know how to solve this problem?
> Is it possible to somehow tweak the SocketTimeout-Time? 
> Here, I'm pasting the few line of complete exception log below. Also attached 
> the complete exception to the issue.
> *_Exception:_*
> *_Caused by: java.net.SocketTimeoutException: Read timed out_*
>         _at java.net.SocketInputStream.socketRead0(Native Method)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:150)_
>         _at java.net.SocketInputStream.read(SocketInputStream.java:121)_
>         _at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)_
>         _at sun.security.ssl.InputRecord.read(InputRecord.java:503)_
>         _at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:954)_
>         _at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1343)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1371)_
>         _at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1355)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:553)_
>         _at 
> org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:412)_
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24005) Remove usage of Scala’s parallel collection

2018-06-12 Thread Maxim Gekk (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509667#comment-16509667
 ] 

Maxim Gekk commented on SPARK-24005:


[~smilegator] I am trying to reproduce the issue but so far I am not lucky. The 
following test is passing successfully:
{code:scala}
  test("canceling of parallel collections") {
val conf = new SparkConf()
sc = new SparkContext("local[1]", "par col", conf)

val f = sc.parallelize(0 to 1, 1).map { i =>
  val par = (1 to 100).par
  val pool = ThreadUtils.newForkJoinPool("test pool", 2)
  par.tasksupport = new ForkJoinTaskSupport(pool)
  try {
par.flatMap { j =>
  Thread.sleep(1000)
  1 to 100
}.seq
  } finally {
pool.shutdown()
  }
}.takeAsync(100)

val sem = new Semaphore(0)
sc.addSparkListener(new SparkListener {
  override def onTaskStart(taskStart: SparkListenerTaskStart) {
sem.release()
  }
})

// Wait until some tasks were launched before we cancel the job.
sem.acquire()
// Wait until a task executes parallel collection.
Thread.sleep(1)
f.cancel()

val e = intercept[SparkException] { f.get() }.getCause
assert(e.getMessage.contains("cancelled") || 
e.getMessage.contains("killed"))
  }
{code}

> Remove usage of Scala’s parallel collection
> ---
>
> Key: SPARK-24005
> URL: https://issues.apache.org/jira/browse/SPARK-24005
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> {noformat}
> val par = (1 to 100).par.flatMap { i =>
>   Thread.sleep(1000)
>   1 to 1000
> }.toSeq
> {noformat}
> We are unable to interrupt the execution of parallel collections. We need to 
> create a common utility function to do it, instead of using Scala parallel 
> collections



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-12 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-24481.
-
Resolution: Not A Problem

I am resolving this as no further action can be taken IMHO (other than waiting 
for SPARK-2260)

> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0 and Databricks Cloud 4.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://;,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300){
>   a = a + s"when action like '$i%' THEN '$i' "
> }
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
>  
> Exception:
> {code:java}
> 18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method 
> "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method 
> "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1444)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1523)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1520)
>   at 
> com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1392)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:579)
>   at 
>

[jira] [Comment Edited] (SPARK-24481) GeneratedIteratorForCodegenStage1 grows beyond 64 KB

2018-06-12 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509662#comment-16509662
 ] 

Marco Gaido edited comment on SPARK-24481 at 6/12/18 2:03 PM:
--

I am resolving this as no further action can be taken IMHO (other than waiting 
for SPARK-22600 to be fixed). Please reopen if needed. Thanks.


was (Author: mgaido):
I am resolving this as no further action can be taken IMHO (other than waiting 
for SPARK-2260)

> GeneratedIteratorForCodegenStage1 grows beyond 64 KB
> 
>
> Key: SPARK-24481
> URL: https://issues.apache.org/jira/browse/SPARK-24481
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Emr 5.13.0 and Databricks Cloud 4.0
>Reporter: Andrew Conegliano
>Priority: Major
> Attachments: log4j-active(1).log
>
>
> Similar to other "grows beyond 64 KB" errors.  Happens with large case 
> statement:
> {code:java}
> import org.apache.spark.sql.functions._
> import scala.collection.mutable
> import org.apache.spark.sql.Column
> var rdd = sc.parallelize(Array("""{
> "event":
> {
> "timestamp": 1521086591110,
> "event_name": "yu",
> "page":
> {
> "page_url": "https://;,
> "page_name": "es"
> },
> "properties":
> {
> "id": "87",
> "action": "action",
> "navigate_action": "navigate_action"
> }
> }
> }
> """))
> var df = spark.read.json(rdd)
> df = 
> df.select("event.properties.id","event.timestamp","event.page.page_url","event.properties.action","event.page.page_name","event.event_name","event.properties.navigate_action")
> .toDF("id","event_time","url","action","page_name","event_name","navigation_action")
> var a = "case "
> for(i <- 1 to 300){
>   a = a + s"when action like '$i%' THEN '$i' "
> }
> a = a + " else null end as task_id"
> val expression = expr(a)
> df = df.filter("id is not null and id <> '' and event_time is not null")
> val transformationExpressions: mutable.HashMap[String, Column] = 
> mutable.HashMap(
> "action" -> expr("coalesce(action, navigation_action) as action"),
> "task_id" -> expression
> )
> for((col, expr) <- transformationExpressions)
> df = df.withColumn(col, expr)
> df = df.filter("(action is not null and action <> '') or (page_name is not 
> null and page_name <> '')")
> df.show
> {code}
>  
> Exception:
> {code:java}
> 18/06/07 01:06:34 ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method 
> "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Code of method 
> "project_doConsume$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIteratorForCodegenStage1;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>   at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1444)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1523)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1520)
>   at 
> com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3936)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4806)
>   at 
>

[jira] [Commented] (SPARK-14376) spark.ml parity for trees

2018-06-12 Thread Lee Dongjin (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509658#comment-16509658
 ] 

Lee Dongjin commented on SPARK-14376:
-

[~josephkb] Excuse me. Is there any reason this issue is still opned? It seems 
like all sub-issues and linked issues are resolved.

> spark.ml parity for trees
> -
>
> Key: SPARK-14376
> URL: https://issues.apache.org/jira/browse/SPARK-14376
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24529) Add spotbugs into maven build process

2018-06-12 Thread Kazuaki Ishizaki (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509569#comment-16509569
 ] 

Kazuaki Ishizaki commented on SPARK-24529:
--

I am working for this

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24529) Add spotbugs into maven build process

2018-06-12 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-24529:


 Summary: Add spotbugs into maven build process
 Key: SPARK-24529
 URL: https://issues.apache.org/jira/browse/SPARK-24529
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


We will enable a Java bytecode check tool 
[spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24474) Cores are left idle when there are a lot of tasks to run

2018-06-12 Thread Al M (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Al M updated SPARK-24474:
-
Summary: Cores are left idle when there are a lot of tasks to run  (was: 
Cores are left idle when there are a lot of stages to run)

> Cores are left idle when there are a lot of tasks to run
> 
>
> Key: SPARK-24474
> URL: https://issues.apache.org/jira/browse/SPARK-24474
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Al M
>Priority: Major
>
> I've observed an issue happening consistently when:
>  * A job contains a join of two datasets
>  * One dataset is much larger than the other
>  * Both datasets require some processing before they are joined
> What I have observed is:
>  * 2 stages are initially active to run processing on the two datasets
>  ** These stages are run in parallel
>  ** One stage has significantly more tasks than the other (e.g. one has 30k 
> tasks and the other has 2k tasks)
>  ** Spark allocates a similar (though not exactly equal) number of cores to 
> each stage
>  * First stage completes (for the smaller dataset)
>  ** Now there is only one stage running
>  ** It still has many tasks left (usually > 20k tasks)
>  ** Around half the cores are idle (e.g. Total Cores = 200, active tasks = 
> 103)
>  ** This continues until the second stage completes
>  * Second stage completes, and third begins (the stage that actually joins 
> the data)
>  ** This stage works fine, no cores are idle (e.g. Total Cores = 200, active 
> tasks = 200)
> Other interesting things about this:
>  * It seems that when we have multiple stages active, and one of them 
> finishes, it does not actually release any cores to existing stages
>  * Once all active stages are done, we release all cores to new stages
>  * I can't reproduce this locally on my machine, only on a cluster with YARN 
> enabled
>  * It happens when dynamic allocation is enabled, and when it is disabled
>  * The stage that hangs (referred to as "Second stage" above) has a lower 
> 'Stage Id' than the first one that completes
>  * This happens with spark.shuffle.service.enabled set to true and false



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2018-06-12 Thread Ohad Raviv (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509465#comment-16509465
 ] 

Ohad Raviv commented on SPARK-24528:


[~cloud_fan], [~viirya] - Hi I found somewhat similar issue to [SPARK-24410], 
would really appreciate if you could tell me what you think..

> Missing optimization for Aggregations/Windowing on a bucketed table
> ---
>
> Key: SPARK-24528
> URL: https://issues.apache.org/jira/browse/SPARK-24528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Ohad Raviv
>Priority: Major
>
> Closely related to  SPARK-24410, we're trying to optimize a very common use 
> case we have of getting the most updated row by id from a fact table.
> We're saving the table bucketed to skip the shuffle stage, but we're still 
> "waste" time on the Sort operator evethough the data is already sorted.
> here's a good example:
> {code:java}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("key", "t1")
> .saveAsTable("a1"){code}
> {code:java}
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
> key#24L, t1, t1#25L, t2, t2#26L))])
> +- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, 
> t1#25L, key, key#24L, t1, t1#25L, t2, t2#26L))])
> +- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
> Format: Parquet, Location: ...{code}
>  
> and here's a bad example, but more realistic:
> {code:java}
> sparkSession.sql("set spark.sql.shuffle.partitions=2")
> sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain
> == Physical Plan ==
> SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
> key#32L, t1, t1#33L, t2, t2#34L))])
> +- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, 
> t1#33L, key, key#32L, t1, t1#33L, t2, t2#34L))])
> +- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
> +- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
> Format: Parquet, Location: ...
> {code}
>  
> I've traced the problem to DataSourceScanExec#235:
> {code:java}
> val sortOrder = if (sortColumns.nonEmpty) {
>   // In case of bucketing, its possible to have multiple files belonging to 
> the
>   // same bucket in a given relation. Each of these files are locally sorted
>   // but those files combined together are not globally sorted. Given that,
>   // the RDD partition will not be sorted even if the relation has sort 
> columns set
>   // Current solution is to check if all the buckets have a single file in it
>   val files = selectedPartitions.flatMap(partition => partition.files)
>   val bucketToFilesGrouping =
> files.map(_.getPath.getName).groupBy(file => 
> BucketingUtils.getBucketId(file))
>   val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
> 1){code}
> so obviously the code avoids dealing with this situation now..
> could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24528) Missing optimization for Aggregations/Windowing on a bucketed table

2018-06-12 Thread Ohad Raviv (JIRA)

Ohad Raviv created SPARK-24528:
--

 Summary: Missing optimization for Aggregations/Windowing on a 
bucketed table
 Key: SPARK-24528
 URL: https://issues.apache.org/jira/browse/SPARK-24528
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0, 2.4.0
Reporter: Ohad Raviv


Closely related to  SPARK-24410, we're trying to optimize a very common use 
case we have of getting the most updated row by id from a fact table.

We're saving the table bucketed to skip the shuffle stage, but we're still 
"waste" time on the Sort operator evethough the data is already sorted.

here's a good example:
{code:java}
sparkSession.range(N).selectExpr(
  "id as key",
  "id % 2 as t1",
  "id % 3 as t2")
.repartition(col("key"))
.write
  .mode(SaveMode.Overwrite)
.bucketBy(3, "key")
.sortBy("key", "t1")
.saveAsTable("a1"){code}
{code:java}
sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain

== Physical Plan ==
SortAggregate(key=[key#24L], functions=[max(named_struct(t1, t1#25L, key, 
key#24L, t1, t1#25L, t2, t2#26L))])
+- SortAggregate(key=[key#24L], functions=[partial_max(named_struct(t1, t1#25L, 
key, key#24L, t1, t1#25L, t2, t2#26L))])
+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: true, 
Format: Parquet, Location: ...{code}
 

and here's a bad example, but more realistic:
{code:java}
sparkSession.sql("set spark.sql.shuffle.partitions=2")
sparkSession.sql("select max(struct(t1, *)) from a1 group by key").explain

== Physical Plan ==
SortAggregate(key=[key#32L], functions=[max(named_struct(t1, t1#33L, key, 
key#32L, t1, t1#33L, t2, t2#34L))])
+- SortAggregate(key=[key#32L], functions=[partial_max(named_struct(t1, t1#33L, 
key, key#32L, t1, t1#33L, t2, t2#34L))])
+- *(1) Sort [key#32L ASC NULLS FIRST], false, 0
+- *(1) FileScan parquet default.a1[key#32L,t1#33L,t2#34L] Batched: true, 
Format: Parquet, Location: ...

{code}
 

I've traced the problem to DataSourceScanExec#235:
{code:java}
val sortOrder = if (sortColumns.nonEmpty) {
  // In case of bucketing, its possible to have multiple files belonging to the
  // same bucket in a given relation. Each of these files are locally sorted
  // but those files combined together are not globally sorted. Given that,
  // the RDD partition will not be sorted even if the relation has sort columns 
set
  // Current solution is to check if all the buckets have a single file in it

  val files = selectedPartitions.flatMap(partition => partition.files)
  val bucketToFilesGrouping =
files.map(_.getPath.getName).groupBy(file => 
BucketingUtils.getBucketId(file))
  val singleFilePartitions = bucketToFilesGrouping.forall(p => p._2.length <= 
1){code}
so obviously the code avoids dealing with this situation now..

could you think of a way to solve this or bypass it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24500) UnsupportedOperationException when trying to execute Union plan with Stream of children

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24500:


Assignee: Herman van Hovell  (was: Apache Spark)

> UnsupportedOperationException when trying to execute Union plan with Stream 
> of children
> ---
>
> Key: SPARK-24500
> URL: https://issues.apache.org/jira/browse/SPARK-24500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Assignee: Herman van Hovell
>Priority: Major
>
> To reproduce:
> {code}
> import org.apache.spark.sql.catalyst.plans.logical._
> def range(i: Int) = Range(1, i, 1, 1)
> val union = Union(Stream(range(3), range(5), range(7)))
> spark.sessionState.planner.plan(union).next().execute()
> {code}
> produces
> {code}
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.execution.PlanLater.doExecute(SparkStrategies.scala:55)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> {code}
> The SparkPlan looks like this:
> {code}
> :- Range (1, 3, step=1, splits=1)
> :- PlanLater Range (1, 5, step=1, splits=Some(1))
> +- PlanLater Range (1, 7, step=1, splits=Some(1))
> {code}
> So not all of it was planned (some PlanLater still in there).
> This appears to be a longstanding issue.
> I traced it to the use of var in TreeNode.
> For example in mapChildren:
> {code}
> case args: Traversable[_] => args.map {
>   case arg: TreeNode[_] if containsChild(arg) =>
> val newChild = f(arg.asInstanceOf[BaseType])
> if (!(newChild fastEquals arg)) {
>   changed = true
> {code}
> If args is a Stream then changed will never be set here, ultimately causing 
> the method to return the original plan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24500) UnsupportedOperationException when trying to execute Union plan with Stream of children

2018-06-12 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509453#comment-16509453
 ] 

Apache Spark commented on SPARK-24500:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/21539

> UnsupportedOperationException when trying to execute Union plan with Stream 
> of children
> ---
>
> Key: SPARK-24500
> URL: https://issues.apache.org/jira/browse/SPARK-24500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Assignee: Herman van Hovell
>Priority: Major
>
> To reproduce:
> {code}
> import org.apache.spark.sql.catalyst.plans.logical._
> def range(i: Int) = Range(1, i, 1, 1)
> val union = Union(Stream(range(3), range(5), range(7)))
> spark.sessionState.planner.plan(union).next().execute()
> {code}
> produces
> {code}
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.execution.PlanLater.doExecute(SparkStrategies.scala:55)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> {code}
> The SparkPlan looks like this:
> {code}
> :- Range (1, 3, step=1, splits=1)
> :- PlanLater Range (1, 5, step=1, splits=Some(1))
> +- PlanLater Range (1, 7, step=1, splits=Some(1))
> {code}
> So not all of it was planned (some PlanLater still in there).
> This appears to be a longstanding issue.
> I traced it to the use of var in TreeNode.
> For example in mapChildren:
> {code}
> case args: Traversable[_] => args.map {
>   case arg: TreeNode[_] if containsChild(arg) =>
> val newChild = f(arg.asInstanceOf[BaseType])
> if (!(newChild fastEquals arg)) {
>   changed = true
> {code}
> If args is a Stream then changed will never be set here, ultimately causing 
> the method to return the original plan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24500) UnsupportedOperationException when trying to execute Union plan with Stream of children

2018-06-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24500:


Assignee: Apache Spark  (was: Herman van Hovell)

> UnsupportedOperationException when trying to execute Union plan with Stream 
> of children
> ---
>
> Key: SPARK-24500
> URL: https://issues.apache.org/jira/browse/SPARK-24500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bogdan Raducanu
>Assignee: Apache Spark
>Priority: Major
>
> To reproduce:
> {code}
> import org.apache.spark.sql.catalyst.plans.logical._
> def range(i: Int) = Range(1, i, 1, 1)
> val union = Union(Stream(range(3), range(5), range(7)))
> spark.sessionState.planner.plan(union).next().execute()
> {code}
> produces
> {code}
> java.lang.UnsupportedOperationException
>   at 
> org.apache.spark.sql.execution.PlanLater.doExecute(SparkStrategies.scala:55)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> {code}
> The SparkPlan looks like this:
> {code}
> :- Range (1, 3, step=1, splits=1)
> :- PlanLater Range (1, 5, step=1, splits=Some(1))
> +- PlanLater Range (1, 7, step=1, splits=Some(1))
> {code}
> So not all of it was planned (some PlanLater still in there).
> This appears to be a longstanding issue.
> I traced it to the use of var in TreeNode.
> For example in mapChildren:
> {code}
> case args: Traversable[_] => args.map {
>   case arg: TreeNode[_] if containsChild(arg) =>
> val newChild = f(arg.asInstanceOf[BaseType])
> if (!(newChild fastEquals arg)) {
>   changed = true
> {code}
> If args is a Stream then changed will never be set here, ultimately causing 
> the method to return the original plan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23754) StopIterator exception in Python UDF results in partial result

2018-06-12 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509442#comment-16509442
 ] 

Apache Spark commented on SPARK-23754:
--

User 'e-dorigatti' has created a pull request for this issue:
https://github.com/apache/spark/pull/21538

> StopIterator exception in Python UDF results in partial result
> --
>
> Key: SPARK-23754
> URL: https://issues.apache.org/jira/browse/SPARK-23754
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Li Jin
>Assignee: Emilio Dorigatti
>Priority: Blocker
> Fix For: 2.3.1, 2.4.0
>
>
> Reproduce:
> {code:java}
> df = spark.range(0, 1000)
> from pyspark.sql.functions import udf
> def foo(x):
> raise StopIteration()
> df.withColumn('v', udf(foo)).show()
> # Results
> # +---+---+
> # | id|  v|
> # +---+---+
> # +---+---+{code}
> I think the task should fail in this case



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24505) Convert strings in codegen to blocks: Cast and BoundAttribute

2018-06-12 Thread Liang-Chi Hsieh (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-24505:

Description: The CodeBlock interpolator now accepts strings. Based on 
previous discussion, we should forbid string interpolation. We will 
incrementally convert strings in codegen methods to blocks. This is for Cast 
and BoundAttribute.  (was: The CodeBlock interpolator now accepts strings. 
Based on previous discussion, we should forbid string interpolation.)

> Convert strings in codegen to blocks: Cast and BoundAttribute
> -
>
> Key: SPARK-24505
> URL: https://issues.apache.org/jira/browse/SPARK-24505
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> The CodeBlock interpolator now accepts strings. Based on previous discussion, 
> we should forbid string interpolation. We will incrementally convert strings 
> in codegen methods to blocks. This is for Cast and BoundAttribute.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24505) Convert strings in codegen to blocks: Cast and BoundAttribute

2018-06-12 Thread Liang-Chi Hsieh (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-24505:

Summary: Convert strings in codegen to blocks: Cast and BoundAttribute  
(was: Forbidding string interpolation in CodeBlock)

> Convert strings in codegen to blocks: Cast and BoundAttribute
> -
>
> Key: SPARK-24505
> URL: https://issues.apache.org/jira/browse/SPARK-24505
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> The CodeBlock interpolator now accepts strings. Based on previous discussion, 
> we should forbid string interpolation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24505) Forbidding string interpolation in CodeBlock

2018-06-12 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509362#comment-16509362
 ] 

Apache Spark commented on SPARK-24505:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/21537

> Forbidding string interpolation in CodeBlock
> 
>
> Key: SPARK-24505
> URL: https://issues.apache.org/jira/browse/SPARK-24505
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> The CodeBlock interpolator now accepts strings. Based on previous discussion, 
> we should forbid string interpolation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24527) select column alias should support quotation marks

2018-06-12 Thread ice bai (JIRA)

ice bai created SPARK-24527:
---

 Summary: select column alias should support quotation marks
 Key: SPARK-24527
 URL: https://issues.apache.org/jira/browse/SPARK-24527
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: ice bai


It will be failed when user use spark-sql or sql API to select come columns 
with quoted alias, but Hive is well.  Such as :

select 'name' as 'nm';
select 'name' as "nm";



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

95 matches

Mail list logo