date:20191216

[jira] [Commented] (SPARK-29158) Expose SerializableConfiguration for DSv2

2019-12-16 Thread Jorge Machado (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997947#comment-16997947
 ] 

Jorge Machado commented on SPARK-29158:
---

How can we get SerializableConfiguration with 2.4.4 ? Any alternative ?

> Expose SerializableConfiguration for DSv2
> -
>
> Key: SPARK-29158
> URL: https://issues.apache.org/jira/browse/SPARK-29158
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.0.0
>
>
> Since we use it frequently inside of our own DataSourceV2 implementations (13 
> times from `
>  grep -r broadcastedConf ./sql/core/src/ |grep val |wc -l`
> ) we should expose the SerializableConfiguration for DSv2 dev work



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30283) V2 Command logical plan should use UnresolvedV2Relation for a table

2019-12-16 Thread Terry Kim (Jira)

Terry Kim created SPARK-30283:
-

 Summary: V2 Command logical plan should use UnresolvedV2Relation 
for a table
 Key: SPARK-30283
 URL: https://issues.apache.org/jira/browse/SPARK-30283
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Terry Kim


For the following v2 commands, multi-part names are directly passed to the 
command without looking up temp views, thus they are always resolved to tables:
 * DROP TABLE
 * REFRESH TABLE
 * RENAME TABLE
 * REPLACE TABLE

They should be updated to have UnresolvedV2Relation such that temp views are 
looked up first in Analyzer.ResolveTables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16183) Large Spark SQL commands cause StackOverflowError in parser when using sqlContext.sql

2019-12-16 Thread Shubhradeep Majumdar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997931#comment-16997931
 ] 

Shubhradeep Majumdar commented on SPARK-16183:
--

Yes, the issue still exists in Spark 2.4.0. The point to note is that my code 
was developed on Spark 2.2.0 and used to work fine. We recently upgraded to 
Spark 2.4.0, where the `StackOverflowError` hits. 

> Large Spark SQL commands cause StackOverflowError in parser when using 
> sqlContext.sql
> -
>
> Key: SPARK-16183
> URL: https://issues.apache.org/jira/browse/SPARK-16183
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.1, 2.0.0
> Environment: Running on AWS EMR
>Reporter: Matthew Porter
>Priority: Major
>
> Hi,
> I have created a PySpark SQL-based tool which auto-generates a complex SQL 
> command to be run via sqlContext.sql(cmd) based on a large number of 
> parameters. As the number of input files to be filtered and joined in this 
> query grows, so does the length of the SQL query. The tool runs fine up until 
> about 200+ files are included in the join, at which point the SQL command 
> becomes very long (~100K characters). It is only on these longer queries that 
> Spark fails, throwing an exception due to what seems to be too much recursion 
> occurring within the SparkSQL parser:
> {code}
> Traceback (most recent call last):
> ...
> merged_df = sqlsc.sql(cmd)
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 
> 580, in sql
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
> line 813, in __call__
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, 
> in deco
>   File "/usr/lib/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 
> 308, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o173.sql.
> : java.lang.StackOverflowError
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
>

[jira] [Created] (SPARK-30282) UnresolvedV2Relation should be resolved to temp view first

2019-12-16 Thread Terry Kim (Jira)

Terry Kim created SPARK-30282:
-

 Summary: UnresolvedV2Relation should be resolved to temp view first
 Key: SPARK-30282
 URL: https://issues.apache.org/jira/browse/SPARK-30282
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Terry Kim


For the following v2 commands, _Analyzer.ResolveTables_ does not check against 
the temp views before resolving _UnresolvedV2Relation_, thus it always resolves 
_UnresolvedV2Relation_ to a table:
 * ALTER TABLE
 * DESCRIBE TABLE
 * SHOW TBLPROPERTIES

Thus, in the following example, 't' will be resolved to a table, not a temp 
view:
{code:java}
sql("CREATE TEMPORARY VIEW t AS SELECT 2 AS i")
sql("CREATE TABLE testcat.ns.t USING csv AS SELECT 1 AS i")
sql("USE testcat.ns")
sql("SHOW TBLPROPERTIES t") // 't' is resolved to a table
{code}
For V2 commands, if a table is resolved to a temp view, it should error out 
with a message that v2 command cannot handle temp views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30281) 'archive' option in FileStreamSource misses to consider partitioned and recursive option

2019-12-16 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997905#comment-16997905
 ] 

Jungtaek Lim commented on SPARK-30281:
--

Will submit a PR soon.

> 'archive' option in FileStreamSource misses to consider partitioned and 
> recursive option
> 
>
> Key: SPARK-30281
> URL: https://issues.apache.org/jira/browse/SPARK-30281
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Cleanup option for FileStreamSource is introduced in SPARK-20568.
> To simplify the condition of verifying archive path, it took the fact that 
> FileStreamSource reads the files where these files meet one of conditions: 1) 
> parent directory matches the source pattern 2) the file itself matches the 
> source pattern.
> We found there're other cases during post-hoc review which invalidate above 
> fact: partitioned, and recursive option. With these options, FileStreamSource 
> can read the arbitrary files in subdirectories which match the source 
> pattern, so simply checking the depth of archive path doesn't work.
> We need to restore the path check logic, though it would be not easy to 
> explain to end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30281) 'archive' option in FileStreamSource misses to consider partitioned and recursive option

2019-12-16 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-30281:


 Summary: 'archive' option in FileStreamSource misses to consider 
partitioned and recursive option
 Key: SPARK-30281
 URL: https://issues.apache.org/jira/browse/SPARK-30281
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


Cleanup option for FileStreamSource is introduced in SPARK-20568.

To simplify the condition of verifying archive path, it took the fact that 
FileStreamSource reads the files where these files meet one of conditions: 1) 
parent directory matches the source pattern 2) the file itself matches the 
source pattern.

We found there're other cases during post-hoc review which invalidate above 
fact: partitioned, and recursive option. With these options, FileStreamSource 
can read the arbitrary files in subdirectories which match the source pattern, 
so simply checking the depth of archive path doesn't work.

We need to restore the path check logic, though it would be not easy to explain 
to end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30280) Update documentation

2019-12-16 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-30280:
---

 Summary: Update documentation
 Key: SPARK-30280
 URL: https://issues.apache.org/jira/browse/SPARK-30280
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16996) Hive ACID delta files not seen

2019-12-16 Thread SandhyaMora (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997886#comment-16997886
 ] 

SandhyaMora commented on SPARK-16996:
-

Any Update on writing data into Hive ACID tables form spark ?

> Hive ACID delta files not seen
> --
>
> Key: SPARK-16996
> URL: https://issues.apache.org/jira/browse/SPARK-16996
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.3, 2.1.2, 2.2.0
> Environment: Hive 1.2.1, Spark 1.5.2
>Reporter: Benjamin BONNET
>Priority: Major
>
> spark-sql seems not to see data stored as delta files in an ACID Hive table.
> Actually I encountered the same problem as describe here : 
> http://stackoverflow.com/questions/35955666/spark-sql-is-not-returning-records-for-hive-transactional-tables-on-hdp
> For example, create an ACID table with HiveCLI and insert a row :
> {code}
> set hive.support.concurrency=true;
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.compactor.initiator.on=true;
> set hive.compactor.worker.threads=1;
>  CREATE TABLE deltas(cle string,valeur string) CLUSTERED BY (cle) INTO 1 
> BUCKETS
> ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS 
>   INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> TBLPROPERTIES ('transactional'='true');
> INSERT INTO deltas VALUES("a","a");
> {code}
> Then make a query with spark-sql CLI :
> {code}
> SELECT * FROM deltas;
> {code}
> That query gets no result and there are no errors in logs.
> If you go to HDFS to inspect table files, you find only deltas
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxr-x---   - me hdfs  0 2016-08-10 14:03 
> /apps/hive/warehouse/deltas/delta_0020943_0020943
> {code}
> Then if you run compaction on that table (in HiveCLI) :
> {code}
> ALTER TABLE deltas COMPACT 'MAJOR';
> {code}
> As a result, the delta will be compute into a base file :
> {code}
> ~>hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 1 items
> drwxrwxrwx   - me hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> {code}
> Go back to spark-sql and the same query gets a result :
> {code}
> SELECT * FROM deltas;
> a   a
> Time taken: 0.477 seconds, Fetched 1 row(s)
> {code}
> But next time you make an insert into Hive table : 
> {code}
> INSERT INTO deltas VALUES("b","b");
> {code}
> spark-sql will immediately see changes : 
> {code}
> SELECT * FROM deltas;
> a   a
> b   b
> Time taken: 0.122 seconds, Fetched 2 row(s)
> {code}
> Yet there was no other compaction, but spark-sql "sees" the base AND the 
> delta file :
> {code}
> ~> hdfs dfs -ls /apps/hive/warehouse/deltas
> Found 2 items
> drwxrwxrwx   - valdata hdfs  0 2016-08-10 15:25 
> /apps/hive/warehouse/deltas/base_0020943
> drwxr-x---   - valdata hdfs  0 2016-08-10 15:31 
> /apps/hive/warehouse/deltas/delta_0020956_0020956
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2019-12-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30201.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26831
[https://github.com/apache/spark/pull/26831]

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
> Fix For: 3.0.0
>
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30201) HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT

2019-12-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30201:
---

Assignee: ulysses you

> HiveOutputWriter standardOI should use ObjectInspectorCopyOption.DEFAULT
> 
>
> Key: SPARK-30201
> URL: https://issues.apache.org/jira/browse/SPARK-30201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Critical
>
> Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will 
> convert any string to UTF-8 string. When write non UTF-8 code data, then 
> `EFBFBD` will appear.
> We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
> Here is the way to reproduce:
> 1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
> 2. create table test1 (c string) location '$file_path';
> 3. select hex(c) from test1; // AABBCC
> 4. craete table test2 (c string) as select c from test1;
> 5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30279) Support 32 or more grouping attributes for GROUPING_ID

2019-12-16 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-30279:


 Summary: Support 32 or more grouping attributes for GROUPING_ID 
 Key: SPARK-30279
 URL: https://issues.apache.org/jira/browse/SPARK-30279
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


This ticket targets to support 32 or more grouping attributes for GROUPING_ID. 
In the current master, an integer overflow can occur to compute grouping IDs;
https://github.com/apache/spark/blob/e75d9afb2f282ce79c9fd8bce031287739326a4f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L613

For example, the query below generates wrong grouping IDs in the master;
{code}
scala> val numCols = 32 // or, 31
scala> val cols = (0 until numCols).map { i => s"c$i" }
scala> sql(s"create table test_$numCols (${cols.map(c => s"$c 
int").mkString(",")}, v int) using parquet")
scala> val insertVals = (0 until numCols).map { _ => 1 }.mkString(",")
scala> sql(s"insert into test_$numCols values ($insertVals,3)")
scala> sql(s"select grouping_id(), sum(v) from test_$numCols group by grouping 
sets ((${cols.mkString(",")}), (${cols.init.mkString(",")}))").show(10, false)
scala> sql(s"drop table test_$numCols")

// numCols = 32
+-+--+
|grouping_id()|sum(v)|
+-+--+
|0|3 |
|0|3 | // Wrong Grouping ID
+-+--+

// numCols = 31
+-+--+
|grouping_id()|sum(v)|
+-+--+
|0|3 |
|1|3 |
+-+--+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30094) Current namespace is not used during table resolution

2019-12-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30094:
---

Assignee: Terry Kim

> Current namespace is not used during table resolution
> -
>
> Key: SPARK-30094
> URL: https://issues.apache.org/jira/browse/SPARK-30094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> The following example shows the scenario where the current namespace is not 
> respected:
> {code:java}
> sql("CREATE TABLE testcat.t USING foo AS SELECT 1 AS id")
> sql("USE testcat")
> sql("SHOW CURRENT NAMESPACE").show
> +---+-+
> |catalog|namespace|
> +---+-+
> |testcat| |
> +---+-+
> // `t` is resolved to `testcat.t`.
> sql("DESCRIBE t").show
> +---+-+---+
> |   col_name|data_type|comment|
> +---+-+---+
> | id|  int|   |
> |   | |   |
> | # Partitioning| |   |
> |Not partitioned| |   |
> +---+-+---+
> // Now create a table under `ns` namespace.
> sql("CREATE TABLE testcat.ns.t USING foo AS SELECT 1 AS id")
> sql("USE testcat.ns")
> sql("SHOW CURRENT NAMESPACE").show
> +---+-+
> |catalog|namespace|
> +---+-+
> |testcat|   ns|
> +---+-+
> // `t` is not resolved any longer since the current namespace `ns` is not 
> used.
> sql("DESCRIBE t").show
> org.apache.spark.sql.AnalysisException: Invalid command: 't' is a view not a 
> table.; line 1 pos 0;
> 'DescribeTable 'UnresolvedV2Relation [t], 
> org.apache.spark.sql.connector.InMemoryTableCatalog@2c5ead80, `t`, false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30094) Current namespace is not used during table resolution

2019-12-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30094.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26894
[https://github.com/apache/spark/pull/26894]

> Current namespace is not used during table resolution
> -
>
> Key: SPARK-30094
> URL: https://issues.apache.org/jira/browse/SPARK-30094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> The following example shows the scenario where the current namespace is not 
> respected:
> {code:java}
> sql("CREATE TABLE testcat.t USING foo AS SELECT 1 AS id")
> sql("USE testcat")
> sql("SHOW CURRENT NAMESPACE").show
> +---+-+
> |catalog|namespace|
> +---+-+
> |testcat| |
> +---+-+
> // `t` is resolved to `testcat.t`.
> sql("DESCRIBE t").show
> +---+-+---+
> |   col_name|data_type|comment|
> +---+-+---+
> | id|  int|   |
> |   | |   |
> | # Partitioning| |   |
> |Not partitioned| |   |
> +---+-+---+
> // Now create a table under `ns` namespace.
> sql("CREATE TABLE testcat.ns.t USING foo AS SELECT 1 AS id")
> sql("USE testcat.ns")
> sql("SHOW CURRENT NAMESPACE").show
> +---+-+
> |catalog|namespace|
> +---+-+
> |testcat|   ns|
> +---+-+
> // `t` is not resolved any longer since the current namespace `ns` is not 
> used.
> sql("DESCRIBE t").show
> org.apache.spark.sql.AnalysisException: Invalid command: 't' is a view not a 
> table.; line 1 pos 0;
> 'DescribeTable 'UnresolvedV2Relation [t], 
> org.apache.spark.sql.connector.InMemoryTableCatalog@2c5ead80, `t`, false
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30277) NoSuchMethodError in Spark 3.0.0-preview with Delta Lake

2019-12-16 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-30277.

Resolution: Not A Problem

That's a internal Spark class, which means that if there is a problem, it's not 
in Spark.

> NoSuchMethodError in Spark 3.0.0-preview with Delta Lake
> 
>
> Key: SPARK-30277
> URL: https://issues.apache.org/jira/browse/SPARK-30277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark: 3.0.0-preview-bin-hadoop2.7
> Delta Lake: 0.5.0_2.12
> Java: 1.8.0_171
>Reporter: Victor Zhang
>Priority: Major
>
> Open spark shell with delta lake packages:
> {code:java}
> bin/spark-shell --master local --packages io.delta:delta-core_2.12:0.5.0{code}
> Create a delta table:
> {code:java}
> spark.range(5).write.format("delta").save("/tmp/delta-table1")
> {code}
> Throws NoSuchMethodException.
> {code:java}
> com.google.common.util.concurrent.ExecutionError: 
> java.lang.NoSuchMethodError: 
> org.apache.spark.util.Utils$.classForName(Ljava/lang/String;)Ljava/lang/Class;
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at 
> com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
>   at org.apache.spark.sql.delta.DeltaLog$.apply(DeltaLog.scala:740)
>   at org.apache.spark.sql.delta.DeltaLog$.forTable(DeltaLog.scala:702)
>   at 
> org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:126)
>   at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:71)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:69)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:87)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:110)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:109)
>   at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:829)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:829)
>   at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:309)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:236)
>   ... 47 elided
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.util.Utils$.classForName(Ljava/lang/String;)Ljava/lang/Class;
>   at 
> org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore(LogStore.scala:122)
>   at 
> org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore$(LogStore.scala:120)
>   at org.apache.spark.sql.delta.DeltaLog.createLogStore(DeltaLog.scala:58)
>   at 
> org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore(LogStore.scala:117)
>   at 
> org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore$(LogStore.scala:115)
>   at org.apache.spark.sql.delta.DeltaLog.createLogStore(DeltaLog.scala:58)
>   at org.apache.spark.sql.delta.DeltaLog.(DeltaLog.scala:79)
>   at 
> org.apache.spark.sql.delta.DeltaLog$$anon$3.$anonfun$call$2(DeltaLog.scala:744)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
>   at 
> org.apache.spark.sql.delta.DeltaLog$$anon$3.$anonfun$call$1(DeltaLog.scala:744)
>   at 
> com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77)
>   at 
> com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67)
>   at org.apache.spark.sql.delta.DeltaLog$.recordOperation(DeltaLog.scala:671)
>   at 
>

[jira] [Created] (SPARK-30278) Update Spark SQL document menu for new changes

2019-12-16 Thread Yuanjian Li (Jira)

Yuanjian Li created SPARK-30278:
---

 Summary: Update Spark SQL document menu for new changes
 Key: SPARK-30278
 URL: https://issues.apache.org/jira/browse/SPARK-30278
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.0.0
Reporter: Yuanjian Li


# Several new changes in the Spark SQL document didn't change the menu-sql.yaml 
correspondingly.
 # Update the demo code for join strategy hints. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30277) NoSuchMethodError in Spark 3.0.0-preview with Delta Lake

2019-12-16 Thread Victor Zhang (Jira)

Victor Zhang created SPARK-30277:


 Summary: NoSuchMethodError in Spark 3.0.0-preview with Delta Lake
 Key: SPARK-30277
 URL: https://issues.apache.org/jira/browse/SPARK-30277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: Spark: 3.0.0-preview-bin-hadoop2.7

Delta Lake: 0.5.0_2.12

Java: 1.8.0_171
Reporter: Victor Zhang


Open spark shell with delta lake packages:
{code:java}
bin/spark-shell --master local --packages io.delta:delta-core_2.12:0.5.0{code}
Create a delta table:
{code:java}
spark.range(5).write.format("delta").save("/tmp/delta-table1")
{code}
Throws NoSuchMethodException.
{code:java}
com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: 
org.apache.spark.util.Utils$.classForName(Ljava/lang/String;)Ljava/lang/Class;
  at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2261)
  at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
  at 
com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
  at org.apache.spark.sql.delta.DeltaLog$.apply(DeltaLog.scala:740)
  at org.apache.spark.sql.delta.DeltaLog$.forTable(DeltaLog.scala:702)
  at 
org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:126)
  at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:71)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:69)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:87)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:189)
  at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:227)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:110)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:109)
  at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:829)
  at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:829)
  at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:309)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:236)
  ... 47 elided
Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.util.Utils$.classForName(Ljava/lang/String;)Ljava/lang/Class;
  at 
org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore(LogStore.scala:122)
  at 
org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore$(LogStore.scala:120)
  at org.apache.spark.sql.delta.DeltaLog.createLogStore(DeltaLog.scala:58)
  at 
org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore(LogStore.scala:117)
  at 
org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore$(LogStore.scala:115)
  at org.apache.spark.sql.delta.DeltaLog.createLogStore(DeltaLog.scala:58)
  at org.apache.spark.sql.delta.DeltaLog.(DeltaLog.scala:79)
  at 
org.apache.spark.sql.delta.DeltaLog$$anon$3.$anonfun$call$2(DeltaLog.scala:744)
  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
  at 
org.apache.spark.sql.delta.DeltaLog$$anon$3.$anonfun$call$1(DeltaLog.scala:744)
  at 
com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77)
  at 
com.databricks.spark.util.DatabricksLogging.recordOperation$(DatabricksLogging.scala:67)
  at org.apache.spark.sql.delta.DeltaLog$.recordOperation(DeltaLog.scala:671)
  at 
org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:103)
  at 
org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation$(DeltaLogging.scala:89)
  at 
org.apache.spark.sql.delta.DeltaLog$.recordDeltaOperation(DeltaLog.scala:671)
  at org.apache.spark.sql.delta.DeltaLog$$anon$3.call(DeltaLog.scala:743)
  at org.apache.spark.sql.delta.DeltaLog$$anon$3.call(DeltaLog.scala:740)
  at 
com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
  at

[jira] [Commented] (SPARK-6235) Address various 2G limits

2019-12-16 Thread Samuel Shepard (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997798#comment-16997798
 ] 

Samuel Shepard commented on SPARK-6235:
---

[~irashid] I meant the former (task result > 2G) as best I understand the 
architecture. Is there a different Jira for the ML library, since it affects 
PCA, that would be more appropriate?

Thanks for the suggestions. Spark is a beautiful system with a lot of kind 
effort put into it. Computational biology has huge feature spaces all over the 
place. The two could really work well together, I think. This issue feels like 
some sort of left over from 32-bit Java, cramping Spark's style. :(

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29164) Rewrite coalesce(boolean, booleanLit) as boolean expression

2019-12-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29164.
--
Resolution: Won't Fix

> Rewrite coalesce(boolean, booleanLit) as boolean expression
> ---
>
> Key: SPARK-29164
> URL: https://issues.apache.org/jira/browse/SPARK-29164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> I propose the following expression rewrite optimizations:
> {code:java}
> coalesce(x: Boolean, true)  -> x or isnull(x)
> coalesce(x: Boolean, false) -> x and isnotnull(x){code}
> This pattern appears when translating Dataset filters on {{Option[Boolean]}} 
> columns: we might have a typed Dataset filter which looks like
> {code:java}
>  .filter(_.boolCol.getOrElse(DEFAULT_VALUE)){code}
> and the most idiomatic, user-friendly translation of this in Catalyst is to 
> use {{coalesce()}}. However, the {{coalesce()}} form of this expression is 
> not eligible for Parquet / data source filter pushdown.
> (We should write out truth-tables to double-check this rewrite's correctness)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29164) Rewrite coalesce(boolean, booleanLit) as boolean expression

2019-12-16 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997791#comment-16997791
 ] 

Hyukjin Kwon commented on SPARK-29164:
--

Resolving per the discussion in the PR.

> Rewrite coalesce(boolean, booleanLit) as boolean expression
> ---
>
> Key: SPARK-29164
> URL: https://issues.apache.org/jira/browse/SPARK-29164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> I propose the following expression rewrite optimizations:
> {code:java}
> coalesce(x: Boolean, true)  -> x or isnull(x)
> coalesce(x: Boolean, false) -> x and isnotnull(x){code}
> This pattern appears when translating Dataset filters on {{Option[Boolean]}} 
> columns: we might have a typed Dataset filter which looks like
> {code:java}
>  .filter(_.boolCol.getOrElse(DEFAULT_VALUE)){code}
> and the most idiomatic, user-friendly translation of this in Catalyst is to 
> use {{coalesce()}}. However, the {{coalesce()}} form of this expression is 
> not eligible for Parquet / data source filter pushdown.
> (We should write out truth-tables to double-check this rewrite's correctness)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30181) Throws runtime exception when filter metastore partition key that's not string type or integral types

2019-12-16 Thread Yu-Jhe Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu-Jhe Li updated SPARK-30181:
--
Description: 
SQL below will throw a runtime exception since spark-2.4.0. I think it's a bug 
brought from SPARK-22384 
{code:scala}
val df = Seq(
(1, java.sql.Timestamp.valueOf("2019-12-01 00:00:00"), 1), 
(2, java.sql.Timestamp.valueOf("2019-12-01 01:00:00"), 1)
  ).toDF("id", "dt", "value")
df.write.partitionBy("dt").mode("overwrite").saveAsTable("timestamp_part")

spark.sql("select * from timestamp_part where dt >= '2019-12-01 
00:00:00'").explain(true)
{code}
{noformat}
Caught Hive MetaException attempting to get partition metadata by filter from 
Hive. You can set the Spark configuration setting 
spark.sql.hive.manageFilesourcePartitions to false to work around this problem, 
however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK
  at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:774)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
  at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677)
  at 
org.apache.spark.sql.hive.client.HiveClientSuite.testMetastorePartitionFiltering(HiveClientSuite.scala:310)
  at 
org.apache.spark.sql.hive.client.HiveClientSuite.org$apache$spark$sql$hive$client$HiveClientSuite$$testMetastorePartitionFiltering(HiveClientSuite.scala:282)
  at 
org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply$mcV$sp(HiveClientSuite.scala:105)
  at 
org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply(HiveClientSuite.scala:105)
  at 
org.apache.spark.sql.hive.client.HiveClientSuite$$anonfun$1.apply(HiveClientSuite.scala:105)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:103)
  at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
  at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
  at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
  at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
  at org.scalatest.FunSuite.runTest(FunSuite.scala:1560)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
  at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
  at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
  at scala.collection.immutable.List.foreach(List.scala:392)
  at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
  at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
  at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
  at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
  at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
  at org.scalatest.Suite$class.run(Suite.scala:1147)
  at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
  at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
  at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
  at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
  at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
  at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:52)
  at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
  at

[jira] [Updated] (SPARK-30276) Support Filter expression allows simultaneous use of DISTINCT

2019-12-16 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30276:
---
Summary: Support Filter expression allows simultaneous use of DISTINCT  
(was: Support Filter expression allow simultaneous use of DISTINCT)

> Support Filter expression allows simultaneous use of DISTINCT
> -
>
> Key: SPARK-30276
> URL: https://issues.apache.org/jira/browse/SPARK-30276
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> SPARK-27986 only supports  Filter expression without DISTINCT.
> We need to support Filter expression allow simultaneous use of DISTINCT
> PostgreSQL support:
> {code:java}
> select ten, sum(distinct four) filter (where four > 10) from onek group by 
> ten;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30276) Support Filter expression allow simultaneous use of DISTINCT

2019-12-16 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-30276:
--

 Summary: Support Filter expression allow simultaneous use of 
DISTINCT
 Key: SPARK-30276
 URL: https://issues.apache.org/jira/browse/SPARK-30276
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: jiaan.geng


SPARK-27986 only supports  Filter expression without DISTINCT.

We need to support Filter expression allow simultaneous use of DISTINCT

PostgreSQL support:
{code:java}
select ten, sum(distinct four) filter (where four > 10) from onek group by 
ten;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30276) Support Filter expression allow simultaneous use of DISTINCT

2019-12-16 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997785#comment-16997785
 ] 

jiaan.geng commented on SPARK-30276:


I'm working.

> Support Filter expression allow simultaneous use of DISTINCT
> 
>
> Key: SPARK-30276
> URL: https://issues.apache.org/jira/browse/SPARK-30276
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> SPARK-27986 only supports  Filter expression without DISTINCT.
> We need to support Filter expression allow simultaneous use of DISTINCT
> PostgreSQL support:
> {code:java}
> select ten, sum(distinct four) filter (where four > 10) from onek group by 
> ten;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30275) Add gitlab-ci.yml file for reproducible builds

2019-12-16 Thread Jim Kleckner (Jira)

Jim Kleckner created SPARK-30275:


 Summary: Add gitlab-ci.yml file for reproducible builds
 Key: SPARK-30275
 URL: https://issues.apache.org/jira/browse/SPARK-30275
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.4, 3.0.0
Reporter: Jim Kleckner


It would be desirable to have public reproducible builds such as provided by 
gitlab or others.
 
Here is a candidate patch set to build spark using gitlab-ci:

* https://gitlab.com/jkleckner/spark/tree/add-gitlab-ci-yml


Let me know if there is interest in a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30233) Spark WebUI task table indentation issue

2019-12-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30233.
--
Resolution: Duplicate

> Spark WebUI task table indentation  issue
> -
>
> Key: SPARK-30233
> URL: https://issues.apache.org/jira/browse/SPARK-30233
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.4
>Reporter: jobit mathew
>Priority: Minor
> Attachments: sparkopensourceissue.PNG
>
>
> !sparkopensourceissue.PNG!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30239) Creating a dataframe with Pandas rather than Numpy datatypes fails

2019-12-16 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997782#comment-16997782
 ] 

Hyukjin Kwon commented on SPARK-30239:
--

Can you show the self-contained reproducer?

> Creating a dataframe with Pandas rather than Numpy datatypes fails
> --
>
> Key: SPARK-30239
> URL: https://issues.apache.org/jira/browse/SPARK-30239
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: DataBricks: 48.00 GB | 24 Cores | DBR 6.0 | Spark 2.4.3 
> | Scala 2.11
>Reporter: Philip Kahn
>Priority: Minor
>
> It's possible to work with DataFrames in Pandas and shuffle them back over to 
> Spark dataframes for processing; however, using Pandas extended datatypes 
> like {{Int64 }}( 
> [https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html] ) 
> throws an error (that long / float can't be converted).
> This is internally because {{np.nan}} is a float, and {{pd.Int64DType()}} 
> allows only integers except for the single float value {{np.nan}}.
>  
> The current workaround for this is to use the columns as floats, and after 
> conversion to the Spark DataFrame, to recast the column as {{LongType()}}. 
> For example:
>  
> {{sdfC = spark.createDataFrame(kgridCLinked)}}
> {{sdfC = sdfC.withColumn("gridID", sdfC["gridID"].cast(LongType()))}}
>  
> However, this is awkward and redundant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30242) Support reading Parquet files from Stream Buffer

2019-12-16 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997779#comment-16997779
 ] 

Hyukjin Kwon commented on SPARK-30242:
--

Nope, I don't think it will be able as it requires to change too many APIs 
(e.g, ORC, CSV, Json, Text) but it can be easily worked around by writing out 
to the local directory and read it back.

> Support reading Parquet files from Stream Buffer
> 
>
> Key: SPARK-30242
> URL: https://issues.apache.org/jira/browse/SPARK-30242
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Jelther Oliveira Gonçalves
>Priority: Trivial
>
> Reading from a Python BufferIO a parquet is not possible using Pyspark.
> Using:
>  
> {code:java}
> from io import BytesIO
> parquetbytes : Bytes = b'PAR...'
> df = spark.read.format("parquet").load(BytesIO(parquetbytes))
> {code}
> Raises :
> {code:java}
> java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
> java.lang.String{code}
>  
> Is there any chance this will be available in the future?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30242) Support reading Parquet files from Stream Buffer

2019-12-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30242.
--
Resolution: Won't Fix

> Support reading Parquet files from Stream Buffer
> 
>
> Key: SPARK-30242
> URL: https://issues.apache.org/jira/browse/SPARK-30242
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Jelther Oliveira Gonçalves
>Priority: Trivial
>
> Reading from a Python BufferIO a parquet is not possible using Pyspark.
> Using:
>  
> {code:java}
> from io import BytesIO
> parquetbytes : Bytes = b'PAR...'
> df = spark.read.format("parquet").load(BytesIO(parquetbytes))
> {code}
> Raises :
> {code:java}
> java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
> java.lang.String{code}
>  
> Is there any chance this will be available in the future?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30249) Invalid Column Names in parquet tables should not be allowed

2019-12-16 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997774#comment-16997774
 ] 

Hyukjin Kwon commented on SPARK-30249:
--

It seems to be valid in Parquet:

{code}
scala> Seq(1).toDF("a:b").write.parquet("/tmp/foo")

scala> spark.read.parquet("/tmp/foo").show()
+---+
|a:b|
+---+
|  1|
+---+
{code}

> Invalid Column Names in parquet tables should not be allowed
> 
>
> Key: SPARK-30249
> URL: https://issues.apache.org/jira/browse/SPARK-30249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Column names such as  `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we 
> are creating parquet tables.
> While when we are creating tables with `orc` all such column names are marked 
> as invalid and analysis exception is thrown.
> These column names should also be not allowed for parquet tables as well.
> Also this induces inconsistency between column names for Parquet and ORC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30249) Invalid Column Names in parquet tables should not be allowed

2019-12-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30249.
--
Resolution: Not A Problem

> Invalid Column Names in parquet tables should not be allowed
> 
>
> Key: SPARK-30249
> URL: https://issues.apache.org/jira/browse/SPARK-30249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Minor
>
> Column names such as  `a:b` , `??`, `,,`, `^^` , `++`etc are allowed when we 
> are creating parquet tables.
> While when we are creating tables with `orc` all such column names are marked 
> as invalid and analysis exception is thrown.
> These column names should also be not allowed for parquet tables as well.
> Also this induces inconsistency between column names for Parquet and ORC



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30264) Unexpected behaviour when using persist MEMORY_ONLY in RDD

2019-12-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30264:
-
Affects Version/s: 2.4.4

> Unexpected behaviour when using persist MEMORY_ONLY in RDD
> --
>
> Key: SPARK-30264
> URL: https://issues.apache.org/jira/browse/SPARK-30264
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.4.0, 2.4.4
>Reporter: moshe ohaion
>Priority: Major
> Attachments: GenericMain.java, users8.avro
>
>
> Persist method with MEMORY_ONLY behave different than using with 
> MEMORY_ONLY_SER.
> persist(StorageLevel.MEMORY_ONLY()).distinct().count() return 1
> while persist(StorageLevel.MEMORY_ONLY_SER()).distinct().count() return 100
> I expect both to return the same results. The right result is 100, for some 
> reason MEMORY_ONLY causing all the objects in the RDD to be the same one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30270) Can't pickle abstract classes (with cloudpickle)

2019-12-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30270.
--
Resolution: Cannot Reproduce

I can confirm that's fixed and cannot be reproduced in the current master:

{code}
>>> import pickle
>>> from abc import ABC
>>> from pyspark import cloudpickle
>>>
>>>
>>> class Foo(ABC):
... pass
...
>>> class Bar(Foo):
... pass
...
>>> bar = Bar()
>>>
>>> # pickle dump works fine
... pickle.dumps(bar)
b'\x80\x03c__main__\nBar\nq\x00)\x81q\x01.'
>>> # cloudpickle doesn't
... cloudpickle.dumps(bar)
b'\x80\x04\x95e\x01\x00\x00\x00\x00\x00\x00\x8c\x13pyspark.cloudpickle\x94\x8c\x19_rehydrate_skeleton_class\x94\x93\x94(h\x00\x8c\x14_make_skeleton_class\x94\x93\x94(\x8c\x03abc\x94\x8c\x07ABCMeta\x94\x93\x94\x8c\x03Bar\x94h\x02(h\x04(h\x07\x8c\x03Foo\x94h\x05\x8c\x03ABC\x94\x93\x94\x85\x94}\x94(\x8c\x07__doc__\x94N\x8c\t__slots__\x94)u\x8c
 
ca482aff65274cbbaaf76887e5703bf5\x94Nt\x94R\x94}\x94(\x8c\n__module__\x94\x8c\x08__main__\x94\x8c\x13__abstractmethods__\x94(\x91\x94\x8c\t_abc_impl\x94]\x94utR\x85\x94}\x94(h\x0eNh\x0f)u\x8c
 
58e3cb2c9c6046d19cfe23fa1e6eb6a4\x94Nt\x94R\x94}\x94(h\x16(\x91\x94h\x18]\x94\x8c\r__slotnames__\x94]\x94utR)\x81\x94.'
{code}

> Can't pickle abstract classes (with cloudpickle)
> 
>
> Key: SPARK-30270
> URL: https://issues.apache.org/jira/browse/SPARK-30270
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Sebastian Straub
>Priority: Minor
>  Labels: cloudpickle
>
> I can't use any classes that are derived from abstract classes in PySpark, 
> because cloudpickle can't pickle them.
> Example:
> {code:java}
> import pickle
> from abc import ABC
> from pyspark import cloudpickle
> class Foo(ABC):
> pass
> class Bar(Foo):
> pass
> bar = Bar()
> # pickle dump works fine
> pickle.dumps(bar)
> # cloudpickle doesn't
> cloudpickle.dumps(bar)
> {code}
> A similar bug has already been reported in SPARK-21439 and marked resolved, 
> but I can confirm that the issue still persists.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30274) Avoid BytesToBytesMap lookup hang forever when holding keys reaching max capacity

2019-12-16 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-30274:
---

 Summary: Avoid BytesToBytesMap lookup hang forever when holding 
keys reaching max capacity
 Key: SPARK-30274
 URL: https://issues.apache.org/jira/browse/SPARK-30274
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


BytesToBytesMap.append allows to append keys until the number of keys reaches 
MAX_CAPACITY. But once the the pointer array in the map holds MAX_CAPACITY 
keys, next time call of lookup will hand forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30268) Incorrect pyspark package name when releasing preview version

2019-12-16 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30268.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26909
[https://github.com/apache/spark/pull/26909]

> Incorrect pyspark package name when releasing preview version
> -
>
> Key: SPARK-30268
> URL: https://issues.apache.org/jira/browse/SPARK-30268
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> {noformat}
> cp: cannot stat 
> 'spark-3.0.0-preview2-bin-hadoop2.7/python/dist/pyspark-3.0.0.dev02.tar.gz': 
> No such file or directory
> gpg: can't open 'pyspark-3.0.0.dev02.tar.gz': No such file or directory
> gpg: signing failed: No such file or directory
> gpg: pyspark-3.0.0.dev02.tar.gz: No such file or directory
> {noformat}
> But it is:
> {noformat}
> yumwang@ubuntu-3513086:~/spark-release/output$ ll 
> spark-3.0.0-preview2-bin-hadoop2.7/python/dist/
> total 214140
> drwxr-xr-x 2 yumwang stack  4096 Dec 16 06:17 ./
> drwxr-xr-x 9 yumwang stack  4096 Dec 16 06:17 ../
> -rw-r--r-- 1 yumwang stack 219267173 Dec 16 06:17 pyspark-3.0.0.dev2.tar.gz
> {noformat}
> {noformat}
> /usr/local/lib/python3.6/dist-packages/setuptools/dist.py:476: UserWarning: 
> Normalizing '3.0.0.dev02' to '3.0.0.dev2'
>   normalized_version,
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30171) Eliminate warnings: part2

2019-12-16 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997740#comment-16997740
 ] 

Sean R. Owen commented on SPARK-30171:
--

Is this a dupe of SPARK-30258?

> Eliminate warnings: part2
> -
>
> Key: SPARK-30171
> URL: https://issues.apache.org/jira/browse/SPARK-30171
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> AvroFunctionsSuite.scala
> Warning:Warning:line (41)method to_avro in package avro is deprecated (since 
> 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' instead.
> val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b"))
> Warning:Warning:line (41)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b"))
> Warning:Warning:line (54)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, 
> avroTypeStr)), df)
> Warning:Warning:line (54)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, 
> avroTypeStr)), df)
> Warning:Warning:line (59)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroStructDF = df.select(to_avro('struct).as("avro"))
> Warning:Warning:line (70)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroStructDF.select(from_avro('avro, avroTypeStruct)), df)
> Warning:Warning:line (76)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroStructDF = df.select(to_avro('struct).as("avro"))
> Warning:Warning:line (118)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val readBackOne = dfOne.select(to_avro($"array").as("avro"))
> Warning:Warning:line (119)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
>   .select(from_avro($"avro", avroTypeArrStruct).as("array"))
> AvroPartitionReaderFactory.scala
> Warning:Warning:line (64)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
> if (parsedOptions.ignoreExtension || 
> partitionedFile.filePath.endsWith(".avro")) {
> AvroFileFormat.scala
> Warning:Warning:line (98)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
>   if (parsedOptions.ignoreExtension || file.filePath.endsWith(".avro")) {
> AvroUtils.scala
> Warning:Warning:line (55)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
> inferAvroSchemaFromFiles(files, conf, parsedOptions.ignoreExtension,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30258) Eliminate warnings of deprecated Spark APIs in tests

2019-12-16 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30258.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26885
[https://github.com/apache/spark/pull/26885]

> Eliminate warnings of deprecated Spark APIs in tests
> 
>
> Key: SPARK-30258
> URL: https://issues.apache.org/jira/browse/SPARK-30258
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Suppress deprecation warnings in tests that check deprecated Spark APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30258) Eliminate warnings of deprecated Spark APIs in tests

2019-12-16 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30258:


Assignee: Maxim Gekk

> Eliminate warnings of deprecated Spark APIs in tests
> 
>
> Key: SPARK-30258
> URL: https://issues.apache.org/jira/browse/SPARK-30258
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Suppress deprecation warnings in tests that check deprecated Spark APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30247) GaussianMixtureModel in py side should expose gaussian

2019-12-16 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30247:


Assignee: Huaxin Gao

> GaussianMixtureModel in py side should expose gaussian
> --
>
> Key: SPARK-30247
> URL: https://issues.apache.org/jira/browse/SPARK-30247
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
>
> A GaussianMixtureModel contains two parts of coefficients: weights & 
> gaussians.
> however, the gaussians is not exposed in the py side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30247) GaussianMixtureModel in py side should expose gaussian

2019-12-16 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30247.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26882
[https://github.com/apache/spark/pull/26882]

> GaussianMixtureModel in py side should expose gaussian
> --
>
> Key: SPARK-30247
> URL: https://issues.apache.org/jira/browse/SPARK-30247
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> A GaussianMixtureModel contains two parts of coefficients: weights & 
> gaussians.
> however, the gaussians is not exposed in the py side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2019-12-16 Thread Kevin Grealish (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997724#comment-16997724
 ] 

Kevin Grealish commented on SPARK-23015:


Here is something that may help craft a complete solution that you can change 
in the Spark the scripts. This uses VB script to create a GUID and assign it to 
an environment variable. It depends on cscript which is part of Windows since 
Windows 95. Change the two %%i it just %i to run outside a batch program. 
Instead of writing a temp .vbs file, just include it with the script now using 
%RANDOM%.

echo WScript.StdOut.WriteLine Mid(CreateObject("Scriptlet.TypeLib").GUID, 2, 
36) > %TEMP%\uuid.vbs
for /f %%i in ('cscript //NoLogo %TEMP%\uuid.vbs') do @set UUID=%%i
echo made a UUID: %UUID%

This code will collide on writing uudi.vbs so instead, a uuid.vbs (say called 
makeuuid.vbs) file should be added to the scripts.

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> {quote}The process cannot access the file because it is being used by another 
> process. The system cannot find the file
> USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt.
> The process cannot access the file because it is being used by another 
> process.{quote}
> My hypothesis is that %RANDOM% is returning the same value for multiple jobs, 
> causing the launcher library to attempt to write to the same file from 
> multiple processes. Another mechanism is needed for reliably generating the 
> names of the temporary files so that the concurrency issue is resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page

2019-12-16 Thread Marcelo Masiero Vanzin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997698#comment-16997698
 ] 

Marcelo Masiero Vanzin commented on SPARK-25392:


The fix basically hides pool details from the history server; actually showing 
pool info is a more involved change, and if that's wanted a new bug should be 
filed. (I know there was a PR for it, but, well, that requires more committer 
time for reviewing too...)

> [Spark Job History]Inconsistent behaviour for pool details in spark web UI 
> and history server page 
> ---
>
> Key: SPARK-25392
> URL: https://issues.apache.org/jira/browse/SPARK-25392
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: shahid
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> Steps:
> 1.Enable spark.scheduler.mode = FAIR
> 2.Submitted beeline jobs
> create database JH;
> use JH;
> create table one12( id int );
> insert into one12 values(12);
> insert into one12 values(13);
> Select * from one12;
> 3.Click on JDBC Incompleted Application ID in Job History Page
> 4. Go to Job Tab in staged Web UI page
> 5. Click on run at AccessController.java:0 under Desription column
> 6 . Click default under Pool Name column of Completed Stages table
> URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
> 7. It throws below error
> HTTP ERROR 400
> Problem accessing /history/application_1536399199015_0006/stages/pool/. 
> Reason:
> Unknown pool: default
> Powered by Jetty:// x.y.z
> But under 
> Yarn resource page it display the summary under Fair Scheduler Pool: default 
> URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default
> Summary
> Pool Name Minimum Share   Pool Weight Active Stages   Running Tasks   
> SchedulingMode
> default   0   1   0   0   FIFO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page

2019-12-16 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-25392:
--

Assignee: shahid

> [Spark Job History]Inconsistent behaviour for pool details in spark web UI 
> and history server page 
> ---
>
> Key: SPARK-25392
> URL: https://issues.apache.org/jira/browse/SPARK-25392
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: shahid
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> Steps:
> 1.Enable spark.scheduler.mode = FAIR
> 2.Submitted beeline jobs
> create database JH;
> use JH;
> create table one12( id int );
> insert into one12 values(12);
> insert into one12 values(13);
> Select * from one12;
> 3.Click on JDBC Incompleted Application ID in Job History Page
> 4. Go to Job Tab in staged Web UI page
> 5. Click on run at AccessController.java:0 under Desription column
> 6 . Click default under Pool Name column of Completed Stages table
> URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
> 7. It throws below error
> HTTP ERROR 400
> Problem accessing /history/application_1536399199015_0006/stages/pool/. 
> Reason:
> Unknown pool: default
> Powered by Jetty:// x.y.z
> But under 
> Yarn resource page it display the summary under Fair Scheduler Pool: default 
> URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default
> Summary
> Pool Name Minimum Share   Pool Weight Active Stages   Running Tasks   
> SchedulingMode
> default   0   1   0   0   FIFO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25392) [Spark Job History]Inconsistent behaviour for pool details in spark web UI and history server page

2019-12-16 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-25392.

Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 26616
[https://github.com/apache/spark/pull/26616]

> [Spark Job History]Inconsistent behaviour for pool details in spark web UI 
> and history server page 
> ---
>
> Key: SPARK-25392
> URL: https://issues.apache.org/jira/browse/SPARK-25392
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: OS: SUSE 11
> Spark Version: 2.3
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> Steps:
> 1.Enable spark.scheduler.mode = FAIR
> 2.Submitted beeline jobs
> create database JH;
> use JH;
> create table one12( id int );
> insert into one12 values(12);
> insert into one12 values(13);
> Select * from one12;
> 3.Click on JDBC Incompleted Application ID in Job History Page
> 4. Go to Job Tab in staged Web UI page
> 5. Click on run at AccessController.java:0 under Desription column
> 6 . Click default under Pool Name column of Completed Stages table
> URL:http://blr123109:23020/history/application_1536399199015_0006/stages/pool/?poolname=default
> 7. It throws below error
> HTTP ERROR 400
> Problem accessing /history/application_1536399199015_0006/stages/pool/. 
> Reason:
> Unknown pool: default
> Powered by Jetty:// x.y.z
> But under 
> Yarn resource page it display the summary under Fair Scheduler Pool: default 
> URL:https://blr123110:64323/proxy/application_1536399199015_0006/stages/pool?poolname=default
> Summary
> Pool Name Minimum Share   Pool Weight Active Stages   Running Tasks   
> SchedulingMode
> default   0   1   0   0   FIFO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-12-16 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29043:
--

Assignee: feiwang

> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Assignee: feiwang
>Priority: Major
> Attachments: image-2019-09-11-15-09-22-912.png, 
> image-2019-09-11-15-10-25-326.png, screenshot-1.png
>
>
> As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
> spark history server.
> However, there is only one replay thread work because of straggler.
> Let's check the code.
> https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547
> There is a synchronous operation for all replay tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-12-16 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29043.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25797
[https://github.com/apache/spark/pull/25797]

> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Assignee: feiwang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-09-11-15-09-22-912.png, 
> image-2019-09-11-15-10-25-326.png, screenshot-1.png
>
>
> As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
> spark history server.
> However, there is only one replay thread work because of straggler.
> Let's check the code.
> https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547
> There is a synchronous operation for all replay tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2019-12-16 Thread Kevin Grealish (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997669#comment-16997669
 ] 

Kevin Grealish commented on SPARK-23015:


%TIME% has a granularity of 10ms, so while this does reduce the probability of 
collision, it does not remove the problem. Neither does using multiple RANDOM's 
as this is pseudo random number. See 
https://devblogs.microsoft.com/oldnewthing/20100617-00/?p=13673 "Why cmd.exe's 
%RANDOM% isn't so random". Once the seed is set from the current time using a 
granularity of a second, the sequence of numbers coming from %RANDOM% is fixed, 
so if using %RANDOM% will cause a collision, then so will 
%RANDOM%%RANDOM%%RANDOM%%RANDOM%...

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> {quote}The process cannot access the file because it is being used by another 
> process. The system cannot find the file
> USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt.
> The process cannot access the file because it is being used by another 
> process.{quote}
> My hypothesis is that %RANDOM% is returning the same value for multiple jobs, 
> causing the launcher library to attempt to write to the same file from 
> multiple processes. Another mechanism is needed for reliably generating the 
> names of the temporary files so that the concurrency issue is resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30209) Display stageId, attemptId, taskId with SQL max metric in UI

2019-12-16 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-30209:
-

Assignee: Niranjan Artal

> Display stageId, attemptId, taskId with SQL max metric in UI
> 
>
> Key: SPARK-30209
> URL: https://issues.apache.org/jira/browse/SPARK-30209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Niranjan Artal
>Assignee: Niranjan Artal
>Priority: Major
> Fix For: 3.0.0
>
>
> It would be helpful if we could add stageId, stage attemptId and taskId for 
> in SQL UI for each of the max metrics values.  These additional metrics help 
> in debugging the jobs quicker.  For a  given operator, it will be easy to 
> identify the task which is taking maximum time to complete from the Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30209) Display stageId, attemptId, taskId with SQL max metric in UI

2019-12-16 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-30209.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Display stageId, attemptId, taskId with SQL max metric in UI
> 
>
> Key: SPARK-30209
> URL: https://issues.apache.org/jira/browse/SPARK-30209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Niranjan Artal
>Priority: Major
> Fix For: 3.0.0
>
>
> It would be helpful if we could add stageId, stage attemptId and taskId for 
> in SQL UI for each of the max metrics values.  These additional metrics help 
> in debugging the jobs quicker.  For a  given operator, it will be easy to 
> identify the task which is taking maximum time to complete from the Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-12-16 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997641#comment-16997641
 ] 

Shane Knapp commented on SPARK-29106:
-

huh...  i created a new python 3.6 env, ran the python test and saw some 
strange failures (skipping pandas tests, etc)...  while everything seemed to 
install properly, when i went in to the interpreter and tried to import stuff 
it failed:

 
{noformat}
(py36) jenkins@spark-jenkins-arm-worker:~/python-envs$ python3
Python 3.6.9 (default, Nov  7 2019, 10:44:02)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/jenkins/python-envs/py36/lib/python3.6/site-packages/pandas/__init__.py",
 line 19, in 
"Missing required dependencies {0}".format(missing_dependencies))
ImportError: Missing required dependencies ['numpy']
>>> import numpy
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/jenkins/python-envs/py36/lib/python3.6/site-packages/numpy/__init__.py", 
line 150, in 
from . import random
  File 
"/home/jenkins/python-envs/py36/lib/python3.6/site-packages/numpy/random/__init__.py",
 line 143, in 
from .mtrand import *
ImportError: 
/home/jenkins/python-envs/py36/lib/python3.6/site-packages/numpy/random/mtrand.cpython-36m-aarch64-linux-gnu.so:
 undefined symbol: PyFPE_jbuf
>>>{noformat}
i'll poke around some more later today.

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, 
> SparkR-and-pyspark36-testing.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6235) Address various 2G limits

2019-12-16 Thread Imran Rashid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997631#comment-16997631
 ] 

Imran Rashid commented on SPARK-6235:
-

[~sammysheep] are you discussing the use case for task results > 2G?  Or large 
records?  Or did you mean one of the parts that was supposed to be fixed in the 
plan above?

I don't deny there is _some_ use for large task result -- I just haven't heard 
much demand for it (in fact you're the first person I've heard from).  Given 
that, I don't expect to see it fixed immediately.  You could open another jira, 
though honestly for the moment I think it would be more of a place for folks to 
voice their interest.

(I'm pretty sure nothing has changed since 2.4.0 on what is fixed and what is 
not.)

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30273) Add melt() function

2019-12-16 Thread Shelby Vanhooser (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shelby Vanhooser updated SPARK-30273:
-
Description: 
- Adds melt() functionality based on 
[this]([https://stackoverflow.com/a/41673644/12474509)] implementation

 

[https://github.com/apache/spark/pull/26912/files]

  was:- Adds melt() functionality based on 
[this]([https://stackoverflow.com/a/41673644/12474509)] implementation


> Add melt() function
> ---
>
> Key: SPARK-30273
> URL: https://issues.apache.org/jira/browse/SPARK-30273
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Shelby Vanhooser
>Priority: Major
>  Labels: PySpark, feature
>
> - Adds melt() functionality based on 
> [this]([https://stackoverflow.com/a/41673644/12474509)] implementation
>  
> [https://github.com/apache/spark/pull/26912/files]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30273) Add melt() function

2019-12-16 Thread Shelby Vanhooser (Jira)

Shelby Vanhooser created SPARK-30273:


 Summary: Add melt() function
 Key: SPARK-30273
 URL: https://issues.apache.org/jira/browse/SPARK-30273
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Shelby Vanhooser


- Adds melt() functionality based on 
[this]([https://stackoverflow.com/a/41673644/12474509)] implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30273) Add melt() function

2019-12-16 Thread Shelby Vanhooser (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shelby Vanhooser updated SPARK-30273:
-
Labels: PySpark feature  (was: )

> Add melt() function
> ---
>
> Key: SPARK-30273
> URL: https://issues.apache.org/jira/browse/SPARK-30273
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Shelby Vanhooser
>Priority: Major
>  Labels: PySpark, feature
>
> - Adds melt() functionality based on 
> [this]([https://stackoverflow.com/a/41673644/12474509)] implementation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2019-12-16 Thread Evgenii (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997604#comment-16997604
 ] 

Evgenii edited comment on SPARK-23015 at 12/16/19 8:01 PM:
---

Here is working solution:

set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%%TIME::=0%.txt

instruction ::=0 is about to remove char ':' from timestamp

I've checked and tested this solution. No need to make a crutches in code.


was (Author: lartcev):
Here is working solution:

set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%%TIME::=0%.txt

instruction ::=0 is about to remove char ':' from timestamp

I've checked and tested this solution. No need to make a crutches in code.

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> {quote}The process cannot access the file because it is being used by another 
> process. The system cannot find the file
> USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt.
> The process cannot access the file because it is being used by another 
> process.{quote}
> My hypothesis is that %RANDOM% is returning the same value for multiple jobs, 
> causing the launcher library to attempt to write to the same file from 
> multiple processes. Another mechanism is needed for reliably generating the 
> names of the temporary files so that the concurrency issue is resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2019-12-16 Thread Evgenii (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997604#comment-16997604
 ] 

Evgenii edited comment on SPARK-23015 at 12/16/19 8:00 PM:
---

Here is working solution:

set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%%TIME::=0%.txt

instruction ::=0 is about to remove char ':' from timestamp

I've checked and tested this solution. No need to make a crutches in code.


was (Author: lartcev):
Here is working solution:

set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%%TIME::=0%.txt

instruction ::=0 is about to remove char ':' from timestamp

I checked that.

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> {quote}The process cannot access the file because it is being used by another 
> process. The system cannot find the file
> USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt.
> The process cannot access the file because it is being used by another 
> process.{quote}
> My hypothesis is that %RANDOM% is returning the same value for multiple jobs, 
> causing the launcher library to attempt to write to the same file from 
> multiple processes. Another mechanism is needed for reliably generating the 
> names of the temporary files so that the concurrency issue is resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23015) spark-submit fails when submitting several jobs in parallel

2019-12-16 Thread Evgenii (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-23015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997604#comment-16997604
 ] 

Evgenii commented on SPARK-23015:
-

Here is working solution:

set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%%TIME::=0%.txt

instruction ::=0 is about to remove char ':' from timestamp

I checked that.

> spark-submit fails when submitting several jobs in parallel
> ---
>
> Key: SPARK-23015
> URL: https://issues.apache.org/jira/browse/SPARK-23015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
> Environment: Windows 10 (1709/16299.125)
> Spark 2.3.0
> Java 8, Update 151
>Reporter: Hugh Zabriskie
>Priority: Major
>
> Spark Submit's launching library prints the command to execute the launcher 
> (org.apache.spark.launcher.main) to a temporary text file, reads the result 
> back into a variable, and then executes that command.
> {code}
> set LAUNCHER_OUTPUT=%temp%\spark-class-launcher-output-%RANDOM%.txt
> "%RUNNER%" -Xmx128m -cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main 
> %* > %LAUNCHER_OUTPUT%
> {code}
> [bin/spark-class2.cmd, 
> L67|https://github.com/apache/spark/blob/master/bin/spark-class2.cmd#L66]
> That temporary text file is given a pseudo-random name by the %RANDOM% env 
> variable generator, which generates a number between 0 and 32767.
> This appears to be the cause of an error occurring when several spark-submit 
> jobs are launched simultaneously. The following error is returned from stderr:
> {quote}The process cannot access the file because it is being used by another 
> process. The system cannot find the file
> USER/AppData/Local/Temp/spark-class-launcher-output-RANDOM.txt.
> The process cannot access the file because it is being used by another 
> process.{quote}
> My hypothesis is that %RANDOM% is returning the same value for multiple jobs, 
> causing the launcher library to attempt to write to the same file from 
> multiple processes. Another mechanism is needed for reliably generating the 
> names of the temporary files so that the concurrency issue is resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30272) Remove usage of Guava that breaks in Guava 27

2019-12-16 Thread Sean R. Owen (Jira)

Sean R. Owen created SPARK-30272:


 Summary: Remove usage of Guava that breaks in Guava 27
 Key: SPARK-30272
 URL: https://issues.apache.org/jira/browse/SPARK-30272
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Sean R. Owen
Assignee: Sean R. Owen


Background:
https://issues.apache.org/jira/browse/SPARK-29250
https://github.com/apache/spark/pull/25932

Hadoop 3.2.1 will update Guava from 11 to 27. There are a number of methods 
that changed between those releases, typically just a rename, but, means one 
set of code can't work with both, while we want to work with Hadoop 2.x and 
3.x. Among them:

- Objects.toStringHelper was moved to MoreObjects; we can just use the Commons 
Lang3 equivalent
- Objects.hashCode etc were renamed; use java.util.Objects equivalents
- MoreExecutors.sameThreadExecutor() became directExecutor(); for same-thread 
execution we can use a dummy implementation of ExecutorService / Executor
- TypeToken.isAssignableFrom become isSupertypeOf; work around with reflection

There is probably more to the Guava issue than just this change, but it will 
make Spark itself work with more versions and reduce our exposure to Guava 
along the way anyway.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29574) spark with user provided hadoop doesn't work on kubernetes

2019-12-16 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29574:
--

Assignee: Shahin Shakeri

> spark with user provided hadoop doesn't work on kubernetes
> --
>
> Key: SPARK-29574
> URL: https://issues.apache.org/jira/browse/SPARK-29574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4
>Reporter: Michał Wesołowski
>Assignee: Shahin Shakeri
>Priority: Major
> Fix For: 3.0.0
>
>
> When spark-submit is run with image built with "hadoop free" spark and user 
> provided hadoop it fails on kubernetes (hadoop libraries are not on spark's 
> classpath). 
> I downloaded spark [Pre-built with user-provided Apache 
> Hadoop|https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-without-hadoop.tgz].
>  
> I created docker image with usage of 
> [docker-image-tool.sh|[https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]].
>  
>  
> Based on this image (2.4.4-without-hadoop)
> I created another one with Dockerfile
> {code:java}
> FROM spark-py:2.4.4-without-hadoop
> ENV SPARK_HOME=/opt/spark/
> # This is needed for newer kubernetes versions
> ADD 
> https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.6.1/kubernetes-client-4.6.1.jar
>  $SPARK_HOME/jars
> COPY spark-env.sh /opt/spark/conf/spark-env.sh
> RUN chmod +x /opt/spark/conf/spark-env.sh
> RUN wget -qO- 
> https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz 
> | tar xz  -C /opt/
> ENV HADOOP_HOME=/opt/hadoop-3.2.1
> ENV PATH=${HADOOP_HOME}/bin:${PATH}
> {code}
> Contents of spark-env.sh:
> {code:java}
> #!/usr/bin/env bash
> export SPARK_DIST_CLASSPATH=$(hadoop 
> classpath):$HADOOP_HOME/share/hadoop/tools/lib/*
> export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
> {code}
> spark-submit run with image crated this way fails since spark-env.sh is 
> overwritten by [volume created when pod 
> starts|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L108]
> As quick workaround I tried to modify [entrypoint 
> script|https://github.com/apache/spark/blob/ea8b5df47476fe66b63bd7f7bcd15acfb80bde78/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh]
>  to run spark-env.sh during startup and moving spark-env.sh to a different 
> directory. 
>  Driver starts without issues in this setup however, evethough 
> SPARK_DIST_CLASSPATH is set executor is constantly failing:
> {code:java}
> PS 
> C:\Sandbox\projekty\roboticdrive-analytics\components\docker-images\spark-rda>
>  kubectl logs rda-script-1571835692837-exec-12
> ++ id -u
> + myuid=0
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 0
> + uidentry=root:x:0:0:root:/root:/bin/ash
> + set -e
> + '[' -z root:x:0:0:root:/root:/bin/ash ']'
> + source /opt/spark-env.sh
> +++ hadoop classpath
> ++ export 
> 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoo++
>  
> SPARK_DIST_CLASSPATH='/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*'
> ++ export LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
> ++ LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
> ++ echo 
> 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*'
>

[jira] [Resolved] (SPARK-29574) spark with user provided hadoop doesn't work on kubernetes

2019-12-16 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29574.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26493
[https://github.com/apache/spark/pull/26493]

> spark with user provided hadoop doesn't work on kubernetes
> --
>
> Key: SPARK-29574
> URL: https://issues.apache.org/jira/browse/SPARK-29574
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4
>Reporter: Michał Wesołowski
>Priority: Major
> Fix For: 3.0.0
>
>
> When spark-submit is run with image built with "hadoop free" spark and user 
> provided hadoop it fails on kubernetes (hadoop libraries are not on spark's 
> classpath). 
> I downloaded spark [Pre-built with user-provided Apache 
> Hadoop|https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-without-hadoop.tgz].
>  
> I created docker image with usage of 
> [docker-image-tool.sh|[https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]].
>  
>  
> Based on this image (2.4.4-without-hadoop)
> I created another one with Dockerfile
> {code:java}
> FROM spark-py:2.4.4-without-hadoop
> ENV SPARK_HOME=/opt/spark/
> # This is needed for newer kubernetes versions
> ADD 
> https://repo1.maven.org/maven2/io/fabric8/kubernetes-client/4.6.1/kubernetes-client-4.6.1.jar
>  $SPARK_HOME/jars
> COPY spark-env.sh /opt/spark/conf/spark-env.sh
> RUN chmod +x /opt/spark/conf/spark-env.sh
> RUN wget -qO- 
> https://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz 
> | tar xz  -C /opt/
> ENV HADOOP_HOME=/opt/hadoop-3.2.1
> ENV PATH=${HADOOP_HOME}/bin:${PATH}
> {code}
> Contents of spark-env.sh:
> {code:java}
> #!/usr/bin/env bash
> export SPARK_DIST_CLASSPATH=$(hadoop 
> classpath):$HADOOP_HOME/share/hadoop/tools/lib/*
> export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native
> {code}
> spark-submit run with image crated this way fails since spark-env.sh is 
> overwritten by [volume created when pod 
> starts|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/KubernetesClientApplication.scala#L108]
> As quick workaround I tried to modify [entrypoint 
> script|https://github.com/apache/spark/blob/ea8b5df47476fe66b63bd7f7bcd15acfb80bde78/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh]
>  to run spark-env.sh during startup and moving spark-env.sh to a different 
> directory. 
>  Driver starts without issues in this setup however, evethough 
> SPARK_DIST_CLASSPATH is set executor is constantly failing:
> {code:java}
> PS 
> C:\Sandbox\projekty\roboticdrive-analytics\components\docker-images\spark-rda>
>  kubectl logs rda-script-1571835692837-exec-12
> ++ id -u
> + myuid=0
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 0
> + uidentry=root:x:0:0:root:/root:/bin/ash
> + set -e
> + '[' -z root:x:0:0:root:/root:/bin/ash ']'
> + source /opt/spark-env.sh
> +++ hadoop classpath
> ++ export 
> 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoo++
>  
> SPARK_DIST_CLASSPATH='/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*'
> ++ export LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
> ++ LD_LIBRARY_PATH=/opt/hadoop-3.2.1/lib/native
> ++ echo 
> 'SPARK_DIST_CLASSPATH=/opt/hadoop-3.2.1/etc/hadoop:/opt/hadoop-3.2.1/share/hadoop/common/lib/*:/opt/hadoop-3.2.1/share/hadoop/common/*:/opt/hadoop-3.2.1/share/hadoop/hdfs:/opt/hadoop-3.2.1/share/hadoop/hdfs/lib/*:/opt/hadoop-3.2.1/share/hadoop/hdfs/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/lib/*:/opt/hadoop-3.2.1/share/hadoop/mapreduce/*:/opt/hadoop-3.2.1/share/hadoop/yarn:/opt/hadoop-3.2.1/share/hadoop/yarn/lib/*:/opt/hadoop-3.2.1/share/hadoop/yarn/*:/opt/hadoop-3.2.1/share/hadoop/tools/lib/*'
>

[jira] [Comment Edited] (SPARK-6235) Address various 2G limits

2019-12-16 Thread Samuel Shepard (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-6235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995889#comment-16995889
 ] 

Samuel Shepard edited comment on SPARK-6235 at 12/16/19 4:39 PM:
-

[~tgraves] , [~irashid] 
 
 One use case could be fetching large results to the driver when computing PCA 
on large square matrices (e.g., distance matrices, similar to Classical MDS). 
This is very helpful in bioinformatics. Sorry if this already fixed past 
2.4.0...

 


was (Author: sammysheep):
One use case could be fetching large results to the driver when computing PCA 
on large square matrices (e.g., distance matrices, similar to Classical MDS). 
This is very helpful in bioinformatics. Sorry if this already fixed past 
2.4.0...

> Address various 2G limits
> -
>
> Key: SPARK-6235
> URL: https://issues.apache.org/jira/browse/SPARK-6235
> Project: Spark
>  Issue Type: Umbrella
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Priority: Major
> Fix For: 2.4.0
>
> Attachments: SPARK-6235_Design_V0.02.pdf
>
>
> An umbrella ticket to track the various 2G limit we have in Spark, due to the 
> use of byte arrays and ByteBuffers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30264) Unexpected behaviour when using persist MEMORY_ONLY in RDD

2019-12-16 Thread moshe ohaion (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997396#comment-16997396
 ] 

moshe ohaion commented on SPARK-30264:
--

Steps to reproduce:
 # File users8.avro was created by GenericMain.java.
 # Run the following spark job:
*public static void main(String[] args) throws IOException {*
 *SparkConf sparkConf = new SparkConf()*
 *.setAppName("Test cache");*

 *sparkConf.set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer");*
 *sparkConf.set("spark.kryo.registrator", 
SparkKryoRegistrator.class.getName());*

 *JavaSparkContext sc = new JavaSparkContext(sparkConf);*
 *JavaPairRDD records = 
sc.newAPIHadoopFile("/*.avro", 
AvroKeyInputFormat.class, AvroKey.class, NullWritable.class, 
sc.hadoopConfiguration());*
 *JavaRDD genericRecordJavaRDD = records.keys().map(x -> 
((GenericRecord) x.datum()));*
 *JavaRDD cache = 
genericRecordJavaRDD.persist(StorageLevel.{color:#FF}MEMORY_ONLY_SER{color}());*
 *long count = cache.map(genericRecord -> 
genericRecord.get("username")).distinct().count();*
 
*System.out.println(count);*

*}*
 # Count printed will be 5 as it should be 
 # Replace *{color:#FF}MEMORY_ONLY_SER{color}* to ** 
{color:#FF}*MEMORY_ONLY* {color}and run the job again.**
 # Count printed will be 1

 

If you also add cache.saveAsTextFile() you will see that when running with 
{color:#FF}*MEMORY_ONLY* {color}you get the same user 5 times.

 

I tried on 2.4.0, 2.4.4 and 3.0.0 preview.

 

 

[^GenericMain.java] . [^users8.avro]

> Unexpected behaviour when using persist MEMORY_ONLY in RDD
> --
>
> Key: SPARK-30264
> URL: https://issues.apache.org/jira/browse/SPARK-30264
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: moshe ohaion
>Priority: Major
> Attachments: GenericMain.java, users8.avro
>
>
> Persist method with MEMORY_ONLY behave different than using with 
> MEMORY_ONLY_SER.
> persist(StorageLevel.MEMORY_ONLY()).distinct().count() return 1
> while persist(StorageLevel.MEMORY_ONLY_SER()).distinct().count() return 100
> I expect both to return the same results. The right result is 100, for some 
> reason MEMORY_ONLY causing all the objects in the RDD to be the same one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30264) Unexpected behaviour when using persist MEMORY_ONLY in RDD

2019-12-16 Thread moshe ohaion (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

moshe ohaion updated SPARK-30264:
-
Attachment: GenericMain.java

> Unexpected behaviour when using persist MEMORY_ONLY in RDD
> --
>
> Key: SPARK-30264
> URL: https://issues.apache.org/jira/browse/SPARK-30264
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: moshe ohaion
>Priority: Major
> Attachments: GenericMain.java, users8.avro
>
>
> Persist method with MEMORY_ONLY behave different than using with 
> MEMORY_ONLY_SER.
> persist(StorageLevel.MEMORY_ONLY()).distinct().count() return 1
> while persist(StorageLevel.MEMORY_ONLY_SER()).distinct().count() return 100
> I expect both to return the same results. The right result is 100, for some 
> reason MEMORY_ONLY causing all the objects in the RDD to be the same one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30264) Unexpected behaviour when using persist MEMORY_ONLY in RDD

2019-12-16 Thread moshe ohaion (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

moshe ohaion updated SPARK-30264:
-
Attachment: users8.avro

> Unexpected behaviour when using persist MEMORY_ONLY in RDD
> --
>
> Key: SPARK-30264
> URL: https://issues.apache.org/jira/browse/SPARK-30264
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: moshe ohaion
>Priority: Major
> Attachments: users8.avro
>
>
> Persist method with MEMORY_ONLY behave different than using with 
> MEMORY_ONLY_SER.
> persist(StorageLevel.MEMORY_ONLY()).distinct().count() return 1
> while persist(StorageLevel.MEMORY_ONLY_SER()).distinct().count() return 100
> I expect both to return the same results. The right result is 100, for some 
> reason MEMORY_ONLY causing all the objects in the RDD to be the same one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30072) Create dedicated planner for subqueries

2019-12-16 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997368#comment-16997368
 ] 

Wenchen Fan commented on SPARK-30072:
-

> The nested subquery "SELECT max(df2.k) FROM df1 JOIN df2 ON df1.k = df2.k AND 
> df2.id < 2" will be run in another QueryExecution

This is true, but we create `AdaptiveSparkPlanExec` for both the main query and 
all subqueries in `InsertAdaptiveSparkPlan`. That said, we have the 
`isSubquery` info when creating `AdaptiveSparkPlanExec`.

> Create dedicated planner for subqueries
> ---
>
> Key: SPARK-30072
> URL: https://issues.apache.org/jira/browse/SPARK-30072
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ali Afroozeh
>Assignee: Ali Afroozeh
>Priority: Minor
> Fix For: 3.0.0
>
>
> This PR changes subquery planning by calling the planner and plan preparation 
> rules on the subquery plan directly. Before we were creating a QueryExecution 
> instance for subqueries to get the executedPlan. This would re-run analysis 
> and optimization on the subqueries plan. Running the analysis again on an 
> optimized query plan can have unwanted consequences, as some rules, for 
> example DecimalPrecision, are not idempotent.
> As an example, consider the expression 1.7 * avg(a) which after applying the 
> DecimalPrecision rule becomes:
> promote_precision(1.7) * promote_precision(avg(a))
> After the optimization, more specifically the constant folding rule, this 
> expression becomes:
> 1.7 * promote_precision(avg(a))
> Now if we run the analyzer on this optimized query again, we will get:
> promote_precision(1.7) * promote_precision(promote_precision(avg(a)))
> Which will later optimized as:
> 1.7 * promote_precision(promote_precision(avg(a)))
> As can be seen, re-running the analysis and optimization on this expression 
> results in an expression with extra nested promote_preceision nodes. Adding 
> unneeded nodes to the plan is problematic because it can eliminate situations 
> where we can reuse the plan.
> We opted to introduce dedicated planners for subuqueries, instead of making 
> the DecimalPrecision rule idempotent, because this eliminates this entire 
> category of problems. Another benefit is that planning time for subqueries is 
> reduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30049) SQL fails to parse when comment contains an unmatched quote character

2019-12-16 Thread Oleg Bonar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997350#comment-16997350
 ] 

Oleg Bonar commented on SPARK-30049:


I would like to investigate this issue.

> SQL fails to parse when comment contains an unmatched quote character
> -
>
> Key: SPARK-30049
> URL: https://issues.apache.org/jira/browse/SPARK-30049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Darrell Lowe
>Priority: Major
>
> A SQL statement that contains a comment with an unmatched quote character can 
> lead to a parse error.  These queries parsed correctly in older versions of 
> Spark.  For example, here's an excerpt from an interactive spark-sql session 
> on a recent Spark-3.0.0-SNAPSHOT build (commit 
> e23c135e568d4401a5659bc1b5ae8fc8bf147693):
> {noformat}
> spark-sql> SELECT 1 -- someone's comment here
>  > ;
> Error in query: 
> extraneous input ';' expecting (line 2, pos 0)
> == SQL ==
> SELECT 1 -- someone's comment here
> ;
> ^^^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30268) Incorrect pyspark package name when releasing preview version

2019-12-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30268:

Summary: Incorrect pyspark package name when releasing preview version  
(was: pyspark pyspark package name when releasing preview version)

> Incorrect pyspark package name when releasing preview version
> -
>
> Key: SPARK-30268
> URL: https://issues.apache.org/jira/browse/SPARK-30268
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> {noformat}
> cp: cannot stat 
> 'spark-3.0.0-preview2-bin-hadoop2.7/python/dist/pyspark-3.0.0.dev02.tar.gz': 
> No such file or directory
> gpg: can't open 'pyspark-3.0.0.dev02.tar.gz': No such file or directory
> gpg: signing failed: No such file or directory
> gpg: pyspark-3.0.0.dev02.tar.gz: No such file or directory
> {noformat}
> But it is:
> {noformat}
> yumwang@ubuntu-3513086:~/spark-release/output$ ll 
> spark-3.0.0-preview2-bin-hadoop2.7/python/dist/
> total 214140
> drwxr-xr-x 2 yumwang stack  4096 Dec 16 06:17 ./
> drwxr-xr-x 9 yumwang stack  4096 Dec 16 06:17 ../
> -rw-r--r-- 1 yumwang stack 219267173 Dec 16 06:17 pyspark-3.0.0.dev2.tar.gz
> {noformat}
> {noformat}
> /usr/local/lib/python3.6/dist-packages/setuptools/dist.py:476: UserWarning: 
> Normalizing '3.0.0.dev02' to '3.0.0.dev2'
>   normalized_version,
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30271) dynamic allocation won't release some executor in some case.

2019-12-16 Thread angerszhu (Jira)

angerszhu created SPARK-30271:
-

 Summary: dynamic allocation won't release some executor in some 
case.
 Key: SPARK-30271
 URL: https://issues.apache.org/jira/browse/SPARK-30271
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.4.0
Reporter: angerszhu


Case :
max executor 5
min executor 0
idle time  5s

stage-1 10 tasks run in 5 executors.
If stage-1 finished in 5 all executors, all executor added to `removeTimes` 
when taskEnd event.
After 5s, start release process, since stage-2 have 20 tasks, then executor 
won't be removed since existing executor num < executorTargetNum., and executor 
will be removed from `removeTimes`. 
But if task won't be scheduled to all these executors, if executor-1 won't have 
task to run in it, it won't be put into `removeTimes` and if there are no more 
tasks, executor won't be removed forever



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30150) Manage resources (ADD/LIST) does not support quoted path

2019-12-16 Thread Rakesh Raushan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997314#comment-16997314
 ] 

Rakesh Raushan commented on SPARK-30150:


Thanks!!

> Manage resources (ADD/LIST) does not support quoted path
> 
>
> Key: SPARK-30150
> URL: https://issues.apache.org/jira/browse/SPARK-30150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: Rakesh Raushan
>Priority: Minor
> Fix For: 3.0.0
>
>
> Manage resources (ADD/LIST) does not support quoted path.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16180) Task hang on fetching blocks (cached RDD)

2019-12-16 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997310#comment-16997310
 ] 

angerszhu commented on SPARK-16180:
---

i meet this problem recently in spark-2.4

> Task hang on fetching blocks (cached RDD)
> -
>
> Key: SPARK-16180
> URL: https://issues.apache.org/jira/browse/SPARK-16180
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.1
>Reporter: Davies Liu
>Priority: Major
>  Labels: bulk-closed
>
> Here is the stackdump of executor:
> {code}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:107)
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:588)
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:585)
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:585)
> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:570)
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:630)
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:46)
> org.apache.spark.scheduler.Task.run(Task.scala:96)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30269) Should use old partition stats to decide whether to update stats when analyzing partition

2019-12-16 Thread Zhenhua Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-30269:
-
Summary: Should use old partition stats to decide whether to update stats 
when analyzing partition  (was: Should use old partition stats to compare when 
analyzing partition)

> Should use old partition stats to decide whether to update stats when 
> analyzing partition
> -
>
> Key: SPARK-30269
> URL: https://issues.apache.org/jira/browse/SPARK-30269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Zhenhua Wang
>Priority: Major
> Fix For: 2.3.5, 2.4.5, 3.0.0
>
>
> It's an obvious bug: currently when analyzing partition stats, we use old 
> table stats to compare with newly computed stats to decide whether it should 
> update stats or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30270) Can't pickle abstract classes (with cloudpickle)

2019-12-16 Thread Sebastian Straub (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Straub updated SPARK-30270:
-
Description: 
I can't use any classes that are derived from abstract classes in PySpark, 
because cloudpickle can't pickle them.

Example:
{code:java}
import pickle
from abc import ABC
from pyspark import cloudpickle


class Foo(ABC):
pass

class Bar(Foo):
pass

bar = Bar()

# pickle dump works fine
pickle.dumps(bar)
# cloudpickle doesn't
cloudpickle.dumps(bar)
{code}
A similar bug has already been reported in SPARK-21439 and marked resolved, but 
I can confirm that the issue still persists.

 

  was:
I can't use any classes that are derived from abstract classes in PySpark, 
because cloudpickle  can't pickle them. Example:

 
{code:java}
import pickle
from abc import ABC
from pyspark import cloudpickle


class Foo(ABC):
pass

class Bar(Foo):
pass

bar = Bar()

# pickle dump works fine
pickle.dumps(bar)
# cloudpickle doesn't
cloudpickle.dumps(bar)
{code}
A similar bug has already been reported in SPARK-21439 and marked resolved, but 
I can confirm that the issue still persists.

 


> Can't pickle abstract classes (with cloudpickle)
> 
>
> Key: SPARK-30270
> URL: https://issues.apache.org/jira/browse/SPARK-30270
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Sebastian Straub
>Priority: Minor
>  Labels: cloudpickle
>
> I can't use any classes that are derived from abstract classes in PySpark, 
> because cloudpickle can't pickle them.
> Example:
> {code:java}
> import pickle
> from abc import ABC
> from pyspark import cloudpickle
> class Foo(ABC):
> pass
> class Bar(Foo):
> pass
> bar = Bar()
> # pickle dump works fine
> pickle.dumps(bar)
> # cloudpickle doesn't
> cloudpickle.dumps(bar)
> {code}
> A similar bug has already been reported in SPARK-21439 and marked resolved, 
> but I can confirm that the issue still persists.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30270) Can't pickle abstract classes (with cloudpickle)

2019-12-16 Thread Sebastian Straub (Jira)

Sebastian Straub created SPARK-30270:


 Summary: Can't pickle abstract classes (with cloudpickle)
 Key: SPARK-30270
 URL: https://issues.apache.org/jira/browse/SPARK-30270
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Sebastian Straub


I can't use any classes that are derived from abstract classes in PySpark, 
because cloudpickle  can't pickle them. Example:

 
{code:java}
import pickle
from abc import ABC
from pyspark import cloudpickle


class Foo(ABC):
pass

class Bar(Foo):
pass

bar = Bar()

# pickle dump works fine
pickle.dumps(bar)
# cloudpickle doesn't
cloudpickle.dumps(bar)
{code}
A similar bug has already been reported in SPARK-21439 and marked resolved, but 
I can confirm that the issue still persists.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30269) Should use old partition stats to compare when analyzing partition

2019-12-16 Thread Zhenhua Wang (Jira)

Zhenhua Wang created SPARK-30269:


 Summary: Should use old partition stats to compare when analyzing 
partition
 Key: SPARK-30269
 URL: https://issues.apache.org/jira/browse/SPARK-30269
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 2.3.4, 3.0.0
Reporter: Zhenhua Wang
 Fix For: 2.3.5, 2.4.5, 3.0.0


It's an obvious bug: currently when analyzing partition stats, we use old table 
stats to compare with newly computed stats to decide whether it should update 
stats or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25250) Race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple ti

2019-12-16 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997306#comment-16997306
 ] 

Wenchen Fan commented on SPARK-25250:
-

It's https://issues.apache.org/jira/browse/SPARK-27474

> Race condition with tasks running when new attempt for same stage is created 
> leads to other task in the next attempt running on the same partition id 
> retry multiple times
> --
>
> Key: SPARK-25250
> URL: https://issues.apache.org/jira/browse/SPARK-25250
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.1
>Reporter: Parth Gandhi
>Assignee: Parth Gandhi
>Priority: Major
> Fix For: 2.3.4, 2.4.1, 3.0.0
>
>
> We recently had a scenario where a race condition occurred when a task from 
> previous stage attempt just finished before new attempt for the same stage 
> was created due to fetch failure, so the new task created in the second 
> attempt on the same partition id was retrying multiple times due to 
> TaskCommitDenied Exception without realizing that the task in earlier attempt 
> was already successful.  
> For example, consider a task with partition id 9000 and index 9000 running in 
> stage 4.0. We see a fetch failure so thus, we spawn a new stage attempt 4.1. 
> Just within this timespan, the above task completes successfully, thus, 
> marking the partition id 9000 as complete for 4.0. However, as stage 4.1 has 
> not yet been created, the taskset info for that stage is not available to the 
> TaskScheduler so, naturally, the partition id 9000 has not been marked 
> completed for 4.1. Stage 4.1 now spawns task with index 2000 on the same 
> partition id 9000. This task fails due to CommitDeniedException and since, it 
> does not see the corresponding partition id as been marked successful, it 
> keeps retrying multiple times until the job finally succeeds. It doesn't 
> cause any job failures because the DAG scheduler is tracking the partitions 
> separate from the task set managers.
>  
> Steps to Reproduce:
>  # Run any large job involving shuffle operation.
>  # When the ShuffleMap stage finishes and the ResultStage begins running, 
> cause this stage to throw a fetch failure exception(Try deleting certain 
> shuffle files on any host).
>  # Observe the task attempt numbers for the next stage attempt. Please note 
> that this issue is an intermittent one, so it might not happen all the time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30150) Manage resources (ADD/LIST) does not support quoted path

2019-12-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30150:
---

Assignee: Rakesh Raushan  (was: jobit mathew)

> Manage resources (ADD/LIST) does not support quoted path
> 
>
> Key: SPARK-30150
> URL: https://issues.apache.org/jira/browse/SPARK-30150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: Rakesh Raushan
>Priority: Minor
> Fix For: 3.0.0
>
>
> Manage resources (ADD/LIST) does not support quoted path.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30268) pyspark pyspark package name when releasing preview version

2019-12-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-30268:

Description: 
{noformat}
cp: cannot stat 
'spark-3.0.0-preview2-bin-hadoop2.7/python/dist/pyspark-3.0.0.dev02.tar.gz': No 
such file or directory
gpg: can't open 'pyspark-3.0.0.dev02.tar.gz': No such file or directory
gpg: signing failed: No such file or directory
gpg: pyspark-3.0.0.dev02.tar.gz: No such file or directory
{noformat}

But it is:

{noformat}
yumwang@ubuntu-3513086:~/spark-release/output$ ll 
spark-3.0.0-preview2-bin-hadoop2.7/python/dist/
total 214140
drwxr-xr-x 2 yumwang stack  4096 Dec 16 06:17 ./
drwxr-xr-x 9 yumwang stack  4096 Dec 16 06:17 ../
-rw-r--r-- 1 yumwang stack 219267173 Dec 16 06:17 pyspark-3.0.0.dev2.tar.gz
{noformat}



{noformat}
/usr/local/lib/python3.6/dist-packages/setuptools/dist.py:476: UserWarning: 
Normalizing '3.0.0.dev02' to '3.0.0.dev2'
  normalized_version,
{noformat}



  was:

{noformat}
cp: cannot stat 
'spark-3.0.0-preview2-bin-hadoop2.7/python/dist/pyspark-3.0.0.dev02.tar.gz': No 
such file or directory
gpg: can't open 'pyspark-3.0.0.dev02.tar.gz': No such file or directory
gpg: signing failed: No such file or directory
gpg: pyspark-3.0.0.dev02.tar.gz: No such file or directory
{noformat}

But it is:

{noformat}
yumwang@ubuntu-3513086:~/spark-release/output$ ll 
spark-3.0.0-preview2-bin-hadoop2.7/python/dist/
total 214140
drwxr-xr-x 2 yumwang stack  4096 Dec 16 06:17 ./
drwxr-xr-x 9 yumwang stack  4096 Dec 16 06:17 ../
-rw-r--r-- 1 yumwang stack 219267173 Dec 16 06:17 pyspark-3.0.0.dev2.tar.gz
{noformat}




> pyspark pyspark package name when releasing preview version
> ---
>
> Key: SPARK-30268
> URL: https://issues.apache.org/jira/browse/SPARK-30268
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> {noformat}
> cp: cannot stat 
> 'spark-3.0.0-preview2-bin-hadoop2.7/python/dist/pyspark-3.0.0.dev02.tar.gz': 
> No such file or directory
> gpg: can't open 'pyspark-3.0.0.dev02.tar.gz': No such file or directory
> gpg: signing failed: No such file or directory
> gpg: pyspark-3.0.0.dev02.tar.gz: No such file or directory
> {noformat}
> But it is:
> {noformat}
> yumwang@ubuntu-3513086:~/spark-release/output$ ll 
> spark-3.0.0-preview2-bin-hadoop2.7/python/dist/
> total 214140
> drwxr-xr-x 2 yumwang stack  4096 Dec 16 06:17 ./
> drwxr-xr-x 9 yumwang stack  4096 Dec 16 06:17 ../
> -rw-r--r-- 1 yumwang stack 219267173 Dec 16 06:17 pyspark-3.0.0.dev2.tar.gz
> {noformat}
> {noformat}
> /usr/local/lib/python3.6/dist-packages/setuptools/dist.py:476: UserWarning: 
> Normalizing '3.0.0.dev02' to '3.0.0.dev2'
>   normalized_version,
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30223) queries in thrift server may read wrong SQL configs

2019-12-16 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997298#comment-16997298
 ] 

Wenchen Fan commented on SPARK-30223:
-

It's not possible to pass around the `SQLConf` object in all the places, you 
can take a look at the callers of `SQLConf.get`.

BTW I mean adding `SparkSession.setActiveSession(this)` in `SparkSession.sql` 
and other similar places.

> queries in thrift server may read wrong SQL configs
> ---
>
> Key: SPARK-30223
> URL: https://issues.apache.org/jira/browse/SPARK-30223
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> The Spark thrift server creates many SparkSessions to serve requests, and the 
> thrift server serves requests using a single thread. One thread can only have 
> one active SparkSession, so SQLCong.get can't get the proper conf from the 
> session that runs the query.
> Whenever we issue an action on a SparkSession, we should set this session as 
> active session, e.g. `SparkSession.sql`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30150) Manage resources (ADD/LIST) does not support quoted path

2019-12-16 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997299#comment-16997299
 ] 

Wenchen Fan commented on SPARK-30150:
-

updated

> Manage resources (ADD/LIST) does not support quoted path
> 
>
> Key: SPARK-30150
> URL: https://issues.apache.org/jira/browse/SPARK-30150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: Rakesh Raushan
>Priority: Minor
> Fix For: 3.0.0
>
>
> Manage resources (ADD/LIST) does not support quoted path.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30268) pyspark pyspark package name when releasing preview version

2019-12-16 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-30268:
---

 Summary: pyspark pyspark package name when releasing preview 
version
 Key: SPARK-30268
 URL: https://issues.apache.org/jira/browse/SPARK-30268
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Yuming Wang
Assignee: Yuming Wang



{noformat}
cp: cannot stat 
'spark-3.0.0-preview2-bin-hadoop2.7/python/dist/pyspark-3.0.0.dev02.tar.gz': No 
such file or directory
gpg: can't open 'pyspark-3.0.0.dev02.tar.gz': No such file or directory
gpg: signing failed: No such file or directory
gpg: pyspark-3.0.0.dev02.tar.gz: No such file or directory
{noformat}

But it is:

{noformat}
yumwang@ubuntu-3513086:~/spark-release/output$ ll 
spark-3.0.0-preview2-bin-hadoop2.7/python/dist/
total 214140
drwxr-xr-x 2 yumwang stack  4096 Dec 16 06:17 ./
drwxr-xr-x 9 yumwang stack  4096 Dec 16 06:17 ../
-rw-r--r-- 1 yumwang stack 219267173 Dec 16 06:17 pyspark-3.0.0.dev2.tar.gz
{noformat}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27021) Leaking Netty event loop group for shuffle chunk fetch requests

2019-12-16 Thread roncenzhao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997288#comment-16997288
 ] 

roncenzhao commented on SPARK-27021:


[~attilapiros] Thanks. The issue is the same problem we have encountered.

> Leaking Netty event loop group for shuffle chunk fetch requests
> ---
>
> Key: SPARK-27021
> URL: https://issues.apache.org/jira/browse/SPARK-27021
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.1, 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-12-14-23-23-50-384.png
>
>
> The extra event loop group created for handling shuffle chunk fetch requests 
> are never closed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30267) avro deserializer: ArrayList cannot be cast to GenericData$Array

2019-12-16 Thread Steven Aerts (Jira)

Steven Aerts created SPARK-30267:


 Summary: avro deserializer: ArrayList cannot be cast to 
GenericData$Array
 Key: SPARK-30267
 URL: https://issues.apache.org/jira/browse/SPARK-30267
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Steven Aerts


On some more complex avro objects, the Avro Deserializer fails with the 
following stack trace:

{code}
java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
org.apache.avro.generic.GenericData$Array
at 
org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:170)
at 
org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19$adapted(AvroDeserializer.scala:169)
at 
org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:314)
at 
org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:310)
at 
org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:332)
at 
org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:329)
at 
org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:56)
at 
org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:70)
{code}

This is because the Deserializer assumes that an array is always of the very 
specific {{org.apache.avro.generic.GenericData$Array}} which is not always the 
case.

Making it a normal list works.
A github PR is coming up to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30265) Do not change R version when releasing preview versions

2019-12-16 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-30265.
-
Resolution: Fixed

Issue resolved by pull request 26904
https://github.com/apache/spark/pull/26904

> Do not change R version when releasing preview versions
> ---
>
> Key: SPARK-30265
> URL: https://issues.apache.org/jira/browse/SPARK-30265
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> {code:sh}
> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr
> {code}
> {noformat}
> ++ . /opt/spark-rm/output/spark-3.0.0-preview2-bin-hadoop2.7/R/find-r.sh
> +++ '[' -z /usr/bin ']'
> ++ /usr/bin/Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
> Loading required package: usethis
> Updating SparkR documentation
> First time using roxygen2. Upgrading automatically...
> Loading SparkR
> Invalid DESCRIPTION:
> Malformed package version.
> See section 'The DESCRIPTION file' in the 'Writing R Extensions'
> manual.
> Error: invalid version specification '3.0.0-preview2'
> In addition: Warning message:
> roxygen2 requires Encoding: UTF-8
> Execution halted
> [ERROR] Command execution failed.
> org.apache.commons.exec.ExecuteException: Process exited with an error: 1 
> (Exit value: 1)
> at org.apache.commons.exec.DefaultExecutor.executeInternal 
> (DefaultExecutor.java:404)
> at org.apache.commons.exec.DefaultExecutor.execute 
> (DefaultExecutor.java:166)
> at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:804)
> at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:751)
> at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:313)
> at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
> (DefaultBuildPluginManager.java:137)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:210)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:156)
> at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
> (MojoExecutor.java:148)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:117)
> at 
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
> (LifecycleModuleBuilder.java:81)
> at 
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
>  (SingleThreadedBuilder.java:56)
> at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
> (LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
> at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
> at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
> at org.apache.maven.cli.MavenCli.execute (MavenCli.java:957)
> at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:289)
> at org.apache.maven.cli.MavenCli.main (MavenCli.java:193)
> at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke 
> (NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke 
> (DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke (Method.java:498)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced 
> (Launcher.java:282)
> at org.codehaus.plexus.classworlds.launcher.Launcher.launch 
> (Launcher.java:225)
> at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode 
> (Launcher.java:406)
> at org.codehaus.plexus.classworlds.launcher.Launcher.main 
> (Launcher.java:347)
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.0.0-preview2:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [ 18.619 
> s]
> [INFO] Spark Project Tags . SUCCESS [ 13.652 
> s]
> [INFO] Spark Project Sketch ... SUCCESS [  5.673 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  2.081 
> s]
> [INFO] Spark Project Networking ... SUCCESS [  3.509 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  0.993 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  7.556 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  5.522 
> s]
> [INFO] Spark Project Core . FAILURE [01:06 
> min]
> [INFO] Spark Project ML

[jira] [Resolved] (SPARK-30192) support column position in DS v2

2019-12-16 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30192.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26817
[https://github.com/apache/spark/pull/26817]

> support column position in DS v2
> 
>
> Key: SPARK-30192
> URL: https://issues.apache.org/jira/browse/SPARK-30192
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30266) Int overflow and MatchError in ApproximatePercentile

2019-12-16 Thread Kent Yao (Jira)

Kent Yao created SPARK-30266:


 Summary: Int overflow and MatchError in ApproximatePercentile 
 Key: SPARK-30266
 URL: https://issues.apache.org/jira/browse/SPARK-30266
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Kent Yao


accuracyExpression can accept Long which may cause overflow error.

accuracyExpression can accept fractions which are implicitly floored.

accuracyExpression can accept null which is implicitly changed to 0.

percentageExpression can accept null but cause MatchError.

percentageExpression can accept ArrayType(_, nullable=true) in which the nulls 
are implicitly changed to zeros.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29505) desc extended is case sensitive

2019-12-16 Thread pavithra ramachandran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16997090#comment-16997090
 ] 

pavithra ramachandran commented on SPARK-29505:
---

i will work on this

> desc extended   is case sensitive
> --
>
> Key: SPARK-29505
> URL: https://issues.apache.org/jira/browse/SPARK-29505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> {code}
> create table customer(id int, name String, *CName String*, address String, 
> city String, pin int, country String);
> insert into customer values(1,'Alfred','Maria','Obere Str 
> 57','Berlin',12209,'Germany');
> insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
> D.F.',05021,'Maxico');
> insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
> 2312','Maxico D.F.',05023,'Maxico');
> analyze table customer compute statistics for columns cname; – *Success( 
> Though cname is not as CName)*
> desc extended customer cname; – Failed
> jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;*
> +-+-+
> | info_name | info_value |
> +-+-+
> | col_name | cname |
> | data_type | string |
> | comment | NULL |
> | min | NULL |
> | max | NULL |
> | num_nulls | NULL |
> | distinct_count | NULL |
> | avg_col_len | NULL |
> | max_col_len | NULL |
> | histogram | NULL |
> +-+--
> {code}
>  
> But 
> {code}
> desc extended customer CName; – SUCCESS
> 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;*
> +-+-+
> | info_name | info_value |
> +-+-+
> | col_name | CName |
> | data_type | string |
> | comment | NULL |
> | min | NULL |
> | max | NULL |
> | num_nulls | 0 |
> | distinct_count | 3 |
> | avg_col_len | 9 |
> | max_col_len | 14 |
> | histogram | NULL |
> +-+-+
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29505) desc extended is case sensitive

2019-12-16 Thread Shivu Sondur (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivu Sondur updated SPARK-29505:
-
Comment: was deleted

(was: I am checking this issue)

> desc extended   is case sensitive
> --
>
> Key: SPARK-29505
> URL: https://issues.apache.org/jira/browse/SPARK-29505
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> {code}
> create table customer(id int, name String, *CName String*, address String, 
> city String, pin int, country String);
> insert into customer values(1,'Alfred','Maria','Obere Str 
> 57','Berlin',12209,'Germany');
> insert into customer values(2,'Ana','trujilo','Adva de la','Maxico 
> D.F.',05021,'Maxico');
> insert into customer values(3,'Antonio','Antonio Moreno','Mataderos 
> 2312','Maxico D.F.',05023,'Maxico');
> analyze table customer compute statistics for columns cname; – *Success( 
> Though cname is not as CName)*
> desc extended customer cname; – Failed
> jdbc:hive2://10.18.19.208:23040/default> desc extended customer *cname;*
> +-+-+
> | info_name | info_value |
> +-+-+
> | col_name | cname |
> | data_type | string |
> | comment | NULL |
> | min | NULL |
> | max | NULL |
> | num_nulls | NULL |
> | distinct_count | NULL |
> | avg_col_len | NULL |
> | max_col_len | NULL |
> | histogram | NULL |
> +-+--
> {code}
>  
> But 
> {code}
> desc extended customer CName; – SUCCESS
> 0: jdbc:hive2://10.18.19.208:23040/default> desc extended customer *CName;*
> +-+-+
> | info_name | info_value |
> +-+-+
> | col_name | CName |
> | data_type | string |
> | comment | NULL |
> | min | NULL |
> | max | NULL |
> | num_nulls | 0 |
> | distinct_count | 3 |
> | avg_col_len | 9 |
> | max_col_len | 14 |
> | histogram | NULL |
> +-+-+
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30265) Do not change R version when releasing preview versions

2019-12-16 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-30265:
---

 Summary: Do not change R version when releasing preview versions
 Key: SPARK-30265
 URL: https://issues.apache.org/jira/browse/SPARK-30265
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Yuming Wang
Assignee: Yuming Wang



{code:sh}
./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr
{code}

{noformat}

++ . /opt/spark-rm/output/spark-3.0.0-preview2-bin-hadoop2.7/R/find-r.sh
+++ '[' -z /usr/bin ']'
++ /usr/bin/Rscript -e ' if("devtools" %in% rownames(installed.packages())) { 
library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) }'
Loading required package: usethis
Updating SparkR documentation
First time using roxygen2. Upgrading automatically...
Loading SparkR
Invalid DESCRIPTION:
Malformed package version.

See section 'The DESCRIPTION file' in the 'Writing R Extensions'
manual.

Error: invalid version specification '3.0.0-preview2'
In addition: Warning message:
roxygen2 requires Encoding: UTF-8
Execution halted
[ERROR] Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit 
value: 1)
at org.apache.commons.exec.DefaultExecutor.executeInternal 
(DefaultExecutor.java:404)
at org.apache.commons.exec.DefaultExecutor.execute 
(DefaultExecutor.java:166)
at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:804)
at org.codehaus.mojo.exec.ExecMojo.executeCommandLine (ExecMojo.java:751)
at org.codehaus.mojo.exec.ExecMojo.execute (ExecMojo.java:313)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
(DefaultBuildPluginManager.java:137)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:210)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:156)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:148)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
(LifecycleModuleBuilder.java:117)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
(LifecycleModuleBuilder.java:81)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
 (SingleThreadedBuilder.java:56)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
at org.apache.maven.cli.MavenCli.execute (MavenCli.java:957)
at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:289)
at org.apache.maven.cli.MavenCli.main (MavenCli.java:193)
at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke (Method.java:498)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced 
(Launcher.java:282)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch 
(Launcher.java:225)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode 
(Launcher.java:406)
at org.codehaus.plexus.classworlds.launcher.Launcher.main 
(Launcher.java:347)
[INFO] 
[INFO] Reactor Summary for Spark Project Parent POM 3.0.0-preview2:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [ 18.619 s]
[INFO] Spark Project Tags . SUCCESS [ 13.652 s]
[INFO] Spark Project Sketch ... SUCCESS [  5.673 s]
[INFO] Spark Project Local DB . SUCCESS [  2.081 s]
[INFO] Spark Project Networking ... SUCCESS [  3.509 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [  0.993 s]
[INFO] Spark Project Unsafe ... SUCCESS [  7.556 s]
[INFO] Spark Project Launcher . SUCCESS [  5.522 s]
[INFO] Spark Project Core . FAILURE [01:06 min]
[INFO] Spark Project ML Local Library . SKIPPED
[INFO] Spark Project GraphX ... SKIPPED
[INFO] Spark Project Streaming  SKIPPED
[INFO] Spark Project Catalyst . SKIPPED
[INFO] Spark Project SQL .. SKIPPED
[INFO] Spark Project ML Library ... SKIPPED
[INFO] Spark Project Tools  SKIPPED
[INFO] Spark Project Hive

87 matches

Mail list logo