[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-08-29 Thread Darcy Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597082#comment-16597082
 ] 

Darcy Shen commented on SPARK-14220:


Jenkins for Scala 2.12:

 

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Blocker
>  Labels: release-notes
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25276) Redundant constrains when using alias

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25276:


Assignee: Apache Spark

> Redundant constrains when using alias
> -
>
> Key: SPARK-25276
> URL: https://issues.apache.org/jira/browse/SPARK-25276
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: Ajith S
>Assignee: Apache Spark
>Priority: Major
> Attachments: test.patch
>
>
> Attaching a test to reproduce the issue. The test fails with following message
> {color:#FF}== FAIL: Constraints do not match ==={color}
> {color:#FF}Found: isnotnull(z#5),(z#5 > 10),(x#3 > 10),(z#5 <=> x#3),(b#1 
> <=> y#4),isnotnull(x#3){color}
> {color:#FF}Expected: (x#3 > 10),isnotnull(x#3),(b#1 <=> y#4),(z#5 <=> 
> x#3){color}
> {color:#FF}== Result =={color}
> {color:#FF}Missing: N/A{color}
> {color:#FF}Found but not expected: isnotnull(z#5),(z#5 > 10){color}
> Here i think as z has a EqualNullSafe comparison with x, so having 
> isnotnull(z#5),(z#5 > 10) is redundant. If a query has lot of aliases, this 
> may cause overhead.
> So i suggest 
> [https://github.com/apache/spark/blob/v2.3.2-rc5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L254]
>  instead of  addAll++= we must just assign =



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25276) Redundant constrains when using alias

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597027#comment-16597027
 ] 

Apache Spark commented on SPARK-25276:
--

User 'ajithme' has created a pull request for this issue:
https://github.com/apache/spark/pull/22277

> Redundant constrains when using alias
> -
>
> Key: SPARK-25276
> URL: https://issues.apache.org/jira/browse/SPARK-25276
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: Ajith S
>Priority: Major
> Attachments: test.patch
>
>
> Attaching a test to reproduce the issue. The test fails with following message
> {color:#FF}== FAIL: Constraints do not match ==={color}
> {color:#FF}Found: isnotnull(z#5),(z#5 > 10),(x#3 > 10),(z#5 <=> x#3),(b#1 
> <=> y#4),isnotnull(x#3){color}
> {color:#FF}Expected: (x#3 > 10),isnotnull(x#3),(b#1 <=> y#4),(z#5 <=> 
> x#3){color}
> {color:#FF}== Result =={color}
> {color:#FF}Missing: N/A{color}
> {color:#FF}Found but not expected: isnotnull(z#5),(z#5 > 10){color}
> Here i think as z has a EqualNullSafe comparison with x, so having 
> isnotnull(z#5),(z#5 > 10) is redundant. If a query has lot of aliases, this 
> may cause overhead.
> So i suggest 
> [https://github.com/apache/spark/blob/v2.3.2-rc5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L254]
>  instead of  addAll++= we must just assign =



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25276) Redundant constrains when using alias

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25276:


Assignee: (was: Apache Spark)

> Redundant constrains when using alias
> -
>
> Key: SPARK-25276
> URL: https://issues.apache.org/jira/browse/SPARK-25276
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: Ajith S
>Priority: Major
> Attachments: test.patch
>
>
> Attaching a test to reproduce the issue. The test fails with following message
> {color:#FF}== FAIL: Constraints do not match ==={color}
> {color:#FF}Found: isnotnull(z#5),(z#5 > 10),(x#3 > 10),(z#5 <=> x#3),(b#1 
> <=> y#4),isnotnull(x#3){color}
> {color:#FF}Expected: (x#3 > 10),isnotnull(x#3),(b#1 <=> y#4),(z#5 <=> 
> x#3){color}
> {color:#FF}== Result =={color}
> {color:#FF}Missing: N/A{color}
> {color:#FF}Found but not expected: isnotnull(z#5),(z#5 > 10){color}
> Here i think as z has a EqualNullSafe comparison with x, so having 
> isnotnull(z#5),(z#5 > 10) is redundant. If a query has lot of aliases, this 
> may cause overhead.
> So i suggest 
> [https://github.com/apache/spark/blob/v2.3.2-rc5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L254]
>  instead of  addAll++= we must just assign =



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25276) Redundant constrains when using alias

2018-08-29 Thread Ajith S (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-25276:

Attachment: test.patch

> Redundant constrains when using alias
> -
>
> Key: SPARK-25276
> URL: https://issues.apache.org/jira/browse/SPARK-25276
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: Ajith S
>Priority: Major
> Attachments: test.patch
>
>
> Attaching a test to reproduce the issue. The test fails with following message
> {color:#FF}== FAIL: Constraints do not match ==={color}
> {color:#FF}Found: isnotnull(z#5),(z#5 > 10),(x#3 > 10),(z#5 <=> x#3),(b#1 
> <=> y#4),isnotnull(x#3){color}
> {color:#FF}Expected: (x#3 > 10),isnotnull(x#3),(b#1 <=> y#4),(z#5 <=> 
> x#3){color}
> {color:#FF}== Result =={color}
> {color:#FF}Missing: N/A{color}
> {color:#FF}Found but not expected: isnotnull(z#5),(z#5 > 10){color}
> Here i think as z has a EqualNullSafe comparison with x, so having 
> isnotnull(z#5),(z#5 > 10) is redundant. If a query has lot of aliases, this 
> may cause overhead.
> So i suggest 
> [https://github.com/apache/spark/blob/v2.3.2-rc5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L254]
>  instead of  addAll++= we must just assign =



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25276) Redundant constrains when using alias

2018-08-29 Thread Ajith S (JIRA)
Ajith S created SPARK-25276:
---

 Summary: Redundant constrains when using alias
 Key: SPARK-25276
 URL: https://issues.apache.org/jira/browse/SPARK-25276
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1, 2.1.0
Reporter: Ajith S
 Attachments: test.patch

Attaching a test to reproduce the issue. The test fails with following message

{color:#FF}== FAIL: Constraints do not match ==={color}
{color:#FF}Found: isnotnull(z#5),(z#5 > 10),(x#3 > 10),(z#5 <=> x#3),(b#1 
<=> y#4),isnotnull(x#3){color}
{color:#FF}Expected: (x#3 > 10),isnotnull(x#3),(b#1 <=> y#4),(z#5 <=> 
x#3){color}
{color:#FF}== Result =={color}
{color:#FF}Missing: N/A{color}
{color:#FF}Found but not expected: isnotnull(z#5),(z#5 > 10){color}

Here i think as z has a EqualNullSafe comparison with x, so having 
isnotnull(z#5),(z#5 > 10) is redundant. If a query has lot of aliases, this may 
cause overhead.

So i suggest 
[https://github.com/apache/spark/blob/v2.3.2-rc5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L254]
 instead of  addAll++= we must just assign =



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24785) Making sure REPL prints Spark UI info and then Welcome message

2018-08-29 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-24785:

Fix Version/s: 2.4.0

> Making sure REPL prints Spark UI info and then Welcome message
> --
>
> Key: SPARK-24785
> URL: https://issues.apache.org/jira/browse/SPARK-24785
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.4.0
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 2.4.0
>
>
> After [SPARK-24418] the welcome message will be printed first, and then scala 
> prompt will be shown before the Spark UI info is printed as the following.
> {code:java}
>  apache-spark git:(scala-2.11.12) ✗ ./bin/spark-shell 
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.0-SNAPSHOT
>   /_/
>  
> Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 
> 1.8.0_161)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> Spark context Web UI available at http://192.168.1.169:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1528180279528).
> Spark session available as 'spark'.
> scala> 
> {code}
> Although it's a minor issue, but visually, it doesn't look nice as the 
> existing behavior. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25275) require memberhip in wheel to run 'su' (in dockerfiles)

2018-08-29 Thread Erik Erlandson (JIRA)
Erik Erlandson created SPARK-25275:
--

 Summary: require memberhip in wheel to run 'su' (in dockerfiles)
 Key: SPARK-25275
 URL: https://issues.apache.org/jira/browse/SPARK-25275
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.3.1, 2.3.0
Reporter: Erik Erlandson


For improved security, configure that users must be in wheel group in order to 
run su.

see example:

[https://github.com/ope]nshift-evangelists/terminal-base-image/blob/master/image/Dockerfile#L53



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25217) Error thrown when creating BlockMatrix

2018-08-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596992#comment-16596992
 ] 

Liang-Chi Hsieh commented on SPARK-25217:
-

If no further question, I think we can close this ticket.

> Error thrown when creating BlockMatrix
> --
>
> Key: SPARK-25217
> URL: https://issues.apache.org/jira/browse/SPARK-25217
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: cs5090237
>Priority: Major
>
> dm1 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])
> dm2 = Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12])
> sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 1, 2], [7, 11, 12])
> blocks1 = sc.parallelize([((0, 0), dm1)])
> sm_ = Matrix(3,2,sm)
> blocks2 = sc.parallelize([((0, 0), sm), ((1, 0), sm)])
> blocks3 = sc.parallelize([((0, 0), sm), ((1, 0), dm2)])
> mat2 = BlockMatrix(blocks2, 3, 2)
> mat3 = BlockMatrix(blocks3, 3, 2)
>  
> *Running above sample code in Pyspark from documentation raises following 
> error:* 
>  
> An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in 
> stage 53.0 failed 4 times, most recent failure: Lost task 14.3 in stage 53.0 
> (TID 1081, , executor 15): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last): File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/worker.py", line 230, 
> in main process() File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/worker.py", line 225, 
> in process serializer.dump_stream(func(split_index, iterator), outfile) File 
> "/mnt/yarn/usercache/livy/appcache/application_1535051034290_0001/container_1535051034290_0001_01_23/pyspark.zip/pyspark/serializers.py",
>  line 372, in dump_stream vs = list(itertools.islice(iterator, batch)) File 
> "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1371, in 
> takeUpToNumLeft File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/util.py", line 55, in 
> wrapper return f(*args, **kwargs) File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/mllib/linalg/distributed.py",
>  line 975, in _convert_to_matrix_block_tuple raise TypeError("Cannot convert 
> type %s into a sub-matrix block tuple" % type(block)) TypeError: Cannot 
> convert type  into a sub-matrix block tuple
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25271) Creating parquet table with all the column null throws exception

2018-08-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596949#comment-16596949
 ] 

Liang-Chi Hsieh edited comment on SPARK-25271 at 8/30/18 12:11 AM:
---

I think this is a known issue on Hive and Parquet, some context can be found at 
https://issues.apache.org/jira/browse/HIVE-11625.

It can be reproduced by:
{code:java}
sql("create table vp_reader STORED AS PARQUET as select map() as a")
18/08/30 00:07:15 ERROR DataWritableWriter: Parquet record is malformed: empty 
fields are illegal, the field should be ommited completely instead
parquet.io.ParquetEncodingException: empty fields are illegal, the field should 
be ommited completely instead  
...
{code}
If you don't store it as Parquet format, it can work:
{code:java}
sql("create table vp_reader STORED AS ORC as select map() as a")
sql("select * from vp_reader").show
+---+
|  a|
+---+
| []|
+---+
{code}


was (Author: viirya):
I think this is a known issue on Hive and Parquet, some context can be found at 
https://issues.apache.org/jira/browse/HIVE-11625.

It can be reproduced by:
{code}
sql("create table vp_reader STORED AS PARQUET as select map() as a")
18/08/30 00:07:15 ERROR DataWritableWriter: Parquet record is malformed: empty 
fields are illegal, the field should be ommited completely instead
parquet.io.ParquetEncodingException: empty fields are illegal, the field should 
be ommited completely instead  
...
{code}

If you don't store it as Parquet format, it can work:
{code}
sql("create table vp_reader STORED AS ORC as select map() as a")
scala> sql("select * from vp_reader").show
+---+
|  a|
+---+
| []|
+---+
{code}


> Creating parquet table with all the column null throws exception
> 
>
> Key: SPARK-25271
> URL: https://issues.apache.org/jira/browse/SPARK-25271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: shivusondur
>Priority: Major
>
> {code:java}
>  1)cat /data/parquet.dat
> 1$abc2$pqr:3$xyz
> null{code}
>  
> {code:java}
> 2)spark.sql("create table vp_reader_temp (projects map) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' 
> MAP KEYS TERMINATED BY '$'")
> {code}
> {code:java}
> 3)spark.sql("
> LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
> {code}
> {code:java}
> 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
> vp_reader_temp")
> {code}
> *Result :* Throwing exception (Working fine with spark 2.2.1)
> {code:java}
> java.lang.RuntimeException: Parquet record is malformed: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
>   at 
> 

[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception

2018-08-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596949#comment-16596949
 ] 

Liang-Chi Hsieh commented on SPARK-25271:
-

I think this is a known issue on Hive and Parquet, some context can be found at 
https://issues.apache.org/jira/browse/HIVE-11625.

It can be reproduced by:
{code}
sql("create table vp_reader STORED AS PARQUET as select map() as a")
18/08/30 00:07:15 ERROR DataWritableWriter: Parquet record is malformed: empty 
fields are illegal, the field should be ommited completely instead
parquet.io.ParquetEncodingException: empty fields are illegal, the field should 
be ommited completely instead  
...
{code}

If you don't store it as Parquet format, it can work:
{code}
sql("create table vp_reader STORED AS ORC as select map() as a")
scala> sql("select * from vp_reader").show
+---+
|  a|
+---+
| []|
+---+
{code}


> Creating parquet table with all the column null throws exception
> 
>
> Key: SPARK-25271
> URL: https://issues.apache.org/jira/browse/SPARK-25271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: shivusondur
>Priority: Major
>
> {code:java}
>  1)cat /data/parquet.dat
> 1$abc2$pqr:3$xyz
> null{code}
>  
> {code:java}
> 2)spark.sql("create table vp_reader_temp (projects map) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' 
> MAP KEYS TERMINATED BY '$'")
> {code}
> {code:java}
> 3)spark.sql("
> LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
> {code}
> {code:java}
> 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
> vp_reader_temp")
> {code}
> *Result :* Throwing exception (Working fine with spark 2.2.1)
> {code:java}
> java.lang.RuntimeException: Parquet record is malformed: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320)
>   at 
> 

[jira] [Resolved] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts

2018-08-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24909.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21976
[https://github.com/apache/spark/pull/21976]

> Spark scheduler can hang when fetch failures, executor lost, task running on 
> lost executor, and multiple stage attempts
> ---
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
> Fix For: 2.4.0
>
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> {code:java}
> 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
> 
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
> 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
> 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33{code}
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
> heap dump on the process and can see that 55769 is still in the DAGScheduler 
> pendingPartitions list but the tasksetmanagers are all complete
> Note to reproduce this, you need a situation where  you have a shufflemaptask 
> (call it task1) fetching data from an executor where it also has other 
> shufflemaptasks (call it task2) running (fetch from other hosts). the task1 
> fetching the data has to FetchFail which would cause the stage to fail and 
> the executor to be marked as lost due to the fetch failure.  It restarts a 
> new task set for the new stage attempt, then the shufflemaptask task2 that 
> was running on the executor that was marked Lost finished.  The scheduler 
> ignore that complete event  "Ignoring possible bogus ...". This results in a 
> hang because at this point the TaskSetManager has already marked all tasks 
> for all attempts of that stage as completed.
>  
> Configs needed to be on:
> |{{spark.blacklist.application.fetchFailure.enabled=true}}| |
> |{{spark.blacklist.application.fetchFailure.enabled=true}}|
> spark.files.fetchFailure.unRegisterOutputOnHost=true
> spark.shuffle.service.enabled=true



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24909) Spark scheduler can hang when fetch failures, executor lost, task running on lost executor, and multiple stage attempts

2018-08-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24909:
--

Assignee: Thomas Graves

> Spark scheduler can hang when fetch failures, executor lost, task running on 
> lost executor, and multiple stage attempts
> ---
>
> Key: SPARK-24909
> URL: https://issues.apache.org/jira/browse/SPARK-24909
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.3.1
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
> Fix For: 2.4.0
>
>
> The DAGScheduler can hang if the executor was lost (due to fetch failure) and 
> all the tasks in the tasks sets are marked as completed. 
> ([https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1265)]
> It never creates new task attempts in the task scheduler but the dag 
> scheduler still has pendingPartitions.
> {code:java}
> 8/07/22 08:30:00 INFO scheduler.TaskSetManager: Starting task 55769.0 in 
> stage 44.0 (TID 970752, host1.com, executor 33, partition 55769, 
> PROCESS_LOCAL, 7874 bytes)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Marking ShuffleMapStage 44 
> (repartition at Lift.scala:191) as failed due to a fetch failure from 
> ShuffleMapStage 42 (map at foo.scala:27)
> 18/07/22 08:30:29 INFO scheduler.DAGScheduler: Resubmitting ShuffleMapStage 
> 42 (map at foo.scala:27) and ShuffleMapStage 44 (repartition at 
> bar.scala:191) due to fetch failure
> 
> 18/07/22 08:30:56 INFO scheduler.DAGScheduler: Executor lost: 33 (epoch 18)
> 18/07/22 08:30:56 INFO schedulerDAGScheduler: Shuffle files lost for 
> executor: 33 (epoch 18)
> 18/07/22 08:31:20 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 44 
> (MapPartitionsRDD[70] at repartition at bar.scala:191), which has no missing 
> parents
> 18/07/22 08:31:21 INFO cluster.YarnClusterScheduler: Adding task set 44.1 
> with 59955 tasks
> 18/07/22 08:31:41 INFO scheduler.TaskSetManager: Finished task 55769.0 in 
> stage 44.0 (TID 970752) in 101505 ms on host1.com (executor 33) (15081/73320)
> 8/07/22 08:31:41 INFO scheduler.DAGScheduler: Ignoring possibly bogus 
> ShuffleMapTask(44, 55769) completion from executor 33{code}
>  
> In the logs above you will see that task 55769.0 finished after the executor 
> was lost and a new task set was started.  The DAG scheduler says "Ignoring 
> possibly bogus".. but in the TaskSetManager side it has marked those tasks as 
> completed for all stage attempts. The DAGScheduler gets hung here.  I did a 
> heap dump on the process and can see that 55769 is still in the DAGScheduler 
> pendingPartitions list but the tasksetmanagers are all complete
> Note to reproduce this, you need a situation where  you have a shufflemaptask 
> (call it task1) fetching data from an executor where it also has other 
> shufflemaptasks (call it task2) running (fetch from other hosts). the task1 
> fetching the data has to FetchFail which would cause the stage to fail and 
> the executor to be marked as lost due to the fetch failure.  It restarts a 
> new task set for the new stage attempt, then the shufflemaptask task2 that 
> was running on the executor that was marked Lost finished.  The scheduler 
> ignore that complete event  "Ignoring possible bogus ...". This results in a 
> hang because at this point the TaskSetManager has already marked all tasks 
> for all attempts of that stage as completed.
>  
> Configs needed to be on:
> |{{spark.blacklist.application.fetchFailure.enabled=true}}| |
> |{{spark.blacklist.application.fetchFailure.enabled=true}}|
> spark.files.fetchFailure.unRegisterOutputOnHost=true
> spark.shuffle.service.enabled=true



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25242) Suggestion to make sql config setting fluent

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25242:


Assignee: (was: Apache Spark)

> Suggestion to make sql config setting fluent
> 
>
> Key: SPARK-25242
> URL: https://issues.apache.org/jira/browse/SPARK-25242
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Florence Hope
>Priority: Minor
>
> Currently I am able to chain .conf.set() settings in the spark session API. I 
> would also like to be able to chain config options in a similar manner in 
> pyspark.sql.conf.
> The same issue is present in the scala SQL API so this might also need to be 
> changed for consistency.
> See [https://github.com/apache/spark/blob/master/python/pyspark/conf.py#L130] 
> vs  
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/conf.py#L41]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25242) Suggestion to make sql config setting fluent

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596930#comment-16596930
 ] 

Apache Spark commented on SPARK-25242:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/22276

> Suggestion to make sql config setting fluent
> 
>
> Key: SPARK-25242
> URL: https://issues.apache.org/jira/browse/SPARK-25242
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Florence Hope
>Priority: Minor
>
> Currently I am able to chain .conf.set() settings in the spark session API. I 
> would also like to be able to chain config options in a similar manner in 
> pyspark.sql.conf.
> The same issue is present in the scala SQL API so this might also need to be 
> changed for consistency.
> See [https://github.com/apache/spark/blob/master/python/pyspark/conf.py#L130] 
> vs  
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/conf.py#L41]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25242) Suggestion to make sql config setting fluent

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25242:


Assignee: Apache Spark

> Suggestion to make sql config setting fluent
> 
>
> Key: SPARK-25242
> URL: https://issues.apache.org/jira/browse/SPARK-25242
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Florence Hope
>Assignee: Apache Spark
>Priority: Minor
>
> Currently I am able to chain .conf.set() settings in the spark session API. I 
> would also like to be able to chain config options in a similar manner in 
> pyspark.sql.conf.
> The same issue is present in the scala SQL API so this might also need to be 
> changed for consistency.
> See [https://github.com/apache/spark/blob/master/python/pyspark/conf.py#L130] 
> vs  
> [https://github.com/apache/spark/blob/master/python/pyspark/sql/conf.py#L41]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25274) Improve toPandas with Arrow by sending out-of-order record batches

2018-08-29 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-25274:
-
Summary: Improve toPandas with Arrow by sending out-of-order record batches 
 (was: Improve `toPandas` with Arrow by sending out-of-order record batches)

> Improve toPandas with Arrow by sending out-of-order record batches
> --
>
> Key: SPARK-25274
> URL: https://issues.apache.org/jira/browse/SPARK-25274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> When executing {{toPandas}} with Arrow enabled, partitions that arrive in the 
> JVM out-of-order must be buffered before they can be send to Python. This 
> causes an excess of memory to be used in the driver JVM and increases the 
> time it takes to complete because data must sit in the JVM waiting for 
> preceding partitions to come in.
> This can be improved by sending out-of-order partitions to Python as soon as 
> they arrive in the JVM, followed by a list of indices so that Python can 
> assemble the data in the correct order. This way, data is not buffered at the 
> JVM and there is no waiting on particular partitions so performance will be 
> increased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25274) Improve `toPandas` with Arrow by sending out-of-order record batches

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25274:


Assignee: (was: Apache Spark)

> Improve `toPandas` with Arrow by sending out-of-order record batches
> 
>
> Key: SPARK-25274
> URL: https://issues.apache.org/jira/browse/SPARK-25274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> When executing {{toPandas}} with Arrow enabled, partitions that arrive in the 
> JVM out-of-order must be buffered before they can be send to Python. This 
> causes an excess of memory to be used in the driver JVM and increases the 
> time it takes to complete because data must sit in the JVM waiting for 
> preceding partitions to come in.
> This can be improved by sending out-of-order partitions to Python as soon as 
> they arrive in the JVM, followed by a list of indices so that Python can 
> assemble the data in the correct order. This way, data is not buffered at the 
> JVM and there is no waiting on particular partitions so performance will be 
> increased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25274) Improve `toPandas` with Arrow by sending out-of-order record batches

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25274:


Assignee: Apache Spark

> Improve `toPandas` with Arrow by sending out-of-order record batches
> 
>
> Key: SPARK-25274
> URL: https://issues.apache.org/jira/browse/SPARK-25274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Major
>
> When executing {{toPandas}} with Arrow enabled, partitions that arrive in the 
> JVM out-of-order must be buffered before they can be send to Python. This 
> causes an excess of memory to be used in the driver JVM and increases the 
> time it takes to complete because data must sit in the JVM waiting for 
> preceding partitions to come in.
> This can be improved by sending out-of-order partitions to Python as soon as 
> they arrive in the JVM, followed by a list of indices so that Python can 
> assemble the data in the correct order. This way, data is not buffered at the 
> JVM and there is no waiting on particular partitions so performance will be 
> increased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25274) Improve `toPandas` with Arrow by sending out-of-order record batches

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596891#comment-16596891
 ] 

Apache Spark commented on SPARK-25274:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/22275

> Improve `toPandas` with Arrow by sending out-of-order record batches
> 
>
> Key: SPARK-25274
> URL: https://issues.apache.org/jira/browse/SPARK-25274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> When executing {{toPandas}} with Arrow enabled, partitions that arrive in the 
> JVM out-of-order must be buffered before they can be send to Python. This 
> causes an excess of memory to be used in the driver JVM and increases the 
> time it takes to complete because data must sit in the JVM waiting for 
> preceding partitions to come in.
> This can be improved by sending out-of-order partitions to Python as soon as 
> they arrive in the JVM, followed by a list of indices so that Python can 
> assemble the data in the correct order. This way, data is not buffered at the 
> JVM and there is no waiting on particular partitions so performance will be 
> increased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25167) Minor fixes for R sql tests (tests that fail in development environment)

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596864#comment-16596864
 ] 

Apache Spark commented on SPARK-25167:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22274

> Minor fixes for R sql tests (tests that fail in development environment)
> 
>
> Key: SPARK-25167
> URL: https://issues.apache.org/jira/browse/SPARK-25167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 2.4.0
>
>
> A few SQL tests for R are failing development environment (Mac). 
> *  The catalog api tests assumes catalog artifacts named "foo" to be non 
> existent. I think name such as foo and bar are common and developers use it 
> frequently. 
> *  One test assumes that we only have one database in the system. I had more 
> than one and it caused the test to fail. I have changed that check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25274) Improve `toPandas` with Arrow by sending out-of-order record batches

2018-08-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596860#comment-16596860
 ] 

Bryan Cutler commented on SPARK-25274:
--

This is a followup to SPARK-23030 that is now possible since Arrow stream 
format is being used

> Improve `toPandas` with Arrow by sending out-of-order record batches
> 
>
> Key: SPARK-25274
> URL: https://issues.apache.org/jira/browse/SPARK-25274
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Priority: Major
>
> When executing {{toPandas}} with Arrow enabled, partitions that arrive in the 
> JVM out-of-order must be buffered before they can be send to Python. This 
> causes an excess of memory to be used in the driver JVM and increases the 
> time it takes to complete because data must sit in the JVM waiting for 
> preceding partitions to come in.
> This can be improved by sending out-of-order partitions to Python as soon as 
> they arrive in the JVM, followed by a list of indices so that Python can 
> assemble the data in the correct order. This way, data is not buffered at the 
> JVM and there is no waiting on particular partitions so performance will be 
> increased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-25272:
-
Comment: was deleted

(was: This is a followup that is possible now that Arrow streaming format is 
being used.)

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596859#comment-16596859
 ] 

Bryan Cutler commented on SPARK-25272:
--

This is a followup that is possible now that Arrow streaming format is being 
used.

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25274) Improve `toPandas` with Arrow by sending out-of-order record batches

2018-08-29 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-25274:


 Summary: Improve `toPandas` with Arrow by sending out-of-order 
record batches
 Key: SPARK-25274
 URL: https://issues.apache.org/jira/browse/SPARK-25274
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 2.4.0
Reporter: Bryan Cutler


When executing {{toPandas}} with Arrow enabled, partitions that arrive in the 
JVM out-of-order must be buffered before they can be send to Python. This 
causes an excess of memory to be used in the driver JVM and increases the time 
it takes to complete because data must sit in the JVM waiting for preceding 
partitions to come in.

This can be improved by sending out-of-order partitions to Python as soon as 
they arrive in the JVM, followed by a list of indices so that Python can 
assemble the data in the correct order. This way, data is not buffered at the 
JVM and there is no waiting on particular partitions so performance will be 
increased.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24005) Remove usage of Scala’s parallel collection

2018-08-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-24005:

Labels:   (was: starter)

> Remove usage of Scala’s parallel collection
> ---
>
> Key: SPARK-24005
> URL: https://issues.apache.org/jira/browse/SPARK-24005
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> {noformat}
> val par = (1 to 100).par.flatMap { i =>
>   Thread.sleep(1000)
>   1 to 1000
> }.toSeq
> {noformat}
> We are unable to interrupt the execution of parallel collections. We need to 
> create a common utility function to do it, instead of using Scala parallel 
> collections



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25273) How to install testthat v1.0.2

2018-08-29 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-25273:
---
Summary: How to install testthat v1.0.2  (was: How to install testthat = 
1.0.2)

> How to install testthat v1.0.2
> --
>
> Key: SPARK-25273
> URL: https://issues.apache.org/jira/browse/SPARK-25273
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
>
> The command installs testthat v2.0.x:
> {code:R}
> R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
> 'survival'), repos='http://cran.us.r-project.org')"
> {code}
> which prevents running the R tests. Need to update the section 
> http://spark.apache.org/docs/latest/building-spark.html#running-r-tests 
> according to https://github.com/apache/spark/pull/20003



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596850#comment-16596850
 ] 

Bryan Cutler commented on SPARK-25272:
--

yeah that would be great to make it easier to view the python test output in 
Jenkins.

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596844#comment-16596844
 ] 

Apache Spark commented on SPARK-25272:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/22273

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25272:


Assignee: Apache Spark  (was: Bryan Cutler)

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25272:


Assignee: Bryan Cutler  (was: Apache Spark)

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25273) How to install testthat = 1.0.2

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25273:


Assignee: (was: Apache Spark)

> How to install testthat = 1.0.2
> ---
>
> Key: SPARK-25273
> URL: https://issues.apache.org/jira/browse/SPARK-25273
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
>
> The command installs testthat v2.0.x:
> {code:R}
> R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
> 'survival'), repos='http://cran.us.r-project.org')"
> {code}
> which prevents running the R tests. Need to update the section 
> http://spark.apache.org/docs/latest/building-spark.html#running-r-tests 
> according to https://github.com/apache/spark/pull/20003



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25273) How to install testthat = 1.0.2

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596842#comment-16596842
 ] 

Apache Spark commented on SPARK-25273:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22272

> How to install testthat = 1.0.2
> ---
>
> Key: SPARK-25273
> URL: https://issues.apache.org/jira/browse/SPARK-25273
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Major
>
> The command installs testthat v2.0.x:
> {code:R}
> R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
> 'survival'), repos='http://cran.us.r-project.org')"
> {code}
> which prevents running the R tests. Need to update the section 
> http://spark.apache.org/docs/latest/building-spark.html#running-r-tests 
> according to https://github.com/apache/spark/pull/20003



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25273) How to install testthat = 1.0.2

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25273:


Assignee: Apache Spark

> How to install testthat = 1.0.2
> ---
>
> Key: SPARK-25273
> URL: https://issues.apache.org/jira/browse/SPARK-25273
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> The command installs testthat v2.0.x:
> {code:R}
> R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
> 'survival'), repos='http://cran.us.r-project.org')"
> {code}
> which prevents running the R tests. Need to update the section 
> http://spark.apache.org/docs/latest/building-spark.html#running-r-tests 
> according to https://github.com/apache/spark/pull/20003



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25273) How to install testthat = 1.0.2

2018-08-29 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-25273:
--

 Summary: How to install testthat = 1.0.2
 Key: SPARK-25273
 URL: https://issues.apache.org/jira/browse/SPARK-25273
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.3.1
Reporter: Maxim Gekk


The command installs testthat v2.0.x:
{code:R}
R -e "install.packages(c('knitr', 'rmarkdown', 'testthat', 'e1071', 
'survival'), repos='http://cran.us.r-project.org')"
{code}
which prevents running the R tests. Need to update the section 
http://spark.apache.org/docs/latest/building-spark.html#running-r-tests 
according to https://github.com/apache/spark/pull/20003



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596836#comment-16596836
 ] 

Imran Rashid commented on SPARK-25272:
--

it would also be great if we could get the pyspark test output into the same 
structure jenkins expects.  I'm thinking this would be possible with some 
re-structuring of the xml output that python is generating, so jenkins can pick 
it up?

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-25272:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-22216

> Show some kind of test output to indicate pyarrow tests were run
> 
>
> Key: SPARK-25272
> URL: https://issues.apache.org/jira/browse/SPARK-25272
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Tests
>Affects Versions: 2.4.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Right now tests only output status when they are skipped and there is no way 
> to really see from the logs that pyarrow tests, like ArrowTests, have been 
> run except by the absence of a skipped message.  We can add a test that is 
> skipped if pyarrow is installed, which will give an output in our Jenkins 
> test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25272) Show some kind of test output to indicate pyarrow tests were run

2018-08-29 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-25272:


 Summary: Show some kind of test output to indicate pyarrow tests 
were run
 Key: SPARK-25272
 URL: https://issues.apache.org/jira/browse/SPARK-25272
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Tests
Affects Versions: 2.4.0
Reporter: Bryan Cutler
Assignee: Bryan Cutler


Right now tests only output status when they are skipped and there is no way to 
really see from the logs that pyarrow tests, like ArrowTests, have been run 
except by the absence of a skipped message.  We can add a test that is skipped 
if pyarrow is installed, which will give an output in our Jenkins test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25268:


Assignee: Apache Spark

> runParallelPersonalizedPageRank throws serialization Exception
> --
>
> Key: SPARK-25268
> URL: https://issues.apache.org/jira/browse/SPARK-25268
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>Priority: Critical
>
> A recent change to PageRank introduced a bug in the 
> ParallelPersonalizedPageRank implementation. The change prevents 
> serialization of a Map which needs to be broadcast to all workers. The issue 
> is in this line here: 
> [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201]
> Because graphx units tests are run in local mode, the Serialization issue is 
> not caught.
>  
> {code:java}
> [info] - Star example parallel personalized PageRank *** FAILED *** (2 
> seconds, 160 milliseconds)
> [info] java.io.NotSerializableException: 
> scala.collection.immutable.MapLike$$anon$2
> [info] Serialization stack:
> [info] - object not serializable (class: 
> scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> 
> SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> 
> SparseVector(3)((2,1.0
> [info] at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> [info] at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> [info] at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
> [info] at 
> org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205)
> [info] at 
> org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40)
> [info] at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info] at 

[jira] [Commented] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596676#comment-16596676
 ] 

Apache Spark commented on SPARK-25268:
--

User 'shahidki31' has created a pull request for this issue:
https://github.com/apache/spark/pull/22271

> runParallelPersonalizedPageRank throws serialization Exception
> --
>
> Key: SPARK-25268
> URL: https://issues.apache.org/jira/browse/SPARK-25268
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A recent change to PageRank introduced a bug in the 
> ParallelPersonalizedPageRank implementation. The change prevents 
> serialization of a Map which needs to be broadcast to all workers. The issue 
> is in this line here: 
> [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201]
> Because graphx units tests are run in local mode, the Serialization issue is 
> not caught.
>  
> {code:java}
> [info] - Star example parallel personalized PageRank *** FAILED *** (2 
> seconds, 160 milliseconds)
> [info] java.io.NotSerializableException: 
> scala.collection.immutable.MapLike$$anon$2
> [info] Serialization stack:
> [info] - object not serializable (class: 
> scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> 
> SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> 
> SparseVector(3)((2,1.0
> [info] at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> [info] at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> [info] at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
> [info] at 
> org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205)
> [info] at 
> org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40)
> [info] at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info] at 
> 

[jira] [Assigned] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25268:


Assignee: (was: Apache Spark)

> runParallelPersonalizedPageRank throws serialization Exception
> --
>
> Key: SPARK-25268
> URL: https://issues.apache.org/jira/browse/SPARK-25268
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A recent change to PageRank introduced a bug in the 
> ParallelPersonalizedPageRank implementation. The change prevents 
> serialization of a Map which needs to be broadcast to all workers. The issue 
> is in this line here: 
> [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201]
> Because graphx units tests are run in local mode, the Serialization issue is 
> not caught.
>  
> {code:java}
> [info] - Star example parallel personalized PageRank *** FAILED *** (2 
> seconds, 160 milliseconds)
> [info] java.io.NotSerializableException: 
> scala.collection.immutable.MapLike$$anon$2
> [info] Serialization stack:
> [info] - object not serializable (class: 
> scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> 
> SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> 
> SparseVector(3)((2,1.0
> [info] at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> [info] at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> [info] at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
> [info] at 
> org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205)
> [info] at 
> org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40)
> [info] at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info] at scala.collection.immutable.List.foreach(List.scala:383)
> [info] at 

[jira] [Commented] (SPARK-25267) Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596672#comment-16596672
 ] 

Apache Spark commented on SPARK-25267:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/22270

> Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive
> -
>
> Key: SPARK-25267
> URL: https://issues.apache.org/jira/browse/SPARK-25267
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> In SharedSparkSession and TestHive, we need to disable the rule 
> ConvertToLocalRelation for better test case coverage. To exclude the rules, 
> we can use the SQLConf `spark.sql.optimizer.excludedRules`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25267) Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25267:


Assignee: Apache Spark

> Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive
> -
>
> Key: SPARK-25267
> URL: https://issues.apache.org/jira/browse/SPARK-25267
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> In SharedSparkSession and TestHive, we need to disable the rule 
> ConvertToLocalRelation for better test case coverage. To exclude the rules, 
> we can use the SQLConf `spark.sql.optimizer.excludedRules`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25267) Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25267:


Assignee: (was: Apache Spark)

> Disable ConvertToLocalRelation in the test cases of sql/core and sql/hive
> -
>
> Key: SPARK-25267
> URL: https://issues.apache.org/jira/browse/SPARK-25267
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> In SharedSparkSession and TestHive, we need to disable the rule 
> ConvertToLocalRelation for better test case coverage. To exclude the rules, 
> we can use the SQLConf `spark.sql.optimizer.excludedRules`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25180) Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname fails

2018-08-29 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596566#comment-16596566
 ] 

Steve Loughran commented on SPARK-25180:


Reviewing a bit more, I think the root cause was

* standalone came up with laptop on network AP 1, with IP Addr 1
* laptop moved to different room, different AP and different IP Addr
* So the old Addr didn't work no more.

Assumption: standalone service is coming up on the external address, not the 
loopback, and the shell can't work with it when the external address moves

I'd close that as a WONTFIX "don't do that", though I worry there's a security 
implication: if things really are coming on the public IP address, then if the 
ports aren't locked, you've just granted malicious callers on the same network 
the ability run anything they want on your system

> Spark standalone failure in Utils.doFetchFile() if nslookup of local hostname 
> fails
> ---
>
> Key: SPARK-25180
> URL: https://issues.apache.org/jira/browse/SPARK-25180
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: mac laptop running on a corporate guest wifi, presumably 
> a wifi with odd DNS settings.
>Reporter: Steve Loughran
>Priority: Minor
>
> trying to save work on spark standalone can fail if netty RPC cannot 
> determine the hostname. While that's a valid failure on a real cluster, in 
> standalone falling back to localhost rather than inferred "hw13176.lan" value 
> may be the better option.
> note also, the abort* call failed; NPE.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25246) When the spark.eventLog.compress is enabled, the Application is not showing in the History server UI ('incomplete application' page), initially.

2018-08-29 Thread shahid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shahid updated SPARK-25246:
---
Comment: was deleted

(was: I am working on it :))

> When the spark.eventLog.compress is enabled, the Application is not showing 
> in the History server UI ('incomplete application' page), initially.
> 
>
> Key: SPARK-25246
> URL: https://issues.apache.org/jira/browse/SPARK-25246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: shahid
>Priority: Major
>
> 1) bin/spark-shell --master yarn --conf "spark.eventLog.compress=true" 
> 2) hdfs dfs -ls /spark-logs 
> {code:java}
> -rwxrwx---   1 root supergroup  *0* 2018-08-27 03:26 
> /spark-logs/application_1535313809919_0005.lz4.inprogress
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23997) Configurable max number of buckets

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596525#comment-16596525
 ] 

Apache Spark commented on SPARK-23997:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/22269

> Configurable max number of buckets
> --
>
> Key: SPARK-23997
> URL: https://issues.apache.org/jira/browse/SPARK-23997
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Fernando Pereira
>Assignee: Fernando Pereira
>Priority: Major
> Fix For: 2.4.0
>
>
> When exporting data as a table the user can choose to split data in buckets 
> by choosing the columns and the number of buckets. Currently there is a 
> hard-coded limit of 99'999 buckets.
> However, for heavy workloads this limit might be too restrictive, a situation 
> that will eventually become more common as workloads grow.
> As per the comments in SPARK-19618 this limit could be made configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25269) SQL interface support specify StorageLevel when cache table

2018-08-29 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25269:

Comment: was deleted

(was: I'm working on.)

> SQL interface support specify StorageLevel when cache table
> ---
>
> Key: SPARK-25269
> URL: https://issues.apache.org/jira/browse/SPARK-25269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> CACHE MEMORY_ONLY TABLE testData;
> {code}
> Supported {{StorageLevel}} should be:
> https://github.com/apache/spark/blob/eefdf9f9dd8afde49ad7d4e230e2735eb817ab0a/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L153-L164



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25253) Refactor pyspark connection & authentication

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596409#comment-16596409
 ] 

Apache Spark commented on SPARK-25253:
--

User 'cclauss' has created a pull request for this issue:
https://github.com/apache/spark/pull/22265

> Refactor pyspark connection & authentication
> 
>
> Key: SPARK-25253
> URL: https://issues.apache.org/jira/browse/SPARK-25253
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.4.0
>
>
> We've got a few places in pyspark that connect to local sockets, with varying 
> levels of ipv6 handling, graceful error handling, and lots of copy-and-paste. 
>  should be pretty easy to clean this up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2018-08-29 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596404#comment-16596404
 ] 

Steve Loughran commented on SPARK-6305:
---

bq. Could be possible that nobody is swapping it out for JUL since JUL is 
terrible.

I don't disagree, though if you get into the depths of kerberos debugging, you 
end up there.

regarding usable back-ends to SLF4J, there's logback, which is performant at 
the expense of being async (pro: reduced impact of logging in locked areas; 
con: process crash can lose the last logged bits)

Is the logging control in spark done only in tests, or in the production side 
of things? If its only test-side, then log4j core can go on the CP there



> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25266) Fix memory leak in Barrier Execution Mode

2018-08-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25266.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22258
[https://github.com/apache/spark/pull/22258]

> Fix memory leak in Barrier Execution Mode
> -
>
> Key: SPARK-25266
> URL: https://issues.apache.org/jira/browse/SPARK-25266
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Critical
> Fix For: 2.4.0
>
>
> BarrierCoordinator uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
> in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.
> Once a TimerTask is scheduled, the reference to it is not released until 
> `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23298) distinct.count on Dataset/DataFrame yields non-deterministic results

2018-08-29 Thread Mateusz Jukiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596366#comment-16596366
 ] 

Mateusz Jukiewicz commented on SPARK-23298:
---

[~jansi]

I can see two issues in your code.
 # If you want to fully reproduce my example, please save the generated 
dataframe first and then read it from disk. With working on it directly, rand() 
generates different numbers every time (which is expected) which might yield 
non-deterministic results.
 # I believe dropDuplicates() works in a non-deterministic fashion which is 
expected. You will end up in unique values in col1, but you don't know which 
row will remain, therefore, different values in col2 might be there in every 
run (meaning different numbers of distinct values), which can cause the effect 
you're observing.

Example on exemplary dataframe:
{code:java}
df
--
c1, c2
--
1 , 2
1 , 3
1 , 3
2 , 2
2 , 3

Run1:
--
df.dropDuplicates("c1") >
--
c1, c2
--
1 , 2
2 , 2

.dropDuplicates("c2") >
--
c1, c2
--
1 , 2
.count = 1

Run2:
--
df.dropDuplicates("c1") >
--
c1, c2
--
1 , 2
2 , 3

.dropDuplicates("c2") >
--
c1, c2
--
1 , 2
2 , 3
.count = 2{code}
As far as I know, behaviour like that is totally expected.

> distinct.count on Dataset/DataFrame yields non-deterministic results
> 
>
> Key: SPARK-23298
> URL: https://issues.apache.org/jira/browse/SPARK-23298
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.1.0, 2.2.0
> Environment: Spark 2.2.0 or 2.1.0
> Java 1.8.0_144
> Yarn version:
> {code:java}
> Hadoop 2.6.0-cdh5.12.1
> Subversion http://github.com/cloudera/hadoop -r 
> 520d8b072e666e9f21d645ca6a5219fc37535a52
> Compiled by jenkins on 2017-08-24T16:43Z
> Compiled with protoc 2.5.0
> From source with checksum de51bf9693ab9426379a1cd28142cea0
> This command was run using 
> /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.12.1.jar{code}
>  
>  
>Reporter: Mateusz Jukiewicz
>Priority: Major
>  Labels: Correctness, CorrectnessBug, correctness
>
> This is what happens (EDIT - managed to get a reproducible example):
> {code:java}
> /* Exemplary spark-shell starting command 
> /opt/spark/bin/spark-shell \
> --num-executors 269 \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.kryoserializer.buffer.max=512m 
> // The spark.sql.shuffle.partitions is 2154 here, if that matters
> */
> val df = spark.range(1000).withColumn("col1", (rand() * 
> 1000).cast("long")).withColumn("col2", (rand() * 
> 1000).cast("long")).drop("id")
> df.repartition(5240).write.parquet("/test.parquet")
> // Then, ideally in a new session
> val df = spark.read.parquet("/test.parquet")
> df.distinct.count
> // res1: Long = 1001256                                                       
>      
> df.distinct.count
> // res2: Long = 55   {code}
> -The _text_dataset.out_ file is a dataset with one string per line. The 
> string has alphanumeric characters as well as colons and spaces. The line 
> length does not exceed 1200. I don't think that's important though, as the 
> issue appeared on various other datasets, I just tried to narrow it down to 
> the simplest possible case.- (the case is now fully reproducible with the 
> above code)
> The observations regarding the issue are as follows:
>  * I managed to reproduce it on both spark 2.2 and spark 2.1.
>  * The issue occurs in YARN cluster mode (I haven't tested YARN client mode).
>  * The issue is not reproducible on a single machine (e.g. laptop) in spark 
> local mode.
>  * It seems that once the correct count is computed, it is not possible to 
> reproduce the issue in the same spark session. In other words, I was able to 
> get 2-3 incorrect distinct.count results consecutively, but once it got 
> right, it always returned the correct value. I had to re-run spark-shell to 
> observe the problem again.
>  * The issue appears on both Dataset and DataFrame (i.e. using read.text or 
> read.textFile).
>  * The issue is not reproducible on RDD (i.e. dataset.rdd.distinct.count).
>  * Not a single container has failed in those multiple invalid executions.
>  * YARN doesn't show any warnings or errors in those invalid executions.
>  * The execution plan determined for both valid and invalid executions was 
> always the same (it's shown in the _SQL_ tab of the UI).
>  * The number returned in the invalid executions was always greater than the 
> correct number (24 014 227).
>  * This occurs even though the input is already completely deduplicated (i.e. 
> _distinct.count_ shouldn't change anything).
>  * The input isn't replicated (i.e. there's only one copy of each file block 
> on the HDFS).
>  * The problem is probably 

[jira] [Comment Edited] (SPARK-23298) distinct.count on Dataset/DataFrame yields non-deterministic results

2018-08-29 Thread Janne Jaanila (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596324#comment-16596324
 ] 

Janne Jaanila edited comment on SPARK-23298 at 8/29/18 1:27 PM:


I have faced a similar problem with multiple dropDuplicates() calls. I was able 
to reproduce it with Mateusz's dataframe in spark-shell:
{code:java}
./spark-shell --master yarn --deploy-mode client

scala> val df = spark.range(1000).withColumn("col1", (rand() * 
1000).cast("long")).withColumn("col2", (rand() * 1000).cast("long")).drop("id")
df: org.apache.spark.sql.DataFrame = [col1: bigint, col2: bigint]
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res77: Long = 639 
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res78: Long = 638
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res79: Long = 618
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res80: Long = 635
{code}
Count with a single dropDuplicates call works as expected.

I'm using Spark 2.3.1. We recently upgraded from 2.0.2 because of this issue.


was (Author: jansi):
I have faced a similar problem with multiple dropDuplicates() calls. I was able 
to reproduce it with Masteusz's dataframe in spark-shell:
{code:java}
./spark-shell --master yarn --deploy-mode client

scala> val df = spark.range(1000).withColumn("col1", (rand() * 
1000).cast("long")).withColumn("col2", (rand() * 1000).cast("long")).drop("id")
df: org.apache.spark.sql.DataFrame = [col1: bigint, col2: bigint]
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res77: Long = 639 
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res78: Long = 638
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res79: Long = 618
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res80: Long = 635
{code}
Count with a single dropDuplicates call works as expected.

I'm using Spark 2.3.1. We recently upgraded from 2.0.2 because of this issue.

> distinct.count on Dataset/DataFrame yields non-deterministic results
> 
>
> Key: SPARK-23298
> URL: https://issues.apache.org/jira/browse/SPARK-23298
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.1.0, 2.2.0
> Environment: Spark 2.2.0 or 2.1.0
> Java 1.8.0_144
> Yarn version:
> {code:java}
> Hadoop 2.6.0-cdh5.12.1
> Subversion http://github.com/cloudera/hadoop -r 
> 520d8b072e666e9f21d645ca6a5219fc37535a52
> Compiled by jenkins on 2017-08-24T16:43Z
> Compiled with protoc 2.5.0
> From source with checksum de51bf9693ab9426379a1cd28142cea0
> This command was run using 
> /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.12.1.jar{code}
>  
>  
>Reporter: Mateusz Jukiewicz
>Priority: Major
>  Labels: Correctness, CorrectnessBug, correctness
>
> This is what happens (EDIT - managed to get a reproducible example):
> {code:java}
> /* Exemplary spark-shell starting command 
> /opt/spark/bin/spark-shell \
> --num-executors 269 \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.kryoserializer.buffer.max=512m 
> // The spark.sql.shuffle.partitions is 2154 here, if that matters
> */
> val df = spark.range(1000).withColumn("col1", (rand() * 
> 1000).cast("long")).withColumn("col2", (rand() * 
> 1000).cast("long")).drop("id")
> df.repartition(5240).write.parquet("/test.parquet")
> // Then, ideally in a new session
> val df = spark.read.parquet("/test.parquet")
> df.distinct.count
> // res1: Long = 1001256                                                       
>      
> df.distinct.count
> // res2: Long = 55   {code}
> -The _text_dataset.out_ file is a dataset with one string per line. The 
> string has alphanumeric characters as well as colons and spaces. The line 
> length does not exceed 1200. I don't think that's important though, as the 
> issue appeared on various other datasets, I just tried to narrow it down to 
> the simplest possible case.- (the case is now fully reproducible with the 
> above code)
> The observations regarding the issue are as follows:
>  * I managed to reproduce it on both spark 2.2 and spark 2.1.
>  * The issue occurs in YARN cluster mode (I haven't tested YARN client mode).
>  * The issue is not reproducible on a single machine (e.g. laptop) in spark 
> local mode.
>  * It seems that once the correct count is computed, it is not possible to 
> reproduce the issue in the same spark session. In other words, I was able to 
> get 2-3 incorrect distinct.count results consecutively, but once it got 
> right, it always returned the correct value. I had to re-run spark-shell to 
> observe the problem again.
>  * The issue appears on both Dataset and DataFrame (i.e. using read.text or 
> 

[jira] [Comment Edited] (SPARK-23298) distinct.count on Dataset/DataFrame yields non-deterministic results

2018-08-29 Thread Janne Jaanila (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596324#comment-16596324
 ] 

Janne Jaanila edited comment on SPARK-23298 at 8/29/18 1:26 PM:


I have faced a similar problem with multiple dropDuplicates() calls. I was able 
to reproduce it with Masteusz's dataframe in spark-shell:
{code:java}
./spark-shell --master yarn --deploy-mode client

scala> val df = spark.range(1000).withColumn("col1", (rand() * 
1000).cast("long")).withColumn("col2", (rand() * 1000).cast("long")).drop("id")
df: org.apache.spark.sql.DataFrame = [col1: bigint, col2: bigint]
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res77: Long = 639 
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res78: Long = 638
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res79: Long = 618
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res80: Long = 635
{code}
Count with a single dropDuplicates call works as expected.

I'm using Spark 2.3.1. We recently upgraded from 2.0.2 because of this issue.


was (Author: jansi):
I have faced a similar problem with multiple dropDuplicates() calls. I was able 
to reproduce it with Masteusz's dataframe in spark-shell:

 
{code:java}
./spark-shell --master yarn --deploy-mode client

scala> val df = spark.range(1000).withColumn("col1", (rand() * 
1000).cast("long")).withColumn("col2", (rand() * 1000).cast("long")).drop("id")
df: org.apache.spark.sql.DataFrame = [col1: bigint, col2: bigint]
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res77: Long = 639 
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res78: Long = 638
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res79: Long = 618
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res80: Long = 635
{code}
 

Count with a single dropDuplicates call works as expected.

I'm using Spark 2.3.1. We recently upgraded from 2.0.2 because of this issue.

> distinct.count on Dataset/DataFrame yields non-deterministic results
> 
>
> Key: SPARK-23298
> URL: https://issues.apache.org/jira/browse/SPARK-23298
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.1.0, 2.2.0
> Environment: Spark 2.2.0 or 2.1.0
> Java 1.8.0_144
> Yarn version:
> {code:java}
> Hadoop 2.6.0-cdh5.12.1
> Subversion http://github.com/cloudera/hadoop -r 
> 520d8b072e666e9f21d645ca6a5219fc37535a52
> Compiled by jenkins on 2017-08-24T16:43Z
> Compiled with protoc 2.5.0
> From source with checksum de51bf9693ab9426379a1cd28142cea0
> This command was run using 
> /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.12.1.jar{code}
>  
>  
>Reporter: Mateusz Jukiewicz
>Priority: Major
>  Labels: Correctness, CorrectnessBug, correctness
>
> This is what happens (EDIT - managed to get a reproducible example):
> {code:java}
> /* Exemplary spark-shell starting command 
> /opt/spark/bin/spark-shell \
> --num-executors 269 \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.kryoserializer.buffer.max=512m 
> // The spark.sql.shuffle.partitions is 2154 here, if that matters
> */
> val df = spark.range(1000).withColumn("col1", (rand() * 
> 1000).cast("long")).withColumn("col2", (rand() * 
> 1000).cast("long")).drop("id")
> df.repartition(5240).write.parquet("/test.parquet")
> // Then, ideally in a new session
> val df = spark.read.parquet("/test.parquet")
> df.distinct.count
> // res1: Long = 1001256                                                       
>      
> df.distinct.count
> // res2: Long = 55   {code}
> -The _text_dataset.out_ file is a dataset with one string per line. The 
> string has alphanumeric characters as well as colons and spaces. The line 
> length does not exceed 1200. I don't think that's important though, as the 
> issue appeared on various other datasets, I just tried to narrow it down to 
> the simplest possible case.- (the case is now fully reproducible with the 
> above code)
> The observations regarding the issue are as follows:
>  * I managed to reproduce it on both spark 2.2 and spark 2.1.
>  * The issue occurs in YARN cluster mode (I haven't tested YARN client mode).
>  * The issue is not reproducible on a single machine (e.g. laptop) in spark 
> local mode.
>  * It seems that once the correct count is computed, it is not possible to 
> reproduce the issue in the same spark session. In other words, I was able to 
> get 2-3 incorrect distinct.count results consecutively, but once it got 
> right, it always returned the correct value. I had to re-run spark-shell to 
> observe the problem again.
>  * The issue appears on both Dataset and DataFrame (i.e. using read.text or 
> 

[jira] [Commented] (SPARK-23298) distinct.count on Dataset/DataFrame yields non-deterministic results

2018-08-29 Thread Janne Jaanila (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596324#comment-16596324
 ] 

Janne Jaanila commented on SPARK-23298:
---

I have faced a similar problem with multiple dropDuplicates() calls. I was able 
to reproduce it with Masteusz's dataframe in spark-shell:

 
{code:java}
./spark-shell --master yarn --deploy-mode client

scala> val df = spark.range(1000).withColumn("col1", (rand() * 
1000).cast("long")).withColumn("col2", (rand() * 1000).cast("long")).drop("id")
df: org.apache.spark.sql.DataFrame = [col1: bigint, col2: bigint]
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res77: Long = 639 
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res78: Long = 638
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res79: Long = 618
scala> df.dropDuplicates("col1").dropDuplicates("col2").count()
res80: Long = 635
{code}
 

Count with a single dropDuplicates call works as expected.

I'm using Spark 2.3.1. We recently upgraded from 2.0.2 because of this issue.

> distinct.count on Dataset/DataFrame yields non-deterministic results
> 
>
> Key: SPARK-23298
> URL: https://issues.apache.org/jira/browse/SPARK-23298
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, SQL, YARN
>Affects Versions: 2.1.0, 2.2.0
> Environment: Spark 2.2.0 or 2.1.0
> Java 1.8.0_144
> Yarn version:
> {code:java}
> Hadoop 2.6.0-cdh5.12.1
> Subversion http://github.com/cloudera/hadoop -r 
> 520d8b072e666e9f21d645ca6a5219fc37535a52
> Compiled by jenkins on 2017-08-24T16:43Z
> Compiled with protoc 2.5.0
> From source with checksum de51bf9693ab9426379a1cd28142cea0
> This command was run using 
> /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.12.1.jar{code}
>  
>  
>Reporter: Mateusz Jukiewicz
>Priority: Major
>  Labels: Correctness, CorrectnessBug, correctness
>
> This is what happens (EDIT - managed to get a reproducible example):
> {code:java}
> /* Exemplary spark-shell starting command 
> /opt/spark/bin/spark-shell \
> --num-executors 269 \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.kryoserializer.buffer.max=512m 
> // The spark.sql.shuffle.partitions is 2154 here, if that matters
> */
> val df = spark.range(1000).withColumn("col1", (rand() * 
> 1000).cast("long")).withColumn("col2", (rand() * 
> 1000).cast("long")).drop("id")
> df.repartition(5240).write.parquet("/test.parquet")
> // Then, ideally in a new session
> val df = spark.read.parquet("/test.parquet")
> df.distinct.count
> // res1: Long = 1001256                                                       
>      
> df.distinct.count
> // res2: Long = 55   {code}
> -The _text_dataset.out_ file is a dataset with one string per line. The 
> string has alphanumeric characters as well as colons and spaces. The line 
> length does not exceed 1200. I don't think that's important though, as the 
> issue appeared on various other datasets, I just tried to narrow it down to 
> the simplest possible case.- (the case is now fully reproducible with the 
> above code)
> The observations regarding the issue are as follows:
>  * I managed to reproduce it on both spark 2.2 and spark 2.1.
>  * The issue occurs in YARN cluster mode (I haven't tested YARN client mode).
>  * The issue is not reproducible on a single machine (e.g. laptop) in spark 
> local mode.
>  * It seems that once the correct count is computed, it is not possible to 
> reproduce the issue in the same spark session. In other words, I was able to 
> get 2-3 incorrect distinct.count results consecutively, but once it got 
> right, it always returned the correct value. I had to re-run spark-shell to 
> observe the problem again.
>  * The issue appears on both Dataset and DataFrame (i.e. using read.text or 
> read.textFile).
>  * The issue is not reproducible on RDD (i.e. dataset.rdd.distinct.count).
>  * Not a single container has failed in those multiple invalid executions.
>  * YARN doesn't show any warnings or errors in those invalid executions.
>  * The execution plan determined for both valid and invalid executions was 
> always the same (it's shown in the _SQL_ tab of the UI).
>  * The number returned in the invalid executions was always greater than the 
> correct number (24 014 227).
>  * This occurs even though the input is already completely deduplicated (i.e. 
> _distinct.count_ shouldn't change anything).
>  * The input isn't replicated (i.e. there's only one copy of each file block 
> on the HDFS).
>  * The problem is probably not related to reading from HDFS. Spark was always 
> able to correctly read all input records (which was shown in the UI), and 
> that number got malformed after the exchange phase:
>  ** correct execution:
>  Input 

[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception

2018-08-29 Thread shivusondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596300#comment-16596300
 ] 

shivusondur commented on SPARK-25271:
-

cc [~cloud_fan]

> Creating parquet table with all the column null throws exception
> 
>
> Key: SPARK-25271
> URL: https://issues.apache.org/jira/browse/SPARK-25271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: shivusondur
>Priority: Major
>
> {code:java}
>  1)cat /data/parquet.dat
> 1$abc2$pqr:3$xyz
> null{code}
>  
> {code:java}
> 2)spark.sql("create table vp_reader_temp (projects map) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' 
> MAP KEYS TERMINATED BY '$'")
> {code}
> {code:java}
> 3)spark.sql("
> LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
> {code}
> {code:java}
> 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
> vp_reader_temp")
> {code}
> *Result :* Throwing exception (Working fine with spark 2.2.1)
> {code:java}
> java.lang.RuntimeException: Parquet record is malformed: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320)
>   at 
> org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
>   ... 21 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Created] (SPARK-25271) Creating parquet table with all the column null throws exception

2018-08-29 Thread shivusondur (JIRA)
shivusondur created SPARK-25271:
---

 Summary: Creating parquet table with all the column null throws 
exception
 Key: SPARK-25271
 URL: https://issues.apache.org/jira/browse/SPARK-25271
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: shivusondur



{code:java}
 1)cat /data/parquet.dat

1$abc2$pqr:3$xyz
null{code}
 


{code:java}
2)spark.sql("create table vp_reader_temp (projects map) ROW FORMAT 
DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' MAP KEYS 
TERMINATED BY '$'")
{code}

{code:java}
3)spark.sql("
LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
{code}

{code:java}
4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
vp_reader_temp")
{code}


*Result :* Throwing exception (Working fine with spark 2.2.1)

{code:java}
java.lang.RuntimeException: Parquet record is malformed: empty fields are 
illegal, the field should be ommited completely instead
at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
at 
org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
at 
org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
at 
org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are 
illegal, the field should be ommited completely instead
at 
org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320)
at 
org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165)
at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
... 21 more
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25217) Error thrown when creating BlockMatrix

2018-08-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596250#comment-16596250
 ] 

Liang-Chi Hsieh commented on SPARK-25217:
-

I think you are mixing {{pyspark.ml.linalg.Matrix}}, {{Matrices}} with 
{{pyspark.mllib.linalg.distributed.BlockMatrix}}.

If you use {{Matrix}} and {{Matrices}} from {{pyspark.mllib.linalg}}, it works 
without the error.

> Error thrown when creating BlockMatrix
> --
>
> Key: SPARK-25217
> URL: https://issues.apache.org/jira/browse/SPARK-25217
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: cs5090237
>Priority: Major
>
> dm1 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])
> dm2 = Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12])
> sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 1, 2], [7, 11, 12])
> blocks1 = sc.parallelize([((0, 0), dm1)])
> sm_ = Matrix(3,2,sm)
> blocks2 = sc.parallelize([((0, 0), sm), ((1, 0), sm)])
> blocks3 = sc.parallelize([((0, 0), sm), ((1, 0), dm2)])
> mat2 = BlockMatrix(blocks2, 3, 2)
> mat3 = BlockMatrix(blocks3, 3, 2)
>  
> *Running above sample code in Pyspark from documentation raises following 
> error:* 
>  
> An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in 
> stage 53.0 failed 4 times, most recent failure: Lost task 14.3 in stage 53.0 
> (TID 1081, , executor 15): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last): File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/worker.py", line 230, 
> in main process() File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/worker.py", line 225, 
> in process serializer.dump_stream(func(split_index, iterator), outfile) File 
> "/mnt/yarn/usercache/livy/appcache/application_1535051034290_0001/container_1535051034290_0001_01_23/pyspark.zip/pyspark/serializers.py",
>  line 372, in dump_stream vs = list(itertools.islice(iterator, batch)) File 
> "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1371, in 
> takeUpToNumLeft File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/util.py", line 55, in 
> wrapper return f(*args, **kwargs) File 
> "/mnt/yarn/usercache/livy/appcache//pyspark.zip/pyspark/mllib/linalg/distributed.py",
>  line 975, in _convert_to_matrix_block_tuple raise TypeError("Cannot convert 
> type %s into a sub-matrix block tuple" % type(block)) TypeError: Cannot 
> convert type  into a sub-matrix block tuple
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25270) lint-python: Add flake8 to find syntax errors and undefined names

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25270:


Assignee: Apache Spark

> lint-python: Add flake8 to find syntax errors and undefined names
> -
>
> Key: SPARK-25270
> URL: https://issues.apache.org/jira/browse/SPARK-25270
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: cclauss
>Assignee: Apache Spark
>Priority: Minor
>
> Flake8 has been a useful tool for finding and fixing undefined names in 
> Python code.  See: SPARK-23698  We should add flake8 testing to the 
> lint-python process to automate this testing on all pull requests.  
> https://github.com/apache/spark/pull/22266



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25270) lint-python: Add flake8 to find syntax errors and undefined names

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25270:


Assignee: (was: Apache Spark)

> lint-python: Add flake8 to find syntax errors and undefined names
> -
>
> Key: SPARK-25270
> URL: https://issues.apache.org/jira/browse/SPARK-25270
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: cclauss
>Priority: Minor
>
> Flake8 has been a useful tool for finding and fixing undefined names in 
> Python code.  See: SPARK-23698  We should add flake8 testing to the 
> lint-python process to automate this testing on all pull requests.  
> https://github.com/apache/spark/pull/22266



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25270) lint-python: Add flake8 to find syntax errors and undefined names

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596234#comment-16596234
 ] 

Apache Spark commented on SPARK-25270:
--

User 'cclauss' has created a pull request for this issue:
https://github.com/apache/spark/pull/22266

> lint-python: Add flake8 to find syntax errors and undefined names
> -
>
> Key: SPARK-25270
> URL: https://issues.apache.org/jira/browse/SPARK-25270
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: cclauss
>Priority: Minor
>
> Flake8 has been a useful tool for finding and fixing undefined names in 
> Python code.  See: SPARK-23698  We should add flake8 testing to the 
> lint-python process to automate this testing on all pull requests.  
> https://github.com/apache/spark/pull/22266



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25270) lint-python: Add flake8 to find syntax errors and undefined names

2018-08-29 Thread cclauss (JIRA)
cclauss created SPARK-25270:
---

 Summary: lint-python: Add flake8 to find syntax errors and 
undefined names
 Key: SPARK-25270
 URL: https://issues.apache.org/jira/browse/SPARK-25270
 Project: Spark
  Issue Type: Test
  Components: PySpark
Affects Versions: 2.3.1
Reporter: cclauss


Flake8 has been a useful tool for finding and fixing undefined names in Python 
code.  See: SPARK-23698  We should add flake8 testing to the lint-python 
process to automate this testing on all pull requests.  
https://github.com/apache/spark/pull/22266



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25265) Fix memory leak vulnerability in Barrier Execution Mode

2018-08-29 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596221#comment-16596221
 ] 

Marco Gaido commented on SPARK-25265:
-

Isn't this a duplicate of the next one?

> Fix memory leak vulnerability in Barrier Execution Mode
> ---
>
> Key: SPARK-25265
> URL: https://issues.apache.org/jira/browse/SPARK-25265
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Critical
>
> BarrierCoordinator$ uses Timer and TimerTask. `TimerTask#cancel()` is invoked 
> in ContextBarrierState#cancelTimerTask but `Timer#purge()` is never invoked.
> Once a TimerTask is scheduled, the reference to it is not released until 
> `Timer#purge()` is invoked even though `TimerTask#cancel()` is invoked.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24716) Refactor ParquetFilters

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596176#comment-16596176
 ] 

Apache Spark commented on SPARK-24716:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22267

> Refactor ParquetFilters
> ---
>
> Key: SPARK-24716
> URL: https://issues.apache.org/jira/browse/SPARK-24716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25102) Write Spark version information to Parquet file footers

2018-08-29 Thread Nikita Poberezkin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596110#comment-16596110
 ] 

Nikita Poberezkin commented on SPARK-25102:
---

Hello, [~zi]
I've tried this approach:
import org.apache.spark.\{SPARK_VERSION}

override def getName: String = {
SPARK_VERSION
 }
Spark is building with this code. Also I have an idea to add configuration to 
SQLConf which was already imported to ParquetWriteSupport.
Code will look something like this, below is config added to SQLConf:

val SPARK_VERSION = buildConf("spark.sql.writerModelName")
 .doc("Version of Spark which create Parquet file")
 .stringConf
 .createWithDefault(SparkContext.getOrCreate().version)

And method getName:

override def getName: String = {
 SQLConf.SPARK_VERSION.key
 }

I've created pull request with the first variant. Which one is better (if any 
of them is valid at all)?

> Write Spark version information to Parquet file footers
> ---
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Zoltan Ivanfi
>Priority: Major
>
> -PARQUET-352- added support for the "writer.model.name" property in the 
> Parquet metadata to identify the object model (application) that wrote the 
> file.
> The easiest way to write this property is by overriding getName() of 
> org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
> getName() to the 
> org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect

2018-08-29 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596109#comment-16596109
 ] 

Marco Gaido commented on SPARK-25219:
-

Well, there are many differences between Spark ML and SKLearn codes you've 
posted. First of all the number of clusters is different. Moreover the input 
data to KMeans can be different.

Please store the data after the TF-IDF transformation, which is the interesting 
one. Then, take the KMeans results and the centroids: check if the distance of 
a point to the centroid it has been assigned to is lower than the distance to 
all the other centroids. If that is the case, there is no issue with KMeans, 
You may have to increase the number of runs, change the initialization method, 
change the seed and so on to get a different result, but there is no evident 
bug in the algorithm itself. If this is not the case, instead, with the input 
data to the KMeans and the reproducer, I can investigate the problem. Thanks.

> KMeans Clustering - Text Data - Results are incorrect
> -
>
> Key: SPARK-25219
> URL: https://issues.apache.org/jira/browse/SPARK-25219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Vasanthkumar Velayudham
>Priority: Major
> Attachments: Apache_Logs_Results.xlsx, SKLearn_Kmeans.txt, 
> Spark_Kmeans.txt
>
>
> Hello Everyone,
> I am facing issues with the usage of KMeans Clustering on my text data. When 
> I apply clustering on my text data, after performing various transformations 
> such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated 
> clusters are not proper and one cluster is found to have lot of data points 
> assigned to it.
> I am able to perform clustering with similar kind of processing and with the 
> same attributes on the SKLearn KMeans algorithm. 
> Upon searching in internet, I observe many have reported the same issue with 
> KMeans clustering library of Spark.
> Request your help in fixing this issue.
> Please let me know if you require any additional details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25256) Plan mismatch errors in Hive tests in 2.12

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25256:


Assignee: Apache Spark

> Plan mismatch errors in Hive tests in 2.12
> --
>
> Key: SPARK-25256
> URL: https://issues.apache.org/jira/browse/SPARK-25256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Major
>
> In Hive tests, in the Scala 2.12 build, still seeing a few failures that seem 
> to show mismatching schema inference. Not clear whether it's the same as 
> SPARK-25044. Examples:
> {code:java}
> - SPARK-5775 read array from partitioned_parquet_with_key_and_complextypes 
> *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'Project ['arrayField, 'p]
> +- 'Filter ('p = 1)
> +- 'UnresolvedRelation `partitioned_parquet_with_key_and_complextypes`
> == Analyzed Logical Plan ==
> arrayField: array, p: int
> Project [arrayField#82569, p#82570]
> +- Filter (p#82570 = 1)
> +- SubqueryAlias `default`.`partitioned_parquet_with_key_and_complextypes`
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Optimized Logical Plan ==
> Project [arrayField#82569, p#82570]
> +- Filter (isnotnull(p#82570) && (p#82570 = 1))
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Physical Plan ==
> *(1) Project [arrayField#82569, p#82570]
> +- *(1) FileScan parquet 
> default.partitioned_parquet_with_key_and_complextypes[arrayField#82569,p#82570]
>  Batched: false, Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/home/srowen/spark-2.12/sql/hive/target/tmp/spark-d8d87d74-33e7-4f22...,
>  PartitionCount: 1, PartitionFilters: [isnotnull(p#82570), (p#82570 = 1)], 
> PushedFilters: [], ReadSchema: struct>
> == Results ==
> == Results ==
> !== Correct Answer - 10 == == Spark Answer - 10 ==
> !struct<> struct,p:int>
> ![Range 1 to 1,1] [WrappedArray(1),1]
> ![Range 1 to 10,1] [WrappedArray(1, 2),1]
> ![Range 1 to 2,1] [WrappedArray(1, 2, 3),1]
> ![Range 1 to 3,1] [WrappedArray(1, 2, 3, 4),1]
> ![Range 1 to 4,1] [WrappedArray(1, 2, 3, 4, 5),1]
> ![Range 1 to 5,1] [WrappedArray(1, 2, 3, 4, 5, 6),1]
> ![Range 1 to 6,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7),1]
> ![Range 1 to 7,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8),1]
> ![Range 1 to 8,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9),1]
> ![Range 1 to 9,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),1] 
> (QueryTest.scala:163){code}
> {code:java}
> - SPARK-2693 udaf aggregates test *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'GlobalLimit 1
> +- 'LocalLimit 1
> +- 'Project [unresolvedalias('percentile('key, 'array(1, 1)), None)]
> +- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> percentile(key, array(1, 1), 1): array
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, cast(array(1, 1) as array), 1, 
> 0, 0) AS percentile(key, array(1, 1), 1)#205101]
> +- SubqueryAlias `default`.`src`
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Optimized Logical Plan ==
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, [1.0,1.0], 1, 0, 0) AS percentile(key, 
> array(1, 1), 1)#205101]
> +- Project [key#205098]
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Physical Plan ==
> CollectLimit 1
> +- ObjectHashAggregate(keys=[], functions=[percentile(key#205098, [1.0,1.0], 
> 1, 0, 0)], output=[percentile(key, array(1, 1), 1)#205101])
> +- Exchange SinglePartition
> +- ObjectHashAggregate(keys=[], functions=[partial_percentile(key#205098, 
> [1.0,1.0], 1, 0, 0)], output=[buf#205104])
> +- Scan hive default.src 

[jira] [Commented] (SPARK-25256) Plan mismatch errors in Hive tests in 2.12

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596073#comment-16596073
 ] 

Apache Spark commented on SPARK-25256:
--

User 'sadhen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22264

> Plan mismatch errors in Hive tests in 2.12
> --
>
> Key: SPARK-25256
> URL: https://issues.apache.org/jira/browse/SPARK-25256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> In Hive tests, in the Scala 2.12 build, still seeing a few failures that seem 
> to show mismatching schema inference. Not clear whether it's the same as 
> SPARK-25044. Examples:
> {code:java}
> - SPARK-5775 read array from partitioned_parquet_with_key_and_complextypes 
> *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'Project ['arrayField, 'p]
> +- 'Filter ('p = 1)
> +- 'UnresolvedRelation `partitioned_parquet_with_key_and_complextypes`
> == Analyzed Logical Plan ==
> arrayField: array, p: int
> Project [arrayField#82569, p#82570]
> +- Filter (p#82570 = 1)
> +- SubqueryAlias `default`.`partitioned_parquet_with_key_and_complextypes`
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Optimized Logical Plan ==
> Project [arrayField#82569, p#82570]
> +- Filter (isnotnull(p#82570) && (p#82570 = 1))
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Physical Plan ==
> *(1) Project [arrayField#82569, p#82570]
> +- *(1) FileScan parquet 
> default.partitioned_parquet_with_key_and_complextypes[arrayField#82569,p#82570]
>  Batched: false, Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/home/srowen/spark-2.12/sql/hive/target/tmp/spark-d8d87d74-33e7-4f22...,
>  PartitionCount: 1, PartitionFilters: [isnotnull(p#82570), (p#82570 = 1)], 
> PushedFilters: [], ReadSchema: struct>
> == Results ==
> == Results ==
> !== Correct Answer - 10 == == Spark Answer - 10 ==
> !struct<> struct,p:int>
> ![Range 1 to 1,1] [WrappedArray(1),1]
> ![Range 1 to 10,1] [WrappedArray(1, 2),1]
> ![Range 1 to 2,1] [WrappedArray(1, 2, 3),1]
> ![Range 1 to 3,1] [WrappedArray(1, 2, 3, 4),1]
> ![Range 1 to 4,1] [WrappedArray(1, 2, 3, 4, 5),1]
> ![Range 1 to 5,1] [WrappedArray(1, 2, 3, 4, 5, 6),1]
> ![Range 1 to 6,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7),1]
> ![Range 1 to 7,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8),1]
> ![Range 1 to 8,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9),1]
> ![Range 1 to 9,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),1] 
> (QueryTest.scala:163){code}
> {code:java}
> - SPARK-2693 udaf aggregates test *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'GlobalLimit 1
> +- 'LocalLimit 1
> +- 'Project [unresolvedalias('percentile('key, 'array(1, 1)), None)]
> +- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> percentile(key, array(1, 1), 1): array
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, cast(array(1, 1) as array), 1, 
> 0, 0) AS percentile(key, array(1, 1), 1)#205101]
> +- SubqueryAlias `default`.`src`
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Optimized Logical Plan ==
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, [1.0,1.0], 1, 0, 0) AS percentile(key, 
> array(1, 1), 1)#205101]
> +- Project [key#205098]
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Physical Plan ==
> CollectLimit 1
> +- ObjectHashAggregate(keys=[], functions=[percentile(key#205098, [1.0,1.0], 
> 1, 0, 0)], output=[percentile(key, array(1, 1), 1)#205101])
> +- Exchange SinglePartition
> +- ObjectHashAggregate(keys=[], functions=[partial_percentile(key#205098, 
> [1.0,1.0], 

[jira] [Assigned] (SPARK-25256) Plan mismatch errors in Hive tests in 2.12

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25256:


Assignee: (was: Apache Spark)

> Plan mismatch errors in Hive tests in 2.12
> --
>
> Key: SPARK-25256
> URL: https://issues.apache.org/jira/browse/SPARK-25256
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> In Hive tests, in the Scala 2.12 build, still seeing a few failures that seem 
> to show mismatching schema inference. Not clear whether it's the same as 
> SPARK-25044. Examples:
> {code:java}
> - SPARK-5775 read array from partitioned_parquet_with_key_and_complextypes 
> *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'Project ['arrayField, 'p]
> +- 'Filter ('p = 1)
> +- 'UnresolvedRelation `partitioned_parquet_with_key_and_complextypes`
> == Analyzed Logical Plan ==
> arrayField: array, p: int
> Project [arrayField#82569, p#82570]
> +- Filter (p#82570 = 1)
> +- SubqueryAlias `default`.`partitioned_parquet_with_key_and_complextypes`
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Optimized Logical Plan ==
> Project [arrayField#82569, p#82570]
> +- Filter (isnotnull(p#82570) && (p#82570 = 1))
> +- 
> Relation[intField#82566,stringField#82567,structField#82568,arrayField#82569,p#82570]
>  parquet
> == Physical Plan ==
> *(1) Project [arrayField#82569, p#82570]
> +- *(1) FileScan parquet 
> default.partitioned_parquet_with_key_and_complextypes[arrayField#82569,p#82570]
>  Batched: false, Format: Parquet, Location: 
> PrunedInMemoryFileIndex[file:/home/srowen/spark-2.12/sql/hive/target/tmp/spark-d8d87d74-33e7-4f22...,
>  PartitionCount: 1, PartitionFilters: [isnotnull(p#82570), (p#82570 = 1)], 
> PushedFilters: [], ReadSchema: struct>
> == Results ==
> == Results ==
> !== Correct Answer - 10 == == Spark Answer - 10 ==
> !struct<> struct,p:int>
> ![Range 1 to 1,1] [WrappedArray(1),1]
> ![Range 1 to 10,1] [WrappedArray(1, 2),1]
> ![Range 1 to 2,1] [WrappedArray(1, 2, 3),1]
> ![Range 1 to 3,1] [WrappedArray(1, 2, 3, 4),1]
> ![Range 1 to 4,1] [WrappedArray(1, 2, 3, 4, 5),1]
> ![Range 1 to 5,1] [WrappedArray(1, 2, 3, 4, 5, 6),1]
> ![Range 1 to 6,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7),1]
> ![Range 1 to 7,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8),1]
> ![Range 1 to 8,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9),1]
> ![Range 1 to 9,1] [WrappedArray(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),1] 
> (QueryTest.scala:163){code}
> {code:java}
> - SPARK-2693 udaf aggregates test *** FAILED ***
> Results do not match for query:
> Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> Timezone Env: 
> == Parsed Logical Plan ==
> 'GlobalLimit 1
> +- 'LocalLimit 1
> +- 'Project [unresolvedalias('percentile('key, 'array(1, 1)), None)]
> +- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> percentile(key, array(1, 1), 1): array
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, cast(array(1, 1) as array), 1, 
> 0, 0) AS percentile(key, array(1, 1), 1)#205101]
> +- SubqueryAlias `default`.`src`
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Optimized Logical Plan ==
> GlobalLimit 1
> +- LocalLimit 1
> +- Aggregate [percentile(key#205098, [1.0,1.0], 1, 0, 0) AS percentile(key, 
> array(1, 1), 1)#205101]
> +- Project [key#205098]
> +- HiveTableRelation `default`.`src`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#205098, value#205099]
> == Physical Plan ==
> CollectLimit 1
> +- ObjectHashAggregate(keys=[], functions=[percentile(key#205098, [1.0,1.0], 
> 1, 0, 0)], output=[percentile(key, array(1, 1), 1)#205101])
> +- Exchange SinglePartition
> +- ObjectHashAggregate(keys=[], functions=[partial_percentile(key#205098, 
> [1.0,1.0], 1, 0, 0)], output=[buf#205104])
> +- Scan hive default.src [key#205098], HiveTableRelation 

[jira] [Commented] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596070#comment-16596070
 ] 

Apache Spark commented on SPARK-25044:
--

User 'sadhen' has created a pull request for this issue:
https://github.com/apache/spark/pull/22264

> Address translation of LMF closure primitive args to Object in Scala 2.12
> -
>
> Key: SPARK-25044
> URL: https://issues.apache.org/jira/browse/SPARK-25044
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.4.0
>
>
> A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 
> Fix HandleNullInputsForUDF rule":
> {code:java}
> - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED ***
> Results do not match for query:
> ...
> == Results ==
> == Results ==
> !== Correct Answer - 3 == == Spark Answer - 3 ==
> !struct<> struct
> ![0,10,null] [0,10,0]
> ![1,12,null] [1,12,1]
> ![2,14,null] [2,14,2] (QueryTest.scala:163){code}
> You can kind of get what's going on reading the test:
> {code:java}
> test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
> // assume(!ClosureCleanerSuite2.supportsLMFs)
> // This test won't test what it intends to in 2.12, as lambda metafactory 
> closures
> // have arg types that are not primitive, but Object
> val udf1 = udf({(x: Int, y: Int) => x + y})
> val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
> .withColumn("c", udf1($"a", lit(null)))
> val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed
> comparePlans(df.logicalPlan, plan)
> checkAnswer(
> df,
> Seq(
> Row(0, 10, null),
> Row(1, 12, null),
> Row(2, 14, null)))
> }{code}
>  
> It seems that the closure that is fed in as a UDF changes behavior, in a way 
> that primitive-type arguments are handled differently. For example an Int 
> argument, when fed 'null', acts like 0.
> I'm sure it's a difference in the LMF closure and how its types are 
> understood, but not exactly sure of the cause yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25259) Left/Right join support push down during-join predicates

2018-08-29 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-25259.
-
Resolution: Duplicate

> Left/Right join support push down during-join predicates
> 
>
> Key: SPARK-25259
> URL: https://issues.apache.org/jira/browse/SPARK-25259
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> create temporary view EMPLOYEE as select * from values
>   ("10", "HAAS", "A00"),
>   ("10", "THOMPSON", "B01"),
>   ("30", "KWAN", "C01"),
>   ("000110", "LUCCHESSI", "A00"),
>   ("000120", "O'CONNELL", "A))"),
>   ("000130", "QUINTANA", "C01")
>   as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT);
> create temporary view DEPARTMENT as select * from values
>   ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"),
>   ("B01", "PLANNING", "20"),
>   ("C01", "INFORMATION CENTER", "30"),
>   ("D01", "DEVELOPMENT CENTER", null)
>   as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO);
> create temporary view PROJECT as select * from values
>   ("AD3100", "ADMIN SERVICES", "D01"),
>   ("IF1000", "QUERY SERVICES", "C01"),
>   ("IF2000", "USER EDUCATION", "E01"),
>   ("MA2100", "WELD LINE AUDOMATION", "D01"),
>   ("PL2100", "WELD LINE PLANNING", "01")
>   as EMPLOYEE(PROJNO, PROJNAME, DEPTNO);
> {code}
> below SQL:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}
> can Optimized to:
> {code:sql}
> SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME
> FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D
> ON P.DEPTNO = D.DEPTNO
> AND P.DEPTNO='E01';
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails

2018-08-29 Thread chen xiao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596033#comment-16596033
 ] 

chen xiao commented on SPARK-25193:
---

It does look like the Hive issue. Thanks [~mgaido]

> insert overwrite doesn't throw exception when drop old data fails
> -
>
> Key: SPARK-25193
> URL: https://issues.apache.org/jira/browse/SPARK-25193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: chen xiao
>Priority: Major
>
> dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName")
> Insert overwrite mode will drop old data in hive table if there's old data.
> But if data deleting fails, no exception will be thrown and the data folder 
> will be like:
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513.
> Two copies of data will be kept.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12

2018-08-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25044.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22259
[https://github.com/apache/spark/pull/22259]

> Address translation of LMF closure primitive args to Object in Scala 2.12
> -
>
> Key: SPARK-25044
> URL: https://issues.apache.org/jira/browse/SPARK-25044
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.4.0
>
>
> A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 
> Fix HandleNullInputsForUDF rule":
> {code:java}
> - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED ***
> Results do not match for query:
> ...
> == Results ==
> == Results ==
> !== Correct Answer - 3 == == Spark Answer - 3 ==
> !struct<> struct
> ![0,10,null] [0,10,0]
> ![1,12,null] [1,12,1]
> ![2,14,null] [2,14,2] (QueryTest.scala:163){code}
> You can kind of get what's going on reading the test:
> {code:java}
> test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
> // assume(!ClosureCleanerSuite2.supportsLMFs)
> // This test won't test what it intends to in 2.12, as lambda metafactory 
> closures
> // have arg types that are not primitive, but Object
> val udf1 = udf({(x: Int, y: Int) => x + y})
> val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
> .withColumn("c", udf1($"a", lit(null)))
> val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed
> comparePlans(df.logicalPlan, plan)
> checkAnswer(
> df,
> Seq(
> Row(0, 10, null),
> Row(1, 12, null),
> Row(2, 14, null)))
> }{code}
>  
> It seems that the closure that is fed in as a UDF changes behavior, in a way 
> that primitive-type arguments are handled differently. For example an Int 
> argument, when fed 'null', acts like 0.
> I'm sure it's a difference in the LMF closure and how its types are 
> understood, but not exactly sure of the cause yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25044) Address translation of LMF closure primitive args to Object in Scala 2.12

2018-08-29 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25044:
---

Assignee: Sean Owen

> Address translation of LMF closure primitive args to Object in Scala 2.12
> -
>
> Key: SPARK-25044
> URL: https://issues.apache.org/jira/browse/SPARK-25044
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
> Fix For: 2.4.0
>
>
> A few SQL-related tests fail in Scala 2.12, such as UDFSuite's "SPARK-24891 
> Fix HandleNullInputsForUDF rule":
> {code:java}
> - SPARK-24891 Fix HandleNullInputsForUDF rule *** FAILED ***
> Results do not match for query:
> ...
> == Results ==
> == Results ==
> !== Correct Answer - 3 == == Spark Answer - 3 ==
> !struct<> struct
> ![0,10,null] [0,10,0]
> ![1,12,null] [1,12,1]
> ![2,14,null] [2,14,2] (QueryTest.scala:163){code}
> You can kind of get what's going on reading the test:
> {code:java}
> test("SPARK-24891 Fix HandleNullInputsForUDF rule") {
> // assume(!ClosureCleanerSuite2.supportsLMFs)
> // This test won't test what it intends to in 2.12, as lambda metafactory 
> closures
> // have arg types that are not primitive, but Object
> val udf1 = udf({(x: Int, y: Int) => x + y})
> val df = spark.range(0, 3).toDF("a")
> .withColumn("b", udf1($"a", udf1($"a", lit(10
> .withColumn("c", udf1($"a", lit(null)))
> val plan = spark.sessionState.executePlan(df.logicalPlan).analyzed
> comparePlans(df.logicalPlan, plan)
> checkAnswer(
> df,
> Seq(
> Row(0, 10, null),
> Row(1, 12, null),
> Row(2, 14, null)))
> }{code}
>  
> It seems that the closure that is fed in as a UDF changes behavior, in a way 
> that primitive-type arguments are handled differently. For example an Int 
> argument, when fed 'null', acts like 0.
> I'm sure it's a difference in the LMF closure and how its types are 
> understood, but not exactly sure of the cause yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23030) Decrease memory consumption with toPandas() collection using Arrow

2018-08-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23030.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21546
[https://github.com/apache/spark/pull/21546]

> Decrease memory consumption with toPandas() collection using Arrow
> --
>
> Key: SPARK-23030
> URL: https://issues.apache.org/jira/browse/SPARK-23030
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently with Arrow enabled, calling {{toPandas()}} results in a collection 
> of all partitions in the JVM in the form of batches of Arrow file format.  
> Once collected in the JVM, they are served to the Python driver process. 
> I believe using the Arrow stream format can help to optimize this and reduce 
> memory consumption in the JVM by only loading one record batch at a time 
> before sending it to Python.  This might also reduce the latency between 
> making the initial call in Python and receiving the first batch of records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23030) Decrease memory consumption with toPandas() collection using Arrow

2018-08-29 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23030:


Assignee: Bryan Cutler

> Decrease memory consumption with toPandas() collection using Arrow
> --
>
> Key: SPARK-23030
> URL: https://issues.apache.org/jira/browse/SPARK-23030
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> Currently with Arrow enabled, calling {{toPandas()}} results in a collection 
> of all partitions in the JVM in the form of batches of Arrow file format.  
> Once collected in the JVM, they are served to the Python driver process. 
> I believe using the Arrow stream format can help to optimize this and reduce 
> memory consumption in the JVM by only loading one record batch at a time 
> before sending it to Python.  This might also reduce the latency between 
> making the initial call in Python and receiving the first batch of records.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25269) SQL interface support specify StorageLevel when cache table

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25269:


Assignee: Apache Spark

> SQL interface support specify StorageLevel when cache table
> ---
>
> Key: SPARK-25269
> URL: https://issues.apache.org/jira/browse/SPARK-25269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {code:sql}
> CACHE MEMORY_ONLY TABLE testData;
> {code}
> Supported {{StorageLevel}} should be:
> https://github.com/apache/spark/blob/eefdf9f9dd8afde49ad7d4e230e2735eb817ab0a/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L153-L164



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25269) SQL interface support specify StorageLevel when cache table

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595997#comment-16595997
 ] 

Apache Spark commented on SPARK-25269:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22263

> SQL interface support specify StorageLevel when cache table
> ---
>
> Key: SPARK-25269
> URL: https://issues.apache.org/jira/browse/SPARK-25269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> CACHE MEMORY_ONLY TABLE testData;
> {code}
> Supported {{StorageLevel}} should be:
> https://github.com/apache/spark/blob/eefdf9f9dd8afde49ad7d4e230e2735eb817ab0a/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L153-L164



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25269) SQL interface support specify StorageLevel when cache table

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25269:


Assignee: (was: Apache Spark)

> SQL interface support specify StorageLevel when cache table
> ---
>
> Key: SPARK-25269
> URL: https://issues.apache.org/jira/browse/SPARK-25269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> CACHE MEMORY_ONLY TABLE testData;
> {code}
> Supported {{StorageLevel}} should be:
> https://github.com/apache/spark/blob/eefdf9f9dd8afde49ad7d4e230e2735eb817ab0a/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L153-L164



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25175:


Assignee: (was: Apache Spark)

> Field resolution should fail if there's ambiguity for ORC native reader
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader

2018-08-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595994#comment-16595994
 ] 

Apache Spark commented on SPARK-25175:
--

User 'seancxmao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22262

> Field resolution should fail if there's ambiguity for ORC native reader
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader

2018-08-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25175:


Assignee: Apache Spark

> Field resolution should fail if there's ambiguity for ORC native reader
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25175) Field resolution should fail if there's ambiguity for ORC native reader

2018-08-29 Thread Chenxiao Mao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenxiao Mao updated SPARK-25175:
-
Summary: Field resolution should fail if there's ambiguity for ORC native 
reader  (was: Field resolution should fail if there is ambiguity for ORC data 
source native implementation)

> Field resolution should fail if there's ambiguity for ORC native reader
> ---
>
> Key: SPARK-25175
> URL: https://issues.apache.org/jira/browse/SPARK-25175
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Chenxiao Mao
>Priority: Major
>
> SPARK-25132 adds support for case-insensitive field resolution when reading 
> from Parquet files. We found ORC files have similar issues, but not identical 
> to Parquet. Spark has two OrcFileFormat.
>  * Since SPARK-2883, Spark supports ORC inside sql/hive module with Hive 
> dependency. This hive OrcFileFormat always do case-insensitive field 
> resolution regardless of case sensitivity mode. When there is ambiguity, hive 
> OrcFileFormat always returns the first matched field, rather than failing the 
> reading operation.
>  * SPARK-20682 adds a new ORC data source inside sql/core. This native 
> OrcFileFormat supports case-insensitive field resolution, however it cannot 
> handle duplicate fields.
> Besides data source tables, hive serde tables also have issues. If ORC data 
> file has more fields than table schema, we just can't read hive serde tables. 
> If ORC data file does not have more fields, hive serde tables always do field 
> resolution by ordinal, rather than by name.
> Both ORC data source hive impl and hive serde table rely on the hive orc 
> InputFormat/SerDe to read table. I'm not sure whether we can change 
> underlying hive classes to make all orc read behaviors consistent.
> This ticket aims to make read behavior of ORC data source native impl 
> consistent with Parquet data source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25269) SQL interface support specify StorageLevel when cache table

2018-08-29 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16595967#comment-16595967
 ] 

Yuming Wang commented on SPARK-25269:
-

I'm working on.

> SQL interface support specify StorageLevel when cache table
> ---
>
> Key: SPARK-25269
> URL: https://issues.apache.org/jira/browse/SPARK-25269
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> CACHE MEMORY_ONLY TABLE testData;
> {code}
> Supported {{StorageLevel}} should be:
> https://github.com/apache/spark/blob/eefdf9f9dd8afde49ad7d4e230e2735eb817ab0a/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L153-L164



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25269) SQL interface support specify StorageLevel when cache table

2018-08-29 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-25269:
---

 Summary: SQL interface support specify StorageLevel when cache 
table
 Key: SPARK-25269
 URL: https://issues.apache.org/jira/browse/SPARK-25269
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang


{code:sql}
CACHE MEMORY_ONLY TABLE testData;
{code}
Supported {{StorageLevel}} should be:
https://github.com/apache/spark/blob/eefdf9f9dd8afde49ad7d4e230e2735eb817ab0a/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L153-L164



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org