[jira] [Assigned] (SPARK-22692) Reduce the number of generated mutable states

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-22692:
---

Assignee: Marco Gaido

> Reduce the number of generated mutable states
> -
>
> Key: SPARK-22692
> URL: https://issues.apache.org/jira/browse/SPARK-22692
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
>
> A large number of mutable states can cause a error during code generation due 
> to reaching the constant pool limit. There is an ongoing effort on 
> SPARK-18016 to fix the problem, nonetheless we can also alleviate it avoiding 
> to create a global variables when they are not needed.
> Therefore I am creating this umbrella ticket to track the elimination of 
> usage of global variables where not needed. This is not a duplicate or an 
> alternative to SPARK-18016: this is a complementary effort which can help 
> together with it to support wider datasets.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22939) Support Spark UDF in registerFunction

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22939:

Issue Type: Improvement  (was: Bug)

> Support Spark UDF in registerFunction
> -
>
> Key: SPARK-22939
> URL: https://issues.apache.org/jira/browse/SPARK-22939
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> {noformat}
> import random
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StringType
> random_udf = udf(lambda: int(random.random() * 100), 
> IntegerType()).asNondeterministic()
> spark.catalog.registerFunction("random_udf", random_udf, StringType())
> spark.sql("SELECT random_udf()").collect()
> {noformat}
> We will get the following error.
> {noformat}
> Py4JError: An error occurred while calling o29.__getnewargs__. Trace:
> py4j.Py4JException: Method __getnewargs__([]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
>   at py4j.Gateway.invoke(Gateway.java:274)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22961) Constant columns no longer picked as constraints in 2.3

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22961:

Issue Type: Bug  (was: Improvement)

> Constant columns no longer picked as constraints in 2.3
> ---
>
> Key: SPARK-22961
> URL: https://issues.apache.org/jira/browse/SPARK-22961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Adrian Ionescu
>Assignee: Adrian Ionescu
>Priority: Major
>  Labels: constraints, optimizer, regression
> Fix For: 2.3.0
>
>
> We're no longer picking up {{x = 2}} as a constraint from something like 
> {{df.withColumn("x", lit(2))}}
> The unit test below succeeds in {{branch-2.2}}:
> {code}
> test("constraints should be inferred from aliased literals") {
> val originalLeft = testRelation.subquery('left).as("left")
> val optimizedLeft = testRelation.subquery('left).where(IsNotNull('a) && 
> 'a <=> 2).as("left")
> val right = Project(Seq(Literal(2).as("two")), 
> testRelation.subquery('right)).as("right")
> val condition = Some("left.a".attr === "right.two".attr)
> val original = originalLeft.join(right, Inner, condition)
> val correct = optimizedLeft.join(right, Inner, condition)
> comparePlans(Optimize.execute(original.analyze), correct.analyze)
>   }
> {code}
> but fails in {{branch-2.3}} with:
> {code}
> == FAIL: Plans do not match ===
>  'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0)
> !:- Filter isnotnull(a#0)   :- Filter ((2 <=> a#0) && 
> isnotnull(a#0))
>  :  +- LocalRelation , [a#0, b#0, c#0]   :  +- LocalRelation , 
> [a#0, b#0, c#0]
>  +- Project [2 AS two#0]+- Project [2 AS two#0]
> +- LocalRelation , [a#0, b#0, c#0]  +- LocalRelation , 
> [a#0, b#0, c#0] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16060) Vectorized Orc reader

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16060:

Labels: release-notes releasenotes  (was: release-notes)

> Vectorized Orc reader
> -
>
> Key: SPARK-16060
> URL: https://issues.apache.org/jira/browse/SPARK-16060
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1
>Reporter: Liang-Chi Hsieh
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: release-notes, releasenotes
> Fix For: 2.3.0
>
>
> Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive 
> Orc already support vectorization, we should add this support to improve Orc 
> reading performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22510:

Labels: releasenotes  (was: )

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: releasenotes
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-22510:
---

Assignee: Kazuaki Ishizaki

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Kazuaki Ishizaki
>Priority: Major
>  Labels: releasenotes
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22510:

Fix Version/s: (was: 2.3.0)

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Priority: Major
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22510:

Fix Version/s: 2.3.0

> Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit 
> 
>
> Key: SPARK-22510
> URL: https://issues.apache.org/jira/browse/SPARK-22510
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Priority: Major
>
> Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant 
> pool entry limit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20392:

Component/s: SQL

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 095_bucketizer_c5d853b95707
> 096_bucketizer_e677659ca253
> 097_bucketizer_396e35548c72
> 098_bucketizer_78a6410d7a84
> 099_bucketizer_e3ae6e54bca1
> 100_bucketizer_9fed5923fe8a
> 101_bucketizer_8925ba4c3ee2
> 102_bucketizer_95750b6942b8
> 103_bucketizer_6e8b50a1918b
> 104_bucketizer_36cfcc13d4ba
> 105_bucketizer_2716d0455512
> 106_bucketizer_9bcf2891652f
> 107_bucketizer_8c3d352915f7
> 108_bucketizer_0786c17d5ef9
> 109_bucketizer_f22df23ef56f
> 110_bucketizer_bad04578bd20
> 111_bucketizer_35cfbde7e28f
> 112_bucketizer_c

[jira] [Updated] (SPARK-20682) Add new ORCFileFormat based on Apache ORC

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20682:

Labels: releasenotes  (was: )

> Add new ORCFileFormat based on Apache ORC
> -
>
> Key: SPARK-20682
> URL: https://issues.apache.org/jira/browse/SPARK-20682
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module 
> with Hive dependency. This issue aims to add a new and faster ORC data source 
> inside `sql/core` and to replace the old ORC data source eventually. In this 
> issue, the latest Apache ORC 1.4.0 (released yesterday) is used.
> There are four key benefits.
> - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is 
> faster than the current implementation in Spark.
> - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC 
> community more.
> - Usability: User can use `ORC` data sources without hive module, i.e, 
> `-Phive`.
> - Maintainability: Reduce the Hive dependency and can remove old legacy code 
> later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23219) Rename ReadTask to DataReaderFactory

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23219:

Parent Issue: SPARK-15689  (was: SPARK-22386)

> Rename ReadTask to DataReaderFactory
> 
>
> Key: SPARK-23219
> URL: https://issues.apache.org/jira/browse/SPARK-23219
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23280) add map type support to ColumnVector

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23280:


Assignee: Wenchen Fan  (was: Apache Spark)

> add map type support to ColumnVector
> 
>
> Key: SPARK-23280
> URL: https://issues.apache.org/jira/browse/SPARK-23280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23280) add map type support to ColumnVector

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346372#comment-16346372
 ] 

Apache Spark commented on SPARK-23280:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20450

> add map type support to ColumnVector
> 
>
> Key: SPARK-23280
> URL: https://issues.apache.org/jira/browse/SPARK-23280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23280) add map type support to ColumnVector

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23280:


Assignee: Apache Spark  (was: Wenchen Fan)

> add map type support to ColumnVector
> 
>
> Key: SPARK-23280
> URL: https://issues.apache.org/jira/browse/SPARK-23280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22400) rename some APIs and classes to make their meaning clearer

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22400:

Parent Issue: SPARK-15689  (was: SPARK-22386)

> rename some APIs and classes to make their meaning clearer
> --
>
> Key: SPARK-22400
> URL: https://issues.apache.org/jira/browse/SPARK-22400
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> Both `ReadSupport` and `ReadTask` have a method called `createReader`, but 
> they create different things. This could cause some confusion for data source 
> developers. The same issue exists between `WriteSupport` and 
> `DataWriterFactory`, both of which have a method called `createWriter`.
> Besides, the name of `RowToInternalRowDataWriterFactory` is not correct, 
> because it actually converts `InternalRow`s to `Row`s.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22452) DataSourceV2Options should have getInt, getBoolean, etc.

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22452:

Parent Issue: SPARK-15689  (was: SPARK-22386)

> DataSourceV2Options should have getInt, getBoolean, etc.
> 
>
> Key: SPARK-22452
> URL: https://issues.apache.org/jira/browse/SPARK-22452
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Sunitha Kambhampati
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22392) columnar reader interface

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22392:

Parent Issue: SPARK-15689  (was: SPARK-22386)

> columnar reader interface 
> --
>
> Key: SPARK-22392
> URL: https://issues.apache.org/jira/browse/SPARK-22392
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22389) partitioning reporting

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22389:

Fix Version/s: (was: 2.3.1)
   2.3.0

> partitioning reporting
> --
>
> Key: SPARK-22389
> URL: https://issues.apache.org/jira/browse/SPARK-22389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>
> We should allow data source to report partitioning and avoid shuffle at Spark 
> side



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22389) partitioning reporting

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22389:

Parent Issue: SPARK-15689  (was: SPARK-22386)

> partitioning reporting
> --
>
> Key: SPARK-22389
> URL: https://issues.apache.org/jira/browse/SPARK-22389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>
> We should allow data source to report partitioning and avoid shuffle at Spark 
> side



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22387) propagate session configs to data source read/write options

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22387:

Parent Issue: SPARK-15689  (was: SPARK-22386)

> propagate session configs to data source read/write options
> ---
>
> Key: SPARK-22387
> URL: https://issues.apache.org/jira/browse/SPARK-22387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Jiang Xingbo
>Priority: Major
> Fix For: 2.3.0
>
>
> This is an open discussion. The general idea is we should allow users to set 
> some common configs in session conf so that they don't need to type them 
> again and again for each data source operations.
> Proposal 1:
> propagate every session config which starts with {{spark.datasource.config.}} 
> to data source options. The downside is, users may only want to set some 
> common configs for a specific data source.
> Proposal 2:
> propagate session config which starts with 
> {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} 
> operations. One downside is, some data source may not have a short name and 
> makes the config key pretty long, e.g. 
> {{spark.datasource.config.com.company.foo.bar.key1}}.
> Proposal 3:
> Introduce a trait `WithSessionConfig` which defines session config key 
> prefix. Then we can pick session configs with this key-prefix and propagate 
> it to this particular data source.
> One another thing also worth to think: sometimes it's really annoying if 
> users have a typo in the config key and spend a lot of time to figure out why 
> things don't work as expected. We should allow data source to validate the 
> given options and throw exception if an option can't be recognized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22386) Data Source V2 improvements

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22386:

Labels: releasenotes  (was: )

> Data Source V2 improvements
> ---
>
> Key: SPARK-22386
> URL: https://issues.apache.org/jira/browse/SPARK-22386
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23280) add map type support to ColumnVector

2018-01-30 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-23280:
---

 Summary: add map type support to ColumnVector
 Key: SPARK-23280
 URL: https://issues.apache.org/jira/browse/SPARK-23280
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23260) remove V2 from the class name of data source reader/writer

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23260:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-15689

> remove V2 from the class name of data source reader/writer
> --
>
> Key: SPARK-23260
> URL: https://issues.apache.org/jira/browse/SPARK-23260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23262) mix-in interface should extend the interface it aimed to mix in

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23262:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-15689

> mix-in interface should extend the interface it aimed to mix in
> ---
>
> Key: SPARK-23262
> URL: https://issues.apache.org/jira/browse/SPARK-23262
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20960) make ColumnVector public

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-20960:

Labels: releasenotes  (was: )

> make ColumnVector public
> 
>
> Key: SPARK-20960
> URL: https://issues.apache.org/jira/browse/SPARK-20960
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> ColumnVector is an internal interface in Spark SQL, which is only used for 
> vectorized parquet reader to represent the in-memory columnar format.
> In Spark 2.3 we want to make ColumnVector public, so that we can provide a 
> more efficient way for data exchanges between Spark and external systems. For 
> example, we can use ColumnVector to build the columnar read API in data 
> source framework, we can use ColumnVector to build a more efficient UDF API, 
> etc.
> We also want to introduce a new ColumnVector implementation based on Apache 
> Arrow(basically just a wrapper over Arrow), so that external systems(like 
> Python Pandas DataFrame) can build ColumnVector very easily.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22969) aggregateByKey with aggregator compression

2018-01-30 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-22969.
--
Resolution: Not A Problem

> aggregateByKey with aggregator compression
> --
>
> Key: SPARK-22969
> URL: https://issues.apache.org/jira/browse/SPARK-22969
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> I encounter a special case that the aggregator can be represented as two 
> types:
> a) high memory-footprint, but fast {{update}}
> b) compact, but must be converted to type a before calling {{update}} and 
> {{merge}}.
> I wonder whether it is possible to compress the fat aggregators in 
> {{aggregateByKey}} before shuffle, how can I impl it?  [~cloud_fan]  
> One similar case maybe:
> Using {{aggregateByKey}}/{{reduceByKey}} to compute the nnz vector (number of 
> non-zero value) for different keys on a large sparse dataset.
> We can use {{DenseVector}} as the aggregators to count the nnz, and then 
> compress it by call {{Vector#compressed}} before send it to the network.
> Another similar case maybe calling {{QuantileSummaries#compress}} before 
> communication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23272) add calendar interval type support to ColumnVector

2018-01-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23272.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20438
[https://github.com/apache/spark/pull/20438]

> add calendar interval type support to ColumnVector
> --
>
> Key: SPARK-23272
> URL: https://issues.apache.org/jira/browse/SPARK-23272
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22969) aggregateByKey with aggregator compression

2018-01-30 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346359#comment-16346359
 ] 

zhengruifeng commented on SPARK-22969:
--

[~srowen]  Mailing list is a better place to discuss.  Thanks. 

> aggregateByKey with aggregator compression
> --
>
> Key: SPARK-22969
> URL: https://issues.apache.org/jira/browse/SPARK-22969
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> I encounter a special case that the aggregator can be represented as two 
> types:
> a) high memory-footprint, but fast {{update}}
> b) compact, but must be converted to type a before calling {{update}} and 
> {{merge}}.
> I wonder whether it is possible to compress the fat aggregators in 
> {{aggregateByKey}} before shuffle, how can I impl it?  [~cloud_fan]  
> One similar case maybe:
> Using {{aggregateByKey}}/{{reduceByKey}} to compute the nnz vector (number of 
> non-zero value) for different keys on a large sparse dataset.
> We can use {{DenseVector}} as the aggregators to count the nnz, and then 
> compress it by call {{Vector#compressed}} before send it to the network.
> Another similar case maybe calling {{QuantileSummaries#compress}} before 
> communication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22971) OneVsRestModel should use temporary RawPredictionCol

2018-01-30 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-22971:
-
Affects Version/s: (was: 2.3.0)
   2.4.0

> OneVsRestModel should use temporary RawPredictionCol
> 
>
> Key: SPARK-22971
> URL: https://issues.apache.org/jira/browse/SPARK-22971
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Issue occurs when I transform one dataframe with two different classification 
> models, first by a {{RandomForestClassificationModel}}, then a 
> {{OneVsRestModel}}.
> The first transform generate a new colum "rawPrediction", which will be 
> internally used in {{OneVsRestModel#transform}} and cause failure.
> {code}
> scala> val df = 
> spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/spark/data/mllib/sample_multiclass_classification_data.txt")
> 18/01/05 17:08:18 WARN ObjectStore: Failed to get database global_temp, 
> returning NoSuchObjectException
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]
> scala> val rf = new RandomForestClassifier()
> rf: org.apache.spark.ml.classification.RandomForestClassifier = 
> rfc_c11b1e1e1f7f
> scala> val rfm = rf.fit(df)
> rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_c11b1e1e1f7f) with 20 trees
> scala> val lr = new LogisticRegression().setMaxIter(1)
> lr: org.apache.spark.ml.classification.LogisticRegression = 
> logreg_f5a5285eba06
> scala> val ovr = new OneVsRest().setClassifier(lr)
> ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_8f5584190634
> scala> val ovrModel = ovr.fit(df)
> ovrModel: org.apache.spark.ml.classification.OneVsRestModel = 
> oneVsRest_8f5584190634
> scala> val df2 = rfm.setPredictionCol("rfPred").transform(df)
> df2: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 3 
> more fields]
> scala> val df3 = ovrModel.setPredictionCol("ovrPred").transform(df2)
> java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
> already exists.
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:101)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:91)
>   at 
> org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:43)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:77)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:904)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:904)
>   at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:104)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel$$anonfun$7.apply(OneVsRest.scala:184)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel$$anonfun$7.apply(OneVsRest.scala:173)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:173)
>   ... 50 elided
> {code}
> {{OneVsRestModel#transform}} only generates a new prediction column, and 
> should not fail by other columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23040) BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23040:


Assignee: (was: Apache Spark)

> BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator 
> or ordering is specified
> 
>
> Key: SPARK-23040
> URL: https://issues.apache.org/jira/browse/SPARK-23040
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: Xianjin YE
>Priority: Minor
>
> For example, if ordering is specified, the returned iterator is an 
> CompletionIterator
> {code:java}
> dep.keyOrdering match {
>   case Some(keyOrd: Ordering[K]) =>
> // Create an ExternalSorter to sort the data.
> val sorter =
>   new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), 
> serializer = dep.serializer)
> sorter.insertAll(aggregatedIter)
> context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
> context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
> 
> context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes)
> CompletionIterator[Product2[K, C], Iterator[Product2[K, 
> C]]](sorter.iterator, sorter.stop())
>   case None =>
> aggregatedIter
> }
> {code}
> However the sorter would consume(in sorter.insertAll) the 
> aggregatedIter(which may be interruptible), then creates an iterator which 
> isn't interruptible.
> The problem with this is that Spark task cannot be cancelled due to stage 
> fail(without interruptThread enabled, which is disabled by default), which 
> wasting executor resource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23040) BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23040:


Assignee: Apache Spark

> BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator 
> or ordering is specified
> 
>
> Key: SPARK-23040
> URL: https://issues.apache.org/jira/browse/SPARK-23040
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: Xianjin YE
>Assignee: Apache Spark
>Priority: Minor
>
> For example, if ordering is specified, the returned iterator is an 
> CompletionIterator
> {code:java}
> dep.keyOrdering match {
>   case Some(keyOrd: Ordering[K]) =>
> // Create an ExternalSorter to sort the data.
> val sorter =
>   new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), 
> serializer = dep.serializer)
> sorter.insertAll(aggregatedIter)
> context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
> context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
> 
> context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes)
> CompletionIterator[Product2[K, C], Iterator[Product2[K, 
> C]]](sorter.iterator, sorter.stop())
>   case None =>
> aggregatedIter
> }
> {code}
> However the sorter would consume(in sorter.insertAll) the 
> aggregatedIter(which may be interruptible), then creates an iterator which 
> isn't interruptible.
> The problem with this is that Spark task cannot be cancelled due to stage 
> fail(without interruptThread enabled, which is disabled by default), which 
> wasting executor resource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23040) BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346325#comment-16346325
 ] 

Apache Spark commented on SPARK-23040:
--

User 'advancedxy' has created a pull request for this issue:
https://github.com/apache/spark/pull/20449

> BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator 
> or ordering is specified
> 
>
> Key: SPARK-23040
> URL: https://issues.apache.org/jira/browse/SPARK-23040
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1
>Reporter: Xianjin YE
>Priority: Minor
>
> For example, if ordering is specified, the returned iterator is an 
> CompletionIterator
> {code:java}
> dep.keyOrdering match {
>   case Some(keyOrd: Ordering[K]) =>
> // Create an ExternalSorter to sort the data.
> val sorter =
>   new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), 
> serializer = dep.serializer)
> sorter.insertAll(aggregatedIter)
> context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled)
> context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled)
> 
> context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes)
> CompletionIterator[Product2[K, C], Iterator[Product2[K, 
> C]]](sorter.iterator, sorter.stop())
>   case None =>
> aggregatedIter
> }
> {code}
> However the sorter would consume(in sorter.insertAll) the 
> aggregatedIter(which may be interruptible), then creates an iterator which 
> isn't interruptible.
> The problem with this is that Spark task cannot be cancelled due to stage 
> fail(without interruptThread enabled, which is disabled by default), which 
> wasting executor resource.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23279) Avoid triggering distributed job for Console sink

2018-01-30 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-23279.
-
   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 2.3.0

> Avoid triggering distributed job for Console sink
> -
>
> Key: SPARK-23279
> URL: https://issues.apache.org/jira/browse/SPARK-23279
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.3.0
>
>
> Console sink will redistribute collected local data and trigger a distributed 
> job in each batch, this is not necessary, so here change to local job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23279) Avoid triggering distributed job for Console sink

2018-01-30 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346301#comment-16346301
 ] 

Saisai Shao commented on SPARK-23279:
-

Issue resolved by pull request 20447
https://github.com/apache/spark/pull/20447

> Avoid triggering distributed job for Console sink
> -
>
> Key: SPARK-23279
> URL: https://issues.apache.org/jira/browse/SPARK-23279
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Console sink will redistribute collected local data and trigger a distributed 
> job in each batch, this is not necessary, so here change to local job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase

2018-01-30 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-23202:
---
Target Version/s: 2.3.0

> Break down DataSourceV2Writer.commit into two phase
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Gengliang Wang
>Priority: Blocker
>
> Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 
> writing job with a list of commit messages.
> It makes sense in some scenarios, e.g. MicroBatchExecution.
> However, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal is to Break down DataSourceV2Writer.commit into two phase:
>  # add(WriterCommitMessage message): Handles a commit message produced by 
> \{@link DataWriter#commit()}.
>  # commit():  Commits the writing job.
> This should make the API more flexible, and more reasonable for implementing 
> some datasources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase

2018-01-30 Thread Gengliang Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-23202:
---
Affects Version/s: (was: 2.2.1)
   2.3.0

> Break down DataSourceV2Writer.commit into two phase
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Gengliang Wang
>Priority: Blocker
>
> Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 
> writing job with a list of commit messages.
> It makes sense in some scenarios, e.g. MicroBatchExecution.
> However, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal is to Break down DataSourceV2Writer.commit into two phase:
>  # add(WriterCommitMessage message): Handles a commit message produced by 
> \{@link DataWriter#commit()}.
>  # commit():  Commits the writing job.
> This should make the API more flexible, and more reasonable for implementing 
> some datasources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23203) DataSourceV2 should use immutable trees.

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23203:

Priority: Blocker  (was: Major)

> DataSourceV2 should use immutable trees.
> 
>
> Key: SPARK-23203
> URL: https://issues.apache.org/jira/browse/SPARK-23203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Blocker
>
> The DataSourceV2 integration doesn't use [immutable 
> trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
>  which is a basic requirement of Catalyst. The v2 relation should not wrap a 
> mutable reader and change the logical plan by pushing projections and filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23203) DataSourceV2 should use immutable trees.

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346253#comment-16346253
 ] 

Apache Spark commented on SPARK-23203:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20448

> DataSourceV2 should use immutable trees.
> 
>
> Key: SPARK-23203
> URL: https://issues.apache.org/jira/browse/SPARK-23203
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> The DataSourceV2 integration doesn't use [immutable 
> trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html],
>  which is a basic requirement of Catalyst. The v2 relation should not wrap a 
> mutable reader and change the logical plan by pushing projections and filters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23251) ClassNotFoundException: scala.Any when there's a missing implicit Map encoder

2018-01-30 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346246#comment-16346246
 ] 

Bruce Robbins edited comment on SPARK-23251 at 1/31/18 4:35 AM:


[~srowen] This also occurs with compiled apps submitted via spark-submit.

For example, this app:
{code:java}
object Implicit1 {
 def main(args: Array[String]) {
 if (args.length < 1) {
 Console.err.println("No input file specified")
 System.exit(1)
 }
 val inputFilename = args(0)

 val spark = SparkSession.builder().appName("Implicit1").getOrCreate()
 import spark.implicits._

 val df = spark.read.json(inputFilename)
 //implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, 
Any]]
 val results = df.map(row => row.getValuesMap[Any](List("stationName", 
"year"))).take(15)
 results.foreach(println)
 }
}{code}
When run on Spark 2.3 (via spark-submit), I get the same exception as I see 
with spark-shell.

With the implicit mapEncoder line uncommented, this compiles and runs fine on 
both 2.2 and 2.3.

Here's the exception from spark-submit on spark 2.3:

 
{noformat}
bash-3.2$ ./bin/spark-submit --version
 Welcome to
  __
 / _/_ ___ / /_
 \ \/ _ \/ _ `/ __/ '/
 /__/ ./_,// //_\ version 2.3.1-SNAPSHOT
 /_/
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_161
 Branch branch-2.3
 Compiled by user brobbins on 2018-01-28T01:25:18Z
 Revision 3b6fc286d105ae7de737c46e50cf941e6831ab98
 Url https://github.com/apache/spark.git
 Type --help for more information.
bash-3.2$ ./bin/spark-submit --class "Implicit1" 
~/github/sparkAppPlay/target/scala-2.11/temps_2.11-1.0.jar 
~/ncdc_gsod_short.jsonl 
 ..
 Exception in thread "main" java.lang.ClassNotFoundException: scala.Any
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:348)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1211)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1203)
 at 
scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:49)
 at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
 at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
 at scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:44)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1203)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:700)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:84)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:65)
 at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:64)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:512)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
 at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:434)
 at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
 at org.apache.spark.sql.SQLImplicits.newMapEncoder(SQLImplicits.scala:172)
 at Implicit1$.main(Implicit1.scala:17)
 at Implicit1.main(Implicit1.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAcce

[jira] [Commented] (SPARK-23251) ClassNotFoundException: scala.Any when there's a missing implicit Map encoder

2018-01-30 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346246#comment-16346246
 ] 

Bruce Robbins commented on SPARK-23251:
---

[~srowen] This also occurs with compiled apps submitted via spark-submit.

For example, this app:
{code:java}
object Implicit1 {
 def main(args: Array[String]) {
 if (args.length < 1) {
 Console.err.println("No input file specified")
 System.exit(1)
 }
 val inputFilename = args(0)

 val spark = SparkSession.builder().appName("Implicit1").getOrCreate()
 import spark.implicits._

 val df = spark.read.json(inputFilename)
 //implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, 
Any]]
 val results = df.map(row => row.getValuesMap[Any](List("stationName", 
"year"))).take(15)
 results.foreach(println)
 }
}{code}
When run on Spark 2.3 (via spark-submit), I get the same exception as I see 
with spark-shell.

With the implicit mapEncoder line uncommented, this compiles and runs fine on 
both 2.2 and 2.3.

Here's the exception from spark-submit on spark 2.3:
{noformat}
bash-3.2$ ./bin/spark-submit --version
Welcome to
  __
 / __/__ ___ _/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /___/ .__/\_,_/_/ /_/\_\ version 2.3.1-SNAPSHOT
 /_/
 
Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_161
Branch branch-2.3
Compiled by user brobbins on 2018-01-28T01:25:18Z
Revision 3b6fc286d105ae7de737c46e50cf941e6831ab98
Url https://github.com/apache/spark.git
Type --help for more information.

bash-3.2$ ./bin/spark-submit --class "Implicit1" 
~/github/sparkAppPlay/target/scala-2.11/temps_2.11-1.0.jar 
~/ncdc_gsod_short.jsonl 
..
Exception in thread "main" java.lang.ClassNotFoundException: scala.Any
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:348)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1211)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1203)
 at 
scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:49)
 at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
 at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
 at scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:44)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1203)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194)
 at 
scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:700)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:84)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:65)
 at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:64)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:512)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
 at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
 at 
org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:434)
 at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
 at org.apache.spark.sql.SQLImplicits.newMapEncoder(SQLImplicits.scala:172)
 at Implicit1$.main(Implicit1.scala:17)
 at Implicit1.main(Implicit1.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja

[jira] [Resolved] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23274.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> ReplaceExceptWithFilter fails on dataframes filtered on same column
> ---
>
> Key: SPARK-23274
> URL: https://issues.apache.org/jira/browse/SPARK-23274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Onur Satici
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Currently affects:
> {code:java}
> $ git tag --contains 01f6ba0e7a
> v2.3.0-rc1
> v2.3.0-rc2
> {code}
> Steps to reproduce:
> {code:java}
> $ cat test.csv
> a,b
> 1,2
> 1,3
> 2,2
> 2,4
> {code}
> {code:java}
> val df = spark.read.format("csv").option("header", "true").load("test.csv")
> val df1 = df.filter($"a" === 1)
> val df2 = df.filter($"a" === 2)
> df1.select("b").except(df2.select("b")).show
> {code}
> results in:
> {code:java}
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-30 Thread Gaurav Garg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346200#comment-16346200
 ] 

Gaurav Garg commented on SPARK-18016:
-

Thanks [~kiszk] for helping me out. I have attached the logs of my test code 
which throws constant pool exception. Please find attached log file.

[^910825_9.zip]

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 910825_9.zip
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.Un

[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-30 Thread Gaurav Garg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaurav Garg updated SPARK-18016:

Attachment: 910825_9.zip

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: 910825_9.zip
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(Si

[jira] [Commented] (SPARK-23279) Avoid triggering distributed job for Console sink

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346158#comment-16346158
 ] 

Apache Spark commented on SPARK-23279:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/20447

> Avoid triggering distributed job for Console sink
> -
>
> Key: SPARK-23279
> URL: https://issues.apache.org/jira/browse/SPARK-23279
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Console sink will redistribute collected local data and trigger a distributed 
> job in each batch, this is not necessary, so here change to local job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23279) Avoid triggering distributed job for Console sink

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23279:


Assignee: (was: Apache Spark)

> Avoid triggering distributed job for Console sink
> -
>
> Key: SPARK-23279
> URL: https://issues.apache.org/jira/browse/SPARK-23279
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Minor
>
> Console sink will redistribute collected local data and trigger a distributed 
> job in each batch, this is not necessary, so here change to local job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23279) Avoid triggering distributed job for Console sink

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23279:


Assignee: Apache Spark

> Avoid triggering distributed job for Console sink
> -
>
> Key: SPARK-23279
> URL: https://issues.apache.org/jira/browse/SPARK-23279
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> Console sink will redistribute collected local data and trigger a distributed 
> job in each batch, this is not necessary, so here change to local job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23277) Spark ALS : param coldStartStrategy does not exist.

2018-01-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23277.
---
Resolution: Invalid

> Spark ALS : param coldStartStrategy does not exist.
> ---
>
> Key: SPARK-23277
> URL: https://issues.apache.org/jira/browse/SPARK-23277
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Surya Prakash Reddy
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23279) Avoid triggering distributed job for Console sink

2018-01-30 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-23279:
---

 Summary: Avoid triggering distributed job for Console sink
 Key: SPARK-23279
 URL: https://issues.apache.org/jira/browse/SPARK-23279
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Saisai Shao


Console sink will redistribute collected local data and trigger a distributed 
job in each batch, this is not necessary, so here change to local job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order

2018-01-30 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346148#comment-16346148
 ] 

Liang-Chi Hsieh commented on SPARK-23273:
-

The {{name}} column will be added after {{age}} in {{ds2}}. So the schema of 
{{ds2}} doesn't match {{ds1}} in the order of columns. You can change column 
order with a projection before union:
{code:java}
scala> ds1.union(ds2.select("name", "age").as[NameAge]).show
+-+---+
| name|age|
+-+---+
|henriquedsg89|  1|
+-+---+
{code}
Since 2.3.0, there is an API {{unionByName}} can be used for this kind of cases:
{code:java}
scala> ds1.unionByName(ds2).show
+-+---+
| name|age|
+-+---+
|henriquedsg89|  1|
+-+---+
{code}

> Spark Dataset withColumn - schema column order isn't the same as case class 
> paramether order
> 
>
> Key: SPARK-23273
> URL: https://issues.apache.org/jira/browse/SPARK-23273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Henrique dos Santos Goulart
>Priority: Major
>
> {code:java}
> case class OnlyAge(age: Int)
> case class NameAge(name: String, age: Int)
> val ds1 = spark.emptyDataset[NameAge]
> val ds2 = spark
>   .createDataset(Seq(OnlyAge(1)))
>   .withColumn("name", lit("henriquedsg89"))
>   .as[NameAge]
> ds1.show()
> ds2.show()
> ds1.union(ds2)
> {code}
>  
> It's going to raise this error:
> {noformat}
> Cannot up cast `age` from string to int as it may truncate
> The type path of the target object is:
> - field (class: "scala.Int", name: "age")
> - root class: "dw.NameAge"{noformat}
> It seems that .as[CaseClass] doesn't keep the order of paramethers that is 
> typed on case class.
> If I change the case class paramether order, it's going to work... like: 
> {code:java}
> case class NameAge(age: Int, name: String){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23254) Add user guide entry for DataFrame multivariate summary

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346118#comment-16346118
 ] 

Apache Spark commented on SPARK-23254:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/20446

> Add user guide entry for DataFrame multivariate summary
> ---
>
> Key: SPARK-23254
> URL: https://issues.apache.org/jira/browse/SPARK-23254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user 
> guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be 
> updated, with the relevant example (to be in parity with the [MLlib user 
> guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23254) Add user guide entry for DataFrame multivariate summary

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23254:


Assignee: Apache Spark

> Add user guide entry for DataFrame multivariate summary
> ---
>
> Key: SPARK-23254
> URL: https://issues.apache.org/jira/browse/SPARK-23254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>Priority: Minor
>
> SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user 
> guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be 
> updated, with the relevant example (to be in parity with the [MLlib user 
> guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23254) Add user guide entry for DataFrame multivariate summary

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23254:


Assignee: (was: Apache Spark)

> Add user guide entry for DataFrame multivariate summary
> ---
>
> Key: SPARK-23254
> URL: https://issues.apache.org/jira/browse/SPARK-23254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user 
> guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be 
> updated, with the relevant example (to be in parity with the [MLlib user 
> guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23092) Migrate MemoryStream to DataSource V2

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346110#comment-16346110
 ] 

Apache Spark commented on SPARK-23092:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/20445

> Migrate MemoryStream to DataSource V2
> -
>
> Key: SPARK-23092
> URL: https://issues.apache.org/jira/browse/SPARK-23092
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Burak Yavuz
>Priority: Major
>
> We should migrate the MemoryStream for Structured Streaming to DataSourceV2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23276.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.3.0

> Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
> --
>
> Key: SPARK-23276
> URL: https://issues.apache.org/jira/browse/SPARK-23276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.3.0
>
>
> Like Parquet, ORC test suite should enable UDT tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23261) Rename Pandas UDFs

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23261:

Description: 
Rename the public APIs of pandas udfs from 
 - PANDAS SCALAR UDF -> SCALAR PANDAS UDF

 - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF 

 - PANDAS GROUP AGG UDF -> PANDAS UDAF [Only 2.4]

  was:
Rename the public APIs of pandas udfs from 

- PANDAS SCALAR UDF -> SCALAR PANDAS UDF

- PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF 

- PANDAS GROUP AGG UDF -> PANDAS UDAF


> Rename Pandas UDFs
> --
>
> Key: SPARK-23261
> URL: https://issues.apache.org/jira/browse/SPARK-23261
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> Rename the public APIs of pandas udfs from 
>  - PANDAS SCALAR UDF -> SCALAR PANDAS UDF
>  - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF 
>  - PANDAS GROUP AGG UDF -> PANDAS UDAF [Only 2.4]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23261) Rename Pandas UDFs

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23261:

Fix Version/s: (was: 2.4.0)
   2.3.0

> Rename Pandas UDFs
> --
>
> Key: SPARK-23261
> URL: https://issues.apache.org/jira/browse/SPARK-23261
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> Rename the public APIs of pandas udfs from 
> - PANDAS SCALAR UDF -> SCALAR PANDAS UDF
> - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF 
> - PANDAS GROUP AGG UDF -> PANDAS UDAF



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase

2018-01-30 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23202:

Priority: Blocker  (was: Major)

> Break down DataSourceV2Writer.commit into two phase
> ---
>
> Key: SPARK-23202
> URL: https://issues.apache.org/jira/browse/SPARK-23202
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Gengliang Wang
>Priority: Blocker
>
> Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a 
> writing job with a list of commit messages.
> It makes sense in some scenarios, e.g. MicroBatchExecution.
> However, on receiving commit message, driver can start processing 
> messages(e.g. persist messages into files) before all the messages are 
> collected.
> The proposal is to Break down DataSourceV2Writer.commit into two phase:
>  # add(WriterCommitMessage message): Handles a commit message produced by 
> \{@link DataWriter#commit()}.
>  # commit():  Commits the writing job.
> This should make the API more flexible, and more reasonable for implementing 
> some datasources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23251) ClassNotFoundException: scala.Any when there's a missing implicit Map encoder

2018-01-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346075#comment-16346075
 ] 

Sean Owen commented on SPARK-23251:
---

Hm. I don't know is this is related to Encoders and the mechanism you cite, not 
directly. The error is that {{scala.Any}} can't be found, which of course must 
certainly be available. This is typically a classloader issue, and in 
{{spark-shell}} the classloader situation is complicated.

It may be a real problem still, or at least a symptom of a known class of 
problems. But can you confirm that this doesn't happen without the shell?

> ClassNotFoundException: scala.Any when there's a missing implicit Map encoder
> -
>
> Key: SPARK-23251
> URL: https://issues.apache.org/jira/browse/SPARK-23251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: mac os high sierra, centos 7
>Reporter: Bruce Robbins
>Priority: Minor
>
> In branch-2.2, when you attempt to use row.getValuesMap[Any] without an 
> implicit Map encoder, you get a nice descriptive compile-time error:
> {noformat}
> scala> df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
> :26: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>        df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
>              ^
> scala> implicit val mapEncoder = 
> org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
> mapEncoder: org.apache.spark.sql.Encoder[Map[String,Any]] = class[value[0]: 
> binary]
> scala> df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
> res1: Array[Map[String,Any]] = Array(Map(stationName -> 007026 9, year -> 
> 2014), Map(stationName -> 007026 9, year -> 2014), Map(stationName -> 
> 007026 9, year -> 2014),
> etc...
> {noformat}
>  
>  On the latest master and also on branch-2.3, the transformation compiles (at 
> least on spark-shell), but throws a ClassNotFoundException:
>  
> {noformat}
> scala> df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
> java.lang.ClassNotFoundException: scala.Any
>  at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>  at java.lang.Class.forName0(Native Method)
>  at java.lang.Class.forName(Class.java:348)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1211)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1203)
>  at 
> scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:49)
>  at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
>  at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
>  at 
> scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:44)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1203)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:700)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:84)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:65)
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:64)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:512)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
>  at 
> sc

[jira] [Assigned] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23274:


Assignee: Xiao Li  (was: Apache Spark)

> ReplaceExceptWithFilter fails on dataframes filtered on same column
> ---
>
> Key: SPARK-23274
> URL: https://issues.apache.org/jira/browse/SPARK-23274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Onur Satici
>Assignee: Xiao Li
>Priority: Blocker
>
> Currently affects:
> {code:java}
> $ git tag --contains 01f6ba0e7a
> v2.3.0-rc1
> v2.3.0-rc2
> {code}
> Steps to reproduce:
> {code:java}
> $ cat test.csv
> a,b
> 1,2
> 1,3
> 2,2
> 2,4
> {code}
> {code:java}
> val df = spark.read.format("csv").option("header", "true").load("test.csv")
> val df1 = df.filter($"a" === 1)
> val df2 = df.filter($"a" === 2)
> df1.select("b").except(df2.select("b")).show
> {code}
> results in:
> {code:java}
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apa

[jira] [Commented] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346071#comment-16346071
 ] 

Apache Spark commented on SPARK-23274:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20444

> ReplaceExceptWithFilter fails on dataframes filtered on same column
> ---
>
> Key: SPARK-23274
> URL: https://issues.apache.org/jira/browse/SPARK-23274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Onur Satici
>Assignee: Xiao Li
>Priority: Blocker
>
> Currently affects:
> {code:java}
> $ git tag --contains 01f6ba0e7a
> v2.3.0-rc1
> v2.3.0-rc2
> {code}
> Steps to reproduce:
> {code:java}
> $ cat test.csv
> a,b
> 1,2
> 1,3
> 2,2
> 2,4
> {code}
> {code:java}
> val df = spark.read.format("csv").option("header", "true").load("test.csv")
> val df1 = df.filter($"a" === 1)
> val df2 = df.filter($"a" === 2)
> df1.select("b").except(df2.select("b")).show
> {code}
> results in:
> {code:java}
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> o

[jira] [Assigned] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23274:


Assignee: Apache Spark  (was: Xiao Li)

> ReplaceExceptWithFilter fails on dataframes filtered on same column
> ---
>
> Key: SPARK-23274
> URL: https://issues.apache.org/jira/browse/SPARK-23274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Onur Satici
>Assignee: Apache Spark
>Priority: Blocker
>
> Currently affects:
> {code:java}
> $ git tag --contains 01f6ba0e7a
> v2.3.0-rc1
> v2.3.0-rc2
> {code}
> Steps to reproduce:
> {code:java}
> $ cat test.csv
> a,b
> 1,2
> 1,3
> 2,2
> 2,4
> {code}
> {code:java}
> val df = spark.read.format("csv").option("header", "true").load("test.csv")
> val df1 = df.filter($"a" === 1)
> val df2 = df.filter($"a" === 2)
> df1.select("b").except(df2.select("b")).show
> {code}
> results in:
> {code:java}
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> or

[jira] [Commented] (SPARK-23251) ClassNotFoundException: scala.Any when there's a missing implicit Map encoder

2018-01-30 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346050#comment-16346050
 ] 

Bruce Robbins commented on SPARK-23251:
---

I commented out the following line in 

sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala and the problem 
went away:

 

 
{code:java}
implicit def newMapEncoder[T <: Map[_, _] : TypeTag]: Encoder[T] = 
ExpressionEncoder()
{code}
 

By "went away", I mean I now had to specify a Map encoder for my map function 
to compile (rather than have it compile and then throw an exception).

Checking with [~michalsenkyr], who will know more than I do.

> ClassNotFoundException: scala.Any when there's a missing implicit Map encoder
> -
>
> Key: SPARK-23251
> URL: https://issues.apache.org/jira/browse/SPARK-23251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
> Environment: mac os high sierra, centos 7
>Reporter: Bruce Robbins
>Priority: Minor
>
> In branch-2.2, when you attempt to use row.getValuesMap[Any] without an 
> implicit Map encoder, you get a nice descriptive compile-time error:
> {noformat}
> scala> df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
> :26: error: Unable to find encoder for type stored in a Dataset.  
> Primitive types (Int, String, etc) and Product types (case classes) are 
> supported by importing spark.implicits._  Support for serializing other types 
> will be added in future releases.
>        df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
>              ^
> scala> implicit val mapEncoder = 
> org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
> mapEncoder: org.apache.spark.sql.Encoder[Map[String,Any]] = class[value[0]: 
> binary]
> scala> df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
> res1: Array[Map[String,Any]] = Array(Map(stationName -> 007026 9, year -> 
> 2014), Map(stationName -> 007026 9, year -> 2014), Map(stationName -> 
> 007026 9, year -> 2014),
> etc...
> {noformat}
>  
>  On the latest master and also on branch-2.3, the transformation compiles (at 
> least on spark-shell), but throws a ClassNotFoundException:
>  
> {noformat}
> scala> df.map(row => row.getValuesMap[Any](List("stationName", 
> "year"))).collect
> java.lang.ClassNotFoundException: scala.Any
>  at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>  at java.lang.Class.forName0(Native Method)
>  at java.lang.Class.forName(Class.java:348)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1211)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1203)
>  at 
> scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:49)
>  at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
>  at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
>  at 
> scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:44)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1203)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194)
>  at 
> scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:700)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:84)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:65)
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:64)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:512)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflec

[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2018-01-30 Thread John Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345963#comment-16345963
 ] 

John Cheng commented on SPARK-18057:


Apache Kafka is now at version 1.0. For people who want to use Spark streaming 
against Kafka brokers on 1.0.0, it is preferable to use the 
`org.apache.kafka:kafka-clients:jar:1.0.0` client.

 

"Most of the discussion on the performance impact of [upgrading to the 0.10.0 
message 
format|https://kafka.apache.org/0110/documentation.html#upgrade_10_performance_impact]
 remains pertinent to the 0.11.0 upgrade. This mainly affects clusters that are 
not secured with TLS since "zero-copy" transfer is already not possible in that 
case. In order to avoid the cost of down-conversion, you should ensure that 
consumer applications are upgraded to the latest 0.11.0 client."

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>Priority: Major
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345949#comment-16345949
 ] 

Apache Spark commented on SPARK-23157:
--

User 'henryr' has created a pull request for this issue:
https://github.com/apache/spark/pull/20443

> withColumn fails for a column that is a result of mapped DataSet
> 
>
> Key: SPARK-23157
> URL: https://issues.apache.org/jira/browse/SPARK-23157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Having 
> {code:java}
> case class R(id: String)
> val ds = spark.createDataset(Seq(R("1")))
> {code}
> This works:
> {code}
> scala> ds.withColumn("n", ds.col("id"))
> res16: org.apache.spark.sql.DataFrame = [id: string, n: string]
> {code}
> but when we map over ds it fails:
> {code}
> scala> ds.withColumn("n", ds.map(a => a).col("id"))
> org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing 
> from id#4 in operator !Project [id#4, id#55 AS n#57];;
> !Project [id#4, id#55 AS n#57]
> +- LocalRelation [id#4]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1150)
>   at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905)
>   ... 48 elided
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23275.
-
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 2.3.0

> hive/tests have been failing when run locally on the laptop (Mac) with OOM 
> ---
>
> Key: SPARK-23275
> URL: https://issues.apache.org/jira/browse/SPARK-23275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Major
> Fix For: 2.3.0
>
>
> hive tests have been failing when they are run locally (Mac Os)  after a 
> recent change in the trunk. After running the tests for some time, the test 
> fails with OOM with  Error: unable to create new native thread. 
> I noticed the thread count goes all the way up to 2000+ after which we start 
> getting these OOM errors. Most of the threads seem to be related to the 
> connection pool in hive metastore (BoneCP-x- ). This behaviour change 
> is happening after we made the following change to HiveClientImpl.reset()
> {code}
>  def reset(): Unit = withHiveState {
> try {
>   // code
> } finally {
>   runSqlHive("USE default")  ===> this is causing the issue
> }
> {code}
> I am proposing to temporarily back-out part of a fix made to address 
> SPARK-23000 to resolve this issue while we work-out the exact reason for this 
> sudden increase in thread counts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23236) Make it easier to find the rest API, especially in local mode

2018-01-30 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345827#comment-16345827
 ] 

Alex Bozarth commented on SPARK-23236:
--

For #1, a REST API endpoint shouldn't return html, but we could return a custom 
html response (such as 405 or another response code) that includes a short 
"maybe you meant..." description.

For #2 I would be ok with a "maybe you meant..." response but with a link to 
the Spark REST API Doc to aid the user.

> Make it easier to find the rest API, especially in local mode
> -
>
> Key: SPARK-23236
> URL: https://issues.apache.org/jira/browse/SPARK-23236
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Trivial
>  Labels: newbie
>
> This is really minor, but it always takes me a little bit to figure out how 
> to get from the UI to the rest api.  Its especially a pain in local-mode, 
> where you need the app-id, though in general I don't know the app-id, so have 
> to either look in logs or go to another endpoint first in the ui just to find 
> the app-id.  While it wouldn't really help anybody accessing the endpoints 
> programmatically, we could make it easier for someone doing exploration via 
> their browser.
> Some things which could be improved:
> * /api/v1 just provides a link to "/api/v1/applications"
> * /api provides a link to "/api/v1/applications"
> * /api/v1/applications/[app-id] gives a list of links for the other endpoints
> * on the UI, there is a link to at least /api/v1/applications/[app-id] -- 
> better still if each UI page links to the corresponding endpoint, eg. the all 
> jobs page would link to /api/v1/applications/[app-id]/jobs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23237) Add UI / endpoint for threaddumps for executors with active tasks

2018-01-30 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345818#comment-16345818
 ] 

Alex Bozarth commented on SPARK-23237:
--

I would rather keep it to an api endpoint, but what I'm worried about is having 
an end point that returns a specific threadDump decided by some unknown 
algorithm. Again, I'm willing to look at a PR to see if the exact impl will 
change my mind.

> Add UI / endpoint for threaddumps for executors with active tasks
> -
>
> Key: SPARK-23237
> URL: https://issues.apache.org/jira/browse/SPARK-23237
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Priority: Major
>
> Frequently, when there are a handful of straggler tasks, users want to know 
> what is going on in those executors running the stragglers.  Currently, that 
> is a bit of a pain to do: you have to go to the page for your active stage, 
> find the task, figure out which executor its on, then go to the executors 
> page, and get the thread dump.  Or maybe you just go to the executors page, 
> find the executor with an active task, and then click on that, but that 
> doesn't work if you've got multiple stages running.
> Users could figure this by extracting the info from the stage rest endpoint, 
> but it's such a common thing to do that we should make it easy.
> I realize that figuring out a good way to do this is a little tricky.  We 
> don't want to make it easy to end up pulling thread dumps from 1000 executors 
> back to the driver.  So we've got to come up with a reasonable heuristic for 
> choosing which executors to poll.  And we've also got to find a suitable 
> place to put this.
> My suggestion is that the stage page always has a link to the thread dumps 
> for the *one* executor with the longest running task.  And there would be a 
> corresponding endpoint in the rest api with the same info, maybe at 
> {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/slowestTaskThreadDump}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column

2018-01-30 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-23274:
---
Labels:   (was: correctness)

> ReplaceExceptWithFilter fails on dataframes filtered on same column
> ---
>
> Key: SPARK-23274
> URL: https://issues.apache.org/jira/browse/SPARK-23274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Onur Satici
>Assignee: Xiao Li
>Priority: Blocker
>
> Currently affects:
> {code:java}
> $ git tag --contains 01f6ba0e7a
> v2.3.0-rc1
> v2.3.0-rc2
> {code}
> Steps to reproduce:
> {code:java}
> $ cat test.csv
> a,b
> 1,2
> 1,3
> 2,2
> 2,4
> {code}
> {code:java}
> val df = spark.read.format("csv").option("header", "true").load("test.csv")
> val df1 = df.filter($"a" === 1)
> val df2 = df.filter($"a" === 2)
> df1.select("b").except(df2.select("b")).show
> {code}
> results in:
> {code:java}
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalys

[jira] [Resolved] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.

2018-01-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23278.
---
Resolution: Duplicate

You opened this twice, so I closed it. Please don't reopen JIRAs.
Your description here is just a stack trace. Without anything more I'd close 
the other one too.

> Spark ALS : param coldStartStrategy does not exist.
> ---
>
> Key: SPARK-23278
> URL: https://issues.apache.org/jira/browse/SPARK-23278
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Surya Prakash Reddy
>Priority: Major
>
> An error occurred while calling o105.getParam. : 
> java.util.NoSuchElementException: Param coldStartStrategy does not exist. at 
> org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) 
> at 
> org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) 
> at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at 
> org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:280) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:214) at 
> java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.

2018-01-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-23278.
-

> Spark ALS : param coldStartStrategy does not exist.
> ---
>
> Key: SPARK-23278
> URL: https://issues.apache.org/jira/browse/SPARK-23278
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Surya Prakash Reddy
>Priority: Major
>
> An error occurred while calling o105.getParam. : 
> java.util.NoSuchElementException: Param coldStartStrategy does not exist. at 
> org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) 
> at 
> org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) 
> at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at 
> org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:280) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:214) at 
> java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.

2018-01-30 Thread Surya Prakash Reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surya Prakash Reddy reopened SPARK-23278:
-

> Spark ALS : param coldStartStrategy does not exist.
> ---
>
> Key: SPARK-23278
> URL: https://issues.apache.org/jira/browse/SPARK-23278
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Surya Prakash Reddy
>Priority: Major
>
> An error occurred while calling o105.getParam. : 
> java.util.NoSuchElementException: Param coldStartStrategy does not exist. at 
> org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) 
> at 
> org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) 
> at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at 
> org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:280) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:214) at 
> java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.

2018-01-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23278.
---
Resolution: Duplicate

> Spark ALS : param coldStartStrategy does not exist.
> ---
>
> Key: SPARK-23278
> URL: https://issues.apache.org/jira/browse/SPARK-23278
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Surya Prakash Reddy
>Priority: Major
>
> An error occurred while calling o105.getParam. : 
> java.util.NoSuchElementException: Param coldStartStrategy does not exist. at 
> org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) 
> at 
> org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) 
> at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at 
> org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:280) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:214) at 
> java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23265:


Assignee: (was: Apache Spark)

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets }}when transforming multiple columns, since that is then applied 
> to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23265:


Assignee: Apache Spark

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets }}when transforming multiple columns, since that is then applied 
> to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345723#comment-16345723
 ] 

Apache Spark commented on SPARK-23265:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/20442

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets }}when transforming multiple columns, since that is then applied 
> to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.

2018-01-30 Thread Surya Prakash Reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surya Prakash Reddy updated SPARK-23278:

Description: An error occurred while calling o105.getParam. : 
java.util.NoSuchElementException: Param coldStartStrategy does not exist. at 
org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) at 
org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) at 
scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at 
org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606) at 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
py4j.Gateway.invoke(Gateway.java:280) at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
py4j.commands.CallCommand.execute(CallCommand.java:79) at 
py4j.GatewayConnection.run(GatewayConnection.java:214) at 
java.lang.Thread.run(Thread.java:744)

> Spark ALS : param coldStartStrategy does not exist.
> ---
>
> Key: SPARK-23278
> URL: https://issues.apache.org/jira/browse/SPARK-23278
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Surya Prakash Reddy
>Priority: Major
>
> An error occurred while calling o105.getParam. : 
> java.util.NoSuchElementException: Param coldStartStrategy does not exist. at 
> org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) 
> at 
> org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) 
> at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at 
> org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:280) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:214) at 
> java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.

2018-01-30 Thread Surya Prakash Reddy (JIRA)
Surya Prakash Reddy created SPARK-23278:
---

 Summary: Spark ALS : param coldStartStrategy does not exist.
 Key: SPARK-23278
 URL: https://issues.apache.org/jira/browse/SPARK-23278
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0
Reporter: Surya Prakash Reddy






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23277) Spark ALS : param coldStartStrategy does not exist.

2018-01-30 Thread Surya Prakash Reddy (JIRA)
Surya Prakash Reddy created SPARK-23277:
---

 Summary: Spark ALS : param coldStartStrategy does not exist.
 Key: SPARK-23277
 URL: https://issues.apache.org/jira/browse/SPARK-23277
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.0
Reporter: Surya Prakash Reddy






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org




[jira] [Commented] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345686#comment-16345686
 ] 

Apache Spark commented on SPARK-23275:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/20441

> hive/tests have been failing when run locally on the laptop (Mac) with OOM 
> ---
>
> Key: SPARK-23275
> URL: https://issues.apache.org/jira/browse/SPARK-23275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Major
>
> hive tests have been failing when they are run locally (Mac Os)  after a 
> recent change in the trunk. After running the tests for some time, the test 
> fails with OOM with  Error: unable to create new native thread. 
> I noticed the thread count goes all the way up to 2000+ after which we start 
> getting these OOM errors. Most of the threads seem to be related to the 
> connection pool in hive metastore (BoneCP-x- ). This behaviour change 
> is happening after we made the following change to HiveClientImpl.reset()
> {code}
>  def reset(): Unit = withHiveState {
> try {
>   // code
> } finally {
>   runSqlHive("USE default")  ===> this is causing the issue
> }
> {code}
> I am proposing to temporarily back-out part of a fix made to address 
> SPARK-23000 to resolve this issue while we work-out the exact reason for this 
> sudden increase in thread counts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23275:


Assignee: (was: Apache Spark)

> hive/tests have been failing when run locally on the laptop (Mac) with OOM 
> ---
>
> Key: SPARK-23275
> URL: https://issues.apache.org/jira/browse/SPARK-23275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Priority: Major
>
> hive tests have been failing when they are run locally (Mac Os)  after a 
> recent change in the trunk. After running the tests for some time, the test 
> fails with OOM with  Error: unable to create new native thread. 
> I noticed the thread count goes all the way up to 2000+ after which we start 
> getting these OOM errors. Most of the threads seem to be related to the 
> connection pool in hive metastore (BoneCP-x- ). This behaviour change 
> is happening after we made the following change to HiveClientImpl.reset()
> {code}
>  def reset(): Unit = withHiveState {
> try {
>   // code
> } finally {
>   runSqlHive("USE default")  ===> this is causing the issue
> }
> {code}
> I am proposing to temporarily back-out part of a fix made to address 
> SPARK-23000 to resolve this issue while we work-out the exact reason for this 
> sudden increase in thread counts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23275:


Assignee: Apache Spark

> hive/tests have been failing when run locally on the laptop (Mac) with OOM 
> ---
>
> Key: SPARK-23275
> URL: https://issues.apache.org/jira/browse/SPARK-23275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dilip Biswal
>Assignee: Apache Spark
>Priority: Major
>
> hive tests have been failing when they are run locally (Mac Os)  after a 
> recent change in the trunk. After running the tests for some time, the test 
> fails with OOM with  Error: unable to create new native thread. 
> I noticed the thread count goes all the way up to 2000+ after which we start 
> getting these OOM errors. Most of the threads seem to be related to the 
> connection pool in hive metastore (BoneCP-x- ). This behaviour change 
> is happening after we made the following change to HiveClientImpl.reset()
> {code}
>  def reset(): Unit = withHiveState {
> try {
>   // code
> } finally {
>   runSqlHive("USE default")  ===> this is causing the issue
> }
> {code}
> I am proposing to temporarily back-out part of a fix made to address 
> SPARK-23000 to resolve this issue while we work-out the exact reason for this 
> sudden increase in thread counts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column

2018-01-30 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-23274:
---
Labels: correctness  (was: )

> ReplaceExceptWithFilter fails on dataframes filtered on same column
> ---
>
> Key: SPARK-23274
> URL: https://issues.apache.org/jira/browse/SPARK-23274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Onur Satici
>Assignee: Xiao Li
>Priority: Blocker
>  Labels: correctness
>
> Currently affects:
> {code:java}
> $ git tag --contains 01f6ba0e7a
> v2.3.0-rc1
> v2.3.0-rc2
> {code}
> Steps to reproduce:
> {code:java}
> $ cat test.csv
> a,b
> 1,2
> 1,3
> 2,2
> 2,4
> {code}
> {code:java}
> val df = spark.read.format("csv").option("header", "true").load("test.csv")
> val df1 = df.filter($"a" === 1)
> val df2 = df.filter($"a" === 2)
> df1.select("b").except(df2.select("b")).show
> {code}
> results in:
> {code:java}
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  

[jira] [Assigned] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23276:


Assignee: (was: Apache Spark)

> Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
> --
>
> Key: SPARK-23276
> URL: https://issues.apache.org/jira/browse/SPARK-23276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Like Parquet, ORC test suite should enable UDT tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23276:


Assignee: Apache Spark

> Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
> --
>
> Key: SPARK-23276
> URL: https://issues.apache.org/jira/browse/SPARK-23276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Like Parquet, ORC test suite should enable UDT tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345679#comment-16345679
 ] 

Apache Spark commented on SPARK-23276:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20440

> Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
> --
>
> Key: SPARK-23276
> URL: https://issues.apache.org/jira/browse/SPARK-23276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Like Parquet, ORC test suite should enable UDT tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite

2018-01-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23276:
--
Component/s: Tests

> Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
> --
>
> Key: SPARK-23276
> URL: https://issues.apache.org/jira/browse/SPARK-23276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Like Parquet, ORC test suite should enable UDT tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite

2018-01-30 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23276:
--
Description: Like Parquet, ORC test suite should enable UDT tests.

> Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
> --
>
> Key: SPARK-23276
> URL: https://issues.apache.org/jira/browse/SPARK-23276
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Like Parquet, ORC test suite should enable UDT tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite

2018-01-30 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23276:
-

 Summary: Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
 Key: SPARK-23276
 URL: https://issues.apache.org/jira/browse/SPARK-23276
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23267) Increase spark.sql.codegen.hugeMethodLimit to 65535

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23267.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Increase spark.sql.codegen.hugeMethodLimit to 65535
> ---
>
> Key: SPARK-23267
> URL: https://issues.apache.org/jira/browse/SPARK-23267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.0
>
>
> Still saw the performance regression introduced by 
> `spark.sql.codegen.hugeMethodLimit` in our internal workloads. There are two 
> major issues in the current solution.
>  * The size of the complied byte code is not identical to the bytecode size 
> of the method. The detection is still not accurate. 
>  * The bytecode size of a single operator (e.g., `SerializeFromObject`) could 
> still exceed 8K limit. We saw the performance regression in such scenario. 
> Since it is close to the release of 2.3, we decide to increase it to 64K for 
> avoiding the perf regression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM

2018-01-30 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-23275:


 Summary: hive/tests have been failing when run locally on the 
laptop (Mac) with OOM 
 Key: SPARK-23275
 URL: https://issues.apache.org/jira/browse/SPARK-23275
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Dilip Biswal


hive tests have been failing when they are run locally (Mac Os)  after a recent 
change in the trunk. After running the tests for some time, the test fails with 
OOM with  Error: unable to create new native thread. 

I noticed the thread count goes all the way up to 2000+ after which we start 
getting these OOM errors. Most of the threads seem to be related to the 
connection pool in hive metastore (BoneCP-x- ). This behaviour change 
is happening after we made the following change to HiveClientImpl.reset()

{code}
 def reset(): Unit = withHiveState {
try {
  // code
} finally {
  runSqlHive("USE default")  ===> this is causing the issue
}
{code}


I am proposing to temporarily back-out part of a fix made to address 
SPARK-23000 to resolve this issue while we work-out the exact reason for this 
sudden increase in thread counts.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23254) Add user guide entry for DataFrame multivariate summary

2018-01-30 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345651#comment-16345651
 ] 

Weichen Xu commented on SPARK-23254:


I will work on this. Thanks! 

> Add user guide entry for DataFrame multivariate summary
> ---
>
> Key: SPARK-23254
> URL: https://issues.apache.org/jira/browse/SPARK-23254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Minor
>
> SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user 
> guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be 
> updated, with the relevant example (to be in parity with the [MLlib user 
> guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23261) Rename Pandas UDFs

2018-01-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345641#comment-16345641
 ] 

Apache Spark commented on SPARK-23261:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20439

> Rename Pandas UDFs
> --
>
> Key: SPARK-23261
> URL: https://issues.apache.org/jira/browse/SPARK-23261
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> Rename the public APIs of pandas udfs from 
> - PANDAS SCALAR UDF -> SCALAR PANDAS UDF
> - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF 
> - PANDAS GROUP AGG UDF -> PANDAS UDAF



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2018-01-30 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345615#comment-16345615
 ] 

Marcelo Vanzin commented on SPARK-18085:


If you want the short and dirty description: these changes decouple the 
application status data storage from the UI code, and allow different ways for 
storing the status data. Shipped with 2.3 are in-memory and disk storage 
options. This allows the SHS to use disk-based storage to use less memory and 
serve data more quickly when restarted.

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Major
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23020:


Assignee: Marcelo Vanzin  (was: Apache Spark)

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Marcelo Vanzin
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23020:


Assignee: Apache Spark  (was: Marcelo Vanzin)

> Re-enable Flaky Test: 
> org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> 
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2018-01-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345598#comment-16345598
 ] 

Thomas Bünger commented on SPARK-12394:
---

Any news on this issue? Is it really fixed? I also can't find a corresponding 
pull request.

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-01-30 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345585#comment-16345585
 ] 

Huaxin Gao commented on SPARK-23265:


I am working on it. Will submit a PR today. 

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets }}when transforming multiple columns, since that is then applied 
> to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-30 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345551#comment-16345551
 ] 

Kazuaki Ishizaki edited comment on SPARK-18016 at 1/30/18 6:27 PM:
---

[~gaurav.garg] Thank you for your confirmation. I am running a program that has 
{{Statistics.corr(vec, "pearson")}}. It takes some time until the program stops.

I would appreciate it if you could share a log file that includes the generated 
Java code.


was (Author: kiszk):
[~gaurav.garg] Thank you for your confirmation. I am running a program that has 
\{{Statistics.corr(vec, "pearson"})}}. It takes some time until the program 
stops.

I would appreciate it if you could share a log file that includes the generated 
Java code.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.U

[jira] [Updated] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column

2018-01-30 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23274:

Target Version/s: 2.3.0

> ReplaceExceptWithFilter fails on dataframes filtered on same column
> ---
>
> Key: SPARK-23274
> URL: https://issues.apache.org/jira/browse/SPARK-23274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Onur Satici
>Assignee: Xiao Li
>Priority: Blocker
>
> Currently affects:
> {code:java}
> $ git tag --contains 01f6ba0e7a
> v2.3.0-rc1
> v2.3.0-rc2
> {code}
> Steps to reproduce:
> {code:java}
> $ cat test.csv
> a,b
> 1,2
> 1,3
> 2,2
> 2,4
> {code}
> {code:java}
> val df = spark.read.format("csv").option("header", "true").load("test.csv")
> val df1 = df.filter($"a" === 1)
> val df2 = df.filter($"a" === 2)
> df1.select("b").except(df2.select("b")).show
> {code}
> results in:
> {code:java}
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.Tre

[jira] [Commented] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column

2018-01-30 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345554#comment-16345554
 ] 

Xiao Li commented on SPARK-23274:
-

Since this is a regression, I will try to fix it ASAP

> ReplaceExceptWithFilter fails on dataframes filtered on same column
> ---
>
> Key: SPARK-23274
> URL: https://issues.apache.org/jira/browse/SPARK-23274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Onur Satici
>Assignee: Xiao Li
>Priority: Blocker
>
> Currently affects:
> {code:java}
> $ git tag --contains 01f6ba0e7a
> v2.3.0-rc1
> v2.3.0-rc2
> {code}
> Steps to reproduce:
> {code:java}
> $ cat test.csv
> a,b
> 1,2
> 1,3
> 2,2
> 2,4
> {code}
> {code:java}
> val df = spark.read.format("csv").option("header", "true").load("test.csv")
> val df1 = df.filter($"a" === 1)
> val df2 = df.filter($"a" === 2)
> df1.select("b").except(df2.select("b")).show
> {code}
> results in:
> {code:java}
> java.util.NoSuchElementException: key not found: a
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:59)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at scala.collection.AbstractMap.apply(Map.scala:59)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transfo

  1   2   >