[jira] [Assigned] (SPARK-22692) Reduce the number of generated mutable states
[ https://issues.apache.org/jira/browse/SPARK-22692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-22692: --- Assignee: Marco Gaido > Reduce the number of generated mutable states > - > > Key: SPARK-22692 > URL: https://issues.apache.org/jira/browse/SPARK-22692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Major > > A large number of mutable states can cause a error during code generation due > to reaching the constant pool limit. There is an ongoing effort on > SPARK-18016 to fix the problem, nonetheless we can also alleviate it avoiding > to create a global variables when they are not needed. > Therefore I am creating this umbrella ticket to track the elimination of > usage of global variables where not needed. This is not a duplicate or an > alternative to SPARK-18016: this is a complementary effort which can help > together with it to support wider datasets. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22939) Support Spark UDF in registerFunction
[ https://issues.apache.org/jira/browse/SPARK-22939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22939: Issue Type: Improvement (was: Bug) > Support Spark UDF in registerFunction > - > > Key: SPARK-22939 > URL: https://issues.apache.org/jira/browse/SPARK-22939 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.3.0 > > > {noformat} > import random > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType, StringType > random_udf = udf(lambda: int(random.random() * 100), > IntegerType()).asNondeterministic() > spark.catalog.registerFunction("random_udf", random_udf, StringType()) > spark.sql("SELECT random_udf()").collect() > {noformat} > We will get the following error. > {noformat} > Py4JError: An error occurred while calling o29.__getnewargs__. Trace: > py4j.Py4JException: Method __getnewargs__([]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:274) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22961) Constant columns no longer picked as constraints in 2.3
[ https://issues.apache.org/jira/browse/SPARK-22961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22961: Issue Type: Bug (was: Improvement) > Constant columns no longer picked as constraints in 2.3 > --- > > Key: SPARK-22961 > URL: https://issues.apache.org/jira/browse/SPARK-22961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Adrian Ionescu >Assignee: Adrian Ionescu >Priority: Major > Labels: constraints, optimizer, regression > Fix For: 2.3.0 > > > We're no longer picking up {{x = 2}} as a constraint from something like > {{df.withColumn("x", lit(2))}} > The unit test below succeeds in {{branch-2.2}}: > {code} > test("constraints should be inferred from aliased literals") { > val originalLeft = testRelation.subquery('left).as("left") > val optimizedLeft = testRelation.subquery('left).where(IsNotNull('a) && > 'a <=> 2).as("left") > val right = Project(Seq(Literal(2).as("two")), > testRelation.subquery('right)).as("right") > val condition = Some("left.a".attr === "right.two".attr) > val original = originalLeft.join(right, Inner, condition) > val correct = optimizedLeft.join(right, Inner, condition) > comparePlans(Optimize.execute(original.analyze), correct.analyze) > } > {code} > but fails in {{branch-2.3}} with: > {code} > == FAIL: Plans do not match === > 'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0) > !:- Filter isnotnull(a#0) :- Filter ((2 <=> a#0) && > isnotnull(a#0)) > : +- LocalRelation , [a#0, b#0, c#0] : +- LocalRelation , > [a#0, b#0, c#0] > +- Project [2 AS two#0]+- Project [2 AS two#0] > +- LocalRelation , [a#0, b#0, c#0] +- LocalRelation , > [a#0, b#0, c#0] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16060) Vectorized Orc reader
[ https://issues.apache.org/jira/browse/SPARK-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16060: Labels: release-notes releasenotes (was: release-notes) > Vectorized Orc reader > - > > Key: SPARK-16060 > URL: https://issues.apache.org/jira/browse/SPARK-16060 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1 >Reporter: Liang-Chi Hsieh >Assignee: Dongjoon Hyun >Priority: Major > Labels: release-notes, releasenotes > Fix For: 2.3.0 > > > Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive > Orc already support vectorization, we should add this support to improve Orc > reading performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit
[ https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22510: Labels: releasenotes (was: ) > Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit > > > Key: SPARK-22510 > URL: https://issues.apache.org/jira/browse/SPARK-22510 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Kazuaki Ishizaki >Priority: Major > Labels: releasenotes > > Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant > pool entry limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit
[ https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-22510: --- Assignee: Kazuaki Ishizaki > Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit > > > Key: SPARK-22510 > URL: https://issues.apache.org/jira/browse/SPARK-22510 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Kazuaki Ishizaki >Priority: Major > Labels: releasenotes > > Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant > pool entry limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit
[ https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22510: Fix Version/s: (was: 2.3.0) > Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit > > > Key: SPARK-22510 > URL: https://issues.apache.org/jira/browse/SPARK-22510 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Priority: Major > > Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant > pool entry limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22510) Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit
[ https://issues.apache.org/jira/browse/SPARK-22510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22510: Fix Version/s: 2.3.0 > Exceptions caused by 64KB JVM bytecode or 64K constant pool entry limit > > > Key: SPARK-22510 > URL: https://issues.apache.org/jira/browse/SPARK-22510 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Priority: Major > > Codegen can throw an exception due to the 64KB JVM bytecode or 64K constant > pool entry limit. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows
[ https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20392: Component/s: SQL > Slow performance when calling fit on ML pipeline for dataset with many > columns but few rows > --- > > Key: SPARK-20392 > URL: https://issues.apache.org/jira/browse/SPARK-20392 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 2.1.0 >Reporter: Barry Becker >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.0 > > Attachments: blockbuster.csv, blockbuster_fewCols.csv, > giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip > > > This started as a [question on stack > overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro], > but it seems like a bug. > I am testing spark pipelines using a simple dataset (attached) with 312 > (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 > minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. > This seems much to long for such a tiny dataset. Similar pipelines run > quickly on datasets that have fewer columns and more rows. It's something > about the number of columns that is causing the slow performance. > Here are a list of the stages in my pipeline: > {code} > 000_strIdx_5708525b2b6c > 001_strIdx_ec2296082913 > 002_bucketizer_3cbc8811877b > 003_bucketizer_5a01d5d78436 > 004_bucketizer_bf290d11364d > 005_bucketizer_c3296dfe94b2 > 006_bucketizer_7071ca50eb85 > 007_bucketizer_27738213c2a1 > 008_bucketizer_bd728fd89ba1 > 009_bucketizer_e1e716f51796 > 010_bucketizer_38be665993ba > 011_bucketizer_5a0e41e5e94f > 012_bucketizer_b5a3d5743aaa > 013_bucketizer_4420f98ff7ff > 014_bucketizer_777cc4fe6d12 > 015_bucketizer_f0f3a3e5530e > 016_bucketizer_218ecca3b5c1 > 017_bucketizer_0b083439a192 > 018_bucketizer_4520203aec27 > 019_bucketizer_462c2c346079 > 020_bucketizer_47435822e04c > 021_bucketizer_eb9dccb5e6e8 > 022_bucketizer_b5f63dd7451d > 023_bucketizer_e0fd5041c841 > 024_bucketizer_ffb3b9737100 > 025_bucketizer_e06c0d29273c > 026_bucketizer_36ee535a425f > 027_bucketizer_ee3a330269f1 > 028_bucketizer_094b58ea01c0 > 029_bucketizer_e93ea86c08e2 > 030_bucketizer_4728a718bc4b > 031_bucketizer_08f6189c7fcc > 032_bucketizer_11feb74901e6 > 033_bucketizer_ab4add4966c7 > 034_bucketizer_4474f7f1b8ce > 035_bucketizer_90cfa5918d71 > 036_bucketizer_1a9ff5e4eccb > 037_bucketizer_38085415a4f4 > 038_bucketizer_9b5e5a8d12eb > 039_bucketizer_082bb650ecc3 > 040_bucketizer_57e1e363c483 > 041_bucketizer_337583fbfd65 > 042_bucketizer_73e8f6673262 > 043_bucketizer_0f9394ed30b8 > 044_bucketizer_8530f3570019 > 045_bucketizer_c53614f1e507 > 046_bucketizer_8fd99e6ec27b > 047_bucketizer_6a8610496d8a > 048_bucketizer_888b0055c1ad > 049_bucketizer_974e0a1433a6 > 050_bucketizer_e848c0937cb9 > 051_bucketizer_95611095a4ac > 052_bucketizer_660a6031acd9 > 053_bucketizer_aaffe5a3140d > 054_bucketizer_8dc569be285f > 055_bucketizer_83d1bffa07bc > 056_bucketizer_0c6180ba75e6 > 057_bucketizer_452f265a000d > 058_bucketizer_38e02ddfb447 > 059_bucketizer_6fa4ad5d3ebd > 060_bucketizer_91044ee766ce > 061_bucketizer_9a9ef04a173d > 062_bucketizer_3d98eb15f206 > 063_bucketizer_c4915bb4d4ed > 064_bucketizer_8ca2b6550c38 > 065_bucketizer_417ee9b760bc > 066_bucketizer_67f3556bebe8 > 067_bucketizer_0556deb652c6 > 068_bucketizer_067b4b3d234c > 069_bucketizer_30ba55321538 > 070_bucketizer_ad826cc5d746 > 071_bucketizer_77676a898055 > 072_bucketizer_05c37a38ce30 > 073_bucketizer_6d9ae54163ed > 074_bucketizer_8cd668b2855d > 075_bucketizer_d50ea1732021 > 076_bucketizer_c68f467c9559 > 077_bucketizer_ee1dfc840db1 > 078_bucketizer_83ec06a32519 > 079_bucketizer_741d08c1b69e > 080_bucketizer_b7402e4829c7 > 081_bucketizer_8adc590dc447 > 082_bucketizer_673be99bdace > 083_bucketizer_77693b45f94c > 084_bucketizer_53529c6b1ac4 > 085_bucketizer_6a3ca776a81e > 086_bucketizer_6679d9588ac1 > 087_bucketizer_6c73af456f65 > 088_bucketizer_2291b2c5ab51 > 089_bucketizer_cb3d0fe669d8 > 090_bucketizer_e71f913c1512 > 091_bucketizer_156528f65ce7 > 092_bucketizer_f3ec5dae079b > 093_bucketizer_809fab77eee1 > 094_bucketizer_6925831511e6 > 095_bucketizer_c5d853b95707 > 096_bucketizer_e677659ca253 > 097_bucketizer_396e35548c72 > 098_bucketizer_78a6410d7a84 > 099_bucketizer_e3ae6e54bca1 > 100_bucketizer_9fed5923fe8a > 101_bucketizer_8925ba4c3ee2 > 102_bucketizer_95750b6942b8 > 103_bucketizer_6e8b50a1918b > 104_bucketizer_36cfcc13d4ba > 105_bucketizer_2716d0455512 > 106_bucketizer_9bcf2891652f > 107_bucketizer_8c3d352915f7 > 108_bucketizer_0786c17d5ef9 > 109_bucketizer_f22df23ef56f > 110_bucketizer_bad04578bd20 > 111_bucketizer_35cfbde7e28f > 112_bucketizer_c
[jira] [Updated] (SPARK-20682) Add new ORCFileFormat based on Apache ORC
[ https://issues.apache.org/jira/browse/SPARK-20682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20682: Labels: releasenotes (was: ) > Add new ORCFileFormat based on Apache ORC > - > > Key: SPARK-20682 > URL: https://issues.apache.org/jira/browse/SPARK-20682 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.1.1, 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > Fix For: 2.3.0 > > > Since SPARK-2883, Apache Spark supports Apache ORC inside `sql/hive` module > with Hive dependency. This issue aims to add a new and faster ORC data source > inside `sql/core` and to replace the old ORC data source eventually. In this > issue, the latest Apache ORC 1.4.0 (released yesterday) is used. > There are four key benefits. > - Speed: Use both Spark `ColumnarBatch` and ORC `RowBatch` together. This is > faster than the current implementation in Spark. > - Stability: Apache ORC 1.4.0 has many fixes and we can depend on ORC > community more. > - Usability: User can use `ORC` data sources without hive module, i.e, > `-Phive`. > - Maintainability: Reduce the Hive dependency and can remove old legacy code > later. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23219) Rename ReadTask to DataReaderFactory
[ https://issues.apache.org/jira/browse/SPARK-23219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23219: Parent Issue: SPARK-15689 (was: SPARK-22386) > Rename ReadTask to DataReaderFactory > > > Key: SPARK-23219 > URL: https://issues.apache.org/jira/browse/SPARK-23219 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23280) add map type support to ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-23280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23280: Assignee: Wenchen Fan (was: Apache Spark) > add map type support to ColumnVector > > > Key: SPARK-23280 > URL: https://issues.apache.org/jira/browse/SPARK-23280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23280) add map type support to ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-23280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346372#comment-16346372 ] Apache Spark commented on SPARK-23280: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20450 > add map type support to ColumnVector > > > Key: SPARK-23280 > URL: https://issues.apache.org/jira/browse/SPARK-23280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23280) add map type support to ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-23280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23280: Assignee: Apache Spark (was: Wenchen Fan) > add map type support to ColumnVector > > > Key: SPARK-23280 > URL: https://issues.apache.org/jira/browse/SPARK-23280 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22400) rename some APIs and classes to make their meaning clearer
[ https://issues.apache.org/jira/browse/SPARK-22400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22400: Parent Issue: SPARK-15689 (was: SPARK-22386) > rename some APIs and classes to make their meaning clearer > -- > > Key: SPARK-22400 > URL: https://issues.apache.org/jira/browse/SPARK-22400 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang >Priority: Major > Fix For: 2.3.0 > > > Both `ReadSupport` and `ReadTask` have a method called `createReader`, but > they create different things. This could cause some confusion for data source > developers. The same issue exists between `WriteSupport` and > `DataWriterFactory`, both of which have a method called `createWriter`. > Besides, the name of `RowToInternalRowDataWriterFactory` is not correct, > because it actually converts `InternalRow`s to `Row`s. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22452) DataSourceV2Options should have getInt, getBoolean, etc.
[ https://issues.apache.org/jira/browse/SPARK-22452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22452: Parent Issue: SPARK-15689 (was: SPARK-22386) > DataSourceV2Options should have getInt, getBoolean, etc. > > > Key: SPARK-22452 > URL: https://issues.apache.org/jira/browse/SPARK-22452 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Sunitha Kambhampati >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22392) columnar reader interface
[ https://issues.apache.org/jira/browse/SPARK-22392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22392: Parent Issue: SPARK-15689 (was: SPARK-22386) > columnar reader interface > -- > > Key: SPARK-22392 > URL: https://issues.apache.org/jira/browse/SPARK-22392 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22389) partitioning reporting
[ https://issues.apache.org/jira/browse/SPARK-22389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22389: Fix Version/s: (was: 2.3.1) 2.3.0 > partitioning reporting > -- > > Key: SPARK-22389 > URL: https://issues.apache.org/jira/browse/SPARK-22389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > > We should allow data source to report partitioning and avoid shuffle at Spark > side -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22389) partitioning reporting
[ https://issues.apache.org/jira/browse/SPARK-22389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22389: Parent Issue: SPARK-15689 (was: SPARK-22386) > partitioning reporting > -- > > Key: SPARK-22389 > URL: https://issues.apache.org/jira/browse/SPARK-22389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > > We should allow data source to report partitioning and avoid shuffle at Spark > side -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22387) propagate session configs to data source read/write options
[ https://issues.apache.org/jira/browse/SPARK-22387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22387: Parent Issue: SPARK-15689 (was: SPARK-22386) > propagate session configs to data source read/write options > --- > > Key: SPARK-22387 > URL: https://issues.apache.org/jira/browse/SPARK-22387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Jiang Xingbo >Priority: Major > Fix For: 2.3.0 > > > This is an open discussion. The general idea is we should allow users to set > some common configs in session conf so that they don't need to type them > again and again for each data source operations. > Proposal 1: > propagate every session config which starts with {{spark.datasource.config.}} > to data source options. The downside is, users may only want to set some > common configs for a specific data source. > Proposal 2: > propagate session config which starts with > {{spark.datasource.config.myDataSource.}} only to {{myDataSource}} > operations. One downside is, some data source may not have a short name and > makes the config key pretty long, e.g. > {{spark.datasource.config.com.company.foo.bar.key1}}. > Proposal 3: > Introduce a trait `WithSessionConfig` which defines session config key > prefix. Then we can pick session configs with this key-prefix and propagate > it to this particular data source. > One another thing also worth to think: sometimes it's really annoying if > users have a typo in the config key and spend a lot of time to figure out why > things don't work as expected. We should allow data source to validate the > given options and throw exception if an option can't be recognized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22386) Data Source V2 improvements
[ https://issues.apache.org/jira/browse/SPARK-22386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22386: Labels: releasenotes (was: ) > Data Source V2 improvements > --- > > Key: SPARK-22386 > URL: https://issues.apache.org/jira/browse/SPARK-22386 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > Labels: releasenotes > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23280) add map type support to ColumnVector
Wenchen Fan created SPARK-23280: --- Summary: add map type support to ColumnVector Key: SPARK-23280 URL: https://issues.apache.org/jira/browse/SPARK-23280 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23260) remove V2 from the class name of data source reader/writer
[ https://issues.apache.org/jira/browse/SPARK-23260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23260: Issue Type: Sub-task (was: Improvement) Parent: SPARK-15689 > remove V2 from the class name of data source reader/writer > -- > > Key: SPARK-23260 > URL: https://issues.apache.org/jira/browse/SPARK-23260 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23262) mix-in interface should extend the interface it aimed to mix in
[ https://issues.apache.org/jira/browse/SPARK-23262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23262: Issue Type: Sub-task (was: Bug) Parent: SPARK-15689 > mix-in interface should extend the interface it aimed to mix in > --- > > Key: SPARK-23262 > URL: https://issues.apache.org/jira/browse/SPARK-23262 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20960) make ColumnVector public
[ https://issues.apache.org/jira/browse/SPARK-20960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20960: Labels: releasenotes (was: ) > make ColumnVector public > > > Key: SPARK-20960 > URL: https://issues.apache.org/jira/browse/SPARK-20960 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: releasenotes > Fix For: 2.3.0 > > > ColumnVector is an internal interface in Spark SQL, which is only used for > vectorized parquet reader to represent the in-memory columnar format. > In Spark 2.3 we want to make ColumnVector public, so that we can provide a > more efficient way for data exchanges between Spark and external systems. For > example, we can use ColumnVector to build the columnar read API in data > source framework, we can use ColumnVector to build a more efficient UDF API, > etc. > We also want to introduce a new ColumnVector implementation based on Apache > Arrow(basically just a wrapper over Arrow), so that external systems(like > Python Pandas DataFrame) can build ColumnVector very easily. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22969) aggregateByKey with aggregator compression
[ https://issues.apache.org/jira/browse/SPARK-22969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-22969. -- Resolution: Not A Problem > aggregateByKey with aggregator compression > -- > > Key: SPARK-22969 > URL: https://issues.apache.org/jira/browse/SPARK-22969 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Priority: Minor > > I encounter a special case that the aggregator can be represented as two > types: > a) high memory-footprint, but fast {{update}} > b) compact, but must be converted to type a before calling {{update}} and > {{merge}}. > I wonder whether it is possible to compress the fat aggregators in > {{aggregateByKey}} before shuffle, how can I impl it? [~cloud_fan] > One similar case maybe: > Using {{aggregateByKey}}/{{reduceByKey}} to compute the nnz vector (number of > non-zero value) for different keys on a large sparse dataset. > We can use {{DenseVector}} as the aggregators to count the nnz, and then > compress it by call {{Vector#compressed}} before send it to the network. > Another similar case maybe calling {{QuantileSummaries#compress}} before > communication. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23272) add calendar interval type support to ColumnVector
[ https://issues.apache.org/jira/browse/SPARK-23272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23272. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20438 [https://github.com/apache/spark/pull/20438] > add calendar interval type support to ColumnVector > -- > > Key: SPARK-23272 > URL: https://issues.apache.org/jira/browse/SPARK-23272 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22969) aggregateByKey with aggregator compression
[ https://issues.apache.org/jira/browse/SPARK-22969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346359#comment-16346359 ] zhengruifeng commented on SPARK-22969: -- [~srowen] Mailing list is a better place to discuss. Thanks. > aggregateByKey with aggregator compression > -- > > Key: SPARK-22969 > URL: https://issues.apache.org/jira/browse/SPARK-22969 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Priority: Minor > > I encounter a special case that the aggregator can be represented as two > types: > a) high memory-footprint, but fast {{update}} > b) compact, but must be converted to type a before calling {{update}} and > {{merge}}. > I wonder whether it is possible to compress the fat aggregators in > {{aggregateByKey}} before shuffle, how can I impl it? [~cloud_fan] > One similar case maybe: > Using {{aggregateByKey}}/{{reduceByKey}} to compute the nnz vector (number of > non-zero value) for different keys on a large sparse dataset. > We can use {{DenseVector}} as the aggregators to count the nnz, and then > compress it by call {{Vector#compressed}} before send it to the network. > Another similar case maybe calling {{QuantileSummaries#compress}} before > communication. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22971) OneVsRestModel should use temporary RawPredictionCol
[ https://issues.apache.org/jira/browse/SPARK-22971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-22971: - Affects Version/s: (was: 2.3.0) 2.4.0 > OneVsRestModel should use temporary RawPredictionCol > > > Key: SPARK-22971 > URL: https://issues.apache.org/jira/browse/SPARK-22971 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: zhengruifeng >Priority: Minor > > Issue occurs when I transform one dataframe with two different classification > models, first by a {{RandomForestClassificationModel}}, then a > {{OneVsRestModel}}. > The first transform generate a new colum "rawPrediction", which will be > internally used in {{OneVsRestModel#transform}} and cause failure. > {code} > scala> val df = > spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/spark/data/mllib/sample_multiclass_classification_data.txt") > 18/01/05 17:08:18 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > df: org.apache.spark.sql.DataFrame = [label: double, features: vector] > scala> val rf = new RandomForestClassifier() > rf: org.apache.spark.ml.classification.RandomForestClassifier = > rfc_c11b1e1e1f7f > scala> val rfm = rf.fit(df) > rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = > RandomForestClassificationModel (uid=rfc_c11b1e1e1f7f) with 20 trees > scala> val lr = new LogisticRegression().setMaxIter(1) > lr: org.apache.spark.ml.classification.LogisticRegression = > logreg_f5a5285eba06 > scala> val ovr = new OneVsRest().setClassifier(lr) > ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_8f5584190634 > scala> val ovrModel = ovr.fit(df) > ovrModel: org.apache.spark.ml.classification.OneVsRestModel = > oneVsRest_8f5584190634 > scala> val df2 = rfm.setPredictionCol("rfPred").transform(df) > df2: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 3 > more fields] > scala> val df3 = ovrModel.setPredictionCol("ovrPred").transform(df2) > java.lang.IllegalArgumentException: requirement failed: Column rawPrediction > already exists. > at scala.Predef$.require(Predef.scala:224) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:101) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:91) > at > org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:43) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:77) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37) > at > org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:904) > at > org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265) > at > org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:904) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:104) > at > org.apache.spark.ml.classification.OneVsRestModel$$anonfun$7.apply(OneVsRest.scala:184) > at > org.apache.spark.ml.classification.OneVsRestModel$$anonfun$7.apply(OneVsRest.scala:173) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) > at > org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:173) > ... 50 elided > {code} > {{OneVsRestModel#transform}} only generates a new prediction column, and > should not fail by other columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23040) BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified
[ https://issues.apache.org/jira/browse/SPARK-23040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23040: Assignee: (was: Apache Spark) > BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator > or ordering is specified > > > Key: SPARK-23040 > URL: https://issues.apache.org/jira/browse/SPARK-23040 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 >Reporter: Xianjin YE >Priority: Minor > > For example, if ordering is specified, the returned iterator is an > CompletionIterator > {code:java} > dep.keyOrdering match { > case Some(keyOrd: Ordering[K]) => > // Create an ExternalSorter to sort the data. > val sorter = > new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), > serializer = dep.serializer) > sorter.insertAll(aggregatedIter) > context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled) > context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled) > > context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes) > CompletionIterator[Product2[K, C], Iterator[Product2[K, > C]]](sorter.iterator, sorter.stop()) > case None => > aggregatedIter > } > {code} > However the sorter would consume(in sorter.insertAll) the > aggregatedIter(which may be interruptible), then creates an iterator which > isn't interruptible. > The problem with this is that Spark task cannot be cancelled due to stage > fail(without interruptThread enabled, which is disabled by default), which > wasting executor resource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23040) BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified
[ https://issues.apache.org/jira/browse/SPARK-23040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23040: Assignee: Apache Spark > BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator > or ordering is specified > > > Key: SPARK-23040 > URL: https://issues.apache.org/jira/browse/SPARK-23040 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 >Reporter: Xianjin YE >Assignee: Apache Spark >Priority: Minor > > For example, if ordering is specified, the returned iterator is an > CompletionIterator > {code:java} > dep.keyOrdering match { > case Some(keyOrd: Ordering[K]) => > // Create an ExternalSorter to sort the data. > val sorter = > new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), > serializer = dep.serializer) > sorter.insertAll(aggregatedIter) > context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled) > context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled) > > context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes) > CompletionIterator[Product2[K, C], Iterator[Product2[K, > C]]](sorter.iterator, sorter.stop()) > case None => > aggregatedIter > } > {code} > However the sorter would consume(in sorter.insertAll) the > aggregatedIter(which may be interruptible), then creates an iterator which > isn't interruptible. > The problem with this is that Spark task cannot be cancelled due to stage > fail(without interruptThread enabled, which is disabled by default), which > wasting executor resource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23040) BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator or ordering is specified
[ https://issues.apache.org/jira/browse/SPARK-23040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346325#comment-16346325 ] Apache Spark commented on SPARK-23040: -- User 'advancedxy' has created a pull request for this issue: https://github.com/apache/spark/pull/20449 > BlockStoreShuffleReader's return Iterator isn't interruptible if aggregator > or ordering is specified > > > Key: SPARK-23040 > URL: https://issues.apache.org/jira/browse/SPARK-23040 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1 >Reporter: Xianjin YE >Priority: Minor > > For example, if ordering is specified, the returned iterator is an > CompletionIterator > {code:java} > dep.keyOrdering match { > case Some(keyOrd: Ordering[K]) => > // Create an ExternalSorter to sort the data. > val sorter = > new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), > serializer = dep.serializer) > sorter.insertAll(aggregatedIter) > context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled) > context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled) > > context.taskMetrics().incPeakExecutionMemory(sorter.peakMemoryUsedBytes) > CompletionIterator[Product2[K, C], Iterator[Product2[K, > C]]](sorter.iterator, sorter.stop()) > case None => > aggregatedIter > } > {code} > However the sorter would consume(in sorter.insertAll) the > aggregatedIter(which may be interruptible), then creates an iterator which > isn't interruptible. > The problem with this is that Spark task cannot be cancelled due to stage > fail(without interruptThread enabled, which is disabled by default), which > wasting executor resource. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23279) Avoid triggering distributed job for Console sink
[ https://issues.apache.org/jira/browse/SPARK-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-23279. - Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 2.3.0 > Avoid triggering distributed job for Console sink > - > > Key: SPARK-23279 > URL: https://issues.apache.org/jira/browse/SPARK-23279 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Minor > Fix For: 2.3.0 > > > Console sink will redistribute collected local data and trigger a distributed > job in each batch, this is not necessary, so here change to local job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23279) Avoid triggering distributed job for Console sink
[ https://issues.apache.org/jira/browse/SPARK-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346301#comment-16346301 ] Saisai Shao commented on SPARK-23279: - Issue resolved by pull request 20447 https://github.com/apache/spark/pull/20447 > Avoid triggering distributed job for Console sink > - > > Key: SPARK-23279 > URL: https://issues.apache.org/jira/browse/SPARK-23279 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Minor > > Console sink will redistribute collected local data and trigger a distributed > job in each batch, this is not necessary, so here change to local job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase
[ https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sameer Agarwal updated SPARK-23202: --- Target Version/s: 2.3.0 > Break down DataSourceV2Writer.commit into two phase > --- > > Key: SPARK-23202 > URL: https://issues.apache.org/jira/browse/SPARK-23202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Gengliang Wang >Priority: Blocker > > Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a > writing job with a list of commit messages. > It makes sense in some scenarios, e.g. MicroBatchExecution. > However, on receiving commit message, driver can start processing > messages(e.g. persist messages into files) before all the messages are > collected. > The proposal is to Break down DataSourceV2Writer.commit into two phase: > # add(WriterCommitMessage message): Handles a commit message produced by > \{@link DataWriter#commit()}. > # commit(): Commits the writing job. > This should make the API more flexible, and more reasonable for implementing > some datasources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase
[ https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-23202: --- Affects Version/s: (was: 2.2.1) 2.3.0 > Break down DataSourceV2Writer.commit into two phase > --- > > Key: SPARK-23202 > URL: https://issues.apache.org/jira/browse/SPARK-23202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Gengliang Wang >Priority: Blocker > > Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a > writing job with a list of commit messages. > It makes sense in some scenarios, e.g. MicroBatchExecution. > However, on receiving commit message, driver can start processing > messages(e.g. persist messages into files) before all the messages are > collected. > The proposal is to Break down DataSourceV2Writer.commit into two phase: > # add(WriterCommitMessage message): Handles a commit message produced by > \{@link DataWriter#commit()}. > # commit(): Commits the writing job. > This should make the API more flexible, and more reasonable for implementing > some datasources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23203) DataSourceV2 should use immutable trees.
[ https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23203: Priority: Blocker (was: Major) > DataSourceV2 should use immutable trees. > > > Key: SPARK-23203 > URL: https://issues.apache.org/jira/browse/SPARK-23203 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Blocker > > The DataSourceV2 integration doesn't use [immutable > trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], > which is a basic requirement of Catalyst. The v2 relation should not wrap a > mutable reader and change the logical plan by pushing projections and filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23203) DataSourceV2 should use immutable trees.
[ https://issues.apache.org/jira/browse/SPARK-23203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346253#comment-16346253 ] Apache Spark commented on SPARK-23203: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20448 > DataSourceV2 should use immutable trees. > > > Key: SPARK-23203 > URL: https://issues.apache.org/jira/browse/SPARK-23203 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > The DataSourceV2 integration doesn't use [immutable > trees|https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html], > which is a basic requirement of Catalyst. The v2 relation should not wrap a > mutable reader and change the logical plan by pushing projections and filters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23251) ClassNotFoundException: scala.Any when there's a missing implicit Map encoder
[ https://issues.apache.org/jira/browse/SPARK-23251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346246#comment-16346246 ] Bruce Robbins edited comment on SPARK-23251 at 1/31/18 4:35 AM: [~srowen] This also occurs with compiled apps submitted via spark-submit. For example, this app: {code:java} object Implicit1 { def main(args: Array[String]) { if (args.length < 1) { Console.err.println("No input file specified") System.exit(1) } val inputFilename = args(0) val spark = SparkSession.builder().appName("Implicit1").getOrCreate() import spark.implicits._ val df = spark.read.json(inputFilename) //implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]] val results = df.map(row => row.getValuesMap[Any](List("stationName", "year"))).take(15) results.foreach(println) } }{code} When run on Spark 2.3 (via spark-submit), I get the same exception as I see with spark-shell. With the implicit mapEncoder line uncommented, this compiles and runs fine on both 2.2 and 2.3. Here's the exception from spark-submit on spark 2.3: {noformat} bash-3.2$ ./bin/spark-submit --version Welcome to __ / _/_ ___ / /_ \ \/ _ \/ _ `/ __/ '/ /__/ ./_,// //_\ version 2.3.1-SNAPSHOT /_/ Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_161 Branch branch-2.3 Compiled by user brobbins on 2018-01-28T01:25:18Z Revision 3b6fc286d105ae7de737c46e50cf941e6831ab98 Url https://github.com/apache/spark.git Type --help for more information. bash-3.2$ ./bin/spark-submit --class "Implicit1" ~/github/sparkAppPlay/target/scala-2.11/temps_2.11-1.0.jar ~/ncdc_gsod_short.jsonl .. Exception in thread "main" java.lang.ClassNotFoundException: scala.Any at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555) at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1211) at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1203) at scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:49) at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19) at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16) at scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:44) at scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1203) at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194) at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54) at org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:700) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:84) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:65) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:64) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:512) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445) at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:434) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71) at org.apache.spark.sql.SQLImplicits.newMapEncoder(SQLImplicits.scala:172) at Implicit1$.main(Implicit1.scala:17) at Implicit1.main(Implicit1.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAcce
[jira] [Commented] (SPARK-23251) ClassNotFoundException: scala.Any when there's a missing implicit Map encoder
[ https://issues.apache.org/jira/browse/SPARK-23251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346246#comment-16346246 ] Bruce Robbins commented on SPARK-23251: --- [~srowen] This also occurs with compiled apps submitted via spark-submit. For example, this app: {code:java} object Implicit1 { def main(args: Array[String]) { if (args.length < 1) { Console.err.println("No input file specified") System.exit(1) } val inputFilename = args(0) val spark = SparkSession.builder().appName("Implicit1").getOrCreate() import spark.implicits._ val df = spark.read.json(inputFilename) //implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]] val results = df.map(row => row.getValuesMap[Any](List("stationName", "year"))).take(15) results.foreach(println) } }{code} When run on Spark 2.3 (via spark-submit), I get the same exception as I see with spark-shell. With the implicit mapEncoder line uncommented, this compiles and runs fine on both 2.2 and 2.3. Here's the exception from spark-submit on spark 2.3: {noformat} bash-3.2$ ./bin/spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.1-SNAPSHOT /_/ Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_161 Branch branch-2.3 Compiled by user brobbins on 2018-01-28T01:25:18Z Revision 3b6fc286d105ae7de737c46e50cf941e6831ab98 Url https://github.com/apache/spark.git Type --help for more information. bash-3.2$ ./bin/spark-submit --class "Implicit1" ~/github/sparkAppPlay/target/scala-2.11/temps_2.11-1.0.jar ~/ncdc_gsod_short.jsonl .. Exception in thread "main" java.lang.ClassNotFoundException: scala.Any at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555) at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1211) at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1203) at scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:49) at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19) at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16) at scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:44) at scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1203) at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194) at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54) at org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:700) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:84) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:65) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:64) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:512) at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445) at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:434) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71) at org.apache.spark.sql.SQLImplicits.newMapEncoder(SQLImplicits.scala:172) at Implicit1$.main(Implicit1.scala:17) at Implicit1.main(Implicit1.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja
[jira] [Resolved] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column
[ https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23274. - Resolution: Fixed Fix Version/s: 2.3.0 > ReplaceExceptWithFilter fails on dataframes filtered on same column > --- > > Key: SPARK-23274 > URL: https://issues.apache.org/jira/browse/SPARK-23274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Onur Satici >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.3.0 > > > Currently affects: > {code:java} > $ git tag --contains 01f6ba0e7a > v2.3.0-rc1 > v2.3.0-rc2 > {code} > Steps to reproduce: > {code:java} > $ cat test.csv > a,b > 1,2 > 1,3 > 2,2 > 2,4 > {code} > {code:java} > val df = spark.read.format("csv").option("header", "true").load("test.csv") > val df1 = df.filter($"a" === 1) > val df2 = df.filter($"a" === 2) > df1.select("b").except(df2.select("b")).show > {code} > results in: > {code:java} > java.util.NoSuchElementException: key not found: a > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala
[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346200#comment-16346200 ] Gaurav Garg commented on SPARK-18016: - Thanks [~kiszk] for helping me out. I have attached the logs of my test code which throws constant pool exception. Please find attached log file. [^910825_9.zip] > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.3.0 > > Attachments: 910825_9.zip > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.Un
[jira] [Updated] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gaurav Garg updated SPARK-18016: Attachment: 910825_9.zip > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.3.0 > > Attachments: 910825_9.zip > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(Si
[jira] [Commented] (SPARK-23279) Avoid triggering distributed job for Console sink
[ https://issues.apache.org/jira/browse/SPARK-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346158#comment-16346158 ] Apache Spark commented on SPARK-23279: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/20447 > Avoid triggering distributed job for Console sink > - > > Key: SPARK-23279 > URL: https://issues.apache.org/jira/browse/SPARK-23279 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Minor > > Console sink will redistribute collected local data and trigger a distributed > job in each batch, this is not necessary, so here change to local job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23279) Avoid triggering distributed job for Console sink
[ https://issues.apache.org/jira/browse/SPARK-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23279: Assignee: (was: Apache Spark) > Avoid triggering distributed job for Console sink > - > > Key: SPARK-23279 > URL: https://issues.apache.org/jira/browse/SPARK-23279 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Minor > > Console sink will redistribute collected local data and trigger a distributed > job in each batch, this is not necessary, so here change to local job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23279) Avoid triggering distributed job for Console sink
[ https://issues.apache.org/jira/browse/SPARK-23279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23279: Assignee: Apache Spark > Avoid triggering distributed job for Console sink > - > > Key: SPARK-23279 > URL: https://issues.apache.org/jira/browse/SPARK-23279 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Minor > > Console sink will redistribute collected local data and trigger a distributed > job in each batch, this is not necessary, so here change to local job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23277) Spark ALS : param coldStartStrategy does not exist.
[ https://issues.apache.org/jira/browse/SPARK-23277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23277. --- Resolution: Invalid > Spark ALS : param coldStartStrategy does not exist. > --- > > Key: SPARK-23277 > URL: https://issues.apache.org/jira/browse/SPARK-23277 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Surya Prakash Reddy >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23279) Avoid triggering distributed job for Console sink
Saisai Shao created SPARK-23279: --- Summary: Avoid triggering distributed job for Console sink Key: SPARK-23279 URL: https://issues.apache.org/jira/browse/SPARK-23279 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Saisai Shao Console sink will redistribute collected local data and trigger a distributed job in each batch, this is not necessary, so here change to local job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order
[ https://issues.apache.org/jira/browse/SPARK-23273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346148#comment-16346148 ] Liang-Chi Hsieh commented on SPARK-23273: - The {{name}} column will be added after {{age}} in {{ds2}}. So the schema of {{ds2}} doesn't match {{ds1}} in the order of columns. You can change column order with a projection before union: {code:java} scala> ds1.union(ds2.select("name", "age").as[NameAge]).show +-+---+ | name|age| +-+---+ |henriquedsg89| 1| +-+---+ {code} Since 2.3.0, there is an API {{unionByName}} can be used for this kind of cases: {code:java} scala> ds1.unionByName(ds2).show +-+---+ | name|age| +-+---+ |henriquedsg89| 1| +-+---+ {code} > Spark Dataset withColumn - schema column order isn't the same as case class > paramether order > > > Key: SPARK-23273 > URL: https://issues.apache.org/jira/browse/SPARK-23273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Henrique dos Santos Goulart >Priority: Major > > {code:java} > case class OnlyAge(age: Int) > case class NameAge(name: String, age: Int) > val ds1 = spark.emptyDataset[NameAge] > val ds2 = spark > .createDataset(Seq(OnlyAge(1))) > .withColumn("name", lit("henriquedsg89")) > .as[NameAge] > ds1.show() > ds2.show() > ds1.union(ds2) > {code} > > It's going to raise this error: > {noformat} > Cannot up cast `age` from string to int as it may truncate > The type path of the target object is: > - field (class: "scala.Int", name: "age") > - root class: "dw.NameAge"{noformat} > It seems that .as[CaseClass] doesn't keep the order of paramethers that is > typed on case class. > If I change the case class paramether order, it's going to work... like: > {code:java} > case class NameAge(age: Int, name: String){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23254) Add user guide entry for DataFrame multivariate summary
[ https://issues.apache.org/jira/browse/SPARK-23254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346118#comment-16346118 ] Apache Spark commented on SPARK-23254: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/20446 > Add user guide entry for DataFrame multivariate summary > --- > > Key: SPARK-23254 > URL: https://issues.apache.org/jira/browse/SPARK-23254 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Minor > > SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user > guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be > updated, with the relevant example (to be in parity with the [MLlib user > guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23254) Add user guide entry for DataFrame multivariate summary
[ https://issues.apache.org/jira/browse/SPARK-23254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23254: Assignee: Apache Spark > Add user guide entry for DataFrame multivariate summary > --- > > Key: SPARK-23254 > URL: https://issues.apache.org/jira/browse/SPARK-23254 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: Apache Spark >Priority: Minor > > SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user > guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be > updated, with the relevant example (to be in parity with the [MLlib user > guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23254) Add user guide entry for DataFrame multivariate summary
[ https://issues.apache.org/jira/browse/SPARK-23254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23254: Assignee: (was: Apache Spark) > Add user guide entry for DataFrame multivariate summary > --- > > Key: SPARK-23254 > URL: https://issues.apache.org/jira/browse/SPARK-23254 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Minor > > SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user > guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be > updated, with the relevant example (to be in parity with the [MLlib user > guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23092) Migrate MemoryStream to DataSource V2
[ https://issues.apache.org/jira/browse/SPARK-23092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346110#comment-16346110 ] Apache Spark commented on SPARK-23092: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/20445 > Migrate MemoryStream to DataSource V2 > - > > Key: SPARK-23092 > URL: https://issues.apache.org/jira/browse/SPARK-23092 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Burak Yavuz >Priority: Major > > We should migrate the MemoryStream for Structured Streaming to DataSourceV2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
[ https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23276. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.3.0 > Enable UDT tests in (Hive)OrcHadoopFsRelationSuite > -- > > Key: SPARK-23276 > URL: https://issues.apache.org/jira/browse/SPARK-23276 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.3.0 > > > Like Parquet, ORC test suite should enable UDT tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23261) Rename Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23261: Description: Rename the public APIs of pandas udfs from - PANDAS SCALAR UDF -> SCALAR PANDAS UDF - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF - PANDAS GROUP AGG UDF -> PANDAS UDAF [Only 2.4] was: Rename the public APIs of pandas udfs from - PANDAS SCALAR UDF -> SCALAR PANDAS UDF - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF - PANDAS GROUP AGG UDF -> PANDAS UDAF > Rename Pandas UDFs > -- > > Key: SPARK-23261 > URL: https://issues.apache.org/jira/browse/SPARK-23261 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.3.0 > > > Rename the public APIs of pandas udfs from > - PANDAS SCALAR UDF -> SCALAR PANDAS UDF > - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF > - PANDAS GROUP AGG UDF -> PANDAS UDAF [Only 2.4] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23261) Rename Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23261: Fix Version/s: (was: 2.4.0) 2.3.0 > Rename Pandas UDFs > -- > > Key: SPARK-23261 > URL: https://issues.apache.org/jira/browse/SPARK-23261 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.3.0 > > > Rename the public APIs of pandas udfs from > - PANDAS SCALAR UDF -> SCALAR PANDAS UDF > - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF > - PANDAS GROUP AGG UDF -> PANDAS UDAF -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23202) Break down DataSourceV2Writer.commit into two phase
[ https://issues.apache.org/jira/browse/SPARK-23202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23202: Priority: Blocker (was: Major) > Break down DataSourceV2Writer.commit into two phase > --- > > Key: SPARK-23202 > URL: https://issues.apache.org/jira/browse/SPARK-23202 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Gengliang Wang >Priority: Blocker > > Currently, the api DataSourceV2Writer#commit(WriterCommitMessage[]) commits a > writing job with a list of commit messages. > It makes sense in some scenarios, e.g. MicroBatchExecution. > However, on receiving commit message, driver can start processing > messages(e.g. persist messages into files) before all the messages are > collected. > The proposal is to Break down DataSourceV2Writer.commit into two phase: > # add(WriterCommitMessage message): Handles a commit message produced by > \{@link DataWriter#commit()}. > # commit(): Commits the writing job. > This should make the API more flexible, and more reasonable for implementing > some datasources. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23251) ClassNotFoundException: scala.Any when there's a missing implicit Map encoder
[ https://issues.apache.org/jira/browse/SPARK-23251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346075#comment-16346075 ] Sean Owen commented on SPARK-23251: --- Hm. I don't know is this is related to Encoders and the mechanism you cite, not directly. The error is that {{scala.Any}} can't be found, which of course must certainly be available. This is typically a classloader issue, and in {{spark-shell}} the classloader situation is complicated. It may be a real problem still, or at least a symptom of a known class of problems. But can you confirm that this doesn't happen without the shell? > ClassNotFoundException: scala.Any when there's a missing implicit Map encoder > - > > Key: SPARK-23251 > URL: https://issues.apache.org/jira/browse/SPARK-23251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: mac os high sierra, centos 7 >Reporter: Bruce Robbins >Priority: Minor > > In branch-2.2, when you attempt to use row.getValuesMap[Any] without an > implicit Map encoder, you get a nice descriptive compile-time error: > {noformat} > scala> df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > :26: error: Unable to find encoder for type stored in a Dataset. > Primitive types (Int, String, etc) and Product types (case classes) are > supported by importing spark.implicits._ Support for serializing other types > will be added in future releases. > df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > ^ > scala> implicit val mapEncoder = > org.apache.spark.sql.Encoders.kryo[Map[String, Any]] > mapEncoder: org.apache.spark.sql.Encoder[Map[String,Any]] = class[value[0]: > binary] > scala> df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > res1: Array[Map[String,Any]] = Array(Map(stationName -> 007026 9, year -> > 2014), Map(stationName -> 007026 9, year -> 2014), Map(stationName -> > 007026 9, year -> 2014), > etc... > {noformat} > > On the latest master and also on branch-2.3, the transformation compiles (at > least on spark-shell), but throws a ClassNotFoundException: > > {noformat} > scala> df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > java.lang.ClassNotFoundException: scala.Any > at > scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1211) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1203) > at > scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:49) > at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19) > at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16) > at > scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:44) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1203) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:700) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:84) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:65) > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:64) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:512) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445) > at > sc
[jira] [Assigned] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column
[ https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23274: Assignee: Xiao Li (was: Apache Spark) > ReplaceExceptWithFilter fails on dataframes filtered on same column > --- > > Key: SPARK-23274 > URL: https://issues.apache.org/jira/browse/SPARK-23274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Onur Satici >Assignee: Xiao Li >Priority: Blocker > > Currently affects: > {code:java} > $ git tag --contains 01f6ba0e7a > v2.3.0-rc1 > v2.3.0-rc2 > {code} > Steps to reproduce: > {code:java} > $ cat test.csv > a,b > 1,2 > 1,3 > 2,2 > 2,4 > {code} > {code:java} > val df = spark.read.format("csv").option("header", "true").load("test.csv") > val df1 = df.filter($"a" === 1) > val df2 = df.filter($"a" === 2) > df1.select("b").except(df2.select("b")).show > {code} > results in: > {code:java} > java.util.NoSuchElementException: key not found: a > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apa
[jira] [Commented] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column
[ https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346071#comment-16346071 ] Apache Spark commented on SPARK-23274: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20444 > ReplaceExceptWithFilter fails on dataframes filtered on same column > --- > > Key: SPARK-23274 > URL: https://issues.apache.org/jira/browse/SPARK-23274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Onur Satici >Assignee: Xiao Li >Priority: Blocker > > Currently affects: > {code:java} > $ git tag --contains 01f6ba0e7a > v2.3.0-rc1 > v2.3.0-rc2 > {code} > Steps to reproduce: > {code:java} > $ cat test.csv > a,b > 1,2 > 1,3 > 2,2 > 2,4 > {code} > {code:java} > val df = spark.read.format("csv").option("header", "true").load("test.csv") > val df1 = df.filter($"a" === 1) > val df2 = df.filter($"a" === 2) > df1.select("b").except(df2.select("b")).show > {code} > results in: > {code:java} > java.util.NoSuchElementException: key not found: a > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > o
[jira] [Assigned] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column
[ https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23274: Assignee: Apache Spark (was: Xiao Li) > ReplaceExceptWithFilter fails on dataframes filtered on same column > --- > > Key: SPARK-23274 > URL: https://issues.apache.org/jira/browse/SPARK-23274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Onur Satici >Assignee: Apache Spark >Priority: Blocker > > Currently affects: > {code:java} > $ git tag --contains 01f6ba0e7a > v2.3.0-rc1 > v2.3.0-rc2 > {code} > Steps to reproduce: > {code:java} > $ cat test.csv > a,b > 1,2 > 1,3 > 2,2 > 2,4 > {code} > {code:java} > val df = spark.read.format("csv").option("header", "true").load("test.csv") > val df1 = df.filter($"a" === 1) > val df2 = df.filter($"a" === 2) > df1.select("b").except(df2.select("b")).show > {code} > results in: > {code:java} > java.util.NoSuchElementException: key not found: a > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > or
[jira] [Commented] (SPARK-23251) ClassNotFoundException: scala.Any when there's a missing implicit Map encoder
[ https://issues.apache.org/jira/browse/SPARK-23251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346050#comment-16346050 ] Bruce Robbins commented on SPARK-23251: --- I commented out the following line in sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala and the problem went away: {code:java} implicit def newMapEncoder[T <: Map[_, _] : TypeTag]: Encoder[T] = ExpressionEncoder() {code} By "went away", I mean I now had to specify a Map encoder for my map function to compile (rather than have it compile and then throw an exception). Checking with [~michalsenkyr], who will know more than I do. > ClassNotFoundException: scala.Any when there's a missing implicit Map encoder > - > > Key: SPARK-23251 > URL: https://issues.apache.org/jira/browse/SPARK-23251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 > Environment: mac os high sierra, centos 7 >Reporter: Bruce Robbins >Priority: Minor > > In branch-2.2, when you attempt to use row.getValuesMap[Any] without an > implicit Map encoder, you get a nice descriptive compile-time error: > {noformat} > scala> df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > :26: error: Unable to find encoder for type stored in a Dataset. > Primitive types (Int, String, etc) and Product types (case classes) are > supported by importing spark.implicits._ Support for serializing other types > will be added in future releases. > df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > ^ > scala> implicit val mapEncoder = > org.apache.spark.sql.Encoders.kryo[Map[String, Any]] > mapEncoder: org.apache.spark.sql.Encoder[Map[String,Any]] = class[value[0]: > binary] > scala> df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > res1: Array[Map[String,Any]] = Array(Map(stationName -> 007026 9, year -> > 2014), Map(stationName -> 007026 9, year -> 2014), Map(stationName -> > 007026 9, year -> 2014), > etc... > {noformat} > > On the latest master and also on branch-2.3, the transformation compiles (at > least on spark-shell), but throws a ClassNotFoundException: > > {noformat} > scala> df.map(row => row.getValuesMap[Any](List("stationName", > "year"))).collect > java.lang.ClassNotFoundException: scala.Any > at > scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1211) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1203) > at > scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:49) > at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19) > at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16) > at > scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:44) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1203) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194) > at > scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:700) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:84) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor$1.apply(ScalaReflection.scala:65) > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:64) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:512) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflec
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345963#comment-16345963 ] John Cheng commented on SPARK-18057: Apache Kafka is now at version 1.0. For people who want to use Spark streaming against Kafka brokers on 1.0.0, it is preferable to use the `org.apache.kafka:kafka-clients:jar:1.0.0` client. "Most of the discussion on the performance impact of [upgrading to the 0.10.0 message format|https://kafka.apache.org/0110/documentation.html#upgrade_10_performance_impact] remains pertinent to the 0.11.0 upgrade. This mainly affects clusters that are not secured with TLS since "zero-copy" transfer is already not possible in that case. In order to avoid the cost of down-conversion, you should ensure that consumer applications are upgraded to the latest 0.11.0 client." > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger >Priority: Major > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23157) withColumn fails for a column that is a result of mapped DataSet
[ https://issues.apache.org/jira/browse/SPARK-23157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345949#comment-16345949 ] Apache Spark commented on SPARK-23157: -- User 'henryr' has created a pull request for this issue: https://github.com/apache/spark/pull/20443 > withColumn fails for a column that is a result of mapped DataSet > > > Key: SPARK-23157 > URL: https://issues.apache.org/jira/browse/SPARK-23157 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Tomasz Bartczak >Priority: Minor > > Having > {code:java} > case class R(id: String) > val ds = spark.createDataset(Seq(R("1"))) > {code} > This works: > {code} > scala> ds.withColumn("n", ds.col("id")) > res16: org.apache.spark.sql.DataFrame = [id: string, n: string] > {code} > but when we map over ds it fails: > {code} > scala> ds.withColumn("n", ds.map(a => a).col("id")) > org.apache.spark.sql.AnalysisException: resolved attribute(s) id#55 missing > from id#4 in operator !Project [id#4, id#55 AS n#57];; > !Project [id#4, id#55 AS n#57] > +- LocalRelation [id#4] > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:67) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2884) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1150) > at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1905) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM
[ https://issues.apache.org/jira/browse/SPARK-23275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23275. - Resolution: Fixed Assignee: Dilip Biswal Fix Version/s: 2.3.0 > hive/tests have been failing when run locally on the laptop (Mac) with OOM > --- > > Key: SPARK-23275 > URL: https://issues.apache.org/jira/browse/SPARK-23275 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 2.3.0 > > > hive tests have been failing when they are run locally (Mac Os) after a > recent change in the trunk. After running the tests for some time, the test > fails with OOM with Error: unable to create new native thread. > I noticed the thread count goes all the way up to 2000+ after which we start > getting these OOM errors. Most of the threads seem to be related to the > connection pool in hive metastore (BoneCP-x- ). This behaviour change > is happening after we made the following change to HiveClientImpl.reset() > {code} > def reset(): Unit = withHiveState { > try { > // code > } finally { > runSqlHive("USE default") ===> this is causing the issue > } > {code} > I am proposing to temporarily back-out part of a fix made to address > SPARK-23000 to resolve this issue while we work-out the exact reason for this > sudden increase in thread counts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23236) Make it easier to find the rest API, especially in local mode
[ https://issues.apache.org/jira/browse/SPARK-23236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345827#comment-16345827 ] Alex Bozarth commented on SPARK-23236: -- For #1, a REST API endpoint shouldn't return html, but we could return a custom html response (such as 405 or another response code) that includes a short "maybe you meant..." description. For #2 I would be ok with a "maybe you meant..." response but with a link to the Spark REST API Doc to aid the user. > Make it easier to find the rest API, especially in local mode > - > > Key: SPARK-23236 > URL: https://issues.apache.org/jira/browse/SPARK-23236 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Trivial > Labels: newbie > > This is really minor, but it always takes me a little bit to figure out how > to get from the UI to the rest api. Its especially a pain in local-mode, > where you need the app-id, though in general I don't know the app-id, so have > to either look in logs or go to another endpoint first in the ui just to find > the app-id. While it wouldn't really help anybody accessing the endpoints > programmatically, we could make it easier for someone doing exploration via > their browser. > Some things which could be improved: > * /api/v1 just provides a link to "/api/v1/applications" > * /api provides a link to "/api/v1/applications" > * /api/v1/applications/[app-id] gives a list of links for the other endpoints > * on the UI, there is a link to at least /api/v1/applications/[app-id] -- > better still if each UI page links to the corresponding endpoint, eg. the all > jobs page would link to /api/v1/applications/[app-id]/jobs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23237) Add UI / endpoint for threaddumps for executors with active tasks
[ https://issues.apache.org/jira/browse/SPARK-23237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345818#comment-16345818 ] Alex Bozarth commented on SPARK-23237: -- I would rather keep it to an api endpoint, but what I'm worried about is having an end point that returns a specific threadDump decided by some unknown algorithm. Again, I'm willing to look at a PR to see if the exact impl will change my mind. > Add UI / endpoint for threaddumps for executors with active tasks > - > > Key: SPARK-23237 > URL: https://issues.apache.org/jira/browse/SPARK-23237 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Priority: Major > > Frequently, when there are a handful of straggler tasks, users want to know > what is going on in those executors running the stragglers. Currently, that > is a bit of a pain to do: you have to go to the page for your active stage, > find the task, figure out which executor its on, then go to the executors > page, and get the thread dump. Or maybe you just go to the executors page, > find the executor with an active task, and then click on that, but that > doesn't work if you've got multiple stages running. > Users could figure this by extracting the info from the stage rest endpoint, > but it's such a common thing to do that we should make it easy. > I realize that figuring out a good way to do this is a little tricky. We > don't want to make it easy to end up pulling thread dumps from 1000 executors > back to the driver. So we've got to come up with a reasonable heuristic for > choosing which executors to poll. And we've also got to find a suitable > place to put this. > My suggestion is that the stage page always has a link to the thread dumps > for the *one* executor with the longest running task. And there would be a > corresponding endpoint in the rest api with the same info, maybe at > {{/applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/slowestTaskThreadDump}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column
[ https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-23274: --- Labels: (was: correctness) > ReplaceExceptWithFilter fails on dataframes filtered on same column > --- > > Key: SPARK-23274 > URL: https://issues.apache.org/jira/browse/SPARK-23274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Onur Satici >Assignee: Xiao Li >Priority: Blocker > > Currently affects: > {code:java} > $ git tag --contains 01f6ba0e7a > v2.3.0-rc1 > v2.3.0-rc2 > {code} > Steps to reproduce: > {code:java} > $ cat test.csv > a,b > 1,2 > 1,3 > 2,2 > 2,4 > {code} > {code:java} > val df = spark.read.format("csv").option("header", "true").load("test.csv") > val df1 = df.filter($"a" === 1) > val df2 = df.filter($"a" === 2) > df1.select("b").except(df2.select("b")).show > {code} > results in: > {code:java} > java.util.NoSuchElementException: key not found: a > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalys
[jira] [Resolved] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.
[ https://issues.apache.org/jira/browse/SPARK-23278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23278. --- Resolution: Duplicate You opened this twice, so I closed it. Please don't reopen JIRAs. Your description here is just a stack trace. Without anything more I'd close the other one too. > Spark ALS : param coldStartStrategy does not exist. > --- > > Key: SPARK-23278 > URL: https://issues.apache.org/jira/browse/SPARK-23278 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Surya Prakash Reddy >Priority: Major > > An error occurred while calling o105.getParam. : > java.util.NoSuchElementException: Param coldStartStrategy does not exist. at > org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) > at > org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at > org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:280) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:214) at > java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.
[ https://issues.apache.org/jira/browse/SPARK-23278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-23278. - > Spark ALS : param coldStartStrategy does not exist. > --- > > Key: SPARK-23278 > URL: https://issues.apache.org/jira/browse/SPARK-23278 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Surya Prakash Reddy >Priority: Major > > An error occurred while calling o105.getParam. : > java.util.NoSuchElementException: Param coldStartStrategy does not exist. at > org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) > at > org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at > org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:280) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:214) at > java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.
[ https://issues.apache.org/jira/browse/SPARK-23278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surya Prakash Reddy reopened SPARK-23278: - > Spark ALS : param coldStartStrategy does not exist. > --- > > Key: SPARK-23278 > URL: https://issues.apache.org/jira/browse/SPARK-23278 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Surya Prakash Reddy >Priority: Major > > An error occurred while calling o105.getParam. : > java.util.NoSuchElementException: Param coldStartStrategy does not exist. at > org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) > at > org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at > org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:280) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:214) at > java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.
[ https://issues.apache.org/jira/browse/SPARK-23278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23278. --- Resolution: Duplicate > Spark ALS : param coldStartStrategy does not exist. > --- > > Key: SPARK-23278 > URL: https://issues.apache.org/jira/browse/SPARK-23278 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Surya Prakash Reddy >Priority: Major > > An error occurred while calling o105.getParam. : > java.util.NoSuchElementException: Param coldStartStrategy does not exist. at > org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) > at > org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at > org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:280) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:214) at > java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23265: Assignee: (was: Apache Spark) > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > {{numBuckets }}when transforming multiple columns, since that is then applied > to all columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23265: Assignee: Apache Spark > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Assignee: Apache Spark >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > {{numBuckets }}when transforming multiple columns, since that is then applied > to all columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345723#comment-16345723 ] Apache Spark commented on SPARK-23265: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/20442 > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > {{numBuckets }}when transforming multiple columns, since that is then applied > to all columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.
[ https://issues.apache.org/jira/browse/SPARK-23278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surya Prakash Reddy updated SPARK-23278: Description: An error occurred while calling o105.getParam. : java.util.NoSuchElementException: Param coldStartStrategy does not exist. at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) at org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:744) > Spark ALS : param coldStartStrategy does not exist. > --- > > Key: SPARK-23278 > URL: https://issues.apache.org/jira/browse/SPARK-23278 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Surya Prakash Reddy >Priority: Major > > An error occurred while calling o105.getParam. : > java.util.NoSuchElementException: Param coldStartStrategy does not exist. at > org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) > at > org.apache.spark.ml.param.Params$$anonfun$getParam$2.apply(params.scala:601) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.ml.param.Params$class.getParam(params.scala:600) at > org.apache.spark.ml.PipelineStage.getParam(Pipeline.scala:42) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) at > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at > py4j.Gateway.invoke(Gateway.java:280) at > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at > py4j.commands.CallCommand.execute(CallCommand.java:79) at > py4j.GatewayConnection.run(GatewayConnection.java:214) at > java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23278) Spark ALS : param coldStartStrategy does not exist.
Surya Prakash Reddy created SPARK-23278: --- Summary: Spark ALS : param coldStartStrategy does not exist. Key: SPARK-23278 URL: https://issues.apache.org/jira/browse/SPARK-23278 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.0 Reporter: Surya Prakash Reddy -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23277) Spark ALS : param coldStartStrategy does not exist.
Surya Prakash Reddy created SPARK-23277: --- Summary: Spark ALS : param coldStartStrategy does not exist. Key: SPARK-23277 URL: https://issues.apache.org/jira/browse/SPARK-23277 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.1.0 Reporter: Surya Prakash Reddy -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM
[ https://issues.apache.org/jira/browse/SPARK-23275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345686#comment-16345686 ] Apache Spark commented on SPARK-23275: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/20441 > hive/tests have been failing when run locally on the laptop (Mac) with OOM > --- > > Key: SPARK-23275 > URL: https://issues.apache.org/jira/browse/SPARK-23275 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Priority: Major > > hive tests have been failing when they are run locally (Mac Os) after a > recent change in the trunk. After running the tests for some time, the test > fails with OOM with Error: unable to create new native thread. > I noticed the thread count goes all the way up to 2000+ after which we start > getting these OOM errors. Most of the threads seem to be related to the > connection pool in hive metastore (BoneCP-x- ). This behaviour change > is happening after we made the following change to HiveClientImpl.reset() > {code} > def reset(): Unit = withHiveState { > try { > // code > } finally { > runSqlHive("USE default") ===> this is causing the issue > } > {code} > I am proposing to temporarily back-out part of a fix made to address > SPARK-23000 to resolve this issue while we work-out the exact reason for this > sudden increase in thread counts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM
[ https://issues.apache.org/jira/browse/SPARK-23275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23275: Assignee: (was: Apache Spark) > hive/tests have been failing when run locally on the laptop (Mac) with OOM > --- > > Key: SPARK-23275 > URL: https://issues.apache.org/jira/browse/SPARK-23275 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Priority: Major > > hive tests have been failing when they are run locally (Mac Os) after a > recent change in the trunk. After running the tests for some time, the test > fails with OOM with Error: unable to create new native thread. > I noticed the thread count goes all the way up to 2000+ after which we start > getting these OOM errors. Most of the threads seem to be related to the > connection pool in hive metastore (BoneCP-x- ). This behaviour change > is happening after we made the following change to HiveClientImpl.reset() > {code} > def reset(): Unit = withHiveState { > try { > // code > } finally { > runSqlHive("USE default") ===> this is causing the issue > } > {code} > I am proposing to temporarily back-out part of a fix made to address > SPARK-23000 to resolve this issue while we work-out the exact reason for this > sudden increase in thread counts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM
[ https://issues.apache.org/jira/browse/SPARK-23275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23275: Assignee: Apache Spark > hive/tests have been failing when run locally on the laptop (Mac) with OOM > --- > > Key: SPARK-23275 > URL: https://issues.apache.org/jira/browse/SPARK-23275 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dilip Biswal >Assignee: Apache Spark >Priority: Major > > hive tests have been failing when they are run locally (Mac Os) after a > recent change in the trunk. After running the tests for some time, the test > fails with OOM with Error: unable to create new native thread. > I noticed the thread count goes all the way up to 2000+ after which we start > getting these OOM errors. Most of the threads seem to be related to the > connection pool in hive metastore (BoneCP-x- ). This behaviour change > is happening after we made the following change to HiveClientImpl.reset() > {code} > def reset(): Unit = withHiveState { > try { > // code > } finally { > runSqlHive("USE default") ===> this is causing the issue > } > {code} > I am proposing to temporarily back-out part of a fix made to address > SPARK-23000 to resolve this issue while we work-out the exact reason for this > sudden increase in thread counts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column
[ https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-23274: --- Labels: correctness (was: ) > ReplaceExceptWithFilter fails on dataframes filtered on same column > --- > > Key: SPARK-23274 > URL: https://issues.apache.org/jira/browse/SPARK-23274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Onur Satici >Assignee: Xiao Li >Priority: Blocker > Labels: correctness > > Currently affects: > {code:java} > $ git tag --contains 01f6ba0e7a > v2.3.0-rc1 > v2.3.0-rc2 > {code} > Steps to reproduce: > {code:java} > $ cat test.csv > a,b > 1,2 > 1,3 > 2,2 > 2,4 > {code} > {code:java} > val df = spark.read.format("csv").option("header", "true").load("test.csv") > val df1 = df.filter($"a" === 1) > val df2 = df.filter($"a" === 2) > df1.select("b").except(df2.select("b")).show > {code} > results in: > {code:java} > java.util.NoSuchElementException: key not found: a > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) >
[jira] [Assigned] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
[ https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23276: Assignee: (was: Apache Spark) > Enable UDT tests in (Hive)OrcHadoopFsRelationSuite > -- > > Key: SPARK-23276 > URL: https://issues.apache.org/jira/browse/SPARK-23276 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Like Parquet, ORC test suite should enable UDT tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
[ https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23276: Assignee: Apache Spark > Enable UDT tests in (Hive)OrcHadoopFsRelationSuite > -- > > Key: SPARK-23276 > URL: https://issues.apache.org/jira/browse/SPARK-23276 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > Like Parquet, ORC test suite should enable UDT tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
[ https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345679#comment-16345679 ] Apache Spark commented on SPARK-23276: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/20440 > Enable UDT tests in (Hive)OrcHadoopFsRelationSuite > -- > > Key: SPARK-23276 > URL: https://issues.apache.org/jira/browse/SPARK-23276 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Like Parquet, ORC test suite should enable UDT tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
[ https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23276: -- Component/s: Tests > Enable UDT tests in (Hive)OrcHadoopFsRelationSuite > -- > > Key: SPARK-23276 > URL: https://issues.apache.org/jira/browse/SPARK-23276 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Like Parquet, ORC test suite should enable UDT tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
[ https://issues.apache.org/jira/browse/SPARK-23276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23276: -- Description: Like Parquet, ORC test suite should enable UDT tests. > Enable UDT tests in (Hive)OrcHadoopFsRelationSuite > -- > > Key: SPARK-23276 > URL: https://issues.apache.org/jira/browse/SPARK-23276 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Like Parquet, ORC test suite should enable UDT tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23276) Enable UDT tests in (Hive)OrcHadoopFsRelationSuite
Dongjoon Hyun created SPARK-23276: - Summary: Enable UDT tests in (Hive)OrcHadoopFsRelationSuite Key: SPARK-23276 URL: https://issues.apache.org/jira/browse/SPARK-23276 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23267) Increase spark.sql.codegen.hugeMethodLimit to 65535
[ https://issues.apache.org/jira/browse/SPARK-23267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23267. - Resolution: Fixed Fix Version/s: 2.3.0 > Increase spark.sql.codegen.hugeMethodLimit to 65535 > --- > > Key: SPARK-23267 > URL: https://issues.apache.org/jira/browse/SPARK-23267 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.3.0 > > > Still saw the performance regression introduced by > `spark.sql.codegen.hugeMethodLimit` in our internal workloads. There are two > major issues in the current solution. > * The size of the complied byte code is not identical to the bytecode size > of the method. The detection is still not accurate. > * The bytecode size of a single operator (e.g., `SerializeFromObject`) could > still exceed 8K limit. We saw the performance regression in such scenario. > Since it is close to the release of 2.3, we decide to increase it to 64K for > avoiding the perf regression. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23275) hive/tests have been failing when run locally on the laptop (Mac) with OOM
Dilip Biswal created SPARK-23275: Summary: hive/tests have been failing when run locally on the laptop (Mac) with OOM Key: SPARK-23275 URL: https://issues.apache.org/jira/browse/SPARK-23275 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Dilip Biswal hive tests have been failing when they are run locally (Mac Os) after a recent change in the trunk. After running the tests for some time, the test fails with OOM with Error: unable to create new native thread. I noticed the thread count goes all the way up to 2000+ after which we start getting these OOM errors. Most of the threads seem to be related to the connection pool in hive metastore (BoneCP-x- ). This behaviour change is happening after we made the following change to HiveClientImpl.reset() {code} def reset(): Unit = withHiveState { try { // code } finally { runSqlHive("USE default") ===> this is causing the issue } {code} I am proposing to temporarily back-out part of a fix made to address SPARK-23000 to resolve this issue while we work-out the exact reason for this sudden increase in thread counts. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23254) Add user guide entry for DataFrame multivariate summary
[ https://issues.apache.org/jira/browse/SPARK-23254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345651#comment-16345651 ] Weichen Xu commented on SPARK-23254: I will work on this. Thanks! > Add user guide entry for DataFrame multivariate summary > --- > > Key: SPARK-23254 > URL: https://issues.apache.org/jira/browse/SPARK-23254 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Minor > > SPARK-19634 added a DataFrame API for vector summary statistics. The [ML user > guide|http://spark.apache.org/docs/latest/ml-statistics.html] should be > updated, with the relevant example (to be in parity with the [MLlib user > guide|http://spark.apache.org/docs/latest/mllib-statistics.html#summary-statistics]). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23261) Rename Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-23261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345641#comment-16345641 ] Apache Spark commented on SPARK-23261: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20439 > Rename Pandas UDFs > -- > > Key: SPARK-23261 > URL: https://issues.apache.org/jira/browse/SPARK-23261 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Major > Fix For: 2.4.0 > > > Rename the public APIs of pandas udfs from > - PANDAS SCALAR UDF -> SCALAR PANDAS UDF > - PANDAS GROUP MAP UDF -> GROUPED MAP PANDAS UDF > - PANDAS GROUP AGG UDF -> PANDAS UDAF -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345615#comment-16345615 ] Marcelo Vanzin commented on SPARK-18085: If you want the short and dirty description: these changes decouple the application status data storage from the UI code, and allow different ways for storing the status data. Shipped with 2.3 are in-memory and disk storage options. This allows the SHS to use disk-based storage to use less memory and serve data more quickly when restarted. > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Major > Labels: SPIP > Fix For: 2.3.0 > > Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23020: Assignee: Marcelo Vanzin (was: Apache Spark) > Re-enable Flaky Test: > org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Sameer Agarwal >Assignee: Marcelo Vanzin >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23020) Re-enable Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23020: Assignee: Apache Spark (was: Marcelo Vanzin) > Re-enable Flaky Test: > org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.4.0 >Reporter: Sameer Agarwal >Assignee: Apache Spark >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)
[ https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345598#comment-16345598 ] Thomas Bünger commented on SPARK-12394: --- Any news on this issue? Is it really fixed? I also can't find a corresponding pull request. > Support writing out pre-hash-partitioned data and exploit that in join > optimizations to avoid shuffle (i.e. bucketing in Hive) > -- > > Key: SPARK-12394 > URL: https://issues.apache.org/jira/browse/SPARK-12394 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Reynold Xin >Assignee: Nong Li >Priority: Major > Fix For: 2.0.0 > > Attachments: BucketedTables.pdf > > > In many cases users know ahead of time the columns that they will be joining > or aggregating on. Ideally they should be able to leverage this information > and pre-shuffle the data so that subsequent queries do not require a shuffle. > Hive supports this functionality by allowing the user to define buckets, > which are hash partitioning of the data based on some key. > - Allow the user to specify a set of columns when caching or writing out data > - Allow the user to specify some parallelism > - Shuffle the data when writing / caching such that its distributed by these > columns > - When planning/executing a query, use this distribution to avoid another > shuffle when reading, assuming the join or aggregation is compatible with the > columns specified > - Should work with existing save modes: append, overwrite, etc > - Should work at least with all Hadoops FS data sources > - Should work with any data source when caching -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345585#comment-16345585 ] Huaxin Gao commented on SPARK-23265: I am working on it. Will submit a PR today. > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > {{numBuckets }}when transforming multiple columns, since that is then applied > to all columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345551#comment-16345551 ] Kazuaki Ishizaki edited comment on SPARK-18016 at 1/30/18 6:27 PM: --- [~gaurav.garg] Thank you for your confirmation. I am running a program that has {{Statistics.corr(vec, "pearson")}}. It takes some time until the program stops. I would appreciate it if you could share a log file that includes the generated Java code. was (Author: kiszk): [~gaurav.garg] Thank you for your confirmation. I am running a program that has \{{Statistics.corr(vec, "pearson"})}}. It takes some time until the program stops. I would appreciate it if you could share a log file that includes the generated Java code. > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.U
[jira] [Updated] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column
[ https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23274: Target Version/s: 2.3.0 > ReplaceExceptWithFilter fails on dataframes filtered on same column > --- > > Key: SPARK-23274 > URL: https://issues.apache.org/jira/browse/SPARK-23274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Onur Satici >Assignee: Xiao Li >Priority: Blocker > > Currently affects: > {code:java} > $ git tag --contains 01f6ba0e7a > v2.3.0-rc1 > v2.3.0-rc2 > {code} > Steps to reproduce: > {code:java} > $ cat test.csv > a,b > 1,2 > 1,3 > 2,2 > 2,4 > {code} > {code:java} > val df = spark.read.format("csv").option("header", "true").load("test.csv") > val df1 = df.filter($"a" === 1) > val df2 = df.filter($"a" === 2) > df1.select("b").except(df2.select("b")).show > {code} > results in: > {code:java} > java.util.NoSuchElementException: key not found: a > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.Tre
[jira] [Commented] (SPARK-23274) ReplaceExceptWithFilter fails on dataframes filtered on same column
[ https://issues.apache.org/jira/browse/SPARK-23274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345554#comment-16345554 ] Xiao Li commented on SPARK-23274: - Since this is a regression, I will try to fix it ASAP > ReplaceExceptWithFilter fails on dataframes filtered on same column > --- > > Key: SPARK-23274 > URL: https://issues.apache.org/jira/browse/SPARK-23274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Onur Satici >Assignee: Xiao Li >Priority: Blocker > > Currently affects: > {code:java} > $ git tag --contains 01f6ba0e7a > v2.3.0-rc1 > v2.3.0-rc2 > {code} > Steps to reproduce: > {code:java} > $ cat test.csv > a,b > 1,2 > 1,3 > 2,2 > 2,4 > {code} > {code:java} > val df = spark.read.format("csv").option("header", "true").load("test.csv") > val df1 = df.filter($"a" === 1) > val df2 = df.filter($"a" === 2) > df1.select("b").except(df2.select("b")).show > {code} > results in: > {code:java} > java.util.NoSuchElementException: key not found: a > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.MapLike$class.apply(MapLike.scala:141) > at scala.collection.AbstractMap.apply(Map.scala:59) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition$1.applyOrElse(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$.org$apache$spark$sql$catalyst$optimizer$ReplaceExceptWithFilter$$transformCondition(ReplaceExceptWithFilter.scala:60) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:50) > at > org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter$$anonfun$apply$1.applyOrElse(ReplaceExceptWithFilter.scala:48) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transfo