[jira] [Resolved] (SPARK-27539) Fix inaccurate aggregate outputRows estimation with column containing null values

2019-04-22 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27539.
---
   Resolution: Fixed
 Assignee: peng bo
Fix Version/s: 2.4.3
   3.0.0

This is resolved via https://github.com/apache/spark/pull/24436.

> Fix inaccurate aggregate outputRows estimation with column containing null 
> values
> -
>
> Key: SPARK-27539
> URL: https://issues.apache.org/jira/browse/SPARK-27539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: peng bo
>Assignee: peng bo
>Priority: Major
> Fix For: 3.0.0, 2.4.3
>
>
> This issue is follow up of [https://github.com/apache/spark/pull/24286]. As 
> [~smilegator] pointed out that column with null value is inaccurate as well.
> {code:java}
> > select key from test;
> 2
> NULL
> 1
> spark-sql> desc extended test key;
> col_name key
> data_type int
> comment NULL
> min 1
> max 2
> num_nulls 1
> distinct_count 2{code}
> The distinct count should be distinct_count + 1 when the column contains null 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27539) Fix inaccurate aggregate outputRows estimation with column containing null values

2019-04-22 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27539:
--
Summary: Fix inaccurate aggregate outputRows estimation with column 
containing null values  (was: Inaccurate aggregate outputRows estimation with 
column contains null value)

> Fix inaccurate aggregate outputRows estimation with column containing null 
> values
> -
>
> Key: SPARK-27539
> URL: https://issues.apache.org/jira/browse/SPARK-27539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: peng bo
>Priority: Major
>
> This issue is follow up of [https://github.com/apache/spark/pull/24286]. As 
> [~smilegator] pointed out that column with null value is inaccurate as well.
> {code:java}
> > select key from test;
> 2
> NULL
> 1
> spark-sql> desc extended test key;
> col_name key
> data_type int
> comment NULL
> min 1
> max 2
> num_nulls 1
> distinct_count 2{code}
> The distinct count should be distinct_count + 1 when the column contains null 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27505) autoBroadcastJoinThreshold including bigger table

2019-04-22 Thread Mike Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823637#comment-16823637
 ] 

Mike Chan commented on SPARK-27505:
---

You mind sharing any info on self-reproducer? Tried to google myself but 
nothing came through. Thank you.

> autoBroadcastJoinThreshold including bigger table
> -
>
> Key: SPARK-27505
> URL: https://issues.apache.org/jira/browse/SPARK-27505
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: Hive table with Spark 2.3.1 on Azure, using Azure 
> storage as storage layer
>Reporter: Mike Chan
>Priority: Major
> Attachments: explain_plan.txt
>
>
> I'm on a case that when certain table being exposed to broadcast join, the 
> query will eventually failed with remote block error. 
>  
> Firstly. We set the spark.sql.autoBroadcastJoinThreshold to 10MB, namely 
> 10485760
> !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.2=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ96l-PZQKRrU2lSlUA7MGbz1DAK62y0fMFOG07rfgI3oXkalm4An9eHtd6hX3hsKDd9EJK46cGTaqj_qKVrzs7xLyJgvx8XHuu36HSSfBtxW9OnrckzikIDRPI=emb=ii_jumg5jxd1|width=542,height=66!
>  
> Then we proceed to perform query. In the SQL plan, we found that one table 
> that is 25MB in size is broadcast as well.
>  
> !https://mail.google.com/mail/u/1?ui=2=6f09461656=0.0.1=msg-a:r2073778291349183964=16a2fd58ea74551c=fimg=s0-l75-ft=ANGjdJ_Fx_sEOI2n4yYfOn0gCUYqFYMDrxsSzd-S9ehtl67Imi87NN3y8cCFUOrHwKYO3MTfi3LVCIGg7J9jEuqnlqa76pvrUaAzEKSUm9VtBoH-Zsf9qepJiS4NKLE=emb=ii_jumg53fq0|width=227,height=542!
>  
> Also in desc extended the table is 24452111 bytes. It is a Hive table. We 
> always ran into error when this table being broadcast. Below is the sample 
> error
>  
> Caused by: java.io.IOException: org.apache.spark.SparkException: corrupt 
> remote block broadcast_477_piece0 of broadcast_477: 298778625 != -992055931 
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1350) at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
>  at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) 
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
>  
> Also attached the physical plan if you're interested. One thing to note that, 
> if I turn down autoBroadcastJoinThreshold{color:#00}to 5MB, this query 
> will get successfully executed and default.product NOT broadcasted.{color}
> {color:#00}
> {color}{color:#00}However, when I change to another query that querying 
> even less columns than pervious one, even in 5MB this table still get 
> broadcasted and failed with the same error. I even changed to 1MB and still 
> the same. {color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2019-04-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823631#comment-16823631
 ] 

Hyukjin Kwon commented on SPARK-18673:
--

This is blocked by SPARK-23710

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27535) Date and timestamp JSON benchmarks

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27535:


Assignee: Maxim Gekk

> Date and timestamp JSON benchmarks
> --
>
> Key: SPARK-27535
> URL: https://issues.apache.org/jira/browse/SPARK-27535
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Extend JSONBenchmark by new benchmarks:
> * Write dates/timestamps to files
> * Read/infer dates/timestamp from files
> * Read/infer dates/timestamps from Dataset[String]
> * to_json/from_json for dates/timestamps



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



subscribe

2019-04-22 Thread Bowen Li



[jira] [Resolved] (SPARK-27535) Date and timestamp JSON benchmarks

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27535.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24430
[https://github.com/apache/spark/pull/24430]

> Date and timestamp JSON benchmarks
> --
>
> Key: SPARK-27535
> URL: https://issues.apache.org/jira/browse/SPARK-27535
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Extend JSONBenchmark by new benchmarks:
> * Write dates/timestamps to files
> * Read/infer dates/timestamp from files
> * Read/infer dates/timestamps from Dataset[String]
> * to_json/from_json for dates/timestamps



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27533) Date and timestamp CSV benchmarks

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27533.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24429
[https://github.com/apache/spark/pull/24429]

> Date and timestamp CSV benchmarks
> -
>
> Key: SPARK-27533
> URL: https://issues.apache.org/jira/browse/SPARK-27533
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Extend CSVBenchmark by new benchmarks:
> - Write dates/timestamps to files
> - Read/infer dates/timestamp from files
> - Read/infer dates/timestamps from Dataset[String]
> - to_csv/from_csv for dates/timestamps



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27533) Date and timestamp CSV benchmarks

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27533:


Assignee: Maxim Gekk

> Date and timestamp CSV benchmarks
> -
>
> Key: SPARK-27533
> URL: https://issues.apache.org/jira/browse/SPARK-27533
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Extend CSVBenchmark by new benchmarks:
> - Write dates/timestamps to files
> - Read/infer dates/timestamp from files
> - Read/infer dates/timestamps from Dataset[String]
> - to_csv/from_csv for dates/timestamps



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27528) Use Parquet logical type TIMESTAMP_MICROS by default

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27528:


Assignee: Maxim Gekk

> Use Parquet logical type TIMESTAMP_MICROS by default
> 
>
> Key: SPARK-27528
> URL: https://issues.apache.org/jira/browse/SPARK-27528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, Spark uses INT96 type for timestamps written to parquet files. To 
> store Catalyst's Timestamp values as INT96, Spark converts microseconds since 
> epoch to nanoseconds in Julian calendar. This conversion is not necessary if 
> Spark saves timestamps as Parquet TIMESTAMP_MICROS logical type. The ticket 
> aims to switch on TIMESTAMP_MICROS from INT96 in write by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27528) Use Parquet logical type TIMESTAMP_MICROS by default

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27528.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24425
[https://github.com/apache/spark/pull/24425]

> Use Parquet logical type TIMESTAMP_MICROS by default
> 
>
> Key: SPARK-27528
> URL: https://issues.apache.org/jira/browse/SPARK-27528
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, Spark uses INT96 type for timestamps written to parquet files. To 
> store Catalyst's Timestamp values as INT96, Spark converts microseconds since 
> epoch to nanoseconds in Julian calendar. This conversion is not necessary if 
> Spark saves timestamps as Parquet TIMESTAMP_MICROS logical type. The ticket 
> aims to switch on TIMESTAMP_MICROS from INT96 in write by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2019-04-22 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823613#comment-16823613
 ] 

zhoukang commented on SPARK-25299:
--

nice work!
Really looking forward thanks [~yifeih]

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27543) Support getRequiredJars and getRequiredFiles APIs for Hive UDFs

2019-04-22 Thread Sergey (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated SPARK-27543:
---
Issue Type: Improvement  (was: Bug)

> Support getRequiredJars and getRequiredFiles APIs for Hive UDFs
> ---
>
> Key: SPARK-27543
> URL: https://issues.apache.org/jira/browse/SPARK-27543
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.4.1
>Reporter: Sergey
>Priority: Minor
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
>
> *getRequiredJars* and *getRequiredFiles* - functions to automatically include 
> additional resources required by a UDF. The files that are provided in 
> methods would be accessible by executors by simple file name. This is 
> necessary for UDFs that need to have some required files distributed, or 
> classes from third-party jars to be available from executors. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27543) Support getRequiredJars and getRequiredFiles APIs for Hive UDFs

2019-04-22 Thread Sergey (JIRA)
Sergey created SPARK-27543:
--

 Summary: Support getRequiredJars and getRequiredFiles APIs for 
Hive UDFs
 Key: SPARK-27543
 URL: https://issues.apache.org/jira/browse/SPARK-27543
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.1, 2.0.0
Reporter: Sergey


*getRequiredJars* and *getRequiredFiles* - functions to automatically include 
additional resources required by a UDF. The files that are provided in methods 
would be accessible by executors by simple file name. This is necessary for 
UDFs that need to have some required files distributed, or classes from 
third-party jars to be available from executors. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23773) JacksonGenerator does not include keys that have null value for StructTypes

2019-04-22 Thread Sergey (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated SPARK-23773:
---
Issue Type: Bug  (was: Improvement)

> JacksonGenerator does not include keys that have null value for StructTypes
> ---
>
> Key: SPARK-23773
> URL: https://issues.apache.org/jira/browse/SPARK-23773
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.3.0
>Reporter: Sergey
>Priority: Trivial
>
> When "toJSON" is called on a dataset, the result JSON string will not have 
> keys displayed for StructTypes that have null value.
> Repro:
> {noformat}
> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> ...
> scala> val df = sqlContext.sql(""" select NAMED_STRUCT('f1', null, 'f2', 
> ARRAY(TRUE, FALSE), 'f3', MAP(123L, 123.456), 'f4', 'some string') as 
> my_struct  """)
>  ...
> scala> df.toJSON.collect().foreach(println)
> {"my_struct":{"f2":[true,false],"f3":({"123":123.456},"f4":"some string"}}
> {noformat}
> The key "f1" is missing in JSON string.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23773) JacksonGenerator does not include keys that have null value for StructTypes

2019-04-22 Thread Sergey (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey updated SPARK-23773:
---
Issue Type: Improvement  (was: Bug)

> JacksonGenerator does not include keys that have null value for StructTypes
> ---
>
> Key: SPARK-23773
> URL: https://issues.apache.org/jira/browse/SPARK-23773
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0, 2.3.0
>Reporter: Sergey
>Priority: Trivial
>
> When "toJSON" is called on a dataset, the result JSON string will not have 
> keys displayed for StructTypes that have null value.
> Repro:
> {noformat}
> scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> ...
> scala> val df = sqlContext.sql(""" select NAMED_STRUCT('f1', null, 'f2', 
> ARRAY(TRUE, FALSE), 'f3', MAP(123L, 123.456), 'f4', 'some string') as 
> my_struct  """)
>  ...
> scala> df.toJSON.collect().foreach(println)
> {"my_struct":{"f2":[true,false],"f3":({"123":123.456},"f4":"some string"}}
> {noformat}
> The key "f1" is missing in JSON string.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27542) SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs for some legacy OutputFormats

2019-04-22 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-27542:
--

 Summary: SparkHadoopWriter doesn't set call setWorkOutputPath, 
causing NPEs for some legacy OutputFormats
 Key: SPARK-27542
 URL: https://issues.apache.org/jira/browse/SPARK-27542
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.0
Reporter: Josh Rosen


In Hadoop MapReduce, tasks call {{FileOutputFormat.setWorkOutputPath()}} after 
configuring the  output committer: 
[https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L611]
 

Spark doesn't do this: 
[https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L115]

As a result, certain legacy output formats can fail to work out-of-the-box on 
Spark. In particular, 
{{org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat}} can fail 
with NullPointerExceptions, e.g.
{code:java}
java.lang.NullPointerException
  at org.apache.hadoop.fs.Path.(Path.java:105)
  at org.apache.hadoop.fs.Path.(Path.java:94)
  at 
org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getDefaultWorkFile(DeprecatedParquetOutputFormat.java:69)
[...]
  at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:96)
{code}

It looks like someone on GitHub has hit the same problem: 
https://gist.github.com/themodernlife/e3b07c23ba978f6cc98b73e3f3609abe

Tez had a very similar bug: https://issues.apache.org/jira/browse/TEZ-3348

We might be able to fix this by having Spark mimic Hadoop's logic. I'm unsure 
of whether that change would pose compatibility risks for other existing 
workloads, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27542) SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using certain legacy OutputFormats

2019-04-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27542:
---
Summary: SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs 
when using certain legacy OutputFormats  (was: SparkHadoopWriter doesn't set 
call setWorkOutputPath, causing NPEs for some legacy OutputFormats)

> SparkHadoopWriter doesn't set call setWorkOutputPath, causing NPEs when using 
> certain legacy OutputFormats
> --
>
> Key: SPARK-27542
> URL: https://issues.apache.org/jira/browse/SPARK-27542
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> In Hadoop MapReduce, tasks call {{FileOutputFormat.setWorkOutputPath()}} 
> after configuring the  output committer: 
> [https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L611]
>  
> Spark doesn't do this: 
> [https://github.com/apache/spark/blob/2d085c13b7f715dbff23dd1f81af45ff903d1a79/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala#L115]
> As a result, certain legacy output formats can fail to work out-of-the-box on 
> Spark. In particular, 
> {{org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat}} can fail 
> with NullPointerExceptions, e.g.
> {code:java}
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.Path.(Path.java:105)
>   at org.apache.hadoop.fs.Path.(Path.java:94)
>   at 
> org.apache.parquet.hadoop.mapred.DeprecatedParquetOutputFormat.getDefaultWorkFile(DeprecatedParquetOutputFormat.java:69)
> [...]
>   at org.apache.spark.SparkHadoopWriter.write(SparkHadoopWriter.scala:96)
> {code}
> It looks like someone on GitHub has hit the same problem: 
> https://gist.github.com/themodernlife/e3b07c23ba978f6cc98b73e3f3609abe
> Tez had a very similar bug: https://issues.apache.org/jira/browse/TEZ-3348
> We might be able to fix this by having Spark mimic Hadoop's logic. I'm unsure 
> of whether that change would pose compatibility risks for other existing 
> workloads, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2019-04-22 Thread KaiXu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823462#comment-16823462
 ] 

KaiXu commented on SPARK-18673:
---

I'm OOO, please expect slow email response, sorry for the inconvenience.


> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18673) Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version

2019-04-22 Thread shanyu zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823458#comment-16823458
 ] 

shanyu zhao commented on SPARK-18673:
-

Ping. What is the verdict here for users want to use Spark 2.4 and Hadoop 3.1?

> Dataframes doesn't work on Hadoop 3.x; Hive rejects Hadoop version
> --
>
> Key: SPARK-18673
> URL: https://issues.apache.org/jira/browse/SPARK-18673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Spark built with -Dhadoop.version=3.0.0-alpha2-SNAPSHOT 
>Reporter: Steve Loughran
>Priority: Major
>
> Spark Dataframes fail to run on Hadoop 3.0.x, because hive.jar's shimloader 
> considers 3.x to be an unknown Hadoop version.
> Hive itself will have to fix this; as Spark uses its own hive 1.2.x JAR, it 
> will need to be updated to match.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27534) Do not load `content` column in binary data source if it is not selected

2019-04-22 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-27534:
-

Assignee: Weichen Xu

> Do not load `content` column in binary data source if it is not selected
> 
>
> Key: SPARK-27534
> URL: https://issues.apache.org/jira/browse/SPARK-27534
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> A follow-up task from SPARK-25348. To save I/O cost, Spark shouldn't attempt 
> to read the file if users didn't request the `content` column. For example:
> {code}
> spark.read.format("binaryFile").load(path).filter($"length" < 100).count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27531) Improve explain output of describe table command to show the inputs to the command.

2019-04-22 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27531.
---
   Resolution: Fixed
 Assignee: Dilip Biswal
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24427

> Improve explain output of describe table command to show the inputs to the 
> command.
> ---
>
> Key: SPARK-27531
> URL: https://issues.apache.org/jira/browse/SPARK-27531
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Dilip Biswal
>Assignee: Dilip Biswal
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently "EXPLAIN DESC TABLE" is special cased and outputs a single row 
> relation as following. This is not consistent with how we handle explain 
> processing for other commands. 
> Current output :
> {code:java}
> spark-sql> EXPLAIN DESCRIBE TABLE t1;
> == Physical Plan ==
> *(1) Scan OneRowRelation[]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2019-04-22 Thread Yifei Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823317#comment-16823317
 ] 

Yifei Huang commented on SPARK-25299:
-

You can follow the API refactor work here: 
[https://github.com/palantir/spark/pulls?utf8=%E2%9C%93=is%3Apr+base%3Aspark-25299].
 

We are also in the process of prototyping implementations using this API to 
further validate the API. For example, this is an implementation of of the API 
using Apache Ignite: 
[https://github.com/mccheah/ignite-shuffle-service/pull/1]. We are also aiming 
to try other prototypes (i.e. individual shuffle file servers, async uploads to 
s3) in the upcoming weeks. 

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27392) TestHive test tables should be placed in shared test state, not per session

2019-04-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27392.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24302
[https://github.com/apache/spark/pull/24302]

> TestHive test tables should be placed in shared test state, not per session
> ---
>
> Key: SPARK-27392
> URL: https://issues.apache.org/jira/browse/SPARK-27392
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Otherwise, tests that use tables from multiple sessions will run into issues 
> if they access the same table. The correct location is in shared state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27392) TestHive test tables should be placed in shared test state, not per session

2019-04-22 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27392:
-

Assignee: Eric Liang

> TestHive test tables should be placed in shared test state, not per session
> ---
>
> Key: SPARK-27392
> URL: https://issues.apache.org/jira/browse/SPARK-27392
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Eric Liang
>Assignee: Eric Liang
>Priority: Minor
>
> Otherwise, tests that use tables from multiple sessions will run into issues 
> if they access the same table. The correct location is in shared state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6

2019-04-22 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp closed SPARK-25079.
---

i kept an eye on things over the weekend, and everything seemed to be working 
great!

> [PYTHON] upgrade python 3.4 -> 3.6
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Fix For: 3.0.0
>
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.6.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.6

2019-04-22 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-25079:

Description: 
for the impending arrow upgrade 
(https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 3.4 
-> 3.6.

i have been testing this here:  
[https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]

my methodology:

1) upgrade python + arrow to 3.5 and 0.10.0

2) run python tests

3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
upgrade centos workers to python3.5

4) simultaneously do the following: 

  - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that points 
to python3.5 (this is currently being tested here:  
[https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]

  - push a change to python/run-tests.py replacing 3.4 with 3.5

5) once the python3.5 change to run-tests.py is merged, we will need to 
back-port this to all existing branches

6) then and only then can i remove the python3.4 -> python3.5 symlink

  was:
for the impending arrow upgrade 
(https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 3.4 
-> 3.5.

i have been testing this here:  
[https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]

my methodology:

1) upgrade python + arrow to 3.5 and 0.10.0

2) run python tests

3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
upgrade centos workers to python3.5

4) simultaneously do the following: 

  - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that points 
to python3.5 (this is currently being tested here:  
[https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]

  - push a change to python/run-tests.py replacing 3.4 with 3.5

5) once the python3.5 change to run-tests.py is merged, we will need to 
back-port this to all existing branches

6) then and only then can i remove the python3.4 -> python3.5 symlink


> [PYTHON] upgrade python 3.4 -> 3.6
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Fix For: 3.0.0
>
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.6.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27541) Refresh class definitions for jars added via addJar()

2019-04-22 Thread Naved Alam (JIRA)
Naved Alam created SPARK-27541:
--

 Summary: Refresh class definitions for jars added via addJar()
 Key: SPARK-27541
 URL: https://issues.apache.org/jira/browse/SPARK-27541
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.3
Reporter: Naved Alam


Currently, if a class is loaded by the executor, its definition cannot be 
updated (because classloaders won't load an already loaded class again). For 
use cases with long running sparkContexts, this becomes a problem when there 
are requirements to update the definition of one of these classes.

There should be a spark property which when turned on, allows the executors to 
refresh the definitions of these classes if they were to be added as a new jar 
using the addJar API.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27367) Faster RoaringBitmap Serialization with v0.8.0

2019-04-22 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823178#comment-16823178
 ] 

Liang-Chi Hsieh commented on SPARK-27367:
-

So I think the new serde API has performance advantage over old API. However, 
the advantage only shows when serialized bytes are relatively big. I can see 
the advantage begins to show when at least there are 1 partitions in 
HighlyCompressedMapStatus. Seems to me it is not easy to seen such case.

> Faster RoaringBitmap Serialization with v0.8.0
> --
>
> Key: SPARK-27367
> URL: https://issues.apache.org/jira/browse/SPARK-27367
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> RoaringBitmap 0.8.0 adds faster serde, but also requires us to change how we 
> call the serde routines slightly to take advantage of it.  This is probably a 
> worthwhile optimization as the every shuffle map task with a large # of 
> partitions generates these bitmaps, and the driver especially has to 
> deserialize many of these messages.
> See 
> * https://github.com/apache/spark/pull/24264#issuecomment-479675572
> * https://github.com/RoaringBitmap/RoaringBitmap/pull/325
> * https://github.com/RoaringBitmap/RoaringBitmap/issues/319



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27540) Add 'meanAveragePrecision_at_k' metric to RankingMetrics

2019-04-22 Thread Tarush Grover (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823163#comment-16823163
 ] 

Tarush Grover commented on SPARK-27540:
---

[~tuananh238] I am working on this issue. Please assign the issue to me.

> Add 'meanAveragePrecision_at_k' metric to RankingMetrics
> 
>
> Key: SPARK-27540
> URL: https://issues.apache.org/jira/browse/SPARK-27540
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.4.1
>Reporter: Pham Nguyen Tuan Anh
>Priority: Minor
>
> Sometimes, we only focus on MAP of top-k results.
> This ticket adds MAP@k to RankingMetrics, besides existing MAP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared

2019-04-22 Thread Vinoo Ganesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823120#comment-16823120
 ] 

Vinoo Ganesh commented on SPARK-27337:
--

Hey [~cltlfcjin] - the thread is called Closing a SparkSession stops the 
SparkContext. I'll put up a PR for this shortly. 

> QueryExecutionListener never cleans up listeners from the bus after 
> SparkSession is cleared
> ---
>
> Key: SPARK-27337
> URL: https://issues.apache.org/jira/browse/SPARK-27337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Major
> Attachments: image001-1.png
>
>
> As a result of 
> [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3],
>  it looks like there is a memory leak (specifically 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).]
>  
> Because the Listener Bus on the context still has a reference to the listener 
> (even after the SparkSession is cleared), they are never cleaned up. This 
> means that if you close and remake spark sessions fairly frequently, you're 
> leaking every single time. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27512) Decimal parsing leads to unexpected type inference

2019-04-22 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823115#comment-16823115
 ] 

koert kuipers commented on SPARK-27512:
---

i agree it is better than having two different decimal parsers. note that i was 
not able to get the old behavior back by using a a locale, so that is not a 
workaround in sofar i can see.

i think this is just a change we won't fix.

> Decimal parsing leads to unexpected type inference
> --
>
> Key: SPARK-27512
> URL: https://issues.apache.org/jira/browse/SPARK-27512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark 3.0.0-SNAPSHOT from this commit:
> {code:bash}
> commit 3ab96d7acf870e53c9016b0b63d0b328eec23bed
> Author: Dilip Biswal 
> Date:   Mon Apr 15 21:26:45 2019 +0800
> {code}
>Reporter: koert kuipers
>Priority: Minor
>
> {code:bash}
> $ hadoop fs -text test.bsv
> x|y
> 1|1,2
> 2|2,3
> 3|3,4
> {code}
> in spark 2.4.1:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: string (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1|1,2|
> |  2|2,3|
> |  3|3,4|
> +---+---+
> {code}
> in spark 3.0.0-SNAPSHOT:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: decimal(2,0) (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1| 12|
> |  2| 23|
> |  3| 34|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823097#comment-16823097
 ] 

Thomas Graves commented on SPARK-27396:
---

thanks for the questions and commenting, please also vote on the DEV list email 
chain - subject:

[VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support
 
 
I'm going to extend that vote by a few days to give more people time to comment 
as I know its a busy time of year.

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to 
>  
>  # Expose to end users a new option of processing the data in a columnar 
> format, multiple rows at a time, with the data organized into contiguous 
> arrays in memory. 
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the end user.
>  # Allow for simple data exchange with other systems, DL/ML libraries, 
> pandas, etc. by having clean APIs to transform the columnar data into an 
> Apache Arrow compatible layout.
>  # Provide a plugin mechanism for columnar processing support so an advanced 
> user could avoid data transition between columnar and row based processing 
> even through shuffles. This means we should at least support pluggable APIs 
> so an advanced end user can implement the columnar partitioning themselves, 
> and provide the glue necessary to shuffle the data still in a columnar format.
>  # Expose new APIs that allow advanced users or frameworks to implement 
> columnar processing either as UDFs, or by adjusting the physical plan to do 
> columnar processing.  If the latter is too controversial we can move it to 
> another SPIP, but we plan to implement some accelerated computing in parallel 
> with this feature to be sure the APIs work, and without this feature it makes 
> that impossible.
>  
> Not Requirements, but things that would be nice to have.
>  # Provide default implementations for partitioning columnar data, so users 
> don’t have to.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  # Provide a clean transition from the existing code to the new one.  The 
> existing APIs which are public but evolving are not that far off from what is 
> being proposed.  We should be able to create a new parallel API that can wrap 
> the existing one. This means any file format that is trying to support 
> columnar can still do so until we make a conscious decision to deprecate and 
> then turn off the old APIs.
>  
> *Q2.* What problem is this proposal NOT designed to solve?
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation, and possibly default 
> implementations for partitioning of columnar shuffle. 
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Input formats, optionally can return a ColumnarBatch instead of rows.  The 
> code generation phase knows how to take that columnar data and iterate 
> through it as rows for stages that wants rows, which currently is almost 
> everything.  The limitations here are mostly implementation specific. The 
> current standard is to abuse Scala’s type erasure to return ColumnarBatches 
> as the elements of an RDD[InternalRow]. The code generation can handle this 
> because it is generating java code, so it bypasses scala’s type checking and 
> just casts the InternalRow to the desired ColumnarBatch.  This makes it 
> difficult for others to implement the same functionality for different 
> processing because they can only do it through code generation. There really 
> is no clean separate path in the code generation for columnar vs row based. 
> Additionally because it is only supported through code generation if for any 
> reason code generation would fail there is no backup.  This is typically fine 
> for input formats but can be problematic 

[jira] [Resolved] (SPARK-27438) Increase precision of to_timestamp

2019-04-22 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27438.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24420
[https://github.com/apache/spark/pull/24420]

> Increase precision of to_timestamp
> --
>
> Key: SPARK-27438
> URL: https://issues.apache.org/jira/browse/SPARK-27438
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> The to_timestamp() function can parse input string up to second precision 
> even if the specified pattern contains second fraction sub-pattern. The 
> ticket aims to improve precision of to_timestamp() up to microsecond 
> precision. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27438) Increase precision of to_timestamp

2019-04-22 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27438:
---

Assignee: Maxim Gekk

> Increase precision of to_timestamp
> --
>
> Key: SPARK-27438
> URL: https://issues.apache.org/jira/browse/SPARK-27438
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> The to_timestamp() function can parse input string up to second precision 
> even if the specified pattern contains second fraction sub-pattern. The 
> ticket aims to improve precision of to_timestamp() up to microsecond 
> precision. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames

2019-04-22 Thread Rafik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823045#comment-16823045
 ] 

Rafik edited comment on SPARK-10925 at 4/22/19 11:50 AM:
-

I managed to solve this by renaming the column after group by to something 
temporary, and then renaming it again to the original column, then joining


was (Author: rafikamir):
I managed to solve this by renaming the column after group by to something 
temporary, and then renaming it again to the original column

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at 

[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2019-04-22 Thread Rafik (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823045#comment-16823045
 ] 

Rafik commented on SPARK-10925:
---

I managed to solve this by renaming the column after group by to something 
temporary, and then renaming it again to the original column

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> 

[jira] [Commented] (SPARK-27512) Decimal parsing leads to unexpected type inference

2019-04-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823028#comment-16823028
 ] 

Hyukjin Kwon commented on SPARK-27512:
--

I see a behaviour change. Yes, looks for schema inference path, how to handle 
decimals looks changed. But I think the current change makes sense rather than 
having two difference decimal parser for schema inference and data parse.
Still the workaround is to set {{locale}}.

> Decimal parsing leads to unexpected type inference
> --
>
> Key: SPARK-27512
> URL: https://issues.apache.org/jira/browse/SPARK-27512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark 3.0.0-SNAPSHOT from this commit:
> {code:bash}
> commit 3ab96d7acf870e53c9016b0b63d0b328eec23bed
> Author: Dilip Biswal 
> Date:   Mon Apr 15 21:26:45 2019 +0800
> {code}
>Reporter: koert kuipers
>Priority: Minor
>
> {code:bash}
> $ hadoop fs -text test.bsv
> x|y
> 1|1,2
> 2|2,3
> 3|3,4
> {code}
> in spark 2.4.1:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: string (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1|1,2|
> |  2|2,3|
> |  3|3,4|
> +---+---+
> {code}
> in spark 3.0.0-SNAPSHOT:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: decimal(2,0) (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1| 12|
> |  2| 23|
> |  3| 34|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-27512) Decimal parsing leads to unexpected type inference

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-27512:
--

> Decimal parsing leads to unexpected type inference
> --
>
> Key: SPARK-27512
> URL: https://issues.apache.org/jira/browse/SPARK-27512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: spark 3.0.0-SNAPSHOT from this commit:
> {code:bash}
> commit 3ab96d7acf870e53c9016b0b63d0b328eec23bed
> Author: Dilip Biswal 
> Date:   Mon Apr 15 21:26:45 2019 +0800
> {code}
>Reporter: koert kuipers
>Priority: Minor
>
> {code:bash}
> $ hadoop fs -text test.bsv
> x|y
> 1|1,2
> 2|2,3
> 3|3,4
> {code}
> in spark 2.4.1:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: string (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1|1,2|
> |  2|2,3|
> |  3|3,4|
> +---+---+
> {code}
> in spark 3.0.0-SNAPSHOT:
> {code:bash}
> scala> val data = spark.read.format("csv").option("header", 
> true).option("delimiter", "|").option("inferSchema", true).load("test.bsv")
> scala> data.printSchema
> root
>  |-- x: integer (nullable = true)
>  |-- y: decimal(2,0) (nullable = true)
> scala> data.show
> +---+---+
> |  x|  y|
> +---+---+
> |  1| 12|
> |  2| 23|
> |  3| 34|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

2019-04-22 Thread Mahima Khatri (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823026#comment-16823026
 ] 

Mahima Khatri commented on SPARK-27298:
---

Yes,I can test this .Will surely let you know the results.

 

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> 
>
> Key: SPARK-27298
> URL: https://issues.apache.org/jira/browse/SPARK-27298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mahima Khatri
>Priority: Major
>  Labels: data-loss
> Attachments: Console-Result-Windows.txt, 
> console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, 
> console-result-LinuxonVM.txt, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
> Column incomeCol = dataLoaded.col("Income");
> Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) 
> valuesArray));
> System.out.println("The count of customers satisfaying Income is :"+ 
> incomeMatchingSet.count());
> System.out.println("*");
> Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet);
> System.out.println("The count of final customers is :"+ 
> maleExcptIncomeMatch.count());
> System.out.println("*");
> }
> }
> {code}
>  When the above code is executed on Spark 2.3.0 ,it gives below different 
> results:
> *Windows* :  The code gives correct count of dataset 148237,
> *Linux :*         The code gives different {color:#172b4d}count of dataset 
> 129532 {color}
>  
> {color:#172b4d}Some more info related to this bug:{color}
> {color:#172b4d}1. Application Code (attached)
> 2. CSV file used(attached)
> 3. Windows spec 
>           Windows 10- 64 bit OS 
> 4. Linux spec (Running on Oracle VM virtual box)
>       Specifications: \{as captured from Vbox.log}
>         00:00:26.112908 VMMDev: Guest Additions information report: Version 
> 5.0.32 r112930          '5.0.32_Ubuntu'
>         00:00:26.112996 VMMDev: Guest Additions information report: Interface 
> = 0x00010004         osType = 0x00053100 (Linux >= 2.6, 64-bit)
> 5. Snapshots of output in both cases (attached){color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26703) Hive record writer will always depends on parquet-1.6 writer should fix it

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26703.
--
Resolution: Duplicate

> Hive record writer will always depends on parquet-1.6 writer should fix it 
> ---
>
> Key: SPARK-26703
> URL: https://issues.apache.org/jira/browse/SPARK-26703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: zhoukang
>Priority: Major
>
> Currently, when we are using insert into hive table related command.
> The parquet file generated will always be version 1.6,reason is below:
> 1. we rely on hive-exec HiveFileFormatUtils to get recordWriter
> {code:java}
> private val hiveWriter = HiveFileFormatUtils.getHiveRecordWriter(
> jobConf,
> tableDesc,
> serializer.getSerializedClass,
> fileSinkConf,
> new Path(path),
> Reporter.NULL)
> {code}
> 2. we will call 
> {code:java}
> public static RecordWriter getHiveRecordWriter(JobConf jc,
>   TableDesc tableInfo, Class outputClass,
>   FileSinkDesc conf, Path outPath, Reporter reporter) throws 
> HiveException {
> HiveOutputFormat hiveOutputFormat = getHiveOutputFormat(jc, 
> tableInfo);
> try {
>   boolean isCompressed = conf.getCompressed();
>   JobConf jc_output = jc;
>   if (isCompressed) {
> jc_output = new JobConf(jc);
> String codecStr = conf.getCompressCodec();
> if (codecStr != null && !codecStr.trim().equals("")) {
>   Class codec = 
>   (Class) 
> JavaUtils.loadClass(codecStr);
>   FileOutputFormat.setOutputCompressorClass(jc_output, codec);
> }
> String type = conf.getCompressType();
> if (type != null && !type.trim().equals("")) {
>   CompressionType style = CompressionType.valueOf(type);
>   SequenceFileOutputFormat.setOutputCompressionType(jc, style);
> }
>   }
>   return getRecordWriter(jc_output, hiveOutputFormat, outputClass,
>   isCompressed, tableInfo.getProperties(), outPath, reporter);
> } catch (Exception e) {
>   throw new HiveException(e);
> }
>   }
>   public static RecordWriter getRecordWriter(JobConf jc,
>   OutputFormat outputFormat,
>   Class valueClass, boolean isCompressed,
>   Properties tableProp, Path outPath, Reporter reporter
>   ) throws IOException, HiveException {
> if (!(outputFormat instanceof HiveOutputFormat)) {
>   outputFormat = new HivePassThroughOutputFormat(outputFormat);
> }
> return ((HiveOutputFormat)outputFormat).getHiveRecordWriter(
> jc, outPath, valueClass, isCompressed, tableProp, reporter);
>   }
> {code}
> 3. then in MapredParquetOutPutFormat
> {code:java}
> public org.apache.hadoop.hive.ql.exec.FileSinkOperator.RecordWriter 
> getHiveRecordWriter(
>   final JobConf jobConf,
>   final Path finalOutPath,
>   final Class valueClass,
>   final boolean isCompressed,
>   final Properties tableProperties,
>   final Progressable progress) throws IOException {
> LOG.info("creating new record writer..." + this);
> final String columnNameProperty = 
> tableProperties.getProperty(IOConstants.COLUMNS);
> final String columnTypeProperty = 
> tableProperties.getProperty(IOConstants.COLUMNS_TYPES);
> List columnNames;
> List columnTypes;
> if (columnNameProperty.length() == 0) {
>   columnNames = new ArrayList();
> } else {
>   columnNames = Arrays.asList(columnNameProperty.split(","));
> }
> if (columnTypeProperty.length() == 0) {
>   columnTypes = new ArrayList();
> } else {
>   columnTypes = 
> TypeInfoUtils.getTypeInfosFromTypeString(columnTypeProperty);
> }
> 
> DataWritableWriteSupport.setSchema(HiveSchemaConverter.convert(columnNames, 
> columnTypes), jobConf);
> return getParquerRecordWriterWrapper(realOutputFormat, jobConf, 
> finalOutPath.toString(),
> progress,tableProperties);
>   }
> {code}
> 4. then call 
> {code:java}
> public ParquetRecordWriterWrapper(
>   final OutputFormat realOutputFormat,
>   final JobConf jobConf,
>   final String name,
>   final Progressable progress, Properties tableProperties) throws
>   IOException {
> try {
>   // create a TaskInputOutputContext
>   TaskAttemptID taskAttemptID = 
> TaskAttemptID.forName(jobConf.get("mapred.task.id"));
>   if (taskAttemptID == null) {
> taskAttemptID = new TaskAttemptID();
>   }
>   taskContext = ContextUtil.newTaskAttemptContext(jobConf, taskAttemptID);
>   LOG.info("initialize serde with table properties.");
>   initializeSerProperties(taskContext, tableProperties);
>   LOG.info("creating real writer to write at " + 

[jira] [Updated] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27337:
-
Priority: Major  (was: Critical)

> QueryExecutionListener never cleans up listeners from the bus after 
> SparkSession is cleared
> ---
>
> Key: SPARK-27337
> URL: https://issues.apache.org/jira/browse/SPARK-27337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Major
> Attachments: image001-1.png
>
>
> As a result of 
> [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3],
>  it looks like there is a memory leak (specifically 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).]
>  
> Because the Listener Bus on the context still has a reference to the listener 
> (even after the SparkSession is cleared), they are never cleaned up. This 
> means that if you close and remake spark sessions fairly frequently, you're 
> leaking every single time. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27433) Spark Structured Streaming left outer joins returns outer nulls for already matched rows

2019-04-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823005#comment-16823005
 ] 

Hyukjin Kwon commented on SPARK-27433:
--

See SPARK-26154.

> Spark Structured Streaming left outer joins returns outer nulls for already 
> matched rows
> 
>
> Key: SPARK-27433
> URL: https://issues.apache.org/jira/browse/SPARK-27433
> Project: Spark
>  Issue Type: Question
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Binit
>Priority: Blocker
>
> I m basically using the example given in Spark's the documentation here: 
> [https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#outer-joins-with-watermarking]
>  with the built-in test stream in which one stream is ahead by 3 seconds (was 
> originally using kafka but ran into the same issue). The results returned the 
> match columns correctly, however after a while the same key is returned with 
> an outer null.
> Is this the expected behavior? Is there a way to exclude the duplicate outer 
> null results when there was a match?
> Code:
> {{val testStream = session.readStream.format("rate") .option("rowsPerSecond", 
> "5").option("numPartitions", "1").load() val impressions = testStream 
> .select( (col("value") + 15).as("impressionAdId"), 
> col("timestamp").as("impressionTime")) val clicks = testStream .select( 
> col("value").as("clickAdId"), col("timestamp").as("clickTime")) // Apply 
> watermarks on event-time columns val impressionsWithWatermark = 
> impressions.withWatermark("impressionTime", "20 seconds") val 
> clicksWithWatermark = clicks.withWatermark("clickTime", "30 seconds") // Join 
> with event-time constraints val result = impressionsWithWatermark.join( 
> clicksWithWatermark, expr(""" clickAdId = impressionAdId AND clickTime >= 
> impressionTime AND clickTime <= impressionTime + interval 10 seconds """), 
> joinType = "leftOuter" // can be "inner", "leftOuter", "rightOuter" ) val 
> query = 
> result.writeStream.outputMode("update").format("console").option("truncate", 
> false).start() query.awaitTermination()}}
> Result:
> {{--- Batch: 19 
> --- 
> +--+---+-+---+ 
> |impressionAdId|impressionTime |clickAdId|clickTime | 
> +--+---+-+---+ 
> |100 |2018-05-23 22:18:38.362|100 |2018-05-23 22:18:41.362| |101 |2018-05-23 
> 22:18:38.562|101 |2018-05-23 22:18:41.562| |102 |2018-05-23 22:18:38.762|102 
> |2018-05-23 22:18:41.762| |103 |2018-05-23 22:18:38.962|103 |2018-05-23 
> 22:18:41.962| |104 |2018-05-23 22:18:39.162|104 |2018-05-23 22:18:42.162| 
> +--+---+-+---+ 
> --- Batch: 57 
> --- 
> +--+---+-+---+ 
> |impressionAdId|impressionTime |clickAdId|clickTime | 
> +--+---+-+---+ 
> |290 |2018-05-23 22:19:16.362|290 |2018-05-23 22:19:19.362| |291 |2018-05-23 
> 22:19:16.562|291 |2018-05-23 22:19:19.562| |292 |2018-05-23 22:19:16.762|292 
> |2018-05-23 22:19:19.762| |293 |2018-05-23 22:19:16.962|293 |2018-05-23 
> 22:19:19.962| |294 |2018-05-23 22:19:17.162|294 |2018-05-23 22:19:20.162| 
> |100 |2018-05-23 22:18:38.362|null |null | |99 |2018-05-23 22:18:38.162|null 
> |null | |103 |2018-05-23 22:18:38.962|null |null | |101 |2018-05-23 
> 22:18:38.562|null |null | |102 |2018-05-23 22:18:38.762|null |null | 
> +--+---+-+---+}}
> {{This question is also asked in the stackoverflow. Please find the link 
> below}}
> {{[https://stackoverflow.com/questions/50500111/spark-structured-streaming-left-outer-joins-returns-outer-nulls-for-already-matc/55616902#55616902]}}
> {{ }}
> {{101 & 103 have already come in the join but still it is coming in the outer 
> left join.}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27337) QueryExecutionListener never cleans up listeners from the bus after SparkSession is cleared

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27337:
-
Component/s: (was: Spark Core)
 SQL

> QueryExecutionListener never cleans up listeners from the bus after 
> SparkSession is cleared
> ---
>
> Key: SPARK-27337
> URL: https://issues.apache.org/jira/browse/SPARK-27337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Vinoo Ganesh
>Priority: Critical
> Attachments: image001-1.png
>
>
> As a result of 
> [https://github.com/apache/spark/commit/9690eba16efe6d25261934d8b73a221972b684f3],
>  it looks like there is a memory leak (specifically 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/util/QueryExecutionListener.scala#L131).]
>  
> Because the Listener Bus on the context still has a reference to the listener 
> (even after the SparkSession is cleared), they are never cleaned up. This 
> means that if you close and remake spark sessions fairly frequently, you're 
> leaking every single time. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2019-04-22 Thread zhoukang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822997#comment-16822997
 ] 

zhoukang commented on SPARK-25299:
--

is there any progress of this task? [~yifeih] [~mcheah]

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

2019-04-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822990#comment-16822990
 ] 

Hyukjin Kwon commented on SPARK-27298:
--

Will you be able to test it against Spark 2.4.1 too?

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> 
>
> Key: SPARK-27298
> URL: https://issues.apache.org/jira/browse/SPARK-27298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Mahima Khatri
>Priority: Major
>  Labels: data-loss
> Attachments: Console-Result-Windows.txt, 
> console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, 
> console-result-LinuxonVM.txt, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
> Column incomeCol = dataLoaded.col("Income");
> Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) 
> valuesArray));
> System.out.println("The count of customers satisfaying Income is :"+ 
> incomeMatchingSet.count());
> System.out.println("*");
> Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet);
> System.out.println("The count of final customers is :"+ 
> maleExcptIncomeMatch.count());
> System.out.println("*");
> }
> }
> {code}
>  When the above code is executed on Spark 2.3.0 ,it gives below different 
> results:
> *Windows* :  The code gives correct count of dataset 148237,
> *Linux :*         The code gives different {color:#172b4d}count of dataset 
> 129532 {color}
>  
> {color:#172b4d}Some more info related to this bug:{color}
> {color:#172b4d}1. Application Code (attached)
> 2. CSV file used(attached)
> 3. Windows spec 
>           Windows 10- 64 bit OS 
> 4. Linux spec (Running on Oracle VM virtual box)
>       Specifications: \{as captured from Vbox.log}
>         00:00:26.112908 VMMDev: Guest Additions information report: Version 
> 5.0.32 r112930          '5.0.32_Ubuntu'
>         00:00:26.112996 VMMDev: Guest Additions information report: Interface 
> = 0x00010004         osType = 0x00053100 (Linux >= 2.6, 64-bit)
> 5. Snapshots of output in both cases (attached){color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27540) Add 'meanAveragePrecision_at_k' metric to RankingMetrics

2019-04-22 Thread Pham Nguyen Tuan Anh (JIRA)
Pham Nguyen Tuan Anh created SPARK-27540:


 Summary: Add 'meanAveragePrecision_at_k' metric to RankingMetrics
 Key: SPARK-27540
 URL: https://issues.apache.org/jira/browse/SPARK-27540
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 2.4.1
Reporter: Pham Nguyen Tuan Anh


Sometimes, we only focus on MAP of top-k results.

This ticket adds MAP@k to RankingMetrics, besides existing MAP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-19860.
--
Resolution: Incomplete

I am leaving this resolved for the lack of information

> DataFrame join get conflict error if two frames has a same name column.
> ---
>
> Key: SPARK-19860
> URL: https://issues.apache.org/jira/browse/SPARK-19860
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: wuchang
>Priority: Major
>
> {code}
> >>> print df1.collect()
> [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', 
> in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), 
> Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', 
> in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), 
> Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', 
> in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), 
> Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', 
> in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), 
> Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', 
> in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), 
> Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', 
> in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), 
> Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', 
> in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), 
> Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', 
> in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), 
> Row(fdate=u'20170301', in_amount1=10159653)]
> >>> print df2.collect()
> [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', 
> in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), 
> Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', 
> in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), 
> Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', 
> in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), 
> Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', 
> in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), 
> Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', 
> in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), 
> Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', 
> in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), 
> Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', 
> in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), 
> Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', 
> in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), 
> Row(fdate=u'20170301', in_amount2=9475418)]
> >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner')
> 2017-03-08 10:27:34,357 WARN  [Thread-2] sql.Column: Constructing trivially 
> true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases.
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join
> jdf = self._jdf.join(other._jdf, on._jc, how)
>   File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u"
> Failure when resolving conflicting references in Join:
> 'Join Inner, (fdate#42 = fdate#42)
> :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) 
> as int) AS in_amount1#97]
> :  +- Filter (inorout#44 = A)
> : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, 
> fdate#42]
> :+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && 
> (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201)))
> :   +- SubqueryAlias history_transfer_v
> :  +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, 
> fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, 
> bankwaterid#48, waterid#49, waterstate#50, source#51]
> : +- SubqueryAlias history_transfer
> :+- 
> Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51]
>  parquet
> +- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) 
> as int) AS in_amount2#145]
>+- Filter (inorout#44 = B)
>   +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, 
> fdate#42]
>  +- Filter (((partnerid#45 = pmec) && 

[jira] [Commented] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.

2019-04-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822984#comment-16822984
 ] 

Hyukjin Kwon commented on SPARK-19860:
--

Does the size of data matter to reproduce this issue, or the query is expected 
to be complicated? From what you guys said, it doesn't look too difficult to 
post a reproducer.

> DataFrame join get conflict error if two frames has a same name column.
> ---
>
> Key: SPARK-19860
> URL: https://issues.apache.org/jira/browse/SPARK-19860
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: wuchang
>Priority: Major
>
> {code}
> >>> print df1.collect()
> [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', 
> in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), 
> Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', 
> in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), 
> Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', 
> in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), 
> Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', 
> in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), 
> Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', 
> in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), 
> Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', 
> in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), 
> Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', 
> in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), 
> Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', 
> in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), 
> Row(fdate=u'20170301', in_amount1=10159653)]
> >>> print df2.collect()
> [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', 
> in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), 
> Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', 
> in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), 
> Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', 
> in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), 
> Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', 
> in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), 
> Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', 
> in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), 
> Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', 
> in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), 
> Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', 
> in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), 
> Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', 
> in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), 
> Row(fdate=u'20170301', in_amount2=9475418)]
> >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner')
> 2017-03-08 10:27:34,357 WARN  [Thread-2] sql.Column: Constructing trivially 
> true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases.
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join
> jdf = self._jdf.join(other._jdf, on._jc, how)
>   File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u"
> Failure when resolving conflicting references in Join:
> 'Join Inner, (fdate#42 = fdate#42)
> :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) 
> as int) AS in_amount1#97]
> :  +- Filter (inorout#44 = A)
> : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, 
> fdate#42]
> :+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && 
> (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201)))
> :   +- SubqueryAlias history_transfer_v
> :  +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, 
> fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, 
> bankwaterid#48, waterid#49, waterstate#50, source#51]
> : +- SubqueryAlias history_transfer
> :+- 
> Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51]
>  parquet
> +- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) 
> as int) AS in_amount2#145]
>+- Filter (inorout#44 = B)

[jira] [Commented] (SPARK-27335) cannot collect() from Correlation.corr

2019-04-22 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822982#comment-16822982
 ] 

Hyukjin Kwon commented on SPARK-27335:
--

Can you post some steps as code block? Otherwise, looks no one can reproduce.

> cannot collect() from Correlation.corr
> --
>
> Key: SPARK-27335
> URL: https://issues.apache.org/jira/browse/SPARK-27335
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Natalino Busa
>Priority: Major
>
> reproducing the bug from the example in the documentation:
>  
>  
> {code:java}
> import pyspark
> from pyspark.ml.linalg import Vectors
> from pyspark.ml.stat import Correlation
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> dataset = [[Vectors.dense([1, 0, 0, -2])],
>  [Vectors.dense([4, 5, 0, 3])],
>  [Vectors.dense([6, 7, 0, 8])],
>  [Vectors.dense([9, 0, 0, 1])]]
> dataset = spark.createDataFrame(dataset, ['features'])
> df = Correlation.corr(dataset, 'features', 'pearson')
> df.collect()
>  
> {code}
> This produces the following stack trace:
>  
> {code:java}
> ---
> AttributeErrorTraceback (most recent call last)
>  in ()
>  11 dataset = spark.createDataFrame(dataset, ['features'])
>  12 df = Correlation.corr(dataset, 'features', 'pearson')
> ---> 13 df.collect()
> /opt/spark/python/pyspark/sql/dataframe.py in collect(self)
> 530 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
> 531 """
> --> 532 with SCCallSiteSync(self._sc) as css:
> 533 sock_info = self._jdf.collectToPython()
> 534 return list(_load_from_socket(sock_info, 
> BatchedSerializer(PickleSerializer(
> /opt/spark/python/pyspark/traceback_utils.py in __enter__(self)
>  70 def __enter__(self):
>  71 if SCCallSiteSync._spark_stack_depth == 0:
> ---> 72 self._context._jsc.setCallSite(self._call_site)
>  73 SCCallSiteSync._spark_stack_depth += 1
>  74 
> AttributeError: 'NoneType' object has no attribute 'setCallSite'{code}
>  
>  
> Analysis:
> Somehow the dataframe properties `df.sql_ctx.sparkSession._jsparkSession`, 
> and `spark._jsparkSession` do not match with the ones available in the spark 
> session.
> The following code fixes the problem (I hope this helps you narrowing down 
> the root cause)
>  
> {code:java}
> df.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession
> df._sc = spark._sc
> df.collect()
> >>> [Row(pearson(features)=DenseMatrix(4, 4, [1.0, 0.0556, nan, 0.4005, 
> >>> 0.0556, 1.0, nan, 0.9136, nan, nan, 1.0, nan, 0.4005, 0.9136, nan, 1.0], 
> >>> False))]{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27539) Inaccurate aggregate outputRows estimation with column contains null value

2019-04-22 Thread peng bo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo updated SPARK-27539:

Summary: Inaccurate aggregate outputRows estimation with column contains 
null value  (was: Inaccurate aggregate outputRows estimation with null value 
column)

> Inaccurate aggregate outputRows estimation with column contains null value
> --
>
> Key: SPARK-27539
> URL: https://issues.apache.org/jira/browse/SPARK-27539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: peng bo
>Priority: Major
>
> This issue is follow up of [https://github.com/apache/spark/pull/24286]. As 
> [~smilegator] pointed out that column with null value is inaccurate as well.
> {code:java}
> > select key from test;
> 2
> NULL
> 1
> spark-sql> desc extended test key;
> col_name key
> data_type int
> comment NULL
> min 1
> max 2
> num_nulls 1
> distinct_count 2{code}
> The distinct count should be distinct_count + 1 when the column contains null 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27539) Inaccurate aggregate outputRows estimation with null value column

2019-04-22 Thread peng bo (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peng bo updated SPARK-27539:

Description: 
This issue is follow up of [https://github.com/apache/spark/pull/24286]. As 
[~smilegator] pointed out that column with null value is inaccurate as well.
{code:java}
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2{code}
The distinct count should be distinct_count + 1 when the column contains null 
value.

  was:
This issue is follow up of [https://github.com/apache/spark/pull/24286]. As 
[~smilegator] pointed out that column with null value is inaccurate as well.
{code:java}
> select * from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2{code}
The distinct count should be distinct_count + 1 when the column contains null 
value.


> Inaccurate aggregate outputRows estimation with null value column
> -
>
> Key: SPARK-27539
> URL: https://issues.apache.org/jira/browse/SPARK-27539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: peng bo
>Priority: Major
>
> This issue is follow up of [https://github.com/apache/spark/pull/24286]. As 
> [~smilegator] pointed out that column with null value is inaccurate as well.
> {code:java}
> > select key from test;
> 2
> NULL
> 1
> spark-sql> desc extended test key;
> col_name key
> data_type int
> comment NULL
> min 1
> max 2
> num_nulls 1
> distinct_count 2{code}
> The distinct count should be distinct_count + 1 when the column contains null 
> value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27539) Inaccurate aggregate outputRows estimation with null value column

2019-04-22 Thread peng bo (JIRA)
peng bo created SPARK-27539:
---

 Summary: Inaccurate aggregate outputRows estimation with null 
value column
 Key: SPARK-27539
 URL: https://issues.apache.org/jira/browse/SPARK-27539
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: peng bo


This issue is follow up of [https://github.com/apache/spark/pull/24286]. As 
[~smilegator] pointed out that column with null value is inaccurate as well.
{code:java}
> select * from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2{code}
The distinct count should be distinct_count + 1 when the column contains null 
value.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27522) Test migration from INT96 to TIMESTAMP_MICROS in parquet

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27522:


Assignee: Maxim Gekk

> Test migration from INT96 to TIMESTAMP_MICROS in parquet
> 
>
> Key: SPARK-27522
> URL: https://issues.apache.org/jira/browse/SPARK-27522
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Write tests to check:
> * Append timestamps of TIMESTAMP_MICROS to existing parquets with INT96 for 
> timestamps
> * Append timestamps of TIMESTAMP_MICROS to a table with INT96 for timestamps
> * Append INT96 timestamps to parquet files with TIMESTAMP_MICROS timestamps
> * Append INT96 timestamps to a table with TIMESTAMP_MICROS timestamps



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27522) Test migration from INT96 to TIMESTAMP_MICROS in parquet

2019-04-22 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27522.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24417
[https://github.com/apache/spark/pull/24417]

> Test migration from INT96 to TIMESTAMP_MICROS in parquet
> 
>
> Key: SPARK-27522
> URL: https://issues.apache.org/jira/browse/SPARK-27522
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Write tests to check:
> * Append timestamps of TIMESTAMP_MICROS to existing parquets with INT96 for 
> timestamps
> * Append timestamps of TIMESTAMP_MICROS to a table with INT96 for timestamps
> * Append INT96 timestamps to parquet files with TIMESTAMP_MICROS timestamps
> * Append INT96 timestamps to a table with TIMESTAMP_MICROS timestamps



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13263) SQL generation support for tablesample

2019-04-22 Thread angerszhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822891#comment-16822891
 ] 

angerszhu commented on SPARK-13263:
---

[~Tagar] 

I make some change in Spark SQL's ASTBuild, can support this.

> SQL generation support for tablesample
> --
>
> Key: SPARK-13263
> URL: https://issues.apache.org/jira/browse/SPARK-13263
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.0.0
>
>
> {code}
> SELECT s.id FROM t0 TABLESAMPLE(0.1 PERCENT) s
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org