[jira] [Created] (SPARK-18484) case class datasets - ability to specify decimal precision and scale

2016-11-16 Thread Damian Momot (JIRA)
Damian Momot created SPARK-18484:


 Summary: case class datasets - ability to specify decimal 
precision and scale
 Key: SPARK-18484
 URL: https://issues.apache.org/jira/browse/SPARK-18484
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.1, 2.0.0
Reporter: Damian Momot


Currently when using decimal type (BigDecimal in scala case class) there's no 
way to enforce precision and scale. This is quite critical when saving data - 
regarding space usage and compatibility with external systems (for example Hive 
table) because spark saves data as Decimal(38,18)

{code:scala}
val spark: SparkSession = ???

case class TestClass(id: String, money: BigDecimal)

val testDs = spark.createDataset(Seq(
  TestClass("1", BigDecimal("22.50")),
  TestClass("2", BigDecimal("500.66"))
))

testDs.printSchema()
{code}

{code}
root
 |-- id: string (nullable = true)
 |-- money: decimal(38,18) (nullable = true)
{code}

Workaround is to convert dataset to dataframe before saving and manually cast 
to specific decimal scale/precision:

{code:scala}
import org.apache.spark.sql.types.DecimalType
val testDf = testDs.toDF()

testDf
  .withColumn("money", testDf("money").cast(DecimalType(10,2)))
  .printSchema()
{code}

{code}
root
 |-- id: string (nullable = true)
 |-- money: decimal(10,2) (nullable = true)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18484) case class datasets - ability to specify decimal precision and scale

2016-11-16 Thread Damian Momot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damian Momot updated SPARK-18484:
-
Description: 
Currently when using decimal type (BigDecimal in scala case class) there's no 
way to enforce precision and scale. This is quite critical when saving data - 
regarding space usage and compatibility with external systems (for example Hive 
table) because spark saves data as Decimal(38,18)

{code}
case class TestClass(id: String, money: BigDecimal)

val testDs = spark.createDataset(Seq(
  TestClass("1", BigDecimal("22.50")),
  TestClass("2", BigDecimal("500.66"))
))

testDs.printSchema()
{code}

{code}
root
 |-- id: string (nullable = true)
 |-- money: decimal(38,18) (nullable = true)
{code}

Workaround is to convert dataset to dataframe before saving and manually cast 
to specific decimal scale/precision:

{code}
import org.apache.spark.sql.types.DecimalType
val testDf = testDs.toDF()

testDf
  .withColumn("money", testDf("money").cast(DecimalType(10,2)))
  .printSchema()
{code}

{code}
root
 |-- id: string (nullable = true)
 |-- money: decimal(10,2) (nullable = true)
{code}

  was:
Currently when using decimal type (BigDecimal in scala case class) there's no 
way to enforce precision and scale. This is quite critical when saving data - 
regarding space usage and compatibility with external systems (for example Hive 
table) because spark saves data as Decimal(38,18)

{code:scala}
val spark: SparkSession = ???

case class TestClass(id: String, money: BigDecimal)

val testDs = spark.createDataset(Seq(
  TestClass("1", BigDecimal("22.50")),
  TestClass("2", BigDecimal("500.66"))
))

testDs.printSchema()
{code}

{code}
root
 |-- id: string (nullable = true)
 |-- money: decimal(38,18) (nullable = true)
{code}

Workaround is to convert dataset to dataframe before saving and manually cast 
to specific decimal scale/precision:

{code:scala}
import org.apache.spark.sql.types.DecimalType
val testDf = testDs.toDF()

testDf
  .withColumn("money", testDf("money").cast(DecimalType(10,2)))
  .printSchema()
{code}

{code}
root
 |-- id: string (nullable = true)
 |-- money: decimal(10,2) (nullable = true)
{code}


> case class datasets - ability to specify decimal precision and scale
> 
>
> Key: SPARK-18484
> URL: https://issues.apache.org/jira/browse/SPARK-18484
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Damian Momot
>
> Currently when using decimal type (BigDecimal in scala case class) there's no 
> way to enforce precision and scale. This is quite critical when saving data - 
> regarding space usage and compatibility with external systems (for example 
> Hive table) because spark saves data as Decimal(38,18)
> {code}
> case class TestClass(id: String, money: BigDecimal)
> val testDs = spark.createDataset(Seq(
>   TestClass("1", BigDecimal("22.50")),
>   TestClass("2", BigDecimal("500.66"))
> ))
> testDs.printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(38,18) (nullable = true)
> {code}
> Workaround is to convert dataset to dataframe before saving and manually cast 
> to specific decimal scale/precision:
> {code}
> import org.apache.spark.sql.types.DecimalType
> val testDf = testDs.toDF()
> testDf
>   .withColumn("money", testDf("money").cast(DecimalType(10,2)))
>   .printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(10,2) (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs

2016-11-16 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673032#comment-15673032
 ] 

Takeshi Yamamuro commented on SPARK-18478:
--

ya, I'll make a pr later, thanks!

> Support codegen for Hive UDFs
> -
>
> Key: SPARK-18478
> URL: https://issues.apache.org/jira/browse/SPARK-18478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently does not codegen Hive UDFs in hiveUDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs

2016-11-16 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673027#comment-15673027
 ] 

Reynold Xin commented on SPARK-18478:
-

Yea that it seems like it's worth doing.


> Support codegen for Hive UDFs
> -
>
> Key: SPARK-18478
> URL: https://issues.apache.org/jira/browse/SPARK-18478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently does not codegen Hive UDFs in hiveUDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs

2016-11-16 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15673017#comment-15673017
 ] 

Takeshi Yamamuro commented on SPARK-18478:
--

[~rxin] seems we have some performance gains;
https://github.com/apache/spark/compare/master...maropu:SupportHiveUdfCodegen#diff-f781931d35e340e5550357fbb28d14e8R37
{code}
/*
 Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13 on Mac OS X 10.10.2
 Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz

 Call Hive UDF:  Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative
   
---
 Call Hive UDF wholestage off3 /4  40941.7  
 0.0   1.0X
 Call Hive UDF wholestage on 1 /2  96620.3  
 0.0   2.4X
 */
{code}


> Support codegen for Hive UDFs
> -
>
> Key: SPARK-18478
> URL: https://issues.apache.org/jira/browse/SPARK-18478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently does not codegen Hive UDFs in hiveUDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14974) spark sql job create too many files in HDFS when doing insert overwrite hive table

2016-11-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672948#comment-15672948
 ] 

Apache Spark commented on SPARK-14974:
--

User 'baishuo' has created a pull request for this issue:
https://github.com/apache/spark/pull/15914

> spark sql job create too many files in HDFS when doing insert overwrite hive 
> table
> --
>
> Key: SPARK-14974
> URL: https://issues.apache.org/jira/browse/SPARK-14974
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: zenglinxi
>Priority: Minor
>
> Recently, we often encounter problems using spark sql for inserting data into 
> a partition table (ex.: insert overwrite table $output_table partition(dt) 
> select xxx from tmp_table).  
> After the spark job start running on yarn, the app will create too many files 
> (ex. 2,000,000, or even 10,000,000), which will make HDFS under enormous 
> pressure.
> We found that the num of files created by spark job is depending on the 
> partition num of hive table that will be inserted and the num of spark sql 
> partitions. 
> files_num = hive_table_partions_num *  spark_sql_partitions_num.
> We often make the spark_sql_partitions_num(spark.sql.shuffle.partitions) >= 
> 1000, and the hive_table_partions_num is very small under normal 
> circumstances, but it will turn out to be more than 2000 when we input a 
> wrong field as the partion field unconsciously, which will make the files_num 
> >= 1000 * 2000 = 2,000,000.
> There is a configuration parameter in hive that can limit the maximum number 
> of dynamic partitions allowed to be created in each mapper/reducer named 
> hive.exec.max.dynamic.partitions.pernode, but this conf parameter did't work 
> when we use hiveContext.
> Reducing spark_sql_partitions_num(spark.sql.shuffle.partitions) can make the 
> files_num be smaller, but it will affect the concurrency.
> Can we create configuration parameters to  limit the maximum number of files 
> allowed to be create by each task or limit the spark_sql_partitions_num 
> without affect the concurrency?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18074) UDFs don't work on non-local environment

2016-11-16 Thread roncenzhao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672950#comment-15672950
 ] 

roncenzhao commented on SPARK-18074:


I have encountered this problem, too. If any one can give some method to solve 
this problem?

> UDFs don't work on non-local environment
> 
>
> Key: SPARK-18074
> URL: https://issues.apache.org/jira/browse/SPARK-18074
> Project: Spark
>  Issue Type: Bug
>Reporter: Alberto Andreotti
>
> It seems that UDFs fail to deserialize when they are sent to the remote 
> cluster. This is an app that can help reproduce the problem,
> https://github.com/albertoandreottiATgmail/spark_udf
> and this is the stack trace with the exception,
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
> at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
> at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at 

[jira] [Created] (SPARK-18483) spark on yarn always connect to yarn resourcemanager at 0.0.0.0:8032

2016-11-16 Thread inred (JIRA)
inred created SPARK-18483:
-

 Summary: spark on yarn always connect to  yarn resourcemanager at  
0.0.0.0:8032
 Key: SPARK-18483
 URL: https://issues.apache.org/jira/browse/SPARK-18483
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.1
 Environment: java8
SBT0.13
scala2.11.8
spark-2.0.1-bin-hadoop2.6
Reporter: inred


I have installed the yarn resource manager at 192.168.13.159:8032 and have  set 
YARN_CONF_DIR environment var and have the yarn-site.xml
configured as the following, but it always connects to 0.0.0.0:8032 instead of 
192.168.13.159:8032


set environment
E:\app>set yarn
YARN_CONF_DIR=D:\Documents\download\hadoop

E:\app>set had
HADOOP_HOME=D:\Documents\download\hadoop

E:\app>cat D:\Documents\download\hadoop\yarn-site.xml

yarn.resourcemanager.address
192.168.13.159:8032


"C:\Program Files\Java\jdk1.8.0_92\bin\java" -Didea.launcher.port=7532 
"-Didea.launcher.bin.path=C:\Program Files (x86)\JetBrains\IntelliJ IDEA 
Community Edition 2016.2.5\bin" -Dfile.encoding=UTF-8 -classpath "C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\charsets.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\deploy.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\access-bridge-64.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\cldrdata.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\dnsns.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\jaccess.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\jfxrt.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\localedata.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\nashorn.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\sunec.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\sunjce_provider.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\sunmscapi.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\sunpkcs11.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\ext\zipfs.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\javaws.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\jce.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\jfr.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\jfxswt.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\jsse.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\management-agent.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\plugin.jar;C:\Program 
Files\Java\jdk1.8.0_92\jre\lib\resources.jar;C:\Program 

[jira] [Commented] (SPARK-17662) Dedup UDAF

2016-11-16 Thread Ohad Raviv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672928#comment-15672928
 ] 

Ohad Raviv commented on SPARK-17662:


you're right,
great solution!
I didn't know about the "max(struct(date, *))" syntax.

thanks!

> Dedup UDAF
> --
>
> Key: SPARK-17662
> URL: https://issues.apache.org/jira/browse/SPARK-17662
> Project: Spark
>  Issue Type: New Feature
>Reporter: Ohad Raviv
>
> We have a common use case od deduping a table in a creation order.
> For example, we have an event log of user actions. A user marks his favorite 
> category from time to time.
> In our analytics we would like to know only the user's last favorite category.
> The data:
> user_idaction_typevaluedate
> 123  fav category   1   2016-02-01
> 123  fav category   4   2016-02-02
> 123  fav category   8   2016-02-03
> 123  fav category   2   2016-02-04
> we would like to get only the last update by the date column.
> we could of-course do it in sql:
> select * from (
> select *, row_number() over (partition by user_id,action_type order by date 
> desc) as rnum from tbl)
> where rnum=1;
> but then, I believe it can't be optimized on the mappers side and we'll get 
> all the data shuffled to the reducers instead of partially aggregated in the 
> map side.
> We have written a UDAF for this, but then we have other issues - like 
> blocking push-down-predicate for columns.
> do you have any idea for a proper solution?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18470) Provide Spark Streaming Monitor Rest Api

2016-11-16 Thread Genmao Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672745#comment-15672745
 ] 

Genmao Yu commented on SPARK-18470:
---

Thanks for your suggestions, i will provide it later

> Provide Spark Streaming Monitor Rest Api
> 
>
> Key: SPARK-18470
> URL: https://issues.apache.org/jira/browse/SPARK-18470
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 1.6.3, 2.0.2
>Reporter: Genmao Yu
>
> provide spark streaming monitor api:
> 1. list receiver information
> 2. list stream batch infomation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML

2016-11-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672708#comment-15672708
 ] 

Apache Spark commented on SPARK-18481:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/15913

> ML 2.1 QA: Remove deprecated methods for ML 
> 
>
> Key: SPARK-18481
> URL: https://issues.apache.org/jira/browse/SPARK-18481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> Remove deprecated methods for ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18482) make sure Spark can access the table metadata created by older version of spark

2016-11-16 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-18482:
---

 Summary: make sure Spark can access the table metadata created by 
older version of spark
 Key: SPARK-18482
 URL: https://issues.apache.org/jira/browse/SPARK-18482
 Project: Spark
  Issue Type: Test
  Components: SQL
Reporter: Wenchen Fan


In Spark 2.1, the way of storing table metadata into Hive metastore has been 
changed a lot. e.g. no runtime schema inference, store partition columns, store 
schema for hive tables, etc. We should add compatibility tests for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1267) Add a pip installer for PySpark

2016-11-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1267:
--
Fix Version/s: 2.1.0

> Add a pip installer for PySpark
> ---
>
> Key: SPARK-1267
> URL: https://issues.apache.org/jira/browse/SPARK-1267
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Prabin Banka
>Assignee: holdenk
>Priority: Minor
>  Labels: pyspark
> Fix For: 2.1.0, 2.2.0
>
>
> Please refer to this mail archive,
> http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3CCAOEPXP7jKiw-3M8eh2giBcs8gEkZ1upHpGb=fqoucvscywj...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18129) Sign pip artifacts

2016-11-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-18129:
---
Fix Version/s: 2.1.0

> Sign pip artifacts
> --
>
> Key: SPARK-18129
> URL: https://issues.apache.org/jira/browse/SPARK-18129
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: holdenk
>Assignee: holdenk
> Fix For: 2.1.0, 2.2.0
>
>
> To allow us to distribute the Python pip installable artifacts we need to be 
> able to sign them with make-release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18481) ML 2.1 QA: Remove deprecated methods for ML

2016-11-16 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-18481:
---

 Summary: ML 2.1 QA: Remove deprecated methods for ML 
 Key: SPARK-18481
 URL: https://issues.apache.org/jira/browse/SPARK-18481
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Assignee: Yanbo Liang
Priority: Minor


Remove deprecated methods for ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18442) Fix nullability of WrapOption.

2016-11-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18442.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15887
[https://github.com/apache/spark/pull/15887]

> Fix nullability of WrapOption.
> --
>
> Key: SPARK-18442
> URL: https://issues.apache.org/jira/browse/SPARK-18442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 2.1.0
>
>
> The nullability of {{WrapOption}} should be {{false}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18442) Fix nullability of WrapOption.

2016-11-16 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18442:

Assignee: Takuya Ueshin

> Fix nullability of WrapOption.
> --
>
> Key: SPARK-18442
> URL: https://issues.apache.org/jira/browse/SPARK-18442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Minor
> Fix For: 2.1.0
>
>
> The nullability of {{WrapOption}} should be {{false}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18480) Link validation for ML guides

2016-11-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18480:


Assignee: (was: Apache Spark)

> Link validation for ML guides
> -
>
> Key: SPARK-18480
> URL: https://issues.apache.org/jira/browse/SPARK-18480
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: zhengruifeng
>Priority: Trivial
>
> I did a skim about the link validation of ML and found following link issues:
> 1, There are two {{[Graph.partitionBy]}} in {{graphx-programming-guide.md}}, 
> the first one had no effert.
> 2, {{DataFrame}}, {{Transformer}}, {{Pipeline}} and {{Parameter}}  in 
> {{ml-pipeline.md}} were linked to {{ml-guide.html}} by mistake.
> 3, {{PythonMLLibAPI}} in {{mllib-linear-methods.md}} was not accessable, 
> because class {{PythonMLLibAPI}} is private.
> 4, Other links updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18480) Link validation for ML guides

2016-11-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18480:


Assignee: Apache Spark

> Link validation for ML guides
> -
>
> Key: SPARK-18480
> URL: https://issues.apache.org/jira/browse/SPARK-18480
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Trivial
>
> I did a skim about the link validation of ML and found following link issues:
> 1, There are two {{[Graph.partitionBy]}} in {{graphx-programming-guide.md}}, 
> the first one had no effert.
> 2, {{DataFrame}}, {{Transformer}}, {{Pipeline}} and {{Parameter}}  in 
> {{ml-pipeline.md}} were linked to {{ml-guide.html}} by mistake.
> 3, {{PythonMLLibAPI}} in {{mllib-linear-methods.md}} was not accessable, 
> because class {{PythonMLLibAPI}} is private.
> 4, Other links updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18480) Link validation for ML guides

2016-11-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672502#comment-15672502
 ] 

Apache Spark commented on SPARK-18480:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/15912

> Link validation for ML guides
> -
>
> Key: SPARK-18480
> URL: https://issues.apache.org/jira/browse/SPARK-18480
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: zhengruifeng
>Priority: Trivial
>
> I did a skim about the link validation of ML and found following link issues:
> 1, There are two {{[Graph.partitionBy]}} in {{graphx-programming-guide.md}}, 
> the first one had no effert.
> 2, {{DataFrame}}, {{Transformer}}, {{Pipeline}} and {{Parameter}}  in 
> {{ml-pipeline.md}} were linked to {{ml-guide.html}} by mistake.
> 3, {{PythonMLLibAPI}} in {{mllib-linear-methods.md}} was not accessable, 
> because class {{PythonMLLibAPI}} is private.
> 4, Other links updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18480) Link validation for ML guides

2016-11-16 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-18480:


 Summary: Link validation for ML guides
 Key: SPARK-18480
 URL: https://issues.apache.org/jira/browse/SPARK-18480
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: zhengruifeng
Priority: Trivial


I did a skim about the link validation of ML and found following link issues:
1, There are two {{[Graph.partitionBy]}} in {{graphx-programming-guide.md}}, 
the first one had no effert.
2, {{DataFrame}}, {{Transformer}}, {{Pipeline}} and {{Parameter}}  in 
{{ml-pipeline.md}} were linked to {{ml-guide.html}} by mistake.
3, {{PythonMLLibAPI}} in {{mllib-linear-methods.md}} was not accessable, 
because class {{PythonMLLibAPI}} is private.
4, Other links updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs

2016-11-16 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672493#comment-15672493
 ] 

Takeshi Yamamuro commented on SPARK-18478:
--

okay, I'll check

> Support codegen for Hive UDFs
> -
>
> Key: SPARK-18478
> URL: https://issues.apache.org/jira/browse/SPARK-18478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently does not codegen Hive UDFs in hiveUDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18319) ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit

2016-11-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18319:

Assignee: yuhao yang

> ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit
> --
>
> Key: SPARK-18319
> URL: https://issues.apache.org/jira/browse/SPARK-18319
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Blocker
>
> We should make a pass through the items marked as Experimental or 
> DeveloperApi and see if any are stable enough to be unmarked.
> We should also check for items marked final or sealed to see if they are 
> stable enough to be opened up as APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18320) ML 2.1 QA: API: Python API coverage

2016-11-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-18320:

Assignee: Seth Hendrickson

> ML 2.1 QA: API: Python API coverage
> ---
>
> Key: SPARK-18320
> URL: https://issues.apache.org/jira/browse/SPARK-18320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Seth Hendrickson
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs

2016-11-16 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672453#comment-15672453
 ] 

Reynold Xin commented on SPARK-18478:
-

Are there any performance improvements we will get? I'm not sure how much gain 
we can get due to conversions.


> Support codegen for Hive UDFs
> -
>
> Key: SPARK-18478
> URL: https://issues.apache.org/jira/browse/SPARK-18478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently does not codegen Hive UDFs in hiveUDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18317) ML, Graph 2.1 QA: API: Binary incompatible changes

2016-11-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-18317:
-

Assignee: Xiangrui Meng

> ML, Graph 2.1 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-18317
> URL: https://issues.apache.org/jira/browse/SPARK-18317
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18449) Name option is being ignored when submitting an R application via spark-submit

2016-11-16 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-18449:


Assignee: Felix Cheung

> Name option is being ignored when submitting an R application via spark-submit
> --
>
> Key: SPARK-18449
> URL: https://issues.apache.org/jira/browse/SPARK-18449
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SparkR
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Alexander Eckert
>Assignee: Felix Cheung
>Priority: Minor
>
> The value of the _--name_ parameter is being ignored, when submitting an R 
> script via _spark-submit_ to a Standalone Spark Cluster. 
> This is the case when the R script starts a _sparkR.session()_ as well as 
> when no sparkR.session() is used. I would expect that the value of the --name 
> parameter should be used when no appName was specified in the 
> sparkR.session() function or the session() function has not been called at 
> all. And when the appName was specified, I would expect the same behaviour 
> like in Scala or Python applications. Currently {{SparkR}} is displayed when 
> no appName was specified.
> Example:
> 1. Edit _examples/src/main/r/dataframe.R_
> 2. Replace _sparkR.session(appName = "SparkR-DataFrame-example")_ with 
> _sparkR.session()_
> 3. Submit dataframe.R
> {code:none}
> bin/spark-submit --master spark://ubuntu:7077 --name MyApp 
> examples/src/main/r/dataframe.R
> {code}
> 4. Compare application names with names in the Spark WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18449) Name option is being ignored when submitting an R application via spark-submit

2016-11-16 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672403#comment-15672403
 ] 

Felix Cheung commented on SPARK-18449:
--

Good catch, this is likely because the R function has a default value of 
appName = "SparkR" which is then passed to JVM, overriding the command line 
value.

It gets a bit tricky on the order on which get applied first though, I'll need 
to look into this to see what is the right fix.


> Name option is being ignored when submitting an R application via spark-submit
> --
>
> Key: SPARK-18449
> URL: https://issues.apache.org/jira/browse/SPARK-18449
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SparkR
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: Alexander Eckert
>Priority: Minor
>
> The value of the _--name_ parameter is being ignored, when submitting an R 
> script via _spark-submit_ to a Standalone Spark Cluster. 
> This is the case when the R script starts a _sparkR.session()_ as well as 
> when no sparkR.session() is used. I would expect that the value of the --name 
> parameter should be used when no appName was specified in the 
> sparkR.session() function or the session() function has not been called at 
> all. And when the appName was specified, I would expect the same behaviour 
> like in Scala or Python applications. Currently {{SparkR}} is displayed when 
> no appName was specified.
> Example:
> 1. Edit _examples/src/main/r/dataframe.R_
> 2. Replace _sparkR.session(appName = "SparkR-DataFrame-example")_ with 
> _sparkR.session()_
> 3. Submit dataframe.R
> {code:none}
> bin/spark-submit --master spark://ubuntu:7077 --name MyApp 
> examples/src/main/r/dataframe.R
> {code}
> 4. Compare application names with names in the Spark WebUI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-16 Thread Jason Pan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672312#comment-15672312
 ] 

Jason Pan commented on SPARK-18353:
---

Thanks sean. It works. 

Just for the doc: "spark.network.timeout, or 10s in standalone 
clusters"
Actually the default is not 10s in standalone when using rest.

> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Priority: Critical
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit" with standalone cluster mode:
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18468) Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet relation with decimal column

2016-11-16 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18468:
-
Description: 
https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.1-test-sbt-hadoop-2.4/71/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_9757_Persist_Parquet_relation_with_decimal_column/

https://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.4/71

Seems we failed to stop the driver
{code}
2016-11-15 18:36:47.76 - stderr> org.apache.spark.rpc.RpcTimeoutException: 
Cannot receive any reply in 120 seconds. This timeout is controlled by 
spark.rpc.askTimeout
2016-11-15 18:36:47.76 - stderr>at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
2016-11-15 18:36:47.76 - stderr>at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
2016-11-15 18:36:47.76 - stderr>at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
2016-11-15 18:36:47.76 - stderr>at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
2016-11-15 18:36:47.76 - stderr>at 
scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
2016-11-15 18:36:47.76 - stderr>at scala.util.Try$.apply(Try.scala:192)
2016-11-15 18:36:47.76 - stderr>at 
scala.util.Failure.recover(Try.scala:216)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
2016-11-15 18:36:47.76 - stderr>at 
com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.Promise$class.complete(Promise.scala:55)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
2016-11-15 18:36:47.76 - stderr>at 
scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
2016-11-15 18:36:47.76 - stderr>at 
org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
2016-11-15 18:36:47.76 - stderr>at 
org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
2016-11-15 18:36:47.76 - stderr>at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
2016-11-15 18:36:47.76 - stderr>at 

[jira] [Updated] (SPARK-18468) Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet relation with decimal column

2016-11-16 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-18468:
-
Component/s: (was: SQL)
 Spark Core

> Flaky test: org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist 
> Parquet relation with decimal column
> --
>
> Key: SPARK-18468
> URL: https://issues.apache.org/jira/browse/SPARK-18468
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Yin Huai
>Priority: Critical
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.1-test-sbt-hadoop-2.4/71/testReport/junit/org.apache.spark.sql.hive/HiveSparkSubmitSuite/SPARK_9757_Persist_Parquet_relation_with_decimal_column/
> Seems we failed to stop the driver
> {code}
> 2016-11-15 18:36:47.76 - stderr> org.apache.spark.rpc.RpcTimeoutException: 
> Cannot receive any reply in 120 seconds. This timeout is controlled by 
> spark.rpc.askTimeout
> 2016-11-15 18:36:47.76 - stderr>  at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> 2016-11-15 18:36:47.76 - stderr>  at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> 2016-11-15 18:36:47.76 - stderr>  at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
> 2016-11-15 18:36:47.76 - stderr>  at scala.util.Try$.apply(Try.scala:192)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.util.Failure.recover(Try.scala:216)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> 2016-11-15 18:36:47.76 - stderr>  at 
> com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.Promise$class.complete(Promise.scala:55)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
> 2016-11-15 18:36:47.76 - stderr>  at 
> scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
> 2016-11-15 18:36:47.76 - 

[jira] [Created] (SPARK-18479) spark.sql.shuffle.partitions defaults should be a prime number

2016-11-16 Thread Hamel Ajay Kothari (JIRA)
Hamel Ajay Kothari created SPARK-18479:
--

 Summary: spark.sql.shuffle.partitions defaults should be a prime 
number
 Key: SPARK-18479
 URL: https://issues.apache.org/jira/browse/SPARK-18479
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: Hamel Ajay Kothari


For most hash bucketing use cases it is my understanding that a prime value, 
such as 199, would be a safer value than the existing value of 200. Using a 
non-prime value makes the likelihood of collisions much higher when the hash 
function isn't great.

Consider the case where you've got a Timestamp or Long column with millisecond 
times at midnight each day. With the default value for 
spark.sql.shuffle.partitions, you'll end up with 120/200 partitions being 
completely empty.

Looking around there doesn't seem to be a good reason why we chose 200 so I 
don't see a huge risk in changing it. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18478) Support codegen for Hive UDFs

2016-11-16 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672272#comment-15672272
 ] 

Takeshi Yamamuro commented on SPARK-18478:
--

We can simply fix this 
(https://github.com/apache/spark/compare/master...maropu:SupportHiveUdfCodegen) 
though, I'm not sure this fix makes sense because we plan to refactor the Hive 
integration in SPARK-15691. cc: [~rxin]

> Support codegen for Hive UDFs
> -
>
> Key: SPARK-18478
> URL: https://issues.apache.org/jira/browse/SPARK-18478
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Takeshi Yamamuro
>
> Spark currently does not codegen Hive UDFs in hiveUDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672245#comment-15672245
 ] 

peay edited comment on SPARK-18473 at 11/17/16 12:55 AM:
-

Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct? (edit: 
nevermind, the fix appears to be in 2.0.2 indeed).


was (Author: peay):
Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct?

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>Assignee: Xiao Li
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18478) Support codegen for Hive UDFs

2016-11-16 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-18478:


 Summary: Support codegen for Hive UDFs
 Key: SPARK-18478
 URL: https://issues.apache.org/jira/browse/SPARK-18478
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.2
Reporter: Takeshi Yamamuro


Spark currently does not codegen Hive UDFs in hiveUDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18477) Enable interrupts for HDFS in HDFSMetadataLog

2016-11-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672225#comment-15672225
 ] 

Apache Spark commented on SPARK-18477:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15911

> Enable interrupts for HDFS in HDFSMetadataLog
> -
>
> Key: SPARK-18477
> URL: https://issues.apache.org/jira/browse/SPARK-18477
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> HDFS `write` may just hang until timeout if some network error happens. It's 
> better to enable interrupts to allow stopping the query fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672245#comment-15672245
 ] 

peay commented on SPARK-18473:
--

Ok, I see, thanks. The fix is in 2.0.3 though, not 2.0.2, correct?

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>Assignee: Xiao Li
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18477) Enable interrupts for HDFS in HDFSMetadataLog

2016-11-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18477:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Enable interrupts for HDFS in HDFSMetadataLog
> -
>
> Key: SPARK-18477
> URL: https://issues.apache.org/jira/browse/SPARK-18477
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> HDFS `write` may just hang until timeout if some network error happens. It's 
> better to enable interrupts to allow stopping the query fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18477) Enable interrupts for HDFS in HDFSMetadataLog

2016-11-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18477:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Enable interrupts for HDFS in HDFSMetadataLog
> -
>
> Key: SPARK-18477
> URL: https://issues.apache.org/jira/browse/SPARK-18477
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> HDFS `write` may just hang until timeout if some network error happens. It's 
> better to enable interrupts to allow stopping the query fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672231#comment-15672231
 ] 

Dongjoon Hyun commented on SPARK-18473:
---

Hi, [~peay].
Maybe, the relevant one is https://issues.apache.org/jira/browse/SPARK-17806 .

I think we can close this issue since it's already proven by the example.

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>Assignee: Xiao Li
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672235#comment-15672235
 ] 

Dongjoon Hyun commented on SPARK-18473:
---

Wow. Please forget about my comment. :)
I didn't read [~hvanhovell]'s comments. Sorry.

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>Assignee: Xiao Li
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-18473.
-
Resolution: Fixed
  Assignee: Xiao Li

Fixed by gatorsmile's PR for SPARK-17981/SPARK-17957

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>Assignee: Xiao Li
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672223#comment-15672223
 ] 

Herman van Hovell commented on SPARK-18473:
---

This is probably caused by SPARK-17981. That bug creates incorrect nullability 
flags, which messes with the result of coalesce (fillna) in your query.

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18477) Enable interrupts for HDFS in HDFSMetadataLog

2016-11-16 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-18477:


 Summary: Enable interrupts for HDFS in HDFSMetadataLog
 Key: SPARK-18477
 URL: https://issues.apache.org/jira/browse/SPARK-18477
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


HDFS `write` may just hang until timeout if some network error happens. It's 
better to enable interrupts to allow stopping the query fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18476) SparkR Logistic Regression should should support output original label.

2016-11-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15672218#comment-15672218
 ] 

Apache Spark commented on SPARK-18476:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/15910

> SparkR Logistic Regression should should support output original label.
> ---
>
> Key: SPARK-18476
> URL: https://issues.apache.org/jira/browse/SPARK-18476
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>
> Similar to [SPARK-18401], as a classification algorithm, logistic regression 
> should support output original label instead of supporting index label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18476) SparkR Logistic Regression should should support output original label.

2016-11-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18476:


Assignee: Apache Spark

> SparkR Logistic Regression should should support output original label.
> ---
>
> Key: SPARK-18476
> URL: https://issues.apache.org/jira/browse/SPARK-18476
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>Assignee: Apache Spark
>
> Similar to [SPARK-18401], as a classification algorithm, logistic regression 
> should support output original label instead of supporting index label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18476) SparkR Logistic Regression should should support output original label.

2016-11-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18476:


Assignee: (was: Apache Spark)

> SparkR Logistic Regression should should support output original label.
> ---
>
> Key: SPARK-18476
> URL: https://issues.apache.org/jira/browse/SPARK-18476
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Miao Wang
>
> Similar to [SPARK-18401], as a classification algorithm, logistic regression 
> should support output original label instead of supporting index label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18476) SparkR Logistic Regression should should support output original label.

2016-11-16 Thread Miao Wang (JIRA)
Miao Wang created SPARK-18476:
-

 Summary: SparkR Logistic Regression should should support output 
original label.
 Key: SPARK-18476
 URL: https://issues.apache.org/jira/browse/SPARK-18476
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Miao Wang


Similar to [SPARK-18401], as a classification algorithm, logistic regression 
should support output original label instead of supporting index label.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671923#comment-15671923
 ] 

Apache Spark commented on SPARK-18475:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15909

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18475:


Assignee: (was: Apache Spark)

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18475:


Assignee: Apache Spark

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> --
>
> Key: SPARK-18475
> URL: https://issues.apache.org/jira/browse/SPARK-18475
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

2016-11-16 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-18475:
---

 Summary: Be able to provide higher parallelization for 
StructuredStreaming Kafka Source
 Key: SPARK-18475
 URL: https://issues.apache.org/jira/browse/SPARK-18475
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.0.2, 2.1.0
Reporter: Burak Yavuz


Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
there are TopicPartitions that we're going to read from Kafka.
This doesn't work well when we have data skew, and there is no reason why we 
shouldn't be able to increase parallelism further, i.e. have multiple Spark 
tasks reading from the same Kafka TopicPartition.

What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
for what it is defined for (being cached) in this use case, but the extra 
overhead is worth handling data skew and increasing parallelism especially in 
ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18186) Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation support

2016-11-16 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-18186.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 15703
[https://github.com/apache/spark/pull/15703]

> Migrate HiveUDAFFunction to TypedImperativeAggregate for partial aggregation 
> support
> 
>
> Key: SPARK-18186
> URL: https://issues.apache.org/jira/browse/SPARK-18186
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> Currently, Hive UDAFs in Spark SQL don't support partial aggregation. Any 
> query involving any Hive UDAFs has to fall back to {{SortAggregateExec}} 
> without partial aggregation.
> This issue can be fixed by migrating {{HiveUDAFFunction}} to 
> {{TypedImperativeAggregate}}, which already provides partial aggregation 
> support for aggregate functions that may use arbitrary Java objects as 
> aggregation states.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1267) Add a pip installer for PySpark

2016-11-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-1267.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Merged into master (2.2) and will consider for 2.1.

> Add a pip installer for PySpark
> ---
>
> Key: SPARK-1267
> URL: https://issues.apache.org/jira/browse/SPARK-1267
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Prabin Banka
>Assignee: holdenk
>Priority: Minor
>  Labels: pyspark
> Fix For: 2.2.0
>
>
> Please refer to this mail archive,
> http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3CCAOEPXP7jKiw-3M8eh2giBcs8gEkZ1upHpGb=fqoucvscywj...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18129) Sign pip artifacts

2016-11-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-18129:
---
Assignee: holdenk

> Sign pip artifacts
> --
>
> Key: SPARK-18129
> URL: https://issues.apache.org/jira/browse/SPARK-18129
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: holdenk
>Assignee: holdenk
> Fix For: 2.2.0
>
>
> To allow us to distribute the Python pip installable artifacts we need to be 
> able to sign them with make-release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18129) Sign pip artifacts

2016-11-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-18129.

   Resolution: Fixed
Fix Version/s: 2.2.0

Merged to master (2.2).

> Sign pip artifacts
> --
>
> Key: SPARK-18129
> URL: https://issues.apache.org/jira/browse/SPARK-18129
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: holdenk
>Assignee: holdenk
> Fix For: 2.2.0
>
>
> To allow us to distribute the Python pip installable artifacts we need to be 
> able to sign them with make-release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1267) Add a pip installer for PySpark

2016-11-16 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1267:
--
Assignee: holdenk

> Add a pip installer for PySpark
> ---
>
> Key: SPARK-1267
> URL: https://issues.apache.org/jira/browse/SPARK-1267
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Prabin Banka
>Assignee: holdenk
>Priority: Minor
>  Labels: pyspark
> Fix For: 2.2.0
>
>
> Please refer to this mail archive,
> http://mail-archives.apache.org/mod_mbox/spark-user/201311.mbox/%3CCAOEPXP7jKiw-3M8eh2giBcs8gEkZ1upHpGb=fqoucvscywj...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16609) Single function for parsing timestamps/dates

2016-11-16 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16609:

Assignee: (was: Reynold Xin)

> Single function for parsing timestamps/dates
> 
>
> Key: SPARK-16609
> URL: https://issues.apache.org/jira/browse/SPARK-16609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>
> Today, if you want to parse a date or timestamp, you have to use the unix 
> time function and then cast to a timestamp.  Its a little odd there isn't a 
> single function that does both.  I propose we add
> {code}
> to_date(, )/to_timestamp(, ).
> {code}
> For reference, in other systems there are:
> MS SQL: {{convert(, )}}. See: 
> https://technet.microsoft.com/en-us/library/ms174450(v=sql.110).aspx
> Netezza: {{to_timestamp(, )}}. See: 
> https://www.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_ntz_sql_extns_conversion_funcs.html
> Teradata has special casting functionality: {{cast( as timestamp 
> format '')}}
> MySql: {{STR_TO_DATE(, )}}. This returns a datetime when you 
> define both date and time parts. See: 
> https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18424) Single Function for Parsing Dates and Times with Formats

2016-11-16 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers resolved SPARK-18424.
---
Resolution: Duplicate

This is a duplicate of SPARK-16609. Work will continue there.

> Single Function for Parsing Dates and Times with Formats
> 
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16609) Single function for parsing timestamps/dates

2016-11-16 Thread Bill Chambers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671632#comment-15671632
 ] 

Bill Chambers commented on SPARK-16609:
---

I am working on this.

> Single function for parsing timestamps/dates
> 
>
> Key: SPARK-16609
> URL: https://issues.apache.org/jira/browse/SPARK-16609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Reynold Xin
>
> Today, if you want to parse a date or timestamp, you have to use the unix 
> time function and then cast to a timestamp.  Its a little odd there isn't a 
> single function that does both.  I propose we add
> {code}
> to_date(, )/to_timestamp(, ).
> {code}
> For reference, in other systems there are:
> MS SQL: {{convert(, )}}. See: 
> https://technet.microsoft.com/en-us/library/ms174450(v=sql.110).aspx
> Netezza: {{to_timestamp(, )}}. See: 
> https://www.ibm.com/support/knowledgecenter/SSULQD_7.0.3/com.ibm.nz.dbu.doc/r_dbuser_ntz_sql_extns_conversion_funcs.html
> Teradata has special casting functionality: {{cast( as timestamp 
> format '')}}
> MySql: {{STR_TO_DATE(, )}}. This returns a datetime when you 
> define both date and time parts. See: 
> https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18424) Single Function for Parsing Dates and Times with Formats

2016-11-16 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18424:
--
Summary: Single Function for Parsing Dates and Times with Formats  (was: 
Single Funct)

> Single Function for Parsing Dates and Times with Formats
> 
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18424) Single Funct

2016-11-16 Thread Bill Chambers (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-18424:
--
Summary: Single Funct  (was: Improve Date Parsing Semantics & Functionality)

> Single Funct
> 
>
> Key: SPARK-18424
> URL: https://issues.apache.org/jira/browse/SPARK-18424
> Project: Spark
>  Issue Type: Improvement
>Reporter: Bill Chambers
>Assignee: Bill Chambers
>Priority: Minor
>
> I've found it quite cumbersome to work with dates thus far in Spark, it can 
> be hard to reason about the timeformat and what type you're working with, for 
> instance:
> say that I have a date in the format
> {code}
> 2017-20-12
> // Y-D-M
> {code}
> In order to parse that into a Date, I have to perform several conversions.
> {code}
>   to_date(
> unix_timestamp(col("date"), dateFormat)
> .cast("timestamp"))
>.alias("date")
> {code}
> I propose simplifying this by adding a to_date function (exists) but adding 
> one that accepts a format for that date. I also propose a to_timestamp 
> function that also supports a format.
> so that you can avoid entirely the above conversion.
> It's also worth mentioning that many other databases support this. For 
> instance, mysql has the STR_TO_DATE function, netezza supports the 
> to_timestamp semantic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671624#comment-15671624
 ] 

peay commented on SPARK-18473:
--

Ah, great, thanks. I had checked out the CHANGELOG but couldn't find anything 
relevant, any reference to the corresponding issue?

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671609#comment-15671609
 ] 

Herman van Hovell commented on SPARK-18473:
---

This has been fixed in spark 2.0.2.

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671609#comment-15671609
 ] 

Herman van Hovell edited comment on SPARK-18473 at 11/16/16 8:59 PM:
-

This has been fixed in spark 2.0.2.

See the following notebook: 
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2728434780191932/3885409687420673/6987336228780374/latest.html


was (Author: hvanhovell):
This has been fixed in spark 2.0.2.

> Correctness issue in INNER join result with window functions
> 
>
> Key: SPARK-18473
> URL: https://issues.apache.org/jira/browse/SPARK-18473
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: peay
>
> I have stumbled onto a corner case where an INNER join appears to return 
> incorrect results. I believe the join should behave as the identity, but 
> instead, some values are shuffled around, and some are just plain wrong.
> This can be reproduced as follows: joining
> {code}
> +-+-+--+++--+--+
> |index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
> +-+-+--+++--+--+
> |1|1| 1|   0|   1| 0| 1|
> |2|2| 0|   0|   1| 0| 1|
> |1|3| 1|   0|   2| 0| 2|
> +-+-+--+++--+--+
> {code}
> with
> {code}
> +--+
> |sessId|
> +--+
> | 1|
> | 2|
> +--+
> {code}
> The result is
> {code}
> +--+-+-+--+++--+
> |sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
> +--+-+-+--+++--+
> | 1|2|2| 0|   0|   1| 0|
> | 2|1|1| 1|   0|   1|-1|
> | 2|1|3| 1|   0|   2| 0|
> +--+-+-+--+++--+
> {code}
> Note how two rows have a sessId of 2 (instead of one row as expected), and 
> how `fiftyCount` can now be negative while always zero in the original 
> dataframe.
> The first dataframe uses two windows:
> - `hasOne` uses a `window.rowsBetween(-10, 0)`.
> - `hasFifty` uses a `window.rowsBetween(-10, -1)`.
> The result is *correct* if:
> - `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
> `window.rowsBetween(-10, -1)`.
> - I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
> `fillna` -- although there are no visible effect on the dataframe as shown by 
> `show` as far as I can tell.
> - I use a LEFT OUTER join instead of INNER JOIN.
> - I write both dataframes to Parquet, read them back and join these.
> This can be reproduced in pyspark using:
> {code}
> import pyspark.sql.functions as F
> from pyspark.sql.functions import col
> from pyspark.sql.window import Window
> df1 = sql_context.createDataFrame(
> pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
> )
> window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")
> df2 = (
> df1
> .withColumn("hasOne", (col("index") == 1).cast("int"))
> .withColumn("hasFifty", (col("index") == 50).cast("int"))
> .withColumn("numOnesBefore", 
> F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
> .withColumn("numFiftyStrictlyBefore", 
> F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
> .fillna({ 'numFiftyStrictlyBefore': 0 })
> .withColumn("sessId", col("numOnesBefore") - 
> col("numFiftyStrictlyBefore"))
> )
> df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
> df_joined = df_selector.join(df2, "sessId", how="inner")
> df2.show()
> df_selector.show()
> df_joined.show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18474) Add StreamingQuery.status in python

2016-11-16 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-18474:
-

 Summary: Add StreamingQuery.status in python
 Key: SPARK-18474
 URL: https://issues.apache.org/jira/browse/SPARK-18474
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how two rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is *correct* if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is *correct* if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- 

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is *correct* if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 }) {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- 

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add {code}.fillna({ 'numOnesBefore': 0 })` {code} after the other call to 
`fillna` -- although there are no visible effect on the dataframe as shown by 
`show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible 
effect on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER 

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })` -- although there are no visible 
effect on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect 
on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read 

[jira] [Updated] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

peay updated SPARK-18473:
-
Description: 
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add .fillna({ 'numOnesBefore': 0 }), although there are no visible effect 
on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code:python}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}

  was:
I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })`, although there are no visible effect 
on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, 

[jira] [Created] (SPARK-18473) Correctness issue in INNER join result with window functions

2016-11-16 Thread peay (JIRA)
peay created SPARK-18473:


 Summary: Correctness issue in INNER join result with window 
functions
 Key: SPARK-18473
 URL: https://issues.apache.org/jira/browse/SPARK-18473
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, SQL
Affects Versions: 2.0.1
Reporter: peay


I have stumbled onto a corner case where an INNER join appears to return 
incorrect results. I believe the join should behave as the identity, but 
instead, some values are shuffled around, and some are just plain wrong.

This can be reproduced as follows: joining

{code}
+-+-+--+++--+--+
|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|sessId|
+-+-+--+++--+--+
|1|1| 1|   0|   1| 0| 1|
|2|2| 0|   0|   1| 0| 1|
|1|3| 1|   0|   2| 0| 2|
+-+-+--+++--+--+
{code}

with

{code}
+--+
|sessId|
+--+
| 1|
| 2|
+--+
{code}

The result is

{code}
+--+-+-+--+++--+
|sessId|index|timeStamp|hasOne|hasFifty|oneCount|fiftyCount|
+--+-+-+--+++--+
| 1|2|2| 0|   0|   1| 0|
| 2|1|1| 1|   0|   1|-1|
| 2|1|3| 1|   0|   2| 0|
+--+-+-+--+++--+
{code}

Note how rows have a sessId of 2 (instead of one row as expected), and how 
`fiftyCount` can now be negative while always zero in the original dataframe.

The first dataframe uses two windows:
- `hasOne` uses a `window.rowsBetween(-10, 0)`.
- `hasFifty` uses a `window.rowsBetween(-10, -1)`.

The result is **correct** if:
- `hasFifty` is changed to `window.rowsBetween(-10, 0)` instead of  
`window.rowsBetween(-10, -1)`.
- I add `.fillna({ 'numOnesBefore': 0 })`, although there are no visible effect 
on the dataframe as shown by `show` as far as I can tell.
- I use a LEFT OUTER join instead of INNER JOIN.
- I write both dataframes to Parquet, read them back and join these.

This can be reproduced in pyspark using:

{code:python}
import pyspark.sql.functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

df1 = sql_context.createDataFrame(
pd.DataFrame({"index": [1, 2, 1], "timeStamp": [1, 2, 3]})
)

window = Window.partitionBy(F.lit(1)).orderBy("timeStamp", "index")

df2 = (
df1
.withColumn("hasOne", (col("index") == 1).cast("int"))
.withColumn("hasFifty", (col("index") == 50).cast("int"))
.withColumn("numOnesBefore", 
F.sum(col("hasOne")).over(window.rowsBetween(-10, 0)))
.withColumn("numFiftyStrictlyBefore", 
F.sum(col("hasFifty")).over(window.rowsBetween(-10, -1)))
.fillna({ 'numFiftyStrictlyBefore': 0 })
.withColumn("sessId", col("numOnesBefore") - col("numFiftyStrictlyBefore"))
)

df_selector = sql_context.createDataFrame(pd.DataFrame({"sessId": [1, 2]}))
df_joined = df_selector.join(df2, "sessId", how="inner")

df2.show()
df_selector.show()
df_joined.show()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16795) Spark's HiveThriftServer should be able to use multiple sqlContexts

2016-11-16 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-16795.
-
Resolution: Duplicate

> Spark's HiveThriftServer should be able to use multiple sqlContexts
> ---
>
> Key: SPARK-16795
> URL: https://issues.apache.org/jira/browse/SPARK-16795
> Project: Spark
>  Issue Type: Wish
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> It seems that when sending multiple Hive queries to the thrift server, the 
> server cannot parallelize the query plannings, because
> it uses only one global sqlContext.
> This make the server very inefficient at handling many small concurrent 
> queries.
> It would be nice to have it use a pool of sqlContexts instead, with a 
> configurable maximum number of contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16795) Spark's HiveThriftServer should be able to use multiple sqlContexts

2016-11-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671462#comment-15671462
 ] 

Herman van Hovell commented on SPARK-16795:
---

Spark uses one Hive client per spark context. This is known to cause 
performance issue in a highly concurrent environment because multiple sessions 
are contending for a single Hive client. This is a duplicate of SPARK-14003, 
and I am closing this one as such. 

> Spark's HiveThriftServer should be able to use multiple sqlContexts
> ---
>
> Key: SPARK-16795
> URL: https://issues.apache.org/jira/browse/SPARK-16795
> Project: Spark
>  Issue Type: Wish
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> It seems that when sending multiple Hive queries to the thrift server, the 
> server cannot parallelize the query plannings, because
> it uses only one global sqlContext.
> This make the server very inefficient at handling many small concurrent 
> queries.
> It would be nice to have it use a pool of sqlContexts instead, with a 
> configurable maximum number of contexts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16865) A file-based end-to-end SQL query suite

2016-11-16 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-16865.
---
   Resolution: Fixed
 Assignee: Peter Lee
Fix Version/s: 2.0.1
   2.1.0

> A file-based end-to-end SQL query suite
> ---
>
> Key: SPARK-16865
> URL: https://issues.apache.org/jira/browse/SPARK-16865
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.1.0, 2.0.1
>
>
> Spark currently has a large number of end-to-end SQL test cases in various 
> SQLQuerySuites. It is fairly difficult to manage and operate these end-to-end 
> test cases. This ticket proposes that we use files to specify queries and 
> expected outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16951) Alternative implementation of NOT IN to Anti-join

2016-11-16 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-16951:
--
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-18455

> Alternative implementation of NOT IN to Anti-join
> -
>
> Key: SPARK-16951
> URL: https://issues.apache.org/jira/browse/SPARK-16951
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Nattavut Sutyanyong
>
> A transformation currently used to process {{NOT IN}} subquery is to rewrite 
> to a form of Anti-join with null-aware property in the Logical Plan and then 
> translate to a form of {{OR}} predicate joining the parent side and the 
> subquery side of the {{NOT IN}}. As a result, the presence of {{OR}} 
> predicate is limited to the nested-loop join execution plan, which will have 
> a major performance implication if both sides' results are large.
> This JIRA sketches an idea of changing the OR predicate to a form similar to 
> the technique used in the implementation of the Existence join that addresses 
> the problem of {{EXISTS (..) OR ..}} type of queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17268) Break Optimizer.scala apart

2016-11-16 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17268.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> Break Optimizer.scala apart
> ---
>
> Key: SPARK-17268
> URL: https://issues.apache.org/jira/browse/SPARK-17268
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> Optimizer.scala has become too large to maintain. We would need to break it 
> apart into multiple files each of which contains rules that are logically 
> relevant.
> We can create the following files for logical grouping:
> - finish analysis
> - joins
> - expressions
> - subquery
> - objects



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory

2016-11-16 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671415#comment-15671415
 ] 

Sital Kedia commented on SPARK-13510:
-

[~shenhong] - We are seeing the same issue on our side. Do you have a PR for 
this yet? 

> Shuffle may throw FetchFailedException: Direct buffer memory
> 
>
> Key: SPARK-13510
> URL: https://issues.apache.org/jira/browse/SPARK-13510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Hong Shen
> Attachments: spark-13510.diff
>
>
> In our cluster, when I test spark-1.6.0 with a sql, it throw exception and 
> failed.
> {code}
> 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request 
> for 1 blocks (915.4 MB) from 10.196.134.220:7337
> 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch 
> from 10.196.134.220:7337 (executor id 122)
> 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 
> to /10.196.134.220:7337
> 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in 
> connection from /10.196.134.220:7337
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>   at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:744)
> 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 
> requests outstanding when connection from /10.196.134.220:7337 is closed
> 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block 
> shuffle_3_81_2, and will not retry (0 retries)
> {code}
>   The reason is that when shuffle a big block(like 1G), task will allocate 
> the same memory, it will easily throw "FetchFailedException: Direct buffer 
> memory".
>   If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will 
> throw 
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607)
> at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
> {code}
>   
>   In mapreduce shuffle, it will firstly judge whether the block can cache in 
> memery, but spark doesn't. 
>   If the block is more than we can cache in memory, we  should write to disk.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17450) spark sql rownumber OOM

2016-11-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671411#comment-15671411
 ] 

Herman van Hovell commented on SPARK-17450:
---

[~cenyuhai] did you have any luck with merging these? Can we close this issue?

> spark sql rownumber OOM
> ---
>
> Key: SPARK-17450
> URL: https://issues.apache.org/jira/browse/SPARK-17450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> spark sql will be OOM when using row_number() over too much sorted records... 
> There will be only 1 task to handle all records
> This sql group by passenger_id,  we have 100 million passengers.
> {code} 
>  SELECT
> passenger_id,
> total_order,
> (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND 
> 670800 THEN 'V3' END) AS order_rank
>   FROM
>   (
> SELECT
>   passenger_id,
>   1 as total_order
> FROM table
> GROUP BY passenger_id
>   ) dd1
> {code}
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:104)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> physical plan:
> {code}
> == Physical Plan ==
> Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 
> <= 670800)) THEN V3 AS order_rank#1]
> +- Window [passenger_id#7L,total_order#0], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS 
> _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC]
>+- Sort [total_order#0 DESC], false, 0
>   +- TungstenExchange SinglePartition, None
>  +- Project [passenger_id#7L,total_order#0]
> +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L,total_order#0])
>+- TungstenExchange hashpartitioning(passenger_id#7L,1000), 
> None
>   +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L])
>  +- Project [passenger_id#7L]
> +- Filter product#9 IN (kuai,gulf)
>+- HiveTableScan [passenger_id#7L,product#9], 
> MetastoreRelation pbs_dw, dwv_order_whole_day, None,
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (SPARK-17662) Dedup UDAF

2016-11-16 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell closed SPARK-17662.
-
Resolution: Not A Problem

> Dedup UDAF
> --
>
> Key: SPARK-17662
> URL: https://issues.apache.org/jira/browse/SPARK-17662
> Project: Spark
>  Issue Type: New Feature
>Reporter: Ohad Raviv
>
> We have a common use case od deduping a table in a creation order.
> For example, we have an event log of user actions. A user marks his favorite 
> category from time to time.
> In our analytics we would like to know only the user's last favorite category.
> The data:
> user_idaction_typevaluedate
> 123  fav category   1   2016-02-01
> 123  fav category   4   2016-02-02
> 123  fav category   8   2016-02-03
> 123  fav category   2   2016-02-04
> we would like to get only the last update by the date column.
> we could of-course do it in sql:
> select * from (
> select *, row_number() over (partition by user_id,action_type order by date 
> desc) as rnum from tbl)
> where rnum=1;
> but then, I believe it can't be optimized on the mappers side and we'll get 
> all the data shuffled to the reducers instead of partially aggregated in the 
> map side.
> We have written a UDAF for this, but then we have other issues - like 
> blocking push-down-predicate for columns.
> do you have any idea for a proper solution?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17662) Dedup UDAF

2016-11-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671401#comment-15671401
 ] 

Herman van Hovell commented on SPARK-17662:
---

This is more of a question for the user list or stack overflow. So I am closing 
this.

BTW: I would use max, for example: 
{noformat}
select user_id,
   action_type,
   max(struct(date, *)) last_record
from   tbl
group by 1,2
{noformat}

> Dedup UDAF
> --
>
> Key: SPARK-17662
> URL: https://issues.apache.org/jira/browse/SPARK-17662
> Project: Spark
>  Issue Type: New Feature
>Reporter: Ohad Raviv
>
> We have a common use case od deduping a table in a creation order.
> For example, we have an event log of user actions. A user marks his favorite 
> category from time to time.
> In our analytics we would like to know only the user's last favorite category.
> The data:
> user_idaction_typevaluedate
> 123  fav category   1   2016-02-01
> 123  fav category   4   2016-02-02
> 123  fav category   8   2016-02-03
> 123  fav category   2   2016-02-04
> we would like to get only the last update by the date column.
> we could of-course do it in sql:
> select * from (
> select *, row_number() over (partition by user_id,action_type order by date 
> desc) as rnum from tbl)
> where rnum=1;
> but then, I believe it can't be optimized on the mappers side and we'll get 
> all the data shuffled to the reducers instead of partially aggregated in the 
> map side.
> We have written a UDAF for this, but then we have other issues - like 
> blocking push-down-predicate for columns.
> do you have any idea for a proper solution?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18172) AnalysisException in first/last during aggregation

2016-11-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671361#comment-15671361
 ] 

Herman van Hovell commented on SPARK-18172:
---

This is different from SPARK-18300, this is not a case where foldable 
propagation is an issue. The thing is that we fixed a lot of these issues 
recently, so I am not sure which one fixed this one.

I am closing this one, please re-open if you hit this again.

> AnalysisException in first/last during aggregation
> --
>
> Key: SPARK-18172
> URL: https://issues.apache.org/jira/browse/SPARK-18172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Emlyn Corrin
> Fix For: 2.0.2
>
>
> Since Spark 2.0.1, the following pyspark snippet fails with 
> {{AnalysisException: The second argument of First should be a boolean 
> literal}} (but it's not restricted to Python, similar code with in Java fails 
> in the same way).
> It worked in Spark 2.0.0, so I believe it may be related to the fix for 
> SPARK-16648.
> {code}
> from pyspark.sql import functions as F
> ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
> ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
> F.countDistinct(ds._2, ds._3)).show()
> {code}
> It works if any of the three arguments to {{.agg}} is removed.
> The stack trace is:
> {code}
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 
> ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
>  ds._3)).show()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py 
> in show(self, n, truncate)
> 285 +---+-+
> 286 """
> --> 287 print(self._jdf.showString(n, truncate))
> 288
> 289 def __repr__(self):
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o76.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: first(_2#1L)()
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
> 

[jira] [Resolved] (SPARK-18172) AnalysisException in first/last during aggregation

2016-11-16 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18172.
---
  Resolution: Fixed
   Fix Version/s: 2.0.2
Target Version/s:   (was: 2.1.0)

> AnalysisException in first/last during aggregation
> --
>
> Key: SPARK-18172
> URL: https://issues.apache.org/jira/browse/SPARK-18172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Emlyn Corrin
> Fix For: 2.0.2
>
>
> Since Spark 2.0.1, the following pyspark snippet fails with 
> {{AnalysisException: The second argument of First should be a boolean 
> literal}} (but it's not restricted to Python, similar code with in Java fails 
> in the same way).
> It worked in Spark 2.0.0, so I believe it may be related to the fix for 
> SPARK-16648.
> {code}
> from pyspark.sql import functions as F
> ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
> ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
> F.countDistinct(ds._2, ds._3)).show()
> {code}
> It works if any of the three arguments to {{.agg}} is removed.
> The stack trace is:
> {code}
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 
> ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
>  ds._3)).show()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py 
> in show(self, n, truncate)
> 285 +---+-+
> 286 """
> --> 287 print(self._jdf.showString(n, truncate))
> 288
> 289 def __repr__(self):
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o76.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: first(_2#1L)()
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> 

[jira] [Updated] (SPARK-17786) [SPARK 2.0] Sorting algorithm gives higher skewness of output

2016-11-16 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17786:
--
Target Version/s: 2.1.0

> [SPARK 2.0] Sorting algorithm gives higher skewness of output
> -
>
> Key: SPARK-17786
> URL: https://issues.apache.org/jira/browse/SPARK-17786
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Maciej Bryński
>
> Hi,
> I'm using df.sort("column") to sort my data before saving it to parquet.
> When using Spark 1.6.2 all partitions were similar in size.
> On Spark 2.0.0 three of the partitions are much bigger than rest.
> Can I go back to previous behaviour of sorting ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17788) RangePartitioner results in few very large tasks and many small to empty tasks

2016-11-16 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17788:
--
Target Version/s: 2.1.0

> RangePartitioner results in few very large tasks and many small to empty 
> tasks 
> ---
>
> Key: SPARK-17788
> URL: https://issues.apache.org/jira/browse/SPARK-17788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
> Environment: Ubuntu 14.04 64bit
> Java 1.8.0_101
>Reporter: Babak Alipour
>
> Greetings everyone,
> I was trying to read a single field of a Hive table stored as Parquet in 
> Spark (~140GB for the entire table, this single field is a Double, ~1.4B 
> records) and look at the sorted output using the following:
> sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC") 
> ​But this simple line of code gives:
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> Same error for:
> sql("SELECT " + field + " FROM MY_TABLE).sort(field)
> and:
> sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)
> After doing some searching, the issue seems to lie in the RangePartitioner 
> trying to create equal ranges. [1]
> [1] 
> https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html
>  
>  The Double values I'm trying to sort are mostly in the range [0,1] (~70% of 
> the data which roughly equates 1 billion records), other numbers in the 
> dataset are as high as 2000. With the RangePartitioner trying to create equal 
> ranges, some tasks are becoming almost empty while others are extremely 
> large, due to the heavily skewed distribution. 
> This is either a bug in Apache Spark or a major limitation of the framework. 
> I hope one of the devs can help solve this issue.
> P.S. Email thread on Spark user mailing list:
> http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17932) Failed to run SQL "show table extended like table_name" in Spark2.0.0

2016-11-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671331#comment-15671331
 ] 

Herman van Hovell commented on SPARK-17932:
---

This is currently not implemented in Spark 2.0. Feel free to open up a PR for 
this.

> Failed to run SQL "show table extended  like table_name"  in Spark2.0.0
> ---
>
> Key: SPARK-17932
> URL: https://issues.apache.org/jira/browse/SPARK-17932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: pin_zhang
>
> SQL "show table extended  like table_name " doesn't work in spark 2.0.0
> that works in spark1.5.2
> Error: org.apache.spark.sql.catalyst.parser.ParseException: 
> missing 'FUNCTIONS' at 'extended'(line 1, pos 11)
> == SQL ==
> show table extended  like test
> ---^^^ (state=,code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17897) not isnotnull is converted to the always false condition isnotnull && not isnotnull

2016-11-16 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-17897:
--
Target Version/s: 2.1.0

> not isnotnull is converted to the always false condition isnotnull && not 
> isnotnull
> ---
>
> Key: SPARK-17897
> URL: https://issues.apache.org/jira/browse/SPARK-17897
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Jordan Halterman
>
> When a logical plan is built containing the following somewhat nonsensical 
> filter:
> {{Filter (NOT isnotnull($f0#212))}}
> During optimization the filter is converted into a condition that will always 
> fail:
> {{Filter (isnotnull($f0#212) && NOT isnotnull($f0#212))}}
> This appears to be caused by the following check for {{NullIntolerant}}:
> https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R63
> Which recurses through the expression and extracts nested {{IsNotNull}} 
> calls, converting them to {{IsNotNull}} calls on the attribute at the root 
> level:
> https://github.com/apache/spark/commit/df68beb85de59bb6d35b2a8a3b85dbc447798bf5#diff-203ac90583cebe29a92c1d812c07f102R49
> This results in the nonsensical condition above.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18460) Include triggerDetails in StreamingQueryStatus.json

2016-11-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671328#comment-15671328
 ] 

Apache Spark commented on SPARK-18460:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/15908

> Include triggerDetails in StreamingQueryStatus.json
> ---
>
> Key: SPARK-18460
> URL: https://issues.apache.org/jira/browse/SPARK-18460
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18459) Rename triggerId to batchId in StreamingQueryStatus.triggerDetails

2016-11-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671327#comment-15671327
 ] 

Apache Spark commented on SPARK-18459:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/15908

> Rename triggerId to batchId in StreamingQueryStatus.triggerDetails
> --
>
> Key: SPARK-18459
> URL: https://issues.apache.org/jira/browse/SPARK-18459
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.1.0
>
>
> triggerId seems like a number that should be increasing with each trigger, 
> whether or not there is data in it. However, actually, triggerId increases 
> only where there is a batch of data in a trigger. So its better to rename it 
> to batchId.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17977) DataFrameReader and DataStreamReader should have an ancestor class

2016-11-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671316#comment-15671316
 ] 

Herman van Hovell edited comment on SPARK-17977 at 11/16/16 7:07 PM:
-

[~aassudani] do you want to open a PR for this?


was (Author: hvanhovell):
[~aassudani] want to open a PR for this?

> DataFrameReader and DataStreamReader should have an ancestor class
> --
>
> Key: SPARK-17977
> URL: https://issues.apache.org/jira/browse/SPARK-17977
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Amit Assudani
>
> There should be an ancestor class of DataFrameReader and DataStreamReader to 
> configure common options / format and use common methods. Most of the methods 
> are exact same having exact same arguments. This will help create utilities / 
> generic code being used for stream / batch applications. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17977) DataFrameReader and DataStreamReader should have an ancestor class

2016-11-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671316#comment-15671316
 ] 

Herman van Hovell commented on SPARK-17977:
---

[~aassudani] want to open a PR for this?

> DataFrameReader and DataStreamReader should have an ancestor class
> --
>
> Key: SPARK-17977
> URL: https://issues.apache.org/jira/browse/SPARK-17977
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Amit Assudani
>
> There should be an ancestor class of DataFrameReader and DataStreamReader to 
> configure common options / format and use common methods. Most of the methods 
> are exact same having exact same arguments. This will help create utilities / 
> generic code being used for stream / batch applications. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-16 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671314#comment-15671314
 ] 

Herman van Hovell commented on SPARK-18458:
---

Nice find!

> core dumped running Spark SQL on large data volume (100TB)
> --
>
> Key: SPARK-18458
> URL: https://issues.apache.org/jira/browse/SPARK-18458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: JESSE CHEN
>Priority: Critical
>  Labels: core, dump
>
> Running a query on 100TB parquet database using the Spark master dated 11/04 
> dump cores on Spark executors.
> The query is TPCDS query 82 (though this query is not the only one can 
> produce this core dump, just the easiest one to re-create the error).
> Spark output that showed the exception:
> {noformat}
> 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_e68_1478924651089_0018_01_74 on 
> host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_e68_1478924651089_0018_01_74
> Exit code: 134
> Exception message: /bin/bash: line 1: 4031216 Aborted (core 
> dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
> -server -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 
> Aborted (core dumped) 
> /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
> -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
> at 

[jira] [Resolved] (SPARK-18461) Improve docs on StreamingQueryListener and StreamingQuery.status

2016-11-16 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-18461.
--
Resolution: Fixed

Issue resolved by pull request 15897
[https://github.com/apache/spark/pull/15897]

> Improve docs on StreamingQueryListener and StreamingQuery.status
> 
>
> Key: SPARK-18461
> URL: https://issues.apache.org/jira/browse/SPARK-18461
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Structured Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17977) DataFrameReader and DataStreamReader should have an ancestor class

2016-11-16 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671287#comment-15671287
 ] 

Michael Armbrust commented on SPARK-17977:
--

No, they were actually the same class for a while.  I would be fine with 
creating a trait that has the common methods.

> DataFrameReader and DataStreamReader should have an ancestor class
> --
>
> Key: SPARK-17977
> URL: https://issues.apache.org/jira/browse/SPARK-17977
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Amit Assudani
>
> There should be an ancestor class of DataFrameReader and DataStreamReader to 
> configure common options / format and use common methods. Most of the methods 
> are exact same having exact same arguments. This will help create utilities / 
> generic code being used for stream / batch applications. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18458:


Assignee: (was: Apache Spark)

> core dumped running Spark SQL on large data volume (100TB)
> --
>
> Key: SPARK-18458
> URL: https://issues.apache.org/jira/browse/SPARK-18458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: JESSE CHEN
>Priority: Critical
>  Labels: core, dump
>
> Running a query on 100TB parquet database using the Spark master dated 11/04 
> dump cores on Spark executors.
> The query is TPCDS query 82 (though this query is not the only one can 
> produce this core dump, just the easiest one to re-create the error).
> Spark output that showed the exception:
> {noformat}
> 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_e68_1478924651089_0018_01_74 on 
> host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_e68_1478924651089_0018_01_74
> Exit code: 134
> Exception message: /bin/bash: line 1: 4031216 Aborted (core 
> dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
> -server -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 
> Aborted (core dumped) 
> /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
> -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
> at 

[jira] [Assigned] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-16 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18458:


Assignee: Apache Spark

> core dumped running Spark SQL on large data volume (100TB)
> --
>
> Key: SPARK-18458
> URL: https://issues.apache.org/jira/browse/SPARK-18458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: JESSE CHEN
>Assignee: Apache Spark
>Priority: Critical
>  Labels: core, dump
>
> Running a query on 100TB parquet database using the Spark master dated 11/04 
> dump cores on Spark executors.
> The query is TPCDS query 82 (though this query is not the only one can 
> produce this core dump, just the easiest one to re-create the error).
> Spark output that showed the exception:
> {noformat}
> 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_e68_1478924651089_0018_01_74 on 
> host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_e68_1478924651089_0018_01_74
> Exit code: 134
> Exception message: /bin/bash: line 1: 4031216 Aborted (core 
> dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
> -server -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 
> Aborted (core dumped) 
> /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
> -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
> at 

[jira] [Commented] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-16 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671237#comment-15671237
 ] 

Kazuaki Ishizaki commented on SPARK-18458:
--

I worked with [~jfc...@us.ibm.com]. Then, I identified that a root cause is an 
integer overflow in a Java expression.

When SIGSEGV occurred, we identified {{dest}} is less than 0 at 
[RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L253-L254].
 It should be greater than 0. It means that {{offsets\[bucket\]}} is less than 
0.
Next, when we checked 
[transformCountsToOffsets()|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L148-L168],
 we knew that {{outputOffset}} is less than 0.
Then, when we checked 
[RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L244-L245],
 we knew that {{outindex = 0x10\?\?\?\?\?\?}}. Since the result of 
multiplication is {{0x80\?\?\?\?\?\?}}, it means that less than 0 as int. This 
is because {{outIndex}} and {{8}} are integer type. Finally, we understood 
integer overflow occurs.

Here, the expression must be {{array.getBaseOffset() + outIndex * 8L}}.

> core dumped running Spark SQL on large data volume (100TB)
> --
>
> Key: SPARK-18458
> URL: https://issues.apache.org/jira/browse/SPARK-18458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: JESSE CHEN
>Priority: Critical
>  Labels: core, dump
>
> Running a query on 100TB parquet database using the Spark master dated 11/04 
> dump cores on Spark executors.
> The query is TPCDS query 82 (though this query is not the only one can 
> produce this core dump, just the easiest one to re-create the error).
> Spark output that showed the exception:
> {noformat}
> 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_e68_1478924651089_0018_01_74 on 
> host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_e68_1478924651089_0018_01_74
> Exit code: 134
> Exception message: /bin/bash: line 1: 4031216 Aborted (core 
> dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
> -server -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 
> Aborted (core dumped) 
> /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
> -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname 

[jira] [Issue Comment Deleted] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-16 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-18458:
-
Comment: was deleted

(was: I worked with [~jfc...@us.ibm.com]. Then, I identified that a root cause 
is an integer overflow in a Java expression.

When SIGSEGV occurred, we identified {{dest}} is less than 0 at 
[RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L253-L254].
 It should be greater than 0. It means that {{offsets\[bucket\]}} is less than 
0.
When we checked 
[transformCountsToOffsets()|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L148-L168],
 we knew that {{outputOffset}} is less than 0.
When we checked 
[RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L244-L245],
 we knew that {{outindex = 0x10__}}. Since {{outIndex}} and {{8}} are 
integer, the result of multiplication is {{0x80??}}. It means that less 
than 0 as int.

Here, the expression must be {{array.getBaseOffset() + outIndex * 8L}}.)

> core dumped running Spark SQL on large data volume (100TB)
> --
>
> Key: SPARK-18458
> URL: https://issues.apache.org/jira/browse/SPARK-18458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: JESSE CHEN
>Priority: Critical
>  Labels: core, dump
>
> Running a query on 100TB parquet database using the Spark master dated 11/04 
> dump cores on Spark executors.
> The query is TPCDS query 82 (though this query is not the only one can 
> produce this core dump, just the easiest one to re-create the error).
> Spark output that showed the exception:
> {noformat}
> 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_e68_1478924651089_0018_01_74 on 
> host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_e68_1478924651089_0018_01_74
> Exit code: 134
> Exception message: /bin/bash: line 1: 4031216 Aborted (core 
> dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
> -server -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 
> Aborted (core dumped) 
> /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
> -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> 

[jira] [Commented] (SPARK-18458) core dumped running Spark SQL on large data volume (100TB)

2016-11-16 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671229#comment-15671229
 ] 

Kazuaki Ishizaki commented on SPARK-18458:
--

I worked with [~jfc...@us.ibm.com]. Then, I identified that a root cause is an 
integer overflow in a Java expression.

When SIGSEGV occurred, we identified {{dest}} is less than 0 at 
[RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L253-L254].
 It should be greater than 0. It means that {{offsets\[bucket\]}} is less than 
0.
When we checked 
[transformCountsToOffsets()|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L148-L168],
 we knew that {{outputOffset}} is less than 0.
When we checked 
[RadixSort.java|https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/RadixSort.java#L244-L245],
 we knew that {{outindex = 0x10__}}. Since {{outIndex}} and {{8}} are 
integer, the result of multiplication is {{0x80??}}. It means that less 
than 0 as int.

Here, the expression must be {{array.getBaseOffset() + outIndex * 8L}}.

> core dumped running Spark SQL on large data volume (100TB)
> --
>
> Key: SPARK-18458
> URL: https://issues.apache.org/jira/browse/SPARK-18458
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: JESSE CHEN
>Priority: Critical
>  Labels: core, dump
>
> Running a query on 100TB parquet database using the Spark master dated 11/04 
> dump cores on Spark executors.
> The query is TPCDS query 82 (though this query is not the only one can 
> produce this core dump, just the easiest one to re-create the error).
> Spark output that showed the exception:
> {noformat}
> 16/11/14 10:38:51 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_e68_1478924651089_0018_01_74 on 
> host: mer05x.svl.ibm.com. Exit status: 134. Diagnostics: Exception from 
> container-launch.
> Container id: container_e68_1478924651089_0018_01_74
> Exit code: 134
> Exception message: /bin/bash: line 1: 4031216 Aborted (core 
> dumped) /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java 
> -server -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/__app__.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.databricks_spark-csv_2.10-1.3.0.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/org.apache.commons_commons-csv-1.1.jar
>  --user-class-path 
> file:/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/com.univocity_univocity-parsers-1.5.1.jar
>  > 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stdout
>  2> 
> /data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/stderr
> Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 4031216 
> Aborted (core dumped) 
> /usr/jdk64/java-1.8.0-openjdk-1.8.0.77-0.b03.el7_2.x86_64/bin/java -server 
> -Xmx24576m 
> -Djava.io.tmpdir=/data4/hadoop/yarn/local/usercache/spark/appcache/application_1478924651089_0018/container_e68_1478924651089_0018_01_74/tmp
>  '-Dspark.history.ui.port=18080' '-Dspark.driver.port=39855' 
> -Dspark.yarn.app.container.log.dir=/data4/hadoop/yarn/log/application_1478924651089_0018/container_e68_1478924651089_0018_01_74
>  -XX:OnOutOfMemoryError='kill %p' 
> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
> spark://CoarseGrainedScheduler@192.168.10.101:39855 --executor-id 73 
> --hostname mer05x.svl.ibm.com --cores 2 --app-id 
> application_1478924651089_0018 --user-class-path 
> 

[jira] [Commented] (SPARK-18172) AnalysisException in first/last during aggregation

2016-11-16 Thread Emlyn Corrin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671227#comment-15671227
 ] 

Emlyn Corrin commented on SPARK-18172:
--

It occurs on 2.0.1 and 2.0.2 (on Mac, installed via Homebrew), but I'd need 
more time to compile a more recent version and test that.
Since it was literally just the snippet above that triggered it, I think it's 
probably fixed if you can't reproduce it.
Maybe it is the same underlying problem as SPARK-18300, and the fix for that 
fixed this too?


> AnalysisException in first/last during aggregation
> --
>
> Key: SPARK-18172
> URL: https://issues.apache.org/jira/browse/SPARK-18172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Emlyn Corrin
>
> Since Spark 2.0.1, the following pyspark snippet fails with 
> {{AnalysisException: The second argument of First should be a boolean 
> literal}} (but it's not restricted to Python, similar code with in Java fails 
> in the same way).
> It worked in Spark 2.0.0, so I believe it may be related to the fix for 
> SPARK-16648.
> {code}
> from pyspark.sql import functions as F
> ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
> ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
> F.countDistinct(ds._2, ds._3)).show()
> {code}
> It works if any of the three arguments to {{.agg}} is removed.
> The stack trace is:
> {code}
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 
> ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
>  ds._3)).show()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py 
> in show(self, n, truncate)
> 285 +---+-+
> 286 """
> --> 287 print(self._jdf.showString(n, truncate))
> 288
> 289 def __repr__(self):
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o76.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: first(_2#1L)()
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
> 

  1   2   3   >