[jira] [Resolved] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects

2017-09-14 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-22018.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 19240
[https://github.com/apache/spark/pull/19240]

> Catalyst Optimizer does not preserve top-level metadata while collapsing 
> projects
> -
>
> Key: SPARK-22018
> URL: https://issues.apache.org/jira/browse/SPARK-22018
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 3.0.0
>
>
> If there are two projects like as follows.
> {code}
> Project [a_with_metadata#27 AS b#26]
> +- Project [a#0 AS a_with_metadata#27]
>+- LocalRelation , [a#0, b#1]
> {code}
> Child Project has an output column with a metadata in it, and the parent 
> Project has an alias that implicitly forwards the metadata. So this metadata 
> is visible for higher operators. Upon applying CollapseProject optimizer 
> rule, the metadata is not preserved.
> {code}
> Project [a#0 AS b#26]
> +- LocalRelation , [a#0, b#1]
> {code}
> This is incorrect, as downstream operators that expect certain metadata (e.g. 
> watermark in structured streaming) to identify certain fields will fail to do 
> so.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22019) JavaBean int type property

2017-09-14 Thread taiho choi (JIRA)
taiho choi created SPARK-22019:
--

 Summary: JavaBean int type property 
 Key: SPARK-22019
 URL: https://issues.apache.org/jira/browse/SPARK-22019
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: taiho choi


when the type of SampleData's id is int, following code generates errors.
when long, it's ok.
 


{code:java}
@Test
public void testDataSet2() {
ArrayList arr= new ArrayList();

arr.add("{\"str\": \"everyone\", \"id\": 1}");
arr.add("{\"str\": \"Hello\", \"id\": 1}");

//1.read array and change to string dataset.
JavaRDD data = sc.parallelize(arr);
Dataset stringdataset = sqc.createDataset(data.rdd(), 
Encoders.STRING());
stringdataset.show(); //PASS
//2. convert string dataset to sampledata dataset
Dataset df = 
sqc.read().json(stringdataset).as(Encoders.bean(SampleData.class));
df.show();//PASS
df.printSchema();//PASS

Dataset fad = df.flatMap(SampleDataFlat::flatMap, 
Encoders.bean(SampleDataFlat.class));
fad.show(); //ERROR
fad.printSchema();
}

public static class SampleData implements Serializable {
public String getStr() {
return str;
}

public void setStr(String str) {
this.str = str;
}


public int getId() {
return id;
}

public void setId(int id) {
this.id = id;
}
String str;
int id;
}

public static class SampleDataFlat {
String str;

public String getStr() {
return str;
}

public void setStr(String str) {
this.str = str;
}

public SampleDataFlat(String str, long id) {
this.str = str;
}


public static Iterator flatMap(SampleData data) {
ArrayList arr = new ArrayList<>();
arr.add(new SampleDataFlat(data.getStr(), data.getId()));
arr.add(new SampleDataFlat(data.getStr(), data.getId()+1));
arr.add(new SampleDataFlat(data.getStr(), data.getId()+2));

return arr.iterator();
}
}
{code}




==Error message==
Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 38, Column 16: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 38, 
Column 16: No applicable constructor/method found for actual parameters "long"; 
candidates are: "public void SparkUnitTest$SampleData.setId(int)"

/* 024 */   public java.lang.Object apply(java.lang.Object _i) {
/* 025 */ InternalRow i = (InternalRow) _i;
/* 026 */
/* 027 */ final SparkUnitTest$SampleData value1 = false ? null : new 
SparkUnitTest$SampleData();
/* 028 */ this.javaBean = value1;
/* 029 */ if (!false) {
/* 030 */
/* 031 */
/* 032 */   boolean isNull3 = i.isNullAt(0);
/* 033 */   long value3 = isNull3 ? -1L : (i.getLong(0));
/* 034 */
/* 035 */   if (isNull3) {
/* 036 */ throw new NullPointerException(((java.lang.String) 
references[0]));
/* 037 */   }
/* 038 */   javaBean.setId(value3);





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-09-14 Thread Jia-Xuan Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167280#comment-16167280
 ] 

Jia-Xuan Liu edited comment on SPARK-21994 at 9/15/17 3:40 AM:
---

I also can't reproduce this in Spark 2.2 release. I think maybe not a problem 
of Spark.

{code:java}
Spark context available as 'sc' (master = local[*], app id = 
local-1505446512312).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = spark.sql("show databases")
df: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> df.show()
++
|databaseName|
++
| default|
|test|
++
scala> df.write.format("parquet").saveAsTable("test.spark22_test_2")
scala> spark.sql("select * from test.spark22_test_2").show()
++
|databaseName|
++
| default|
|test|
++
{code}



was (Author: goldmedal):
I also can't reproduce this in Spark 2.2 release.

{code:java}
Spark context available as 'sc' (master = local[*], app id = 
local-1505446512312).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = spark.sql("show databases")
df: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> df.show()
++
|databaseName|
++
| default|
|test|
++
scala> df.write.format("parquet").saveAsTable("test.spark22_test_2")
scala> spark.sql("select * from test.spark22_test_2").show()
++
|databaseName|
++
| default|
|test|
++
{code}


> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path, it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-09-14 Thread Jia-Xuan Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167280#comment-16167280
 ] 

Jia-Xuan Liu commented on SPARK-21994:
--

I also can't reproduce this in Spark 2.2 release.

{code:java}
Spark context available as 'sc' (master = local[*], app id = 
local-1505446512312).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
  /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = spark.sql("show databases")
df: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> df.show()
++
|databaseName|
++
| default|
|test|
++
scala> df.write.format("parquet").saveAsTable("test.spark22_test_2")
scala> spark.sql("select * from test.spark22_test_2").show()
++
|databaseName|
++
| default|
|test|
++
{code}


> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path, it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-09-14 Thread Xiayun Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167276#comment-16167276
 ] 

Xiayun Sun edited comment on SPARK-21994 at 9/15/17 3:31 AM:
-

Looks like it's not a problem in latest master build (commit a28728a, version 
2.3.0-SNAPSHOT)


{code:java}

Spark context available as 'sc' (master = local[*], app id = 
local-1505446109004).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.


scala> spark.sql("create database test")
res0: org.apache.spark.sql.DataFrame = []

scala> val df = spark.sql("show databases")
df: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> df.show()
++
|databaseName|
++
| default|
|test|
++


scala> df.write.format("parquet").saveAsTable("test.spark22_test")

scala> spark.sql("select * from test.spark22_test").show()
++
|databaseName|
++
| default|
|test|
++


{code}



was (Author: xiayunsun):
I'm unable to reproduce this for latest master build (commit a28728a, version 
2.3.0-SNAPSHOT)


{code:java}

Spark context available as 'sc' (master = local[*], app id = 
local-1505446109004).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.


scala> spark.sql("create database test")
res0: org.apache.spark.sql.DataFrame = []

scala> val df = spark.sql("show databases")
df: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> df.show()
++
|databaseName|
++
| default|
|test|
++


scala> df.write.format("parquet").saveAsTable("test.spark22_test")

scala> spark.sql("select * from test.spark22_test").show()
++
|databaseName|
++
| default|
|test|
++


{code}


> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path, it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version 

[jira] [Comment Edited] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-09-14 Thread Xiayun Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167276#comment-16167276
 ] 

Xiayun Sun edited comment on SPARK-21994 at 9/15/17 3:30 AM:
-

I'm unable to reproduce this for latest master build (commit a28728a, version 
2.3.0-SNAPSHOT)


{code:java}

Spark context available as 'sc' (master = local[*], app id = 
local-1505446109004).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0-SNAPSHOT
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92)
Type in expressions to have them evaluated.
Type :help for more information.


scala> spark.sql("create database test")
res0: org.apache.spark.sql.DataFrame = []

scala> val df = spark.sql("show databases")
df: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> df.show()
++
|databaseName|
++
| default|
|test|
++


scala> df.write.format("parquet").saveAsTable("test.spark22_test")

scala> spark.sql("select * from test.spark22_test").show()
++
|databaseName|
++
| default|
|test|
++


{code}



was (Author: xiayunsun):
I'm unable to reproduce this for latest master build (commit a28728a, version 
2.3.0-SNAPSHOT)


{code:java}


scala> spark.sql("create database test")
res0: org.apache.spark.sql.DataFrame = []

scala> val df = spark.sql("show databases")
df: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> df.show()
++
|databaseName|
++
| default|
|test|
++


scala> df.write.format("parquet").saveAsTable("test.spark22_test")

scala> spark.sql("select * from test.spark22_test").show()
++
|databaseName|
++
| default|
|test|
++


{code}


> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path, it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-09-14 Thread Xiayun Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167276#comment-16167276
 ] 

Xiayun Sun edited comment on SPARK-21994 at 9/15/17 3:29 AM:
-

I'm unable to reproduce this for latest master build (commit a28728a, version 
2.3.0-SNAPSHOT)


{code:java}


scala> spark.sql("create database test")
res0: org.apache.spark.sql.DataFrame = []

scala> val df = spark.sql("show databases")
df: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> df.show()
++
|databaseName|
++
| default|
|test|
++


scala> df.write.format("parquet").saveAsTable("test.spark22_test")

scala> spark.sql("select * from test.spark22_test").show()
++
|databaseName|
++
| default|
|test|
++


{code}



was (Author: xiayunsun):
I'm unable to reproduce this for latest master build (commit a28728a, version 
2.3.0-SNAPSHOT)

{{

scala> spark.sql("create database test")
res0: org.apache.spark.sql.DataFrame = []

scala> val df = spark.sql("show databases")
df: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> df.show()
++
|databaseName|
++
| default|
|test|
++


scala> df.write.format("parquet").saveAsTable("test.spark22_test")

scala> spark.sql("select * from test.spark22_test").show()
++
|databaseName|
++
| default|
|test|
++

}}

> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path, it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21994) Spark 2.2 can not read Parquet table created by itself

2017-09-14 Thread Xiayun Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167276#comment-16167276
 ] 

Xiayun Sun commented on SPARK-21994:


I'm unable to reproduce this for latest master build (commit a28728a, version 
2.3.0-SNAPSHOT)

{{

scala> spark.sql("create database test")
res0: org.apache.spark.sql.DataFrame = []

scala> val df = spark.sql("show databases")
df: org.apache.spark.sql.DataFrame = [databaseName: string]

scala> df.show()
++
|databaseName|
++
| default|
|test|
++


scala> df.write.format("parquet").saveAsTable("test.spark22_test")

scala> spark.sql("select * from test.spark22_test").show()
++
|databaseName|
++
| default|
|test|
++

}}

> Spark 2.2 can not read Parquet table created by itself
> --
>
> Key: SPARK-21994
> URL: https://issues.apache.org/jira/browse/SPARK-21994
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Spark 2.2 on Cloudera CDH 5.10.1, Hive 1.1
>Reporter: Jurgis Pods
>
> This seems to be a new bug introduced in Spark 2.2, since it did not occur 
> under Spark 2.1.
> When writing a dataframe to a table in Parquet format, Spark SQL does not 
> write the 'path' of the table to the Hive metastore, unlike in previous 
> versions.
> As a consequence, Spark 2.2 is not able to read the table it just created. It 
> just outputs the table header without any row content. 
> A parallel installation of Spark 1.6 at least produces an appropriate error 
> trace:
> {code:java}
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Version information not found 
> in metastore. hive.metastore.schema.verification is not enabled so recording 
> the schema version 1.1.0
> 17/09/13 10:22:12 WARN metastore.ObjectStore: Failed to get database default, 
> returning NoSuchObjectException
> org.spark-project.guava.util.concurrent.UncheckedExecutionException: 
> java.util.NoSuchElementException: key not found: path
> [...]
> {code}
> h3. Steps to reproduce:
> Run the following in spark2-shell:
> {code:java}
> scala> val df = spark.sql("show databases")
> scala> df.show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> scala> df.write.format("parquet").saveAsTable("test.spark22_test")
> scala> spark.sql("select * from test.spark22_test").show()
> ++
> |databaseName|
> ++
> ++{code}
> When manually setting the path, it works:
> {code:java}
> scala> df.write.option("path", 
> "/hadoop/eco/hive/warehouse/test.db/spark22_parquet_with_path").format("parquet").saveAsTable("test.spark22_parquet_with_path")
> scala> spark.sql("select * from test.spark22_parquet_with_path").show()
> ++
> |databaseName|
> ++
> |   mydb1|
> |   mydb2|
> | default|
> |test|
> ++
> {code}
> It is kind of a disaster that we are not able to read tables created by the 
> very same Spark version and have to manually specify the path as an explicit 
> option.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19459) ORC tables cannot be read when they contain char/varchar columns

2017-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167258#comment-16167258
 ] 

Apache Spark commented on SPARK-19459:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19235

> ORC tables cannot be read when they contain char/varchar columns
> 
>
> Key: SPARK-19459
> URL: https://issues.apache.org/jira/browse/SPARK-19459
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.1.1, 2.2.0
>
>
> Reading from an ORC table which contains char/varchar columns can fail if the 
> table has been created using Spark. This is caused by the fact that spark 
> internally replaces char and varchar columns with a string column, this 
> causes the ORC reader to use the wrong reader, and that eventually causes a 
> ClassCastException.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14387) Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc

2017-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167257#comment-16167257
 ] 

Apache Spark commented on SPARK-14387:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19235

> Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc
> -
>
> Key: SPARK-14387
> URL: https://issues.apache.org/jira/browse/SPARK-14387
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Rajesh Balamohan
>
> In master branch, I tried to run TPC-DS queries (e.g Query27) at 200 GB 
> scale. Initially I got the following exception (as FileScanRDD has been made 
> the default in master branch)
> {noformat}
> 16/04/04 06:49:55 WARN TaskSetManager: Lost task 0.0 in stage 15.0. 
> java.lang.IllegalArgumentException: Field "s_store_sk" does not exist.
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:157)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource$$anonfun$buildReader$2.apply(OrcRelation.scala:146)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:69)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:60)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$6$$anon$1.hasNext(WholeStageCodegen.scala:361)
> {noformat}
> When running with "spark.sql.sources.fileScan=false", following exception is 
> thrown
> {noformat}
> 16/04/04 09:02:00 ERROR SparkExecuteStatementOperation: Error executing 
> query, currentState RUNNING,
> java.lang.IllegalArgumentException: Field "cd_demo_sk" does not exist.
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:236)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:59)
> at 
> org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:235)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$$anonfun$13.apply(OrcRelation.scala:410)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:94)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at org.apache.spark.sql.types.StructType.map(StructType.scala:94)
> at 
> org.apache.spark.sql.hive.orc.OrcRelation$.setRequiredColumns(OrcRelation.scala:410)
> at 
> org.apache.spark.sql.hive.orc.OrcTableScan.execute(OrcRelation.scala:317)
> at 
> org.apache.spark.sql.hive.orc.DefaultSource.buildInternalScan(OrcRelation.scala:124)
>   

[jira] [Commented] (SPARK-22000) org.codehaus.commons.compiler.CompileException: toString method is not declared

2017-09-14 Thread taiho choi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167248#comment-16167248
 ] 

taiho choi commented on SPARK-22000:


@members, my code generates this error message but i cannot make sample code to 
reproduce this issue. 

> org.codehaus.commons.compiler.CompileException: toString method is not 
> declared
> ---
>
> Key: SPARK-22000
> URL: https://issues.apache.org/jira/browse/SPARK-22000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> the error message say that toString is not declared on "value13" which is 
> "long" type in generated code.
> i think value13 should be Long type.
> ==error message
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 70, Column 32: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 70, Column 32: A method named "toString" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 033 */   private void apply1_2(InternalRow i) {
> /* 034 */
> /* 035 */
> /* 036 */ boolean isNull11 = i.isNullAt(1);
> /* 037 */ UTF8String value11 = isNull11 ? null : (i.getUTF8String(1));
> /* 038 */ boolean isNull10 = true;
> /* 039 */ java.lang.String value10 = null;
> /* 040 */ if (!isNull11) {
> /* 041 */
> /* 042 */   isNull10 = false;
> /* 043 */   if (!isNull10) {
> /* 044 */
> /* 045 */ Object funcResult4 = null;
> /* 046 */ funcResult4 = value11.toString();
> /* 047 */
> /* 048 */ if (funcResult4 != null) {
> /* 049 */   value10 = (java.lang.String) funcResult4;
> /* 050 */ } else {
> /* 051 */   isNull10 = true;
> /* 052 */ }
> /* 053 */
> /* 054 */
> /* 055 */   }
> /* 056 */ }
> /* 057 */ javaBean.setApp(value10);
> /* 058 */
> /* 059 */
> /* 060 */ boolean isNull13 = i.isNullAt(12);
> /* 061 */ long value13 = isNull13 ? -1L : (i.getLong(12));
> /* 062 */ boolean isNull12 = true;
> /* 063 */ java.lang.String value12 = null;
> /* 064 */ if (!isNull13) {
> /* 065 */
> /* 066 */   isNull12 = false;
> /* 067 */   if (!isNull12) {
> /* 068 */
> /* 069 */ Object funcResult5 = null;
> /* 070 */ funcResult5 = value13.toString();
> /* 071 */
> /* 072 */ if (funcResult5 != null) {
> /* 073 */   value12 = (java.lang.String) funcResult5;
> /* 074 */ } else {
> /* 075 */   isNull12 = true;
> /* 076 */ }
> /* 077 */
> /* 078 */
> /* 079 */   }
> /* 080 */ }
> /* 081 */ javaBean.setReasonCode(value12);
> /* 082 */
> /* 083 */   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21997) Spark shows different results on char/varchar columns on Parquet

2017-09-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21997:
--
Description: 
SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different 
results according to the SQL configuration, 
*spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the 
default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is 
wrong by default.

{code}
scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
scala> sql("SELECT * FROM t_char").show
+---+---+
|  a|  b|
+---+---+
|  a|  b|
+---+---+

scala> sql("set spark.sql.hive.convertMetastoreParquet=false")

scala> sql("SELECT * FROM t_char").show
+--+---+
| a|  b|
+--+---+
|a |  b|
+--+---+
{code}

  was:
SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different 
results according to the SQL configuration, 
*spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the 
default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is 
wrong by default.

For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so 
SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn on 
it `true`.

{code}
scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
scala> sql("SELECT * FROM t_char").show
+---+---+
|  a|  b|
+---+---+
|  a|  b|
+---+---+

scala> sql("set spark.sql.hive.convertMetastoreParquet=false")

scala> sql("SELECT * FROM t_char").show
+--+---+
| a|  b|
+--+---+
|a |  b|
+--+---+
{code}


> Spark shows different results on char/varchar columns on Parquet
> 
>
> Key: SPARK-21997
> URL: https://issues.apache.org/jira/browse/SPARK-21997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>
> SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows 
> different results according to the SQL configuration, 
> *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, 
> the default of `spark.sql.hive.convertMetastoreParquet` is true, so the 
> result is wrong by default.
> {code}
> scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
> scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
> scala> sql("SELECT * FROM t_char").show
> +---+---+
> |  a|  b|
> +---+---+
> |  a|  b|
> +---+---+
> scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
> scala> sql("SELECT * FROM t_char").show
> +--+---+
> | a|  b|
> +--+---+
> |a |  b|
> +--+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22018:


Assignee: Apache Spark  (was: Tathagata Das)

> Catalyst Optimizer does not preserve top-level metadata while collapsing 
> projects
> -
>
> Key: SPARK-22018
> URL: https://issues.apache.org/jira/browse/SPARK-22018
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> If there are two projects like as follows.
> {code}
> Project [a_with_metadata#27 AS b#26]
> +- Project [a#0 AS a_with_metadata#27]
>+- LocalRelation , [a#0, b#1]
> {code}
> Child Project has an output column with a metadata in it, and the parent 
> Project has an alias that implicitly forwards the metadata. So this metadata 
> is visible for higher operators. Upon applying CollapseProject optimizer 
> rule, the metadata is not preserved.
> {code}
> Project [a#0 AS b#26]
> +- LocalRelation , [a#0, b#1]
> {code}
> This is incorrect, as downstream operators that expect certain metadata (e.g. 
> watermark in structured streaming) to identify certain fields will fail to do 
> so.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22018:


Assignee: Tathagata Das  (was: Apache Spark)

> Catalyst Optimizer does not preserve top-level metadata while collapsing 
> projects
> -
>
> Key: SPARK-22018
> URL: https://issues.apache.org/jira/browse/SPARK-22018
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> If there are two projects like as follows.
> {code}
> Project [a_with_metadata#27 AS b#26]
> +- Project [a#0 AS a_with_metadata#27]
>+- LocalRelation , [a#0, b#1]
> {code}
> Child Project has an output column with a metadata in it, and the parent 
> Project has an alias that implicitly forwards the metadata. So this metadata 
> is visible for higher operators. Upon applying CollapseProject optimizer 
> rule, the metadata is not preserved.
> {code}
> Project [a#0 AS b#26]
> +- LocalRelation , [a#0, b#1]
> {code}
> This is incorrect, as downstream operators that expect certain metadata (e.g. 
> watermark in structured streaming) to identify certain fields will fail to do 
> so.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects

2017-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167166#comment-16167166
 ] 

Apache Spark commented on SPARK-22018:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/19240

> Catalyst Optimizer does not preserve top-level metadata while collapsing 
> projects
> -
>
> Key: SPARK-22018
> URL: https://issues.apache.org/jira/browse/SPARK-22018
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Structured Streaming
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> If there are two projects like as follows.
> {code}
> Project [a_with_metadata#27 AS b#26]
> +- Project [a#0 AS a_with_metadata#27]
>+- LocalRelation , [a#0, b#1]
> {code}
> Child Project has an output column with a metadata in it, and the parent 
> Project has an alias that implicitly forwards the metadata. So this metadata 
> is visible for higher operators. Upon applying CollapseProject optimizer 
> rule, the metadata is not preserved.
> {code}
> Project [a#0 AS b#26]
> +- LocalRelation , [a#0, b#1]
> {code}
> This is incorrect, as downstream operators that expect certain metadata (e.g. 
> watermark in structured streaming) to identify certain fields will fail to do 
> so.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22018) Catalyst Optimizer does not preserve top-level metadata while collapsing projects

2017-09-14 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-22018:
-

 Summary: Catalyst Optimizer does not preserve top-level metadata 
while collapsing projects
 Key: SPARK-22018
 URL: https://issues.apache.org/jira/browse/SPARK-22018
 Project: Spark
  Issue Type: Bug
  Components: Optimizer, Structured Streaming
Affects Versions: 2.2.0, 2.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das


If there are two projects like as follows.
{code}
Project [a_with_metadata#27 AS b#26]
+- Project [a#0 AS a_with_metadata#27]
   +- LocalRelation , [a#0, b#1]
{code}

Child Project has an output column with a metadata in it, and the parent 
Project has an alias that implicitly forwards the metadata. So this metadata is 
visible for higher operators. Upon applying CollapseProject optimizer rule, the 
metadata is not preserved.

{code}
Project [a#0 AS b#26]
+- LocalRelation , [a#0, b#1]
{code}

This is incorrect, as downstream operators that expect certain metadata (e.g. 
watermark in structured streaming) to identify certain fields will fail to do 
so.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness

2017-09-14 Thread Ihor Bobak (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167033#comment-16167033
 ] 

Ihor Bobak commented on SPARK-21523:


For us it is also critical issue:  we faced with the same problem.

> Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
> --
>
> Key: SPARK-21523
> URL: https://issues.apache.org/jira/browse/SPARK-21523
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We need merge this breeze bugfix into spark because it influence a series of 
> algos in MLlib which use LBFGS.
> https://github.com/scalanlp/breeze/pull/651



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified

2017-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16167020#comment-16167020
 ] 

Apache Spark commented on SPARK-22017:
--

User 'joseph-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/19239

> watermark evaluation with multi-input stream operators is unspecified
> -
>
> Key: SPARK-22017
> URL: https://issues.apache.org/jira/browse/SPARK-22017
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>
> Watermarks are stored as a single value in StreamExecution. If a query has 
> multiple watermark nodes (which can generally only happen with multi input 
> operators like union), a headOption call will arbitrarily pick one to use as 
> the real one. This will happen independently in each batch, possibly leading 
> to strange and undefined behavior.
> We should instead choose the minimum from all watermark exec nodes as the 
> query-wide watermark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22017:


Assignee: (was: Apache Spark)

> watermark evaluation with multi-input stream operators is unspecified
> -
>
> Key: SPARK-22017
> URL: https://issues.apache.org/jira/browse/SPARK-22017
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>
> Watermarks are stored as a single value in StreamExecution. If a query has 
> multiple watermark nodes (which can generally only happen with multi input 
> operators like union), a headOption call will arbitrarily pick one to use as 
> the real one. This will happen independently in each batch, possibly leading 
> to strange and undefined behavior.
> We should instead choose the minimum from all watermark exec nodes as the 
> query-wide watermark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22017:


Assignee: Apache Spark

> watermark evaluation with multi-input stream operators is unspecified
> -
>
> Key: SPARK-22017
> URL: https://issues.apache.org/jira/browse/SPARK-22017
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>
> Watermarks are stored as a single value in StreamExecution. If a query has 
> multiple watermark nodes (which can generally only happen with multi input 
> operators like union), a headOption call will arbitrarily pick one to use as 
> the real one. This will happen independently in each batch, possibly leading 
> to strange and undefined behavior.
> We should instead choose the minimum from all watermark exec nodes as the 
> query-wide watermark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22017) watermark evaluation with multi-input stream operators is unspecified

2017-09-14 Thread Jose Torres (JIRA)
Jose Torres created SPARK-22017:
---

 Summary: watermark evaluation with multi-input stream operators is 
unspecified
 Key: SPARK-22017
 URL: https://issues.apache.org/jira/browse/SPARK-22017
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Jose Torres


Watermarks are stored as a single value in StreamExecution. If a query has 
multiple watermark nodes (which can generally only happen with multi input 
operators like union), a headOption call will arbitrarily pick one to use as 
the real one. This will happen independently in each batch, possibly leading to 
strange and undefined behavior.

We should instead choose the minimum from all watermark exec nodes as the 
query-wide watermark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22016) Add HiveDialect for JDBC connection to Hive

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22016:


Assignee: Apache Spark

> Add HiveDialect for JDBC connection to Hive
> ---
>
> Key: SPARK-22016
> URL: https://issues.apache.org/jira/browse/SPARK-22016
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.2.1
>Reporter: Daniel Fernandez
>Assignee: Apache Spark
>
> I found out there is no Dialect for Hive in spark. So, I would like to add 
> the HiveDialect.scala in the package org.apache.spark.sql.jdbc to support it.
> Only two functions should be overriden:
> * canHandle
> * quoteIdentifier



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22016) Add HiveDialect for JDBC connection to Hive

2017-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166988#comment-16166988
 ] 

Apache Spark commented on SPARK-22016:
--

User 'danielfx90' has created a pull request for this issue:
https://github.com/apache/spark/pull/19238

> Add HiveDialect for JDBC connection to Hive
> ---
>
> Key: SPARK-22016
> URL: https://issues.apache.org/jira/browse/SPARK-22016
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.2.1
>Reporter: Daniel Fernandez
>
> I found out there is no Dialect for Hive in spark. So, I would like to add 
> the HiveDialect.scala in the package org.apache.spark.sql.jdbc to support it.
> Only two functions should be overriden:
> * canHandle
> * quoteIdentifier



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22016) Add HiveDialect for JDBC connection to Hive

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22016:


Assignee: (was: Apache Spark)

> Add HiveDialect for JDBC connection to Hive
> ---
>
> Key: SPARK-22016
> URL: https://issues.apache.org/jira/browse/SPARK-22016
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.3, 2.2.1
>Reporter: Daniel Fernandez
>
> I found out there is no Dialect for Hive in spark. So, I would like to add 
> the HiveDialect.scala in the package org.apache.spark.sql.jdbc to support it.
> Only two functions should be overriden:
> * canHandle
> * quoteIdentifier



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22016) Add HiveDialect for JDBC connection to Hive

2017-09-14 Thread Daniel Fernandez (JIRA)
Daniel Fernandez created SPARK-22016:


 Summary: Add HiveDialect for JDBC connection to Hive
 Key: SPARK-22016
 URL: https://issues.apache.org/jira/browse/SPARK-22016
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.3, 2.2.1
Reporter: Daniel Fernandez


I found out there is no Dialect for Hive in spark. So, I would like to add the 
HiveDialect.scala in the package org.apache.spark.sql.jdbc to support it.

Only two functions should be overriden:
* canHandle
* quoteIdentifier



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-09-14 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166945#comment-16166945
 ] 

Michael N commented on SPARK-21999:
---

Some related questions that would provide insights into this issue are

1. Based on the stack trace, what triggered Spark to do serialization of the 
application objects on the master node ?  

2. Why does Spark do this type of serialization asynchronously ?

3. What other conditions trigger Spark to do this type of  serialization 
asynchronously ?

4. Can it be configured to do this type of serialization synchronously instead ?

5. Excluding application code, what list objects does Spark use that would be 
part of this type of serialization ?

Thanks.

Michael,



> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> 

[jira] [Resolved] (SPARK-21988) Add default stats to StreamingExecutionRelation

2017-09-14 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21988.
--
   Resolution: Fixed
 Assignee: Jose Torres
Fix Version/s: 2.3.0

> Add default stats to StreamingExecutionRelation
> ---
>
> Key: SPARK-21988
> URL: https://issues.apache.org/jira/browse/SPARK-21988
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Jose Torres
> Fix For: 2.3.0
>
>
> StreamingExecutionRelation currently doesn't implement stats.
> This makes some sense, but unfortunately the LeafNode contract requires that 
> nodes which survive analysis implement stats, and StreamingExecutionRelation 
> can indeed survive analysis when running explain() on a streaming dataframe.
> This value won't ever be used during execution, because 
> StreamingExecutionRelation does *not* survive analysis on the execution path; 
> it's replaced with each batch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21987) Spark 2.3 cannot read 2.2 event logs

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21987:


Assignee: (was: Apache Spark)

> Spark 2.3 cannot read 2.2 event logs
> 
>
> Key: SPARK-21987
> URL: https://issues.apache.org/jira/browse/SPARK-21987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Blocker
>
> Reported by [~jincheng] in a comment in SPARK-18085:
> {noformat}
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: 
> Unrecognized field "metadata" (class 
> org.apache.spark.sql.execution.SparkPlanInfo), not marked as ignorable (4 
> known properties: "simpleString", "nodeName", "children", "metrics"])
>  at [Source: 
> {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json
>  at 
> NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
>  
> Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"==
>  Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: 
> array\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Optimized Logical Plan ==\nRepartition 200, true\n+- 
> LogicalRDD [uid#327L, gids#328]\n\n== Physical Plan ==\nExchange 
> RoundRobinPartitioning(200)\n+- Scan 
> ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange
>  
> RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan
>  
> ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number
>  of output 
> rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data
>  size total (min, med, 
> max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: 
> 1, column: 1622] (through reference chain: 
> org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"])
>   at 
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
> {noformat}
> This was caused by SPARK-17701 (which at this moment is still open even 
> though the patch has been committed).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21987) Spark 2.3 cannot read 2.2 event logs

2017-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166637#comment-16166637
 ] 

Apache Spark commented on SPARK-21987:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19237

> Spark 2.3 cannot read 2.2 event logs
> 
>
> Key: SPARK-21987
> URL: https://issues.apache.org/jira/browse/SPARK-21987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Blocker
>
> Reported by [~jincheng] in a comment in SPARK-18085:
> {noformat}
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: 
> Unrecognized field "metadata" (class 
> org.apache.spark.sql.execution.SparkPlanInfo), not marked as ignorable (4 
> known properties: "simpleString", "nodeName", "children", "metrics"])
>  at [Source: 
> {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json
>  at 
> NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
>  
> Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"==
>  Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: 
> array\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Optimized Logical Plan ==\nRepartition 200, true\n+- 
> LogicalRDD [uid#327L, gids#328]\n\n== Physical Plan ==\nExchange 
> RoundRobinPartitioning(200)\n+- Scan 
> ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange
>  
> RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan
>  
> ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number
>  of output 
> rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data
>  size total (min, med, 
> max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: 
> 1, column: 1622] (through reference chain: 
> org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"])
>   at 
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
> {noformat}
> This was caused by SPARK-17701 (which at this moment is still open even 
> though the patch has been committed).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21987) Spark 2.3 cannot read 2.2 event logs

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21987:


Assignee: Apache Spark

> Spark 2.3 cannot read 2.2 event logs
> 
>
> Key: SPARK-21987
> URL: https://issues.apache.org/jira/browse/SPARK-21987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Blocker
>
> Reported by [~jincheng] in a comment in SPARK-18085:
> {noformat}
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: 
> Unrecognized field "metadata" (class 
> org.apache.spark.sql.execution.SparkPlanInfo), not marked as ignorable (4 
> known properties: "simpleString", "nodeName", "children", "metrics"])
>  at [Source: 
> {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json
>  at 
> NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
>  
> Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"==
>  Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: 
> array\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Optimized Logical Plan ==\nRepartition 200, true\n+- 
> LogicalRDD [uid#327L, gids#328]\n\n== Physical Plan ==\nExchange 
> RoundRobinPartitioning(200)\n+- Scan 
> ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange
>  
> RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan
>  
> ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number
>  of output 
> rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data
>  size total (min, med, 
> max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: 
> 1, column: 1622] (through reference chain: 
> org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"])
>   at 
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
> {noformat}
> This was caused by SPARK-17701 (which at this moment is still open even 
> though the patch has been committed).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22015) Remove usage of non-used private field isAuthenticated from org.apache.spark.network.sasl.SaslRpcHandler.java

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22015:


Assignee: Apache Spark

> Remove usage of non-used private field isAuthenticated from 
> org.apache.spark.network.sasl.SaslRpcHandler.java
> -
>
> Key: SPARK-22015
> URL: https://issues.apache.org/jira/browse/SPARK-22015
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Nikhil Bhide
>Assignee: Apache Spark
>Priority: Trivial
>
> Remove usage of non-used private field isAuthenticated from 
> org.apache.spark.network.sasl.SaslRpcHandler.java. It does not adhere to code 
> convention.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22015) Remove usage of non-used private field isAuthenticated from org.apache.spark.network.sasl.SaslRpcHandler.java

2017-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166561#comment-16166561
 ] 

Apache Spark commented on SPARK-22015:
--

User 'nikhilbhide' has created a pull request for this issue:
https://github.com/apache/spark/pull/19236

> Remove usage of non-used private field isAuthenticated from 
> org.apache.spark.network.sasl.SaslRpcHandler.java
> -
>
> Key: SPARK-22015
> URL: https://issues.apache.org/jira/browse/SPARK-22015
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Nikhil Bhide
>Priority: Trivial
>
> Remove usage of non-used private field isAuthenticated from 
> org.apache.spark.network.sasl.SaslRpcHandler.java. It does not adhere to code 
> convention.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22015) Remove usage of non-used private field isAuthenticated from org.apache.spark.network.sasl.SaslRpcHandler.java

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22015:


Assignee: (was: Apache Spark)

> Remove usage of non-used private field isAuthenticated from 
> org.apache.spark.network.sasl.SaslRpcHandler.java
> -
>
> Key: SPARK-22015
> URL: https://issues.apache.org/jira/browse/SPARK-22015
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Nikhil Bhide
>Priority: Trivial
>
> Remove usage of non-used private field isAuthenticated from 
> org.apache.spark.network.sasl.SaslRpcHandler.java. It does not adhere to code 
> convention.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22015) Remove usage of non-used private field isAuthenticated from org.apache.spark.network.sasl.SaslRpcHandler.java

2017-09-14 Thread Nikhil Bhide (JIRA)
Nikhil Bhide created SPARK-22015:


 Summary: Remove usage of non-used private field isAuthenticated 
from org.apache.spark.network.sasl.SaslRpcHandler.java
 Key: SPARK-22015
 URL: https://issues.apache.org/jira/browse/SPARK-22015
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Nikhil Bhide
Priority: Trivial


Remove usage of non-used private field isAuthenticated from 
org.apache.spark.network.sasl.SaslRpcHandler.java. It does not adhere to code 
convention.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21997) Spark shows different results on char/varchar columns on Parquet

2017-09-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21997:
--
Description: 
SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different 
results according to the SQL configuration, 
*spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the 
default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is 
wrong by default.

For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so 
SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn on 
it `true`.

{code}
scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
scala> sql("SELECT * FROM t_char").show
+---+---+
|  a|  b|
+---+---+
|  a|  b|
+---+---+

scala> sql("set spark.sql.hive.convertMetastoreParquet=false")

scala> sql("SELECT * FROM t_char").show
+--+---+
| a|  b|
+--+---+
|a |  b|
+--+---+
{code}

  was:
SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different 
results according to the SQL configuration, 
*spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the 
default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is 
wrong by default.

For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so 
SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn on 
it `true`.

{code}
scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
scala> sql("SELECT * FROM t_char").show
+---+---+
|  a|  b|
+---+---+
|  a|  b|
+---+---+

scala> sql("set spark.sql.hive.convertMetastoreParquet=false")

scala> sql("SELECT * FROM t_char").show
+--+---+
| a|  b|
+--+---+
|a |  b|
+--+---+

scala> spark.version
res3: String = 2.2.0
{code}


> Spark shows different results on char/varchar columns on Parquet
> 
>
> Key: SPARK-21997
> URL: https://issues.apache.org/jira/browse/SPARK-21997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>
> SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows 
> different results according to the SQL configuration, 
> *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, 
> the default of `spark.sql.hive.convertMetastoreParquet` is true, so the 
> result is wrong by default.
> For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so 
> SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn 
> on it `true`.
> {code}
> scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
> scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
> scala> sql("SELECT * FROM t_char").show
> +---+---+
> |  a|  b|
> +---+---+
> |  a|  b|
> +---+---+
> scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
> scala> sql("SELECT * FROM t_char").show
> +--+---+
> | a|  b|
> +--+---+
> |a |  b|
> +--+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21997) Spark shows different results on char/varchar columns on Parquet

2017-09-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21997:
--
Summary: Spark shows different results on char/varchar columns on Parquet  
(was: Spark shows different results on Hive char/varchar columns on Parquet)

> Spark shows different results on char/varchar columns on Parquet
> 
>
> Key: SPARK-21997
> URL: https://issues.apache.org/jira/browse/SPARK-21997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>
> SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows 
> different results according to the SQL configuration, 
> *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, 
> the default of `spark.sql.hive.convertMetastoreParquet` is true, so the 
> result is wrong by default.
> For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so 
> SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn 
> on it `true`.
> {code}
> scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
> scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
> scala> sql("SELECT * FROM t_char").show
> +---+---+
> |  a|  b|
> +---+---+
> |  a|  b|
> +---+---+
> scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
> scala> sql("SELECT * FROM t_char").show
> +--+---+
> | a|  b|
> +--+---+
> |a |  b|
> +--+---+
> scala> spark.version
> res3: String = 2.2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21997) Spark shows different results on Hive char/varchar columns on Parquet

2017-09-14 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21997:
--
Description: 
SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different 
results according to the SQL configuration, 
*spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the 
default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is 
wrong by default.

For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so 
SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn on 
it `true`.

{code}
scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
scala> sql("SELECT * FROM t_char").show
+---+---+
|  a|  b|
+---+---+
|  a|  b|
+---+---+

scala> sql("set spark.sql.hive.convertMetastoreParquet=false")

scala> sql("SELECT * FROM t_char").show
+--+---+
| a|  b|
+--+---+
|a |  b|
+--+---+

scala> spark.version
res3: String = 2.2.0
{code}

  was:
SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows different 
results according to the SQL configuration, 
*spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, the 
default of `spark.sql.hive.convertMetastoreParquet` is true, so the result is 
wrong by default.

For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so 
SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn on 
it `true`.

{code}
hive> CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet;

hive> INSERT INTO TABLE t_char SELECT 'a', 'b' FROM (SELECT 1) t;

scala> sql("SELECT * FROM t_char").show
+---+---+
|  a|  b|
+---+---+
|  a|  b|
+---+---+

scala> sql("set spark.sql.hive.convertMetastoreParquet=false")

scala> sql("SELECT * FROM t_char").show
+--+---+
| a|  b|
+--+---+
|a |  b|
+--+---+

scala> spark.version
res3: String = 2.2.0
{code}


> Spark shows different results on Hive char/varchar columns on Parquet
> -
>
> Key: SPARK-21997
> URL: https://issues.apache.org/jira/browse/SPARK-21997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>
> SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows 
> different results according to the SQL configuration, 
> *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, 
> the default of `spark.sql.hive.convertMetastoreParquet` is true, so the 
> result is wrong by default.
> For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so 
> SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn 
> on it `true`.
> {code}
> scala> sql("CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet")
> scala> sql("INSERT INTO TABLE t_char SELECT 'a', 'b'")
> scala> sql("SELECT * FROM t_char").show
> +---+---+
> |  a|  b|
> +---+---+
> |  a|  b|
> +---+---+
> scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
> scala> sql("SELECT * FROM t_char").show
> +--+---+
> | a|  b|
> +--+---+
> |a |  b|
> +--+---+
> scala> spark.version
> res3: String = 2.2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21997) Spark shows different results on Hive char/varchar columns on Parquet

2017-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166539#comment-16166539
 ] 

Apache Spark commented on SPARK-21997:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/19235

> Spark shows different results on Hive char/varchar columns on Parquet
> -
>
> Key: SPARK-21997
> URL: https://issues.apache.org/jira/browse/SPARK-21997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>
> SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows 
> different results according to the SQL configuration, 
> *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, 
> the default of `spark.sql.hive.convertMetastoreParquet` is true, so the 
> result is wrong by default.
> For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so 
> SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn 
> on it `true`.
> {code}
> hive> CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet;
> hive> INSERT INTO TABLE t_char SELECT 'a', 'b' FROM (SELECT 1) t;
> scala> sql("SELECT * FROM t_char").show
> +---+---+
> |  a|  b|
> +---+---+
> |  a|  b|
> +---+---+
> scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
> scala> sql("SELECT * FROM t_char").show
> +--+---+
> | a|  b|
> +--+---+
> |a |  b|
> +--+---+
> scala> spark.version
> res3: String = 2.2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21997) Spark shows different results on Hive char/varchar columns on Parquet

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21997:


Assignee: (was: Apache Spark)

> Spark shows different results on Hive char/varchar columns on Parquet
> -
>
> Key: SPARK-21997
> URL: https://issues.apache.org/jira/browse/SPARK-21997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>
> SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows 
> different results according to the SQL configuration, 
> *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, 
> the default of `spark.sql.hive.convertMetastoreParquet` is true, so the 
> result is wrong by default.
> For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so 
> SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn 
> on it `true`.
> {code}
> hive> CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet;
> hive> INSERT INTO TABLE t_char SELECT 'a', 'b' FROM (SELECT 1) t;
> scala> sql("SELECT * FROM t_char").show
> +---+---+
> |  a|  b|
> +---+---+
> |  a|  b|
> +---+---+
> scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
> scala> sql("SELECT * FROM t_char").show
> +--+---+
> | a|  b|
> +--+---+
> |a |  b|
> +--+---+
> scala> spark.version
> res3: String = 2.2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21997) Spark shows different results on Hive char/varchar columns on Parquet

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21997:


Assignee: Apache Spark

> Spark shows different results on Hive char/varchar columns on Parquet
> -
>
> Key: SPARK-21997
> URL: https://issues.apache.org/jira/browse/SPARK-21997
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> SPARK-19459 resolves CHAR/VARCHAR issues in general, but Spark shows 
> different results according to the SQL configuration, 
> *spark.sql.hive.convertMetastoreParquet*. We had better fix this. Actually, 
> the default of `spark.sql.hive.convertMetastoreParquet` is true, so the 
> result is wrong by default.
> For ORC, the default of `spark.sql.hive.convertMetastoreOrc` is false, so 
> SPARK-19459 didn't resolve this together. For ORC, it will happen if we turn 
> on it `true`.
> {code}
> hive> CREATE TABLE t_char(a CHAR(10), b VARCHAR(10)) STORED AS parquet;
> hive> INSERT INTO TABLE t_char SELECT 'a', 'b' FROM (SELECT 1) t;
> scala> sql("SELECT * FROM t_char").show
> +---+---+
> |  a|  b|
> +---+---+
> |  a|  b|
> +---+---+
> scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
> scala> sql("SELECT * FROM t_char").show
> +--+---+
> | a|  b|
> +--+---+
> |a |  b|
> +--+---+
> scala> spark.version
> res3: String = 2.2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-22010:
---
Description: 
To convert timestamp type to python we are using 
{code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100){code}
code.

{code}
In [34]: %%timeit
...: datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
...:
4.58 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

It's slow, because:
# we're trying to get TZ on every conversion
# we're using replace method

Proposed solution: custom datetime conversion and move calculation of TZ to 
module

  was:
To convert timestamp type to python we are using 
{code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100){code}
code.

{code}
In [34]: %%timeit
...: datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
...:
4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

It's slow, because:
# we're trying to get TZ on every conversion
# we're using replace method

Proposed solution: custom datetime conversion and move calculation of TZ to 
module


> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> {code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100){code}
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.58 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-22010:
---
Priority: Minor  (was: Major)

> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100)`
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-22010:
---
Issue Type: Improvement  (was: Bug)

> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100)`
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-22010:
---
Description: 
To convert timestamp type to python we are using 
{code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100){code}
code.

{code}
In [34]: %%timeit
...: datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
...:
4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

It's slow, because:
# we're trying to get TZ on every conversion
# we're using replace method

Proposed solution: custom datetime conversion and move calculation of TZ to 
module

  was:
To convert timestamp type to python we are using 
`datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100)`
code.

{code}
In [34]: %%timeit
...: datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
...:
4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

It's slow, because:
# we're trying to get TZ on every conversion
# we're using replace method

Proposed solution: custom datetime conversion and move calculation of TZ to 
module


> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>Priority: Minor
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> {code}datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100){code}
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-09-14 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166516#comment-16166516
 ] 

Michael N commented on SPARK-21999:
---

My app was not asynchronously modifying an object that your streaming job 
accessed on the driver. Since we have ruled that out and given that 
- this issue occurred on the master
- other people have submitted tickets for various occurrences of 
ConcurrentModificationException 
- the line numbers in the Spark code were captured in the stack trace 

could you have it tracked down and resolved ?

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 

[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-09-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166493#comment-16166493
 ] 

Sean Owen commented on SPARK-21999:
---

I'm suggesting it could happen if your app was asynchronously modifying an 
object that your streaming job accessed on the driver. Maybe not, but that's 
the first thing to rule out. It's hard to guess what it could be otherwise, 
unfortunately.

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
> 

[jira] [Commented] (SPARK-21858) Make Spark grouping_id() compatible with Hive grouping__id

2017-09-14 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166467#comment-16166467
 ] 

Dongjoon Hyun commented on SPARK-21858:
---

Thank you for conclusion, [~cloud_fan]!

> Make Spark grouping_id() compatible with Hive grouping__id
> --
>
> Key: SPARK-21858
> URL: https://issues.apache.org/jira/browse/SPARK-21858
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yann Byron
>
> If you want to migrate some ETLs using `grouping__id` in Hive to Spark and 
> use Spark `grouping_id()` instead of Hive `grouping__id`, you will find 
> difference between their evaluations.
> Here is an example.
> {code:java}
> select A, B, grouping__id/grouping_id() from t group by A, B grouping 
> sets((), (A), (B), (A,B))
> {code}
> Running it on Hive and Spark separately, you'll find this: (the selected 
> attribute in selected grouping set is represented by (/) and  otherwise by 
> (x))
> ||A B||Binary Expression in Spark||Spark||Hive||Binary Expression in Hive||B 
> A||
> |(x) (x)|11|3|0|00|(x) (x)|
> |(x) (/)|10|2|2|10|(/) (x)|
> |(/) (x)|01|1|1|01|(x) (/)|
> |(/) (/)|00|0|3|11|(/) (/)|
> As shown above,In Hive, (/) set to 0, (x) set to 1, and in Spark it's 
> opposite.
> Moreover, attributes in `group by` will reverse firstly in Hive. In Spark 
> it'll be evaluated directly.
> In my opinion, I suggest that modifying the behavior of `grouping_id()` make 
> it compatible with Hive `grouping__id`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22014) Sample windows in Spark SQL

2017-09-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22014:
--
Target Version/s:   (was: 2.2.0)

> Sample windows in Spark SQL
> ---
>
> Key: SPARK-22014
> URL: https://issues.apache.org/jira/browse/SPARK-22014
> Project: Spark
>  Issue Type: Wish
>  Components: DStreams, SQL
>Affects Versions: 2.2.0
>Reporter: Simon Schiff
>Priority: Minor
>
> Hello,
> I am using spark to process measurement data. It is possible to create sample 
> windows in Spark Streaming, where the duration of the window is smaller than 
> the slide. But when I try to do the same with Spark SQL (The measurement data 
> has a time stamp column) then I got an analysis exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type 
> mismatch: The slide duration (18000) must be less than or equal to the 
> windowDuration (6000)
> {code}
> Here is a example:
> {code:java}
> import java.sql.Timestamp;
> import java.text.SimpleDateFormat;
> import java.util.ArrayList;
> import java.util.Date;
> import java.util.List;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Encoders;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class App {
>   public static Timestamp createTimestamp(String in) throws Exception {
>   SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd 
> hh:mm:ss");
>   Date parsedDate = dateFormat.parse(in);
>   return new Timestamp(parsedDate.getTime());
>   }
>   
>   public static void main(String[] args) {
>   SparkSession spark = SparkSession.builder().appName("Window 
> Sampling Example").getOrCreate();
>   
>   List sensorData = new ArrayList();
>   sensorData.add("2017-08-04 00:00:00, 22.75");
>   sensorData.add("2017-08-04 00:01:00, 23.82");
>   sensorData.add("2017-08-04 00:02:00, 24.15");
>   sensorData.add("2017-08-04 00:03:00, 23.16");
>   sensorData.add("2017-08-04 00:04:00, 22.62");
>   sensorData.add("2017-08-04 00:05:00, 22.89");
>   sensorData.add("2017-08-04 00:06:00, 23.21");
>   sensorData.add("2017-08-04 00:07:00, 24.59");
>   sensorData.add("2017-08-04 00:08:00, 24.44");
>   
>   Dataset in = spark.createDataset(sensorData, 
> Encoders.STRING());
>   
>   StructType sensorSchema = DataTypes.createStructType(new 
> StructField[] { 
>   DataTypes.createStructField("timestamp", 
> DataTypes.TimestampType, false),
>   DataTypes.createStructField("value", 
> DataTypes.DoubleType, false),
>   });
>   
>   Dataset data = 
> spark.createDataFrame(in.toJavaRDD().map(new Function() {
>   public Row call(String line) throws Exception {
>   return 
> RowFactory.create(createTimestamp(line.split(",")[0]), 
> Double.parseDouble(line.split(",")[1]));
>   }
>   }), sensorSchema);
>   
>   data.groupBy(functions.window(data.col("timestamp"), "1 
> minutes", "3 minutes")).avg("value").orderBy("window").show(false);
>   }
> }
> {code}
> I think there should be no difference (duration and slide) in a "Spark 
> Streaming window" and a "Spark SQL window" function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22011) model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error

2017-09-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22011.
---
Resolution: Not A Problem

The error pretty much says it all -- set SPARK_HOME. See the SparkR docs.

> model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error
> --
>
> Key: SPARK-22011
> URL: https://issues.apache.org/jira/browse/SPARK-22011
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.2.0
> Environment: Error showing on SparkR 
>Reporter: Atul Khairnar
>
> Sys.setenv(SPARK_HOME="C:/spark")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.8.0_144/")
> library(SparkR)
> sc <- sparkR.session(master = "local")
> sqlContext <- sparkRSQL.init(sc)
> o/p: showing error in Rstudio
> Warning message:
> 'sparkRSQL.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated") 
> Can you help me what exactly error/warning...and next
> model <- spark.logit(training, Survived ~ ., regParam = 0.5)
> Error in handleErrors(returnStatus, conn) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 35.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 35.0 (TID 31, localhost, executor driver): org.apache.spark.SparkException: 
> SPARK_HOME not set. Can't locate SparkR package.
>   at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88)
>   at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.api.r.RUtils$.sparkRPackagePath(RUtils.scala:87)
>   at org.apache.spark.api.r.RRunner$.createRProcess(RRunner.scala:339)
>   at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:391)
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22010:


Assignee: (was: Apache Spark)

> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100)`
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22010:


Assignee: Apache Spark

> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100)`
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166430#comment-16166430
 ] 

Apache Spark commented on SPARK-22010:
--

User 'maver1ck' has created a pull request for this issue:
https://github.com/apache/spark/pull/19234

> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100)`
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22014) Sample windows in Spark SQL

2017-09-14 Thread Simon Schiff (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Schiff updated SPARK-22014:
-
Description: 
Hello,
I am using spark to process measurement data. It is possible to create sample 
windows in Spark Streaming, where the duration of the window is smaller than 
the slide. But when I try to do the same with Spark SQL (The measurement data 
has a time stamp column) then I got an analysis exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type 
mismatch: The slide duration (18000) must be less than or equal to the 
windowDuration (6000)
{code}

Here is a example:

{code:java}
import java.sql.Timestamp;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

public class App {
public static Timestamp createTimestamp(String in) throws Exception {
SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd 
hh:mm:ss");
Date parsedDate = dateFormat.parse(in);
return new Timestamp(parsedDate.getTime());
}

public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("Window 
Sampling Example").getOrCreate();

List sensorData = new ArrayList();
sensorData.add("2017-08-04 00:00:00, 22.75");
sensorData.add("2017-08-04 00:01:00, 23.82");
sensorData.add("2017-08-04 00:02:00, 24.15");
sensorData.add("2017-08-04 00:03:00, 23.16");
sensorData.add("2017-08-04 00:04:00, 22.62");
sensorData.add("2017-08-04 00:05:00, 22.89");
sensorData.add("2017-08-04 00:06:00, 23.21");
sensorData.add("2017-08-04 00:07:00, 24.59");
sensorData.add("2017-08-04 00:08:00, 24.44");

Dataset in = spark.createDataset(sensorData, 
Encoders.STRING());

StructType sensorSchema = DataTypes.createStructType(new 
StructField[] { 
DataTypes.createStructField("timestamp", 
DataTypes.TimestampType, false),
DataTypes.createStructField("value", 
DataTypes.DoubleType, false),
});

Dataset data = 
spark.createDataFrame(in.toJavaRDD().map(new Function() {
public Row call(String line) throws Exception {
return 
RowFactory.create(createTimestamp(line.split(",")[0]), 
Double.parseDouble(line.split(",")[1]));
}
}), sensorSchema);

data.groupBy(functions.window(data.col("timestamp"), "1 
minutes", "3 minutes")).avg("value").orderBy("window").show(false);
}
}
{code}

I think there should be no difference (duration and slide) in a "Spark 
Streaming window" and a "Spark SQL window" function.

  was:
Hello,
I am using spark to process measurement data. It is possible to create sample 
windows in Spark Streaming, where the duration of the window is smaller than 
the slide. But when I try to do the same with Spark SQL (The measurement data 
has a time stamp column) then i got an analysis exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type 
mismatch: The slide duration (18000) must be less than or equal to the 
windowDuration (6000)
{code}

Here is a example:

{code:java}
import java.sql.Timestamp;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

public class App {
public static Timestamp createTimestamp(String in) throws Exception {
SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd 
hh:mm:ss");
Date parsedDate = 

[jira] [Updated] (SPARK-22014) Sample windows in Spark SQL

2017-09-14 Thread Simon Schiff (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Schiff updated SPARK-22014:
-
Description: 
Hello,
I am using spark to process measurement data. It is possible to create sample 
windows in Spark Streaming, where the duration of the window is smaller than 
the slide. But when I try to do the same with Spark SQL (The measurement data 
has a time stamp column) then i got an analysis exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type 
mismatch: The slide duration (18000) must be less than or equal to the 
windowDuration (6000)
{code}

Here is a example:

{code:java}
import java.sql.Timestamp;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

public class App {
public static Timestamp createTimestamp(String in) throws Exception {
SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd 
hh:mm:ss");
Date parsedDate = dateFormat.parse(in);
return new Timestamp(parsedDate.getTime());
}

public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("Window 
Sampling Example").getOrCreate();

List sensorData = new ArrayList();
sensorData.add("2017-08-04 00:00:00, 22.75");
sensorData.add("2017-08-04 00:01:00, 23.82");
sensorData.add("2017-08-04 00:02:00, 24.15");
sensorData.add("2017-08-04 00:03:00, 23.16");
sensorData.add("2017-08-04 00:04:00, 22.62");
sensorData.add("2017-08-04 00:05:00, 22.89");
sensorData.add("2017-08-04 00:06:00, 23.21");
sensorData.add("2017-08-04 00:07:00, 24.59");
sensorData.add("2017-08-04 00:08:00, 24.44");

Dataset in = spark.createDataset(sensorData, 
Encoders.STRING());

StructType sensorSchema = DataTypes.createStructType(new 
StructField[] { 
DataTypes.createStructField("timestamp", 
DataTypes.TimestampType, false),
DataTypes.createStructField("value", 
DataTypes.DoubleType, false),
});

Dataset data = 
spark.createDataFrame(in.toJavaRDD().map(new Function() {
public Row call(String line) throws Exception {
return 
RowFactory.create(createTimestamp(line.split(",")[0]), 
Double.parseDouble(line.split(",")[1]));
}
}), sensorSchema);

data.groupBy(functions.window(data.col("timestamp"), "1 
minutes", "3 minutes")).avg("value").orderBy("window").show(false);
}
}
{code}

I think there should be no difference (duration and slide) in a "Spark 
Streaming window" and a "Spark SQL window" function.

  was:
Hello,
i am using spark to process measurement data. It is possible to create sample 
windows in Spark Streaming, where the duration of the window is smaller than 
the slide. But when I try to do the same with Spark SQL (The measurement data 
has a time stamp column) then i got a analysis exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type 
mismatch: The slide duration (18000) must be less than or equal to the 
windowDuration (6000)
{code}

Here is a example:

{code:java}
import java.sql.Timestamp;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

public class App {
public static Timestamp createTimestamp(String in) throws Exception {
SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd 
hh:mm:ss");
Date parsedDate = 

[jira] [Updated] (SPARK-22014) Sample windows in Spark SQL

2017-09-14 Thread Simon Schiff (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simon Schiff updated SPARK-22014:
-
Issue Type: Wish  (was: Improvement)

> Sample windows in Spark SQL
> ---
>
> Key: SPARK-22014
> URL: https://issues.apache.org/jira/browse/SPARK-22014
> Project: Spark
>  Issue Type: Wish
>  Components: DStreams, SQL
>Affects Versions: 2.2.0
>Reporter: Simon Schiff
>Priority: Minor
>
> Hello,
> i am using spark to process measurement data. It is possible to create sample 
> windows in Spark Streaming, where the duration of the window is smaller than 
> the slide. But when I try to do the same with Spark SQL (The measurement data 
> has a time stamp column) then i got a analysis exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type 
> mismatch: The slide duration (18000) must be less than or equal to the 
> windowDuration (6000)
> {code}
> Here is a example:
> {code:java}
> import java.sql.Timestamp;
> import java.text.SimpleDateFormat;
> import java.util.ArrayList;
> import java.util.Date;
> import java.util.List;
> import org.apache.spark.api.java.function.Function;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Encoders;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.RowFactory;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.functions;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> public class App {
>   public static Timestamp createTimestamp(String in) throws Exception {
>   SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd 
> hh:mm:ss");
>   Date parsedDate = dateFormat.parse(in);
>   return new Timestamp(parsedDate.getTime());
>   }
>   
>   public static void main(String[] args) {
>   SparkSession spark = SparkSession.builder().appName("Window 
> Sampling Example").getOrCreate();
>   
>   List sensorData = new ArrayList();
>   sensorData.add("2017-08-04 00:00:00, 22.75");
>   sensorData.add("2017-08-04 00:01:00, 23.82");
>   sensorData.add("2017-08-04 00:02:00, 24.15");
>   sensorData.add("2017-08-04 00:03:00, 23.16");
>   sensorData.add("2017-08-04 00:04:00, 22.62");
>   sensorData.add("2017-08-04 00:05:00, 22.89");
>   sensorData.add("2017-08-04 00:06:00, 23.21");
>   sensorData.add("2017-08-04 00:07:00, 24.59");
>   sensorData.add("2017-08-04 00:08:00, 24.44");
>   
>   Dataset in = spark.createDataset(sensorData, 
> Encoders.STRING());
>   
>   StructType sensorSchema = DataTypes.createStructType(new 
> StructField[] { 
>   DataTypes.createStructField("timestamp", 
> DataTypes.TimestampType, false),
>   DataTypes.createStructField("value", 
> DataTypes.DoubleType, false),
>   });
>   
>   Dataset data = 
> spark.createDataFrame(in.toJavaRDD().map(new Function() {
>   public Row call(String line) throws Exception {
>   return 
> RowFactory.create(createTimestamp(line.split(",")[0]), 
> Double.parseDouble(line.split(",")[1]));
>   }
>   }), sensorSchema);
>   
>   data.groupBy(functions.window(data.col("timestamp"), "1 
> minutes", "3 minutes")).avg("value").orderBy("window").show(false);
>   }
> }
> {code}
> I think there should be no difference (duration and slide) in a "Spark 
> Streaming window" and a "Spark SQL window" function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22011) model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error

2017-09-14 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22011:
-
Fix Version/s: (was: 0.8.2)

> model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error
> --
>
> Key: SPARK-22011
> URL: https://issues.apache.org/jira/browse/SPARK-22011
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.2.0
> Environment: Error showing on SparkR 
>Reporter: Atul Khairnar
>
> Sys.setenv(SPARK_HOME="C:/spark")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.8.0_144/")
> library(SparkR)
> sc <- sparkR.session(master = "local")
> sqlContext <- sparkRSQL.init(sc)
> o/p: showing error in Rstudio
> Warning message:
> 'sparkRSQL.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated") 
> Can you help me what exactly error/warning...and next
> model <- spark.logit(training, Survived ~ ., regParam = 0.5)
> Error in handleErrors(returnStatus, conn) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 35.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 35.0 (TID 31, localhost, executor driver): org.apache.spark.SparkException: 
> SPARK_HOME not set. Can't locate SparkR package.
>   at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88)
>   at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.api.r.RUtils$.sparkRPackagePath(RUtils.scala:87)
>   at org.apache.spark.api.r.RRunner$.createRProcess(RRunner.scala:339)
>   at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:391)
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22011) model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error

2017-09-14 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22011:
-
Target Version/s:   (was: 2.2.0)

> model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error
> --
>
> Key: SPARK-22011
> URL: https://issues.apache.org/jira/browse/SPARK-22011
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 2.2.0
> Environment: Error showing on SparkR 
>Reporter: Atul Khairnar
>
> Sys.setenv(SPARK_HOME="C:/spark")
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.8.0_144/")
> library(SparkR)
> sc <- sparkR.session(master = "local")
> sqlContext <- sparkRSQL.init(sc)
> o/p: showing error in Rstudio
> Warning message:
> 'sparkRSQL.init' is deprecated.
> Use 'sparkR.session' instead.
> See help("Deprecated") 
> Can you help me what exactly error/warning...and next
> model <- spark.logit(training, Survived ~ ., regParam = 0.5)
> Error in handleErrors(returnStatus, conn) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 35.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
> 35.0 (TID 31, localhost, executor driver): org.apache.spark.SparkException: 
> SPARK_HOME not set. Can't locate SparkR package.
>   at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88)
>   at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.api.r.RUtils$.sparkRPackagePath(RUtils.scala:87)
>   at org.apache.spark.api.r.RRunner$.createRProcess(RRunner.scala:339)
>   at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:391)
>   at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)
>   at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22014) Sample windows in Spark SQL

2017-09-14 Thread Simon Schiff (JIRA)
Simon Schiff created SPARK-22014:


 Summary: Sample windows in Spark SQL
 Key: SPARK-22014
 URL: https://issues.apache.org/jira/browse/SPARK-22014
 Project: Spark
  Issue Type: Improvement
  Components: DStreams, SQL
Affects Versions: 2.2.0
Reporter: Simon Schiff
Priority: Minor


Hello,
i am using spark to process measurement data. It is possible to create sample 
windows in Spark Streaming, where the duration of the window is smaller than 
the slide. But when I try to do the same with Spark SQL (The measurement data 
has a time stamp column) then i got a analysis exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve 'timewindow(timestamp, 6000, 18000, 0)' due to data type 
mismatch: The slide duration (18000) must be less than or equal to the 
windowDuration (6000)
{code}

Here is a example:

{code:java}
import java.sql.Timestamp;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import org.apache.spark.api.java.function.Function;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

public class App {
public static Timestamp createTimestamp(String in) throws Exception {
SimpleDateFormat dateFormat = new SimpleDateFormat("-MM-dd 
hh:mm:ss");
Date parsedDate = dateFormat.parse(in);
return new Timestamp(parsedDate.getTime());
}

public static void main(String[] args) {
SparkSession spark = SparkSession.builder().appName("Window 
Sampling Example").getOrCreate();

List sensorData = new ArrayList();
sensorData.add("2017-08-04 00:00:00, 22.75");
sensorData.add("2017-08-04 00:01:00, 23.82");
sensorData.add("2017-08-04 00:02:00, 24.15");
sensorData.add("2017-08-04 00:03:00, 23.16");
sensorData.add("2017-08-04 00:04:00, 22.62");
sensorData.add("2017-08-04 00:05:00, 22.89");
sensorData.add("2017-08-04 00:06:00, 23.21");
sensorData.add("2017-08-04 00:07:00, 24.59");
sensorData.add("2017-08-04 00:08:00, 24.44");

Dataset in = spark.createDataset(sensorData, 
Encoders.STRING());

StructType sensorSchema = DataTypes.createStructType(new 
StructField[] { 
DataTypes.createStructField("timestamp", 
DataTypes.TimestampType, false),
DataTypes.createStructField("value", 
DataTypes.DoubleType, false),
});

Dataset data = 
spark.createDataFrame(in.toJavaRDD().map(new Function() {
public Row call(String line) throws Exception {
return 
RowFactory.create(createTimestamp(line.split(",")[0]), 
Double.parseDouble(line.split(",")[1]));
}
}), sensorSchema);

data.groupBy(functions.window(data.col("timestamp"), "1 
minutes", "3 minutes")).avg("value").orderBy("window").show(false);
}
}
{code}

I think there should be no difference (duration and slide) in a "Spark 
Streaming window" and a "Spark SQL window" function.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-09-14 Thread Michael N (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166316#comment-16166316
 ] 

Michael N commented on SPARK-21999:
---

Sean, this issue occurred on the master as well.I updated the ticket to 
show only its stack trace to clarify that.   Which closure did you refer to 
about my app modifying some list ? My app is not using accumulators or 
broadcast variables. 

Thanks.

Michael,


> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at 

[jira] [Updated] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-09-14 Thread Michael N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N updated SPARK-21999:
--
Description: 
Hi,

I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
ConcurrentModificationException intermittently.  When it occurs, Spark does not 
honor the specified value of spark.task.maxFailures. So Spark aborts the 
current batch  and fetch the next batch, so it results in lost data. Its 
exception stack is listed below. 

This instance of ConcurrentModificationException is similar to the issue at 
https://issues.apache.org/jira/browse/SPARK-17463, which was about 
Serialization of accumulators in heartbeats.  However, my Spark stream app does 
not use accumulators. 

The stack trace listed below occurred on the Spark master in Spark streaming 
driver at the time of data loss.   

>From the line of code in the first stack trace, can you tell which object 
>Spark was trying to serialize ?  What is the root cause for this issue  ?  

Because this issue results in lost data as described above, could you have this 
issue fixed ASAP ?

Thanks.

Michael N.,



Stack trace of Spark Streaming driver
ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
at 
org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
at 
org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
at scala.Option.orElse(Option.scala:289)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
at 
org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
at 
org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:42)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
at 

[jira] [Updated] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-09-14 Thread Michael N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N updated SPARK-21999:
--
Summary: ConcurrentModificationException - Spark Streaming  (was: 
ConcurrentModificationException - Master )

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> There are two stack traces below.  The first one is from the Spark streaming 
> driver at the time of data loss.   The second one is from the Spark slave 
> server that runs the application several hours later.  It may or may not be 
> related. They are listed here because they involve the same type of 
> ConcurrentModificationException, so they may be permutation of the same issue 
> and occurred at different times.
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 

[jira] [Updated] (SPARK-21999) ConcurrentModificationException - Master

2017-09-14 Thread Michael N (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N updated SPARK-21999:
--
Summary: ConcurrentModificationException - Master   (was: 
ConcurrentModificationException - Error sending message [message = Heartbeat)

> ConcurrentModificationException - Master 
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> There are two stack traces below.  The first one is from the Spark streaming 
> driver at the time of data loss.   The second one is from the Spark slave 
> server that runs the application several hours later.  It may or may not be 
> related. They are listed here because they involve the same type of 
> ConcurrentModificationException, so they may be permutation of the same issue 
> and occurred at different times.
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>  

[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166206#comment-16166206
 ] 

Maciej Bryński commented on SPARK-22010:


I'll open PR.

> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100)`
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-14 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-21922.
-
  Resolution: Fixed
Assignee: zhoukang
Target Version/s: 2.3.0

> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Assignee: zhoukang
> Attachments: fixed01.png, fixed02.png, notfixed01.png, notfixed02.png
>
>
> As title described,and below is an example:
> !notfixed01.png|Before fixed!
> !notfixed02.png|Before fixed!
> We can fix the duration time by the modify time of event log:
> !fixed01.png|After fixed!
> !fixed02.png|After fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-14 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-21922:

Fix Version/s: 2.3.0

> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Assignee: zhoukang
> Fix For: 2.3.0
>
> Attachments: fixed01.png, fixed02.png, notfixed01.png, notfixed02.png
>
>
> As title described,and below is an example:
> !notfixed01.png|Before fixed!
> !notfixed02.png|Before fixed!
> We can fix the duration time by the modify time of event log:
> !fixed01.png|After fixed!
> !fixed02.png|After fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21922) When executor failed and task metrics have not send to driver,the status will always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'

2017-09-14 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-21922:

Target Version/s:   (was: 2.3.0)

> When executor failed and task metrics have not send to driver,the status will 
> always be 'RUNNING' and the duration will be 'CurrentTime - launchTime'
> -
>
> Key: SPARK-21922
> URL: https://issues.apache.org/jira/browse/SPARK-21922
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Assignee: zhoukang
> Fix For: 2.3.0
>
> Attachments: fixed01.png, fixed02.png, notfixed01.png, notfixed02.png
>
>
> As title described,and below is an example:
> !notfixed01.png|Before fixed!
> !notfixed02.png|Before fixed!
> We can fix the duration time by the modify time of event log:
> !fixed01.png|After fixed!
> !fixed02.png|After fixed!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166187#comment-16166187
 ] 

Hyukjin Kwon commented on SPARK-22010:
--

I'd also put some details in the PR description, for example, fromtimestamp 
implementation in Python - 
https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1425-L1458.
I still think it is trivial but sounds valid improvement.

> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100)`
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166181#comment-16166181
 ] 

Hyukjin Kwon commented on SPARK-22010:
--

Sounds good. Would you like to go ahead and open a PR with some small perf 
tests?

> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100)`
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166136#comment-16166136
 ] 

Maciej Bryński edited comment on SPARK-22010 at 9/14/17 12:05 PM:
--

The reason of this Jira is this profiling (attachment).
!profile_fact_dok.png|thumbnail!
As you can see about 80% of pyspark time is spent in Spark internals.



was (Author: maver1ck):
The reason of this Jira is this profiling (attachment).
!profile_fact_dok.jpg|thumbnail!
As you can see about 80% of pyspark time is spent in Spark internals.


> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
> Attachments: profile_fact_dok.png
>
>
> To convert timestamp type to python we are using 
> `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100)`
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166134#comment-16166134
 ] 

Maciej Bryński commented on SPARK-22010:


Proposition
Create constant
{code}
utc = datetime.datetime.now(tzlocal()).tzname() == 'UTC'
{code}

Then change this code to: ( 
https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L196 )
{code}
y, m, d, hh, mm, ss, _, _, _ = gmtime(ts // 100) if utc else localtime(ts 
// 100)
datetime.datetime(y, m, d, hh, mm, ss, ts % 100)
{code}

This is running 30% faster if TZ != UTC and 3x faster if TZ == UTC

What do you think about such a solution ?

> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> To convert timestamp type to python we are using 
> `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100)`
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"

2017-09-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-22012.
-

> CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after 
> polling for 512"
> --
>
> Key: SPARK-22012
> URL: https://issues.apache.org/jira/browse/SPARK-22012
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.1.0.2.6.0.3-8, Kafka 0.10 for Java 8
>Reporter: Karan Singh
>Priority: Blocker
>
> My Spark Streaming duration is 5 seconds (5000) and kafka is all at its 
> default properties , i am still facing this issue , Can anyone tell me how to 
> resolve it what i am doing wrong ?
>   JavaStreamingContext ssc = new JavaStreamingContext(sc, 
> new Duration(5000));
> Exception in Spark Streamings
> Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most 
> recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): 
> java.lang.AssertionError: assertion failed: Failed to get records for 
> spark-executor-xxx pulse 1 163684030 after polling for 512
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
>   at 
> com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165)
>   at 
> com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"

2017-09-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22012.
---
Resolution: Invalid

> CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after 
> polling for 512"
> --
>
> Key: SPARK-22012
> URL: https://issues.apache.org/jira/browse/SPARK-22012
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.1.0.2.6.0.3-8, Kafka 0.10 for Java 8
>Reporter: Karan Singh
>Priority: Blocker
>
> My Spark Streaming duration is 5 seconds (5000) and kafka is all at its 
> default properties , i am still facing this issue , Can anyone tell me how to 
> resolve it what i am doing wrong ?
>   JavaStreamingContext ssc = new JavaStreamingContext(sc, 
> new Duration(5000));
> Exception in Spark Streamings
> Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most 
> recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): 
> java.lang.AssertionError: assertion failed: Failed to get records for 
> spark-executor-xxx pulse 1 163684030 after polling for 512
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
>   at 
> com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165)
>   at 
> com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"

2017-09-14 Thread Karan Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karan Singh updated SPARK-22012:

Environment: Apache Spark 2.1.0.2.6.0.3-8, Kafka 0.10 for Java 8  (was: 
Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11)

> CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after 
> polling for 512"
> --
>
> Key: SPARK-22012
> URL: https://issues.apache.org/jira/browse/SPARK-22012
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.1.0.2.6.0.3-8, Kafka 0.10 for Java 8
>Reporter: Karan Singh
>Priority: Blocker
>
> My Spark Streaming duration is 5 seconds (5000) and kafka is all at its 
> default properties , i am still facing this issue , Can anyone tell me how to 
> resolve it what i am doing wrong ?
>   JavaStreamingContext ssc = new JavaStreamingContext(sc, 
> new Duration(5000));
> Exception in Spark Streamings
> Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most 
> recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): 
> java.lang.AssertionError: assertion failed: Failed to get records for 
> spark-executor-xxx pulse 1 163684030 after polling for 512
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
>   at 
> com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165)
>   at 
> com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12216) Spark failed to delete temp directory

2017-09-14 Thread Michel Lemay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166117#comment-16166117
 ] 

Michel Lemay commented on SPARK-12216:
--

In my case, what prevent the temp folder to be deleted is a lock on the jar I 
used in spark-submit.  That jar is copied in `spark-guid\userFiles-guid\` 
folder and loaded in the process.  I'm not sure if it is caused by the JVM 
itself or some handle left open (MutableURLClassLoader does not seems to be 
closed.)

> Spark failed to delete temp directory 
> --
>
> Key: SPARK-12216
> URL: https://issues.apache.org/jira/browse/SPARK-12216
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
> Environment: windows 7 64 bit
> Spark 1.52
> Java 1.8.0.65
> PATH includes:
> C:\Users\Stefan\spark-1.5.2-bin-hadoop2.6\bin
> C:\ProgramData\Oracle\Java\javapath
> C:\Users\Stefan\scala\bin
> SYSTEM variables set are:
> JAVA_HOME=C:\Program Files\Java\jre1.8.0_65
> HADOOP_HOME=C:\Users\Stefan\hadoop-2.6.0\bin
> (where the bin\winutils resides)
> both \tmp and \tmp\hive have permissions
> drwxrwxrwx as detected by winutils ls
>Reporter: stefan
>Priority: Minor
>
> The mailing list archives have no obvious solution to this:
> scala> :q
> Stopping spark context.
> 15/12/08 16:24:22 ERROR ShutdownHookManager: Exception while deleting Spark 
> temp dir: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> java.io.IOException: Failed to delete: 
> C:\Users\Stefan\AppData\Local\Temp\spark-18f2a418-e02f-458b-8325-60642868fdff
> at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:884)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:63)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1$$anonfun$apply$mcV$sp$3.apply(ShutdownHookManager.scala:60)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> org.apache.spark.util.ShutdownHookManager$$anonfun$1.apply$mcV$sp(ShutdownHookManager.scala:60)
> at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:264)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:234)
> at scala.util.Try$.apply(Try.scala:161)
> at 
> org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:234)
> at 
> org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:216)
> at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"

2017-09-14 Thread Karan Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karan Singh updated SPARK-22012:

Description: 
My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default 
properties , i am still facing this issue , Can anyone tell me how to resolve 
it what i am doing wrong ?
JavaStreamingContext ssc = new JavaStreamingContext(sc, 
new Duration(5000));


Exception in Spark Streamings
Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most 
recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): 
java.lang.AssertionError: assertion failed: Failed to get records for 
spark-executor-xxx pulse 1 163684030 after polling for 512
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at 
scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
at 
com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165)
at 
com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

  was:
We have a Spark Streaming application reading records from Kafka 0.10.

Some tasks are failed because of the following error:
"java.lang.AssertionError: assertion failed: Failed to get records for (...) 
after polling for 512"

The first attempt fails and the second attempt (retry) completes successfully, 
- this is the pattern that we see for many tasks in our logs. These fails and 
retries consume resources.

A similar case with a stack trace are described here:
https://www.mail-archive.com/user@spark.apache.org/msg56564.html
https://gist.github.com/SrikanthTati/c2e95c4ac689cd49aab817e24ec42767

Here is the line from the stack trace where the error is raised:
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)

We tried several values for "spark.streaming.kafka.consumer.poll.ms", - 2, 5, 
10, 30 and 60 seconds, but the error appeared in all the cases except the last 
one. Moreover, increasing the threshold led to increasing total Spark stage 
duration.
In other words, increasing "spark.streaming.kafka.consumer.poll.ms" led to 
fewer task failures but with cost of total stage duration. So, it is bad for 
performance when processing data streams.

We have a suspicion that there is a bug in CachedKafkaConsumer (and/or other 
related classes) which inhibits the reading process.



> CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after 
> polling for 512"
> --
>
> Key: SPARK-22012
> URL: https://issues.apache.org/jira/browse/SPARK-22012
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11
>Reporter: Karan Singh
>
> My Spark Streaming duration is 5 seconds (5000) and kafka is all at its 
> default properties , i am still facing this issue , Can anyone tell me how to 
> resolve it what i am doing wrong ?
>   JavaStreamingContext ssc = new JavaStreamingContext(sc, 
> new Duration(5000));
> Exception in Spark Streamings
> Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most 
> recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): 
> java.lang.AssertionError: assertion failed: Failed to get records for 
> spark-executor-xxx pulse 1 163684030 after polling for 512
>   at 

[jira] [Updated] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"

2017-09-14 Thread Karan Singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karan Singh updated SPARK-22012:

Priority: Blocker  (was: Major)

> CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after 
> polling for 512"
> --
>
> Key: SPARK-22012
> URL: https://issues.apache.org/jira/browse/SPARK-22012
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11
>Reporter: Karan Singh
>Priority: Blocker
>
> My Spark Streaming duration is 5 seconds (5000) and kafka is all at its 
> default properties , i am still facing this issue , Can anyone tell me how to 
> resolve it what i am doing wrong ?
>   JavaStreamingContext ssc = new JavaStreamingContext(sc, 
> new Duration(5000));
> Exception in Spark Streamings
> Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most 
> recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): 
> java.lang.AssertionError: assertion failed: Failed to get records for 
> spark-executor-xxx pulse 1 163684030 after polling for 512
>   at scala.Predef$.assert(Predef.scala:170)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at 
> scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
>   at 
> com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165)
>   at 
> com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22012) CLONE - Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"

2017-09-14 Thread Karan Singh (JIRA)
Karan Singh created SPARK-22012:
---

 Summary: CLONE - Spark Streaming, Kafka receiver, "Failed to get 
records for ... after polling for 512"
 Key: SPARK-22012
 URL: https://issues.apache.org/jira/browse/SPARK-22012
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.0.0
 Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11
Reporter: Karan Singh


We have a Spark Streaming application reading records from Kafka 0.10.

Some tasks are failed because of the following error:
"java.lang.AssertionError: assertion failed: Failed to get records for (...) 
after polling for 512"

The first attempt fails and the second attempt (retry) completes successfully, 
- this is the pattern that we see for many tasks in our logs. These fails and 
retries consume resources.

A similar case with a stack trace are described here:
https://www.mail-archive.com/user@spark.apache.org/msg56564.html
https://gist.github.com/SrikanthTati/c2e95c4ac689cd49aab817e24ec42767

Here is the line from the stack trace where the error is raised:
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)

We tried several values for "spark.streaming.kafka.consumer.poll.ms", - 2, 5, 
10, 30 and 60 seconds, but the error appeared in all the cases except the last 
one. Moreover, increasing the threshold led to increasing total Spark stage 
duration.
In other words, increasing "spark.streaming.kafka.consumer.poll.ms" led to 
fewer task failures but with cost of total stage duration. So, it is bad for 
performance when processing data streams.

We have a suspicion that there is a bug in CachedKafkaConsumer (and/or other 
related classes) which inhibits the reading process.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19275) Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"

2017-09-14 Thread Karan Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166112#comment-16166112
 ] 

Karan Singh commented on SPARK-19275:
-

Hi Team , 
My Spark Streaming duration is 5 seconds (5000) and kafka is all at its default 
properties , i am still facing this issue , Can anyone tell me how to resolve 
it what i am doing wrong ?
JavaStreamingContext ssc = new JavaStreamingContext(sc, 
new Duration(5000));


Exception in Spark Streamings
Job aborted due to stage failure: Task 6 in stage 289.0 failed 4 times, most 
recent failure: Lost task 6.3 in stage 289.0 (xx, executor 2): 
java.lang.AssertionError: assertion failed: Failed to get records for 
spark-executor-xxx pulse 1 163684030 after polling for 512
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
at 
org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
at 
scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:31)
at 
com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:165)
at 
com.olacabs.analytics.engine.EngineManager$1$1.call(EngineManager.java:132)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:219)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

> Spark Streaming, Kafka receiver, "Failed to get records for ... after polling 
> for 512"
> --
>
> Key: SPARK-19275
> URL: https://issues.apache.org/jira/browse/SPARK-19275
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11
>Reporter: Dmitry Ochnev
>
> We have a Spark Streaming application reading records from Kafka 0.10.
> Some tasks are failed because of the following error:
> "java.lang.AssertionError: assertion failed: Failed to get records for (...) 
> after polling for 512"
> The first attempt fails and the second attempt (retry) completes 
> successfully, - this is the pattern that we see for many tasks in our logs. 
> These fails and retries consume resources.
> A similar case with a stack trace are described here:
> https://www.mail-archive.com/user@spark.apache.org/msg56564.html
> https://gist.github.com/SrikanthTati/c2e95c4ac689cd49aab817e24ec42767
> Here is the line from the stack trace where the error is raised:
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
> We tried several values for "spark.streaming.kafka.consumer.poll.ms", - 2, 5, 
> 10, 30 and 60 seconds, but the error appeared in all the cases except the 
> last one. Moreover, increasing the threshold led to increasing total Spark 
> stage duration.
> In other words, increasing "spark.streaming.kafka.consumer.poll.ms" led to 
> fewer task failures but with cost of total stage duration. So, it is bad for 
> performance when processing data streams.
> We have a suspicion that there is a bug in CachedKafkaConsumer (and/or other 
> related classes) which inhibits the reading process.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22011) model <- spark.logit(training, Survived ~ ., regParam = 0.5) shwoing error

2017-09-14 Thread Atul Khairnar (JIRA)
Atul Khairnar created SPARK-22011:
-

 Summary: model <- spark.logit(training, Survived ~ ., regParam = 
0.5) shwoing error
 Key: SPARK-22011
 URL: https://issues.apache.org/jira/browse/SPARK-22011
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 2.2.0
 Environment: Error showing on SparkR 
Reporter: Atul Khairnar
 Fix For: 0.8.2


Sys.setenv(SPARK_HOME="C:/spark")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

Sys.setenv(JAVA_HOME="C:/Program Files/Java/jdk1.8.0_144/")

library(SparkR)
sc <- sparkR.session(master = "local")
sqlContext <- sparkRSQL.init(sc)

o/p: showing error in Rstudio
Warning message:
'sparkRSQL.init' is deprecated.
Use 'sparkR.session' instead.
See help("Deprecated") 

Can you help me what exactly error/warning...and next

model <- spark.logit(training, Survived ~ ., regParam = 0.5)

Error in handleErrors(returnStatus, conn) : 
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 35.0 failed 1 times, most recent failure: Lost task 0.0 in stage 35.0 
(TID 31, localhost, executor driver): org.apache.spark.SparkException: 
SPARK_HOME not set. Can't locate SparkR package.
at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88)
at org.apache.spark.api.r.RUtils$$anonfun$2.apply(RUtils.scala:88)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.api.r.RUtils$.sparkRPackagePath(RUtils.scala:87)
at org.apache.spark.api.r.RRunner$.createRProcess(RRunner.scala:339)
at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:391)
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22008) Spark Streaming Dynamic Allocation auto fix maxNumExecutors

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22008:


Assignee: Apache Spark

> Spark Streaming Dynamic Allocation auto fix maxNumExecutors
> ---
>
> Key: SPARK-22008
> URL: https://issues.apache.org/jira/browse/SPARK-22008
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Yue Ma
>Assignee: Apache Spark
>Priority: Minor
>
> In SparkStreaming DRA .The metric we use to add or remove executor is the 
> ratio of batch processing time / batch duration (R). And we use the parameter 
> "spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of 
> executor .Currently it doesn't work well with Spark streaming because of 
> several reasons:
> (1) For example if the max nums of executor we need is 10 and we set 
> "spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 
> executors.
> (2) If  the number of topic partition changes ,then the partition of KafkaRDD 
> or  the num of tasks in a stage changes too.And the max executor we need will 
> also change,so the num of maxExecutors should change with the nums of Task .
> The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner 
> when Stage Submitted ,first figure out the num executor we need  , then 
> update the maxNumExecutor



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22008) Spark Streaming Dynamic Allocation auto fix maxNumExecutors

2017-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166106#comment-16166106
 ] 

Apache Spark commented on SPARK-22008:
--

User 'mayuehappy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19233

> Spark Streaming Dynamic Allocation auto fix maxNumExecutors
> ---
>
> Key: SPARK-22008
> URL: https://issues.apache.org/jira/browse/SPARK-22008
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Yue Ma
>Priority: Minor
>
> In SparkStreaming DRA .The metric we use to add or remove executor is the 
> ratio of batch processing time / batch duration (R). And we use the parameter 
> "spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of 
> executor .Currently it doesn't work well with Spark streaming because of 
> several reasons:
> (1) For example if the max nums of executor we need is 10 and we set 
> "spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 
> executors.
> (2) If  the number of topic partition changes ,then the partition of KafkaRDD 
> or  the num of tasks in a stage changes too.And the max executor we need will 
> also change,so the num of maxExecutors should change with the nums of Task .
> The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner 
> when Stage Submitted ,first figure out the num executor we need  , then 
> update the maxNumExecutor



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22008) Spark Streaming Dynamic Allocation auto fix maxNumExecutors

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22008:


Assignee: (was: Apache Spark)

> Spark Streaming Dynamic Allocation auto fix maxNumExecutors
> ---
>
> Key: SPARK-22008
> URL: https://issues.apache.org/jira/browse/SPARK-22008
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Yue Ma
>Priority: Minor
>
> In SparkStreaming DRA .The metric we use to add or remove executor is the 
> ratio of batch processing time / batch duration (R). And we use the parameter 
> "spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of 
> executor .Currently it doesn't work well with Spark streaming because of 
> several reasons:
> (1) For example if the max nums of executor we need is 10 and we set 
> "spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 
> executors.
> (2) If  the number of topic partition changes ,then the partition of KafkaRDD 
> or  the num of tasks in a stage changes too.And the max executor we need will 
> also change,so the num of maxExecutors should change with the nums of Task .
> The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner 
> when Stage Submitted ,first figure out the num executor we need  , then 
> update the maxNumExecutor



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166097#comment-16166097
 ] 

Hyukjin Kwon commented on SPARK-22010:
--

I don't think this is worth fixing for now. The improvement looks quite trivial 
but it sounds we should reinvent the wheel. Do you know a simple and well-known 
workaround or any measurement between the custom fix and the current status? 
Otherwise, I'd close this as {{Won't Fix}}.

> Slow fromInternal conversion for TimestampType
> --
>
> Key: SPARK-22010
> URL: https://issues.apache.org/jira/browse/SPARK-22010
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> To convert timestamp type to python we are using 
> `datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
> 100)`
> code.
> {code}
> In [34]: %%timeit
> ...: 
> datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
> ...:
> 4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> {code}
> It's slow, because:
> # we're trying to get TZ on every conversion
> # we're using replace method
> Proposed solution: custom datetime conversion and move calculation of TZ to 
> module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22006) date/datetime comparisons should avoid casting

2017-09-14 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22006.
--
Resolution: Invalid

Please see 
https://github.com/apache/spark/blob/183d4cb71fbcbf484fc85d8621e1fe04cbbc8195/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L119-L121:

{code}
// We should cast all relative timestamp/date/string comparison into string 
comparisons
// This behaves as a user would expect because timestamp strings sort 
lexicographically.
// i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true
{code}

I think direct comparison between {{datetime}} and {{date}} is not even allowed 
in Python itself:

{code}
>>> import datetime
>>> datetime.date(2017, 1, 1) > datetime.datetime(2017, 1, 1)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: can't compare datetime.datetime to datetime.date
{code}

> date/datetime comparisons should avoid casting
> --
>
> Key: SPARK-22006
> URL: https://issues.apache.org/jira/browse/SPARK-22006
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Adrian Bridgett
>Priority: Minor
>  Labels: performance
>
> I believe there's a relatively simple optimisation that can be done here - 
> comparing timestamps with dates involves a cast whereas comparing with 
> datetimes avoids this (and pushes the query down into parquet:
> {code}
> df.filter(df['local_read_at'] > datetime.date('2017-01-01')).count()
> {code}
> Results in a plan of:
> {code}
>  +- *Filter (isnotnull(local_read_at#324) && (cast(local_read_at#324 
> as string) > 2017-01-01))
> +- *FileScan parquet [local_read_at#324] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[s3a://...], PartitionFilters: [], 
> PushedFilters: [IsNotNull(local_read_at)], ReadSchema: 
> struct
> {code}
> Whereas:
> {code}
> df.filter(df['local_read_at'] > datetime.datetime(2017,1,1)).count()
> {code}
> Results in:
> {code}
>  +- *Filter (isnotnull(local_read_at#324) && (local_read_at#324 > 
> 14832288))
> +- *FileScan parquet [local_read_at#324] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[s3a://...], PartitionFilters: [], 
> PushedFilters: [IsNotNull(local_read_at), 
> GreaterThan(local_read_at,2017-01-01 00:00:00.0)], ReadSchema: 
> struct
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22008) Spark Streaming Dynamic Allocation auto fix maxNumExecutors

2017-09-14 Thread Yue Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yue Ma updated SPARK-22008:
---
Description: 
In SparkStreaming DRA .The metric we use to add or remove executor is the ratio 
of batch processing time / batch duration (R). And we use the parameter 
"spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of executor 
.Currently it doesn't work well with Spark streaming because of several reasons:
(1) For example if the max nums of executor we need is 10 and we set 
"spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 
executors.
(2) If  the number of topic partition changes ,then the partition of KafkaRDD 
or  the num of tasks in a stage changes too.And the max executor we need will 
also change,so the num of maxExecutors should change with the nums of Task .

The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner 
when Stage Submitted ,first figure out the num executor we need  , then update 
the maxNumExecutor

  was:
In SparkStreaming DRA .The metric we use to add or remove executor is the ratio 
of batch processing time / batch duration (R). And we use the parameter 
"spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of executor 
.Currently it doesn't work well with Spark streaming because of several reasons:
(1) For example if the max nums of executor we need is 10 and we set 
"spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 
executors.
(2) If  the number of topic partition changes ,then the partition of KafkaRDD 
or  the num of tasks in a stage changes too.And the max executor we need will 
also change,so the num of maxExecutors should change with the nums of Task .


The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner 
when Stage Submitted ,first figure out the num executor we need  , then update 
the maxNumExecutor


> Spark Streaming Dynamic Allocation auto fix maxNumExecutors
> ---
>
> Key: SPARK-22008
> URL: https://issues.apache.org/jira/browse/SPARK-22008
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Yue Ma
>Priority: Minor
>
> In SparkStreaming DRA .The metric we use to add or remove executor is the 
> ratio of batch processing time / batch duration (R). And we use the parameter 
> "spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of 
> executor .Currently it doesn't work well with Spark streaming because of 
> several reasons:
> (1) For example if the max nums of executor we need is 10 and we set 
> "spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 
> executors.
> (2) If  the number of topic partition changes ,then the partition of KafkaRDD 
> or  the num of tasks in a stage changes too.And the max executor we need will 
> also change,so the num of maxExecutors should change with the nums of Task .
> The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner 
> when Stage Submitted ,first figure out the num executor we need  , then 
> update the maxNumExecutor



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-14 Thread Nikolaos Tsipas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166086#comment-16166086
 ] 

Nikolaos Tsipas commented on SPARK-21935:
-

Thanks for the response [~srowen]! The problem only appears when a python UDF 
is used, the same UDF written in scala doesn't cause any memory issues. 

However if you are still thinking that the issue is somewhere else in the app 
what would be the best way to debug it? Focus on spark or yarn? Also, if you 
can think of any more specific debugging steps please make suggestions.

> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Nikolaos Tsipas
>  Labels: pyspark, udf
> Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen 
> Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png
>
>
> Hi,
> I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
> follows:
> {code}
> from pyspark.sql.functions import col, udf
> from pyspark.sql.types import StringType
> path = 's3://some/parquet/dir/myfile.parquet'
> df = spark.read.load(path)
> def _test_udf(useragent):
> return useragent.upper()
> test_udf = udf(_test_udf, StringType())
> df = df.withColumn('test_field', test_udf(col('std_useragent')))
> df.write.parquet('/output.parquet')
> {code}
> The following config is used in {{spark-defaults.conf}} (using 
> {{maximizeResourceAllocation}} in EMR)
> {code}
> ...
> spark.executor.instances 4
> spark.executor.cores 8
> spark.driver.memory  8G
> spark.executor.memory9658M
> spark.default.parallelism64
> spark.driver.maxResultSize   3G
> ...
> {code}
> The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 
> 15 GiB memory, 160 SSD GB storage
> The above example fails every single time with errors like the following:
> {code}
> 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
> ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
> (executor 10 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical 
> memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
> I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
> delays the errors but eventually I get them before the end of the job. The 
> job eventually fails.
> !Screen Shot 2017-09-06 at 11.31.31.png|width=800!
> If I run the above job in scala everything works as expected (without having 
> to adjust the memoryOverhead)
> {code}
> import org.apache.spark.sql.functions.udf
> val upper: String => String = _.toUpperCase
> val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
> val upperUDF = udf(upper)
> val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
> newdf.write.parquet("/output.parquet")
> {code}
> !Screen Shot 2017-09-06 at 11.31.13.png|width=800!
> Cpu utilisation is very bad with pyspark
> !cpu.png|width=800!
> Is this a known bug with pyspark and udfs or is it a matter of bad 
> configuration? 
> Looking forward to suggestions. Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166069#comment-16166069
 ] 

Sean Owen commented on SPARK-21935:
---

The error is just running out of memory. Generally it is not true that UDFs 
cause the whole thing to fail. It is likely an issue elsewhere on your app

> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Nikolaos Tsipas
>  Labels: pyspark, udf
> Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen 
> Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png
>
>
> Hi,
> I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
> follows:
> {code}
> from pyspark.sql.functions import col, udf
> from pyspark.sql.types import StringType
> path = 's3://some/parquet/dir/myfile.parquet'
> df = spark.read.load(path)
> def _test_udf(useragent):
> return useragent.upper()
> test_udf = udf(_test_udf, StringType())
> df = df.withColumn('test_field', test_udf(col('std_useragent')))
> df.write.parquet('/output.parquet')
> {code}
> The following config is used in {{spark-defaults.conf}} (using 
> {{maximizeResourceAllocation}} in EMR)
> {code}
> ...
> spark.executor.instances 4
> spark.executor.cores 8
> spark.driver.memory  8G
> spark.executor.memory9658M
> spark.default.parallelism64
> spark.driver.maxResultSize   3G
> ...
> {code}
> The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 
> 15 GiB memory, 160 SSD GB storage
> The above example fails every single time with errors like the following:
> {code}
> 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
> ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
> (executor 10 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical 
> memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
> I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
> delays the errors but eventually I get them before the end of the job. The 
> job eventually fails.
> !Screen Shot 2017-09-06 at 11.31.31.png|width=800!
> If I run the above job in scala everything works as expected (without having 
> to adjust the memoryOverhead)
> {code}
> import org.apache.spark.sql.functions.udf
> val upper: String => String = _.toUpperCase
> val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
> val upperUDF = udf(upper)
> val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
> newdf.write.parquet("/output.parquet")
> {code}
> !Screen Shot 2017-09-06 at 11.31.13.png|width=800!
> Cpu utilisation is very bad with pyspark
> !cpu.png|width=800!
> Is this a known bug with pyspark and udfs or is it a matter of bad 
> configuration? 
> Looking forward to suggestions. Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21935) Pyspark UDF causing ExecutorLostFailure

2017-09-14 Thread Alistair Wooldrige (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166063#comment-16166063
 ] 

Alistair Wooldrige commented on SPARK-21935:


I get this a lot, and see it *only* when introducing a Pyspark UDF into a Spark 
SQL job. The job will run perfectly fine without the UDF, but as soon as one is 
introduced these {{ExecutorLostFailure}} errors start. This is the case even if 
the UDF does no significant work and just returns a static string such as:
{noformat}
def _test_udf(_):
return "apple"
{noformat}

Any suggestions would be greatly appreciated as I've also tried many of the 
tuning fixes that [~nicktgr15] mentioned, to no avail.

> Pyspark UDF causing ExecutorLostFailure 
> 
>
> Key: SPARK-21935
> URL: https://issues.apache.org/jira/browse/SPARK-21935
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Nikolaos Tsipas
>  Labels: pyspark, udf
> Attachments: cpu.png, Screen Shot 2017-09-06 at 11.30.28.png, Screen 
> Shot 2017-09-06 at 11.31.13.png, Screen Shot 2017-09-06 at 11.31.31.png
>
>
> Hi,
> I'm using spark 2.1.0 on AWS EMR (Yarn) and trying to use a UDF in python as 
> follows:
> {code}
> from pyspark.sql.functions import col, udf
> from pyspark.sql.types import StringType
> path = 's3://some/parquet/dir/myfile.parquet'
> df = spark.read.load(path)
> def _test_udf(useragent):
> return useragent.upper()
> test_udf = udf(_test_udf, StringType())
> df = df.withColumn('test_field', test_udf(col('std_useragent')))
> df.write.parquet('/output.parquet')
> {code}
> The following config is used in {{spark-defaults.conf}} (using 
> {{maximizeResourceAllocation}} in EMR)
> {code}
> ...
> spark.executor.instances 4
> spark.executor.cores 8
> spark.driver.memory  8G
> spark.executor.memory9658M
> spark.default.parallelism64
> spark.driver.maxResultSize   3G
> ...
> {code}
> The cluster has 4 worker nodes (+1 master) with the following specs: 8 vCPU, 
> 15 GiB memory, 160 SSD GB storage
> The above example fails every single time with errors like the following:
> {code}
> 17/09/06 09:58:08 WARN TaskSetManager: Lost task 26.1 in stage 1.0 (TID 50, 
> ip-172-31-7-125.eu-west-1.compute.internal, executor 10): ExecutorLostFailure 
> (executor 10 exited caused by one of the running tasks) Reason: Container 
> killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical 
> memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> {code}
> I tried to increase the  {{spark.yarn.executor.memoryOverhead}} to 3000 which 
> delays the errors but eventually I get them before the end of the job. The 
> job eventually fails.
> !Screen Shot 2017-09-06 at 11.31.31.png|width=800!
> If I run the above job in scala everything works as expected (without having 
> to adjust the memoryOverhead)
> {code}
> import org.apache.spark.sql.functions.udf
> val upper: String => String = _.toUpperCase
> val df = spark.read.load("s3://some/parquet/dir/myfile.parquet")
> val upperUDF = udf(upper)
> val newdf = df.withColumn("test_field", upperUDF(col("std_useragent")))
> newdf.write.parquet("/output.parquet")
> {code}
> !Screen Shot 2017-09-06 at 11.31.13.png|width=800!
> Cpu utilisation is very bad with pyspark
> !cpu.png|width=800!
> Is this a known bug with pyspark and udfs or is it a matter of bad 
> configuration? 
> Looking forward to suggestions. Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22010) Slow fromInternal conversion for TimestampType

2017-09-14 Thread JIRA
Maciej Bryński created SPARK-22010:
--

 Summary: Slow fromInternal conversion for TimestampType
 Key: SPARK-22010
 URL: https://issues.apache.org/jira/browse/SPARK-22010
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Maciej Bryński


To convert timestamp type to python we are using 
`datetime.datetime.fromtimestamp(ts // 100).replace(microsecond=ts % 
100)`
code.

{code}
In [34]: %%timeit
...: datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
...:
4.2 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
{code}

It's slow, because:
# we're trying to get TZ on every conversion
# we're using replace method

Proposed solution: custom datetime conversion and move calculation of TZ to 
module



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22009) Using treeAggregate improve some algs

2017-09-14 Thread zhengruifeng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-22009:
-
Description: 
I test on a dataset of about 13M instances, and found that using 
`treeAggregate` give a speedup in following algs:

OneHotEncoder ~ 5%
StatFunctions.calculateCov ~ 7%
StatFunctions.multipleApproxQuantiles ~ 9% 
RegressionEvaluator ~ 8% 

  was:
I test on a dataset of about 13M instances, and found that using 
`treeAggregate` give a speedup in following algs:

OneHotEncoder ~ 5%
StatFunctions.calculateCov ~ 13%
StatFunctions.multipleApproxQuantiles ~ 9% 
RegressionEvaluator ~ 8% 


> Using treeAggregate improve some algs
> -
>
> Key: SPARK-22009
> URL: https://issues.apache.org/jira/browse/SPARK-22009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Minor
>
> I test on a dataset of about 13M instances, and found that using 
> `treeAggregate` give a speedup in following algs:
> OneHotEncoder ~ 5%
> StatFunctions.calculateCov ~ 7%
> StatFunctions.multipleApproxQuantiles ~ 9% 
> RegressionEvaluator ~ 8% 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22009) Using treeAggregate improve some algs

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22009:


Assignee: Apache Spark

> Using treeAggregate improve some algs
> -
>
> Key: SPARK-22009
> URL: https://issues.apache.org/jira/browse/SPARK-22009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Minor
>
> I test on a dataset of about 13M instances, and found that using 
> `treeAggregate` give a speedup in following algs:
> OneHotEncoder ~ 5%
> StatFunctions.calculateCov ~ 13%
> StatFunctions.multipleApproxQuantiles ~ 9% 
> RegressionEvaluator ~ 8% 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22009) Using treeAggregate improve some algs

2017-09-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166003#comment-16166003
 ] 

Apache Spark commented on SPARK-22009:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/19232

> Using treeAggregate improve some algs
> -
>
> Key: SPARK-22009
> URL: https://issues.apache.org/jira/browse/SPARK-22009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Minor
>
> I test on a dataset of about 13M instances, and found that using 
> `treeAggregate` give a speedup in following algs:
> OneHotEncoder ~ 5%
> StatFunctions.calculateCov ~ 13%
> StatFunctions.multipleApproxQuantiles ~ 9% 
> RegressionEvaluator ~ 8% 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22009) Using treeAggregate improve some algs

2017-09-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22009:


Assignee: (was: Apache Spark)

> Using treeAggregate improve some algs
> -
>
> Key: SPARK-22009
> URL: https://issues.apache.org/jira/browse/SPARK-22009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: zhengruifeng
>Priority: Minor
>
> I test on a dataset of about 13M instances, and found that using 
> `treeAggregate` give a speedup in following algs:
> OneHotEncoder ~ 5%
> StatFunctions.calculateCov ~ 13%
> StatFunctions.multipleApproxQuantiles ~ 9% 
> RegressionEvaluator ~ 8% 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22009) Using treeAggregate improve some algs

2017-09-14 Thread zhengruifeng (JIRA)
zhengruifeng created SPARK-22009:


 Summary: Using treeAggregate improve some algs
 Key: SPARK-22009
 URL: https://issues.apache.org/jira/browse/SPARK-22009
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: zhengruifeng
Priority: Minor


I test on a dataset of about 13M instances, and found that using 
`treeAggregate` give a speedup in following algs:

OneHotEncoder ~ 5%
StatFunctions.calculateCov ~ 13%
StatFunctions.multipleApproxQuantiles ~ 9% 
RegressionEvaluator ~ 8% 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22008) Spark Streaming Dynamic Allocation auto fix maxNumExecutors

2017-09-14 Thread Yue Ma (JIRA)
Yue Ma created SPARK-22008:
--

 Summary: Spark Streaming Dynamic Allocation auto fix 
maxNumExecutors
 Key: SPARK-22008
 URL: https://issues.apache.org/jira/browse/SPARK-22008
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 2.2.0
Reporter: Yue Ma
Priority: Minor


In SparkStreaming DRA .The metric we use to add or remove executor is the ratio 
of batch processing time / batch duration (R). And we use the parameter 
"spark.streaming.dynamicAllocation.maxExecutors" to set the max Num of executor 
.Currently it doesn't work well with Spark streaming because of several reasons:
(1) For example if the max nums of executor we need is 10 and we set 
"spark.streaming.dynamicAllocation.maxExecutors" to 15,Obviously ,We wasted 5 
executors.
(2) If  the number of topic partition changes ,then the partition of KafkaRDD 
or  the num of tasks in a stage changes too.And the max executor we need will 
also change,so the num of maxExecutors should change with the nums of Task .


The goal of this JIRA is to auto fix maxNumExecutors . Using a SparkListerner 
when Stage Submitted ,first figure out the num executor we need  , then update 
the maxNumExecutor



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22007) spark-submit on yarn or local , got different result

2017-09-14 Thread xinzhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xinzhang resolved SPARK-22007.
--
Resolution: Won't Fix

> spark-submit on yarn or local , got different result
> 
>
> Key: SPARK-22007
> URL: https://issues.apache.org/jira/browse/SPARK-22007
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: xinzhang
>
> submit the py script on local.
> /opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py
> result:
> ++
> |databaseName|
> ++
> | default|
> | |
> |   x|
> ++
> submit the py script on yarn.
> /opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster 
> test_hive.py
> result:
> ++
> |databaseName|
> ++
> | default|
> ++
> the py script :
> [yangtt@dc-gateway119 test]$ cat test_hive.py 
> #!/usr/bin/env python
> #coding=utf-8
> from os.path import expanduser, join, abspath
> from pyspark.sql import SparkSession
> from pyspark.sql import Row
> from pyspark.conf import SparkConf
> def squared(s):
>   return s * s
> warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table')
> spark = SparkSession \
> .builder \
> .appName("Python_Spark_SQL_Hive") \
> .config("spark.sql.warehouse.dir", warehouse_location) \
> .config(conf=SparkConf()) \
> .enableHiveSupport() \
> .getOrCreate()
> spark.udf.register("squared",squared)
> spark.sql("show databases").show()
> Q:why the spark load the different hive metastore
> the yarn always use the DERBY?
> 17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is 
> DERBY
> my current metastore is in mysql.
> any suggest will be helpful.
> thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22007) spark-submit on yarn or local , got different result

2017-09-14 Thread xinzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165972#comment-16165972
 ] 

xinzhang commented on SPARK-22007:
--

ye .i figure it out.
add this with instance sparkSession
.config("hive.metastore.uris", "thrift://11.11.11.11:9083") \

maybe the web here should describe more detail.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.SparkSession.Builder


> spark-submit on yarn or local , got different result
> 
>
> Key: SPARK-22007
> URL: https://issues.apache.org/jira/browse/SPARK-22007
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: xinzhang
>
> submit the py script on local.
> /opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py
> result:
> ++
> |databaseName|
> ++
> | default|
> | |
> |   x|
> ++
> submit the py script on yarn.
> /opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster 
> test_hive.py
> result:
> ++
> |databaseName|
> ++
> | default|
> ++
> the py script :
> [yangtt@dc-gateway119 test]$ cat test_hive.py 
> #!/usr/bin/env python
> #coding=utf-8
> from os.path import expanduser, join, abspath
> from pyspark.sql import SparkSession
> from pyspark.sql import Row
> from pyspark.conf import SparkConf
> def squared(s):
>   return s * s
> warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table')
> spark = SparkSession \
> .builder \
> .appName("Python_Spark_SQL_Hive") \
> .config("spark.sql.warehouse.dir", warehouse_location) \
> .config(conf=SparkConf()) \
> .enableHiveSupport() \
> .getOrCreate()
> spark.udf.register("squared",squared)
> spark.sql("show databases").show()
> Q:why the spark load the different hive metastore
> the yarn always use the DERBY?
> 17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is 
> DERBY
> my current metastore is in mysql.
> any suggest will be helpful.
> thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true

2017-09-14 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165946#comment-16165946
 ] 

Daniel Darabos commented on SPARK-21418:


Sean's fix should cover you no matter what triggers the unexpected {{toString}} 
call. You could try building from {{master}} (or taking a nightly from 
https://spark.apache.org/developer-tools.html#nightly-builds) to confirm that 
this is the case.

> NoSuchElementException: None.get in DataSourceScanExec with 
> sun.io.serialization.extendedDebugInfo=true
> ---
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   

[jira] [Updated] (SPARK-22007) spark-submit on yarn or local , got different result

2017-09-14 Thread xinzhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xinzhang updated SPARK-22007:
-
Description: 
submit the py script on local.
/opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py
result:
++
|databaseName|
++
| default|
| |
|   x|
++

submit the py script on yarn.
/opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster 
test_hive.py
result:
++
|databaseName|
++
| default|
++

the py script :

[yangtt@dc-gateway119 test]$ cat test_hive.py 
#!/usr/bin/env python
#coding=utf-8

from os.path import expanduser, join, abspath

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.conf import SparkConf

def squared(s):
  return s * s

warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table')

spark = SparkSession \
.builder \
.appName("Python_Spark_SQL_Hive") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config(conf=SparkConf()) \
.enableHiveSupport() \
.getOrCreate()

spark.udf.register("squared",squared)

spark.sql("show databases").show()



Q:why the spark load the different hive metastore
the yarn always use the DERBY?
17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is 
DERBY
my current metastore is in mysql.
any suggest will be helpful.
thanks.

  was:
submit the py script on local.
/opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py
result:
++
|databaseName|
++
| default|
| |
|   x|
++

submit the py script on yarn.
/opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster 
test_hive.py
result:
++
|databaseName|
++
| default|
++

the py script :

[yangtt@dc-gateway119 test]$ cat test_hive.py 
#!/usr/bin/env python
#coding=utf-8

from os.path import expanduser, join, abspath

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.conf import SparkConf

def squared(s):
  return s * s

# warehouse_location points to the default location for managed databases and 
tables
warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table')

spark = SparkSession \
.builder \
.appName("Python_Spark_SQL_Hive") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config(conf=SparkConf()) \
.enableHiveSupport() \
.getOrCreate()

spark.udf.register("squared",squared)

spark.sql("show databases").show()



Q:why the spark load the different hive metastore
the yarn always use the DERBY?
17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is 
DERBY
my current metastore is in mysql.
any suggest will be helpful.
thanks.


> spark-submit on yarn or local , got different result
> 
>
> Key: SPARK-22007
> URL: https://issues.apache.org/jira/browse/SPARK-22007
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: xinzhang
>
> submit the py script on local.
> /opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py
> result:
> ++
> |databaseName|
> ++
> | default|
> | |
> |   x|
> ++
> submit the py script on yarn.
> /opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster 
> test_hive.py
> result:
> ++
> |databaseName|
> ++
> | default|
> ++
> the py script :
> [yangtt@dc-gateway119 test]$ cat test_hive.py 
> #!/usr/bin/env python
> #coding=utf-8
> from os.path import expanduser, join, abspath
> from pyspark.sql import SparkSession
> from pyspark.sql import Row
> from pyspark.conf import SparkConf
> def squared(s):
>   return s * s
> warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table')
> spark = SparkSession \
> .builder \
> .appName("Python_Spark_SQL_Hive") \
> .config("spark.sql.warehouse.dir", warehouse_location) \
> .config(conf=SparkConf()) \
> .enableHiveSupport() \
> .getOrCreate()
> spark.udf.register("squared",squared)
> spark.sql("show databases").show()
> Q:why the spark load the different hive metastore
> the yarn always use the DERBY?
> 17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is 
> DERBY
> my current metastore is in mysql.
> any suggest will be helpful.
> thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22007) spark-submit on yarn or local , got different result

2017-09-14 Thread xinzhang (JIRA)
xinzhang created SPARK-22007:


 Summary: spark-submit on yarn or local , got different result
 Key: SPARK-22007
 URL: https://issues.apache.org/jira/browse/SPARK-22007
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Shell, Spark Submit
Affects Versions: 2.1.0
Reporter: xinzhang


submit the py script on local.
/opt/spark/spark-bin/bin/spark-submit --master local cluster test_hive.py
result:
++
|databaseName|
++
| default|
| |
|   x|
++

submit the py script on yarn.
/opt/spark/spark-bin/bin/spark-submit --master yarn --deploy-mode cluster 
test_hive.py
result:
++
|databaseName|
++
| default|
++

the py script :

[yangtt@dc-gateway119 test]$ cat test_hive.py 
#!/usr/bin/env python
#coding=utf-8

from os.path import expanduser, join, abspath

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.conf import SparkConf

def squared(s):
  return s * s

# warehouse_location points to the default location for managed databases and 
tables
warehouse_location = abspath('/group/user/yangtt/meta/hive-temp-table')

spark = SparkSession \
.builder \
.appName("Python_Spark_SQL_Hive") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.config(conf=SparkConf()) \
.enableHiveSupport() \
.getOrCreate()

spark.udf.register("squared",squared)

spark.sql("show databases").show()



Q:why the spark load the different hive metastore
the yarn always use the DERBY?
17/09/14 16:10:55 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is 
DERBY
my current metastore is in mysql.
any suggest will be helpful.
thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >