[jira] [Created] (SPARK-25098) UDF ‘Cast’ return wrong result when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)
ice bai created SPARK-25098:
---

 Summary: UDF ‘Cast’ return wrong result when input string 
starts/ends with special character 
 Key: SPARK-25098
 URL: https://issues.apache.org/jira/browse/SPARK-25098
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: ice bai






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) UDF ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Summary: UDF ‘Cast’ will return NULL when input string starts/ends with 
special character   (was: UDF ‘Cast’ return wrong result when input string 
starts/ends with special character )

> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) UDF ‘Cast’ return wrong result when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Description: UDF ‘Cast’ will return NULL when input string starts/ends with 
special character,

> UDF ‘Cast’ return wrong result when input string starts/ends with special 
> character 
> 
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character,



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) UDF ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Description: 
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

for examle, get hour from a string ends with a blank :

hive:

```
hive> select hour('2018-07-27 17:20:07 ');//
OK 
17
```

spark-sql:

```
spark-sql> select hour('2018-07-27 17:20:07 ');
NULL
```

  was:UDF ‘Cast’ will return NULL when input string starts/ends with special 
character,


> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> for examle, get hour from a string ends with a blank :
> hive:
> ```
> hive> select hour('2018-07-27 17:20:07 ');//
> OK 
> 17
> ```
> spark-sql:
> ```
> spark-sql> select hour('2018-07-27 17:20:07 ');
> NULL
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) UDF ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Description: 
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, get hour from a string ends with a blank :

hive:

```
hive> select hour('2018-07-27 17:20:07 ');//end with a blank
OK 
17
```

spark-sql:

```
spark-sql> select hour('2018-07-27 17:20:07 ');//end with a blank
NULL
```

  was:
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

for examle, get hour from a string ends with a blank :

hive:

```
hive> select hour('2018-07-27 17:20:07 ');//end with a blank
OK 
17
```

spark-sql:

```
spark-sql> select hour('2018-07-27 17:20:07 ');//end with a blank
NULL
```


> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, get hour from a string ends with a blank :
> hive:
> ```
> hive> select hour('2018-07-27 17:20:07 ');//end with a blank
> OK 
> 17
> ```
> spark-sql:
> ```
> spark-sql> select hour('2018-07-27 17:20:07 ');//end with a blank
> NULL
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) UDF ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Description: 
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> select hour('2018-07-27 17:20:07 ');//end with a blank
OK 
17
```

spark-sql:

```
spark-sql> select hour('2018-07-27 17:20:07 ');//end with a blank
NULL
```

  was:
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, get hour from a string ends with a blank :

hive:

```
hive> select hour('2018-07-27 17:20:07 ');//end with a blank
OK 
17
```

spark-sql:

```
spark-sql> select hour('2018-07-27 17:20:07 ');//end with a blank
NULL
```


> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> select hour('2018-07-27 17:20:07 ');//end with a blank
> OK 
> 17
> ```
> spark-sql:
> ```
> spark-sql> select hour('2018-07-27 17:20:07 ');//end with a blank
> NULL
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) UDF ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Description: 
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

for examle, get hour from a string ends with a blank :

hive:

```
hive> select hour('2018-07-27 17:20:07 ');//end with a blank
OK 
17
```

spark-sql:

```
spark-sql> select hour('2018-07-27 17:20:07 ');//end with a blank
NULL
```

  was:
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

for examle, get hour from a string ends with a blank :

hive:

```
hive> select hour('2018-07-27 17:20:07 ');//
OK 
17
```

spark-sql:

```
spark-sql> select hour('2018-07-27 17:20:07 ');
NULL
```


> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> for examle, get hour from a string ends with a blank :
> hive:
> ```
> hive> select hour('2018-07-27 17:20:07 ');//end with a blank
> OK 
> 17
> ```
> spark-sql:
> ```
> spark-sql> select hour('2018-07-27 17:20:07 ');//end with a blank
> NULL
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Summary: ‘Cast’ will return NULL when input string starts/ends with special 
character   (was: UDF ‘Cast’ will return NULL when input string starts/ends 
with special character )

> ‘Cast’ will return NULL when input string starts/ends with special character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> select hour('2018-07-27 17:20:07 ');//end with a blank
> OK 
> 17
> ```
> spark-sql:
> ```
> spark-sql> select hour('2018-07-27 17:20:07 ');//end with a blank
> NULL
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Description: 
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
OK 
17
```

spark-sql:

```
spark-sql> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
NULL
```

  was:
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> select hour('2018-07-27 17:20:07 ');//end with a blank
OK 
17
```

spark-sql:

```
spark-sql> select hour('2018-07-27 17:20:07 ');//end with a blank
NULL
```


> ‘Cast’ will return NULL when input string starts/ends with special character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
> OK 
> 17
> ```
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
> NULL
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Description: 
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
OK 
2018-08-13
select hour('2018-04-27 17:20:07
```

spark-sql:

```
spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
NULL
```

  was:
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
OK 
17
```
2018-08-13
spark-sql:

```
spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
NULL
```


> ‘Cast’ will return NULL when input string starts/ends with special character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
> OK 
> 2018-08-13
> select hour('2018-04-27 17:20:07
> ```
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> NULL
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Description: 
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
OK 
17
```
2018-08-13
spark-sql:

```
spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
NULL
```

  was:
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
OK 
17
```

spark-sql:

```
spark-sql> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
NULL
```


> ‘Cast’ will return NULL when input string starts/ends with special character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
> OK 
> 17
> ```
> 2018-08-13
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> NULL
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Description: 
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
OK 
2018-08-13

hive> SELECT HOUR('2018-04-27 17:20:07 );//ends with a blank
OK
17
```

spark-sql:

```
spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
NULL

```

  was:
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> SELECT CAST(' 2018-08-13' AS DATE);;//starts with a blank
OK 
2018-08-13
select hour('2018-04-27 17:20:07
```

spark-sql:

```
spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
NULL
```


> ‘Cast’ will return NULL when input string starts/ends with special character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> OK 
> 2018-08-13
> hive> SELECT HOUR('2018-04-27 17:20:07 );//ends with a blank
> OK
> 17
> ```
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> NULL
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Description: 
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
OK 
2018-08-13

hive> SELECT HOUR('2018-04-27 17:20:07 );//ends with a blank
OK
17
```

spark-sql:

```
spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
NULL

spark-sql> SELECT HOUR('2018-04-27 17:20:07 );//ends with a blank
NULL
```

  was:
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
OK 
2018-08-13

hive> SELECT HOUR('2018-04-27 17:20:07 );//ends with a blank
OK
17
```

spark-sql:

```
spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
NULL

```


> ‘Cast’ will return NULL when input string starts/ends with special character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> OK 
> 2018-08-13
> hive> SELECT HOUR('2018-04-27 17:20:07 );//ends with a blank
> OK
> 17
> ```
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> NULL
> spark-sql> SELECT HOUR('2018-04-27 17:20:07 );//ends with a blank
> NULL
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Description: 
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
OK 
2018-08-13

hive> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
OK
17
```

spark-sql:

```
spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
NULL

spark-sql> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
NULL
```

All of the following UDFs will be affected:
```
year
month
day
hour
minute
second
date_add
date_sub
```



  was:
UDF ‘Cast’ will return NULL when input string starts/ends with special 
character, but hive doesn't.

For examle, we get hour from a string ends with a blank :

hive:

```
hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
OK 
2018-08-13

hive> SELECT HOUR('2018-04-27 17:20:07 );//ends with a blank
OK
17
```

spark-sql:

```
spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
NULL

spark-sql> SELECT HOUR('2018-04-27 17:20:07 );//ends with a blank
NULL
```


> ‘Cast’ will return NULL when input string starts/ends with special character 
> -
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> OK 
> 2018-08-13
> hive> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> OK
> 17
> ```
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> NULL
> spark-sql> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> NULL
> ```
> All of the following UDFs will be affected:
> ```
> year
> month
> day
> hour
> minute
> second
> date_add
> date_sub
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character(s)

2018-08-13 Thread ice bai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ice bai updated SPARK-25098:

Summary: ‘Cast’ will return NULL when input string starts/ends with special 
character(s)  (was: ‘Cast’ will return NULL when input string starts/ends with 
special character )

> ‘Cast’ will return NULL when input string starts/ends with special 
> character(s)
> ---
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> OK 
> 2018-08-13
> hive> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> OK
> 17
> ```
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> NULL
> spark-sql> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> NULL
> ```
> All of the following UDFs will be affected:
> ```
> year
> month
> day
> hour
> minute
> second
> date_add
> date_sub
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24931) CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which leading to job failed.

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16577948#comment-16577948
 ] 

Apache Spark commented on SPARK-24931:
--

User 'bingbai0912' has created a pull request for this issue:
https://github.com/apache/spark/pull/22088

> CoarseGrainedExecutorBackend send wrong 'Reason' when executor exits which 
> leading to job failed.
> -
>
> Key: SPARK-24931
> URL: https://issues.apache.org/jira/browse/SPARK-24931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> when executor lost for some reason(e.g.  Unable to register with external 
> shuffle server),CoarseGrainedExecutorBackend will send a RemoveExecutor event 
> with 'ExecutorLossReason'.   But this will cause TaskSetManager handle 
> handleFailedTask function with exitCausedByApp=true.   This is not correct



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character(s)

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25098:


Assignee: Apache Spark

> ‘Cast’ will return NULL when input string starts/ends with special 
> character(s)
> ---
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Assignee: Apache Spark
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> OK 
> 2018-08-13
> hive> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> OK
> 17
> ```
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> NULL
> spark-sql> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> NULL
> ```
> All of the following UDFs will be affected:
> ```
> year
> month
> day
> hour
> minute
> second
> date_add
> date_sub
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character(s)

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16577954#comment-16577954
 ] 

Apache Spark commented on SPARK-25098:
--

User 'bingbai0912' has created a pull request for this issue:
https://github.com/apache/spark/pull/22089

> ‘Cast’ will return NULL when input string starts/ends with special 
> character(s)
> ---
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> OK 
> 2018-08-13
> hive> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> OK
> 17
> ```
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> NULL
> spark-sql> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> NULL
> ```
> All of the following UDFs will be affected:
> ```
> year
> month
> day
> hour
> minute
> second
> date_add
> date_sub
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25098) ‘Cast’ will return NULL when input string starts/ends with special character(s)

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25098:


Assignee: (was: Apache Spark)

> ‘Cast’ will return NULL when input string starts/ends with special 
> character(s)
> ---
>
> Key: SPARK-25098
> URL: https://issues.apache.org/jira/browse/SPARK-25098
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: ice bai
>Priority: Major
>
> UDF ‘Cast’ will return NULL when input string starts/ends with special 
> character, but hive doesn't.
> For examle, we get hour from a string ends with a blank :
> hive:
> ```
> hive> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> OK 
> 2018-08-13
> hive> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> OK
> 17
> ```
> spark-sql:
> ```
> spark-sql> SELECT CAST(' 2018-08-13' AS DATE);//starts with a blank
> NULL
> spark-sql> SELECT HOUR('2018-08-13 17:20:07 );//ends with a blank
> NULL
> ```
> All of the following UDFs will be affected:
> ```
> year
> month
> day
> hour
> minute
> second
> date_add
> date_sub
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25099) Generate Avro Binary files in test suite

2018-08-13 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-25099:
--

 Summary: Generate Avro Binary files in test suite
 Key: SPARK-25099
 URL: https://issues.apache.org/jira/browse/SPARK-25099
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang


In PR [https://github.com/apache/spark/pull/21984] and 
[https://github.com/apache/spark/pull/21935] , the related test cases are 
created with Python scripts.

Generate the binary files in test suite to make it more transparent. 

Also move the related test cases to a new file AvroLogicalTypeSuite.scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25099) Generate Avro Binary files in test suite

2018-08-13 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-25099:
---
Description: 
In PR [https://github.com/apache/spark/pull/21984] and 
[https://github.com/apache/spark/pull/21935] , the related test cases are using 
binary files created by Python scripts.

Generate the binary files in test suite to make it more transparent. 

Also move the related test cases to a new file AvroLogicalTypeSuite.scala.

  was:
In PR [https://github.com/apache/spark/pull/21984] and 
[https://github.com/apache/spark/pull/21935] , the related test cases are 
created with Python scripts.

Generate the binary files in test suite to make it more transparent. 

Also move the related test cases to a new file AvroLogicalTypeSuite.scala.


> Generate Avro Binary files in test suite
> 
>
> Key: SPARK-25099
> URL: https://issues.apache.org/jira/browse/SPARK-25099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In PR [https://github.com/apache/spark/pull/21984] and 
> [https://github.com/apache/spark/pull/21935] , the related test cases are 
> using binary files created by Python scripts.
> Generate the binary files in test suite to make it more transparent. 
> Also move the related test cases to a new file AvroLogicalTypeSuite.scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25099) Generate Avro Binary files in test suite

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25099:


Assignee: (was: Apache Spark)

> Generate Avro Binary files in test suite
> 
>
> Key: SPARK-25099
> URL: https://issues.apache.org/jira/browse/SPARK-25099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In PR [https://github.com/apache/spark/pull/21984] and 
> [https://github.com/apache/spark/pull/21935] , the related test cases are 
> using binary files created by Python scripts.
> Generate the binary files in test suite to make it more transparent. 
> Also move the related test cases to a new file AvroLogicalTypeSuite.scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25099) Generate Avro Binary files in test suite

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25099:


Assignee: Apache Spark

> Generate Avro Binary files in test suite
> 
>
> Key: SPARK-25099
> URL: https://issues.apache.org/jira/browse/SPARK-25099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In PR [https://github.com/apache/spark/pull/21984] and 
> [https://github.com/apache/spark/pull/21935] , the related test cases are 
> using binary files created by Python scripts.
> Generate the binary files in test suite to make it more transparent. 
> Also move the related test cases to a new file AvroLogicalTypeSuite.scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25099) Generate Avro Binary files in test suite

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16577976#comment-16577976
 ] 

Apache Spark commented on SPARK-25099:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22091

> Generate Avro Binary files in test suite
> 
>
> Key: SPARK-25099
> URL: https://issues.apache.org/jira/browse/SPARK-25099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In PR [https://github.com/apache/spark/pull/21984] and 
> [https://github.com/apache/spark/pull/21935] , the related test cases are 
> using binary files created by Python scripts.
> Generate the binary files in test suite to make it more transparent. 
> Also move the related test cases to a new file AvroLogicalTypeSuite.scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25093) CodeFormatter could avoid creating regex object again and again

2018-08-13 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-25093:

Priority: Minor  (was: Major)

> CodeFormatter could avoid creating regex object again and again
> ---
>
> Key: SPARK-25093
> URL: https://issues.apache.org/jira/browse/SPARK-25093
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Minor
>
> in class `CodeFormatter` 
> method: `stripExtraNewLinesAndComments`
> could be refactored to: 
> {code:scala}
> // Some comments here
>  val commentReg =
> ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/
>   """([ |\t]*?\/\/[\s\S]*?\n)""").r  // strip //comment
>   val emptyRowsReg = """\n\s*\n""".r
> def stripExtraNewLinesAndComments(input: String): String = {
> val codeWithoutComment = commentReg.replaceAllIn(input, "")
> emptyRowsReg.replaceAllIn(codeWithoutComment, "\n") // strip ExtraNewLines
>   }
> {code}
> so the Regex would be compiled only once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25093) CodeFormatter could avoid creating regex object again and again

2018-08-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578038#comment-16578038
 ] 

Marco Gaido commented on SPARK-25093:
-

I just marked this as a minor priority ticket, anyway I agree with the proposed 
improvement. Are you submitting a PR for it? Thanks.

> CodeFormatter could avoid creating regex object again and again
> ---
>
> Key: SPARK-25093
> URL: https://issues.apache.org/jira/browse/SPARK-25093
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Minor
>
> in class `CodeFormatter` 
> method: `stripExtraNewLinesAndComments`
> could be refactored to: 
> {code:scala}
> // Some comments here
>  val commentReg =
> ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/
>   """([ |\t]*?\/\/[\s\S]*?\n)""").r  // strip //comment
>   val emptyRowsReg = """\n\s*\n""".r
> def stripExtraNewLinesAndComments(input: String): String = {
> val codeWithoutComment = commentReg.replaceAllIn(input, "")
> emptyRowsReg.replaceAllIn(codeWithoutComment, "\n") // strip ExtraNewLines
>   }
> {code}
> so the Regex would be compiled only once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578037#comment-16578037
 ] 

Izek Greenfield commented on SPARK-25094:
-

[~hyukjin.kwon] Does the full Plan is OK?

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24710) Information Gain Ratio for decision trees

2018-08-13 Thread Aleksey Zinoviev (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578045#comment-16578045
 ] 

Aleksey Zinoviev commented on SPARK-24710:
--

Could I take this ticket? 

> Information Gain Ratio for decision trees
> -
>
> Key: SPARK-24710
> URL: https://issues.apache.org/jira/browse/SPARK-24710
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Pablo J. Villacorta
>Priority: Minor
>  Labels: features
>
> Spark currently uses Information Gain (IG) to decide the next feature to 
> branch on when building a decision tree. In case of categorical features, IG 
> is known to be biased towards features with a large number of categories. 
> [Information Gain Ratio|https://en.wikipedia.org/wiki/Information_gain_ratio] 
> solves this problem by dividing the IG by a number that characterizes the 
> intrinsic information of a feature.
> As far as I know, Spark has IG but not IGR. It would be nice to have the 
> possibility to choose IGR instead of IG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25068) High-order function: exists(array, function) → boolean

2018-08-13 Thread Marek Novotny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578087#comment-16578087
 ] 

Marek Novotny commented on SPARK-25068:
---

I am curious about the motivation for adding this function. :) Was it added 
just to get Spark aligned with some other SQL engine or standard? Or is the 
motivation to offer users a function with better performance in {{true}} cases 
compare to {{aggregate}}? If so, what about adding a function like {{forAll}} 
in Scala that would stop iterating when the lambda returns {{false}}?

> High-order function: exists(array, function) → boolean
> -
>
> Key: SPARK-25068
> URL: https://issues.apache.org/jira/browse/SPARK-25068
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> Tests if arrays have those elements for which function returns true.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25100) Using KryoSerializer and setting registrationRequired true can lead job failed

2018-08-13 Thread deshanxiao (JIRA)
deshanxiao created SPARK-25100:
--

 Summary: Using KryoSerializer and setting registrationRequired 
true can lead job failed
 Key: SPARK-25100
 URL: https://issues.apache.org/jira/browse/SPARK-25100
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: deshanxiao


When spark.serializer is org.apache.spark.serializer.KryoSerializer and  
spark.kryo.registrationRequired is true in SparkCOnf. I invoked  
saveAsNewAPIHadoopDataset to store data in hdfs. The job will fail because the 
class TaskCommitMessage hasn't be registered.

 
{code:java}
java.lang.IllegalArgumentException: Class is not registered: 
org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage
Note: To register this class use: 
kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:488)
at com.twitter.chill.KryoBase.getRegistration(KryoBase.scala:52)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:97)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:347)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:393)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578114#comment-16578114
 ] 

Hyukjin Kwon commented on SPARK-25094:
--

It's more about code generation. It would be nicer if we know what input 
produce that output described in the JIRA. Otherwise I would rather resolve 
this as Cannot Reproduce since strictly no one knows how to reproduce

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Izek Greenfield (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Izek Greenfield updated SPARK-25094:

Attachment: generated_code.txt

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: generated_code.txt
>
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578127#comment-16578127
 ] 

Izek Greenfield commented on SPARK-25094:
-

the code that creates this plan is very complex. 
I will try to reproduce it in simple code in the meanwhile I can attach the 
generated code so you can see the problem is that the code does not create 
functions and inline all the Plan into the processNext method. 
[^generated_code.txt]  

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: generated_code.txt
>
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25096) Loosen nullability if the cast is force-nullable.

2018-08-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25096.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22086
[https://github.com/apache/spark/pull/22086]

> Loosen nullability if the cast is force-nullable.
> -
>
> Key: SPARK-25096
> URL: https://issues.apache.org/jira/browse/SPARK-25096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> In type coercion for complex types, if the found type is force-nullable to 
> cast, we should loosen the nullability to be able to cast. Also for map key 
> type, we can't use the type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25096) Loosen nullability if the cast is force-nullable.

2018-08-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25096:


Assignee: Takuya Ueshin

> Loosen nullability if the cast is force-nullable.
> -
>
> Key: SPARK-25096
> URL: https://issues.apache.org/jira/browse/SPARK-25096
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.0
>
>
> In type coercion for complex types, if the found type is force-nullable to 
> cast, we should loosen the nullability to be able to cast. Also for map key 
> type, we can't use the type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25101) Creating leaderLatch with id for getting info about spark master nodes from zk

2018-08-13 Thread Yuliya (JIRA)
Yuliya created SPARK-25101:
--

 Summary: Creating leaderLatch with id for getting info about spark 
master nodes from zk
 Key: SPARK-25101
 URL: https://issues.apache.org/jira/browse/SPARK-25101
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Yuliya


Sometimes spark master nodes start faster than zookeeper => all my masters are 
in STANDBY status. For solve this problem I check zk for getting info about 
registering leaderLatch objects, but currently spark creates leaderLatch 
without id (ZooKeeperLeaderElectionAgent:41) and I have no possibilities 
getting info about registered leaderLatch for spark master node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25101) Creating leaderLatch with id for getting info about spark master nodes from zk

2018-08-13 Thread Yuliya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578185#comment-16578185
 ] 

Yuliya commented on SPARK-25101:


https://github.com/apache/spark/pull/22092

> Creating leaderLatch with id for getting info about spark master nodes from zk
> --
>
> Key: SPARK-25101
> URL: https://issues.apache.org/jira/browse/SPARK-25101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuliya
>Priority: Major
>
> Sometimes spark master nodes start faster than zookeeper => all my masters 
> are in STANDBY status. For solve this problem I check zk for getting info 
> about registering leaderLatch objects, but currently spark creates 
> leaderLatch without id (ZooKeeperLeaderElectionAgent:41) and I have no 
> possibilities getting info about registered leaderLatch for spark master node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24391) to_json/from_json should support arrays of primitives, and more generally all JSON

2018-08-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24391.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21439
[https://github.com/apache/spark/pull/21439]

> to_json/from_json should support arrays of primitives, and more generally all 
> JSON 
> ---
>
> Key: SPARK-24391
> URL: https://issues.apache.org/jira/browse/SPARK-24391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sam Kitajima-Kimbrel
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> https://issues.apache.org/jira/browse/SPARK-19849 and 
> https://issues.apache.org/jira/browse/SPARK-21513 brought support for more 
> column types to functions.to_json/from_json, but I also have cases where I'd 
> like to simply (de)serialize an array of primitives to/from JSON when 
> outputting to certain destinations, which does not work:
> {code:java}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq("[1, 2, 3]").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> val schema = new ArrayType(IntegerType, false)
> schema: org.apache.spark.sql.types.ArrayType = ArrayType(IntegerType,false)
> scala> df.select(from_json($"a", schema))
> org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`a`)' 
> due to data type mismatch: Input schema array must be a struct or an 
> array of structs.;;
> 'Project [jsontostructs(ArrayType(IntegerType,false), a#3, 
> Some(America/Los_Angeles)) AS jsontostructs(a)#10]
> scala> val arrayDf = Seq(Array(1, 2, 3)).toDF("arr")
> arrayDf: org.apache.spark.sql.DataFrame = [arr: array]
> scala> arrayDf.select(to_json($"arr"))
> org.apache.spark.sql.AnalysisException: cannot resolve 'structstojson(`arr`)' 
> due to data type mismatch: Input type array must be a struct, array of 
> structs or a map or array of map.;;
> 'Project [structstojson(arr#19, Some(America/Los_Angeles)) AS 
> structstojson(arr)#26]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25101) Creating leaderLatch with id for getting info about spark master nodes from zk

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25101:


Assignee: Apache Spark

> Creating leaderLatch with id for getting info about spark master nodes from zk
> --
>
> Key: SPARK-25101
> URL: https://issues.apache.org/jira/browse/SPARK-25101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuliya
>Assignee: Apache Spark
>Priority: Major
>
> Sometimes spark master nodes start faster than zookeeper => all my masters 
> are in STANDBY status. For solve this problem I check zk for getting info 
> about registering leaderLatch objects, but currently spark creates 
> leaderLatch without id (ZooKeeperLeaderElectionAgent:41) and I have no 
> possibilities getting info about registered leaderLatch for spark master node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24391) to_json/from_json should support arrays of primitives, and more generally all JSON

2018-08-13 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-24391:


Assignee: Maxim Gekk

> to_json/from_json should support arrays of primitives, and more generally all 
> JSON 
> ---
>
> Key: SPARK-24391
> URL: https://issues.apache.org/jira/browse/SPARK-24391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Sam Kitajima-Kimbrel
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> https://issues.apache.org/jira/browse/SPARK-19849 and 
> https://issues.apache.org/jira/browse/SPARK-21513 brought support for more 
> column types to functions.to_json/from_json, but I also have cases where I'd 
> like to simply (de)serialize an array of primitives to/from JSON when 
> outputting to certain destinations, which does not work:
> {code:java}
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq("[1, 2, 3]").toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: string]
> scala> val schema = new ArrayType(IntegerType, false)
> schema: org.apache.spark.sql.types.ArrayType = ArrayType(IntegerType,false)
> scala> df.select(from_json($"a", schema))
> org.apache.spark.sql.AnalysisException: cannot resolve 'jsontostructs(`a`)' 
> due to data type mismatch: Input schema array must be a struct or an 
> array of structs.;;
> 'Project [jsontostructs(ArrayType(IntegerType,false), a#3, 
> Some(America/Los_Angeles)) AS jsontostructs(a)#10]
> scala> val arrayDf = Seq(Array(1, 2, 3)).toDF("arr")
> arrayDf: org.apache.spark.sql.DataFrame = [arr: array]
> scala> arrayDf.select(to_json($"arr"))
> org.apache.spark.sql.AnalysisException: cannot resolve 'structstojson(`arr`)' 
> due to data type mismatch: Input type array must be a struct, array of 
> structs or a map or array of map.;;
> 'Project [structstojson(arr#19, Some(America/Los_Angeles)) AS 
> structstojson(arr)#26]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25101) Creating leaderLatch with id for getting info about spark master nodes from zk

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578186#comment-16578186
 ] 

Apache Spark commented on SPARK-25101:
--

User 'yuliyaZharnosek' has created a pull request for this issue:
https://github.com/apache/spark/pull/22092

> Creating leaderLatch with id for getting info about spark master nodes from zk
> --
>
> Key: SPARK-25101
> URL: https://issues.apache.org/jira/browse/SPARK-25101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuliya
>Priority: Major
>
> Sometimes spark master nodes start faster than zookeeper => all my masters 
> are in STANDBY status. For solve this problem I check zk for getting info 
> about registering leaderLatch objects, but currently spark creates 
> leaderLatch without id (ZooKeeperLeaderElectionAgent:41) and I have no 
> possibilities getting info about registered leaderLatch for spark master node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25101) Creating leaderLatch with id for getting info about spark master nodes from zk

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25101:


Assignee: (was: Apache Spark)

> Creating leaderLatch with id for getting info about spark master nodes from zk
> --
>
> Key: SPARK-25101
> URL: https://issues.apache.org/jira/browse/SPARK-25101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Yuliya
>Priority: Major
>
> Sometimes spark master nodes start faster than zookeeper => all my masters 
> are in STANDBY status. For solve this problem I check zk for getting info 
> about registering leaderLatch objects, but currently spark creates 
> leaderLatch without id (ZooKeeperLeaderElectionAgent:41) and I have no 
> possibilities getting info about registered leaderLatch for spark master node.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25099) Generate Avro Binary files in test suite

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25099.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22091
[https://github.com/apache/spark/pull/22091]

> Generate Avro Binary files in test suite
> 
>
> Key: SPARK-25099
> URL: https://issues.apache.org/jira/browse/SPARK-25099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> In PR [https://github.com/apache/spark/pull/21984] and 
> [https://github.com/apache/spark/pull/21935] , the related test cases are 
> using binary files created by Python scripts.
> Generate the binary files in test suite to make it more transparent. 
> Also move the related test cases to a new file AvroLogicalTypeSuite.scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25099) Generate Avro Binary files in test suite

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25099:
---

Assignee: Gengliang Wang

> Generate Avro Binary files in test suite
> 
>
> Key: SPARK-25099
> URL: https://issues.apache.org/jira/browse/SPARK-25099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> In PR [https://github.com/apache/spark/pull/21984] and 
> [https://github.com/apache/spark/pull/21935] , the related test cases are 
> using binary files created by Python scripts.
> Generate the binary files in test suite to make it more transparent. 
> Also move the related test cases to a new file AvroLogicalTypeSuite.scala.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22713) OOM caused by the memory contention and memory leak in TaskMemoryManager

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22713:
---

Assignee: Eyal Farago

> OOM caused by the memory contention and memory leak in TaskMemoryManager
> 
>
> Key: SPARK-22713
> URL: https://issues.apache.org/jira/browse/SPARK-22713
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.1, 2.1.2
>Reporter: Lijie Xu
>Assignee: Eyal Farago
>Priority: Critical
> Fix For: 2.4.0
>
>
> The pdf version of this issue with high-quality figures is available at 
> https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/report/OOM-TaskMemoryManager.pdf.
> *[Abstract]* 
> I recently encountered an OOM error in a PageRank application 
> (_org.apache.spark.examples.SparkPageRank_). After profiling the application, 
> I found the OOM error is related to the memory contention in shuffle spill 
> phase. Here, the memory contention means that a task tries to release some 
> old memory consumers from memory for keeping the new memory consumers. After 
> analyzing the OOM heap dump, I found the root cause is a memory leak in 
> _TaskMemoryManager_. Since memory contention is common in shuffle phase, this 
> is a critical bug/defect. In the following sections, I will use the 
> application dataflow, execution log, heap dump, and source code to identify 
> the root cause.
> *[Application]* 
> This is a PageRank application from Spark’s example library. The following 
> figure shows the application dataflow. The source code is available at \[1\].
> !https://raw.githubusercontent.com/JerryLead/Misc/master/OOM-TasksMemoryManager/figures/PageRankDataflow.png|width=100%!
> *[Failure symptoms]*
> This application has a map stage and many iterative reduce stages. An OOM 
> error occurs in a reduce task (Task-28) as follows.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/Stage.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/task.png?raw=true|width=100%!
>  
> *[OOM root cause identification]*
> Each executor has 1 CPU core and 6.5GB memory, so it only runs one task at a 
> time. After analyzing the application dataflow, error log, heap dump, and 
> source code, I found the following steps lead to the OOM error. 
> => The MemoryManager found that there is not enough memory to cache the 
> _links:ShuffledRDD_ (rdd-5-28, red circles in the dataflow figure).
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/ShuffledRDD.png?raw=true|width=100%!
> => The task needs to shuffle twice (1st shuffle and 2nd shuffle in the 
> dataflow figure).
> => The task needs to generate two _ExternalAppendOnlyMap_ (E1 for 1st shuffle 
> and E2 for 2nd shuffle) in sequence.
> => The 1st shuffle begins and ends. E1 aggregates all the shuffled data of 
> 1st shuffle and achieves 3.3 GB.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/FirstShuffle.png?raw=true|width=100%!
> => The 2nd shuffle begins. E2 is aggregating the shuffled data of 2nd 
> shuffle, and finding that there is not enough memory left. This triggers the 
> memory contention.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/SecondShuffle.png?raw=true|width=100%!
> => To handle the memory contention, the _TaskMemoryManager_ releases E1 
> (spills it onto disk) and assumes that the 3.3GB space is free now.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/MemoryContention.png?raw=true|width=100%!
> => E2 continues to aggregates the shuffled records of 2nd shuffle. However, 
> E2 encounters an OOM error while shuffling.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMbefore.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMError.png?raw=true|width=100%!
> *[Guess]* 
> The task memory usage below reveals that there is not memory drop down. So, 
> the cause may be that the 3.3GB _ExternalAppendOnlyMap_ (E1) is not actually 
> released by the _TaskMemoryManger_. 
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/GCFigure.png?raw=true|width=100%!
> *[Root cause]* 
> After analyzing the heap dump, I found the guess is right (the 3.3GB 
> _ExternalAppendOnlyMap_ is actually not released). The 1.6GB object is 
> _ExternalAppendOnlyMap (E2)_.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/heapdump.png?raw=true|width=100%!
> *[Question]* 
> Why the released _ExternalAppendOnlyMap_ is still in memory?
> The source code of _ExternalAppendOnlyMap_ shows that the _curre

[jira] [Resolved] (SPARK-22713) OOM caused by the memory contention and memory leak in TaskMemoryManager

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22713.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21369
[https://github.com/apache/spark/pull/21369]

> OOM caused by the memory contention and memory leak in TaskMemoryManager
> 
>
> Key: SPARK-22713
> URL: https://issues.apache.org/jira/browse/SPARK-22713
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.1, 2.1.2
>Reporter: Lijie Xu
>Assignee: Eyal Farago
>Priority: Critical
> Fix For: 2.4.0
>
>
> The pdf version of this issue with high-quality figures is available at 
> https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/report/OOM-TaskMemoryManager.pdf.
> *[Abstract]* 
> I recently encountered an OOM error in a PageRank application 
> (_org.apache.spark.examples.SparkPageRank_). After profiling the application, 
> I found the OOM error is related to the memory contention in shuffle spill 
> phase. Here, the memory contention means that a task tries to release some 
> old memory consumers from memory for keeping the new memory consumers. After 
> analyzing the OOM heap dump, I found the root cause is a memory leak in 
> _TaskMemoryManager_. Since memory contention is common in shuffle phase, this 
> is a critical bug/defect. In the following sections, I will use the 
> application dataflow, execution log, heap dump, and source code to identify 
> the root cause.
> *[Application]* 
> This is a PageRank application from Spark’s example library. The following 
> figure shows the application dataflow. The source code is available at \[1\].
> !https://raw.githubusercontent.com/JerryLead/Misc/master/OOM-TasksMemoryManager/figures/PageRankDataflow.png|width=100%!
> *[Failure symptoms]*
> This application has a map stage and many iterative reduce stages. An OOM 
> error occurs in a reduce task (Task-28) as follows.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/Stage.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/task.png?raw=true|width=100%!
>  
> *[OOM root cause identification]*
> Each executor has 1 CPU core and 6.5GB memory, so it only runs one task at a 
> time. After analyzing the application dataflow, error log, heap dump, and 
> source code, I found the following steps lead to the OOM error. 
> => The MemoryManager found that there is not enough memory to cache the 
> _links:ShuffledRDD_ (rdd-5-28, red circles in the dataflow figure).
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/ShuffledRDD.png?raw=true|width=100%!
> => The task needs to shuffle twice (1st shuffle and 2nd shuffle in the 
> dataflow figure).
> => The task needs to generate two _ExternalAppendOnlyMap_ (E1 for 1st shuffle 
> and E2 for 2nd shuffle) in sequence.
> => The 1st shuffle begins and ends. E1 aggregates all the shuffled data of 
> 1st shuffle and achieves 3.3 GB.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/FirstShuffle.png?raw=true|width=100%!
> => The 2nd shuffle begins. E2 is aggregating the shuffled data of 2nd 
> shuffle, and finding that there is not enough memory left. This triggers the 
> memory contention.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/SecondShuffle.png?raw=true|width=100%!
> => To handle the memory contention, the _TaskMemoryManager_ releases E1 
> (spills it onto disk) and assumes that the 3.3GB space is free now.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/MemoryContention.png?raw=true|width=100%!
> => E2 continues to aggregates the shuffled records of 2nd shuffle. However, 
> E2 encounters an OOM error while shuffling.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMbefore.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMError.png?raw=true|width=100%!
> *[Guess]* 
> The task memory usage below reveals that there is not memory drop down. So, 
> the cause may be that the 3.3GB _ExternalAppendOnlyMap_ (E1) is not actually 
> released by the _TaskMemoryManger_. 
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/GCFigure.png?raw=true|width=100%!
> *[Root cause]* 
> After analyzing the heap dump, I found the guess is right (the 3.3GB 
> _ExternalAppendOnlyMap_ is actually not released). The 1.6GB object is 
> _ExternalAppendOnlyMap (E2)_.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/heapdump.png?raw=true|width=100%!
> *[Question]* 
> Why the released _Externa

[jira] [Comment Edited] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578127#comment-16578127
 ] 

Izek Greenfield edited comment on SPARK-25094 at 8/13/18 1:03 PM:
--

the code that creates this plan is very complex. 
I will try to reproduce it in simple code in the meanwhile I can attach the 
generated code so you can see the problem is that the code does not create 
functions and inline all the Plan into the processNext method. 
[^generated_code.txt]  

it contains 2 DataFrames on with 80 columns 10 of them built from `case when` 
expressions:
 like that: 
CASE WHEN (`predefined_hc` IS NOT NULL) THEN '/Predefined_hc/' WHEN 
(`zero_volatility_adj_ind` = 'Y') THEN '/Zero_Haircuts_cases/' WHEN 
(`collateral_allocation_method` = 'FCSM') THEN '/FCSM_Collaterals/' WHEN 
`underlying_type` = 'DEBT') AND (`issuer_type` = 'CGVT')) AND 
((`instrument_cqs_st` <= 4) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 4 AND (`residual_maturity_instrument` <= 12.0D)) 
THEN '/Debt/Central_Government_Issuer/Eligible/res_mat_1Y/' WHEN 
`underlying_type` = 'DEBT') AND (`issuer_type` = 'CGVT')) AND 
((`instrument_cqs_st` <= 4) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 4 AND (`residual_maturity_instrument` <= 60.0D)) 
THEN '/Debt/Central_Government_Issuer/Eligible/res_mat_5Y/' WHEN 
(((`underlying_type` = 'DEBT') AND (`issuer_type` = 'CGVT')) AND 
((`instrument_cqs_st` <= 4) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 4 THEN 
'/Debt/Central_Government_Issuer/Eligible/res_mat_G5/' WHEN ((`underlying_type` 
= 'DEBT') AND (`issuer_type` = 'CGVT')) THEN 
'/Debt/Central_Government_Issuer/Non_Eligible/' WHEN `underlying_type` = 
'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 'PSE', 'RGLA', 'IO_LISTED'))) 
AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 3 AND (`residual_maturity_instrument` <= 12.0D)) 
THEN '/Debt/Other_Issuers/Eligible/res_mat_1Y/' WHEN `underlying_type` = 
'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 'PSE', 'RGLA', 'IO_LISTED'))) 
AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 3 AND (`residual_maturity_instrument` <= 60.0D)) 
THEN '/Debt/Other_Issuers/Eligible/res_mat_5Y/' WHEN (((`underlying_type` = 
'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 'PSE', 'RGLA', 'IO_LISTED'))) 
AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) AND 
(`instrument_cqs_lt` <= 3 THEN '/Debt/Other_Issuers/Eligible/res_mat_G5/' 
WHEN ((`underlying_type` = 'DEBT') AND (`issuer_type` IN ('INST', 'CORP', 
'PSE', 'RGLA', 'IO_LISTED'))) THEN '/Debt/Other_Issuers/Non_Eligible/' WHEN 
(((`underlying_type` = 'SECURITISATION') AND ((`instrument_cqs_st` <= 3) OR 
((`instrument_cqs_st` = 7) AND (`instrument_cqs_lt` <= 3 AND 
(`residual_maturity_instrument` <= 12.0D)) THEN 
'/Securitisation/Eligible/res_mat_1Y/' WHEN (((`underlying_type` = 
'SECURITISATION') AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) 
AND (`instrument_cqs_lt` <= 3 AND (`residual_maturity_instrument` <= 
60.0D)) THEN '/Securitisation/Eligible/res_mat_5Y/' WHEN ((`underlying_type` = 
'SECURITISATION') AND ((`instrument_cqs_st` <= 3) OR ((`instrument_cqs_st` = 7) 
AND (`instrument_cqs_lt` <= 3 THEN '/Securitisation/Eligible/res_mat_G5/' 
WHEN (`underlying_type` = 'SECURITISATION') THEN 
'/Securitisation/Non_Eligible/' WHEN ((`underlying_type` IN ('EQUITY', 
'MAIN_INDEX_EQUITY', 'COMMODITY', 'NON_ELIGIBLE_SECURITY')) AND 
(`underlying_type` = 'MAIN_INDEX_EQUITY')) THEN '/Other_Securities/Main_index/' 
WHEN (`underlying_type` IN ('EQUITY', 'MAIN_INDEX_EQUITY', 'COMMODITY', 
'NON_ELIGIBLE_SECURITY')) THEN '/Other_Securities/Others/' WHEN 
(`underlying_type` = 'CASH') THEN '/Cash/' WHEN (`underlying_type` = 'GOLD') 
THEN '/Gold/' WHEN (`underlying_type` = 'CIU') THEN '/CIU/' WHEN true THEN 
'/Others/' END AS 
`108_0___Portfolio_CRD4_Art_224_Volatility_Adjustments_Codespath_CRD4_Art_224_Volatil`


was (Author: igreenfi):
the code that creates this plan is very complex. 
I will try to reproduce it in simple code in the meanwhile I can attach the 
generated code so you can see the problem is that the code does not create 
functions and inline all the Plan into the processNext method. 
[^generated_code.txt]  

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: generated_code.txt
>
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCo

[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Izek Greenfield (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578275#comment-16578275
 ] 

Izek Greenfield commented on SPARK-25094:
-

looking in the code the problem is here:

def splitExpressionsWithCurrentInputs(
  expressions: Seq[String],
  funcName: String = "apply",
  extraArguments: Seq[(String, String)] = Nil,
  returnType: String = "void",
  makeSplitFunction: String => String = identity,
  foldFunctions: Seq[String] => String = _.mkString("", ";\n", ";")): 
String = {
// TODO: support whole stage codegen
if (INPUT_ROW == null || currentVars != null) {
  expressions.mkString("\n")
} else {
  splitExpressions(
expressions,
funcName,
("InternalRow", INPUT_ROW) +: extraArguments,
returnType,
makeSplitFunction,
foldFunctions)
}
  }

the TODO section!!

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: generated_code.txt
>
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25102) Write Spark version information to Parquet file footers

2018-08-13 Thread Zoltan Ivanfi (JIRA)
Zoltan Ivanfi created SPARK-25102:
-

 Summary: Write Spark version information to Parquet file footers
 Key: SPARK-25102
 URL: https://issues.apache.org/jira/browse/SPARK-25102
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Zoltan Ivanfi


-PARQUET-352- added support for the "writer.model.name" property in the Parquet 
metadata to identify the object model (application) that wrote the file.

The easiest way to write this property is by overriding getName() of 
org.apache.parquet.hadoop.api.WriteSupport. In Spark, this would mean adding 
getName() to the 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22713) OOM caused by the memory contention and memory leak in TaskMemoryManager

2018-08-13 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578322#comment-16578322
 ] 

Eyal Farago commented on SPARK-22713:
-

[~jerrylead], can you please test this?

> OOM caused by the memory contention and memory leak in TaskMemoryManager
> 
>
> Key: SPARK-22713
> URL: https://issues.apache.org/jira/browse/SPARK-22713
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.1, 2.1.2
>Reporter: Lijie Xu
>Assignee: Eyal Farago
>Priority: Critical
> Fix For: 2.4.0
>
>
> The pdf version of this issue with high-quality figures is available at 
> https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/report/OOM-TaskMemoryManager.pdf.
> *[Abstract]* 
> I recently encountered an OOM error in a PageRank application 
> (_org.apache.spark.examples.SparkPageRank_). After profiling the application, 
> I found the OOM error is related to the memory contention in shuffle spill 
> phase. Here, the memory contention means that a task tries to release some 
> old memory consumers from memory for keeping the new memory consumers. After 
> analyzing the OOM heap dump, I found the root cause is a memory leak in 
> _TaskMemoryManager_. Since memory contention is common in shuffle phase, this 
> is a critical bug/defect. In the following sections, I will use the 
> application dataflow, execution log, heap dump, and source code to identify 
> the root cause.
> *[Application]* 
> This is a PageRank application from Spark’s example library. The following 
> figure shows the application dataflow. The source code is available at \[1\].
> !https://raw.githubusercontent.com/JerryLead/Misc/master/OOM-TasksMemoryManager/figures/PageRankDataflow.png|width=100%!
> *[Failure symptoms]*
> This application has a map stage and many iterative reduce stages. An OOM 
> error occurs in a reduce task (Task-28) as follows.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/Stage.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/task.png?raw=true|width=100%!
>  
> *[OOM root cause identification]*
> Each executor has 1 CPU core and 6.5GB memory, so it only runs one task at a 
> time. After analyzing the application dataflow, error log, heap dump, and 
> source code, I found the following steps lead to the OOM error. 
> => The MemoryManager found that there is not enough memory to cache the 
> _links:ShuffledRDD_ (rdd-5-28, red circles in the dataflow figure).
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/ShuffledRDD.png?raw=true|width=100%!
> => The task needs to shuffle twice (1st shuffle and 2nd shuffle in the 
> dataflow figure).
> => The task needs to generate two _ExternalAppendOnlyMap_ (E1 for 1st shuffle 
> and E2 for 2nd shuffle) in sequence.
> => The 1st shuffle begins and ends. E1 aggregates all the shuffled data of 
> 1st shuffle and achieves 3.3 GB.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/FirstShuffle.png?raw=true|width=100%!
> => The 2nd shuffle begins. E2 is aggregating the shuffled data of 2nd 
> shuffle, and finding that there is not enough memory left. This triggers the 
> memory contention.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/SecondShuffle.png?raw=true|width=100%!
> => To handle the memory contention, the _TaskMemoryManager_ releases E1 
> (spills it onto disk) and assumes that the 3.3GB space is free now.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/MemoryContention.png?raw=true|width=100%!
> => E2 continues to aggregates the shuffled records of 2nd shuffle. However, 
> E2 encounters an OOM error while shuffling.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMbefore.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMError.png?raw=true|width=100%!
> *[Guess]* 
> The task memory usage below reveals that there is not memory drop down. So, 
> the cause may be that the 3.3GB _ExternalAppendOnlyMap_ (E1) is not actually 
> released by the _TaskMemoryManger_. 
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/GCFigure.png?raw=true|width=100%!
> *[Root cause]* 
> After analyzing the heap dump, I found the guess is right (the 3.3GB 
> _ExternalAppendOnlyMap_ is actually not released). The 1.6GB object is 
> _ExternalAppendOnlyMap (E2)_.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/heapdump.png?raw=true|width=100%!
> *[Question]* 
> Why the released _ExternalAppendOnlyMap_ is still in memory?
>

[jira] [Commented] (SPARK-25094) proccesNext() failed to compile size is over 64kb

2018-08-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578348#comment-16578348
 ] 

Marco Gaido commented on SPARK-25094:
-

[~igreenfi] as I mentioned you, this is a known issue. You found a TODO because 
currently it is not possible to implement that TODO. There is an ongoing effort 
to make it happening, but it is a huge effort, so it will take time. Thanks.

> proccesNext() failed to compile size is over 64kb
> -
>
> Key: SPARK-25094
> URL: https://issues.apache.org/jira/browse/SPARK-25094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Izek Greenfield
>Priority: Major
> Attachments: generated_code.txt
>
>
> I have this tree:
> 2018-08-12T07:14:31,289 WARN  [] 
> org.apache.spark.sql.execution.WholeStageCodegenExec - Whole-stage codegen 
> disabled for plan (id=1):
>  *(1) Project [, ... 10 more fields]
> +- *(1) Filter NOT exposure_calc_method#10141 IN 
> (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)
>+- InMemoryTableScan [, ... 11 more fields], [NOT 
> exposure_calc_method#10141 IN (UNSETTLED_TRANSACTIONS,FREE_DELIVERIES)]
>  +- InMemoryRelation [, ... 80 more fields], StorageLevel(memory, 
> deserialized, 1 replicas)
>+- *(5) SortMergeJoin [unique_id#8506], [unique_id#8722], Inner
>   :- *(2) Sort [unique_id#8506 ASC NULLS FIRST], false, 0
>   :  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
>   : +- *(1) Project [, ... 6 more fields]
>   :+- *(1) Filter (isnotnull(v#49) && 
> isnotnull(run_id#52)) && (asof_date#48 <=> 17531)) && (run_id#52 = DATA_REG)) 
> && (v#49 = DATA_REG)) && isnotnull(unique_id#39))
>   :   +- InMemoryTableScan [, ... 6 more fields], [, 
> ... 6 more fields]
>   : +- InMemoryRelation [, ... 6 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   :   +- *(1) FileScan csv [,... 6 more 
> fields] , ... 6 more fields
>   +- *(4) Sort [unique_id#8722 ASC NULLS FIRST], false, 0
>  +- Exchange(coordinator id: 1456511137) 
> UnknownPartitioning(9), coordinator[target post-shuffle partition size: 
> 67108864]
> +- *(3) Project [, ... 74 more fields]
>+- *(3) Filter (((isnotnull(v#51) && (asof_date#42 
> <=> 17531)) && (v#51 = DATA_REG)) && isnotnull(unique_id#54))
>   +- InMemoryTableScan [, ... 74 more fields], [, 
> ... 4 more fields]
> +- InMemoryRelation [, ... 74 more 
> fields], StorageLevel(memory, deserialized, 1 replicas)
>   +- *(1) FileScan csv [,... 74 more 
> fields] , ... 6 more fields
> Compiling "GeneratedClass": Code of method "processNext()V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1"
>  grows beyond 64 KB
> and the generated code failed to compile.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25100) Using KryoSerializer and setting registrationRequired true can lead job failed

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25100:


Assignee: Apache Spark

> Using KryoSerializer and setting registrationRequired true can lead job failed
> --
>
> Key: SPARK-25100
> URL: https://issues.apache.org/jira/browse/SPARK-25100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: deshanxiao
>Assignee: Apache Spark
>Priority: Major
>
> When spark.serializer is org.apache.spark.serializer.KryoSerializer and  
> spark.kryo.registrationRequired is true in SparkCOnf. I invoked  
> saveAsNewAPIHadoopDataset to store data in hdfs. The job will fail because 
> the class TaskCommitMessage hasn't be registered.
>  
> {code:java}
> java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage
> Note: To register this class use: 
> kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class);
> at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:488)
> at com.twitter.chill.KryoBase.getRegistration(KryoBase.scala:52)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:97)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:347)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:393)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25100) Using KryoSerializer and setting registrationRequired true can lead job failed

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578376#comment-16578376
 ] 

Apache Spark commented on SPARK-25100:
--

User 'deshanxiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/22093

> Using KryoSerializer and setting registrationRequired true can lead job failed
> --
>
> Key: SPARK-25100
> URL: https://issues.apache.org/jira/browse/SPARK-25100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: deshanxiao
>Priority: Major
>
> When spark.serializer is org.apache.spark.serializer.KryoSerializer and  
> spark.kryo.registrationRequired is true in SparkCOnf. I invoked  
> saveAsNewAPIHadoopDataset to store data in hdfs. The job will fail because 
> the class TaskCommitMessage hasn't be registered.
>  
> {code:java}
> java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage
> Note: To register this class use: 
> kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class);
> at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:488)
> at com.twitter.chill.KryoBase.getRegistration(KryoBase.scala:52)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:97)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:347)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:393)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25100) Using KryoSerializer and setting registrationRequired true can lead job failed

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25100:


Assignee: (was: Apache Spark)

> Using KryoSerializer and setting registrationRequired true can lead job failed
> --
>
> Key: SPARK-25100
> URL: https://issues.apache.org/jira/browse/SPARK-25100
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: deshanxiao
>Priority: Major
>
> When spark.serializer is org.apache.spark.serializer.KryoSerializer and  
> spark.kryo.registrationRequired is true in SparkCOnf. I invoked  
> saveAsNewAPIHadoopDataset to store data in hdfs. The job will fail because 
> the class TaskCommitMessage hasn't be registered.
>  
> {code:java}
> java.lang.IllegalArgumentException: Class is not registered: 
> org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage
> Note: To register this class use: 
> kryo.register(org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage.class);
> at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:488)
> at com.twitter.chill.KryoBase.getRegistration(KryoBase.scala:52)
> at 
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:97)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:517)
> at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:622)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:347)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:393)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24954) Fail fast on job submit if run a barrier stage with dynamic resource allocation enabled

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578419#comment-16578419
 ] 

Imran Rashid commented on SPARK-24954:
--

I was thinking more about this, and realized you'll also run into issues w/ 
static allocation, and a multi-tenant cluster with pre-emption.  You might keep 
gaining resources and having them taken away from you.

I'm not sure if there is anything we can really do about that, though, other 
than telling users to not run w/ preemption, though its hard to tell whether 
maybe they have things configured in a way they'll never get pre-empted.

> Fail fast on job submit if run a barrier stage with dynamic resource 
> allocation enabled
> ---
>
> Key: SPARK-24954
> URL: https://issues.apache.org/jira/browse/SPARK-24954
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Since we explicitly listed "Support running barrier stage with dynamic 
> resource allocation" a Non-Goal in the design doc, we shall fail fast on job 
> submit if running a barrier stage with dynamic resource allocation enabled, 
> to avoid some confusing behaviors (can refer to SPARK-24942 for some 
> examples).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24941) Add RDDBarrier.coalesce() function

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578430#comment-16578430
 ] 

Imran Rashid commented on SPARK-24941:
--

I don't think the answer here should be having the user specify 
{{numPartitions}}, that seems extremely hard for a user to get right with the 
way the rest of spark works.

Its hard enough to choose the number of partitions normally with spark, but 
users are currently trained to tune to amount of data, and to choose a *large* 
number.  But now that number needs to be really tightly linked to what compute 
resources are available at *runtime*.  Eg. you might run your job with 50 
executors one time, and so you decide to choose 50 tasks.   Then you try your 
job again a little later on, and now you only get 49 executors (cluster busy, 
or maintenance, or node failure, or budget constraints, etc.) and your job 
doesn't get anywhere.

I'd think the user would have some way to let the repartitioning be automatic 
based on available resources.

> Add RDDBarrier.coalesce() function
> --
>
> Key: SPARK-24941
> URL: https://issues.apache.org/jira/browse/SPARK-24941
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Jiang Xingbo
>Priority: Major
>
> https://github.com/apache/spark/pull/21758#discussion_r204917245
> The number of partitions from the input data can be unexpectedly large, eg. 
> if you do
> {code}
> sc.textFile(...).barrier().mapPartitions()
> {code}
> The number of input partitions is based on the hdfs input splits. We shall 
> provide a way in RDDBarrier to enable users to specify the number of tasks in 
> a barrier stage. Maybe something like RDDBarrier.coalesce(numPartitions: Int) 
> .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24928) spark sql cross join running time too long

2018-08-13 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-24928.
-
Resolution: Duplicate

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-13 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578447#comment-16578447
 ] 

Marco Gaido commented on SPARK-24928:
-

Actually this is a duplicate of SPARK-11982, which solved the issue for the SQL 
API. For the RDD API, please be careful choosing the right side of the 
cartesian. I am closing this as a duplicate. Feel free to reopen if you think 
anything else can be done.

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24928) spark sql cross join running time too long

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578446#comment-16578446
 ] 

Apache Spark commented on SPARK-24928:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/22008

> spark sql cross join running time too long
> --
>
> Key: SPARK-24928
> URL: https://issues.apache.org/jira/browse/SPARK-24928
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 1.6.2
>Reporter: LIFULONG
>Priority: Minor
>
> spark sql running time is too long while input left table and right table is 
> small hdfs text format data,
> the sql is:  select * from t1 cross join t2  
> the line of t1 is 49, three column
> the line of t2 is 1, one column only
> running more than 30mins and then failed
>  
>  
> spark CartesianRDD also has the same problem, example test code is:
> val ones = sc.textFile("hdfs://host:port/data/cartesian_data/t1b")  //1 line 
> 1 column
>  val twos = sc.textFile("hdfs://host:port/data/cartesian_data/t2b")  //49 
> line 3 column
>  val cartesian = new CartesianRDD(sc, twos, ones)
> cartesian.count()
> running more than 5 mins,while use CartesianRDD(sc, ones, twos) , it only use 
> less than 10 seconds



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-08-13 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578489#comment-16578489
 ] 

Eyal Farago commented on SPARK-24410:
-

[~viirya], I think your conclusion about co-partitioning is wrong, the 
following code segment from your comment:
{code:java}
val df1 = spark.table("a1").select(spark_partition_id(), $"key")
val df2 = spark.table("a2").select(spark_partition_id(), $"key")
df1.union(df2).select(spark_partition_id(), $"key").collect
{code}
this prints the partition ids as assigned by union, assuming union simply 
concatenates the partitions from df1 and df2 assigning them a running number 
id, it really makes sense you'd get two partitions per key: one coming from df1 
and the other from df2.

applying this select on each dataframe separately you'd get the exact same 
results meaning a given key will have the same partition id in both dataframes.

I think this code fragment basically shows what's wrong with current 
implementation of Union, no that we can't optimize unions of co-partitioned 
relations.

if union was a bit more 'partitioning aware' it'd be able to identify that both 
children have the same partitioning scheme and 'inherit' it. as you actually 
showed this might be a bit tricky as the same logical attribute from different 
children has a different expression id, but Union eventually maps these child 
attributes into a single output attribute, so this information can be used to 
resolve the partitioning columns and determine their equality.

furthermore, Union being smarter on its output partitioning won't cut it, few 
rules have to be added/modified:

1. applying exchange on a union should sometimes be pushed to the children 
(children can be partitioned to those supporting the required partitioning and 
others not supporting it, the exchange can be applied to a union of the 
non-supporting children and then unioned with the rest of the children)
 2. partial aggregate also has to be pushed to the children resulting with a 
union of partial aggregations, again it's possible to partition children 
according to their support of the required partitioning.
 3. final aggregation over a union introduces an exchange which will then be 
pushed to the children, the aggregation is then applied on top of the 
partitioning aware union (think of the way PartitionerAwareUnionRDD handles 
partitioning).
 * partition children = partitioning an array by a predicate 
(scala.collection.TraversableLike#partition)
 * other operators like join may require additional rules.
 * some of this ideas were discussed offline with [~hvanhovell]

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: 

[jira] [Created] (SPARK-25103) CompletionIterator may delay GC of completed resources

2018-08-13 Thread Eyal Farago (JIRA)
Eyal Farago created SPARK-25103:
---

 Summary: CompletionIterator may delay GC of completed resources
 Key: SPARK-25103
 URL: https://issues.apache.org/jira/browse/SPARK-25103
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0, 2.2.0, 2.1.0, 2.0.1
Reporter: Eyal Farago


while working on SPARK-22713 , I fund (and partially fixed) a scenario in which 
an iterator is already exhausted but still holds a reference to some resources 
that can be GCed at this point.

However, these resources can not be GCed because of this reference.

the specific fix applied in SPARK-22713 was to wrap the iterator with a 
CompletionIterator that cleans it when exhausted, thing is that it's quite easy 
to get this wrong by closing over local variables or _this_ reference in the 
cleanup function itself.

I propose solving this by modifying CompletionIterator to discard references to 
the wrapped iterator and cleanup function once exhausted.

 
 * a dive into the code showed that most CompletionIterators are eventually 
used by 
{code:java}
org.apache.spark.scheduler.ShuffleMapTask#runTask{code}
which does:

{code:java}
writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: 
Product2[Any, Any]]]){code}

looking at 
{code:java}
org.apache.spark.shuffle.ShuffleWriter#write{code}
implementations, it seems all of them first exhaust the iterator and then 
perform some kind of post-processing: i.e. merging spills, sorting, writing 
partitions files and then concatenating them into a single file... bottom line 
the Iterator may actually be 'sitting' for some time after being exhausted.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25103) CompletionIterator may delay GC of completed resources

2018-08-13 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578519#comment-16578519
 ] 

Eyal Farago commented on SPARK-25103:
-

CC: [~cloud_fan], [~hvanhovell]

> CompletionIterator may delay GC of completed resources
> --
>
> Key: SPARK-25103
> URL: https://issues.apache.org/jira/browse/SPARK-25103
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1, 2.1.0, 2.2.0, 2.3.0
>Reporter: Eyal Farago
>Priority: Major
>
> while working on SPARK-22713 , I fund (and partially fixed) a scenario in 
> which an iterator is already exhausted but still holds a reference to some 
> resources that can be GCed at this point.
> However, these resources can not be GCed because of this reference.
> the specific fix applied in SPARK-22713 was to wrap the iterator with a 
> CompletionIterator that cleans it when exhausted, thing is that it's quite 
> easy to get this wrong by closing over local variables or _this_ reference in 
> the cleanup function itself.
> I propose solving this by modifying CompletionIterator to discard references 
> to the wrapped iterator and cleanup function once exhausted.
>  
>  * a dive into the code showed that most CompletionIterators are eventually 
> used by 
> {code:java}
> org.apache.spark.scheduler.ShuffleMapTask#runTask{code}
> which does:
> {code:java}
> writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: 
> Product2[Any, Any]]]){code}
> looking at 
> {code:java}
> org.apache.spark.shuffle.ShuffleWriter#write{code}
> implementations, it seems all of them first exhaust the iterator and then 
> perform some kind of post-processing: i.e. merging spills, sorting, writing 
> partitions files and then concatenating them into a single file... bottom 
> line the Iterator may actually be 'sitting' for some time after being 
> exhausted.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-13 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23874:
-
Description: 
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support SPARK-23555
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 


> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support SPARK-23555
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578549#comment-16578549
 ] 

Ryan Blue commented on SPARK-22347:
---

[~viirya], [~cloud_fan]: Is there any objection to changing the resolution of 
this issue to "Won't Fix" instead of "Fixed"? Just documenting the behavior is 
not a fix.

If I don't hear anything in the next day or so, I'll update it.

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.3.0
>
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25060) PySpark UDF in case statement is always run

2018-08-13 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578557#comment-16578557
 ] 

Ryan Blue commented on SPARK-25060:
---

Thanks, [~hyukjin.kwon], you're right that this is a duplicate. I've closed it.

> PySpark UDF in case statement is always run
> ---
>
> Key: SPARK-25060
> URL: https://issues.apache.org/jira/browse/SPARK-25060
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Ryan Blue
>Priority: Major
>
> When evaluating a case statement with a python UDF, Spark will always run the 
> UDF even if the case doesn't use the branch with the UDF call. Here's a repro 
> case:
> {code:lang=python}
> from pyspark.sql.types import StringType
> def fail_if_x(s):
> assert s != 'x'
> return s
> spark.udf.register("fail_if_x", fail_if_x, StringType())
> df = spark.createDataFrame([(1, 'x'), (2, 'y')], ['id', 'str'])
> df.registerTempTable("data")
> spark.sql("select id, case when str <> 'x' then fail_if_x(str) else null end 
> from data").show()
> {code}
> This produces the following error:
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last): 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 189, in main 
> process() 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 184, in process 
> serializer.dump_stream(func(split_index, iterator), outfile) 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 104, in  
> func = lambda _, it: map(mapper, it) 
>   File "", line 1, in  
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 71, in  
> return lambda *a: f(*a) 
>   File "", line 4, in fail_if_x 
> AssertionError
> {code}
> This is because Python UDFs are extracted from expressions and run in the 
> BatchEvalPython node inserted as the child of the expression node:
> {code}
> == Physical Plan ==
> CollectLimit 21
> +- *Project [id#0L, CASE WHEN NOT (str#1 = x) THEN pythonUDF0#14 ELSE null 
> END AS CASE WHEN (NOT (str = x)) THEN fail_if_x(str) ELSE CAST(NULL AS 
> STRING) END#6]
>+- BatchEvalPython [fail_if_x(str#1)], [id#0L, str#1, pythonUDF0#14]
>   +- Scan ExistingRDD[id#0L,str#1]
> {code}
> This doesn't affect correctness, but the behavior doesn't match the Scala API 
> where case can be used to avoid passing data that will cause a UDF to fail 
> into the UDF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25060) PySpark UDF in case statement is always run

2018-08-13 Thread Ryan Blue (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved SPARK-25060.
---
Resolution: Won't Fix

I'm closing this issue as "Won't Fix", the same as the issue this duplicates, 
SPARK-22347.

> PySpark UDF in case statement is always run
> ---
>
> Key: SPARK-25060
> URL: https://issues.apache.org/jira/browse/SPARK-25060
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Ryan Blue
>Priority: Major
>
> When evaluating a case statement with a python UDF, Spark will always run the 
> UDF even if the case doesn't use the branch with the UDF call. Here's a repro 
> case:
> {code:lang=python}
> from pyspark.sql.types import StringType
> def fail_if_x(s):
> assert s != 'x'
> return s
> spark.udf.register("fail_if_x", fail_if_x, StringType())
> df = spark.createDataFrame([(1, 'x'), (2, 'y')], ['id', 'str'])
> df.registerTempTable("data")
> spark.sql("select id, case when str <> 'x' then fail_if_x(str) else null end 
> from data").show()
> {code}
> This produces the following error:
> {code}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last): 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 189, in main 
> process() 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 184, in process 
> serializer.dump_stream(func(split_index, iterator), outfile) 
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 104, in  
> func = lambda _, it: map(mapper, it) 
>   File "", line 1, in  
>   File 
> "/mnt3/yarn/usercache/rblue/appcache/application_1533057049763_100912/container_1533057049763_100912_01_02/pyspark.zip/pyspark/worker.py",
>  line 71, in  
> return lambda *a: f(*a) 
>   File "", line 4, in fail_if_x 
> AssertionError
> {code}
> This is because Python UDFs are extracted from expressions and run in the 
> BatchEvalPython node inserted as the child of the expression node:
> {code}
> == Physical Plan ==
> CollectLimit 21
> +- *Project [id#0L, CASE WHEN NOT (str#1 = x) THEN pythonUDF0#14 ELSE null 
> END AS CASE WHEN (NOT (str = x)) THEN fail_if_x(str) ELSE CAST(NULL AS 
> STRING) END#6]
>+- BatchEvalPython [fail_if_x(str#1)], [id#0L, str#1, pythonUDF0#14]
>   +- Scan ExistingRDD[id#0L,str#1]
> {code}
> This doesn't affect correctness, but the behavior doesn't match the Scala API 
> where case can be used to avoid passing data that will cause a UDF to fail 
> into the UDF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578566#comment-16578566
 ] 

Wenchen Fan commented on SPARK-22347:
-

we changed our mind during code review and this JIRA is no longer valid, we 
should mark it as won't fix. [~rdblue] thanks for pointing it out!

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.3.0
>
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-22347:
-

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.3.0
>
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22347.
-
Resolution: Won't Fix

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
> Fix For: 2.3.0
>
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-22347:

Fix Version/s: (was: 2.3.0)

> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-08-13 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-23874:
-
Description: 
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support ARROW-2141
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 

  was:
Version 0.10.0 will allow for the following improvements and bug fixes:
 * Allow for adding BinaryType support SPARK-23555
 * Bug fix related to array serialization ARROW-1973
 * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
 * Python bytearrays are supported in as input to pyarrow ARROW-2141
 * Java has common interface for reset to cleanup complex vectors in Spark 
ArrowWriter ARROW-1962
 * Cleanup pyarrow type equality checks ARROW-2423
 * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
ARROW-2645
 * Improved low level handling of messages for RecordBatch ARROW-2704

 

 


> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.10.0 will allow for the following improvements and bug fixes:
>  * Allow for adding BinaryType support ARROW-2141
>  * Bug fix related to array serialization ARROW-1973
>  * Python2 str will be made into an Arrow string instead of bytes ARROW-2101
>  * Python bytearrays are supported in as input to pyarrow ARROW-2141
>  * Java has common interface for reset to cleanup complex vectors in Spark 
> ArrowWriter ARROW-1962
>  * Cleanup pyarrow type equality checks ARROW-2423
>  * ArrowStreamWriter should not hold references to ArrowBlocks ARROW-2632, 
> ARROW-2645
>  * Improved low level handling of messages for RecordBatch ARROW-2704
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25104) Validate user specified output schema

2018-08-13 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-25104:
--

 Summary: Validate user specified output schema
 Key: SPARK-25104
 URL: https://issues.apache.org/jira/browse/SPARK-25104
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Gengliang Wang


With code changes in 
[https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,]
 , Spark can write out data as per user provided output schema.

To make it more robust and user friendly, we should validate the Avro schema 
before tasks launched.

Also we should support output logical decimal type as BYTES (By default we 
output as FIXED)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24736) --py-files not functional for non local URLs. It appears to pass non-local URL's into PYTHONPATH directly.

2018-08-13 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578624#comment-16578624
 ] 

holdenk commented on SPARK-24736:
-

cc [~ifilonenko]

> --py-files not functional for non local URLs. It appears to pass non-local 
> URL's into PYTHONPATH directly.
> --
>
> Key: SPARK-24736
> URL: https://issues.apache.org/jira/browse/SPARK-24736
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.0
> Environment: Recent 2.4.0 from master branch, submitted on Linux to a 
> KOPS Kubernetes cluster created on AWS.
>  
>Reporter: Jonathan A Weaver
>Priority: Minor
>
> My spark-submit
> bin/spark-submit \
>         --master 
> k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]
>  \
>         --deploy-mode cluster \
>         --name pytest \
>         --conf 
> spark.kubernetes.container.image=[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest]
>  \
>         --conf 
> [spark.kubernetes.driver.pod.name|http://spark.kubernetes.driver.pod.name/]=spark-pi-driver
>  \
>         --conf 
> spark.kubernetes.authenticate.submission.caCertFile=[cluster.ca|http://cluster.ca/]
>  \
>         --conf spark.kubernetes.authenticate.submission.oauthToken=$TOK \
>         --conf spark.kubernetes.authenticate.driver.oauthToken=$TOK \
> --py-files "[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]"; \
> [https://s3.amazonaws.com/maxar-ids-fids/it.py]
>  
> *screw.zip is successfully downloaded and placed in SparkFIles.getRootPath()*
> 2018-07-01 07:33:43 INFO  SparkContext:54 - Added file 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] at 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] with timestamp 
> 1530430423297
> 2018-07-01 07:33:43 INFO  Utils:54 - Fetching 
> [https://s3.amazonaws.com/maxar-ids-fids/screw.zip] to 
> /var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240/fetchFileTemp1549645948768432992.tmp
> *I print out the  PYTHONPATH and PYSPARK_FILES environment variables from the 
> driver script:*
>      PYTHONPATH 
> /opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-0.10.7-src.zip:/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar:/opt/spark/python/lib/pyspark.zip:/opt/spark/python/lib/py4j-*.zip:*[https://s3.amazonaws.com/maxar-ids-fids/screw.zip]*
>     PYSPARK_FILES [https://s3.amazonaws.com/maxar-ids-fids/screw.zip]
>  
> *I print out sys.path*
> ['/tmp/spark-fec3684b-8b63-4f43-91a4-2f2fa41a1914', 
> u'/var/data/spark-7aba748d-2bba-4015-b388-c2ba9adba81e/spark-0ed5a100-6efa-45ca-ad4c-d1e57af76ffd/userFiles-a053206e-33d9-4245-b587-f8ac26d4c240',
>  '/opt/spark/python/lib/pyspark.zip', 
> '/opt/spark/python/lib/py4j-0.10.7-src.zip', 
> '/opt/spark/jars/spark-core_2.11-2.4.0-SNAPSHOT.jar', 
> '/opt/spark/python/lib/py4j-*.zip', *'/opt/spark/work-dir/https', 
> '//[s3.amazonaws.com/maxar-ids-fids/screw.zip|http://s3.amazonaws.com/maxar-ids-fids/screw.zip]',*
>  '/usr/lib/python27.zip', '/usr/lib/python2.7', 
> '/usr/lib/python2.7/plat-linux2', '/usr/lib/python2.7/lib-tk', 
> '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', 
> '/usr/lib/python2.7/site-packages']
>  
> *URL from PYTHONFILES gets placed in sys.path verbatim with obvious results.*
>  
> *Dump of spark config from container.*
> Spark config dumped:
> [(u'spark.master', 
> u'k8s://[https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com|https://internal-api-test-k8s-local-7afed8-796273878.us-east-1.elb.amazonaws.com/]'),
>  (u'spark.kubernetes.authenticate.submission.oauthToken', 
> u''), 
> (u'spark.kubernetes.authenticate.driver.oauthToken', 
> u''), (u'spark.kubernetes.executor.podNamePrefix', 
> u'pytest-1530430411996'), (u'spark.kubernetes.memoryOverheadFactor', u'0.4'), 
> (u'spark.driver.blockManager.port', u'7079'), 
> (u'[spark.app.id|http://spark.app.id/]', u'spark-application-1530430424433'), 
> (u'[spark.app.name|http://spark.app.name/]', u'pytest'), 
> (u'[spark.executor.id|http://spark.executor.id/]', u'driver'), 
> (u'spark.driver.host', u'pytest-1530430411996-driver-svc.default.svc'), 
> (u'spark.kubernetes.container.image', 
> u'[412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest'|http://412834075398.dkr.ecr.us-east-1.amazonaws.com/fids/pyspark-k8s:latest']),
>  (u'spark.driver.port', u'7078'), 
> (u'spark.kubernetes.python.mainAppResource', 
> u'[https://s3.amazonaws.com/maxar-ids-fids/it.py']), 
> (u'spark.kub

[jira] [Assigned] (SPARK-25104) Validate user specified output schema

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25104:


Assignee: (was: Apache Spark)

> Validate user specified output schema
> -
>
> Key: SPARK-25104
> URL: https://issues.apache.org/jira/browse/SPARK-25104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> With code changes in 
> [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,]
>  , Spark can write out data as per user provided output schema.
> To make it more robust and user friendly, we should validate the Avro schema 
> before tasks launched.
> Also we should support output logical decimal type as BYTES (By default we 
> output as FIXED)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25104) Validate user specified output schema

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578637#comment-16578637
 ] 

Apache Spark commented on SPARK-25104:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/22094

> Validate user specified output schema
> -
>
> Key: SPARK-25104
> URL: https://issues.apache.org/jira/browse/SPARK-25104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Priority: Major
>
> With code changes in 
> [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,]
>  , Spark can write out data as per user provided output schema.
> To make it more robust and user friendly, we should validate the Avro schema 
> before tasks launched.
> Also we should support output logical decimal type as BYTES (By default we 
> output as FIXED)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25104) Validate user specified output schema

2018-08-13 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25104:


Assignee: Apache Spark

> Validate user specified output schema
> -
>
> Key: SPARK-25104
> URL: https://issues.apache.org/jira/browse/SPARK-25104
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> With code changes in 
> [https://github.com/apache/spark/pull/21847|https://github.com/apache/spark/pull/21847,]
>  , Spark can write out data as per user provided output schema.
> To make it more robust and user friendly, we should validate the Avro schema 
> before tasks launched.
> Also we should support output logical decimal type as BYTES (By default we 
> output as FIXED)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24735) Improve exception when mixing up pandas_udf types

2018-08-13 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-24735:

Summary: Improve exception when mixing up pandas_udf types  (was: Improve 
exception when mixing pandas_udf types)

> Improve exception when mixing up pandas_udf types
> -
>
> Key: SPARK-24735
> URL: https://issues.apache.org/jira/browse/SPARK-24735
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> From the discussion here 
> https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
> Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = 
> pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an 
> exception which is hard to understand.  It should tell the user that the UDF 
> type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
> line 353, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
>   at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-13 Thread holdenk (JIRA)
holdenk created SPARK-25105:
---

 Summary: Importing all of pyspark.sql.functions should bring 
PandasUDFType in as well
 Key: SPARK-25105
 URL: https://issues.apache.org/jira/browse/SPARK-25105
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.4.0
Reporter: holdenk


 
{code:java}
>>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
Traceback (most recent call last):
 File "", line 1, in 
NameError: name 'PandasUDFType' is not defined
 
{code}
When explicitly imported it works fine:
{code:java}
 
>>> from pyspark.sql.functions import PandasUDFType
>>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
{code}
 

We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22347) UDF is evaluated when 'F.when' condition is false

2018-08-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578709#comment-16578709
 ] 

Liang-Chi Hsieh commented on SPARK-22347:
-

Agreed. Thanks [~rdblue]


> UDF is evaluated when 'F.when' condition is false
> -
>
> Key: SPARK-22347
> URL: https://issues.apache.org/jira/browse/SPARK-22347
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Nicolas Porter
>Assignee: Liang-Chi Hsieh
>Priority: Minor
>
> Here's a simple example on how to reproduce this:
> {code}
> from pyspark.sql import functions as F, Row, types
> def Divide10():
> def fn(value): return 10 / int(value)
> return F.udf(fn, types.IntegerType())
> df = sc.parallelize([Row(x=5), Row(x=0)]).toDF()
> x = F.col('x')
> df2 = df.select(F.when((x > 0), Divide10()(x)))
> df2.show(200)
> {code}
> This raises a division by zero error, even if `F.when` is trying to filter 
> out all cases where `x <= 0`. I believe the correct behavior should be not to 
> evaluate the UDF when the `F.when` condition is false.
> Interestingly enough, when the `F.when` condition is set to `F.lit(False)`, 
> then the error is not raised and all rows resolve to `null`, which is the 
> expected result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24735) Improve exception when mixing up pandas_udf types

2018-08-13 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578710#comment-16578710
 ] 

holdenk commented on SPARK-24735:
-

I think we could do better than just improving the exception, if we look at the 
other aggregates in PySpark when we call them with select it does the grouping 
for us:

 
{code:java}
>>> df.select(sumDistinct(df._1)).show()
++
|sum(DISTINCT _1)|
++
| 4950   |
++{code}

> Improve exception when mixing up pandas_udf types
> -
>
> Key: SPARK-24735
> URL: https://issues.apache.org/jira/browse/SPARK-24735
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> From the discussion here 
> https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
> Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = 
> pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an 
> exception which is hard to understand.  It should tell the user that the UDF 
> type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
> line 353, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
>   at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24735) Improve exception when mixing up pandas_udf types

2018-08-13 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578719#comment-16578719
 ] 

holdenk commented on SPARK-24735:
-

So [~bryanc]what do you think of if we add a AggregatePythonUDF and use it for 
grouped_map / grouped_agg so we get treated the correct way by the Scala SQL 
engine?

> Improve exception when mixing up pandas_udf types
> -
>
> Key: SPARK-24735
> URL: https://issues.apache.org/jira/browse/SPARK-24735
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>Priority: Major
>
> From the discussion here 
> https://github.com/apache/spark/pull/21650#discussion_r199203674, mixing up 
> Pandas UDF types, like using GROUPED_MAP as a SCALAR {{foo = 
> pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)}} produces an 
> exception which is hard to understand.  It should tell the user that the UDF 
> type is wrong.  This is the full output:
> {code}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> >>> df.select(foo(df['v'])).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/dataframe.py", 
> line 353, in show
> print(self._jdf.showString(n, 20, vertical))
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/Users/icexelloss/workspace/upstream/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o257.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
>   at 
> org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
>   at scala.Option.getOrElse(Option.scala:121)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23654) Cut jets3t and bouncy castle as dependencies of spark-core

2018-08-13 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-23654:
---
Summary: Cut jets3t and bouncy castle as dependencies of spark-core  (was: 
Cut jets3t as a dependency of spark-core)

> Cut jets3t and bouncy castle as dependencies of spark-core
> --
>
> Key: SPARK-23654
> URL: https://issues.apache.org/jira/browse/SPARK-23654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>
> Spark core declares a dependency on Jets3t, which pulls in other cruft
> # the hadoop-cloud module pulls in the hadoop-aws module with the 
> jets3t-compatible connectors, and the relevant dependencies: the spark-core 
> dependency is incomplete if that module isn't built, and superflous or 
> inconsistent if it is.
> # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 
> 3.x in favour we're willing to maintain.
> JetS3t was wonderful when it came out, but now the amazon SDKs massively 
> exceed it in functionality, albeit at the expense of week-to-week stability 
> and JAR binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23654) Cut jets3t as a dependency of spark-core

2018-08-13 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-23654:
---
Summary: Cut jets3t as a dependency of spark-core  (was: Cut jets3t and 
bouncy castle as dependencies of spark-core)

> Cut jets3t as a dependency of spark-core
> 
>
> Key: SPARK-23654
> URL: https://issues.apache.org/jira/browse/SPARK-23654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
>
> Spark core declares a dependency on Jets3t, which pulls in other cruft
> # the hadoop-cloud module pulls in the hadoop-aws module with the 
> jets3t-compatible connectors, and the relevant dependencies: the spark-core 
> dependency is incomplete if that module isn't built, and superflous or 
> inconsistent if it is.
> # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 
> 3.x in favour we're willing to maintain.
> JetS3t was wonderful when it came out, but now the amazon SDKs massively 
> exceed it in functionality, albeit at the expense of week-to-week stability 
> and JAR binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-13 Thread MIK (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578752#comment-16578752
 ] 

MIK edited comment on SPARK-25051 at 8/13/18 6:21 PM:
--

Thanks [~yumwang] , with 2.3.2-rc4 the error is gone now but the result is not 
correct (getting 0 records), 
 ++---+
|id|name|

++---+

The sample program should return 2 records.
 +++
|id|name|
|1|one|
|3|three|

+++


was (Author: mik1007):
Thanks [~yumwang] , with 2.3.2-rc4 the error is gone now but the result is not 
correct (getting 0 records), 
+---++
| id|name|
+---++
+---++

The sample program should return 2 records.
+---+-+
| id| name|
+---+-+
|  1|  one|
|  3|three|
+---+-+

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25051) where clause on dataset gives AnalysisException

2018-08-13 Thread MIK (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578752#comment-16578752
 ] 

MIK commented on SPARK-25051:
-

Thanks [~yumwang] , with 2.3.2-rc4 the error is gone now but the result is not 
correct (getting 0 records), 
+---++
| id|name|
+---++
+---++

The sample program should return 2 records.
+---+-+
| id| name|
+---+-+
|  1|  one|
|  3|three|
+---+-+

> where clause on dataset gives AnalysisException
> ---
>
> Key: SPARK-25051
> URL: https://issues.apache.org/jira/browse/SPARK-25051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: MIK
>Priority: Major
>
> *schemas :*
> df1
> => id ts
> df2
> => id name country
> *code:*
> val df = df1.join(df2, Seq("id"), "left_outer").where(df2("id").isNull)
> *error*:
> org.apache.spark.sql.AnalysisException:Resolved attribute(s) id#0 missing 
> from xx#15,xx#9L,id#5,xx#6,xx#11,xx#14,xx#13,xx#12,xx#7,xx#16,xx#10,xx#8L in 
> operator !Filter isnull(id#0). Attribute(s) with the same name appear in the 
> operation: id. Please check if the right attribute(s) are used.;;
>  at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:289)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
>     at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:80)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
>     at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:104)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
>     at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:172)
>     at org.apache.spark.sql.Dataset.(Dataset.scala:178)
>     at org.apache.spark.sql.Dataset$.apply(Dataset.scala:65)
>     at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:3300)
>     at org.apache.spark.sql.Dataset.filter(Dataset.scala:1458)
>     at org.apache.spark.sql.Dataset.where(Dataset.scala:1486)
> This works fine in spark 2.2.2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-08-13 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578763#comment-16578763
 ] 

Liang-Chi Hsieh commented on SPARK-24410:
-

The above code shows that the two tables in union results are located in 
logically different partitions, even you know they might be physically 
co-partitioned. So we can't just get rid of the shuffle and expect the correct 
results, because of `SparkContext.union`'s current implementation.

That is why cloud-fan suggested to implement Union with RDD.zip for some 
certain case, to preserve the children output partitioning.

Although we can make Union smarter on its output partitioning, from the 
discussion you can see we might need to also consider parallelism and locality.

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25105) Importing all of pyspark.sql.functions should bring PandasUDFType in as well

2018-08-13 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578766#comment-16578766
 ] 

kevin yu commented on SPARK-25105:
--

I will try to fix it. Thanks. Kevin

> Importing all of pyspark.sql.functions should bring PandasUDFType in as well
> 
>
> Key: SPARK-25105
> URL: https://issues.apache.org/jira/browse/SPARK-25105
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: holdenk
>Priority: Trivial
>
>  
> {code:java}
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> Traceback (most recent call last):
>  File "", line 1, in 
> NameError: name 'PandasUDFType' is not defined
>  
> {code}
> When explicitly imported it works fine:
> {code:java}
>  
> >>> from pyspark.sql.functions import PandasUDFType
> >>> foo = pandas_udf(lambda x: x, 'v int', PandasUDFType.GROUPED_MAP)
> {code}
>  
> We just need to make sure it's included in __all__/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-13 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578767#comment-16578767
 ] 

Marcelo Vanzin commented on SPARK-24918:


For reference: this looks kinda similar to SPARK-650.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24410) Missing optimization for Union on bucketed tables

2018-08-13 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578789#comment-16578789
 ] 

Eyal Farago commented on SPARK-24410:
-

[~viirya], my bad :)

seems there are two distinct issues here: one is general behavior of 
join/aggregate over unions, the other is the guarantees of bucketed 
partitioning.

looking more carefully at the results of your query it seems that the two DFs 
are not co-partitioned (which is a bit surprising), so my apologies.

having that said, there's a more general issue with pushing down shuffle 
related operations over a union, do you guys think this deserves a separate 
issue?

 

> Missing optimization for Union on bucketed tables
> -
>
> Key: SPARK-24410
> URL: https://issues.apache.org/jira/browse/SPARK-24410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Major
>
> A common use-case we have is of a partially aggregated table and daily 
> increments that we need to further aggregate. we do this my unioning the two 
> tables and aggregating again.
> we tried to optimize this process by bucketing the tables, but currently it 
> seems that the union operator doesn't leverage the tables being bucketed 
> (like the join operator).
> for example, for two bucketed tables a1,a2:
> {code}
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
> .repartition(col("key"))
> .write
>   .mode(SaveMode.Overwrite)
> .bucketBy(3, "key")
> .sortBy("t1")
> .saveAsTable("a1")
> sparkSession.range(N).selectExpr(
>   "id as key",
>   "id % 2 as t1",
>   "id % 3 as t2")
>   .repartition(col("key"))
>   .write.mode(SaveMode.Overwrite)
>   .bucketBy(3, "key")
>   .sortBy("t1")
>   .saveAsTable("a2")
> {code}
> for the join query we get the "SortMergeJoin"
> {code}
> select * from a1 join a2 on (a1.key=a2.key)
> == Physical Plan ==
> *(3) SortMergeJoin [key#24L], [key#27L], Inner
> :- *(1) Sort [key#24L ASC NULLS FIRST], false, 0
> :  +- *(1) Project [key#24L, t1#25L, t2#26L]
> : +- *(1) Filter isnotnull(key#24L)
> :+- *(1) FileScan parquet default.a1[key#24L,t1#25L,t2#26L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> +- *(2) Sort [key#27L ASC NULLS FIRST], false, 0
>+- *(2) Project [key#27L, t1#28L, t2#29L]
>   +- *(2) Filter isnotnull(key#27L)
>  +- *(2) FileScan parquet default.a2[key#27L,t1#28L,t2#29L] Batched: 
> true, Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
> {code}
> but for aggregation after union we get a shuffle:
> {code}
> select key,count(*) from (select * from a1 union all select * from a2)z group 
> by key
> == Physical Plan ==
> *(4) HashAggregate(keys=[key#25L], functions=[count(1)], output=[key#25L, 
> count(1)#36L])
> +- Exchange hashpartitioning(key#25L, 1)
>+- *(3) HashAggregate(keys=[key#25L], functions=[partial_count(1)], 
> output=[key#25L, count#38L])
>   +- Union
>  :- *(1) Project [key#25L]
>  :  +- *(1) FileScan parquet default.a1[key#25L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a1], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
>  +- *(2) Project [key#28L]
> +- *(2) FileScan parquet default.a2[key#28L] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/some/where/spark-warehouse/a2], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25106) A new Kafka consumer gets created for every batch

2018-08-13 Thread Alexis Seigneurin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Seigneurin updated SPARK-25106:
--
Attachment: console.txt

> A new Kafka consumer gets created for every batch
> -
>
> Key: SPARK-25106
> URL: https://issues.apache.org/jira/browse/SPARK-25106
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: console.txt
>
>
> I have a fairly simple piece of code that reads from Kafka, applies some 
> transformations - including applying a UDF - and writes the result to the 
> console. Every time a batch is created, a new consumer is created (and not 
> closed), eventually leading to a "too many open files" error.
> I created a test case, with the code available here: 
> [https://github.com/aseigneurin/spark-kafka-issue]
> To reproduce:
>  # Start Kafka and create a topic called "persons"
>  # Run "Producer" to generate data
>  # Run "Consumer"
> I am attaching the log where you can see a new consumer being initialized 
> between every batch.
> Please note this issue does *not* appear with Spark 2.2.2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25106) A new Kafka consumer gets created for every batch

2018-08-13 Thread Alexis Seigneurin (JIRA)
Alexis Seigneurin created SPARK-25106:
-

 Summary: A new Kafka consumer gets created for every batch
 Key: SPARK-25106
 URL: https://issues.apache.org/jira/browse/SPARK-25106
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Alexis Seigneurin
 Attachments: console.txt

I have a fairly simple piece of code that reads from Kafka, applies some 
transformations - including applying a UDF - and writes the result to the 
console. Every time a batch is created, a new consumer is created (and not 
closed), eventually leading to a "too many open files" error.

I created a test case, with the code available here: 
[https://github.com/aseigneurin/spark-kafka-issue]

To reproduce:
 # Start Kafka and create a topic called "persons"
 # Run "Producer" to generate data
 # Run "Consumer"

I am attaching the log where you can see a new consumer being initialized 
between every batch.

Please note this issue does *not* appear with Spark 2.2.2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25106) A new Kafka consumer gets created for every batch

2018-08-13 Thread Alexis Seigneurin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Seigneurin updated SPARK-25106:
--
Description: 
I have a fairly simple piece of code that reads from Kafka, applies some 
transformations - including applying a UDF - and writes the result to the 
console. Every time a batch is created, a new consumer is created (and not 
closed), eventually leading to a "too many open files" error.

I created a test case, with the code available here: 
[https://github.com/aseigneurin/spark-kafka-issue]

To reproduce:
 # Start Kafka and create a topic called "persons"
 # Run "Producer" to generate data
 # Run "Consumer"

I am attaching the log where you can see a new consumer being initialized 
between every batch.

Please note this issue does *not* appear with Spark 2.2.2, and it does not 
appear either when I don't apply the UDF.

  was:
I have a fairly simple piece of code that reads from Kafka, applies some 
transformations - including applying a UDF - and writes the result to the 
console. Every time a batch is created, a new consumer is created (and not 
closed), eventually leading to a "too many open files" error.

I created a test case, with the code available here: 
[https://github.com/aseigneurin/spark-kafka-issue]

To reproduce:
 # Start Kafka and create a topic called "persons"
 # Run "Producer" to generate data
 # Run "Consumer"

I am attaching the log where you can see a new consumer being initialized 
between every batch.

Please note this issue does *not* appear with Spark 2.2.2.


> A new Kafka consumer gets created for every batch
> -
>
> Key: SPARK-25106
> URL: https://issues.apache.org/jira/browse/SPARK-25106
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: console.txt
>
>
> I have a fairly simple piece of code that reads from Kafka, applies some 
> transformations - including applying a UDF - and writes the result to the 
> console. Every time a batch is created, a new consumer is created (and not 
> closed), eventually leading to a "too many open files" error.
> I created a test case, with the code available here: 
> [https://github.com/aseigneurin/spark-kafka-issue]
> To reproduce:
>  # Start Kafka and create a topic called "persons"
>  # Run "Producer" to generate data
>  # Run "Consumer"
> I am attaching the log where you can see a new consumer being initialized 
> between every batch.
> Please note this issue does *not* appear with Spark 2.2.2, and it does not 
> appear either when I don't apply the UDF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25106) A new Kafka consumer gets created for every batch

2018-08-13 Thread Alexis Seigneurin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexis Seigneurin updated SPARK-25106:
--
Description: 
I have a fairly simple piece of code that reads from Kafka, applies some 
transformations - including applying a UDF - and writes the result to the 
console. Every time a batch is created, a new consumer is created (and not 
closed), eventually leading to a "too many open files" error.

I created a test case, with the code available here: 
[https://github.com/aseigneurin/spark-kafka-issue]

To reproduce:
 # Start Kafka and create a topic called "persons"
 # Run "Producer" to generate data
 # Run "Consumer"

I am attaching the log where you can see a new consumer being initialized 
between every batch.

Please note this issue does *not* appear with Spark 2.2.2, and it does not 
appear either when I don't apply the UDF.

I am suspecting - although I did go far enough to confirm - that this issue is 
related to the improvement made in SPARK-23623.

  was:
I have a fairly simple piece of code that reads from Kafka, applies some 
transformations - including applying a UDF - and writes the result to the 
console. Every time a batch is created, a new consumer is created (and not 
closed), eventually leading to a "too many open files" error.

I created a test case, with the code available here: 
[https://github.com/aseigneurin/spark-kafka-issue]

To reproduce:
 # Start Kafka and create a topic called "persons"
 # Run "Producer" to generate data
 # Run "Consumer"

I am attaching the log where you can see a new consumer being initialized 
between every batch.

Please note this issue does *not* appear with Spark 2.2.2, and it does not 
appear either when I don't apply the UDF.


> A new Kafka consumer gets created for every batch
> -
>
> Key: SPARK-25106
> URL: https://issues.apache.org/jira/browse/SPARK-25106
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: console.txt
>
>
> I have a fairly simple piece of code that reads from Kafka, applies some 
> transformations - including applying a UDF - and writes the result to the 
> console. Every time a batch is created, a new consumer is created (and not 
> closed), eventually leading to a "too many open files" error.
> I created a test case, with the code available here: 
> [https://github.com/aseigneurin/spark-kafka-issue]
> To reproduce:
>  # Start Kafka and create a topic called "persons"
>  # Run "Producer" to generate data
>  # Run "Consumer"
> I am attaching the log where you can see a new consumer being initialized 
> between every batch.
> Please note this issue does *not* appear with Spark 2.2.2, and it does not 
> appear either when I don't apply the UDF.
> I am suspecting - although I did go far enough to confirm - that this issue 
> is related to the improvement made in SPARK-23623.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22905) Fix ChiSqSelectorModel, GaussianMixtureModel save implementation for Row order issues

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578850#comment-16578850
 ] 

Apache Spark commented on SPARK-22905:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/22079

> Fix ChiSqSelectorModel, GaussianMixtureModel save implementation for Row 
> order issues
> -
>
> Key: SPARK-22905
> URL: https://issues.apache.org/jira/browse/SPARK-22905
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.1
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, in `ChiSqSelectorModel`, save:
> {code}
> spark.createDataFrame(dataArray).repartition(1).write...
> {code}
> The default partition number used by createDataFrame is "defaultParallelism",
> Current RoundRobinPartitioning won't guarantee the "repartition" generating 
> the same order result with local array. We need fix it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25079) [PYTHON] upgrade python 3.4 -> 3.5

2018-08-13 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578851#comment-16578851
 ] 

shane knapp commented on SPARK-25079:
-

question:  do we want to upgrade to 3.6 instead?

> [PYTHON] upgrade python 3.4 -> 3.5
> --
>
> Key: SPARK-25079
> URL: https://issues.apache.org/jira/browse/SPARK-25079
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> for the impending arrow upgrade 
> (https://issues.apache.org/jira/browse/SPARK-23874) we need to bump python 
> 3.4 -> 3.5.
> i have been testing this here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/|https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69]
> my methodology:
> 1) upgrade python + arrow to 3.5 and 0.10.0
> 2) run python tests
> 3) when i'm happy that Things Won't Explode Spectacularly, pause jenkins and 
> upgrade centos workers to python3.5
> 4) simultaneously do the following: 
>   - create a symlink in /home/anaconda/envs/py3k/bin for python3.4 that 
> points to python3.5 (this is currently being tested here:  
> [https://amplab.cs.berkeley.edu/jenkins/view/RISELab%20Infra/job/ubuntuSparkPRB/69)]
>   - push a change to python/run-tests.py replacing 3.4 with 3.5
> 5) once the python3.5 change to run-tests.py is merged, we will need to 
> back-port this to all existing branches
> 6) then and only then can i remove the python3.4 -> python3.5 symlink



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578853#comment-16578853
 ] 

Imran Rashid commented on SPARK-24918:
--

Ah, right, thanks [~vanzin], I knew I had seen this before.

[~srowen], you argued the most against SPARK-650 -- have I made the case here?  
I did indeed at first do exactly what you suggested, using a static 
initializer, but realized it was not great for a couple of very important 
reasons:

* dynamic allocation
* turning on a "debug" mode without any code changes (you'd be surprised how 
big a hurdle this is for something in production)
* "sql only" apps, where the end user barely knows anything about calling a 
mapPartitions function

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578856#comment-16578856
 ] 

Imran Rashid commented on SPARK-650:


Folks may be interested in SPARK-24918.  perhaps one should be closed a 
duplicate of the other, but for now there is some discussion on both, so I'll 
leave them open for the time being

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24918) Executor Plugin API

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578858#comment-16578858
 ] 

Imran Rashid commented on SPARK-24918:
--

[~lucacanali] OK I see the case for what you're proposing -- its hard too setup 
that communication between the driver & executors without *some* initial setup 
message.

Still ... I'm a bit reluctant to include that now, until we see someone 
actually builds something that uses it.  I realizes you might be hesitant to do 
that until you know it can be built on a stable api, but I don't think we can 
get around that.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24918) Executor Plugin API

2018-08-13 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578858#comment-16578858
 ] 

Imran Rashid edited comment on SPARK-24918 at 8/13/18 8:15 PM:
---

[~lucacanali] OK I see the case for what you're proposing -- its hard to setup 
that communication between the driver & executors without *some* initial setup 
message.

Still ... I'm a bit reluctant to include that now, until we see someone 
actually builds something that uses it.  I realizes you might be hesitant to do 
that until you know it can be built on a stable api, but I don't think we can 
get around that.


was (Author: irashid):
[~lucacanali] OK I see the case for what you're proposing -- its hard too setup 
that communication between the driver & executors without *some* initial setup 
message.

Still ... I'm a bit reluctant to include that now, until we see someone 
actually builds something that uses it.  I realizes you might be hesitant to do 
that until you know it can be built on a stable api, but I don't think we can 
get around that.

> Executor Plugin API
> ---
>
> Key: SPARK-24918
> URL: https://issues.apache.org/jira/browse/SPARK-24918
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
>  Labels: SPIP, memory-analysis
>
> It would be nice if we could specify an arbitrary class to run within each 
> executor for debugging and instrumentation.  Its hard to do this currently 
> because:
> a) you have no idea when executors will come and go with DynamicAllocation, 
> so don't have a chance to run custom code before the first task
> b) even with static allocation, you'd have to change the code of your spark 
> app itself to run a special task to "install" the plugin, which is often 
> tough in production cases when those maintaining regularly running 
> applications might not even know how to make changes to the application.
> For example, https://github.com/squito/spark-memory could be used in a 
> debugging context to understand memory use, just by re-running an application 
> with extra command line arguments (as opposed to rebuilding spark).
> I think one tricky part here is just deciding the api, and how its versioned. 
>  Does it just get created when the executor starts, and thats it?  Or does it 
> get more specific events, like task start, task end, etc?  Would we ever add 
> more events?  It should definitely be a {{DeveloperApi}}, so breaking 
> compatibility would be allowed ... but still should be avoided.  We could 
> create a base class that has no-op implementations, or explicitly version 
> everything.
> Note that this is not needed in the driver as we already have SparkListeners 
> (even if you don't care about the SparkListenerEvents and just want to 
> inspect objects in the JVM, its still good enough).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23984) PySpark Bindings for K8S

2018-08-13 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578871#comment-16578871
 ] 

Apache Spark commented on SPARK-23984:
--

User 'ifilonenko' has created a pull request for this issue:
https://github.com/apache/spark/pull/22095

> PySpark Bindings for K8S
> 
>
> Key: SPARK-23984
> URL: https://issues.apache.org/jira/browse/SPARK-23984
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, PySpark
>Affects Versions: 2.3.0
>Reporter: Ilan Filonenko
>Priority: Major
> Fix For: 2.4.0
>
>
> This ticket is tracking the ongoing work of moving the upsteam work from 
> [https://github.com/apache-spark-on-k8s/spark] specifically regarding Python 
> bindings for Spark on Kubernetes. 
> The points of focus are: dependency management, increased non-JVM memory 
> overhead default values, and modified Docker images to include Python 
> Support. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >