[jira] [Created] (SPARK-27670) Add High available for Spark Hive thrift server.

2019-05-10 Thread jiaan.geng (JIRA)
jiaan.geng created SPARK-27670:
--

 Summary: Add High available for Spark Hive thrift server.
 Key: SPARK-27670
 URL: https://issues.apache.org/jira/browse/SPARK-27670
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: jiaan.geng


`HiveThriftServer2` is a multi-session managed service based on `HiveServer2`, 
which provides centralized management services for Hive.

Since `HiveThriftServer2` itself inherits from `HiveServer2`, `HiveServer2`'s 
own HA solution can also support `HiveThriftServer2`.

`HiveThriftServer2` does not support HA. Why is this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27670) Add High available for Spark Hive thrift server.

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27670:


Assignee: Apache Spark

> Add High available for Spark Hive thrift server.
> 
>
> Key: SPARK-27670
> URL: https://issues.apache.org/jira/browse/SPARK-27670
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> `HiveThriftServer2` is a multi-session managed service based on 
> `HiveServer2`, which provides centralized management services for Hive.
> Since `HiveThriftServer2` itself inherits from `HiveServer2`, `HiveServer2`'s 
> own HA solution can also support `HiveThriftServer2`.
> `HiveThriftServer2` does not support HA. Why is this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27670) Add High available for Spark Hive thrift server.

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27670:


Assignee: (was: Apache Spark)

> Add High available for Spark Hive thrift server.
> 
>
> Key: SPARK-27670
> URL: https://issues.apache.org/jira/browse/SPARK-27670
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> `HiveThriftServer2` is a multi-session managed service based on 
> `HiveServer2`, which provides centralized management services for Hive.
> Since `HiveThriftServer2` itself inherits from `HiveServer2`, `HiveServer2`'s 
> own HA solution can also support `HiveThriftServer2`.
> `HiveThriftServer2` does not support HA. Why is this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-10 Thread tommy duan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tommy duan updated SPARK-27648:
---
Attachment: houragg_filter.csv

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> image-2019-05-09-17-51-14-036.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-10 Thread tommy duan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837078#comment-16837078
 ] 

tommy duan commented on SPARK-27648:


Hi [~gsomogyi] & [~kabhwan]

I have filtered the fields 
(*timestamp,numRowsTotal,numRowsUpdated,memoryUsedBytes*) from this progress 
log for details.

Please check Attachments(([^houragg_filter.csv])

Then, according to the time stamp, the state occupied memory 
(*memoryUsedBytes*), the number of updated records (*numRowsUpdated*), and the 
total number of stored records (*numRowsTotal*) are iconized.

The processed graphics are shown as follows:

!image-2019-05-10-17-18-25-051.png!

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-18-25-051.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27600) Unable to start Spark Hive Thrift Server when multiple hive server server share the same metastore

2019-05-10 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837081#comment-16837081
 ] 

Hyukjin Kwon commented on SPARK-27600:
--

Spark is upgrading Hive to 2.3.5. Will be fixed together soon anyway.

> Unable to start Spark Hive Thrift Server when multiple hive server server 
> share the same metastore
> --
>
> Key: SPARK-27600
> URL: https://issues.apache.org/jira/browse/SPARK-27600
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: pin_zhang
>Priority: Major
>
> When start ten or more spark hive thrift servers at the same time, more than 
> one version saved to table VERSION when meet exception WARN 
> [DataNucleus.Query] (main:) Query for candidates of 
> org.apache.hadoop.hive.metastore.model.MVersionTable and subclasses resulted 
> in no possible candidates
> Exception thrown obtaining schema column information from datastore
> org.datanucleus.exceptions.NucleusDataStoreException: Exception thrown 
> obtaining schema column information from datastore
> Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Table 
> 'via_ms.deleteme1556239494724' doesn't exist
>  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>  at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>  at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
>  at com.mysql.jdbc.Util.getInstance(Util.java:408)
>  at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:944)
>  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3978)
>  at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3914)
>  at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2530)
>  at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2683)
>  at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2491)
>  at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2449)
>  at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1381)
>  at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2441)
>  at com.mysql.jdbc.DatabaseMetaData$2.forEach(DatabaseMetaData.java:2339)
>  at com.mysql.jdbc.IterateBlock.doForAll(IterateBlock.java:50)
>  at com.mysql.jdbc.DatabaseMetaData.getColumns(DatabaseMetaData.java:2337)
>  at 
> org.apache.commons.dbcp.DelegatingDatabaseMetaData.getColumns(DelegatingDatabaseMetaData.java:218)
>  at 
> org.datanucleus.store.rdbms.adapter.BaseDatastoreAdapter.getColumns(BaseDatastoreAdapter.java:1532)
>  at 
> org.datanucleus.store.rdbms.schema.RDBMSSchemaHandler.refreshTableData(RDBMSSchemaHandler.java:921)
> Then cannot start hive server any more because of 
> MetaException(message:Metastore contains multiple versions (2) 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-10 Thread tommy duan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tommy duan updated SPARK-27648:
---
Attachment: image-2019-05-10-17-18-25-051.png

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-18-25-051.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-10 Thread tommy duan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837078#comment-16837078
 ] 

tommy duan edited comment on SPARK-27648 at 5/10/19 9:41 AM:
-

Hi [~gsomogyi] & [~kabhwan]

I have filtered the fields 
(*timestamp,numRowsTotal,numRowsUpdated,memoryUsedBytes*) from this progress 
log for details.

Please check Attachments(([^houragg_filter.csv])

Then, according to the time stamp, the state occupied memory 
(*memoryUsedBytes*), the number of updated records (*numRowsUpdated*), and the 
total number of stored records (*numRowsTotal*) are iconized.

The processed graphics are shown as follows:

!image-2019-05-10-17-18-25-051.png!

After this analysis, we can known the state memory occupancy 
(*memoryUsedBytes*: from the progresses log) does not fluctuate much (basically 
stable), but from *"SPARK UI*" - > "*Executors Tab*", we can see that "*Storage 
Memory*" does increase over time.

{color:#ff}*Please note that:*{color}

{color:#59afe1}1) The log file([^houragg(1).out]&[^houragg_filter.csv]) above 
is from 2019-04-23 to 2019-04-29. In fact, I ran from 2019-04-23 to 2019-05-10, 
but the log of 2019-04-29 was lost.{color}

{color:#59afe1}2) The positive thing is that "Storage Memory" has been 
increasing from spark UI - > "executor-tab".{color}
|{color:#59afe1}*TimeStamp*{color}|{color:#59afe1}*Run-time(hour)*{color}|{color:#59afe1}*Storage
 Memory size(MB)*{color}|{color:#59afe1}*Memory growth rate(MB/hour)*{color}|
|{color:#59afe1}2019-04-23{color}|{color:#59afe1}0H{color}|{color:#59afe1}0MB/1.5GB{color}|{color:#59afe1}0{color}|
|{color:#59afe1}2019-04-24{color}|{color:#59afe1}23.5H{color}|{color:#59afe1}41.6MB/1.5GB{color}|{color:#59afe1}1.770212766{color}|
|{color:#59afe1}2019-04-28{color}|{color:#59afe1}108.4H{color}|{color:#59afe1}460.2MB/1.5GB{color}|{color:#59afe1}4.245387454{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}131.7H{color}|{color:#59afe1}559.1MB/1.5GB{color}|{color:#59afe1}4.245254366{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}135.4H{color}|{color:#59afe1}575MB/1.5GB{color}|{color:#59afe1}4.246676514{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}153.6H{color}|{color:#59afe1}641.2MB/1.5GB{color}|{color:#59afe1}4.174479167{color}|
|{color:#59afe1}2019-05-02{color}|{color:#59afe1}219H{color}|{color:#59afe1}888.1MB/1.5GB{color}|{color:#59afe1}4.055251142{color}|
|{color:#59afe1}..{color}|{color:#59afe1}263H{color}|{color:#59afe1}1126.4MB/1.5GB{color}|{color:#59afe1}4.282889734{color}|
|{color:#59afe1}..{color}|{color:#59afe1}309H{color}|{color:#59afe1}1228.8MB/1.5GB{color}|{color:#59afe1}3.976699029{color}|


was (Author: yy3b2007com):
Hi [~gsomogyi] & [~kabhwan]

I have filtered the fields 
(*timestamp,numRowsTotal,numRowsUpdated,memoryUsedBytes*) from this progress 
log for details.

Please check Attachments(([^houragg_filter.csv])

Then, according to the time stamp, the state occupied memory 
(*memoryUsedBytes*), the number of updated records (*numRowsUpdated*), and the 
total number of stored records (*numRowsTotal*) are iconized.

The processed graphics are shown as follows:

!image-2019-05-10-17-18-25-051.png!

After this analysis, we can known the state memory occupancy 
(*memoryUsedBytes*: from the progresses log) does not fluctuate much (basically 
stable), but from *"SPARK UI*" - > "*Executors Tab*", we can see that "*Storage 
Memory*" does increase over time.

{color:#FF}*Please note that:*{color}

{color:#59afe1}1) The log file above is from 2019-04-23 to 2019-04-29. In fact, 
I ran from 2019-04-23 to 2019-05-10, but the log of 2019-04-29 was lost.{color}

{color:#59afe1}2) The positive thing is that "Storage Memory" has been 
increasing from spark UI - > "executor-tab".{color}
|{color:#59afe1}*TimeStamp*{color}|{color:#59afe1}*Run-time(hour)*{color}|{color:#59afe1}*Storage
 Memory size(MB)*{color}|{color:#59afe1}*Memory growth rate(MB/hour)*{color}|
|{color:#59afe1}2019-04-23{color}|{color:#59afe1}0H{color}|{color:#59afe1}0MB/1.5GB{color}|{color:#59afe1}0{color}|
|{color:#59afe1}2019-04-24{color}|{color:#59afe1}23.5H{color}|{color:#59afe1}41.6MB/1.5GB{color}|{color:#59afe1}1.770212766{color}|
|{color:#59afe1}2019-04-28{color}|{color:#59afe1}108.4H{color}|{color:#59afe1}460.2MB/1.5GB{color}|{color:#59afe1}4.245387454{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}131.7H{color}|{color:#59afe1}559.1MB/1.5GB{color}|{color:#59afe1}4.245254366{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}135.4H{color}|{color:#59afe1}575MB/1.5GB{color}|{color:#59afe1}4.246676514{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}153.6H{color}|{color:#59afe1}641.2MB/1.5GB{color}|{color:#59afe1}4.174479167{color}|
|{color:#59afe1}2019-05-02{color}|{color:#59afe1}219H{color}|{color:#59afe1}888.1MB/1.5GB{color}|{color:#59afe1}4.055251142{color}|
|{color:#59afe1}..{color}|{color:#59afe

[jira] [Comment Edited] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-10 Thread tommy duan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837078#comment-16837078
 ] 

tommy duan edited comment on SPARK-27648 at 5/10/19 9:40 AM:
-

Hi [~gsomogyi] & [~kabhwan]

I have filtered the fields 
(*timestamp,numRowsTotal,numRowsUpdated,memoryUsedBytes*) from this progress 
log for details.

Please check Attachments(([^houragg_filter.csv])

Then, according to the time stamp, the state occupied memory 
(*memoryUsedBytes*), the number of updated records (*numRowsUpdated*), and the 
total number of stored records (*numRowsTotal*) are iconized.

The processed graphics are shown as follows:

!image-2019-05-10-17-18-25-051.png!

After this analysis, we can known the state memory occupancy 
(*memoryUsedBytes*: from the progresses log) does not fluctuate much (basically 
stable), but from *"SPARK UI*" - > "*Executors Tab*", we can see that "*Storage 
Memory*" does increase over time.

{color:#FF}*Please note that:*{color}

{color:#59afe1}1) The log file above is from 2019-04-23 to 2019-04-29. In fact, 
I ran from 2019-04-23 to 2019-05-10, but the log of 2019-04-29 was lost.{color}

{color:#59afe1}2) The positive thing is that "Storage Memory" has been 
increasing from spark UI - > "executor-tab".{color}
|{color:#59afe1}*TimeStamp*{color}|{color:#59afe1}*Run-time(hour)*{color}|{color:#59afe1}*Storage
 Memory size(MB)*{color}|{color:#59afe1}*Memory growth rate(MB/hour)*{color}|
|{color:#59afe1}2019-04-23{color}|{color:#59afe1}0H{color}|{color:#59afe1}0MB/1.5GB{color}|{color:#59afe1}0{color}|
|{color:#59afe1}2019-04-24{color}|{color:#59afe1}23.5H{color}|{color:#59afe1}41.6MB/1.5GB{color}|{color:#59afe1}1.770212766{color}|
|{color:#59afe1}2019-04-28{color}|{color:#59afe1}108.4H{color}|{color:#59afe1}460.2MB/1.5GB{color}|{color:#59afe1}4.245387454{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}131.7H{color}|{color:#59afe1}559.1MB/1.5GB{color}|{color:#59afe1}4.245254366{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}135.4H{color}|{color:#59afe1}575MB/1.5GB{color}|{color:#59afe1}4.246676514{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}153.6H{color}|{color:#59afe1}641.2MB/1.5GB{color}|{color:#59afe1}4.174479167{color}|
|{color:#59afe1}2019-05-02{color}|{color:#59afe1}219H{color}|{color:#59afe1}888.1MB/1.5GB{color}|{color:#59afe1}4.055251142{color}|
|{color:#59afe1}..{color}|{color:#59afe1}263H{color}|{color:#59afe1}1126.4MB/1.5GB{color}|{color:#59afe1}4.282889734{color}|
|{color:#59afe1}..{color}|{color:#59afe1}309H{color}|{color:#59afe1}1228.8MB/1.5GB{color}|{color:#59afe1}3.976699029{color}|


was (Author: yy3b2007com):
Hi [~gsomogyi] & [~kabhwan]

I have filtered the fields 
(*timestamp,numRowsTotal,numRowsUpdated,memoryUsedBytes*) from this progress 
log for details.

Please check Attachments(([^houragg_filter.csv])

Then, according to the time stamp, the state occupied memory 
(*memoryUsedBytes*), the number of updated records (*numRowsUpdated*), and the 
total number of stored records (*numRowsTotal*) are iconized.

The processed graphics are shown as follows:

!image-2019-05-10-17-18-25-051.png!

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-18-25-051.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M 

[jira] [Comment Edited] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-10 Thread tommy duan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837078#comment-16837078
 ] 

tommy duan edited comment on SPARK-27648 at 5/10/19 9:43 AM:
-

Hi [~gsomogyi] & [~kabhwan]

I have filtered the fields 
(*timestamp,numRowsTotal,numRowsUpdated,memoryUsedBytes*) from this progress 
log for details.

Please check Attachments(([^houragg_filter.csv])

Then, according to the time stamp, the state occupied memory 
(*memoryUsedBytes*), the number of updated records (*numRowsUpdated*), and the 
total number of stored records (*numRowsTotal*) are iconized.

The processed graphics are shown as follows:

!image-2019-05-10-17-18-25-051.png!

After this analysis, we can known the state memory occupancy 
(*memoryUsedBytes*: from the progresses log) does not fluctuate much (basically 
stable), but from *"SPARK UI*" - > "*Executors Tab*", we can see that "*Storage 
Memory*" does increase over time.

{color:#ff}*Please note that:*{color}

{color:#d04437}1) The log file([^houragg(1).out]&[^houragg_filter.csv]) above 
is from 2019-04-23 to 2019-04-29. In fact, I ran from 2019-04-23 to 2019-05-10, 
but the log lost after 2019-04-29.{color}

{color:#d04437}2) The positive thing is that "Storage Memory" has been 
increasing from spark UI - > "executor-tab".{color}
|{color:#59afe1}*TimeStamp*{color}|{color:#59afe1}*Run-time(hour)*{color}|{color:#59afe1}*Storage
 Memory size(MB)*{color}|{color:#59afe1}*Memory growth rate(MB/hour)*{color}|
|{color:#59afe1}2019-04-23{color}|{color:#59afe1}0H{color}|{color:#59afe1}0MB/1.5GB{color}|{color:#59afe1}0{color}|
|{color:#59afe1}2019-04-24{color}|{color:#59afe1}23.5H{color}|{color:#59afe1}41.6MB/1.5GB{color}|{color:#59afe1}1.770212766{color}|
|{color:#59afe1}2019-04-28{color}|{color:#59afe1}108.4H{color}|{color:#59afe1}460.2MB/1.5GB{color}|{color:#59afe1}4.245387454{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}131.7H{color}|{color:#59afe1}559.1MB/1.5GB{color}|{color:#59afe1}4.245254366{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}135.4H{color}|{color:#59afe1}575MB/1.5GB{color}|{color:#59afe1}4.246676514{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}153.6H{color}|{color:#59afe1}641.2MB/1.5GB{color}|{color:#59afe1}4.174479167{color}|
|{color:#59afe1}2019-05-02{color}|{color:#59afe1}219H{color}|{color:#59afe1}888.1MB/1.5GB{color}|{color:#59afe1}4.055251142{color}|
|{color:#59afe1}..{color}|{color:#59afe1}263H{color}|{color:#59afe1}1126.4MB/1.5GB{color}|{color:#59afe1}4.282889734{color}|
|{color:#59afe1}..{color}|{color:#59afe1}309H{color}|{color:#59afe1}1228.8MB/1.5GB{color}|{color:#59afe1}3.976699029{color}|


was (Author: yy3b2007com):
Hi [~gsomogyi] & [~kabhwan]

I have filtered the fields 
(*timestamp,numRowsTotal,numRowsUpdated,memoryUsedBytes*) from this progress 
log for details.

Please check Attachments(([^houragg_filter.csv])

Then, according to the time stamp, the state occupied memory 
(*memoryUsedBytes*), the number of updated records (*numRowsUpdated*), and the 
total number of stored records (*numRowsTotal*) are iconized.

The processed graphics are shown as follows:

!image-2019-05-10-17-18-25-051.png!

After this analysis, we can known the state memory occupancy 
(*memoryUsedBytes*: from the progresses log) does not fluctuate much (basically 
stable), but from *"SPARK UI*" - > "*Executors Tab*", we can see that "*Storage 
Memory*" does increase over time.

{color:#ff}*Please note that:*{color}

{color:#59afe1}1) The log file([^houragg(1).out]&[^houragg_filter.csv]) above 
is from 2019-04-23 to 2019-04-29. In fact, I ran from 2019-04-23 to 2019-05-10, 
but the log of 2019-04-29 was lost.{color}

{color:#59afe1}2) The positive thing is that "Storage Memory" has been 
increasing from spark UI - > "executor-tab".{color}
|{color:#59afe1}*TimeStamp*{color}|{color:#59afe1}*Run-time(hour)*{color}|{color:#59afe1}*Storage
 Memory size(MB)*{color}|{color:#59afe1}*Memory growth rate(MB/hour)*{color}|
|{color:#59afe1}2019-04-23{color}|{color:#59afe1}0H{color}|{color:#59afe1}0MB/1.5GB{color}|{color:#59afe1}0{color}|
|{color:#59afe1}2019-04-24{color}|{color:#59afe1}23.5H{color}|{color:#59afe1}41.6MB/1.5GB{color}|{color:#59afe1}1.770212766{color}|
|{color:#59afe1}2019-04-28{color}|{color:#59afe1}108.4H{color}|{color:#59afe1}460.2MB/1.5GB{color}|{color:#59afe1}4.245387454{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}131.7H{color}|{color:#59afe1}559.1MB/1.5GB{color}|{color:#59afe1}4.245254366{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}135.4H{color}|{color:#59afe1}575MB/1.5GB{color}|{color:#59afe1}4.246676514{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}153.6H{color}|{color:#59afe1}641.2MB/1.5GB{color}|{color:#59afe1}4.174479167{color}|
|{color:#59afe1}2019-05-02{color}|{color:#59afe1}219H{color}|{color:#59afe1}888.1MB/1.5GB{color}|{color:#59afe1}4.055251142{color}

[jira] [Comment Edited] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-10 Thread tommy duan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837078#comment-16837078
 ] 

tommy duan edited comment on SPARK-27648 at 5/10/19 9:49 AM:
-

Hi [~gsomogyi] & [~kabhwan]

I have filtered the fields 
(*timestamp,numRowsTotal,numRowsUpdated,memoryUsedBytes*) from this progress 
log for details.

Please check Attachments(([^houragg_filter.csv])

Then, according to the time stamp, the state occupied memory 
(*memoryUsedBytes*), the number of updated records (*numRowsUpdated*), and the 
total number of stored records (*numRowsTotal*) are iconized.

The processed graphics are shown as follows:

!image-2019-05-10-17-49-42-034.png!

!image-2019-05-10-17-18-25-051.png!

After this analysis, we can known the state memory occupancy 
(*memoryUsedBytes*: from the progresses log) does not fluctuate much (basically 
stable), but from *"SPARK UI*" - > "*Executors Tab*", we can see that "*Storage 
Memory*" does increase over time.

{color:#ff}*Please note that:*{color}

{color:#d04437}1) The log file([^houragg(1).out]&[^houragg_filter.csv]) above 
is from 2019-04-23 to 2019-04-29. In fact, I ran from 2019-04-23 to 2019-05-10, 
but the log lost after 2019-04-29.{color}

{color:#d04437}2) The positive thing is that "Storage Memory" has been 
increasing from spark UI - > "executor-tab".{color}
|{color:#59afe1}*TimeStamp*{color}|{color:#59afe1}*Run-time(hour)*{color}|{color:#59afe1}*Storage
 Memory size(MB)*{color}|{color:#59afe1}*Memory growth rate(MB/hour)*{color}|
|{color:#59afe1}2019-04-23{color}|{color:#59afe1}0H{color}|{color:#59afe1}0MB/1.5GB{color}|{color:#59afe1}0{color}|
|{color:#59afe1}2019-04-24{color}|{color:#59afe1}23.5H{color}|{color:#59afe1}41.6MB/1.5GB{color}|{color:#59afe1}1.770212766{color}|
|{color:#59afe1}2019-04-28{color}|{color:#59afe1}108.4H{color}|{color:#59afe1}460.2MB/1.5GB{color}|{color:#59afe1}4.245387454{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}131.7H{color}|{color:#59afe1}559.1MB/1.5GB{color}|{color:#59afe1}4.245254366{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}135.4H{color}|{color:#59afe1}575MB/1.5GB{color}|{color:#59afe1}4.246676514{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}153.6H{color}|{color:#59afe1}641.2MB/1.5GB{color}|{color:#59afe1}4.174479167{color}|
|{color:#59afe1}2019-05-02{color}|{color:#59afe1}219H{color}|{color:#59afe1}888.1MB/1.5GB{color}|{color:#59afe1}4.055251142{color}|
|{color:#59afe1}..{color}|{color:#59afe1}263H{color}|{color:#59afe1}1126.4MB/1.5GB{color}|{color:#59afe1}4.282889734{color}|
|{color:#59afe1}..{color}|{color:#59afe1}309H{color}|{color:#59afe1}1228.8MB/1.5GB{color}|{color:#59afe1}3.976699029{color}|


was (Author: yy3b2007com):
Hi [~gsomogyi] & [~kabhwan]

I have filtered the fields 
(*timestamp,numRowsTotal,numRowsUpdated,memoryUsedBytes*) from this progress 
log for details.

Please check Attachments(([^houragg_filter.csv])

Then, according to the time stamp, the state occupied memory 
(*memoryUsedBytes*), the number of updated records (*numRowsUpdated*), and the 
total number of stored records (*numRowsTotal*) are iconized.

The processed graphics are shown as follows:

!image-2019-05-10-17-18-25-051.png!

After this analysis, we can known the state memory occupancy 
(*memoryUsedBytes*: from the progresses log) does not fluctuate much (basically 
stable), but from *"SPARK UI*" - > "*Executors Tab*", we can see that "*Storage 
Memory*" does increase over time.

{color:#ff}*Please note that:*{color}

{color:#d04437}1) The log file([^houragg(1).out]&[^houragg_filter.csv]) above 
is from 2019-04-23 to 2019-04-29. In fact, I ran from 2019-04-23 to 2019-05-10, 
but the log lost after 2019-04-29.{color}

{color:#d04437}2) The positive thing is that "Storage Memory" has been 
increasing from spark UI - > "executor-tab".{color}
|{color:#59afe1}*TimeStamp*{color}|{color:#59afe1}*Run-time(hour)*{color}|{color:#59afe1}*Storage
 Memory size(MB)*{color}|{color:#59afe1}*Memory growth rate(MB/hour)*{color}|
|{color:#59afe1}2019-04-23{color}|{color:#59afe1}0H{color}|{color:#59afe1}0MB/1.5GB{color}|{color:#59afe1}0{color}|
|{color:#59afe1}2019-04-24{color}|{color:#59afe1}23.5H{color}|{color:#59afe1}41.6MB/1.5GB{color}|{color:#59afe1}1.770212766{color}|
|{color:#59afe1}2019-04-28{color}|{color:#59afe1}108.4H{color}|{color:#59afe1}460.2MB/1.5GB{color}|{color:#59afe1}4.245387454{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}131.7H{color}|{color:#59afe1}559.1MB/1.5GB{color}|{color:#59afe1}4.245254366{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}135.4H{color}|{color:#59afe1}575MB/1.5GB{color}|{color:#59afe1}4.246676514{color}|
|{color:#59afe1}2019-04-29{color}|{color:#59afe1}153.6H{color}|{color:#59afe1}641.2MB/1.5GB{color}|{color:#59afe1}4.174479167{color}|
|{color:#59afe1}2019-05-02{color}|{color:#59afe1}219H{color}|{color:#59afe1}888.1MB/1.5GB{colo

[jira] [Updated] (SPARK-27648) In Spark2.4 Structured Streaming:The executor storage memory increasing over time

2019-05-10 Thread tommy duan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tommy duan updated SPARK-27648:
---
Attachment: (was: image-2019-05-10-17-18-25-051.png)

> In Spark2.4 Structured Streaming:The executor storage memory increasing over 
> time
> -
>
> Key: SPARK-27648
> URL: https://issues.apache.org/jira/browse/SPARK-27648
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: tommy duan
>Priority: Major
> Attachments: houragg(1).out, houragg_filter.csv, 
> image-2019-05-09-17-51-14-036.png, image-2019-05-10-17-49-42-034.png
>
>
> *Spark Program Code Business:*
>  Read the topic on kafka, aggregate the stream data sources, and then output 
> it to another topic line of kafka.
> *Problem Description:*
>  *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
> overflow problems often occur (because of too many versions of state stored 
> in memory, this bug has been modified in spark 2.4).
> {code:java}
> /spark-submit \
> --conf “spark.yarn.executor.memoryOverhead=4096M”
> --num-executors 15 \
> --executor-memory 3G \
> --executor-cores 2 \
> --driver-memory 6G \{code}
> {code}
> Executor memory exceptions occur when running with this submit resource under 
> SPARK 2.2 and the normal running time does not exceed one day.
> The solution is to set the executor memory larger than before 
> {code:java}
>  My spark-submit script is as follows:
> /spark-submit\
> conf "spark. yarn. executor. memoryOverhead = 4096M"
> num-executors 15\
> executor-memory 46G\
> executor-cores 3\
> driver-memory 6G\
> ...{code}
> In this case, the spark program can be guaranteed to run stably for a long 
> time, and the executor storage memory is less than 10M (it has been running 
> stably for more than 20 days).
> *2) From the upgrade information of Spark 2.4, we can see that the problem of 
> large memory consumption of state storage has been solved in Spark 2.4.* 
>  So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
> and found that the use of memory was reduced.
>  But a problem arises, as the running time increases, the storage memory of 
> executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
> Resource Manager UI).
>  This program has been running for 14 days (under SPARK 2.2, running with 
> this submit resource, the normal running time is not more than one day, 
> Executor memory abnormalities will occur).
>  The script submitted by the program under spark2.4 is as follows:
> {code:java}
> /spark-submit \
>  --conf “spark.yarn.executor.memoryOverhead=4096M”
>  --num-executors 15 \
>  --executor-memory 3G \
>  --executor-cores 2 \
>  --driver-memory 6G 
> {code}
> Under Spark 2.4, I counted the size of executor memory as time went by during 
> the running of the spark program:
> |Run-time(hour)|Storage Memory size(MB)|Memory growth rate(MB/hour)|
> |23.5H|41.6MB/1.5GB|1.770212766|
> |108.4H|460.2MB/1.5GB|4.245387454|
> |131.7H|559.1MB/1.5GB|4.245254366|
> |135.4H|575MB/1.5GB|4.246676514|
> |153.6H|641.2MB/1.5GB|4.174479167|
> |219H|888.1MB/1.5GB|4.055251142|
> |263H|1126.4MB/1.5GB|4.282889734|
> |309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27664) Performance issue with FileStatusCache, while reading from object stores.

2019-05-10 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-27664:

Description: 
In short,

This issue(i.e. degraded performance ) surfaces when the number of files are 
large > 100K, and is stored on an object store, or any remote storage. The 
actual issue is due to,

Everything is inserted as a single entry in the FileStatusCache i.e. guava 
cache, which does not fit unless the cache is configured to be very very large 
or 4X. Reason: [https://github.com/google/guava/issues/3462].

 

Full story, with possible solutions,

When we read a directory in spark by,
{code:java}
spark.read.parquet("/dir/data/test").limit(1).show()
{code}
behind the scenes, it fetches the FileStatus objects and caches them, inside a 
FileStatusCache, so that it does not need to refetch these objects. Internally, 
it scans using listLeafFiles function at driver. 
 Inside the cache, the entire content of the listing as array of FileStatus 
objects is inserted as a single entry, with key as "/dir/data/test", in the 
FileStatusCache. The default size of this cache is 250MB and it is 
configurable. This underlying cache uses guava cache.

The guava cache has one interesting property, i.e. a single entry can only be 
as large as
{code:java}
maximumSize/concurrencyLevel{code}
see [https://github.com/google/guava/issues/3462], for details. So for a cache 
size of 250MB, a single entry can be as large as only 250MB/4, since the 
default concurrency level is 4 in guava. This size is around 62MB, which is 
good enough for most datasets, but for directories with larger listing, it does 
not work that well. And the effect of this is especially evident when such 
listings are for object stores like Amazon s3 or IBM Cloud object store etc..

So, currently one can work around this problem by setting the value of size of 
the cache (i.e. `spark.sql.hive.filesourcePartitionFileCacheSize`) as very 
high, as it needs to be much more than 4x of what is required.

In order to fix this issue, we can take 3 different approaches,

1) one stop gap fix can be, reduce the concurrency level of the guava cache to 
be just 1, because if everything has to be just one single entry per job, then 
concurrency is not helpful anyway.

2) The alternative would be, to divide the input array into multiple entries in 
the cache, instead of inserting everything against a single key. This can be 
done using directories as keys, if there are multiple nested directories under 
a directory, but if a user has everything listed under a single dir, then this 
solution does not help either and we cannot depend on them creating partitions 
in their hive/sql table.

3) One more alternative fix would be, to make concurrency level configurable, 
for those who want to change it. And while inserting the entry in the cache 
divide it into the `concurrencyLevel`(or even 2X or 3X of it) number of parts, 
before inserting. This way cache will perform optimally, and even if there is 
an eviction, it will evict only a part of the entries, as against all the 
entries in the current implementation. How many entries are evicted due to 
size, depends on concurrencyLevel configured. This approach can be taken, even 
without making `concurrencyLevel` configurable.

The problem with this approach is, the partitions in cache are of no use as 
such, because even if one partition is evicted, then all the partitions of the 
key should also be evicted, otherwise the results would be wrong. 

  was:
In short,

This issue(i.e. degraded performance ) surfaces when the number of files are 
large > 100K, and is stored on an object store, or any remote storage. The 
actual issue is due to,

Everything is inserted as a single entry in the FileStatusCache i.e. guava 
cache, which does not fit unless the cache is configured to be very very large 
or 4X. Reason: [https://github.com/google/guava/issues/3462].

 

Full story, with possible solutions,

When we read a directory in spark by,
{code:java}
spark.read.parquet("/dir/data/test").limit(1).show()
{code}
behind the scenes, it fetches the FileStatus objects and caches them, inside a 
FileStatusCache, so that it does not need to refetch these objects. Internally, 
it scans using listLeafFiles function at driver. 
 Inside the cache, the entire content of the listing as array of FileStatus 
objects is inserted as a single entry, with key as "/dir/data/test", in the 
FileStatusCache. The default size of this cache is 250MB and it is 
configurable. This underlying cache uses guava cache.

The guava cache has one interesting property, i.e. a single entry can only be 
as large as
{code:java}
maximumSize/concurrencyLevel{code}
see [https://github.com/google/guava/issues/3462], for details. So for a cache 
size of 250MB, a single entry can be as large as only 250MB/4, since the 
default concurrency level is 4 in guava. This size 

[jira] [Updated] (SPARK-27664) Performance issue with FileStatusCache, while reading from object stores.

2019-05-10 Thread Prashant Sharma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Sharma updated SPARK-27664:

Description: 
In short,

This issue(i.e. degraded performance ) surfaces when the number of files are 
large > 100K, and is stored on an object store, or any remote storage. The 
actual issue is due to,

Everything is inserted as a single entry in the FileStatusCache i.e. guava 
cache, which does not fit unless the cache is configured to be very very large 
or 4X. Reason: [https://github.com/google/guava/issues/3462].

 

Full story, with possible solutions,

When we read a directory in spark by,
{code:java}
spark.read.parquet("/dir/data/test").limit(1).show()
{code}
behind the scenes, it fetches the FileStatus objects and caches them, inside a 
FileStatusCache, so that it does not need to refetch these objects. Internally, 
it scans using listLeafFiles function at driver. 
 Inside the cache, the entire content of the listing as array of FileStatus 
objects is inserted as a single entry, with key as "/dir/data/test", in the 
FileStatusCache. The default size of this cache is 250MB and it is 
configurable. This underlying cache uses guava cache.

The guava cache has one interesting property, i.e. a single entry can only be 
as large as
{code:java}
maximumSize/concurrencyLevel{code}
see [https://github.com/google/guava/issues/3462], for details. So for a cache 
size of 250MB, a single entry can be as large as only 250MB/4, since the 
default concurrency level is 4 in guava. This size is around 62MB, which is 
good enough for most datasets, but for directories with larger listing, it does 
not work that well. And the effect of this is especially evident when such 
listings are for object stores like Amazon s3 or IBM Cloud object store etc..

So, currently one can work around this problem by setting the value of size of 
the cache (i.e. `spark.sql.hive.filesourcePartitionFileCacheSize`) as very 
high, as it needs to be much more than 4x of what is required. But this has a 
drawback, that either one has to start the driver with large amount of memory 
than required or risk an OOM when cache does not evict older entries as the 
size is configured to be 4x.

In order to fix this issue, we can take 3 different approaches,

1) one stop gap fix can be, reduce the concurrency level of the guava cache to 
be just 1, for few entries with very large size, we do not lose much by doing 
this.

2) The alternative would be, to divide the input array into multiple entries in 
the cache, instead of inserting everything against a single key. This can be 
done using directories as keys, if there are multiple nested directories under 
a directory, but if a user has everything listed under a single dir, then this 
solution does not help either and we cannot depend on them creating partitions 
in their hive/sql table.

3) One more alternative fix would be, to make concurrency level configurable, 
for those who want to change it. And while inserting the entry in the cache 
divide it into the `concurrencyLevel`(or even 2X or 3X of it) number of parts, 
before inserting. This way cache will perform optimally, and even if there is 
an eviction, it will evict only a part of the entries, as against all the 
entries in the current implementation. How many entries are evicted due to 
size, depends on concurrencyLevel configured. This approach can be taken, even 
without making `concurrencyLevel` configurable.

The problem with this approach is, the partitions in cache are of no use as 
such, because even if one partition is evicted, then all the partitions of the 
key should also be evicted, otherwise the results would be wrong. 

  was:
In short,

This issue(i.e. degraded performance ) surfaces when the number of files are 
large > 100K, and is stored on an object store, or any remote storage. The 
actual issue is due to,

Everything is inserted as a single entry in the FileStatusCache i.e. guava 
cache, which does not fit unless the cache is configured to be very very large 
or 4X. Reason: [https://github.com/google/guava/issues/3462].

 

Full story, with possible solutions,

When we read a directory in spark by,
{code:java}
spark.read.parquet("/dir/data/test").limit(1).show()
{code}
behind the scenes, it fetches the FileStatus objects and caches them, inside a 
FileStatusCache, so that it does not need to refetch these objects. Internally, 
it scans using listLeafFiles function at driver. 
 Inside the cache, the entire content of the listing as array of FileStatus 
objects is inserted as a single entry, with key as "/dir/data/test", in the 
FileStatusCache. The default size of this cache is 250MB and it is 
configurable. This underlying cache uses guava cache.

The guava cache has one interesting property, i.e. a single entry can only be 
as large as
{code:java}
maximumSize/concurrencyLevel{code}
see [https://github.com/google/gua

[jira] [Created] (SPARK-27671) Analysis exception thrown when casting from a nested null in a struct

2019-05-10 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-27671:
---

 Summary: Analysis exception thrown when casting from a nested null 
in a struct
 Key: SPARK-27671
 URL: https://issues.apache.org/jira/browse/SPARK-27671
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Liang-Chi Hsieh


When a null in a nested field in struct, casting from the struct throws error, 
currently.

{code}
scala> sql("select cast(struct(1, null) as struct)").show
scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$)  
   
  at org.apache.spark.sql.catalyst.expressions.Cast.castToInt(Cast.scala:447)   
   
  at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:635)
   
  at 
org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castStruct$1(Cast.scala:603)
   
{code}

{code}
scala> sql("select * FROM VALUES (('a', (10, null))), (('b', (10, 50))), (('c', 
null)) AS tab(x, y)").show 
org.apache.spark.sql.AnalysisException: failed to evaluate expression 
named_struct('col1', 10, 'col2', NULL): NullType (of class 
org.apache.spark.sql.t
ypes.NullType$); line 1 pos 14  
   
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
 
  at 
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$convert$6(ResolveInlineTables.scala:106)

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27672) Add since info to string expressions

2019-05-10 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-27672:


 Summary: Add since info to string expressions
 Key: SPARK-27672
 URL: https://issues.apache.org/jira/browse/SPARK-27672
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


{{since}} is missing for string expressions (at ExpressionDescription).

SPARK-8241 ConcatWs
SPARK-16276 Elt
SPARK-1995 Upper / Lower
SPARK-20750 StringReplace
SPARK-8266 StringTranslate
SPARK-8244 FindInSet
SPARK-8253 StringTrimLeft
SPARK-8260 StringTrimRight
SPARK-8267 StringTrim
SPARK-8247 StringInstr
SPARK-8264 SubstringIndex
SPARK-8249 StringLocate
SPARK-8252 StringLPad
SPARK-8259 StringRPad
SPARK-16281 ParseUrl
SPARK-9154 FormatString
SPARK-8269 Initcap
SPARK-8257 StringRepeat
SPARK-8261 StringSpace
SPARK-8263 Substring
SPARK-21007 Right
SPARK-21007 Left
SPARK-8248 Length
SPARK-20749 BitLength
SPARK-20749 OctetLength
SPARK-8270 Levenshtein
SPARK-8271 SoundEx
SPARK-8238 Ascii
SPARK-20748 Chr
SPARK-8239 Base64
SPARK-8268 UnBase64
SPARK-8242 Decode
SPARK-8243 Encode
SPARK-8245 format_number
SPARK-16285 Sentences



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27671) Analysis exception thrown when casting from a nested null in a struct

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27671:


Assignee: (was: Apache Spark)

> Analysis exception thrown when casting from a nested null in a struct
> -
>
> Key: SPARK-27671
> URL: https://issues.apache.org/jira/browse/SPARK-27671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> When a null in a nested field in struct, casting from the struct throws 
> error, currently.
> {code}
> scala> sql("select cast(struct(1, null) as struct)").show
> scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$)
>  
>   at org.apache.spark.sql.catalyst.expressions.Cast.castToInt(Cast.scala:447) 
>  
>   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:635)  
>  
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castStruct$1(Cast.scala:603)
>
> {code}
> {code}
> scala> sql("select * FROM VALUES (('a', (10, null))), (('b', (10, 50))), 
> (('c', null)) AS tab(x, y)").show 
> org.apache.spark.sql.AnalysisException: failed to evaluate expression 
> named_struct('col1', 10, 'col2', NULL): NullType (of class 
> org.apache.spark.sql.t
> ypes.NullType$); line 1 pos 14
>  
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
>  
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$convert$6(ResolveInlineTables.scala:106)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27664) Performance issue with FileStatusCache, while reading from object stores.

2019-05-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837182#comment-16837182
 ] 

Apache Spark commented on SPARK-27664:
--

User 'ScrapCodes' has created a pull request for this issue:
https://github.com/apache/spark/pull/24577

> Performance issue with FileStatusCache, while reading from object stores.
> -
>
> Key: SPARK-27664
> URL: https://issues.apache.org/jira/browse/SPARK-27664
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Prashant Sharma
>Priority: Major
>
> In short,
> This issue(i.e. degraded performance ) surfaces when the number of files are 
> large > 100K, and is stored on an object store, or any remote storage. The 
> actual issue is due to,
> Everything is inserted as a single entry in the FileStatusCache i.e. guava 
> cache, which does not fit unless the cache is configured to be very very 
> large or 4X. Reason: [https://github.com/google/guava/issues/3462].
>  
> Full story, with possible solutions,
> When we read a directory in spark by,
> {code:java}
> spark.read.parquet("/dir/data/test").limit(1).show()
> {code}
> behind the scenes, it fetches the FileStatus objects and caches them, inside 
> a FileStatusCache, so that it does not need to refetch these objects. 
> Internally, it scans using listLeafFiles function at driver. 
>  Inside the cache, the entire content of the listing as array of FileStatus 
> objects is inserted as a single entry, with key as "/dir/data/test", in the 
> FileStatusCache. The default size of this cache is 250MB and it is 
> configurable. This underlying cache uses guava cache.
> The guava cache has one interesting property, i.e. a single entry can only be 
> as large as
> {code:java}
> maximumSize/concurrencyLevel{code}
> see [https://github.com/google/guava/issues/3462], for details. So for a 
> cache size of 250MB, a single entry can be as large as only 250MB/4, since 
> the default concurrency level is 4 in guava. This size is around 62MB, which 
> is good enough for most datasets, but for directories with larger listing, it 
> does not work that well. And the effect of this is especially evident when 
> such listings are for object stores like Amazon s3 or IBM Cloud object store 
> etc..
> So, currently one can work around this problem by setting the value of size 
> of the cache (i.e. `spark.sql.hive.filesourcePartitionFileCacheSize`) as very 
> high, as it needs to be much more than 4x of what is required. But this has a 
> drawback, that either one has to start the driver with large amount of memory 
> than required or risk an OOM when cache does not evict older entries as the 
> size is configured to be 4x.
> In order to fix this issue, we can take 3 different approaches,
> 1) one stop gap fix can be, reduce the concurrency level of the guava cache 
> to be just 1, for few entries with very large size, we do not lose much by 
> doing this.
> 2) The alternative would be, to divide the input array into multiple entries 
> in the cache, instead of inserting everything against a single key. This can 
> be done using directories as keys, if there are multiple nested directories 
> under a directory, but if a user has everything listed under a single dir, 
> then this solution does not help either and we cannot depend on them creating 
> partitions in their hive/sql table.
> 3) One more alternative fix would be, to make concurrency level configurable, 
> for those who want to change it. And while inserting the entry in the cache 
> divide it into the `concurrencyLevel`(or even 2X or 3X of it) number of 
> parts, before inserting. This way cache will perform optimally, and even if 
> there is an eviction, it will evict only a part of the entries, as against 
> all the entries in the current implementation. How many entries are evicted 
> due to size, depends on concurrencyLevel configured. This approach can be 
> taken, even without making `concurrencyLevel` configurable.
> The problem with this approach is, the partitions in cache are of no use as 
> such, because even if one partition is evicted, then all the partitions of 
> the key should also be evicted, otherwise the results would be wrong. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27664) Performance issue with FileStatusCache, while reading from object stores.

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27664:


Assignee: (was: Apache Spark)

> Performance issue with FileStatusCache, while reading from object stores.
> -
>
> Key: SPARK-27664
> URL: https://issues.apache.org/jira/browse/SPARK-27664
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Prashant Sharma
>Priority: Major
>
> In short,
> This issue(i.e. degraded performance ) surfaces when the number of files are 
> large > 100K, and is stored on an object store, or any remote storage. The 
> actual issue is due to,
> Everything is inserted as a single entry in the FileStatusCache i.e. guava 
> cache, which does not fit unless the cache is configured to be very very 
> large or 4X. Reason: [https://github.com/google/guava/issues/3462].
>  
> Full story, with possible solutions,
> When we read a directory in spark by,
> {code:java}
> spark.read.parquet("/dir/data/test").limit(1).show()
> {code}
> behind the scenes, it fetches the FileStatus objects and caches them, inside 
> a FileStatusCache, so that it does not need to refetch these objects. 
> Internally, it scans using listLeafFiles function at driver. 
>  Inside the cache, the entire content of the listing as array of FileStatus 
> objects is inserted as a single entry, with key as "/dir/data/test", in the 
> FileStatusCache. The default size of this cache is 250MB and it is 
> configurable. This underlying cache uses guava cache.
> The guava cache has one interesting property, i.e. a single entry can only be 
> as large as
> {code:java}
> maximumSize/concurrencyLevel{code}
> see [https://github.com/google/guava/issues/3462], for details. So for a 
> cache size of 250MB, a single entry can be as large as only 250MB/4, since 
> the default concurrency level is 4 in guava. This size is around 62MB, which 
> is good enough for most datasets, but for directories with larger listing, it 
> does not work that well. And the effect of this is especially evident when 
> such listings are for object stores like Amazon s3 or IBM Cloud object store 
> etc..
> So, currently one can work around this problem by setting the value of size 
> of the cache (i.e. `spark.sql.hive.filesourcePartitionFileCacheSize`) as very 
> high, as it needs to be much more than 4x of what is required. But this has a 
> drawback, that either one has to start the driver with large amount of memory 
> than required or risk an OOM when cache does not evict older entries as the 
> size is configured to be 4x.
> In order to fix this issue, we can take 3 different approaches,
> 1) one stop gap fix can be, reduce the concurrency level of the guava cache 
> to be just 1, for few entries with very large size, we do not lose much by 
> doing this.
> 2) The alternative would be, to divide the input array into multiple entries 
> in the cache, instead of inserting everything against a single key. This can 
> be done using directories as keys, if there are multiple nested directories 
> under a directory, but if a user has everything listed under a single dir, 
> then this solution does not help either and we cannot depend on them creating 
> partitions in their hive/sql table.
> 3) One more alternative fix would be, to make concurrency level configurable, 
> for those who want to change it. And while inserting the entry in the cache 
> divide it into the `concurrencyLevel`(or even 2X or 3X of it) number of 
> parts, before inserting. This way cache will perform optimally, and even if 
> there is an eviction, it will evict only a part of the entries, as against 
> all the entries in the current implementation. How many entries are evicted 
> due to size, depends on concurrencyLevel configured. This approach can be 
> taken, even without making `concurrencyLevel` configurable.
> The problem with this approach is, the partitions in cache are of no use as 
> such, because even if one partition is evicted, then all the partitions of 
> the key should also be evicted, otherwise the results would be wrong. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27671) Analysis exception thrown when casting from a nested null in a struct

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27671:


Assignee: Apache Spark

> Analysis exception thrown when casting from a nested null in a struct
> -
>
> Key: SPARK-27671
> URL: https://issues.apache.org/jira/browse/SPARK-27671
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> When a null in a nested field in struct, casting from the struct throws 
> error, currently.
> {code}
> scala> sql("select cast(struct(1, null) as struct)").show
> scala.MatchError: NullType (of class org.apache.spark.sql.types.NullType$)
>  
>   at org.apache.spark.sql.catalyst.expressions.Cast.castToInt(Cast.scala:447) 
>  
>   at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:635)  
>  
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.$anonfun$castStruct$1(Cast.scala:603)
>
> {code}
> {code}
> scala> sql("select * FROM VALUES (('a', (10, null))), (('b', (10, 50))), 
> (('c', null)) AS tab(x, y)").show 
> org.apache.spark.sql.AnalysisException: failed to evaluate expression 
> named_struct('col1', 10, 'col2', NULL): NullType (of class 
> org.apache.spark.sql.t
> ypes.NullType$); line 1 pos 14
>  
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
>  
>   at 
> org.apache.spark.sql.catalyst.analysis.ResolveInlineTables.$anonfun$convert$6(ResolveInlineTables.scala:106)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27664) Performance issue with FileStatusCache, while reading from object stores.

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27664:


Assignee: Apache Spark

> Performance issue with FileStatusCache, while reading from object stores.
> -
>
> Key: SPARK-27664
> URL: https://issues.apache.org/jira/browse/SPARK-27664
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 2.4.3
>Reporter: Prashant Sharma
>Assignee: Apache Spark
>Priority: Major
>
> In short,
> This issue(i.e. degraded performance ) surfaces when the number of files are 
> large > 100K, and is stored on an object store, or any remote storage. The 
> actual issue is due to,
> Everything is inserted as a single entry in the FileStatusCache i.e. guava 
> cache, which does not fit unless the cache is configured to be very very 
> large or 4X. Reason: [https://github.com/google/guava/issues/3462].
>  
> Full story, with possible solutions,
> When we read a directory in spark by,
> {code:java}
> spark.read.parquet("/dir/data/test").limit(1).show()
> {code}
> behind the scenes, it fetches the FileStatus objects and caches them, inside 
> a FileStatusCache, so that it does not need to refetch these objects. 
> Internally, it scans using listLeafFiles function at driver. 
>  Inside the cache, the entire content of the listing as array of FileStatus 
> objects is inserted as a single entry, with key as "/dir/data/test", in the 
> FileStatusCache. The default size of this cache is 250MB and it is 
> configurable. This underlying cache uses guava cache.
> The guava cache has one interesting property, i.e. a single entry can only be 
> as large as
> {code:java}
> maximumSize/concurrencyLevel{code}
> see [https://github.com/google/guava/issues/3462], for details. So for a 
> cache size of 250MB, a single entry can be as large as only 250MB/4, since 
> the default concurrency level is 4 in guava. This size is around 62MB, which 
> is good enough for most datasets, but for directories with larger listing, it 
> does not work that well. And the effect of this is especially evident when 
> such listings are for object stores like Amazon s3 or IBM Cloud object store 
> etc..
> So, currently one can work around this problem by setting the value of size 
> of the cache (i.e. `spark.sql.hive.filesourcePartitionFileCacheSize`) as very 
> high, as it needs to be much more than 4x of what is required. But this has a 
> drawback, that either one has to start the driver with large amount of memory 
> than required or risk an OOM when cache does not evict older entries as the 
> size is configured to be 4x.
> In order to fix this issue, we can take 3 different approaches,
> 1) one stop gap fix can be, reduce the concurrency level of the guava cache 
> to be just 1, for few entries with very large size, we do not lose much by 
> doing this.
> 2) The alternative would be, to divide the input array into multiple entries 
> in the cache, instead of inserting everything against a single key. This can 
> be done using directories as keys, if there are multiple nested directories 
> under a directory, but if a user has everything listed under a single dir, 
> then this solution does not help either and we cannot depend on them creating 
> partitions in their hive/sql table.
> 3) One more alternative fix would be, to make concurrency level configurable, 
> for those who want to change it. And while inserting the entry in the cache 
> divide it into the `concurrencyLevel`(or even 2X or 3X of it) number of 
> parts, before inserting. This way cache will perform optimally, and even if 
> there is an eviction, it will evict only a part of the entries, as against 
> all the entries in the current implementation. How many entries are evicted 
> due to size, depends on concurrencyLevel configured. This approach can be 
> taken, even without making `concurrencyLevel` configurable.
> The problem with this approach is, the partitions in cache are of no use as 
> such, because even if one partition is evicted, then all the partitions of 
> the key should also be evicted, otherwise the results would be wrong. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27672) Add since info to string expressions

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27672:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Add since info to string expressions
> 
>
> Key: SPARK-27672
> URL: https://issues.apache.org/jira/browse/SPARK-27672
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> {{since}} is missing for string expressions (at ExpressionDescription).
> SPARK-8241 ConcatWs
> SPARK-16276 Elt
> SPARK-1995 Upper / Lower
> SPARK-20750 StringReplace
> SPARK-8266 StringTranslate
> SPARK-8244 FindInSet
> SPARK-8253 StringTrimLeft
> SPARK-8260 StringTrimRight
> SPARK-8267 StringTrim
> SPARK-8247 StringInstr
> SPARK-8264 SubstringIndex
> SPARK-8249 StringLocate
> SPARK-8252 StringLPad
> SPARK-8259 StringRPad
> SPARK-16281 ParseUrl
> SPARK-9154 FormatString
> SPARK-8269 Initcap
> SPARK-8257 StringRepeat
> SPARK-8261 StringSpace
> SPARK-8263 Substring
> SPARK-21007 Right
> SPARK-21007 Left
> SPARK-8248 Length
> SPARK-20749 BitLength
> SPARK-20749 OctetLength
> SPARK-8270 Levenshtein
> SPARK-8271 SoundEx
> SPARK-8238 Ascii
> SPARK-20748 Chr
> SPARK-8239 Base64
> SPARK-8268 UnBase64
> SPARK-8242 Decode
> SPARK-8243 Encode
> SPARK-8245 format_number
> SPARK-16285 Sentences



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27672) Add since info to string expressions

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27672:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Add since info to string expressions
> 
>
> Key: SPARK-27672
> URL: https://issues.apache.org/jira/browse/SPARK-27672
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> {{since}} is missing for string expressions (at ExpressionDescription).
> SPARK-8241 ConcatWs
> SPARK-16276 Elt
> SPARK-1995 Upper / Lower
> SPARK-20750 StringReplace
> SPARK-8266 StringTranslate
> SPARK-8244 FindInSet
> SPARK-8253 StringTrimLeft
> SPARK-8260 StringTrimRight
> SPARK-8267 StringTrim
> SPARK-8247 StringInstr
> SPARK-8264 SubstringIndex
> SPARK-8249 StringLocate
> SPARK-8252 StringLPad
> SPARK-8259 StringRPad
> SPARK-16281 ParseUrl
> SPARK-9154 FormatString
> SPARK-8269 Initcap
> SPARK-8257 StringRepeat
> SPARK-8261 StringSpace
> SPARK-8263 Substring
> SPARK-21007 Right
> SPARK-21007 Left
> SPARK-8248 Length
> SPARK-20749 BitLength
> SPARK-20749 OctetLength
> SPARK-8270 Levenshtein
> SPARK-8271 SoundEx
> SPARK-8238 Ascii
> SPARK-20748 Chr
> SPARK-8239 Base64
> SPARK-8268 UnBase64
> SPARK-8242 Decode
> SPARK-8243 Encode
> SPARK-8245 format_number
> SPARK-16285 Sentences



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27673) Add since info to random. regex, null expressions

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27673:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Add since info to random. regex, null expressions
> -
>
> Key: SPARK-27673
> URL: https://issues.apache.org/jira/browse/SPARK-27673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> We should add since info to all expressions.
> SPARK-7886 Rand / Randn
> https://github.com/apache/spark/commit/af3746ce0d724dc624658a2187bde188ab26d084
>  RLike, Like (I manually checked that it exists from 1.0.0)
> SPARK-8262 Split
> SPARK-8256 RegExpReplace
> SPARK-8255 RegExpExtract
> https://github.com/apache/spark/commit/9aadcffabd226557174f3ff566927f873c71672e
>  Coalesce / IsNull / IsNotNull (I manually checked that it exists from 1.0.0)
> SPARK-14541 IfNull / NullIf / Nvl / Nvl2
> SPARK-9080 IsNaN
> SPARK-9168 NaNvl



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27673) Add since info to random. regex, null expressions

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27673:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Add since info to random. regex, null expressions
> -
>
> Key: SPARK-27673
> URL: https://issues.apache.org/jira/browse/SPARK-27673
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> We should add since info to all expressions.
> SPARK-7886 Rand / Randn
> https://github.com/apache/spark/commit/af3746ce0d724dc624658a2187bde188ab26d084
>  RLike, Like (I manually checked that it exists from 1.0.0)
> SPARK-8262 Split
> SPARK-8256 RegExpReplace
> SPARK-8255 RegExpExtract
> https://github.com/apache/spark/commit/9aadcffabd226557174f3ff566927f873c71672e
>  Coalesce / IsNull / IsNotNull (I manually checked that it exists from 1.0.0)
> SPARK-14541 IfNull / NullIf / Nvl / Nvl2
> SPARK-9080 IsNaN
> SPARK-9168 NaNvl



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27673) Add since info to random. regex, null expressions

2019-05-10 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-27673:


 Summary: Add since info to random. regex, null expressions
 Key: SPARK-27673
 URL: https://issues.apache.org/jira/browse/SPARK-27673
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


We should add since info to all expressions.

SPARK-7886 Rand / Randn
https://github.com/apache/spark/commit/af3746ce0d724dc624658a2187bde188ab26d084 
RLike, Like (I manually checked that it exists from 1.0.0)
SPARK-8262 Split
SPARK-8256 RegExpReplace
SPARK-8255 RegExpExtract
https://github.com/apache/spark/commit/9aadcffabd226557174f3ff566927f873c71672e 
Coalesce / IsNull / IsNotNull (I manually checked that it exists from 1.0.0)
SPARK-14541 IfNull / NullIf / Nvl / Nvl2
SPARK-9080 IsNaN
SPARK-9168 NaNvl



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2019-05-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837234#comment-16837234
 ] 

Stavros Kontopoulos commented on SPARK-25075:
-

[~srowen] since Scala 2.11 has been removed from master 
([https://github.com/apache/spark/pull/23098)] should we try add 2.13 support 
as experimental in Spark 3.0.0 or it should be like Spark 3.1.0? Scala 2.13 is 
at RC1 at the moment, but will be out soon.  

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2019-05-10 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837260#comment-16837260
 ] 

Guillaume Massé commented on SPARK-25075:
-

Supporting both Scala 2.12 and Scala 2.13 will be an interesting chalenge. You 
can take a look at our work in Akka: [https://github.com/akka/akka/pull/26043.]

 

To make the transition easier, there is a library and a bunch of scalafix 
migration rules. Take a look at: 
[https://github.com/scala/scala-collection-compat.] See also the migration 
guide at: 
[https://docs.scala-lang.org/overviews/core/collections-migration-213.html]

 

I took a quick look, and there is a few Traversable instances in the public 
API. There is also a bunch of scala.Seq. This will makes things harder to 
migrate. Based on the version policy 
(https://spark.apache.org/versioning-policy.html), this would require a bump to 
Spark 4.X

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18484) case class datasets - ability to specify decimal precision and scale

2019-05-10 Thread Bill Schneider (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837269#comment-16837269
 ] 

Bill Schneider commented on SPARK-18484:


I agree with [~bonazzaf].  Where this is a real problem is when you read from 
an existing Parquet file with a specific decimal precision/scale, cast to a 
case class for a typed DataSet, then write back to Parquet.  

If I have a Parquet file with decimal(10,2), I should be able to do something 
like

 
{code:java}
case class WithDecimal(x: BigDecimal)

val input = spark.read.parquet("file_with_decimal")
// today would have to add .withColumn("x", cast($"x", "decimal(38,18)")) 

val typedDs = input.as[WithDecimal]

val output = doSomeStuffWith(typedDs)

spark.write.parquet(output){code}
If I want to do this, I have to cast `x` to decimal(38,18) before I can use a 
typed dataset, then I have to remember to cast it back if I don't want to write 
38,18 to Parquet.  This is at best annoying and at worst a real problem if my 
underlying type is something like decimal(22,0) and I really can't cast to 
(38,18) without truncating. 

In this case, I'm not inferring a schema from a case class.  I already have a 
DataFrame, it already has a schema from Parquet, and further, it's compatible 
with the target case class even without a cast as BigDecimal can hold arbitrary 
precision/scale.  

> case class datasets - ability to specify decimal precision and scale
> 
>
> Key: SPARK-18484
> URL: https://issues.apache.org/jira/browse/SPARK-18484
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Damian Momot
>Priority: Major
>
> Currently when using decimal type (BigDecimal in scala case class) there's no 
> way to enforce precision and scale. This is quite critical when saving data - 
> regarding space usage and compatibility with external systems (for example 
> Hive table) because spark saves data as Decimal(38,18)
> {code}
> case class TestClass(id: String, money: BigDecimal)
> val testDs = spark.createDataset(Seq(
>   TestClass("1", BigDecimal("22.50")),
>   TestClass("2", BigDecimal("500.66"))
> ))
> testDs.printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(38,18) (nullable = true)
> {code}
> Workaround is to convert dataset to dataframe before saving and manually cast 
> to specific decimal scale/precision:
> {code}
> import org.apache.spark.sql.types.DecimalType
> val testDf = testDs.toDF()
> testDf
>   .withColumn("money", testDf("money").cast(DecimalType(10,2)))
>   .printSchema()
> {code}
> {code}
> root
>  |-- id: string (nullable = true)
>  |-- money: decimal(10,2) (nullable = true)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27674) the hint should not be dropped after cache lookup

2019-05-10 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27674:
---

 Summary: the hint should not be dropped after cache lookup
 Key: SPARK-27674
 URL: https://issues.apache.org/jira/browse/SPARK-27674
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27674) the hint should not be dropped after cache lookup

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27674:


Assignee: Apache Spark  (was: Wenchen Fan)

> the hint should not be dropped after cache lookup
> -
>
> Key: SPARK-27674
> URL: https://issues.apache.org/jira/browse/SPARK-27674
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27674) the hint should not be dropped after cache lookup

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27674:


Assignee: Wenchen Fan  (was: Apache Spark)

> the hint should not be dropped after cache lookup
> -
>
> Key: SPARK-27674
> URL: https://issues.apache.org/jira/browse/SPARK-27674
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2019-05-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837299#comment-16837299
 ] 

Sean Owen commented on SPARK-25075:
---

Given that 2.13 isn't out yet and most dependencies won't cross publish for a 
while, I doubt we'd try it soon. However any breaking changes we need to make 
to support it later we should try now. 

Uh oh does the collections API change mean returning Seq is a problem? Hoo boy.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27675) do not use MutableColumnarRow in ColumnarBatch

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27675:


Assignee: Wenchen Fan  (was: Apache Spark)

> do not use MutableColumnarRow in ColumnarBatch
> --
>
> Key: SPARK-27675
> URL: https://issues.apache.org/jira/browse/SPARK-27675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27675) do not use MutableColumnarRow in ColumnarBatch

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27675:


Assignee: Apache Spark  (was: Wenchen Fan)

> do not use MutableColumnarRow in ColumnarBatch
> --
>
> Key: SPARK-27675
> URL: https://issues.apache.org/jira/browse/SPARK-27675
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27650) sepate the row iterator functionality from ColumnarBatch

2019-05-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27650.
-
Resolution: Won't Do

> sepate the row iterator functionality from ColumnarBatch
> 
>
> Key: SPARK-27650
> URL: https://issues.apache.org/jira/browse/SPARK-27650
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27675) do not use MutableColumnarRow in ColumnarBatch

2019-05-10 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-27675:
---

 Summary: do not use MutableColumnarRow in ColumnarBatch
 Key: SPARK-27675
 URL: https://issues.apache.org/jira/browse/SPARK-27675
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27625) ScalaReflection.serializerFor fails for annotated types

2019-05-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27625.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24564
[https://github.com/apache/spark/pull/24564]

> ScalaReflection.serializerFor fails for annotated types
> ---
>
> Key: SPARK-27625
> URL: https://issues.apache.org/jira/browse/SPARK-27625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.2
>Reporter: Patrick Grandjean
>Priority: Major
> Fix For: 3.0.0
>
>
> ScalaRelfection.serializerFor fails for annotated type. Example:
> {code:java}
> case class Foo(
>   field1: String,
>   field2: Option[String] @Bar
> )
> val rdd: RDD[Foo] = ...
> val ds = rdd.toDS // fails at runtime{code}
> The stack trace:
> {code:java}
> // code placeholder
> User class threw exception: scala.MatchError: scala.Option[String] @Bar (of 
> class scala.reflect.internal.Types$AnnotatedType)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
> at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
> at ...{code}
> I believe that it would be safe to ignore the annotation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27625) ScalaReflection.serializerFor fails for annotated types

2019-05-10 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27625:
---

Assignee: Marco Gaido

> ScalaReflection.serializerFor fails for annotated types
> ---
>
> Key: SPARK-27625
> URL: https://issues.apache.org/jira/browse/SPARK-27625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.2
>Reporter: Patrick Grandjean
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 3.0.0
>
>
> ScalaRelfection.serializerFor fails for annotated type. Example:
> {code:java}
> case class Foo(
>   field1: String,
>   field2: Option[String] @Bar
> )
> val rdd: RDD[Foo] = ...
> val ds = rdd.toDS // fails at runtime{code}
> The stack trace:
> {code:java}
> // code placeholder
> User class threw exception: scala.MatchError: scala.Option[String] @Bar (of 
> class scala.reflect.internal.Types$AnnotatedType)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:483)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
> at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
> at ...{code}
> I believe that it would be safe to ignore the annotation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2019-05-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-26412:
--
Summary: Allow Pandas UDF to take an iterator of pd.DataFrames  (was: Allow 
Pandas UDF to take an iterator of pd.DataFrames or Arrow batches)

> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to batch scope, user need to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient. I created this JIRA to discuss possible solutions.
> Essentially we need to support "start()" and "finish()" besides "apply". We 
> can either provide those interfaces or simply provide users the iterator of 
> batches in pd.DataFrame or Arrow table and let user code handle it.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14948) Exception when joining DataFrames derived form the same DataFrame

2019-05-10 Thread Graton M Gathright (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837346#comment-16837346
 ] 

Graton M Gathright commented on SPARK-14948:


Looking forward to a fix for this bug.

> Exception when joining DataFrames derived form the same DataFrame
> -
>
> Key: SPARK-14948
> URL: https://issues.apache.org/jira/browse/SPARK-14948
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Saurabh Santhosh
>Priority: Major
>
> h2. Spark Analyser is throwing the following exception in a specific scenario 
> :
> h2. Exception :
> org.apache.spark.sql.AnalysisException: resolved attribute(s) F1#3 missing 
> from asd#5,F2#4,F1#6,F2#7 in operator !Project [asd#5,F1#3];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> JavaRDD rdd =
> 
> sparkClient.getJavaSparkContext().parallelize(Arrays.asList(RowFactory.create("a",
>  "b")));
> DataFrame df = sparkClient.getSparkHiveContext().createDataFrame(rdd, new 
> StructType(fields));
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkClient.getSparkHiveContext().sql("select F1 as 
> asd, F2 from t1");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(aliasedDf, 
> "t2");
> sparkClient.getSparkHiveContext().registerDataFrameAsTable(df, "t3");
> 
> DataFrame join = aliasedDf.join(df, 
> aliasedDf.col("F2").equalTo(df.col("F2")), "inner");
> DataFrame select = join.select(aliasedDf.col("asd"), df.col("F1"));
> select.collect();
> {code}
> h2. Observations :
> * This issue is related to the Data Type of Fields of the initial Data 
> Frame.(If the Data Type is not String, it will work.)
> * It works fine if the data frame is registered as a temporary table and an 
> sql (select a.asd,b.F1 from t2 a inner join t3 b on a.F2=b.F2) is written.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2019-05-10 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837353#comment-16837353
 ] 

Xiangrui Meng commented on SPARK-26412:
---

[~WeichenXu123] I updated the description.

> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITERATOR)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2019-05-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-26412:
--
Description: 
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITERATOR)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

We might add a contract that each yield must match the corresponding batch size.

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]

  was:
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITERATOR)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]


> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITERATOR)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> We might add a contract that each yield must match the corresponding batch 
> size.
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26412) Allow Pandas UDF to take an iterator of pd.DataFrames

2019-05-10 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-26412:
--
Description: 
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to each batch, user needs to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient.

We can provide users the iterator of batches in pd.DataFrame and let user code 
handle it:

{code}
@pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITERATOR)
def predict(batch_iter):
  model = ... # load model
  for batch in batch_iter:
yield model.predict(batch)
{code}

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]

  was:
Pandas UDF is the ideal connection between PySpark and DL model inference 
workload. However, user needs to load the model file first to make predictions. 
It is common to see models of size ~100MB or bigger. If the Pandas UDF 
execution is limited to batch scope, user need to repeatedly load the same 
model for every batch in the same python worker process, which is inefficient. 
I created this JIRA to discuss possible solutions.

Essentially we need to support "start()" and "finish()" besides "apply". We can 
either provide those interfaces or simply provide users the iterator of batches 
in pd.DataFrame or Arrow table and let user code handle it.

Another benefit is with iterator interface and asyncio from Python, it is 
flexible for users to implement data pipelining.

cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]


> Allow Pandas UDF to take an iterator of pd.DataFrames
> -
>
> Key: SPARK-26412
> URL: https://issues.apache.org/jira/browse/SPARK-26412
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Weichen Xu
>Priority: Major
>
> Pandas UDF is the ideal connection between PySpark and DL model inference 
> workload. However, user needs to load the model file first to make 
> predictions. It is common to see models of size ~100MB or bigger. If the 
> Pandas UDF execution is limited to each batch, user needs to repeatedly load 
> the same model for every batch in the same python worker process, which is 
> inefficient.
> We can provide users the iterator of batches in pd.DataFrame and let user 
> code handle it:
> {code}
> @pandas_udf(DoubleType(), PandasUDFType.SCALAR_ITERATOR)
> def predict(batch_iter):
>   model = ... # load model
>   for batch in batch_iter:
> yield model.predict(batch)
> {code}
> Another benefit is with iterator interface and asyncio from Python, it is 
> flexible for users to implement data pipelining.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2019-05-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837403#comment-16837403
 ] 

Stavros Kontopoulos commented on SPARK-25075:
-

Yes will take some time one of the blockers here: 
[https://github.com/twitter/chill/issues/316.]

Thanks [~MasseGuillaume] havent checked the Spark side yet.  `scala.Seq` is an 
alias right? so it should work out of the box no?

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25075) Build and test Spark against Scala 2.13

2019-05-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837403#comment-16837403
 ] 

Stavros Kontopoulos edited comment on SPARK-25075 at 5/10/19 3:53 PM:
--

Yes will take some time one of the blockers here: 
[https://github.com/twitter/chill/issues/316.]

Thanks [~MasseGuillaume] havent checked the Spark side yet.  `scala.Seq` is an 
alias right? so it should work out of the box no? 


was (Author: skonto):
Yes will take some time one of the blockers here: 
[https://github.com/twitter/chill/issues/316.]

Thanks [~MasseGuillaume] havent checked the Spark side yet.  `scala.Seq` is an 
alias right? so it should work out of the box no?

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2019-05-10 Thread Stefan Zeiger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837417#comment-16837417
 ] 

Stefan Zeiger commented on SPARK-25075:
---

Returning a Seq is not a problem but if you want it to be a scala.Seq it needs 
to be immutable now. Nothing changes if you always use scala.collection.Seq or 
scala.collection.immutable.Seq explicitly. This is recommended for 
cross-building.

The main problem with Seq is the interaction with varargs. They are of type 
scala.Seq so you cannot pass an existing mutable Seq to a varargs method with 
`_*` anymore.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-24980) add support for pandas/arrow etc for python2.7 and pypy builds

2019-05-10 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp closed SPARK-24980.
---

> add support for pandas/arrow etc for python2.7 and pypy builds
> --
>
> Key: SPARK-24980
> URL: https://issues.apache.org/jira/browse/SPARK-24980
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> since we have full support for python3.4 via anaconda, it's time to create 
> similar environments for 2.7 and pypy 2.5.1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24980) add support for pandas/arrow etc for python2.7 and pypy builds

2019-05-10 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-24980.
-
Resolution: Fixed

> add support for pandas/arrow etc for python2.7 and pypy builds
> --
>
> Key: SPARK-24980
> URL: https://issues.apache.org/jira/browse/SPARK-24980
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, PySpark
>Affects Versions: 2.3.1
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> since we have full support for python3.4 via anaconda, it's time to create 
> similar environments for 2.7 and pypy 2.5.1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21367) R older version of Roxygen2 on Jenkins

2019-05-10 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837440#comment-16837440
 ] 

shane knapp commented on SPARK-21367:
-

after deciding to clear out my JIRA queue this morning, i found this unresolved 
ticket and just installed roxygen2 from source on all of the jenkins workers:


{noformat}
$ pssh -h jenkins_workers.txt -i "/home/eecs/sknapp/r-wtf.sh"
[1] 09:54:30 [SUCCESS] amp-jenkins-worker-02
[1] ‘5.0.1’
[2] 09:54:30 [SUCCESS] amp-jenkins-worker-05
[1] ‘5.0.1’
[3] 09:54:30 [SUCCESS] amp-jenkins-worker-06
[1] ‘5.0.1’
[4] 09:54:30 [SUCCESS] amp-jenkins-worker-04
[1] ‘5.0.1’
[5] 09:54:31 [SUCCESS] amp-jenkins-worker-03
[1] ‘5.0.1’
[6] 09:54:31 [SUCCESS] amp-jenkins-worker-01
[1] ‘5.0.1’
{noformat}


> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Major
> Attachments: R.paks
>
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21367) R older version of Roxygen2 on Jenkins

2019-05-10 Thread shane knapp (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp resolved SPARK-21367.
-
Resolution: Fixed

> R older version of Roxygen2 on Jenkins
> --
>
> Key: SPARK-21367
> URL: https://issues.apache.org/jira/browse/SPARK-21367
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Major
> Attachments: R.paks
>
>
> Getting this message from a recent build.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/79461/console
> Warning messages:
> 1: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> 2: In check_dep_version(pkg, version, compare) :
>   Need roxygen2 >= 5.0.0 but loaded version is 4.1.1
> * installing *source* package 'SparkR' ...
> ** R
> We have been running with 5.0.1 and haven't changed for a year.
> NOTE: Roxygen 6.x has some big changes and IMO we should not move to that yet.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26632) Separate Thread Configurations of Driver and Executor

2019-05-10 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26632:
--

Assignee: jiafu zhang

> Separate Thread Configurations of Driver and Executor
> -
>
> Key: SPARK-26632
> URL: https://issues.apache.org/jira/browse/SPARK-26632
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiafu zhang
>Assignee: jiafu zhang
>Priority: Minor
> Fix For: 3.0.0
>
>
> During the benchmark of Spark 2.4.0 on HPC (High Performance Computing), we 
> identified an area can be optimized to improve RPC performance on large 
> number of HPC nodes with omini-path NIC. It's same thread configurations for 
> both driver and executor. From the test, we find driver and executor should 
> have different thread configurations because driver has far more RPC messages 
> than single executor.
> These configurations are, 
> ||Config Key||for Driver||for Executor||
> |spark.rpc.io.serverThreads|spark.driver.rpc.io.serverThreads|spark.executor.rpc.io.serverThreads|
> |spark.rpc.io.clientThreads|spark.driver.rpc.io.clientThreads|spark.executor.rpc.io.clientThreads|
> |spark.rpc.netty.dispatcher.numThreads|spark.driver.rpc.netty.dispatcher.numThreads|spark.executor.rpc.netty.dispatcher.numThreads|
> When Spark reads thread configurations, it tries to read driver's 
> configurations or executor's configurations first. Then fall back to the 
> common thread configurations.
> After the separation, the performance is improved a lot in 256 nodes and 512 
> nodes. see below test result of SimpleMapTask.
> || 
> ||spark.driver.rpc.io.serverThreads||spark.driver.rpc.io.clientThreads||spark.driver.rpc.netty.dispatcher.numThreads||spark.executor.rpc.netty.dispatcher.numThreads||Overall
>  Time (s)||Overall Time without Separation (s)||Improvement||
> |128 nodes|15|15|10|30|107|108|0.9%|
> |256 nodes|12|15|10|30|159|196|18.8%|
> |512 nodes|12|15|10|30|283|377|24.9%|
>  
> The implementation is almost done. We are working on the code merge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26632) Separate Thread Configurations of Driver and Executor

2019-05-10 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26632.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23560
[https://github.com/apache/spark/pull/23560]

> Separate Thread Configurations of Driver and Executor
> -
>
> Key: SPARK-26632
> URL: https://issues.apache.org/jira/browse/SPARK-26632
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiafu zhang
>Priority: Minor
> Fix For: 3.0.0
>
>
> During the benchmark of Spark 2.4.0 on HPC (High Performance Computing), we 
> identified an area can be optimized to improve RPC performance on large 
> number of HPC nodes with omini-path NIC. It's same thread configurations for 
> both driver and executor. From the test, we find driver and executor should 
> have different thread configurations because driver has far more RPC messages 
> than single executor.
> These configurations are, 
> ||Config Key||for Driver||for Executor||
> |spark.rpc.io.serverThreads|spark.driver.rpc.io.serverThreads|spark.executor.rpc.io.serverThreads|
> |spark.rpc.io.clientThreads|spark.driver.rpc.io.clientThreads|spark.executor.rpc.io.clientThreads|
> |spark.rpc.netty.dispatcher.numThreads|spark.driver.rpc.netty.dispatcher.numThreads|spark.executor.rpc.netty.dispatcher.numThreads|
> When Spark reads thread configurations, it tries to read driver's 
> configurations or executor's configurations first. Then fall back to the 
> common thread configurations.
> After the separation, the performance is improved a lot in 256 nodes and 512 
> nodes. see below test result of SimpleMapTask.
> || 
> ||spark.driver.rpc.io.serverThreads||spark.driver.rpc.io.clientThreads||spark.driver.rpc.netty.dispatcher.numThreads||spark.executor.rpc.netty.dispatcher.numThreads||Overall
>  Time (s)||Overall Time without Separation (s)||Improvement||
> |128 nodes|15|15|10|30|107|108|0.9%|
> |256 nodes|12|15|10|30|159|196|18.8%|
> |512 nodes|12|15|10|30|283|377|24.9%|
>  
> The implementation is almost done. We are working on the code merge.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27347) Fix supervised driver retry logic when agent crashes/restarts

2019-05-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27347:
-

Assignee: Dongjoon Hyun

> Fix supervised driver retry logic when agent crashes/restarts
> -
>
> Key: SPARK-27347
> URL: https://issues.apache.org/jira/browse/SPARK-27347
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.1, 2.3.2, 2.4.0
>Reporter: Sam Tran
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Ran into scenarios where {{--supervised}} Spark jobs were retried multiple 
> times when an agent would crash, come back, and re-register even when those 
> jobs had already relaunched on a different agent.
> That is:
>  * supervised driver is running on agent1
>  * agent1 crashes
>  * driver is relaunched on another agent as `-retry-1`
>  * agent1 comes back online and re-registers with scheduler
>  * spark relaunches the same job as `-retry-2`
>  * now there are two jobs running simultaneously and the first retry job is 
> effectively orphaned within Zookeeper
> This is because when an agent comes back and re-registers, it sends a status 
> update {{TASK_FAILED}} for its old driver-task. Previous logic would 
> indiscriminately remove the {{submissionId from Zookeeper's launchedDrivers}} 
> node and add it to {{retryList}} node.
> Then, when a new offer came in, it would relaunch another -retry task even 
> though one was previously running.
> Sample log looks something like this: 
> {code:java}
> 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
> ... [offers] ...
> 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver 
> driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
> ...
> 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
> 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed 
> out' reason=REASON_SLAVE_REMOVED
> ...
> 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-1"
> ...
> 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
> 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable 
> agent re-reregistered'
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal 
> executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
> 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with 
> driver-20190115192138-0001 in status update
> ...
> 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-2"
> ...
> 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
> 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27347) Fix supervised driver retry logic when agent crashes/restarts

2019-05-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27347.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.4.4
   2.3.4

This is resolved via https://github.com/apache/spark/pull/24276

> Fix supervised driver retry logic when agent crashes/restarts
> -
>
> Key: SPARK-27347
> URL: https://issues.apache.org/jira/browse/SPARK-27347
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.1, 2.3.2, 2.4.0
>Reporter: Sam Tran
>Priority: Major
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> Ran into scenarios where {{--supervised}} Spark jobs were retried multiple 
> times when an agent would crash, come back, and re-register even when those 
> jobs had already relaunched on a different agent.
> That is:
>  * supervised driver is running on agent1
>  * agent1 crashes
>  * driver is relaunched on another agent as `-retry-1`
>  * agent1 comes back online and re-registers with scheduler
>  * spark relaunches the same job as `-retry-2`
>  * now there are two jobs running simultaneously and the first retry job is 
> effectively orphaned within Zookeeper
> This is because when an agent comes back and re-registers, it sends a status 
> update {{TASK_FAILED}} for its old driver-task. Previous logic would 
> indiscriminately remove the {{submissionId from Zookeeper's launchedDrivers}} 
> node and add it to {{retryList}} node.
> Then, when a new offer came in, it would relaunch another -retry task even 
> though one was previously running.
> Sample log looks something like this: 
> {code:java}
> 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
> ... [offers] ...
> 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver 
> driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
> ...
> 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
> 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed 
> out' reason=REASON_SLAVE_REMOVED
> ...
> 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-1"
> ...
> 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
> 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable 
> agent re-reregistered'
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal 
> executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
> 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with 
> driver-20190115192138-0001 in status update
> ...
> 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-2"
> ...
> 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
> 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27347) Fix supervised driver retry logic when agent crashes/restarts

2019-05-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27347:
-

Assignee: (was: Dongjoon Hyun)

> Fix supervised driver retry logic when agent crashes/restarts
> -
>
> Key: SPARK-27347
> URL: https://issues.apache.org/jira/browse/SPARK-27347
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.1, 2.3.2, 2.4.0
>Reporter: Sam Tran
>Priority: Major
>
> Ran into scenarios where {{--supervised}} Spark jobs were retried multiple 
> times when an agent would crash, come back, and re-register even when those 
> jobs had already relaunched on a different agent.
> That is:
>  * supervised driver is running on agent1
>  * agent1 crashes
>  * driver is relaunched on another agent as `-retry-1`
>  * agent1 comes back online and re-registers with scheduler
>  * spark relaunches the same job as `-retry-2`
>  * now there are two jobs running simultaneously and the first retry job is 
> effectively orphaned within Zookeeper
> This is because when an agent comes back and re-registers, it sends a status 
> update {{TASK_FAILED}} for its old driver-task. Previous logic would 
> indiscriminately remove the {{submissionId from Zookeeper's launchedDrivers}} 
> node and add it to {{retryList}} node.
> Then, when a new offer came in, it would relaunch another -retry task even 
> though one was previously running.
> Sample log looks something like this: 
> {code:java}
> 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
> ... [offers] ...
> 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver 
> driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
> ...
> 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
> 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed 
> out' reason=REASON_SLAVE_REMOVED
> ...
> 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-1"
> ...
> 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
> 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable 
> agent re-reregistered'
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal 
> executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
> 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with 
> driver-20190115192138-0001 in status update
> ...
> 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-2"
> ...
> 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
> 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27347) Fix supervised driver retry logic when agent crashes/restarts

2019-05-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-27347:
-

Assignee: Sam Tran

> Fix supervised driver retry logic when agent crashes/restarts
> -
>
> Key: SPARK-27347
> URL: https://issues.apache.org/jira/browse/SPARK-27347
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.1, 2.3.2, 2.4.0
>Reporter: Sam Tran
>Assignee: Sam Tran
>Priority: Major
> Fix For: 2.3.4, 2.4.4, 3.0.0
>
>
> Ran into scenarios where {{--supervised}} Spark jobs were retried multiple 
> times when an agent would crash, come back, and re-register even when those 
> jobs had already relaunched on a different agent.
> That is:
>  * supervised driver is running on agent1
>  * agent1 crashes
>  * driver is relaunched on another agent as `-retry-1`
>  * agent1 comes back online and re-registers with scheduler
>  * spark relaunches the same job as `-retry-2`
>  * now there are two jobs running simultaneously and the first retry job is 
> effectively orphaned within Zookeeper
> This is because when an agent comes back and re-registers, it sends a status 
> update {{TASK_FAILED}} for its old driver-task. Previous logic would 
> indiscriminately remove the {{submissionId from Zookeeper's launchedDrivers}} 
> node and add it to {{retryList}} node.
> Then, when a new offer came in, it would relaunch another -retry task even 
> though one was previously running.
> Sample log looks something like this: 
> {code:java}
> 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
> ... [offers] ...
> 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver 
> driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
> ...
> 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
> 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed 
> out' reason=REASON_SLAVE_REMOVED
> ...
> 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-1"
> ...
> 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
> 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable 
> agent re-reregistered'
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal 
> executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
> 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with 
> driver-20190115192138-0001 in status update
> ...
> 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-2"
> ...
> 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
> 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2019-05-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837516#comment-16837516
 ] 

Sean Owen commented on SPARK-25075:
---

OK, so we can possibly defer this by waiting until 2.13 support to go 
explicitly import scala.collection.Seq everywhere rather than use the alias in 
order to keep the type the same. I mean, we could get going on it now too; 
that's a lot of imports!

Traversable: one usage is in a public API. We'll have to deprecate that and 
replace with Iterable, then remove when 2.13 support is added. Likewise there 
are some usages of removed collection classes, though so far looks all internal.

Right now I'm mostly interested in identifying changes we can't solve by 
deprecating something in 3.x and then removing in 3.(x+1) to allow for 2.13 
support. Probably nothing that will truly only be solvable in 4.x that I see so 
far, but, maybe best to start on these changes before 3.0.

Anyone's welcome to; I may start pushing some minor proactive changes for 2.13.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-27660) Allow PySpark toLocalIterator to pre-fetch data

2019-05-10 Thread Holden Karau (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-27660:
-
Comment: was deleted

(was: Which is the this an issue duplicate off?)

> Allow PySpark toLocalIterator to pre-fetch data
> ---
>
> Key: SPARK-27660
> URL: https://issues.apache.org/jira/browse/SPARK-27660
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> In SPARK-23961, data was no longer prefetched so that the local iterator 
> could close cleanly in the case of not consuming all of the data. If the user 
> intends to iterate over all elements, then prefetching data could bring back 
> any lost performance. We would need to run some tests to see if the 
> performance is worth it due to additional complexity and extra usage of 
> memory. The option to prefetch could adding as a user conf, and it's possible 
> this could improve the Scala toLocalIterator also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27660) Allow PySpark toLocalIterator to pre-fetch data

2019-05-10 Thread Holden Karau (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837529#comment-16837529
 ] 

Holden Karau commented on SPARK-27660:
--

Which is the this an issue duplicate off?

> Allow PySpark toLocalIterator to pre-fetch data
> ---
>
> Key: SPARK-27660
> URL: https://issues.apache.org/jira/browse/SPARK-27660
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> In SPARK-23961, data was no longer prefetched so that the local iterator 
> could close cleanly in the case of not consuming all of the data. If the user 
> intends to iterate over all elements, then prefetching data could bring back 
> any lost performance. We would need to run some tests to see if the 
> performance is worth it due to additional complexity and extra usage of 
> memory. The option to prefetch could adding as a user conf, and it's possible 
> this could improve the Scala toLocalIterator also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27660) Allow PySpark toLocalIterator to pre-fetch data

2019-05-10 Thread Holden Karau (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837530#comment-16837530
 ] 

Holden Karau commented on SPARK-27660:
--

Related to https://issues.apache.org/jira/browse/SPARK-27659

> Allow PySpark toLocalIterator to pre-fetch data
> ---
>
> Key: SPARK-27660
> URL: https://issues.apache.org/jira/browse/SPARK-27660
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Minor
>
> In SPARK-23961, data was no longer prefetched so that the local iterator 
> could close cleanly in the case of not consuming all of the data. If the user 
> intends to iterate over all elements, then prefetching data could bring back 
> any lost performance. We would need to run some tests to see if the 
> performance is worth it due to additional complexity and extra usage of 
> memory. The option to prefetch could adding as a user conf, and it's possible 
> this could improve the Scala toLocalIterator also.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing

2019-05-10 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-27676:
--

 Summary: InMemoryFileIndex should hard-fail on missing files 
instead of logging and continuing
 Key: SPARK-27676
 URL: https://issues.apache.org/jira/browse/SPARK-27676
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Josh Rosen


Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
exceptions are caught and logged as warnings. I think that this is a dangerous 
default behavior and would prefer that Spark hard-fails by default (with the 
ignore-and-continue behavior guarded by a SQL session configuration).

In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
Quoting from the PR for SPARK-17599:
{quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. If 
a folder is deleted at any time between the paths were resolved and the file 
catalog can check for the folder, the Spark job fails. This may abruptly stop 
long running StructuredStreaming jobs for example.

Folders may be deleted by users or automatically by retention policies. These 
cases should not prevent jobs from successfully completing.
{quote}
Let's say that I'm *not* expecting to ever delete input files for my job. In 
that case, this behavior can mask bugs.

One straightforward masked bug class is accidental file deletion: if I'm never 
expecting to delete files then I'd prefer to fail my job if Spark sees deleted 
files.

A more subtle bug can occur when using a S3 filesystem. Say I'm running a Spark 
job against a partitioned Parquet dataset which is laid out like this:
{code:java}
data/
  date=1/
region=west/
   0.parquet
   1.parquet
region=east/
   0.parquet
   1.parquet{code}
If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform 
multiple rounds of file listing, first listing {{/data/date=1}} to discover the 
partitions for that date, then listing within each partition to discover the 
leaf files. Due to the eventual consistency of S3 ListObjects, it's possible 
that the first listing will show the {{region=west}} and {{region=east}} 
partitions existing and then the next-level listing fails to return any objects 
(e.g. {{/data/date=1/}} returns files but {{/data/date=1/region=west/}} throws 
a {{FileNotFoundException}} in S3A due to ListObjects inconsistency).

If Spark propagated the {{FileNotFoundException}} and hard-failed in this case 
then I'd be able to fail the job in this case where we _definitely_ know that 
the S3 listing is inconsistent (failing here doesn't guard against _all_ 
potential S3 list inconsistency issues (e.g. back-to-back listings which both 
return a subset of the true set of objects), but I think it's still an 
improvement to fail for the subset of cases that we _can_ detect even if that's 
not a surefire failsafe against the more general problem).

Finally, I'm unsure if the original patch will have the desired effect: if a 
file is deleted once a Spark job expects to read it then that can cause 
problems at multiple layers, both in the driver (multiple rounds of file 
listing) and in executors (if the deletion occurs after the construction of the 
catalog but before the scheduling of the read tasks); I think the original 
patch only resolved the problem for the driver (unless I'm missing similar 
executor-side code specific to the original streaming use-case).

Given all of these reasons, I think that the "ignore potentially deleted files 
during file index listing" behavior should be guarded behind a feature flag 
which defaults to {{false}}, consistent with the existing 
{{spark.files.ignoreMissingFiles}} and {{spark.sql.files.ignoreMissingFiles}} 
flags (which both default to false).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing

2019-05-10 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27676:
---
Description: 
Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
exceptions are caught and logged as warnings (during [directory 
listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274]
 and . I think that this is a dangerous default behavior and would prefer that 
Spark hard-fails by default (with the ignore-and-continue behavior guarded by a 
SQL session configuration).

In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
Quoting from the PR for SPARK-17599:
{quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. If 
a folder is deleted at any time between the paths were resolved and the file 
catalog can check for the folder, the Spark job fails. This may abruptly stop 
long running StructuredStreaming jobs for example.

Folders may be deleted by users or automatically by retention policies. These 
cases should not prevent jobs from successfully completing.
{quote}
Let's say that I'm *not* expecting to ever delete input files for my job. In 
that case, this behavior can mask bugs.

One straightforward masked bug class is accidental file deletion: if I'm never 
expecting to delete files then I'd prefer to fail my job if Spark sees deleted 
files.

A more subtle bug can occur when using a S3 filesystem. Say I'm running a Spark 
job against a partitioned Parquet dataset which is laid out like this:
{code:java}
data/
  date=1/
region=west/
   0.parquet
   1.parquet
region=east/
   0.parquet
   1.parquet{code}
If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform 
multiple rounds of file listing, first listing {{/data/date=1}} to discover the 
partitions for that date, then listing within each partition to discover the 
leaf files. Due to the eventual consistency of S3 ListObjects, it's possible 
that the first listing will show the {{region=west}} and {{region=east}} 
partitions existing and then the next-level listing fails to return any objects 
(e.g. {{/data/date=1/}} returns files but {{/data/date=1/region=west/}} throws 
a {{FileNotFoundException}} in S3A due to ListObjects inconsistency).

If Spark propagated the {{FileNotFoundException}} and hard-failed in this case 
then I'd be able to fail the job in this case where we _definitely_ know that 
the S3 listing is inconsistent (failing here doesn't guard against _all_ 
potential S3 list inconsistency issues (e.g. back-to-back listings which both 
return a subset of the true set of objects), but I think it's still an 
improvement to fail for the subset of cases that we _can_ detect even if that's 
not a surefire failsafe against the more general problem).

Finally, I'm unsure if the original patch will have the desired effect: if a 
file is deleted once a Spark job expects to read it then that can cause 
problems at multiple layers, both in the driver (multiple rounds of file 
listing) and in executors (if the deletion occurs after the construction of the 
catalog but before the scheduling of the read tasks); I think the original 
patch only resolved the problem for the driver (unless I'm missing similar 
executor-side code specific to the original streaming use-case).

Given all of these reasons, I think that the "ignore potentially deleted files 
during file index listing" behavior should be guarded behind a feature flag 
which defaults to {{false}}, consistent with the existing 
{{spark.files.ignoreMissingFiles}} and {{spark.sql.files.ignoreMissingFiles}} 
flags (which both default to false).

  was:
Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
exceptions are caught and logged as warnings. I think that this is a dangerous 
default behavior and would prefer that Spark hard-fails by default (with the 
ignore-and-continue behavior guarded by a SQL session configuration).

In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
Quoting from the PR for SPARK-17599:
{quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. If 
a folder is deleted at any time between the paths were resolved and the file 
catalog can check for the folder, the Spark job fails. This may abruptly stop 
long running StructuredStreaming jobs for example.

Folders may be deleted by users or automatically by retention policies. These 
cases should not prevent jobs from successfully completing.
{quote}
Let's say that I'm *not* expecting to ever delete input files for my job. In 
that case, this behavior can mask bugs.

One straightforward masked bug class is accidental file deletion: if I'm never 
expecting to delete files then I'd prefer to fail my job if Spark sees deleted 
files.

A mor

[jira] [Updated] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing

2019-05-10 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27676:
---
Description: 
Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
exceptions are caught and logged as warnings (during [directory 
listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274]
 and [block location 
lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]).
 I think that this is a dangerous default behavior and would prefer that Spark 
hard-fails by default (with the ignore-and-continue behavior guarded by a SQL 
session configuration).

In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
Quoting from the PR for SPARK-17599:
{quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. If 
a folder is deleted at any time between the paths were resolved and the file 
catalog can check for the folder, the Spark job fails. This may abruptly stop 
long running StructuredStreaming jobs for example.

Folders may be deleted by users or automatically by retention policies. These 
cases should not prevent jobs from successfully completing.
{quote}
Let's say that I'm *not* expecting to ever delete input files for my job. In 
that case, this behavior can mask bugs.

One straightforward masked bug class is accidental file deletion: if I'm never 
expecting to delete files then I'd prefer to fail my job if Spark sees deleted 
files.

A more subtle bug can occur when using a S3 filesystem. Say I'm running a Spark 
job against a partitioned Parquet dataset which is laid out like this:
{code:java}
data/
  date=1/
region=west/
   0.parquet
   1.parquet
region=east/
   0.parquet
   1.parquet{code}
If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform 
multiple rounds of file listing, first listing {{/data/date=1}} to discover the 
partitions for that date, then listing within each partition to discover the 
leaf files. Due to the eventual consistency of S3 ListObjects, it's possible 
that the first listing will show the {{region=west}} and {{region=east}} 
partitions existing and then the next-level listing fails to return any objects 
(e.g. {{/data/date=1/}} returns files but {{/data/date=1/region=west/}} throws 
a {{FileNotFoundException}} in S3A due to ListObjects inconsistency).

If Spark propagated the {{FileNotFoundException}} and hard-failed in this case 
then I'd be able to fail the job in this case where we _definitely_ know that 
the S3 listing is inconsistent (failing here doesn't guard against _all_ 
potential S3 list inconsistency issues (e.g. back-to-back listings which both 
return a subset of the true set of objects), but I think it's still an 
improvement to fail for the subset of cases that we _can_ detect even if that's 
not a surefire failsafe against the more general problem).

Finally, I'm unsure if the original patch will have the desired effect: if a 
file is deleted once a Spark job expects to read it then that can cause 
problems at multiple layers, both in the driver (multiple rounds of file 
listing) and in executors (if the deletion occurs after the construction of the 
catalog but before the scheduling of the read tasks); I think the original 
patch only resolved the problem for the driver (unless I'm missing similar 
executor-side code specific to the original streaming use-case).

Given all of these reasons, I think that the "ignore potentially deleted files 
during file index listing" behavior should be guarded behind a feature flag 
which defaults to {{false}}, consistent with the existing 
{{spark.files.ignoreMissingFiles}} and {{spark.sql.files.ignoreMissingFiles}} 
flags (which both default to false).

  was:
Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
exceptions are caught and logged as warnings (during [directory 
listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274]
 and . I think that this is a dangerous default behavior and would prefer that 
Spark hard-fails by default (with the ignore-and-continue behavior guarded by a 
SQL session configuration).

In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
Quoting from the PR for SPARK-17599:
{quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. If 
a folder is deleted at any time between the paths were resolved and the file 
catalog can check for the folder, the Spark job fails. This may abruptly stop 
long running StructuredStreaming jobs for example.

Folders may be deleted by user

[jira] [Updated] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing

2019-05-10 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-27676:
---
Description: 
Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
exceptions are caught and logged as warnings (during [directory 
listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274]
 and [block location 
lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]).
 I think that this is a dangerous default behavior and would prefer that Spark 
hard-fails by default (with the ignore-and-continue behavior guarded by a SQL 
session configuration).

In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
Quoting from the PR for SPARK-17599:
{quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. If 
a folder is deleted at any time between the paths were resolved and the file 
catalog can check for the folder, the Spark job fails. This may abruptly stop 
long running StructuredStreaming jobs for example.

Folders may be deleted by users or automatically by retention policies. These 
cases should not prevent jobs from successfully completing.
{quote}
Let's say that I'm *not* expecting to ever delete input files for my job. In 
that case, this behavior can mask bugs.

One straightforward masked bug class is accidental file deletion: if I'm never 
expecting to delete files then I'd prefer to fail my job if Spark sees deleted 
files.

A more subtle bug can occur when using a S3 filesystem. Say I'm running a Spark 
job against a partitioned Parquet dataset which is laid out like this:
{code:java}
data/
  date=1/
region=west/
   0.parquet
   1.parquet
region=east/
   0.parquet
   1.parquet{code}
If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform 
multiple rounds of file listing, first listing {{/data/date=1}} to discover the 
partitions for that date, then listing within each partition to discover the 
leaf files. Due to the eventual consistency of S3 ListObjects, it's possible 
that the first listing will show the {{region=west}} and {{region=east}} 
partitions existing and then the next-level listing fails to return any for 
some of the directories (e.g. {{/data/date=1/}} returns files but 
{{/data/date=1/region=west/}} throws a {{FileNotFoundException}} in S3A due to 
ListObjects inconsistency).

If Spark propagated the {{FileNotFoundException}} and hard-failed in this case 
then I'd be able to fail the job in this case where we _definitely_ know that 
the S3 listing is inconsistent (failing here doesn't guard against _all_ 
potential S3 list inconsistency issues (e.g. back-to-back listings which both 
return a subset of the true set of objects), but I think it's still an 
improvement to fail for the subset of cases that we _can_ detect even if that's 
not a surefire failsafe against the more general problem).

Finally, I'm unsure if the original patch will have the desired effect: if a 
file is deleted once a Spark job expects to read it then that can cause 
problems at multiple layers, both in the driver (multiple rounds of file 
listing) and in executors (if the deletion occurs after the construction of the 
catalog but before the scheduling of the read tasks); I think the original 
patch only resolved the problem for the driver (unless I'm missing similar 
executor-side code specific to the original streaming use-case).

Given all of these reasons, I think that the "ignore potentially deleted files 
during file index listing" behavior should be guarded behind a feature flag 
which defaults to {{false}}, consistent with the existing 
{{spark.files.ignoreMissingFiles}} and {{spark.sql.files.ignoreMissingFiles}} 
flags (which both default to false).

  was:
Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
exceptions are caught and logged as warnings (during [directory 
listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274]
 and [block location 
lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]).
 I think that this is a dangerous default behavior and would prefer that Spark 
hard-fails by default (with the ignore-and-continue behavior guarded by a SQL 
session configuration).

In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
Quoting from the PR for SPARK-17599:
{quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. If 
a folder is delete

[jira] [Commented] (SPARK-27676) InMemoryFileIndex should hard-fail on missing files instead of logging and continuing

2019-05-10 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837552#comment-16837552
 ] 

Michael Armbrust commented on SPARK-27676:
--

I tend to agree that all cases where we chose to ignore missing files should be 
hidden behind the existing {{spark.sql.files.ignoreMissingFiles}} flag.

> InMemoryFileIndex should hard-fail on missing files instead of logging and 
> continuing
> -
>
> Key: SPARK-27676
> URL: https://issues.apache.org/jira/browse/SPARK-27676
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> Spark's {{InMemoryFileIndex}} contains two places where {{FileNotFound}} 
> exceptions are caught and logged as warnings (during [directory 
> listing|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L274]
>  and [block location 
> lookup|https://github.com/apache/spark/blob/bcd3b61c4be98565352491a108e6394670a0f413/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L333]).
>  I think that this is a dangerous default behavior and would prefer that 
> Spark hard-fails by default (with the ignore-and-continue behavior guarded by 
> a SQL session configuration).
> In SPARK-17599 and SPARK-24364, logic was added to ignore missing files. 
> Quoting from the PR for SPARK-17599:
> {quote}The {{ListingFileCatalog}} lists files given a set of resolved paths. 
> If a folder is deleted at any time between the paths were resolved and the 
> file catalog can check for the folder, the Spark job fails. This may abruptly 
> stop long running StructuredStreaming jobs for example.
> Folders may be deleted by users or automatically by retention policies. These 
> cases should not prevent jobs from successfully completing.
> {quote}
> Let's say that I'm *not* expecting to ever delete input files for my job. In 
> that case, this behavior can mask bugs.
> One straightforward masked bug class is accidental file deletion: if I'm 
> never expecting to delete files then I'd prefer to fail my job if Spark sees 
> deleted files.
> A more subtle bug can occur when using a S3 filesystem. Say I'm running a 
> Spark job against a partitioned Parquet dataset which is laid out like this:
> {code:java}
> data/
>   date=1/
> region=west/
>0.parquet
>1.parquet
> region=east/
>0.parquet
>1.parquet{code}
> If I do {{spark.read.parquet("/data/date=1/")}} then Spark needs to perform 
> multiple rounds of file listing, first listing {{/data/date=1}} to discover 
> the partitions for that date, then listing within each partition to discover 
> the leaf files. Due to the eventual consistency of S3 ListObjects, it's 
> possible that the first listing will show the {{region=west}} and 
> {{region=east}} partitions existing and then the next-level listing fails to 
> return any for some of the directories (e.g. {{/data/date=1/}} returns files 
> but {{/data/date=1/region=west/}} throws a {{FileNotFoundException}} in S3A 
> due to ListObjects inconsistency).
> If Spark propagated the {{FileNotFoundException}} and hard-failed in this 
> case then I'd be able to fail the job in this case where we _definitely_ know 
> that the S3 listing is inconsistent (failing here doesn't guard against _all_ 
> potential S3 list inconsistency issues (e.g. back-to-back listings which both 
> return a subset of the true set of objects), but I think it's still an 
> improvement to fail for the subset of cases that we _can_ detect even if 
> that's not a surefire failsafe against the more general problem).
> Finally, I'm unsure if the original patch will have the desired effect: if a 
> file is deleted once a Spark job expects to read it then that can cause 
> problems at multiple layers, both in the driver (multiple rounds of file 
> listing) and in executors (if the deletion occurs after the construction of 
> the catalog but before the scheduling of the read tasks); I think the 
> original patch only resolved the problem for the driver (unless I'm missing 
> similar executor-side code specific to the original streaming use-case).
> Given all of these reasons, I think that the "ignore potentially deleted 
> files during file index listing" behavior should be guarded behind a feature 
> flag which defaults to {{false}}, consistent with the existing 
> {{spark.files.ignoreMissingFiles}} and {{spark.sql.files.ignoreMissingFiles}} 
> flags (which both default to false).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscr

[jira] [Commented] (SPARK-25075) Build and test Spark against Scala 2.13

2019-05-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837579#comment-16837579
 ] 

Stavros Kontopoulos commented on SPARK-25075:
-

Scala team added some more info to explain this in detail: 
https://github.com/scala/docs.scala-lang/pull/1326/files 

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25075) Build and test Spark against Scala 2.13

2019-05-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837579#comment-16837579
 ] 

Stavros Kontopoulos edited comment on SPARK-25075 at 5/10/19 8:24 PM:
--

Scala team added some more info to explain this in detail: 
[https://github.com/scala/docs.scala-lang/pull/1326/files] 

I might be able to help too with the changes. First some dependencies need to 
be fixed outside Spark which requires some work.


was (Author: skonto):
Scala team added some more info to explain this in detail: 
https://github.com/scala/docs.scala-lang/pull/1326/files 

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25075) Build and test Spark against Scala 2.13

2019-05-10 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837579#comment-16837579
 ] 

Stavros Kontopoulos edited comment on SPARK-25075 at 5/10/19 8:24 PM:
--

Scala team added some more info to explain this in detail: 
[https://github.com/scala/docs.scala-lang/pull/1326/files] 

I might be able to help too with the changes. First some dependencies need to 
be fixed outside Spark which requires a certain amount of work.


was (Author: skonto):
Scala team added some more info to explain this in detail: 
[https://github.com/scala/docs.scala-lang/pull/1326/files] 

I might be able to help too with the changes. First some dependencies need to 
be fixed outside Spark which requires some work.

> Build and test Spark against Scala 2.13
> ---
>
> Key: SPARK-25075
> URL: https://issues.apache.org/jira/browse/SPARK-25075
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.0.0
>Reporter: Guillaume Massé
>Priority: Major
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.13 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27677) Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

2019-05-10 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-27677:


 Summary: Disk-persisted RDD blocks served by shuffle service, and 
ignored for Dynamic Allocation
 Key: SPARK-27677
 URL: https://issues.apache.org/jira/browse/SPARK-27677
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Spark Core
Affects Versions: 2.4.3
Reporter: Imran Rashid


Disk-cached RDD blocks are currently unavailable after an executor is removed.  
However, when there is an external shuffle service, the data remains available 
on disk and could be served by the shuffle service.  This would allow dynamic 
allocation to reclaim executors with only disk-cached blocks more rapidly, but 
still keep the cached data available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27677) Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation

2019-05-10 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837629#comment-16837629
 ] 

Imran Rashid commented on SPARK-27677:
--

I'm breaking this out of SPARK-25888.  [~attilapiros] is already working on a 
fix for this, the current pr is https://github.com/apache/spark/pull/24499

> Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic 
> Allocation
> ---
>
> Key: SPARK-27677
> URL: https://issues.apache.org/jira/browse/SPARK-27677
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Spark Core
>Affects Versions: 2.4.3
>Reporter: Imran Rashid
>Priority: Major
>
> Disk-cached RDD blocks are currently unavailable after an executor is 
> removed.  However, when there is an external shuffle service, the data 
> remains available on disk and could be served by the shuffle service.  This 
> would allow dynamic allocation to reclaim executors with only disk-cached 
> blocks more rapidly, but still keep the cached data available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25888) Service requests for persist() blocks via external service after dynamic deallocation

2019-05-10 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837632#comment-16837632
 ] 

Imran Rashid commented on SPARK-25888:
--

There is a lot of stuff listed here, so I think it makes sense to break it out 
into smaller chunks.  I created SPARK-27677 for the first part (which 
corresponds to the PR attila already created).  I think all of the asks here 
are reasonable long-term goals, but some are significantly more complicated.  
I'll keep thinking about the ways we can break this down to get in incremental 
improvement.

> Service requests for persist() blocks via external service after dynamic 
> deallocation
> -
>
> Key: SPARK-25888
> URL: https://issues.apache.org/jira/browse/SPARK-25888
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle, YARN
>Affects Versions: 2.3.2
>Reporter: Adam Kennedy
>Priority: Major
>
> Large and highly multi-tenant Spark on YARN clusters with diverse job 
> execution often display terrible utilization rates (we have observed as low 
> as 3-7% CPU at max container allocation, but 50% CPU utilization on even a 
> well policed cluster is not uncommon).
> As a sizing example, consider a scenario with 1,000 nodes, 50,000 cores, 250 
> users and 50,000 runs of 1,000 distinct applications per week, with 
> predominantly Spark including a mixture of ETL, Ad Hoc tasks and PySpark 
> Notebook jobs (no streaming)
> Utilization problems appear to be due in large part to difficulties with 
> persist() blocks (DISK or DISK+MEMORY) preventing dynamic deallocation.
> In situations where an external shuffle service is present (which is typical 
> on clusters of this type) we already solve this for the shuffle block case by 
> offloading the IO handling of shuffle blocks to the external service, 
> allowing dynamic deallocation to proceed.
> Allowing Executors to transfer persist() blocks to some external "shuffle" 
> service in a similar manner would be an enormous win for Spark multi-tenancy 
> as it would limit deallocation blocking scenarios to only MEMORY-only cache() 
> scenarios.
> I'm not sure if I'm correct, but I seem to recall seeing in the original 
> external shuffle service commits that may have been considered at the time 
> but getting shuffle blocks moved to the external shuffle service was the 
> first priority.
> With support for external persist() DISK blocks in place, we could also then 
> handle deallocation of DISK+MEMORY, as the memory instance could first be 
> dropped, changing the block to DISK only, and then further transferred to the 
> shuffle service.
> We have tried to resolve the persist() issue via extensive user training, but 
> that has typically only allowed us to improve utilization of the worst 
> offenders (10% utilization) up to around 40-60% utilization, as the need for 
> persist() is often legitimate and occurs during the middle stages of a job.
> In a healthy multi-tenant scenario, a large job might spool up to say 10,000 
> cores, persist() data, release executors across a long tail down to 100 
> cores, and then spool back up to 10,000 cores for the following stage without 
> impact on the persist() data.
> In an ideal world, if an new executor started up on a node on which blocks 
> had been transferred to the shuffle service, the new executor might even be 
> able to "recapture" control of those blocks (if that would help with 
> performance in some way).
> And the behavior of gradually expanding up and down several times over the 
> course of a job would not just improve utilization, but would allow resources 
> to more easily be redistributed to other jobs which start on the cluster 
> during the long-tail periods, which would improve multi-tenancy and bring us 
> closer to optimal "envy free" YARN scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27678) Support Knox user impersonation in UI

2019-05-10 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-27678:
--

 Summary: Support Knox user impersonation in UI
 Key: SPARK-27678
 URL: https://issues.apache.org/jira/browse/SPARK-27678
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Marcelo Vanzin


When Spark is running behind a Knox proxy server, it would be useful to support 
impersonation, so that Knox can properly identify the user who's actually 
making the request.

This way, Knox authenticates to Spark, but Spark makes access control checks 
against the user that Knox identifies as the actual requesting user, so that 
proper access checks are performed, instead of performing the checks against 
the more privileged Knox user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27678) Support Knox user impersonation in UI

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27678:


Assignee: (was: Apache Spark)

> Support Knox user impersonation in UI
> -
>
> Key: SPARK-27678
> URL: https://issues.apache.org/jira/browse/SPARK-27678
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> When Spark is running behind a Knox proxy server, it would be useful to 
> support impersonation, so that Knox can properly identify the user who's 
> actually making the request.
> This way, Knox authenticates to Spark, but Spark makes access control checks 
> against the user that Knox identifies as the actual requesting user, so that 
> proper access checks are performed, instead of performing the checks against 
> the more privileged Knox user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27678) Support Knox user impersonation in UI

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27678:


Assignee: Apache Spark

> Support Knox user impersonation in UI
> -
>
> Key: SPARK-27678
> URL: https://issues.apache.org/jira/browse/SPARK-27678
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> When Spark is running behind a Knox proxy server, it would be useful to 
> support impersonation, so that Knox can properly identify the user who's 
> actually making the request.
> This way, Knox authenticates to Spark, but Spark makes access control checks 
> against the user that Knox identifies as the actual requesting user, so that 
> proper access checks are performed, instead of performing the checks against 
> the more privileged Knox user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27679) Improve queries with LIKE expression

2019-05-10 Thread Achuth Narayan Rajagopal (JIRA)
Achuth Narayan Rajagopal created SPARK-27679:


 Summary: Improve queries with LIKE expression
 Key: SPARK-27679
 URL: https://issues.apache.org/jira/browse/SPARK-27679
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3
Reporter: Achuth Narayan Rajagopal


The following improvements are possible when simplifying `LIKE` expressions -
 # In the case where pattern is `null` replace like expression with 
Literal(`false`), so in cases where applicable we can replace the underlying 
relation with `LocalRelation `.
 # Add support for underscore case optimization similar to '%'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27679) Improve queries with LIKE expression

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27679:


Assignee: Apache Spark

> Improve queries with LIKE expression
> 
>
> Key: SPARK-27679
> URL: https://issues.apache.org/jira/browse/SPARK-27679
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Achuth Narayan Rajagopal
>Assignee: Apache Spark
>Priority: Major
>
> The following improvements are possible when simplifying `LIKE` expressions -
>  # In the case where pattern is `null` replace like expression with 
> Literal(`false`), so in cases where applicable we can replace the underlying 
> relation with `LocalRelation `.
>  # Add support for underscore case optimization similar to '%'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27679) Improve queries with LIKE expression

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27679:


Assignee: (was: Apache Spark)

> Improve queries with LIKE expression
> 
>
> Key: SPARK-27679
> URL: https://issues.apache.org/jira/browse/SPARK-27679
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Achuth Narayan Rajagopal
>Priority: Major
>
> The following improvements are possible when simplifying `LIKE` expressions -
>  # In the case where pattern is `null` replace like expression with 
> Literal(`false`), so in cases where applicable we can replace the underlying 
> relation with `LocalRelation `.
>  # Add support for underscore case optimization similar to '%'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27681) Use scala.collection.Seq explicitly instead of scala.Seq alias

2019-05-10 Thread Sean Owen (JIRA)
Sean Owen created SPARK-27681:
-

 Summary: Use scala.collection.Seq explicitly instead of scala.Seq 
alias
 Key: SPARK-27681
 URL: https://issues.apache.org/jira/browse/SPARK-27681
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, Spark Core, SQL, Structured Streaming
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


{{scala.Seq}} is widely used in the code, and is an alias for 
{{scala.collection.Seq}} in Scala 2.12. It will become an alias for 
{{scala.collection.immutable.Seq}} in Scala 2.13. To avoid API changes, we 
should simply explicit import and use {{scala.collection.Seq}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27682) Avoid use of Scala collection classes that are removed in 2.13

2019-05-10 Thread Sean Owen (JIRA)
Sean Owen created SPARK-27682:
-

 Summary: Avoid use of Scala collection classes that are removed in 
2.13
 Key: SPARK-27682
 URL: https://issues.apache.org/jira/browse/SPARK-27682
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


Scala 2.13 will remove several collection classes like {{MutableList}}. We 
should avoid using them and replace with similar classes proactively.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27680) Remove usage of Traversable

2019-05-10 Thread Sean Owen (JIRA)
Sean Owen created SPARK-27680:
-

 Summary: Remove usage of Traversable
 Key: SPARK-27680
 URL: https://issues.apache.org/jira/browse/SPARK-27680
 Project: Spark
  Issue Type: Sub-task
  Components: GraphX, Spark Core, SQL
Affects Versions: 3.0.0
Reporter: Sean Owen
Assignee: Sean Owen


Scala's {{Traversable}} is removed in Scala 2.13. We have some usages of this 
interface that almost certainly can just be references to {{Iterable}}, which 
all usual collections classes also extend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27680) Remove usage of Traversable

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27680:


Assignee: Sean Owen  (was: Apache Spark)

> Remove usage of Traversable
> ---
>
> Key: SPARK-27680
> URL: https://issues.apache.org/jira/browse/SPARK-27680
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> Scala's {{Traversable}} is removed in Scala 2.13. We have some usages of this 
> interface that almost certainly can just be references to {{Iterable}}, which 
> all usual collections classes also extend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27680) Remove usage of Traversable

2019-05-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27680:


Assignee: Apache Spark  (was: Sean Owen)

> Remove usage of Traversable
> ---
>
> Key: SPARK-27680
> URL: https://issues.apache.org/jira/browse/SPARK-27680
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX, Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> Scala's {{Traversable}} is removed in Scala 2.13. We have some usages of this 
> interface that almost certainly can just be references to {{Iterable}}, which 
> all usual collections classes also extend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27681) Use scala.collection.Seq explicitly instead of scala.Seq alias

2019-05-10 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837749#comment-16837749
 ] 

Marcelo Vanzin commented on SPARK-27681:


OOC, what happens if we do nothing?

As far as I can see, Scala 2.12 builds will have scala.Seq in the API, Scala 
2.13 builds will have scala.collection.immutable.Seq, and that should be fine, 
right, since those builds are not binary compatible with each other anyway?

> Use scala.collection.Seq explicitly instead of scala.Seq alias
> --
>
> Key: SPARK-27681
> URL: https://issues.apache.org/jira/browse/SPARK-27681
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> {{scala.Seq}} is widely used in the code, and is an alias for 
> {{scala.collection.Seq}} in Scala 2.12. It will become an alias for 
> {{scala.collection.immutable.Seq}} in Scala 2.13. To avoid API changes, we 
> should simply explicit import and use {{scala.collection.Seq}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27681) Use scala.collection.Seq explicitly instead of scala.Seq alias

2019-05-10 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837752#comment-16837752
 ] 

Marcelo Vanzin commented on SPARK-27681:


Also the change you're proposing likely would break source compatibility, which 
would make upgrading to Spark 3 harder than just a recompile...

> Use scala.collection.Seq explicitly instead of scala.Seq alias
> --
>
> Key: SPARK-27681
> URL: https://issues.apache.org/jira/browse/SPARK-27681
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> {{scala.Seq}} is widely used in the code, and is an alias for 
> {{scala.collection.Seq}} in Scala 2.12. It will become an alias for 
> {{scala.collection.immutable.Seq}} in Scala 2.13. To avoid API changes, we 
> should simply explicit import and use {{scala.collection.Seq}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27681) Use scala.collection.Seq explicitly instead of scala.Seq alias

2019-05-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837751#comment-16837751
 ] 

Sean Owen commented on SPARK-27681:
---

To my understanding, we don't need to do this until we support Scala 2.13. We 
_can_ do it now, as it just makes explicit the type of {{Seq}} that is already 
used by using the type {{scala.Seq}}. We don't actually want or need to change 
to use {{scala.collection.immutable.Seq}} over {{scala.collection.Seq}} in 
theory or practice. But indeed, that change wouldn't be so bad anyway as 2.13 
builds aren't expected to be binary compatible with 2.12. I'd only argue for it 
as the API already doesn't promise immutable or mutable {{Seq}} and this change 
would maintain that, but yeah it's not even strictly required.

> Use scala.collection.Seq explicitly instead of scala.Seq alias
> --
>
> Key: SPARK-27681
> URL: https://issues.apache.org/jira/browse/SPARK-27681
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> {{scala.Seq}} is widely used in the code, and is an alias for 
> {{scala.collection.Seq}} in Scala 2.12. It will become an alias for 
> {{scala.collection.immutable.Seq}} in Scala 2.13. To avoid API changes, we 
> should simply explicit import and use {{scala.collection.Seq}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27681) Use scala.collection.Seq explicitly instead of scala.Seq alias

2019-05-10 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837753#comment-16837753
 ] 

Marcelo Vanzin commented on SPARK-27681:


bq. we don't need to do this until we support Scala 2.13

I guess my question is why do we need to do it? Your explanation above 
indicates that the {{Seq}} alias in 2.13 will just be a different type. That's 
fine since 2.12 and 2.13 builds are not binary compatible anyway.

I'd understand this change if the {{Seq}} alias was being removed altogether 
and you'd need an explicit import, but that doesn't seem to be the case.

> Use scala.collection.Seq explicitly instead of scala.Seq alias
> --
>
> Key: SPARK-27681
> URL: https://issues.apache.org/jira/browse/SPARK-27681
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> {{scala.Seq}} is widely used in the code, and is an alias for 
> {{scala.collection.Seq}} in Scala 2.12. It will become an alias for 
> {{scala.collection.immutable.Seq}} in Scala 2.13. To avoid API changes, we 
> should simply explicit import and use {{scala.collection.Seq}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27681) Use scala.collection.Seq explicitly instead of scala.Seq alias

2019-05-10 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837752#comment-16837752
 ] 

Marcelo Vanzin edited comment on SPARK-27681 at 5/11/19 4:28 AM:
-

Also the change you're proposing likely would break source compatibility, which 
would make upgrading to Spark + Scala 2.13 harder than just a recompile...


was (Author: vanzin):
Also the change you're proposing likely would break source compatibility, which 
would make upgrading to Spark 3 harder than just a recompile...

> Use scala.collection.Seq explicitly instead of scala.Seq alias
> --
>
> Key: SPARK-27681
> URL: https://issues.apache.org/jira/browse/SPARK-27681
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Major
>
> {{scala.Seq}} is widely used in the code, and is an alias for 
> {{scala.collection.Seq}} in Scala 2.12. It will become an alias for 
> {{scala.collection.immutable.Seq}} in Scala 2.13. To avoid API changes, we 
> should simply explicit import and use {{scala.collection.Seq}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23191) Workers registration failes in case of network drop

2019-05-10 Thread Neeraj Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837756#comment-16837756
 ] 

Neeraj Gupta commented on SPARK-23191:
--

Hi [~Ngone51],

Yes in my case multiple drivers were running concurrently for one app.

> Workers registration failes in case of network drop
> ---
>
> Key: SPARK-23191
> URL: https://issues.apache.org/jira/browse/SPARK-23191
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.1, 2.3.0
> Environment: OS:- Centos 6.9(64 bit)
>  
>Reporter: Neeraj Gupta
>Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from 
> java.io.IOException: Connection timed out
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>     at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>     at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>     at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>     at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>     at java.lang.Thread.run(Thread.java:745)
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> {code}
>  # Once we get this exception we renable the connections to port 7077 using
> {code:java}
> iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # Worker tries to register again with master but is unable to do so. It 
> gives following error
> {code:java}
> 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  
> org.apache.spark.deploy.worker.Worker - Failed to connect to master 
> :7077
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
>     at 
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to :7077
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
>     at 
> org.apache.spark.rpc.ne