[jira] [Resolved] (SPARK-33215) Speed up event log download by skipping UI rebuild

2020-10-26 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-33215.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30126
[https://github.com/apache/spark/pull/30126]

> Speed up event log download by skipping UI rebuild
> --
>
> Key: SPARK-33215
> URL: https://issues.apache.org/jira/browse/SPARK-33215
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
> Fix For: 3.1.0
>
>
> Right now, when we want to download the event logs from the spark history 
> server(SHS), SHS will need to parse entire the event log to rebuild UI, and 
> this is just for view permission checks. UI rebuilding is a time-consuming 
> and memory-intensive task, especially for large logs. However, this process 
> is unnecessary for event log download.
> This patch enables SHS to check UI view permissions of a given app/attempt 
> for a given user, without rebuilding the UI. This is achieved by adding a 
> method "checkUIViewPermissions(appId: String, attemptId: Option[String], 
> user: String): Boolean" to many layers of history server components.
> With this patch, UI rebuild can be skipped when downloading event logs from 
> the history server. Thus the time of downloading a GB scale event log can be 
> reduced from several minutes to several seconds, and the memory consumption 
> of UI rebuilding can be avoided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33215) Speed up event log download by skipping UI rebuild

2020-10-26 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-33215:


Assignee: Baohe Zhang

> Speed up event log download by skipping UI rebuild
> --
>
> Key: SPARK-33215
> URL: https://issues.apache.org/jira/browse/SPARK-33215
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
>
> Right now, when we want to download the event logs from the spark history 
> server(SHS), SHS will need to parse entire the event log to rebuild UI, and 
> this is just for view permission checks. UI rebuilding is a time-consuming 
> and memory-intensive task, especially for large logs. However, this process 
> is unnecessary for event log download.
> This patch enables SHS to check UI view permissions of a given app/attempt 
> for a given user, without rebuilding the UI. This is achieved by adding a 
> method "checkUIViewPermissions(appId: String, attemptId: Option[String], 
> user: String): Boolean" to many layers of history server components.
> With this patch, UI rebuild can be skipped when downloading event logs from 
> the history server. Thus the time of downloading a GB scale event log can be 
> reduced from several minutes to several seconds, and the memory consumption 
> of UI rebuilding can be avoided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33243) Add numpydoc into documentation dependency

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33243.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30149
[https://github.com/apache/spark/pull/30149]

> Add numpydoc into documentation dependency
> --
>
> Key: SPARK-33243
> URL: https://issues.apache.org/jira/browse/SPARK-33243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> To switch the docstring formats, we should add numpydoc package into Sphinx.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33243) Add numpydoc into documentation dependency

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33243:


Assignee: Hyukjin Kwon

> Add numpydoc into documentation dependency
> --
>
> Key: SPARK-33243
> URL: https://issues.apache.org/jira/browse/SPARK-33243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> To switch the docstring formats, we should add numpydoc package into Sphinx.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33256) Update contribution guide about NumPy documentation style

2020-10-26 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33256:


 Summary: Update contribution guide about NumPy documentation style
 Key: SPARK-33256
 URL: https://issues.apache.org/jira/browse/SPARK-33256
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


We should document that PySpark uses NumPy documentation style.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33249) Add status plugin for live application

2020-10-26 Thread Weiyi Kong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiyi Kong updated SPARK-33249:
---
Remaining Estimate: (was: 24h)
 Original Estimate: (was: 24h)

> Add status plugin for live application
> --
>
> Key: SPARK-33249
> URL: https://issues.apache.org/jira/browse/SPARK-33249
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Weiyi Kong
>Priority: Minor
>
> There are cases that developer may want to extend the current REST API of Web 
> UI. In most cases, adding external module is a better option than directly 
> editing the original Spark code.
> For an external module, to extend the REST API of the Web UI, 2 things may 
> need to be done:
>  * Add extra API to provide extra status info. This can be simply done by 
> implementing another ApiRequestContext which will be automatically loaded.
>  * If the info can not be calculated from the original data in the store, add 
> extra listeners to generate them.
> For history server, there is an interface called AppHistoryServerPlugin, 
> which is loaded based on SPI, providing a method to create listeners. In live 
> application, the only way is spark.extraListeners based on 
> Utils.loadExtensions. But this is not enough for the cases.
> To let the API get the status info, the data need to be written to the 
> AppStatusStore, which is the only store that an API can get by accessing 
> "ui.store" or "ui.sc.statusStore". But listeners created by 
> Utils.loadExtensions only get a SparkConf in construction, and are unable to 
> write the AppStatusStore.
> So I think we still need plugin like AppHistorySever for live UI. For 
> concerns like SPARK-22786, the plugin for live app can be separated from the 
> history server one, and also loaded using Utils.loadExtensions with an extra 
> configurations. So by default, nothing will be loaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer

2020-10-26 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221126#comment-17221126
 ] 

Yang Jie commented on SPARK-33255:
--

[~hyukjin.kwon] Got it ~

> Use new API to construct ParquetFileReader and read Parquet footer
> --
>
> Key: SPARK-33255
> URL: https://issues.apache.org/jira/browse/SPARK-33255
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> {code:java}
> /**
>  * @param configuration the Hadoop conf
>  * @param fileMetaData fileMetaData for parquet file
>  * @param filePath Path for the parquet file
>  * @param blocks the blocks to read
>  * @param columns the columns to read (their path)
>  * @throws IOException if the file can not be opened
>  * @deprecated will be removed in 2.0.0.
>  */
> @Deprecated
> public ParquetFileReader(
> Configuration configuration, FileMetaData fileMetaData,
> Path filePath, List blocks, List 
> columns) throws IOException {
>  {code}
> {code:java}
> /**
>  * Reads the meta data block in the footer of the file
>  * @param configuration a configuration
>  * @param file the parquet File
>  * @param filter the filter to apply to row groups
>  * @return the metadata blocks in the footer
>  * @throws IOException if an error occurs while reading the file
>  * @deprecated will be removed in 2.0.0;
>  * use {@link ParquetFileReader#open(InputFile, 
> ParquetReadOptions)}
>  */
> @Deprecated
> public static final ParquetMetadata readFooter(Configuration configuration, 
> FileStatus file, MetadataFilter filter) throws IOException
> {code}
> {code:java}
> /**
>  * Reads the meta data in the footer of the file.
>  * Skipping row groups (or not) based on the provided filter
>  * @param configuration a configuration
>  * @param file the Parquet File
>  * @param filter the filter to apply to row groups
>  * @return the metadata with row groups filtered.
>  * @throws IOException  if an error occurs while reading the file
>  * @deprecated will be removed in 2.0.0;
>  * use {@link ParquetFileReader#open(InputFile, 
> ParquetReadOptions)}
>  */
> public static ParquetMetadata readFooter(Configuration configuration, Path 
> file, MetadataFilter filter) throws IOException{code}
>  in ParquetFileReader were marked as deprecated, use 
> {code:java}
> public ParquetFileReader(InputFile file, ParquetReadOptions options) throws 
> IOException
> {code}
> {code:java}
> public ParquetMetadata getFooter()
> {code}
>  to instead of them.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33249) Add status plugin for live application

2020-10-26 Thread Weiyi Kong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiyi Kong updated SPARK-33249:
---
Remaining Estimate: 24h
 Original Estimate: 24h

> Add status plugin for live application
> --
>
> Key: SPARK-33249
> URL: https://issues.apache.org/jira/browse/SPARK-33249
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Weiyi Kong
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> There are cases that developer may want to extend the current REST API of Web 
> UI. In most cases, adding external module is a better option than directly 
> editing the original Spark code.
> For an external module, to extend the REST API of the Web UI, 2 things may 
> need to be done:
>  * Add extra API to provide extra status info. This can be simply done by 
> implementing another ApiRequestContext which will be automatically loaded.
>  * If the info can not be calculated from the original data in the store, add 
> extra listeners to generate them.
> For history server, there is an interface called AppHistoryServerPlugin, 
> which is loaded based on SPI, providing a method to create listeners. In live 
> application, the only way is spark.extraListeners based on 
> Utils.loadExtensions. But this is not enough for the cases.
> To let the API get the status info, the data need to be written to the 
> AppStatusStore, which is the only store that an API can get by accessing 
> "ui.store" or "ui.sc.statusStore". But listeners created by 
> Utils.loadExtensions only get a SparkConf in construction, and are unable to 
> write the AppStatusStore.
> So I think we still need plugin like AppHistorySever for live UI. For 
> concerns like SPARK-22786, the plugin for live app can be separated from the 
> history server one, and also loaded using Utils.loadExtensions with an extra 
> configurations. So by default, nothing will be loaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33250) Migration to NumPy documentation style in SQL (pyspark.sql.*)

2020-10-26 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221122#comment-17221122
 ] 

Hyukjin Kwon commented on SPARK-33250:
--

I'll work on this one.

> Migration to NumPy documentation style in SQL (pyspark.sql.*)
> -
>
> Key: SPARK-33250
> URL: https://issues.apache.org/jira/browse/SPARK-33250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Migration to NumPy documentation style in SQL (pyspark.sql.*)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer

2020-10-26 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-33255:
-
Description: 
{code:java}
/**
 * @param configuration the Hadoop conf
 * @param fileMetaData fileMetaData for parquet file
 * @param filePath Path for the parquet file
 * @param blocks the blocks to read
 * @param columns the columns to read (their path)
 * @throws IOException if the file can not be opened
 * @deprecated will be removed in 2.0.0.
 */
@Deprecated
public ParquetFileReader(
Configuration configuration, FileMetaData fileMetaData,
Path filePath, List blocks, List columns) 
throws IOException {
 {code}
{code:java}
/**
 * Reads the meta data block in the footer of the file
 * @param configuration a configuration
 * @param file the parquet File
 * @param filter the filter to apply to row groups
 * @return the metadata blocks in the footer
 * @throws IOException if an error occurs while reading the file
 * @deprecated will be removed in 2.0.0;
 * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)}
 */
@Deprecated
public static final ParquetMetadata readFooter(Configuration configuration, 
FileStatus file, MetadataFilter filter) throws IOException
{code}
{code:java}
/**
 * Reads the meta data in the footer of the file.
 * Skipping row groups (or not) based on the provided filter
 * @param configuration a configuration
 * @param file the Parquet File
 * @param filter the filter to apply to row groups
 * @return the metadata with row groups filtered.
 * @throws IOException  if an error occurs while reading the file
 * @deprecated will be removed in 2.0.0;
 * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)}
 */
public static ParquetMetadata readFooter(Configuration configuration, Path 
file, MetadataFilter filter) throws IOException{code}
 in ParquetFileReader were marked as deprecated, use 
{code:java}
public ParquetFileReader(InputFile file, ParquetReadOptions options) throws 
IOException
{code}
{code:java}
public ParquetMetadata getFooter()
{code}
 to instead of them.

 

 

  was:
{code:java}
/**
 * @param configuration the Hadoop conf
 * @param fileMetaData fileMetaData for parquet file
 * @param filePath Path for the parquet file
 * @param blocks the blocks to read
 * @param columns the columns to read (their path)
 * @throws IOException if the file can not be opened
 * @deprecated will be removed in 2.0.0.
 */
@Deprecated
public ParquetFileReader(
Configuration configuration, FileMetaData fileMetaData,
Path filePath, List blocks, List columns) 
throws IOException {
 {code}
{code:java}
/**
 * Reads the meta data block in the footer of the file
 * @param configuration a configuration
 * @param file the parquet File
 * @param filter the filter to apply to row groups
 * @return the metadata blocks in the footer
 * @throws IOException if an error occurs while reading the file
 * @deprecated will be removed in 2.0.0;
 * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)}
 */
@Deprecated
public static final ParquetMetadata readFooter(Configuration configuration, 
FileStatus file, MetadataFilter filter) throws IOException
{code}
{code:java}
/**
 * Reads the meta data in the footer of the file.
 * Skipping row groups (or not) based on the provided filter
 * @param configuration a configuration
 * @param file the Parquet File
 * @param filter the filter to apply to row groups
 * @return the metadata with row groups filtered.
 * @throws IOException  if an error occurs while reading the file
 * @deprecated will be removed in 2.0.0;
 * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)}
 */
public static ParquetMetadata readFooter(Configuration configuration, Path 
file, MetadataFilter filter) throws IOException{code}
 in ParquetFileReader were marked as deprecated, use 
{code:java}
public ParquetFileReader(InputFile file, ParquetReadOptions options) throws 
IOException
{code}
{code:java}
/**
 * Open a {@link InputFile file} with {@link ParquetReadOptions options}.
 *
 * @param file an input file
 * @param options parquet read options
 * @return an open ParquetFileReader
 * @throws IOException if there is an error while opening the file
 */
public static ParquetFileReader open(InputFile file, ParquetReadOptions 
options) throws IOException 
{code}
{code:java}
public ParquetMetadata getFooter()
{code}
 to instead of them.

 

 


> Use new API to construct ParquetFileReader and read Parquet footer
> --
>
> Key: SPARK-33255
> URL: https://issues.apache.org/jira/browse/SPARK-33255
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> {code:java}
> /**
>  * @param configuration 

[jira] [Updated] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer

2020-10-26 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-33255:
-
Description: 
{code:java}
/**
 * @param configuration the Hadoop conf
 * @param fileMetaData fileMetaData for parquet file
 * @param filePath Path for the parquet file
 * @param blocks the blocks to read
 * @param columns the columns to read (their path)
 * @throws IOException if the file can not be opened
 * @deprecated will be removed in 2.0.0.
 */
@Deprecated
public ParquetFileReader(
Configuration configuration, FileMetaData fileMetaData,
Path filePath, List blocks, List columns) 
throws IOException {
 {code}
{code:java}
/**
 * Reads the meta data block in the footer of the file
 * @param configuration a configuration
 * @param file the parquet File
 * @param filter the filter to apply to row groups
 * @return the metadata blocks in the footer
 * @throws IOException if an error occurs while reading the file
 * @deprecated will be removed in 2.0.0;
 * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)}
 */
@Deprecated
public static final ParquetMetadata readFooter(Configuration configuration, 
FileStatus file, MetadataFilter filter) throws IOException
{code}
{code:java}
/**
 * Reads the meta data in the footer of the file.
 * Skipping row groups (or not) based on the provided filter
 * @param configuration a configuration
 * @param file the Parquet File
 * @param filter the filter to apply to row groups
 * @return the metadata with row groups filtered.
 * @throws IOException  if an error occurs while reading the file
 * @deprecated will be removed in 2.0.0;
 * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)}
 */
public static ParquetMetadata readFooter(Configuration configuration, Path 
file, MetadataFilter filter) throws IOException{code}
 in ParquetFileReader were marked as deprecated, use 
{code:java}
public ParquetFileReader(InputFile file, ParquetReadOptions options) throws 
IOException
{code}
{code:java}
/**
 * Open a {@link InputFile file} with {@link ParquetReadOptions options}.
 *
 * @param file an input file
 * @param options parquet read options
 * @return an open ParquetFileReader
 * @throws IOException if there is an error while opening the file
 */
public static ParquetFileReader open(InputFile file, ParquetReadOptions 
options) throws IOException 
{code}
{code:java}
public ParquetMetadata getFooter()
{code}
 to instead of them.

 

 

  was:
{code:java}
/**
 * @param configuration the Hadoop conf
 * @param fileMetaData fileMetaData for parquet file
 * @param filePath Path for the parquet file
 * @param blocks the blocks to read
 * @param columns the columns to read (their path)
 * @throws IOException if the file can not be opened
 * @deprecated will be removed in 2.0.0.
 */
@Deprecated
public ParquetFileReader(
Configuration configuration, FileMetaData fileMetaData,
Path filePath, List blocks, List columns) 
throws IOException {
{code}
 

,

 
{code:java}
/**
 * Reads the meta data block in the footer of the file
 * @param configuration a configuration
 * @param file the parquet File
 * @param filter the filter to apply to row groups
 * @return the metadata blocks in the footer
 * @throws IOException if an error occurs while reading the file
 * @deprecated will be removed in 2.0.0;
 * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)}
 */
@Deprecated
public static final ParquetMetadata readFooter(Configuration configuration, 
FileStatus file, MetadataFilter filter) throws IOException
{code}
 

 
{code:java}
/**
 * Reads the meta data in the footer of the file.
 * Skipping row groups (or not) based on the provided filter
 * @param configuration a configuration
 * @param file the Parquet File
 * @param filter the filter to apply to row groups
 * @return the metadata with row groups filtered.
 * @throws IOException  if an error occurs while reading the file
 * @deprecated will be removed in 2.0.0;
 * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)}
 */
public static ParquetMetadata readFooter(Configuration configuration, Path 
file, MetadataFilter filter) throws IOException{code}
 

 

in ParquetFileReader were marked as deprecated, use 

 

 
{code:java}
public ParquetFileReader(InputFile file, ParquetReadOptions options) throws 
IOException
{code}
 
{code:java}
/**
 * Open a {@link InputFile file} with {@link ParquetReadOptions options}.
 *
 * @param file an input file
 * @param options parquet read options
 * @return an open ParquetFileReader
 * @throws IOException if there is an error while opening the file
 */
public static ParquetFileReader open(InputFile file, ParquetReadOptions 
options) throws IOException 
{code}
{code:java}
public ParquetMetadata getFooter()
{code}
 to instead of them.

 

 


> Use new API to construct ParquetFileReader and read Parquet footer
> ---

[jira] [Updated] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer

2020-10-26 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-33255:
-
Description: 
{code:java}
/**
 * @param configuration the Hadoop conf
 * @param fileMetaData fileMetaData for parquet file
 * @param filePath Path for the parquet file
 * @param blocks the blocks to read
 * @param columns the columns to read (their path)
 * @throws IOException if the file can not be opened
 * @deprecated will be removed in 2.0.0.
 */
@Deprecated
public ParquetFileReader(
Configuration configuration, FileMetaData fileMetaData,
Path filePath, List blocks, List columns) 
throws IOException {
{code}
 

,

 
{code:java}
/**
 * Reads the meta data block in the footer of the file
 * @param configuration a configuration
 * @param file the parquet File
 * @param filter the filter to apply to row groups
 * @return the metadata blocks in the footer
 * @throws IOException if an error occurs while reading the file
 * @deprecated will be removed in 2.0.0;
 * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)}
 */
@Deprecated
public static final ParquetMetadata readFooter(Configuration configuration, 
FileStatus file, MetadataFilter filter) throws IOException
{code}
 

 
{code:java}
/**
 * Reads the meta data in the footer of the file.
 * Skipping row groups (or not) based on the provided filter
 * @param configuration a configuration
 * @param file the Parquet File
 * @param filter the filter to apply to row groups
 * @return the metadata with row groups filtered.
 * @throws IOException  if an error occurs while reading the file
 * @deprecated will be removed in 2.0.0;
 * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)}
 */
public static ParquetMetadata readFooter(Configuration configuration, Path 
file, MetadataFilter filter) throws IOException{code}
 

 

in ParquetFileReader were marked as deprecated, use 

 

 
{code:java}
public ParquetFileReader(InputFile file, ParquetReadOptions options) throws 
IOException
{code}
 
{code:java}
/**
 * Open a {@link InputFile file} with {@link ParquetReadOptions options}.
 *
 * @param file an input file
 * @param options parquet read options
 * @return an open ParquetFileReader
 * @throws IOException if there is an error while opening the file
 */
public static ParquetFileReader open(InputFile file, ParquetReadOptions 
options) throws IOException 
{code}
{code:java}
public ParquetMetadata getFooter()
{code}
 to instead of them.

 

 

  was:
{code:java}
/**
 * @param configuration the Hadoop conf
 * @param fileMetaData fileMetaData for parquet file
 * @param filePath Path for the parquet file
 * @param blocks the blocks to read
 * @param columns the columns to read (their path)
 * @throws IOException if the file can not be opened
 * @deprecated will be removed in 2.0.0.
 */
@Deprecated
public ParquetFileReader(
Configuration configuration, FileMetaData fileMetaData,
Path filePath, List blocks, List columns) 
throws IOException {
{code}
and 

 

 


> Use new API to construct ParquetFileReader and read Parquet footer
> --
>
> Key: SPARK-33255
> URL: https://issues.apache.org/jira/browse/SPARK-33255
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> {code:java}
> /**
>  * @param configuration the Hadoop conf
>  * @param fileMetaData fileMetaData for parquet file
>  * @param filePath Path for the parquet file
>  * @param blocks the blocks to read
>  * @param columns the columns to read (their path)
>  * @throws IOException if the file can not be opened
>  * @deprecated will be removed in 2.0.0.
>  */
> @Deprecated
> public ParquetFileReader(
> Configuration configuration, FileMetaData fileMetaData,
> Path filePath, List blocks, List 
> columns) throws IOException {
> {code}
>  
> ,
>  
> {code:java}
> /**
>  * Reads the meta data block in the footer of the file
>  * @param configuration a configuration
>  * @param file the parquet File
>  * @param filter the filter to apply to row groups
>  * @return the metadata blocks in the footer
>  * @throws IOException if an error occurs while reading the file
>  * @deprecated will be removed in 2.0.0;
>  * use {@link ParquetFileReader#open(InputFile, 
> ParquetReadOptions)}
>  */
> @Deprecated
> public static final ParquetMetadata readFooter(Configuration configuration, 
> FileStatus file, MetadataFilter filter) throws IOException
> {code}
>  
>  
> {code:java}
> /**
>  * Reads the meta data in the footer of the file.
>  * Skipping row groups (or not) based on the provided filter
>  * @param configuration a configuration
>  * @param file the Parquet File
>  * @param filter the filter to apply to row groups
>  * @return th

[jira] [Resolved] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33255.
--
Resolution: Duplicate

> Use new API to construct ParquetFileReader and read Parquet footer
> --
>
> Key: SPARK-33255
> URL: https://issues.apache.org/jira/browse/SPARK-33255
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> {code:java}
> /**
>  * @param configuration the Hadoop conf
>  * @param fileMetaData fileMetaData for parquet file
>  * @param filePath Path for the parquet file
>  * @param blocks the blocks to read
>  * @param columns the columns to read (their path)
>  * @throws IOException if the file can not be opened
>  * @deprecated will be removed in 2.0.0.
>  */
> @Deprecated
> public ParquetFileReader(
> Configuration configuration, FileMetaData fileMetaData,
> Path filePath, List blocks, List 
> columns) throws IOException {
> {code}
> and 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer

2020-10-26 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221119#comment-17221119
 ] 

Hyukjin Kwon commented on SPARK-33255:
--

We can't replace this now. See also 
https://github.com/apache/spark/pull/29542#pullrequestreview-478269264

> Use new API to construct ParquetFileReader and read Parquet footer
> --
>
> Key: SPARK-33255
> URL: https://issues.apache.org/jira/browse/SPARK-33255
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> {code:java}
> /**
>  * @param configuration the Hadoop conf
>  * @param fileMetaData fileMetaData for parquet file
>  * @param filePath Path for the parquet file
>  * @param blocks the blocks to read
>  * @param columns the columns to read (their path)
>  * @throws IOException if the file can not be opened
>  * @deprecated will be removed in 2.0.0.
>  */
> @Deprecated
> public ParquetFileReader(
> Configuration configuration, FileMetaData fileMetaData,
> Path filePath, List blocks, List 
> columns) throws IOException {
> {code}
> and 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer

2020-10-26 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-33255:
-
Description: 
{code:java}
/**
 * @param configuration the Hadoop conf
 * @param fileMetaData fileMetaData for parquet file
 * @param filePath Path for the parquet file
 * @param blocks the blocks to read
 * @param columns the columns to read (their path)
 * @throws IOException if the file can not be opened
 * @deprecated will be removed in 2.0.0.
 */
@Deprecated
public ParquetFileReader(
Configuration configuration, FileMetaData fileMetaData,
Path filePath, List blocks, List columns) 
throws IOException {
{code}
and 

 

 

> Use new API to construct ParquetFileReader and read Parquet footer
> --
>
> Key: SPARK-33255
> URL: https://issues.apache.org/jira/browse/SPARK-33255
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Priority: Major
>
> {code:java}
> /**
>  * @param configuration the Hadoop conf
>  * @param fileMetaData fileMetaData for parquet file
>  * @param filePath Path for the parquet file
>  * @param blocks the blocks to read
>  * @param columns the columns to read (their path)
>  * @throws IOException if the file can not be opened
>  * @deprecated will be removed in 2.0.0.
>  */
> @Deprecated
> public ParquetFileReader(
> Configuration configuration, FileMetaData fileMetaData,
> Path filePath, List blocks, List 
> columns) throws IOException {
> {code}
> and 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32085) Migrate to NumPy documentation style

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32085:
-
Description: 
https://github.com/numpy/numpydoc

For example,

Before: 
https://github.com/apache/spark/blob/f0e6d0ec13d9cdadf341d1b976623345bcdb1028/python/pyspark/sql/dataframe.py#L276-L318
 After: 
https://github.com/databricks/koalas/blob/6711e9c0f50c79dd57eeedb530da6c4ea3298de2/databricks/koalas/frame.py#L1122-L1176

We can incrementally start to switch.

NOTE that this JIRA targets only to switch the style. It does not target to add 
additional information or fixes together.

  was:
https://github.com/numpy/numpydoc

For example,

Before: 
https://github.com/apache/spark/blob/f0e6d0ec13d9cdadf341d1b976623345bcdb1028/python/pyspark/sql/dataframe.py#L276-L318
 After: 
https://github.com/databricks/koalas/blob/6711e9c0f50c79dd57eeedb530da6c4ea3298de2/databricks/koalas/frame.py#L1122-L1176

We can incrementally start to switch.


> Migrate to NumPy documentation style
> 
>
> Key: SPARK-32085
> URL: https://issues.apache.org/jira/browse/SPARK-32085
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/numpy/numpydoc
> For example,
> Before: 
> https://github.com/apache/spark/blob/f0e6d0ec13d9cdadf341d1b976623345bcdb1028/python/pyspark/sql/dataframe.py#L276-L318
>  After: 
> https://github.com/databricks/koalas/blob/6711e9c0f50c79dd57eeedb530da6c4ea3298de2/databricks/koalas/frame.py#L1122-L1176
> We can incrementally start to switch.
> NOTE that this JIRA targets only to switch the style. It does not target to 
> add additional information or fixes together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33254) Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33254:
-
Description: This JIRA targets to migrate to NumPy documentation style in 

> Migration to NumPy documentation style in Core (pyspark.*, 
> pyspark.resource.*, etc.)
> 
>
> Key: SPARK-33254
> URL: https://issues.apache.org/jira/browse/SPARK-33254
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to migrate to NumPy documentation style in 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33251) Migration to NumPy documentation style in ML (pyspark.ml.*)

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33251:
-
Description:  This JIRA targets to migrate to NumPy documentation style in 
MLlib (pyspark.mllib.*). Please also see the parent JIRA.

> Migration to NumPy documentation style in ML (pyspark.ml.*)
> ---
>
> Key: SPARK-33251
> URL: https://issues.apache.org/jira/browse/SPARK-33251
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in MLlib 
> (pyspark.mllib.*). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33250) Migration to NumPy documentation style in SQL (pyspark.sql.*)

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33250:
-
Description: Migration to NumPy documentation style in ML (pyspark.ml.*)

> Migration to NumPy documentation style in SQL (pyspark.sql.*)
> -
>
> Key: SPARK-33250
> URL: https://issues.apache.org/jira/browse/SPARK-33250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Migration to NumPy documentation style in ML (pyspark.ml.*)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33251) Migration to NumPy documentation style in ML (pyspark.ml.*)

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33251:
-
Description:  This JIRA targets to migrate to NumPy documentation style in 
ML (pyspark.ml.*). Please also see the parent JIRA.  (was:  This JIRA targets 
to migrate to NumPy documentation style in MLlib (pyspark.mllib.*). Please also 
see the parent JIRA.)

> Migration to NumPy documentation style in ML (pyspark.ml.*)
> ---
>
> Key: SPARK-33251
> URL: https://issues.apache.org/jira/browse/SPARK-33251
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in ML 
> (pyspark.ml.*). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32085) Migrate to NumPy documentation style

2020-10-26 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221099#comment-17221099
 ] 

Hyukjin Kwon commented on SPARK-32085:
--

cc [~zero323] in case you're interested in some of sub-tasks.

> Migrate to NumPy documentation style
> 
>
> Key: SPARK-32085
> URL: https://issues.apache.org/jira/browse/SPARK-32085
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/numpy/numpydoc
> For example,
> Before: 
> https://github.com/apache/spark/blob/f0e6d0ec13d9cdadf341d1b976623345bcdb1028/python/pyspark/sql/dataframe.py#L276-L318
>  After: 
> https://github.com/databricks/koalas/blob/6711e9c0f50c79dd57eeedb530da6c4ea3298de2/databricks/koalas/frame.py#L1122-L1176
> We can incrementally start to switch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33250) Migration to NumPy documentation style in SQL (pyspark.sql.*)

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33250:
-
Description: Migration to NumPy documentation style in SQL (pyspark.sql.*)  
(was: Migration to NumPy documentation style in ML (pyspark.ml.*))

> Migration to NumPy documentation style in SQL (pyspark.sql.*)
> -
>
> Key: SPARK-33250
> URL: https://issues.apache.org/jira/browse/SPARK-33250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Migration to NumPy documentation style in SQL (pyspark.sql.*)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33254) Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33254:
-
Description:  This JIRA targets to migrate to NumPy documentation style in 
Core (pyspark.\*, pyspark.resource.\*, etc.). Please also see the parent JIRA.  
(was:  This JIRA targets to migrate to NumPy documentation style in Core 
(pyspark.*, pyspark.resource.*, etc.). Please also see the parent JIRA.)

> Migration to NumPy documentation style in Core (pyspark.*, 
> pyspark.resource.*, etc.)
> 
>
> Key: SPARK-33254
> URL: https://issues.apache.org/jira/browse/SPARK-33254
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in Core 
> (pyspark.\*, pyspark.resource.\*, etc.). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33246) Spark SQL null semantics documentation is incorrect

2020-10-26 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221116#comment-17221116
 ] 

Hyukjin Kwon commented on SPARK-33246:
--

[~stwhit] Apache Spark uses pull requests in GitHub to apply a patch. See also 
https://spark.apache.org/contributing.html

> Spark SQL null semantics documentation is incorrect
> ---
>
> Key: SPARK-33246
> URL: https://issues.apache.org/jira/browse/SPARK-33246
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Stuart White
>Priority: Trivial
> Attachments: null-semantics.patch
>
>
> The documentation of Spark SQL's null semantics is (I believe) incorrect.
> The documentation states that "NULL AND False" yields NULL, when in fact it 
> yields False.
> {noformat}
> Seq[(java.lang.Boolean, java.lang.Boolean)](
>   (true, null),
>   (false, null),
>   (null, true),
>   (null, false),
>   (null, null)
> )
>   .toDF("left_operand", "right_operand")
>   .withColumn("OR", 'left_operand || 'right_operand)
>   .withColumn("AND", 'left_operand && 'right_operand)
>   .show(truncate = false)
> ++-++-+
> |left_operand|right_operand|OR  |AND  |
> ++-++-+
> |true|null |true|null |
> |false   |null |null|false|
> |null|true |true|null |
> |null|false|null|false|  < this line is incorrect in the 
> docs
> |null|null |null|null |
> ++-++-+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33253) Migration to NumPy documentation style in Streaming (pyspark.streaming.*)

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33253:
-
Description:  This JIRA targets to migrate to NumPy documentation style in 
Streaming (pyspark.streaming.*). Please also see the parent JIRA.  (was:  This 
JIRA targets to migrate to NumPy documentation style in Core (pyspark.*, 
pyspark.resource.*, etc.). Please also see the parent JIRA.)

> Migration to NumPy documentation style in Streaming (pyspark.streaming.*)
> -
>
> Key: SPARK-33253
> URL: https://issues.apache.org/jira/browse/SPARK-33253
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in Streaming 
> (pyspark.streaming.*). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33254) Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33254:
-
Description:  This JIRA targets to migrate to NumPy documentation style in 
Core (pyspark.*, pyspark.resource.*, etc.). Please also see the parent JIRA.  
(was: This JIRA targets to migrate to NumPy documentation style in )

> Migration to NumPy documentation style in Core (pyspark.*, 
> pyspark.resource.*, etc.)
> 
>
> Key: SPARK-33254
> URL: https://issues.apache.org/jira/browse/SPARK-33254
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in Core 
> (pyspark.*, pyspark.resource.*, etc.). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33252:
-
Description:  This JIRA targets to migrate to NumPy documentation style in 
Streaming (pyspark.streaming.*). Please also see the parent JIRA.

> Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
> -
>
> Key: SPARK-33252
> URL: https://issues.apache.org/jira/browse/SPARK-33252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in Streaming 
> (pyspark.streaming.*). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33253) Migration to NumPy documentation style in Streaming (pyspark.streaming.*)

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33253:
-
Description:  This JIRA targets to migrate to NumPy documentation style in 
Core (pyspark.*, pyspark.resource.*, etc.). Please also see the parent JIRA.

> Migration to NumPy documentation style in Streaming (pyspark.streaming.*)
> -
>
> Key: SPARK-33253
> URL: https://issues.apache.org/jira/browse/SPARK-33253
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in Core 
> (pyspark.*, pyspark.resource.*, etc.). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33252:
-
Description:  This JIRA targets to migrate to NumPy documentation style in 
MLlib (pyspark.mllib.*). Please also see the parent JIRA.  (was:  This JIRA 
targets to migrate to NumPy documentation style in Streaming 
(pyspark.streaming.*). Please also see the parent JIRA.)

> Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
> -
>
> Key: SPARK-33252
> URL: https://issues.apache.org/jira/browse/SPARK-33252
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
>  This JIRA targets to migrate to NumPy documentation style in MLlib 
> (pyspark.mllib.*). Please also see the parent JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33254) Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)

2020-10-26 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33254:


 Summary: Migration to NumPy documentation style in Core 
(pyspark.*, pyspark.resource.*, etc.)
 Key: SPARK-33254
 URL: https://issues.apache.org/jira/browse/SPARK-33254
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)

2020-10-26 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33252:


 Summary: Migration to NumPy documentation style in MLlib 
(pyspark.mllib.*)
 Key: SPARK-33252
 URL: https://issues.apache.org/jira/browse/SPARK-33252
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33253) Migration to NumPy documentation style in Streaming (pyspark.streaming.*)

2020-10-26 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33253:


 Summary: Migration to NumPy documentation style in Streaming 
(pyspark.streaming.*)
 Key: SPARK-33253
 URL: https://issues.apache.org/jira/browse/SPARK-33253
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33250) Migration to NumPy documentation style in SQL (pyspark.sql.*)

2020-10-26 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33250:


 Summary: Migration to NumPy documentation style in SQL 
(pyspark.sql.*)
 Key: SPARK-33250
 URL: https://issues.apache.org/jira/browse/SPARK-33250
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33251) Migration to NumPy documentation style in ML (pyspark.ml.*)

2020-10-26 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33251:


 Summary: Migration to NumPy documentation style in ML 
(pyspark.ml.*)
 Key: SPARK-33251
 URL: https://issues.apache.org/jira/browse/SPARK-33251
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer

2020-10-26 Thread Yang Jie (Jira)
Yang Jie created SPARK-33255:


 Summary: Use new API to construct ParquetFileReader and read 
Parquet footer
 Key: SPARK-33255
 URL: https://issues.apache.org/jira/browse/SPARK-33255
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33249) Add status plugin for live application

2020-10-26 Thread Weiyi Kong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiyi Kong updated SPARK-33249:
---
Description: 
There are cases that developer may want to extend the current REST API of Web 
UI. In most cases, adding external module is a better option than directly 
editing the original Spark code.

For an external module, to extend the REST API of the Web UI, 2 things may need 
to be done:
 * Add extra API to provide extra status info. This can be simply done by 
implementing another ApiRequestContext which will be automatically loaded.
 * If the info can not be calculated from the original data in the store, add 
extra listeners to generate them.

For history server, there is an interface called AppHistoryServerPlugin, which 
is loaded based on SPI, providing a method to create listeners. In live 
application, the only way is spark.extraListeners based on 
Utils.loadExtensions. But this is not enough for the cases.

To let the API get the status info, the data need to be written to the 
AppStatusStore, which is the only store that an API can get by accessing 
"ui.store" or "ui.sc.statusStore". But listeners created by 
Utils.loadExtensions only get a SparkConf in construction, and are unable to 
write the AppStatusStore.

So I think we still need plugin like AppHistorySever for live UI. For concerns 
like [#SPARK-22786], the plugin for live app can be separated from the history 
server one, and also loaded using Utils.loadExtensions with an extra 
configurations. So by default, nothing will be loaded.

  was:
There are cases that developer may want to extend the current REST API of Web 
UI. In most cases, adding external module is a better option than directly 
editing the original Spark code.

For an external module, to extend the REST API of the Web UI, 2 things may need 
to be done:
 * Add extra API to provide extra status info. This can be simply done by 
implementing another ApiRequestContext which will be automatically loaded.
Add extra listeners to generate the status info if it can not be calculated 
from the original data. This brings the issue.

For history server, there is an interface called AppHistoryServerPlugin, which 
is loaded based on SPI, providing a method to create listeners. In live 
application, the only way is spark.extraListeners based on 
Utils.loadExtensions. But this is not enough for the cases.

To let the API get the status info, the data need to be written to the 
AppStatusStore, which is the only store that an API can get by accessing 
"ui.store" or "ui.sc.statusStore". But listeners created by 
Utils.loadExtensions only get a SparkConf in construction, and are unable to 
write the AppStatusStore.

So I think we still need plugin like AppHistorySever for live UI. For concerns 
like [#SPARK-22786], the plugin for live app can be separated from the history 
server one, and also loaded using Utils.loadExtensions with an extra 
configurations. So by default, nothing will be loaded.


> Add status plugin for live application
> --
>
> Key: SPARK-33249
> URL: https://issues.apache.org/jira/browse/SPARK-33249
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Weiyi Kong
>Priority: Minor
>
> There are cases that developer may want to extend the current REST API of Web 
> UI. In most cases, adding external module is a better option than directly 
> editing the original Spark code.
> For an external module, to extend the REST API of the Web UI, 2 things may 
> need to be done:
>  * Add extra API to provide extra status info. This can be simply done by 
> implementing another ApiRequestContext which will be automatically loaded.
>  * If the info can not be calculated from the original data in the store, add 
> extra listeners to generate them.
> For history server, there is an interface called AppHistoryServerPlugin, 
> which is loaded based on SPI, providing a method to create listeners. In live 
> application, the only way is spark.extraListeners based on 
> Utils.loadExtensions. But this is not enough for the cases.
> To let the API get the status info, the data need to be written to the 
> AppStatusStore, which is the only store that an API can get by accessing 
> "ui.store" or "ui.sc.statusStore". But listeners created by 
> Utils.loadExtensions only get a SparkConf in construction, and are unable to 
> write the AppStatusStore.
> So I think we still need plugin like AppHistorySever for live UI. For 
> concerns like [#SPARK-22786], the plugin for live app can be separated from 
> the history server one, and also loaded using Utils.loadExtensions with an 
> extra configurations. So by default, nothing will be loaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Updated] (SPARK-33249) Add status plugin for live application

2020-10-26 Thread Weiyi Kong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weiyi Kong updated SPARK-33249:
---
Description: 
There are cases that developer may want to extend the current REST API of Web 
UI. In most cases, adding external module is a better option than directly 
editing the original Spark code.

For an external module, to extend the REST API of the Web UI, 2 things may need 
to be done:
 * Add extra API to provide extra status info. This can be simply done by 
implementing another ApiRequestContext which will be automatically loaded.
 * If the info can not be calculated from the original data in the store, add 
extra listeners to generate them.

For history server, there is an interface called AppHistoryServerPlugin, which 
is loaded based on SPI, providing a method to create listeners. In live 
application, the only way is spark.extraListeners based on 
Utils.loadExtensions. But this is not enough for the cases.

To let the API get the status info, the data need to be written to the 
AppStatusStore, which is the only store that an API can get by accessing 
"ui.store" or "ui.sc.statusStore". But listeners created by 
Utils.loadExtensions only get a SparkConf in construction, and are unable to 
write the AppStatusStore.

So I think we still need plugin like AppHistorySever for live UI. For concerns 
like SPARK-22786, the plugin for live app can be separated from the history 
server one, and also loaded using Utils.loadExtensions with an extra 
configurations. So by default, nothing will be loaded.

  was:
There are cases that developer may want to extend the current REST API of Web 
UI. In most cases, adding external module is a better option than directly 
editing the original Spark code.

For an external module, to extend the REST API of the Web UI, 2 things may need 
to be done:
 * Add extra API to provide extra status info. This can be simply done by 
implementing another ApiRequestContext which will be automatically loaded.
 * If the info can not be calculated from the original data in the store, add 
extra listeners to generate them.

For history server, there is an interface called AppHistoryServerPlugin, which 
is loaded based on SPI, providing a method to create listeners. In live 
application, the only way is spark.extraListeners based on 
Utils.loadExtensions. But this is not enough for the cases.

To let the API get the status info, the data need to be written to the 
AppStatusStore, which is the only store that an API can get by accessing 
"ui.store" or "ui.sc.statusStore". But listeners created by 
Utils.loadExtensions only get a SparkConf in construction, and are unable to 
write the AppStatusStore.

So I think we still need plugin like AppHistorySever for live UI. For concerns 
like [#SPARK-22786], the plugin for live app can be separated from the history 
server one, and also loaded using Utils.loadExtensions with an extra 
configurations. So by default, nothing will be loaded.


> Add status plugin for live application
> --
>
> Key: SPARK-33249
> URL: https://issues.apache.org/jira/browse/SPARK-33249
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Weiyi Kong
>Priority: Minor
>
> There are cases that developer may want to extend the current REST API of Web 
> UI. In most cases, adding external module is a better option than directly 
> editing the original Spark code.
> For an external module, to extend the REST API of the Web UI, 2 things may 
> need to be done:
>  * Add extra API to provide extra status info. This can be simply done by 
> implementing another ApiRequestContext which will be automatically loaded.
>  * If the info can not be calculated from the original data in the store, add 
> extra listeners to generate them.
> For history server, there is an interface called AppHistoryServerPlugin, 
> which is loaded based on SPI, providing a method to create listeners. In live 
> application, the only way is spark.extraListeners based on 
> Utils.loadExtensions. But this is not enough for the cases.
> To let the API get the status info, the data need to be written to the 
> AppStatusStore, which is the only store that an API can get by accessing 
> "ui.store" or "ui.sc.statusStore". But listeners created by 
> Utils.loadExtensions only get a SparkConf in construction, and are unable to 
> write the AppStatusStore.
> So I think we still need plugin like AppHistorySever for live UI. For 
> concerns like SPARK-22786, the plugin for live app can be separated from the 
> history server one, and also loaded using Utils.loadExtensions with an extra 
> configurations. So by default, nothing will be loaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

--

[jira] [Commented] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221087#comment-17221087
 ] 

Apache Spark commented on SPARK-33248:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30156

> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
> --
>
> Key: SPARK-33248
> URL: https://issues.apache.org/jira/browse/SPARK-33248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Priority: Major
>
> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
>  
> FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33248:


Assignee: Apache Spark

> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
> --
>
> Key: SPARK-33248
> URL: https://issues.apache.org/jira/browse/SPARK-33248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
>  
> FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221084#comment-17221084
 ] 

Apache Spark commented on SPARK-33248:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/30156

> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
> --
>
> Key: SPARK-33248
> URL: https://issues.apache.org/jira/browse/SPARK-33248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Priority: Major
>
> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
>  
> FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33248:


Assignee: (was: Apache Spark)

> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
> --
>
> Key: SPARK-33248
> URL: https://issues.apache.org/jira/browse/SPARK-33248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Priority: Major
>
> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
>  
> FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32085) Migrate to NumPy documentation style

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32085:
-
Affects Version/s: (was: 3.0.0)
   3.1.0

> Migrate to NumPy documentation style
> 
>
> Key: SPARK-32085
> URL: https://issues.apache.org/jira/browse/SPARK-32085
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://github.com/numpy/numpydoc
> For example,
> Before: 
> https://github.com/apache/spark/blob/f0e6d0ec13d9cdadf341d1b976623345bcdb1028/python/pyspark/sql/dataframe.py#L276-L318
>  After: 
> https://github.com/databricks/koalas/blob/6711e9c0f50c79dd57eeedb530da6c4ea3298de2/databricks/koalas/frame.py#L1122-L1176
> We can incrementally start to switch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33249) Add status plugin for live application

2020-10-26 Thread Weiyi Kong (Jira)
Weiyi Kong created SPARK-33249:
--

 Summary: Add status plugin for live application
 Key: SPARK-33249
 URL: https://issues.apache.org/jira/browse/SPARK-33249
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Web UI
Affects Versions: 3.0.1, 2.4.7
Reporter: Weiyi Kong


There are cases that developer may want to extend the current REST API of Web 
UI. In most cases, adding external module is a better option than directly 
editing the original Spark code.

For an external module, to extend the REST API of the Web UI, 2 things may need 
to be done:
 * Add extra API to provide extra status info. This can be simply done by 
implementing another ApiRequestContext which will be automatically loaded.
Add extra listeners to generate the status info if it can not be calculated 
from the original data. This brings the issue.

For history server, there is an interface called AppHistoryServerPlugin, which 
is loaded based on SPI, providing a method to create listeners. In live 
application, the only way is spark.extraListeners based on 
Utils.loadExtensions. But this is not enough for the cases.

To let the API get the status info, the data need to be written to the 
AppStatusStore, which is the only store that an API can get by accessing 
"ui.store" or "ui.sc.statusStore". But listeners created by 
Utils.loadExtensions only get a SparkConf in construction, and are unable to 
write the AppStatusStore.

So I think we still need plugin like AppHistorySever for live UI. For concerns 
like [#SPARK-22786], the plugin for live app can be separated from the history 
server one, and also loaded using Utils.loadExtensions with an extra 
configurations. So by default, nothing will be loaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32084) Replace dictionary-based function definitions to proper functions in functions.py

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32084:


Assignee: Maciej Szymkiewicz

> Replace dictionary-based function definitions to proper functions in 
> functions.py
> -
>
> Key: SPARK-32084
> URL: https://issues.apache.org/jira/browse/SPARK-32084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Currently some functions in {{functions.py}} are defined by a dictionary. It 
> programmatically defines the functions to the module; however, it makes some 
> IDEs such as PyCharm don't detect.
> Also, it makes hard to add proper examples into the docstrings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32084) Replace dictionary-based function definitions to proper functions in functions.py

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32084.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30143
[https://github.com/apache/spark/pull/30143]

> Replace dictionary-based function definitions to proper functions in 
> functions.py
> -
>
> Key: SPARK-32084
> URL: https://issues.apache.org/jira/browse/SPARK-32084
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently some functions in {{functions.py}} are defined by a dictionary. It 
> programmatically defines the functions to the module; however, it makes some 
> IDEs such as PyCharm don't detect.
> Also, it makes hard to add proper examples into the docstrings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

2020-10-26 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33248:
--
Description: 
Add a configuration to control the legacy behavior of whether need to pad null 
value when value size less then schema size

 

FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691]

  was:Add a configuration to control the legacy behavior of whether need to pad 
null value when value size less then schema size


> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
> --
>
> Key: SPARK-33248
> URL: https://issues.apache.org/jira/browse/SPARK-33248
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Priority: Major
>
> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
>  
> FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33238) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

2020-10-26 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-33238.
---
Resolution: Duplicate

> Add a configuration to control the legacy behavior of whether need to pad 
> null value when value size less then schema size
> --
>
> Key: SPARK-33238
> URL: https://issues.apache.org/jira/browse/SPARK-33238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> FOR comment https://github.com/apache/spark/pull/29421#discussion_r511684691



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size

2020-10-26 Thread angerszhu (Jira)
angerszhu created SPARK-33248:
-

 Summary: Add a configuration to control the legacy behavior of 
whether need to pad null value when value size less then schema size
 Key: SPARK-33248
 URL: https://issues.apache.org/jira/browse/SPARK-33248
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.1
Reporter: angerszhu


Add a configuration to control the legacy behavior of whether need to pad null 
value when value size less then schema size



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33247) Improve examples and scenarios in docstrings

2020-10-26 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-33247:


 Summary: Improve examples and scenarios in docstrings 
 Key: SPARK-33247
 URL: https://issues.apache.org/jira/browse/SPARK-33247
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


Currently, PySpark documentation does not have a lot of examples and scenarios. 
See also https://github.com/apache/spark/pull/30149#issuecomment-716490037.

We should add/improve examples especially in the commonly used APIs. For 
example, {{Column}}, {{DataFrame}}. {{RDD}}, {{SparkContext}}, etc.

This umbrella JIRA targets to improve them in commonly used APIs.

NOTE that we'll have to convert the docstrings into numpydoc style first in a 
separate PR (at SPARK-32085), and then add examples. In this way, we can manage 
migration to numpydoc and example improvement here separately (e.g., reverting 
numpydoc migration only).




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32388) TRANSFORM when schema less should keep same with hive

2020-10-26 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32388.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29421
[https://github.com/apache/spark/pull/29421]

> TRANSFORM when schema less should keep same with hive
> -
>
> Key: SPARK-32388
> URL: https://issues.apache.org/jira/browse/SPARK-32388
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.1.0
>
>
> Hive transform without schema
>  
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int);
> hive> INSERT INTO t VALUES (1, 1, 1);
> hive> INSERT INTO t VALUES (2, 2, 2);
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;
> hive> DESCRIBE v;
> key   string  
> value string  
> hive> SELECT * FROM v;
> 1 1   1
> 2 2   2
> hive> SELECT key FROM v;
> 1
> 2
> hive> SELECT value FROM v;
> 1 1
> 2 2{code}
> Spark
> {code:java}
> hive> create table t (c0 int, c1 int, c2 int); 
> hive> INSERT INTO t VALUES (1, 1, 1); 
> hive> INSERT INTO t VALUES (2, 2, 2); 
> hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; 
> hive> SELECT * FROM v; 
> 1   11
> 2   22 {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32185) User Guide - Monitoring

2020-10-26 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221047#comment-17221047
 ] 

Hyukjin Kwon commented on SPARK-32185:
--

Thanks!

> User Guide - Monitoring
> ---
>
> Key: SPARK-32185
> URL: https://issues.apache.org/jira/browse/SPARK-32185
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Abhijeet Prasad
>Priority: Major
>
> Monitoring. We should focus on how to monitor PySpark jobs.
> - Custom Worker, see also 
> https://github.com/apache/spark/tree/master/python/test_coverage to enable 
> test coverage that include worker sides too.
> - Sentry Support \(?\) 
> https://blog.sentry.io/2019/11/12/sentry-for-data-error-monitoring-with-pyspark
> - Link back https://spark.apache.org/docs/latest/monitoring.html . 
> - ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32789) Wildcards not working in get_json_object

2020-10-26 Thread Aoyuan Liao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221044#comment-17221044
 ] 

Aoyuan Liao commented on SPARK-32789:
-

[~tuhren] Not sure if HIve supports wildcard for dictionary. From 
documentation, star only works for array.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object

> Wildcards not working in get_json_object
> 
>
> Key: SPARK-32789
> URL: https://issues.apache.org/jira/browse/SPARK-32789
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Thomas Uhren
>Priority: Major
> Attachments: image-2020-09-03-13-22-38-569.png
>
>
> It seems that wildcards (star) are not supported when using 
> {{get_json_object}}:
> {code:java}
> spark.sql("""select get_json_object('{"k":{"value":"abc"}}', '$.*.value') as 
> j""").show()
> {code}
> This results in {{null}} while it should return 'abc'. It works if I replace 
> * with 'k'.
>   !image-2020-09-03-13-22-38-569.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32789) Wildcards not working in get_json_object

2020-10-26 Thread Aoyuan Liao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aoyuan Liao resolved SPARK-32789.
-
Resolution: Not A Problem

> Wildcards not working in get_json_object
> 
>
> Key: SPARK-32789
> URL: https://issues.apache.org/jira/browse/SPARK-32789
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Thomas Uhren
>Priority: Major
> Attachments: image-2020-09-03-13-22-38-569.png
>
>
> It seems that wildcards (star) are not supported when using 
> {{get_json_object}}:
> {code:java}
> spark.sql("""select get_json_object('{"k":{"value":"abc"}}', '$.*.value') as 
> j""").show()
> {code}
> This results in {{null}} while it should return 'abc'. It works if I replace 
> * with 'k'.
>   !image-2020-09-03-13-22-38-569.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33231) Make podCreationTimeout configurable

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33231:


Assignee: Apache Spark

> Make podCreationTimeout configurable
> 
>
> Key: SPARK-33231
> URL: https://issues.apache.org/jira/browse/SPARK-33231
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>
> Execution Monitor & Pod Allocator have differing views of the world which can 
> lead to pod trashing.
> The executor monitor can be notified of an executor coming up before a 
> snapshot is delivered to the PodAllocator. This can cause the executor 
> monitor to believe it needs to delete a pod, and the pod allocator to believe 
> that it needs to create a new pod. This happens if the podCreationTimeout is 
> too low for the cluster. Currently podCreationTimeout can only be configured 
> by increasing the batch delay but that has additional consequences leading to 
> slower spin up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33231) Make podCreationTimeout configurable

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221031#comment-17221031
 ] 

Apache Spark commented on SPARK-33231:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30155

> Make podCreationTimeout configurable
> 
>
> Key: SPARK-33231
> URL: https://issues.apache.org/jira/browse/SPARK-33231
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> Execution Monitor & Pod Allocator have differing views of the world which can 
> lead to pod trashing.
> The executor monitor can be notified of an executor coming up before a 
> snapshot is delivered to the PodAllocator. This can cause the executor 
> monitor to believe it needs to delete a pod, and the pod allocator to believe 
> that it needs to create a new pod. This happens if the podCreationTimeout is 
> too low for the cluster. Currently podCreationTimeout can only be configured 
> by increasing the batch delay but that has additional consequences leading to 
> slower spin up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33231) Make podCreationTimeout configurable

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33231:


Assignee: (was: Apache Spark)

> Make podCreationTimeout configurable
> 
>
> Key: SPARK-33231
> URL: https://issues.apache.org/jira/browse/SPARK-33231
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> Execution Monitor & Pod Allocator have differing views of the world which can 
> lead to pod trashing.
> The executor monitor can be notified of an executor coming up before a 
> snapshot is delivered to the PodAllocator. This can cause the executor 
> monitor to believe it needs to delete a pod, and the pod allocator to believe 
> that it needs to create a new pod. This happens if the podCreationTimeout is 
> too low for the cluster. Currently podCreationTimeout can only be configured 
> by increasing the batch delay but that has additional consequences leading to 
> slower spin up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33231) Make podCreationTimeout configurable

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221032#comment-17221032
 ] 

Apache Spark commented on SPARK-33231:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30155

> Make podCreationTimeout configurable
> 
>
> Key: SPARK-33231
> URL: https://issues.apache.org/jira/browse/SPARK-33231
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> Execution Monitor & Pod Allocator have differing views of the world which can 
> lead to pod trashing.
> The executor monitor can be notified of an executor coming up before a 
> snapshot is delivered to the PodAllocator. This can cause the executor 
> monitor to believe it needs to delete a pod, and the pod allocator to believe 
> that it needs to create a new pod. This happens if the podCreationTimeout is 
> too low for the cluster. Currently podCreationTimeout can only be configured 
> by increasing the batch delay but that has additional consequences leading to 
> slower spin up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221029#comment-17221029
 ] 

Apache Spark commented on SPARK-32405:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30154

> Apply table options while creating tables in JDBC Table Catalog
> ---
>
> Key: SPARK-32405
> URL: https://issues.apache.org/jira/browse/SPARK-32405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> We need to add an API to `JdbcDialect` to generate the SQL statement to 
> specify table options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32405:


Assignee: Apache Spark

> Apply table options while creating tables in JDBC Table Catalog
> ---
>
> Key: SPARK-32405
> URL: https://issues.apache.org/jira/browse/SPARK-32405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> We need to add an API to `JdbcDialect` to generate the SQL statement to 
> specify table options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221028#comment-17221028
 ] 

Apache Spark commented on SPARK-32405:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/30154

> Apply table options while creating tables in JDBC Table Catalog
> ---
>
> Key: SPARK-32405
> URL: https://issues.apache.org/jira/browse/SPARK-32405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> We need to add an API to `JdbcDialect` to generate the SQL statement to 
> specify table options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32405:


Assignee: (was: Apache Spark)

> Apply table options while creating tables in JDBC Table Catalog
> ---
>
> Key: SPARK-32405
> URL: https://issues.apache.org/jira/browse/SPARK-32405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> We need to add an API to `JdbcDialect` to generate the SQL statement to 
> specify table options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job

2020-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33237:
-

Assignee: Dongjoon Hyun

> Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT 
> Jenkins job
> --
>
> Key: SPARK-33237
> URL: https://issues.apache.org/jira/browse/SPARK-33237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job

2020-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33237.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30153
[https://github.com/apache/spark/pull/30153]

> Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT 
> Jenkins job
> --
>
> Key: SPARK-33237
> URL: https://issues.apache.org/jira/browse/SPARK-33237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33150) Groupby key may not be unique when using window

2020-10-26 Thread Aoyuan Liao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220984#comment-17220984
 ] 

Aoyuan Liao commented on SPARK-33150:
-

[~DieterDP] After I looked deeper into code, the issue is not within spark. 
Spark creates pandas.dataframe from pd.DataFrame.from_records. However. it 
ignores the fold attribute of datetime object, which leads to the same window, 
as:
{code:java}
>>> from datetime import datetime
>>> test = pd.DataFrame.from_records([(datetime(2019, 10, 27, 2, 54), 1), 
>>> (datetime(2019, 10, 27, 2, 54, fold=1), 3)])
>>> test
0  1
0 2019-10-27 02:54:00  1
1 2019-10-27 02:54:00  3
{code}
IMHO, there is nothing much spark can do.

If you enable arrow in spark(config spark.sql.execution.arrow.pyspark.enabled 
as true), the two UTC timestamps of dataframe will be distiguished.

> Groupby key may not be unique when using window
> ---
>
> Key: SPARK-33150
> URL: https://issues.apache.org/jira/browse/SPARK-33150
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.3, 3.0.0
>Reporter: Dieter De Paepe
>Priority: Major
>
>  
> Due to the way spark converts dates to local times, it may end up losing 
> details that allow it to differentiate instants when those times fall in the 
> transition for daylight savings time. Setting the spark timezone to UTC does 
> not resolve the issue.
> This issue is somewhat related to SPARK-32123, but seems independent enough 
> to consider this a separate issue.
> A minimal example is below. I tested these on Spark 3.0.0 and 2.3.3 (I could 
> not get 2.4.x to work on my system). My machine is located in timezone 
> "Europe/Brussels".
>  
> {code:java}
> import pyspark
> import pyspark.sql.functions as f
> spark = (pyspark
>  .sql
>  .SparkSession
>  .builder
>  .master('local[1]')
>  .config("spark.sql.session.timeZone", "UTC")
>  .config('spark.driver.extraJavaOptions', '-Duser.timezone=UTC') \
>  .config('spark.executor.extraJavaOptions', '-Duser.timezone=UTC')
>  .getOrCreate()
> )
> debug_df = spark.createDataFrame([
>  (1572137640, 1),
>  (1572137640, 2),
>  (1572141240, 3),
>  (1572141240, 4)
> ],['epochtime', 'value'])
> debug_df \
>  .withColumn('time', f.from_unixtime('epochtime')) \
>  .withColumn('window', f.window('time', '1 minute').start) \
>  .collect()
> {code}
>  
> Output, here we see the window function internally transforms the times to 
> local time, and as such has to disambiguate between the Belgian winter and 
> summer hour transition by setting the "fold" attribute:
>  
> {code:java}
> [Row(epochtime=1572137640, value=1, time='2019-10-27 00:54:00', 
> window=datetime.datetime(2019, 10, 27, 2, 54)),
>  Row(epochtime=1572137640, value=2, time='2019-10-27 00:54:00', 
> window=datetime.datetime(2019, 10, 27, 2, 54)),
>  Row(epochtime=1572141240, value=3, time='2019-10-27 01:54:00', 
> window=datetime.datetime(2019, 10, 27, 2, 54, fold=1)),
>  Row(epochtime=1572141240, value=4, time='2019-10-27 01:54:00', 
> window=datetime.datetime(2019, 10, 27, 2, 54, fold=1))]{code}
>  
> Now, this has severe implications when we use the window function for a 
> groupby operation:
>  
> {code:java}
> output = debug_df \
>  .withColumn('time', f.from_unixtime('epochtime')) \
>  .groupby(f.window('time', '1 minute').start.alias('window')).agg(
>f.min('value').alias('min_value')
>  )
> output_collect = output.collect()
> output_pandas = output.toPandas()
> print(output_collect)
> print(output_pandas)
> {code}
> Output:
>  
> {code:java}
> [Row(window=datetime.datetime(2019, 10, 27, 2, 54), min_value=1), 
> Row(window=datetime.datetime(2019, 10, 27, 2, 54, fold=1), min_value=3)]
>   window  min_value
> 0 2019-10-27 00:54:00 1
> 1 2019-10-27 00:54:00 3
> {code}
>  
> While the output using collect() outputs Belgian local time, it allows us to 
> differentiate between the two different keys visually using the fold 
> attribute. However, due to the way the fold attribute is defined, [it is 
> ignored for|https://www.python.org/dev/peps/pep-0495/#the-fold-attribute] 
> equality comparison.
> On the other hand, the pandas output uses the UTC output (due to the setting 
> of spark.sql.session.timeZone), but it has lost the disambiguating fold 
> attribute in the pandas datatype conversion.
> In both cases, the column on which was grouped is not unique.
>  
> {code:java}
> print(output_collect[0].window == output_collect[1].window)  # True
> print(output_collect[0].window.fold == output_collect[1].window.fold)  # False
> print(output_pandas.window[0] == output_pandas.window[1])  # True
> print(output_pandas.window[0].fold == output_pandas.window[1].fold)  # True
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---

[jira] [Updated] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan

2020-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33228:
--
Fix Version/s: (was: 2.4.8)

> Don't uncache data when replacing an existing view having the same plan
> ---
>
> Key: SPARK-33228
> URL: https://issues.apache.org/jira/browse/SPARK-33228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache 
> when replacing an existing view. But, this change drops cache even when 
> replacing a view having the same logical plan. A sequence of queries to 
> reproduce this as follows;
> {code}
> scala> val df = spark.range(1).selectExpr("id a", "id b")
> scala> df.cache()
> scala> df.explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> // If one re-runs the same query `df.createOrReplaceTempView("t")`, the 
> cache's swept away
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
> +- *(1) Range (0, 1, step=1, splits=4)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33230) FileOutputWriter jobs have duplicate JobIDs if launched in same second

2020-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33230.
---
Fix Version/s: 2.4.8
   3.0.2
   3.1.0
   Resolution: Fixed

Issue resolved by pull request 30141
[https://github.com/apache/spark/pull/30141]

> FileOutputWriter jobs have duplicate JobIDs if launched in same second
> --
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Fix For: 3.1.0, 3.0.2, 2.4.8
>
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33230) FileOutputWriter jobs have duplicate JobIDs if launched in same second

2020-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33230:
-

Assignee: Steve Loughran

> FileOutputWriter jobs have duplicate JobIDs if launched in same second
> --
>
> Key: SPARK-33230
> URL: https://issues.apache.org/jira/browse/SPARK-33230
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>
> The Hadoop S3A staging committer has problems with >1 spark sql query being 
> launched simultaneously, as it uses the jobID for its path in the clusterFS 
> to pass the commit information from tasks to job committer. 
> If two queries are launched in the same second, they conflict and the output 
> of job 1 includes that of all job2 files written so far; job 2 will fail with 
> FNFE.
> Proposed:
> job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of 
> {{WriteJobDescription.uuid}}
> That was the property name which used to serve this purpose; any committers 
> already written which use this property will pick it up without needing any 
> changes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job

2020-10-26 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33237:
--
Component/s: Kubernetes

> Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT 
> Jenkins job
> --
>
> Key: SPARK-33237
> URL: https://issues.apache.org/jira/browse/SPARK-33237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33237:


Assignee: Apache Spark

> Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT 
> Jenkins job
> --
>
> Key: SPARK-33237
> URL: https://issues.apache.org/jira/browse/SPARK-33237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220943#comment-17220943
 ] 

Apache Spark commented on SPARK-33237:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30153

> Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT 
> Jenkins job
> --
>
> Key: SPARK-33237
> URL: https://issues.apache.org/jira/browse/SPARK-33237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33237:


Assignee: (was: Apache Spark)

> Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT 
> Jenkins job
> --
>
> Key: SPARK-33237
> URL: https://issues.apache.org/jira/browse/SPARK-33237
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0.
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32185) User Guide - Monitoring

2020-10-26 Thread Abhijeet Prasad (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220902#comment-17220902
 ] 

Abhijeet Prasad commented on SPARK-32185:
-

Hey, sorry for not updating this issue. I have been very busy with school these 
past few months, but I will try to get a PR out within the next week or so. 

> User Guide - Monitoring
> ---
>
> Key: SPARK-32185
> URL: https://issues.apache.org/jira/browse/SPARK-32185
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Abhijeet Prasad
>Priority: Major
>
> Monitoring. We should focus on how to monitor PySpark jobs.
> - Custom Worker, see also 
> https://github.com/apache/spark/tree/master/python/test_coverage to enable 
> test coverage that include worker sides too.
> - Sentry Support \(?\) 
> https://blog.sentry.io/2019/11/12/sentry-for-data-error-monitoring-with-pyspark
> - Link back https://spark.apache.org/docs/latest/monitoring.html . 
> - ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-33197) Changes to spark.sql.analyzer.maxIterations do not take effect at runtime

2020-10-26 Thread Yuning Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuning Zhang closed SPARK-33197.


> Changes to spark.sql.analyzer.maxIterations do not take effect at runtime
> -
>
> Key: SPARK-33197
> URL: https://issues.apache.org/jira/browse/SPARK-33197
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.0
>Reporter: Yuning Zhang
>Assignee: Yuning Zhang
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> `spark.sql.analyzer.maxIterations` is not a static conf. However, changes to 
> it do not take effect at runtime.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC

2020-10-26 Thread Denise Mauldin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220856#comment-17220856
 ] 

Denise Mauldin commented on SPARK-19335:


[~kevinyu98] Using AWS Glue to copy/update data between two databases.  We do 
not want to TRUNCATE the tables.  We need to update every row in a table 
without modifying tables that have foreign keys to this table.

> Spark should support doing an efficient DataFrame Upsert via JDBC
> -
>
> Key: SPARK-19335
> URL: https://issues.apache.org/jira/browse/SPARK-19335
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ilya Ganelin
>Priority: Minor
>
> Doing a database update, as opposed to an insert is useful, particularly when 
> working with streaming applications which may require revisions to previously 
> stored data. 
> Spark DataFrames/DataSets do not currently support an Update feature via the 
> JDBC Writer allowing only Overwrite or Append.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC

2020-10-26 Thread Denise Mauldin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220855#comment-17220855
 ] 

Denise Mauldin commented on SPARK-19335:


+1 This is a major deficiency for using Spark in ETL jobs.

> Spark should support doing an efficient DataFrame Upsert via JDBC
> -
>
> Key: SPARK-19335
> URL: https://issues.apache.org/jira/browse/SPARK-19335
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ilya Ganelin
>Priority: Minor
>
> Doing a database update, as opposed to an insert is useful, particularly when 
> working with streaming applications which may require revisions to previously 
> stored data. 
> Spark DataFrames/DataSets do not currently support an Update feature via the 
> JDBC Writer allowing only Overwrite or Append.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33246) Spark SQL null semantics documentation is incorrect

2020-10-26 Thread Stuart White (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart White updated SPARK-33246:
-
Attachment: null-semantics.patch

> Spark SQL null semantics documentation is incorrect
> ---
>
> Key: SPARK-33246
> URL: https://issues.apache.org/jira/browse/SPARK-33246
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 3.0.1
>Reporter: Stuart White
>Priority: Trivial
> Attachments: null-semantics.patch
>
>
> The documentation of Spark SQL's null semantics is (I believe) incorrect.
> The documentation states that "NULL AND False" yields NULL, when in fact it 
> yields False.
> {noformat}
> Seq[(java.lang.Boolean, java.lang.Boolean)](
>   (true, null),
>   (false, null),
>   (null, true),
>   (null, false),
>   (null, null)
> )
>   .toDF("left_operand", "right_operand")
>   .withColumn("OR", 'left_operand || 'right_operand)
>   .withColumn("AND", 'left_operand && 'right_operand)
>   .show(truncate = false)
> ++-++-+
> |left_operand|right_operand|OR  |AND  |
> ++-++-+
> |true|null |true|null |
> |false   |null |null|false|
> |null|true |true|null |
> |null|false|null|false|  < this line is incorrect in the 
> docs
> |null|null |null|null |
> ++-++-+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33246) Spark SQL null semantics documentation is incorrect

2020-10-26 Thread Stuart White (Jira)
Stuart White created SPARK-33246:


 Summary: Spark SQL null semantics documentation is incorrect
 Key: SPARK-33246
 URL: https://issues.apache.org/jira/browse/SPARK-33246
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.0.1
Reporter: Stuart White
 Attachments: null-semantics.patch

The documentation of Spark SQL's null semantics is (I believe) incorrect.

The documentation states that "NULL AND False" yields NULL, when in fact it 
yields False.

{noformat}
Seq[(java.lang.Boolean, java.lang.Boolean)](
  (true, null),
  (false, null),
  (null, true),
  (null, false),
  (null, null)
)
  .toDF("left_operand", "right_operand")
  .withColumn("OR", 'left_operand || 'right_operand)
  .withColumn("AND", 'left_operand && 'right_operand)
  .show(truncate = false)

++-++-+
|left_operand|right_operand|OR  |AND  |
++-++-+
|true|null |true|null |
|false   |null |null|false|
|null|true |true|null |
|null|false|null|false|  < this line is incorrect in the 
docs
|null|null |null|null |
++-++-+
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23539) Add support for Kafka headers in Structured Streaming

2020-10-26 Thread Calvin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220813#comment-17220813
 ] 

Calvin commented on SPARK-23539:


[~dongjin]/[~kabhwan] Apologies for reviving this long-closed ticket, but I was 
wondering if there are any plans to backport this feature to any of the Spark 
2.x.x versions or if this feature will only be available from 3.0.0 onward?

> Add support for Kafka headers in Structured Streaming
> -
>
> Key: SPARK-23539
> URL: https://issues.apache.org/jira/browse/SPARK-23539
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Dongjin Lee
>Priority: Major
> Fix For: 3.0.0
>
>
> Kafka headers were added in 0.11. We should expose them through our kafka 
> data source in both batch and streaming queries. 
> This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ 
> SPARK-18057



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220807#comment-17220807
 ] 

Apache Spark commented on SPARK-33228:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30152

> Don't uncache data when replacing an existing view having the same plan
> ---
>
> Key: SPARK-33228
> URL: https://issues.apache.org/jira/browse/SPARK-33228
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache 
> when replacing an existing view. But, this change drops cache even when 
> replacing a view having the same logical plan. A sequence of queries to 
> reproduce this as follows;
> {code}
> scala> val df = spark.range(1).selectExpr("id a", "id b")
> scala> df.cache()
> scala> df.explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) ColumnarToRow
> +- InMemoryTableScan [a#2L, b#3L]
>  +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>  +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
>  +- *(1) Range (0, 1, step=1, splits=4)
> // If one re-runs the same query `df.createOrReplaceTempView("t")`, the 
> cache's swept away
> scala> df.createOrReplaceTempView("t")
> scala> sql("select * from t").explain()
> == Physical Plan ==
> *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
> +- *(1) Range (0, 1, step=1, splits=4)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark

2020-10-26 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220752#comment-17220752
 ] 

Sean R. Owen commented on SPARK-26043:
--

I don't have a strong opinion on it. [~vanzin] says it would take some work to 
make a proper API and perhaps isn't widely used.
Yes you can just use the same code in your project and/or access it directly 
from Java or with a shim class you put in the same Spark package, if you really 
wanted to. (Reflection too, but a bit messier)

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark

2020-10-26 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220748#comment-17220748
 ] 

Wenchen Fan commented on SPARK-26043:
-

A quick way is to copy-paste the code to your repo so that it compiles, or use 
java to write a proxy, as `private[spark]` doesn't work for java.

Seems like this util is still useful. [~srowen] shall we consider making it 
semi-public like the catalyst rules? We don't document it, and don't guarantee 
compatibility, but people can access it freely, and take risks on their own.

> Make SparkHadoopUtil private to Spark
> -
>
> Key: SPARK-26043
> URL: https://issues.apache.org/jira/browse/SPARK-26043
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Masiero Vanzin
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> This API contains a few small helper methods used internally by Spark, mostly 
> related to Hadoop configs and kerberos.
> It's been historically marked as "DeveloperApi". But in reality it's not very 
> useful for others, and changes a lot to be considered a stable API. Better to 
> just make it private to Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal

2020-10-26 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33233:
--
Description: Now spark support group by ordinal, but 
cube/rollup/groupingsets not support this. This pr make cube/rollup/grouping 
sets support group by ordinal

> CUBE/ROLLUP can't support UnresolvedOrdinal
> ---
>
> Key: SPARK-33233
> URL: https://issues.apache.org/jira/browse/SPARK-33233
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>
> Now spark support group by ordinal, but cube/rollup/groupingsets not support 
> this. This pr make cube/rollup/grouping sets support group by ordinal



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33245) Add built-in UDF - GETBIT

2020-10-26 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33245:
---

 Summary: Add built-in UDF - GETBIT 
 Key: SPARK-33245
 URL: https://issues.apache.org/jira/browse/SPARK-33245
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


Teradata, Impala, Snowflake and Yellowbrick support this function:

https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/PK1oV1b2jqvG~ohRnOro9w
https://docs.cloudera.com/runtime/7.2.0/impala-sql-reference/topics/impala-bit-functions.html#bit_functions__getbit
https://docs.snowflake.com/en/sql-reference/functions/getbit.html
https://www.yellowbrick.com/docs/2.2/ybd_sqlref/getbit.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33183) Bug in optimizer rule EliminateSorts

2020-10-26 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33183:
-
Affects Version/s: (was: 3.0.1)
   (was: 3.0.0)
   3.1.0
   3.0.2
   2.4.8

> Bug in optimizer rule EliminateSorts
> 
>
> Key: SPARK-33183
> URL: https://issues.apache.org/jira/browse/SPARK-33183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, the rule {{EliminateSorts}} removes a global sort node if its 
> child plan already satisfies the required sort order without checking if the 
> child plan's ordering is local or global. For example, in the following 
> scenario, the first sort shouldn't be removed because it has a stronger 
> guarantee than the second sort even if the sort orders are the same for both 
> sorts. 
> {code:java}
> Sort(orders, global = True, ...)
>   Sort(orders, global = False, ...){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened

2020-10-26 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-33204.

Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30119
[https://github.com/apache/spark/pull/30119]

> `Event Timeline`  in Spark Job UI sometimes cannot be opened
> 
>
> Key: SPARK-33204
> URL: https://issues.apache.org/jira/browse/SPARK-33204
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: reproduce.gif
>
>
> The Event Timeline area  cannot be expanded when a spark application has some 
> failed jobs.
> show as the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal

2020-10-26 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220648#comment-17220648
 ] 

Takeshi Yamamuro commented on SPARK-33233:
--

Please fill the description.

> CUBE/ROLLUP can't support UnresolvedOrdinal
> ---
>
> Key: SPARK-33233
> URL: https://issues.apache.org/jira/browse/SPARK-33233
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal

2020-10-26 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-33233:
-
Issue Type: Improvement  (was: Bug)

> CUBE/ROLLUP can't support UnresolvedOrdinal
> ---
>
> Key: SPARK-33233
> URL: https://issues.apache.org/jira/browse/SPARK-33233
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33223) Expose state information on SS UI

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220637#comment-17220637
 ] 

Apache Spark commented on SPARK-33223:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/30151

> Expose state information on SS UI
> -
>
> Key: SPARK-33223
> URL: https://issues.apache.org/jira/browse/SPARK-33223
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.1
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33223) Expose state information on SS UI

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33223:


Assignee: Apache Spark

> Expose state information on SS UI
> -
>
> Key: SPARK-33223
> URL: https://issues.apache.org/jira/browse/SPARK-33223
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.1
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33223) Expose state information on SS UI

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33223:


Assignee: (was: Apache Spark)

> Expose state information on SS UI
> -
>
> Key: SPARK-33223
> URL: https://issues.apache.org/jira/browse/SPARK-33223
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.0.1
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33075) Only disable auto bucketed scan for cached query

2020-10-26 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33075.
--
Fix Version/s: 3.1.0
 Assignee: Cheng Su
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/30138

> Only disable auto bucketed scan for cached query
> 
>
> Key: SPARK-33075
> URL: https://issues.apache.org/jira/browse/SPARK-33075
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.1.0
>
>
> As a followup from discussion in 
> [https://github.com/apache/spark/pull/29804#discussion_r500033528,] auto 
> bucketed scan is disabled by default due to regression for cached query. 
> Suggested by [~cloud_fan], we can enable auto bucketed scan globally with 
> special handling for cached query, similar to adaptive execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32188) API Reference

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220625#comment-17220625
 ] 

Apache Spark commented on SPARK-32188:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30150

> API Reference
> -
>
> Key: SPARK-32188
> URL: https://issues.apache.org/jira/browse/SPARK-32188
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Example: https://hyukjin-spark.readthedocs.io/en/latest/reference/index.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32188) API Reference

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220624#comment-17220624
 ] 

Apache Spark commented on SPARK-32188:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30150

> API Reference
> -
>
> Key: SPARK-32188
> URL: https://issues.apache.org/jira/browse/SPARK-32188
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Example: https://hyukjin-spark.readthedocs.io/en/latest/reference/index.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33243) Add numpydoc into documentation dependency

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33243:


Assignee: Apache Spark

> Add numpydoc into documentation dependency
> --
>
> Key: SPARK-33243
> URL: https://issues.apache.org/jira/browse/SPARK-33243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> To switch the docstring formats, we should add numpydoc package into Sphinx.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33243) Add numpydoc into documentation dependency

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33243:


Assignee: (was: Apache Spark)

> Add numpydoc into documentation dependency
> --
>
> Key: SPARK-33243
> URL: https://issues.apache.org/jira/browse/SPARK-33243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> To switch the docstring formats, we should add numpydoc package into Sphinx.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33243) Add numpydoc into documentation dependency

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220617#comment-17220617
 ] 

Apache Spark commented on SPARK-33243:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30149

> Add numpydoc into documentation dependency
> --
>
> Key: SPARK-33243
> URL: https://issues.apache.org/jira/browse/SPARK-33243
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> To switch the docstring formats, we should add numpydoc package into Sphinx.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33244) Unify the code paths for spark.table and spark.read.table

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33244:


Assignee: (was: Apache Spark)

> Unify the code paths for spark.table and spark.read.table
> -
>
> Key: SPARK-33244
> URL: https://issues.apache.org/jira/browse/SPARK-33244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> The code paths of `spark.table` and `spark.read.table` should be the same. 
> This behavior is broke in SPARK-32592 since we need to respect options in 
> `spark.read.table` API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33244) Unify the code paths for spark.table and spark.read.table

2020-10-26 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33244:


Assignee: Apache Spark

> Unify the code paths for spark.table and spark.read.table
> -
>
> Key: SPARK-33244
> URL: https://issues.apache.org/jira/browse/SPARK-33244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> The code paths of `spark.table` and `spark.read.table` should be the same. 
> This behavior is broke in SPARK-32592 since we need to respect options in 
> `spark.read.table` API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33244) Unify the code paths for spark.table and spark.read.table

2020-10-26 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220615#comment-17220615
 ] 

Apache Spark commented on SPARK-33244:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/30148

> Unify the code paths for spark.table and spark.read.table
> -
>
> Key: SPARK-33244
> URL: https://issues.apache.org/jira/browse/SPARK-33244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> The code paths of `spark.table` and `spark.read.table` should be the same. 
> This behavior is broke in SPARK-32592 since we need to respect options in 
> `spark.read.table` API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33244) Unify the code paths for spark.table and spark.read.table

2020-10-26 Thread Yuanjian Li (Jira)
Yuanjian Li created SPARK-33244:
---

 Summary: Unify the code paths for spark.table and spark.read.table
 Key: SPARK-33244
 URL: https://issues.apache.org/jira/browse/SPARK-33244
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuanjian Li


The code paths of `spark.table` and `spark.read.table` should be the same. This 
behavior is broke in SPARK-32592 since we need to respect options in 
`spark.read.table` API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >