[jira] [Resolved] (SPARK-33215) Speed up event log download by skipping UI rebuild
[ https://issues.apache.org/jira/browse/SPARK-33215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-33215. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30126 [https://github.com/apache/spark/pull/30126] > Speed up event log download by skipping UI rebuild > -- > > Key: SPARK-33215 > URL: https://issues.apache.org/jira/browse/SPARK-33215 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Baohe Zhang >Assignee: Baohe Zhang >Priority: Major > Fix For: 3.1.0 > > > Right now, when we want to download the event logs from the spark history > server(SHS), SHS will need to parse entire the event log to rebuild UI, and > this is just for view permission checks. UI rebuilding is a time-consuming > and memory-intensive task, especially for large logs. However, this process > is unnecessary for event log download. > This patch enables SHS to check UI view permissions of a given app/attempt > for a given user, without rebuilding the UI. This is achieved by adding a > method "checkUIViewPermissions(appId: String, attemptId: Option[String], > user: String): Boolean" to many layers of history server components. > With this patch, UI rebuild can be skipped when downloading event logs from > the history server. Thus the time of downloading a GB scale event log can be > reduced from several minutes to several seconds, and the memory consumption > of UI rebuilding can be avoided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33215) Speed up event log download by skipping UI rebuild
[ https://issues.apache.org/jira/browse/SPARK-33215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-33215: Assignee: Baohe Zhang > Speed up event log download by skipping UI rebuild > -- > > Key: SPARK-33215 > URL: https://issues.apache.org/jira/browse/SPARK-33215 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Baohe Zhang >Assignee: Baohe Zhang >Priority: Major > > Right now, when we want to download the event logs from the spark history > server(SHS), SHS will need to parse entire the event log to rebuild UI, and > this is just for view permission checks. UI rebuilding is a time-consuming > and memory-intensive task, especially for large logs. However, this process > is unnecessary for event log download. > This patch enables SHS to check UI view permissions of a given app/attempt > for a given user, without rebuilding the UI. This is achieved by adding a > method "checkUIViewPermissions(appId: String, attemptId: Option[String], > user: String): Boolean" to many layers of history server components. > With this patch, UI rebuild can be skipped when downloading event logs from > the history server. Thus the time of downloading a GB scale event log can be > reduced from several minutes to several seconds, and the memory consumption > of UI rebuilding can be avoided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33243) Add numpydoc into documentation dependency
[ https://issues.apache.org/jira/browse/SPARK-33243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33243. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30149 [https://github.com/apache/spark/pull/30149] > Add numpydoc into documentation dependency > -- > > Key: SPARK-33243 > URL: https://issues.apache.org/jira/browse/SPARK-33243 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > To switch the docstring formats, we should add numpydoc package into Sphinx. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33243) Add numpydoc into documentation dependency
[ https://issues.apache.org/jira/browse/SPARK-33243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33243: Assignee: Hyukjin Kwon > Add numpydoc into documentation dependency > -- > > Key: SPARK-33243 > URL: https://issues.apache.org/jira/browse/SPARK-33243 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > To switch the docstring formats, we should add numpydoc package into Sphinx. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33256) Update contribution guide about NumPy documentation style
Hyukjin Kwon created SPARK-33256: Summary: Update contribution guide about NumPy documentation style Key: SPARK-33256 URL: https://issues.apache.org/jira/browse/SPARK-33256 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.1.0 Reporter: Hyukjin Kwon We should document that PySpark uses NumPy documentation style. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33249) Add status plugin for live application
[ https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiyi Kong updated SPARK-33249: --- Remaining Estimate: (was: 24h) Original Estimate: (was: 24h) > Add status plugin for live application > -- > > Key: SPARK-33249 > URL: https://issues.apache.org/jira/browse/SPARK-33249 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Weiyi Kong >Priority: Minor > > There are cases that developer may want to extend the current REST API of Web > UI. In most cases, adding external module is a better option than directly > editing the original Spark code. > For an external module, to extend the REST API of the Web UI, 2 things may > need to be done: > * Add extra API to provide extra status info. This can be simply done by > implementing another ApiRequestContext which will be automatically loaded. > * If the info can not be calculated from the original data in the store, add > extra listeners to generate them. > For history server, there is an interface called AppHistoryServerPlugin, > which is loaded based on SPI, providing a method to create listeners. In live > application, the only way is spark.extraListeners based on > Utils.loadExtensions. But this is not enough for the cases. > To let the API get the status info, the data need to be written to the > AppStatusStore, which is the only store that an API can get by accessing > "ui.store" or "ui.sc.statusStore". But listeners created by > Utils.loadExtensions only get a SparkConf in construction, and are unable to > write the AppStatusStore. > So I think we still need plugin like AppHistorySever for live UI. For > concerns like SPARK-22786, the plugin for live app can be separated from the > history server one, and also loaded using Utils.loadExtensions with an extra > configurations. So by default, nothing will be loaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer
[ https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221126#comment-17221126 ] Yang Jie commented on SPARK-33255: -- [~hyukjin.kwon] Got it ~ > Use new API to construct ParquetFileReader and read Parquet footer > -- > > Key: SPARK-33255 > URL: https://issues.apache.org/jira/browse/SPARK-33255 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > /** > * @param configuration the Hadoop conf > * @param fileMetaData fileMetaData for parquet file > * @param filePath Path for the parquet file > * @param blocks the blocks to read > * @param columns the columns to read (their path) > * @throws IOException if the file can not be opened > * @deprecated will be removed in 2.0.0. > */ > @Deprecated > public ParquetFileReader( > Configuration configuration, FileMetaData fileMetaData, > Path filePath, List blocks, List > columns) throws IOException { > {code} > {code:java} > /** > * Reads the meta data block in the footer of the file > * @param configuration a configuration > * @param file the parquet File > * @param filter the filter to apply to row groups > * @return the metadata blocks in the footer > * @throws IOException if an error occurs while reading the file > * @deprecated will be removed in 2.0.0; > * use {@link ParquetFileReader#open(InputFile, > ParquetReadOptions)} > */ > @Deprecated > public static final ParquetMetadata readFooter(Configuration configuration, > FileStatus file, MetadataFilter filter) throws IOException > {code} > {code:java} > /** > * Reads the meta data in the footer of the file. > * Skipping row groups (or not) based on the provided filter > * @param configuration a configuration > * @param file the Parquet File > * @param filter the filter to apply to row groups > * @return the metadata with row groups filtered. > * @throws IOException if an error occurs while reading the file > * @deprecated will be removed in 2.0.0; > * use {@link ParquetFileReader#open(InputFile, > ParquetReadOptions)} > */ > public static ParquetMetadata readFooter(Configuration configuration, Path > file, MetadataFilter filter) throws IOException{code} > in ParquetFileReader were marked as deprecated, use > {code:java} > public ParquetFileReader(InputFile file, ParquetReadOptions options) throws > IOException > {code} > {code:java} > public ParquetMetadata getFooter() > {code} > to instead of them. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33249) Add status plugin for live application
[ https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiyi Kong updated SPARK-33249: --- Remaining Estimate: 24h Original Estimate: 24h > Add status plugin for live application > -- > > Key: SPARK-33249 > URL: https://issues.apache.org/jira/browse/SPARK-33249 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Weiyi Kong >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > There are cases that developer may want to extend the current REST API of Web > UI. In most cases, adding external module is a better option than directly > editing the original Spark code. > For an external module, to extend the REST API of the Web UI, 2 things may > need to be done: > * Add extra API to provide extra status info. This can be simply done by > implementing another ApiRequestContext which will be automatically loaded. > * If the info can not be calculated from the original data in the store, add > extra listeners to generate them. > For history server, there is an interface called AppHistoryServerPlugin, > which is loaded based on SPI, providing a method to create listeners. In live > application, the only way is spark.extraListeners based on > Utils.loadExtensions. But this is not enough for the cases. > To let the API get the status info, the data need to be written to the > AppStatusStore, which is the only store that an API can get by accessing > "ui.store" or "ui.sc.statusStore". But listeners created by > Utils.loadExtensions only get a SparkConf in construction, and are unable to > write the AppStatusStore. > So I think we still need plugin like AppHistorySever for live UI. For > concerns like SPARK-22786, the plugin for live app can be separated from the > history server one, and also loaded using Utils.loadExtensions with an extra > configurations. So by default, nothing will be loaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33250) Migration to NumPy documentation style in SQL (pyspark.sql.*)
[ https://issues.apache.org/jira/browse/SPARK-33250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221122#comment-17221122 ] Hyukjin Kwon commented on SPARK-33250: -- I'll work on this one. > Migration to NumPy documentation style in SQL (pyspark.sql.*) > - > > Key: SPARK-33250 > URL: https://issues.apache.org/jira/browse/SPARK-33250 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Migration to NumPy documentation style in SQL (pyspark.sql.*) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer
[ https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-33255: - Description: {code:java} /** * @param configuration the Hadoop conf * @param fileMetaData fileMetaData for parquet file * @param filePath Path for the parquet file * @param blocks the blocks to read * @param columns the columns to read (their path) * @throws IOException if the file can not be opened * @deprecated will be removed in 2.0.0. */ @Deprecated public ParquetFileReader( Configuration configuration, FileMetaData fileMetaData, Path filePath, List blocks, List columns) throws IOException { {code} {code:java} /** * Reads the meta data block in the footer of the file * @param configuration a configuration * @param file the parquet File * @param filter the filter to apply to row groups * @return the metadata blocks in the footer * @throws IOException if an error occurs while reading the file * @deprecated will be removed in 2.0.0; * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)} */ @Deprecated public static final ParquetMetadata readFooter(Configuration configuration, FileStatus file, MetadataFilter filter) throws IOException {code} {code:java} /** * Reads the meta data in the footer of the file. * Skipping row groups (or not) based on the provided filter * @param configuration a configuration * @param file the Parquet File * @param filter the filter to apply to row groups * @return the metadata with row groups filtered. * @throws IOException if an error occurs while reading the file * @deprecated will be removed in 2.0.0; * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)} */ public static ParquetMetadata readFooter(Configuration configuration, Path file, MetadataFilter filter) throws IOException{code} in ParquetFileReader were marked as deprecated, use {code:java} public ParquetFileReader(InputFile file, ParquetReadOptions options) throws IOException {code} {code:java} public ParquetMetadata getFooter() {code} to instead of them. was: {code:java} /** * @param configuration the Hadoop conf * @param fileMetaData fileMetaData for parquet file * @param filePath Path for the parquet file * @param blocks the blocks to read * @param columns the columns to read (their path) * @throws IOException if the file can not be opened * @deprecated will be removed in 2.0.0. */ @Deprecated public ParquetFileReader( Configuration configuration, FileMetaData fileMetaData, Path filePath, List blocks, List columns) throws IOException { {code} {code:java} /** * Reads the meta data block in the footer of the file * @param configuration a configuration * @param file the parquet File * @param filter the filter to apply to row groups * @return the metadata blocks in the footer * @throws IOException if an error occurs while reading the file * @deprecated will be removed in 2.0.0; * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)} */ @Deprecated public static final ParquetMetadata readFooter(Configuration configuration, FileStatus file, MetadataFilter filter) throws IOException {code} {code:java} /** * Reads the meta data in the footer of the file. * Skipping row groups (or not) based on the provided filter * @param configuration a configuration * @param file the Parquet File * @param filter the filter to apply to row groups * @return the metadata with row groups filtered. * @throws IOException if an error occurs while reading the file * @deprecated will be removed in 2.0.0; * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)} */ public static ParquetMetadata readFooter(Configuration configuration, Path file, MetadataFilter filter) throws IOException{code} in ParquetFileReader were marked as deprecated, use {code:java} public ParquetFileReader(InputFile file, ParquetReadOptions options) throws IOException {code} {code:java} /** * Open a {@link InputFile file} with {@link ParquetReadOptions options}. * * @param file an input file * @param options parquet read options * @return an open ParquetFileReader * @throws IOException if there is an error while opening the file */ public static ParquetFileReader open(InputFile file, ParquetReadOptions options) throws IOException {code} {code:java} public ParquetMetadata getFooter() {code} to instead of them. > Use new API to construct ParquetFileReader and read Parquet footer > -- > > Key: SPARK-33255 > URL: https://issues.apache.org/jira/browse/SPARK-33255 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > /** > * @param configuration
[jira] [Updated] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer
[ https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-33255: - Description: {code:java} /** * @param configuration the Hadoop conf * @param fileMetaData fileMetaData for parquet file * @param filePath Path for the parquet file * @param blocks the blocks to read * @param columns the columns to read (their path) * @throws IOException if the file can not be opened * @deprecated will be removed in 2.0.0. */ @Deprecated public ParquetFileReader( Configuration configuration, FileMetaData fileMetaData, Path filePath, List blocks, List columns) throws IOException { {code} {code:java} /** * Reads the meta data block in the footer of the file * @param configuration a configuration * @param file the parquet File * @param filter the filter to apply to row groups * @return the metadata blocks in the footer * @throws IOException if an error occurs while reading the file * @deprecated will be removed in 2.0.0; * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)} */ @Deprecated public static final ParquetMetadata readFooter(Configuration configuration, FileStatus file, MetadataFilter filter) throws IOException {code} {code:java} /** * Reads the meta data in the footer of the file. * Skipping row groups (or not) based on the provided filter * @param configuration a configuration * @param file the Parquet File * @param filter the filter to apply to row groups * @return the metadata with row groups filtered. * @throws IOException if an error occurs while reading the file * @deprecated will be removed in 2.0.0; * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)} */ public static ParquetMetadata readFooter(Configuration configuration, Path file, MetadataFilter filter) throws IOException{code} in ParquetFileReader were marked as deprecated, use {code:java} public ParquetFileReader(InputFile file, ParquetReadOptions options) throws IOException {code} {code:java} /** * Open a {@link InputFile file} with {@link ParquetReadOptions options}. * * @param file an input file * @param options parquet read options * @return an open ParquetFileReader * @throws IOException if there is an error while opening the file */ public static ParquetFileReader open(InputFile file, ParquetReadOptions options) throws IOException {code} {code:java} public ParquetMetadata getFooter() {code} to instead of them. was: {code:java} /** * @param configuration the Hadoop conf * @param fileMetaData fileMetaData for parquet file * @param filePath Path for the parquet file * @param blocks the blocks to read * @param columns the columns to read (their path) * @throws IOException if the file can not be opened * @deprecated will be removed in 2.0.0. */ @Deprecated public ParquetFileReader( Configuration configuration, FileMetaData fileMetaData, Path filePath, List blocks, List columns) throws IOException { {code} , {code:java} /** * Reads the meta data block in the footer of the file * @param configuration a configuration * @param file the parquet File * @param filter the filter to apply to row groups * @return the metadata blocks in the footer * @throws IOException if an error occurs while reading the file * @deprecated will be removed in 2.0.0; * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)} */ @Deprecated public static final ParquetMetadata readFooter(Configuration configuration, FileStatus file, MetadataFilter filter) throws IOException {code} {code:java} /** * Reads the meta data in the footer of the file. * Skipping row groups (or not) based on the provided filter * @param configuration a configuration * @param file the Parquet File * @param filter the filter to apply to row groups * @return the metadata with row groups filtered. * @throws IOException if an error occurs while reading the file * @deprecated will be removed in 2.0.0; * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)} */ public static ParquetMetadata readFooter(Configuration configuration, Path file, MetadataFilter filter) throws IOException{code} in ParquetFileReader were marked as deprecated, use {code:java} public ParquetFileReader(InputFile file, ParquetReadOptions options) throws IOException {code} {code:java} /** * Open a {@link InputFile file} with {@link ParquetReadOptions options}. * * @param file an input file * @param options parquet read options * @return an open ParquetFileReader * @throws IOException if there is an error while opening the file */ public static ParquetFileReader open(InputFile file, ParquetReadOptions options) throws IOException {code} {code:java} public ParquetMetadata getFooter() {code} to instead of them. > Use new API to construct ParquetFileReader and read Parquet footer > ---
[jira] [Updated] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer
[ https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-33255: - Description: {code:java} /** * @param configuration the Hadoop conf * @param fileMetaData fileMetaData for parquet file * @param filePath Path for the parquet file * @param blocks the blocks to read * @param columns the columns to read (their path) * @throws IOException if the file can not be opened * @deprecated will be removed in 2.0.0. */ @Deprecated public ParquetFileReader( Configuration configuration, FileMetaData fileMetaData, Path filePath, List blocks, List columns) throws IOException { {code} , {code:java} /** * Reads the meta data block in the footer of the file * @param configuration a configuration * @param file the parquet File * @param filter the filter to apply to row groups * @return the metadata blocks in the footer * @throws IOException if an error occurs while reading the file * @deprecated will be removed in 2.0.0; * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)} */ @Deprecated public static final ParquetMetadata readFooter(Configuration configuration, FileStatus file, MetadataFilter filter) throws IOException {code} {code:java} /** * Reads the meta data in the footer of the file. * Skipping row groups (or not) based on the provided filter * @param configuration a configuration * @param file the Parquet File * @param filter the filter to apply to row groups * @return the metadata with row groups filtered. * @throws IOException if an error occurs while reading the file * @deprecated will be removed in 2.0.0; * use {@link ParquetFileReader#open(InputFile, ParquetReadOptions)} */ public static ParquetMetadata readFooter(Configuration configuration, Path file, MetadataFilter filter) throws IOException{code} in ParquetFileReader were marked as deprecated, use {code:java} public ParquetFileReader(InputFile file, ParquetReadOptions options) throws IOException {code} {code:java} /** * Open a {@link InputFile file} with {@link ParquetReadOptions options}. * * @param file an input file * @param options parquet read options * @return an open ParquetFileReader * @throws IOException if there is an error while opening the file */ public static ParquetFileReader open(InputFile file, ParquetReadOptions options) throws IOException {code} {code:java} public ParquetMetadata getFooter() {code} to instead of them. was: {code:java} /** * @param configuration the Hadoop conf * @param fileMetaData fileMetaData for parquet file * @param filePath Path for the parquet file * @param blocks the blocks to read * @param columns the columns to read (their path) * @throws IOException if the file can not be opened * @deprecated will be removed in 2.0.0. */ @Deprecated public ParquetFileReader( Configuration configuration, FileMetaData fileMetaData, Path filePath, List blocks, List columns) throws IOException { {code} and > Use new API to construct ParquetFileReader and read Parquet footer > -- > > Key: SPARK-33255 > URL: https://issues.apache.org/jira/browse/SPARK-33255 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > /** > * @param configuration the Hadoop conf > * @param fileMetaData fileMetaData for parquet file > * @param filePath Path for the parquet file > * @param blocks the blocks to read > * @param columns the columns to read (their path) > * @throws IOException if the file can not be opened > * @deprecated will be removed in 2.0.0. > */ > @Deprecated > public ParquetFileReader( > Configuration configuration, FileMetaData fileMetaData, > Path filePath, List blocks, List > columns) throws IOException { > {code} > > , > > {code:java} > /** > * Reads the meta data block in the footer of the file > * @param configuration a configuration > * @param file the parquet File > * @param filter the filter to apply to row groups > * @return the metadata blocks in the footer > * @throws IOException if an error occurs while reading the file > * @deprecated will be removed in 2.0.0; > * use {@link ParquetFileReader#open(InputFile, > ParquetReadOptions)} > */ > @Deprecated > public static final ParquetMetadata readFooter(Configuration configuration, > FileStatus file, MetadataFilter filter) throws IOException > {code} > > > {code:java} > /** > * Reads the meta data in the footer of the file. > * Skipping row groups (or not) based on the provided filter > * @param configuration a configuration > * @param file the Parquet File > * @param filter the filter to apply to row groups > * @return th
[jira] [Resolved] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer
[ https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33255. -- Resolution: Duplicate > Use new API to construct ParquetFileReader and read Parquet footer > -- > > Key: SPARK-33255 > URL: https://issues.apache.org/jira/browse/SPARK-33255 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > /** > * @param configuration the Hadoop conf > * @param fileMetaData fileMetaData for parquet file > * @param filePath Path for the parquet file > * @param blocks the blocks to read > * @param columns the columns to read (their path) > * @throws IOException if the file can not be opened > * @deprecated will be removed in 2.0.0. > */ > @Deprecated > public ParquetFileReader( > Configuration configuration, FileMetaData fileMetaData, > Path filePath, List blocks, List > columns) throws IOException { > {code} > and > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer
[ https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221119#comment-17221119 ] Hyukjin Kwon commented on SPARK-33255: -- We can't replace this now. See also https://github.com/apache/spark/pull/29542#pullrequestreview-478269264 > Use new API to construct ParquetFileReader and read Parquet footer > -- > > Key: SPARK-33255 > URL: https://issues.apache.org/jira/browse/SPARK-33255 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > /** > * @param configuration the Hadoop conf > * @param fileMetaData fileMetaData for parquet file > * @param filePath Path for the parquet file > * @param blocks the blocks to read > * @param columns the columns to read (their path) > * @throws IOException if the file can not be opened > * @deprecated will be removed in 2.0.0. > */ > @Deprecated > public ParquetFileReader( > Configuration configuration, FileMetaData fileMetaData, > Path filePath, List blocks, List > columns) throws IOException { > {code} > and > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer
[ https://issues.apache.org/jira/browse/SPARK-33255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-33255: - Description: {code:java} /** * @param configuration the Hadoop conf * @param fileMetaData fileMetaData for parquet file * @param filePath Path for the parquet file * @param blocks the blocks to read * @param columns the columns to read (their path) * @throws IOException if the file can not be opened * @deprecated will be removed in 2.0.0. */ @Deprecated public ParquetFileReader( Configuration configuration, FileMetaData fileMetaData, Path filePath, List blocks, List columns) throws IOException { {code} and > Use new API to construct ParquetFileReader and read Parquet footer > -- > > Key: SPARK-33255 > URL: https://issues.apache.org/jira/browse/SPARK-33255 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > {code:java} > /** > * @param configuration the Hadoop conf > * @param fileMetaData fileMetaData for parquet file > * @param filePath Path for the parquet file > * @param blocks the blocks to read > * @param columns the columns to read (their path) > * @throws IOException if the file can not be opened > * @deprecated will be removed in 2.0.0. > */ > @Deprecated > public ParquetFileReader( > Configuration configuration, FileMetaData fileMetaData, > Path filePath, List blocks, List > columns) throws IOException { > {code} > and > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32085) Migrate to NumPy documentation style
[ https://issues.apache.org/jira/browse/SPARK-32085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32085: - Description: https://github.com/numpy/numpydoc For example, Before: https://github.com/apache/spark/blob/f0e6d0ec13d9cdadf341d1b976623345bcdb1028/python/pyspark/sql/dataframe.py#L276-L318 After: https://github.com/databricks/koalas/blob/6711e9c0f50c79dd57eeedb530da6c4ea3298de2/databricks/koalas/frame.py#L1122-L1176 We can incrementally start to switch. NOTE that this JIRA targets only to switch the style. It does not target to add additional information or fixes together. was: https://github.com/numpy/numpydoc For example, Before: https://github.com/apache/spark/blob/f0e6d0ec13d9cdadf341d1b976623345bcdb1028/python/pyspark/sql/dataframe.py#L276-L318 After: https://github.com/databricks/koalas/blob/6711e9c0f50c79dd57eeedb530da6c4ea3298de2/databricks/koalas/frame.py#L1122-L1176 We can incrementally start to switch. > Migrate to NumPy documentation style > > > Key: SPARK-32085 > URL: https://issues.apache.org/jira/browse/SPARK-32085 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/numpy/numpydoc > For example, > Before: > https://github.com/apache/spark/blob/f0e6d0ec13d9cdadf341d1b976623345bcdb1028/python/pyspark/sql/dataframe.py#L276-L318 > After: > https://github.com/databricks/koalas/blob/6711e9c0f50c79dd57eeedb530da6c4ea3298de2/databricks/koalas/frame.py#L1122-L1176 > We can incrementally start to switch. > NOTE that this JIRA targets only to switch the style. It does not target to > add additional information or fixes together. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33254) Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)
[ https://issues.apache.org/jira/browse/SPARK-33254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33254: - Description: This JIRA targets to migrate to NumPy documentation style in > Migration to NumPy documentation style in Core (pyspark.*, > pyspark.resource.*, etc.) > > > Key: SPARK-33254 > URL: https://issues.apache.org/jira/browse/SPARK-33254 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to migrate to NumPy documentation style in -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33251) Migration to NumPy documentation style in ML (pyspark.ml.*)
[ https://issues.apache.org/jira/browse/SPARK-33251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33251: - Description: This JIRA targets to migrate to NumPy documentation style in MLlib (pyspark.mllib.*). Please also see the parent JIRA. > Migration to NumPy documentation style in ML (pyspark.ml.*) > --- > > Key: SPARK-33251 > URL: https://issues.apache.org/jira/browse/SPARK-33251 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to migrate to NumPy documentation style in MLlib > (pyspark.mllib.*). Please also see the parent JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33250) Migration to NumPy documentation style in SQL (pyspark.sql.*)
[ https://issues.apache.org/jira/browse/SPARK-33250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33250: - Description: Migration to NumPy documentation style in ML (pyspark.ml.*) > Migration to NumPy documentation style in SQL (pyspark.sql.*) > - > > Key: SPARK-33250 > URL: https://issues.apache.org/jira/browse/SPARK-33250 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Migration to NumPy documentation style in ML (pyspark.ml.*) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33251) Migration to NumPy documentation style in ML (pyspark.ml.*)
[ https://issues.apache.org/jira/browse/SPARK-33251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33251: - Description: This JIRA targets to migrate to NumPy documentation style in ML (pyspark.ml.*). Please also see the parent JIRA. (was: This JIRA targets to migrate to NumPy documentation style in MLlib (pyspark.mllib.*). Please also see the parent JIRA.) > Migration to NumPy documentation style in ML (pyspark.ml.*) > --- > > Key: SPARK-33251 > URL: https://issues.apache.org/jira/browse/SPARK-33251 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to migrate to NumPy documentation style in ML > (pyspark.ml.*). Please also see the parent JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32085) Migrate to NumPy documentation style
[ https://issues.apache.org/jira/browse/SPARK-32085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221099#comment-17221099 ] Hyukjin Kwon commented on SPARK-32085: -- cc [~zero323] in case you're interested in some of sub-tasks. > Migrate to NumPy documentation style > > > Key: SPARK-32085 > URL: https://issues.apache.org/jira/browse/SPARK-32085 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/numpy/numpydoc > For example, > Before: > https://github.com/apache/spark/blob/f0e6d0ec13d9cdadf341d1b976623345bcdb1028/python/pyspark/sql/dataframe.py#L276-L318 > After: > https://github.com/databricks/koalas/blob/6711e9c0f50c79dd57eeedb530da6c4ea3298de2/databricks/koalas/frame.py#L1122-L1176 > We can incrementally start to switch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33250) Migration to NumPy documentation style in SQL (pyspark.sql.*)
[ https://issues.apache.org/jira/browse/SPARK-33250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33250: - Description: Migration to NumPy documentation style in SQL (pyspark.sql.*) (was: Migration to NumPy documentation style in ML (pyspark.ml.*)) > Migration to NumPy documentation style in SQL (pyspark.sql.*) > - > > Key: SPARK-33250 > URL: https://issues.apache.org/jira/browse/SPARK-33250 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Migration to NumPy documentation style in SQL (pyspark.sql.*) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33254) Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)
[ https://issues.apache.org/jira/browse/SPARK-33254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33254: - Description: This JIRA targets to migrate to NumPy documentation style in Core (pyspark.\*, pyspark.resource.\*, etc.). Please also see the parent JIRA. (was: This JIRA targets to migrate to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.). Please also see the parent JIRA.) > Migration to NumPy documentation style in Core (pyspark.*, > pyspark.resource.*, etc.) > > > Key: SPARK-33254 > URL: https://issues.apache.org/jira/browse/SPARK-33254 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to migrate to NumPy documentation style in Core > (pyspark.\*, pyspark.resource.\*, etc.). Please also see the parent JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33246) Spark SQL null semantics documentation is incorrect
[ https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221116#comment-17221116 ] Hyukjin Kwon commented on SPARK-33246: -- [~stwhit] Apache Spark uses pull requests in GitHub to apply a patch. See also https://spark.apache.org/contributing.html > Spark SQL null semantics documentation is incorrect > --- > > Key: SPARK-33246 > URL: https://issues.apache.org/jira/browse/SPARK-33246 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Stuart White >Priority: Trivial > Attachments: null-semantics.patch > > > The documentation of Spark SQL's null semantics is (I believe) incorrect. > The documentation states that "NULL AND False" yields NULL, when in fact it > yields False. > {noformat} > Seq[(java.lang.Boolean, java.lang.Boolean)]( > (true, null), > (false, null), > (null, true), > (null, false), > (null, null) > ) > .toDF("left_operand", "right_operand") > .withColumn("OR", 'left_operand || 'right_operand) > .withColumn("AND", 'left_operand && 'right_operand) > .show(truncate = false) > ++-++-+ > |left_operand|right_operand|OR |AND | > ++-++-+ > |true|null |true|null | > |false |null |null|false| > |null|true |true|null | > |null|false|null|false| < this line is incorrect in the > docs > |null|null |null|null | > ++-++-+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33253) Migration to NumPy documentation style in Streaming (pyspark.streaming.*)
[ https://issues.apache.org/jira/browse/SPARK-33253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33253: - Description: This JIRA targets to migrate to NumPy documentation style in Streaming (pyspark.streaming.*). Please also see the parent JIRA. (was: This JIRA targets to migrate to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.). Please also see the parent JIRA.) > Migration to NumPy documentation style in Streaming (pyspark.streaming.*) > - > > Key: SPARK-33253 > URL: https://issues.apache.org/jira/browse/SPARK-33253 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to migrate to NumPy documentation style in Streaming > (pyspark.streaming.*). Please also see the parent JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33254) Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)
[ https://issues.apache.org/jira/browse/SPARK-33254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33254: - Description: This JIRA targets to migrate to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.). Please also see the parent JIRA. (was: This JIRA targets to migrate to NumPy documentation style in ) > Migration to NumPy documentation style in Core (pyspark.*, > pyspark.resource.*, etc.) > > > Key: SPARK-33254 > URL: https://issues.apache.org/jira/browse/SPARK-33254 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to migrate to NumPy documentation style in Core > (pyspark.*, pyspark.resource.*, etc.). Please also see the parent JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
[ https://issues.apache.org/jira/browse/SPARK-33252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33252: - Description: This JIRA targets to migrate to NumPy documentation style in Streaming (pyspark.streaming.*). Please also see the parent JIRA. > Migration to NumPy documentation style in MLlib (pyspark.mllib.*) > - > > Key: SPARK-33252 > URL: https://issues.apache.org/jira/browse/SPARK-33252 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to migrate to NumPy documentation style in Streaming > (pyspark.streaming.*). Please also see the parent JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33253) Migration to NumPy documentation style in Streaming (pyspark.streaming.*)
[ https://issues.apache.org/jira/browse/SPARK-33253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33253: - Description: This JIRA targets to migrate to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.). Please also see the parent JIRA. > Migration to NumPy documentation style in Streaming (pyspark.streaming.*) > - > > Key: SPARK-33253 > URL: https://issues.apache.org/jira/browse/SPARK-33253 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to migrate to NumPy documentation style in Core > (pyspark.*, pyspark.resource.*, etc.). Please also see the parent JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
[ https://issues.apache.org/jira/browse/SPARK-33252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33252: - Description: This JIRA targets to migrate to NumPy documentation style in MLlib (pyspark.mllib.*). Please also see the parent JIRA. (was: This JIRA targets to migrate to NumPy documentation style in Streaming (pyspark.streaming.*). Please also see the parent JIRA.) > Migration to NumPy documentation style in MLlib (pyspark.mllib.*) > - > > Key: SPARK-33252 > URL: https://issues.apache.org/jira/browse/SPARK-33252 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to migrate to NumPy documentation style in MLlib > (pyspark.mllib.*). Please also see the parent JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33254) Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.)
Hyukjin Kwon created SPARK-33254: Summary: Migration to NumPy documentation style in Core (pyspark.*, pyspark.resource.*, etc.) Key: SPARK-33254 URL: https://issues.apache.org/jira/browse/SPARK-33254 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.1.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
Hyukjin Kwon created SPARK-33252: Summary: Migration to NumPy documentation style in MLlib (pyspark.mllib.*) Key: SPARK-33252 URL: https://issues.apache.org/jira/browse/SPARK-33252 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.1.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33253) Migration to NumPy documentation style in Streaming (pyspark.streaming.*)
Hyukjin Kwon created SPARK-33253: Summary: Migration to NumPy documentation style in Streaming (pyspark.streaming.*) Key: SPARK-33253 URL: https://issues.apache.org/jira/browse/SPARK-33253 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.1.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33250) Migration to NumPy documentation style in SQL (pyspark.sql.*)
Hyukjin Kwon created SPARK-33250: Summary: Migration to NumPy documentation style in SQL (pyspark.sql.*) Key: SPARK-33250 URL: https://issues.apache.org/jira/browse/SPARK-33250 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.1.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33251) Migration to NumPy documentation style in ML (pyspark.ml.*)
Hyukjin Kwon created SPARK-33251: Summary: Migration to NumPy documentation style in ML (pyspark.ml.*) Key: SPARK-33251 URL: https://issues.apache.org/jira/browse/SPARK-33251 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 3.1.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33255) Use new API to construct ParquetFileReader and read Parquet footer
Yang Jie created SPARK-33255: Summary: Use new API to construct ParquetFileReader and read Parquet footer Key: SPARK-33255 URL: https://issues.apache.org/jira/browse/SPARK-33255 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33249) Add status plugin for live application
[ https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiyi Kong updated SPARK-33249: --- Description: There are cases that developer may want to extend the current REST API of Web UI. In most cases, adding external module is a better option than directly editing the original Spark code. For an external module, to extend the REST API of the Web UI, 2 things may need to be done: * Add extra API to provide extra status info. This can be simply done by implementing another ApiRequestContext which will be automatically loaded. * If the info can not be calculated from the original data in the store, add extra listeners to generate them. For history server, there is an interface called AppHistoryServerPlugin, which is loaded based on SPI, providing a method to create listeners. In live application, the only way is spark.extraListeners based on Utils.loadExtensions. But this is not enough for the cases. To let the API get the status info, the data need to be written to the AppStatusStore, which is the only store that an API can get by accessing "ui.store" or "ui.sc.statusStore". But listeners created by Utils.loadExtensions only get a SparkConf in construction, and are unable to write the AppStatusStore. So I think we still need plugin like AppHistorySever for live UI. For concerns like [#SPARK-22786], the plugin for live app can be separated from the history server one, and also loaded using Utils.loadExtensions with an extra configurations. So by default, nothing will be loaded. was: There are cases that developer may want to extend the current REST API of Web UI. In most cases, adding external module is a better option than directly editing the original Spark code. For an external module, to extend the REST API of the Web UI, 2 things may need to be done: * Add extra API to provide extra status info. This can be simply done by implementing another ApiRequestContext which will be automatically loaded. Add extra listeners to generate the status info if it can not be calculated from the original data. This brings the issue. For history server, there is an interface called AppHistoryServerPlugin, which is loaded based on SPI, providing a method to create listeners. In live application, the only way is spark.extraListeners based on Utils.loadExtensions. But this is not enough for the cases. To let the API get the status info, the data need to be written to the AppStatusStore, which is the only store that an API can get by accessing "ui.store" or "ui.sc.statusStore". But listeners created by Utils.loadExtensions only get a SparkConf in construction, and are unable to write the AppStatusStore. So I think we still need plugin like AppHistorySever for live UI. For concerns like [#SPARK-22786], the plugin for live app can be separated from the history server one, and also loaded using Utils.loadExtensions with an extra configurations. So by default, nothing will be loaded. > Add status plugin for live application > -- > > Key: SPARK-33249 > URL: https://issues.apache.org/jira/browse/SPARK-33249 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Weiyi Kong >Priority: Minor > > There are cases that developer may want to extend the current REST API of Web > UI. In most cases, adding external module is a better option than directly > editing the original Spark code. > For an external module, to extend the REST API of the Web UI, 2 things may > need to be done: > * Add extra API to provide extra status info. This can be simply done by > implementing another ApiRequestContext which will be automatically loaded. > * If the info can not be calculated from the original data in the store, add > extra listeners to generate them. > For history server, there is an interface called AppHistoryServerPlugin, > which is loaded based on SPI, providing a method to create listeners. In live > application, the only way is spark.extraListeners based on > Utils.loadExtensions. But this is not enough for the cases. > To let the API get the status info, the data need to be written to the > AppStatusStore, which is the only store that an API can get by accessing > "ui.store" or "ui.sc.statusStore". But listeners created by > Utils.loadExtensions only get a SparkConf in construction, and are unable to > write the AppStatusStore. > So I think we still need plugin like AppHistorySever for live UI. For > concerns like [#SPARK-22786], the plugin for live app can be separated from > the history server one, and also loaded using Utils.loadExtensions with an > extra configurations. So by default, nothing will be loaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) -
[jira] [Updated] (SPARK-33249) Add status plugin for live application
[ https://issues.apache.org/jira/browse/SPARK-33249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weiyi Kong updated SPARK-33249: --- Description: There are cases that developer may want to extend the current REST API of Web UI. In most cases, adding external module is a better option than directly editing the original Spark code. For an external module, to extend the REST API of the Web UI, 2 things may need to be done: * Add extra API to provide extra status info. This can be simply done by implementing another ApiRequestContext which will be automatically loaded. * If the info can not be calculated from the original data in the store, add extra listeners to generate them. For history server, there is an interface called AppHistoryServerPlugin, which is loaded based on SPI, providing a method to create listeners. In live application, the only way is spark.extraListeners based on Utils.loadExtensions. But this is not enough for the cases. To let the API get the status info, the data need to be written to the AppStatusStore, which is the only store that an API can get by accessing "ui.store" or "ui.sc.statusStore". But listeners created by Utils.loadExtensions only get a SparkConf in construction, and are unable to write the AppStatusStore. So I think we still need plugin like AppHistorySever for live UI. For concerns like SPARK-22786, the plugin for live app can be separated from the history server one, and also loaded using Utils.loadExtensions with an extra configurations. So by default, nothing will be loaded. was: There are cases that developer may want to extend the current REST API of Web UI. In most cases, adding external module is a better option than directly editing the original Spark code. For an external module, to extend the REST API of the Web UI, 2 things may need to be done: * Add extra API to provide extra status info. This can be simply done by implementing another ApiRequestContext which will be automatically loaded. * If the info can not be calculated from the original data in the store, add extra listeners to generate them. For history server, there is an interface called AppHistoryServerPlugin, which is loaded based on SPI, providing a method to create listeners. In live application, the only way is spark.extraListeners based on Utils.loadExtensions. But this is not enough for the cases. To let the API get the status info, the data need to be written to the AppStatusStore, which is the only store that an API can get by accessing "ui.store" or "ui.sc.statusStore". But listeners created by Utils.loadExtensions only get a SparkConf in construction, and are unable to write the AppStatusStore. So I think we still need plugin like AppHistorySever for live UI. For concerns like [#SPARK-22786], the plugin for live app can be separated from the history server one, and also loaded using Utils.loadExtensions with an extra configurations. So by default, nothing will be loaded. > Add status plugin for live application > -- > > Key: SPARK-33249 > URL: https://issues.apache.org/jira/browse/SPARK-33249 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Affects Versions: 2.4.7, 3.0.1 >Reporter: Weiyi Kong >Priority: Minor > > There are cases that developer may want to extend the current REST API of Web > UI. In most cases, adding external module is a better option than directly > editing the original Spark code. > For an external module, to extend the REST API of the Web UI, 2 things may > need to be done: > * Add extra API to provide extra status info. This can be simply done by > implementing another ApiRequestContext which will be automatically loaded. > * If the info can not be calculated from the original data in the store, add > extra listeners to generate them. > For history server, there is an interface called AppHistoryServerPlugin, > which is loaded based on SPI, providing a method to create listeners. In live > application, the only way is spark.extraListeners based on > Utils.loadExtensions. But this is not enough for the cases. > To let the API get the status info, the data need to be written to the > AppStatusStore, which is the only store that an API can get by accessing > "ui.store" or "ui.sc.statusStore". But listeners created by > Utils.loadExtensions only get a SparkConf in construction, and are unable to > write the AppStatusStore. > So I think we still need plugin like AppHistorySever for live UI. For > concerns like SPARK-22786, the plugin for live app can be separated from the > history server one, and also loaded using Utils.loadExtensions with an extra > configurations. So by default, nothing will be loaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) --
[jira] [Commented] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size
[ https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221087#comment-17221087 ] Apache Spark commented on SPARK-33248: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/30156 > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > -- > > Key: SPARK-33248 > URL: https://issues.apache.org/jira/browse/SPARK-33248 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Priority: Major > > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > > FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size
[ https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33248: Assignee: Apache Spark > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > -- > > Key: SPARK-33248 > URL: https://issues.apache.org/jira/browse/SPARK-33248 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > > FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size
[ https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221084#comment-17221084 ] Apache Spark commented on SPARK-33248: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/30156 > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > -- > > Key: SPARK-33248 > URL: https://issues.apache.org/jira/browse/SPARK-33248 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Priority: Major > > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > > FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size
[ https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33248: Assignee: (was: Apache Spark) > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > -- > > Key: SPARK-33248 > URL: https://issues.apache.org/jira/browse/SPARK-33248 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Priority: Major > > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > > FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32085) Migrate to NumPy documentation style
[ https://issues.apache.org/jira/browse/SPARK-32085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-32085: - Affects Version/s: (was: 3.0.0) 3.1.0 > Migrate to NumPy documentation style > > > Key: SPARK-32085 > URL: https://issues.apache.org/jira/browse/SPARK-32085 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/numpy/numpydoc > For example, > Before: > https://github.com/apache/spark/blob/f0e6d0ec13d9cdadf341d1b976623345bcdb1028/python/pyspark/sql/dataframe.py#L276-L318 > After: > https://github.com/databricks/koalas/blob/6711e9c0f50c79dd57eeedb530da6c4ea3298de2/databricks/koalas/frame.py#L1122-L1176 > We can incrementally start to switch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33249) Add status plugin for live application
Weiyi Kong created SPARK-33249: -- Summary: Add status plugin for live application Key: SPARK-33249 URL: https://issues.apache.org/jira/browse/SPARK-33249 Project: Spark Issue Type: New Feature Components: Spark Core, Web UI Affects Versions: 3.0.1, 2.4.7 Reporter: Weiyi Kong There are cases that developer may want to extend the current REST API of Web UI. In most cases, adding external module is a better option than directly editing the original Spark code. For an external module, to extend the REST API of the Web UI, 2 things may need to be done: * Add extra API to provide extra status info. This can be simply done by implementing another ApiRequestContext which will be automatically loaded. Add extra listeners to generate the status info if it can not be calculated from the original data. This brings the issue. For history server, there is an interface called AppHistoryServerPlugin, which is loaded based on SPI, providing a method to create listeners. In live application, the only way is spark.extraListeners based on Utils.loadExtensions. But this is not enough for the cases. To let the API get the status info, the data need to be written to the AppStatusStore, which is the only store that an API can get by accessing "ui.store" or "ui.sc.statusStore". But listeners created by Utils.loadExtensions only get a SparkConf in construction, and are unable to write the AppStatusStore. So I think we still need plugin like AppHistorySever for live UI. For concerns like [#SPARK-22786], the plugin for live app can be separated from the history server one, and also loaded using Utils.loadExtensions with an extra configurations. So by default, nothing will be loaded. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32084) Replace dictionary-based function definitions to proper functions in functions.py
[ https://issues.apache.org/jira/browse/SPARK-32084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32084: Assignee: Maciej Szymkiewicz > Replace dictionary-based function definitions to proper functions in > functions.py > - > > Key: SPARK-32084 > URL: https://issues.apache.org/jira/browse/SPARK-32084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > > Currently some functions in {{functions.py}} are defined by a dictionary. It > programmatically defines the functions to the module; however, it makes some > IDEs such as PyCharm don't detect. > Also, it makes hard to add proper examples into the docstrings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32084) Replace dictionary-based function definitions to proper functions in functions.py
[ https://issues.apache.org/jira/browse/SPARK-32084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32084. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30143 [https://github.com/apache/spark/pull/30143] > Replace dictionary-based function definitions to proper functions in > functions.py > - > > Key: SPARK-32084 > URL: https://issues.apache.org/jira/browse/SPARK-32084 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > Currently some functions in {{functions.py}} are defined by a dictionary. It > programmatically defines the functions to the module; however, it makes some > IDEs such as PyCharm don't detect. > Also, it makes hard to add proper examples into the docstrings. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size
[ https://issues.apache.org/jira/browse/SPARK-33248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-33248: -- Description: Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691] was:Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > -- > > Key: SPARK-33248 > URL: https://issues.apache.org/jira/browse/SPARK-33248 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Priority: Major > > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > > FOR comment [https://github.com/apache/spark/pull/29421#discussion_r511684691] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33238) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size
[ https://issues.apache.org/jira/browse/SPARK-33238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu resolved SPARK-33238. --- Resolution: Duplicate > Add a configuration to control the legacy behavior of whether need to pad > null value when value size less then schema size > -- > > Key: SPARK-33238 > URL: https://issues.apache.org/jira/browse/SPARK-33238 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > FOR comment https://github.com/apache/spark/pull/29421#discussion_r511684691 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33248) Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size
angerszhu created SPARK-33248: - Summary: Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size Key: SPARK-33248 URL: https://issues.apache.org/jira/browse/SPARK-33248 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.1 Reporter: angerszhu Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33247) Improve examples and scenarios in docstrings
Hyukjin Kwon created SPARK-33247: Summary: Improve examples and scenarios in docstrings Key: SPARK-33247 URL: https://issues.apache.org/jira/browse/SPARK-33247 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 3.1.0 Reporter: Hyukjin Kwon Currently, PySpark documentation does not have a lot of examples and scenarios. See also https://github.com/apache/spark/pull/30149#issuecomment-716490037. We should add/improve examples especially in the commonly used APIs. For example, {{Column}}, {{DataFrame}}. {{RDD}}, {{SparkContext}}, etc. This umbrella JIRA targets to improve them in commonly used APIs. NOTE that we'll have to convert the docstrings into numpydoc style first in a separate PR (at SPARK-32085), and then add examples. In this way, we can manage migration to numpydoc and example improvement here separately (e.g., reverting numpydoc migration only). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32388) TRANSFORM when schema less should keep same with hive
[ https://issues.apache.org/jira/browse/SPARK-32388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32388. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29421 [https://github.com/apache/spark/pull/29421] > TRANSFORM when schema less should keep same with hive > - > > Key: SPARK-32388 > URL: https://issues.apache.org/jira/browse/SPARK-32388 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.1.0 > > > Hive transform without schema > > {code:java} > hive> create table t (c0 int, c1 int, c2 int); > hive> INSERT INTO t VALUES (1, 1, 1); > hive> INSERT INTO t VALUES (2, 2, 2); > hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; > hive> DESCRIBE v; > key string > value string > hive> SELECT * FROM v; > 1 1 1 > 2 2 2 > hive> SELECT key FROM v; > 1 > 2 > hive> SELECT value FROM v; > 1 1 > 2 2{code} > Spark > {code:java} > hive> create table t (c0 int, c1 int, c2 int); > hive> INSERT INTO t VALUES (1, 1, 1); > hive> INSERT INTO t VALUES (2, 2, 2); > hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; > hive> SELECT * FROM v; > 1 11 > 2 22 {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32185) User Guide - Monitoring
[ https://issues.apache.org/jira/browse/SPARK-32185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221047#comment-17221047 ] Hyukjin Kwon commented on SPARK-32185: -- Thanks! > User Guide - Monitoring > --- > > Key: SPARK-32185 > URL: https://issues.apache.org/jira/browse/SPARK-32185 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Abhijeet Prasad >Priority: Major > > Monitoring. We should focus on how to monitor PySpark jobs. > - Custom Worker, see also > https://github.com/apache/spark/tree/master/python/test_coverage to enable > test coverage that include worker sides too. > - Sentry Support \(?\) > https://blog.sentry.io/2019/11/12/sentry-for-data-error-monitoring-with-pyspark > - Link back https://spark.apache.org/docs/latest/monitoring.html . > - ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32789) Wildcards not working in get_json_object
[ https://issues.apache.org/jira/browse/SPARK-32789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221044#comment-17221044 ] Aoyuan Liao commented on SPARK-32789: - [~tuhren] Not sure if HIve supports wildcard for dictionary. From documentation, star only works for array. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-get_json_object > Wildcards not working in get_json_object > > > Key: SPARK-32789 > URL: https://issues.apache.org/jira/browse/SPARK-32789 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Thomas Uhren >Priority: Major > Attachments: image-2020-09-03-13-22-38-569.png > > > It seems that wildcards (star) are not supported when using > {{get_json_object}}: > {code:java} > spark.sql("""select get_json_object('{"k":{"value":"abc"}}', '$.*.value') as > j""").show() > {code} > This results in {{null}} while it should return 'abc'. It works if I replace > * with 'k'. > !image-2020-09-03-13-22-38-569.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32789) Wildcards not working in get_json_object
[ https://issues.apache.org/jira/browse/SPARK-32789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aoyuan Liao resolved SPARK-32789. - Resolution: Not A Problem > Wildcards not working in get_json_object > > > Key: SPARK-32789 > URL: https://issues.apache.org/jira/browse/SPARK-32789 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Thomas Uhren >Priority: Major > Attachments: image-2020-09-03-13-22-38-569.png > > > It seems that wildcards (star) are not supported when using > {{get_json_object}}: > {code:java} > spark.sql("""select get_json_object('{"k":{"value":"abc"}}', '$.*.value') as > j""").show() > {code} > This results in {{null}} while it should return 'abc'. It works if I replace > * with 'k'. > !image-2020-09-03-13-22-38-569.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33231) Make podCreationTimeout configurable
[ https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33231: Assignee: Apache Spark > Make podCreationTimeout configurable > > > Key: SPARK-33231 > URL: https://issues.apache.org/jira/browse/SPARK-33231 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Major > > Execution Monitor & Pod Allocator have differing views of the world which can > lead to pod trashing. > The executor monitor can be notified of an executor coming up before a > snapshot is delivered to the PodAllocator. This can cause the executor > monitor to believe it needs to delete a pod, and the pod allocator to believe > that it needs to create a new pod. This happens if the podCreationTimeout is > too low for the cluster. Currently podCreationTimeout can only be configured > by increasing the batch delay but that has additional consequences leading to > slower spin up. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33231) Make podCreationTimeout configurable
[ https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221031#comment-17221031 ] Apache Spark commented on SPARK-33231: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/30155 > Make podCreationTimeout configurable > > > Key: SPARK-33231 > URL: https://issues.apache.org/jira/browse/SPARK-33231 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Holden Karau >Priority: Major > > Execution Monitor & Pod Allocator have differing views of the world which can > lead to pod trashing. > The executor monitor can be notified of an executor coming up before a > snapshot is delivered to the PodAllocator. This can cause the executor > monitor to believe it needs to delete a pod, and the pod allocator to believe > that it needs to create a new pod. This happens if the podCreationTimeout is > too low for the cluster. Currently podCreationTimeout can only be configured > by increasing the batch delay but that has additional consequences leading to > slower spin up. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33231) Make podCreationTimeout configurable
[ https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33231: Assignee: (was: Apache Spark) > Make podCreationTimeout configurable > > > Key: SPARK-33231 > URL: https://issues.apache.org/jira/browse/SPARK-33231 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Holden Karau >Priority: Major > > Execution Monitor & Pod Allocator have differing views of the world which can > lead to pod trashing. > The executor monitor can be notified of an executor coming up before a > snapshot is delivered to the PodAllocator. This can cause the executor > monitor to believe it needs to delete a pod, and the pod allocator to believe > that it needs to create a new pod. This happens if the podCreationTimeout is > too low for the cluster. Currently podCreationTimeout can only be configured > by increasing the batch delay but that has additional consequences leading to > slower spin up. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33231) Make podCreationTimeout configurable
[ https://issues.apache.org/jira/browse/SPARK-33231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221032#comment-17221032 ] Apache Spark commented on SPARK-33231: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/30155 > Make podCreationTimeout configurable > > > Key: SPARK-33231 > URL: https://issues.apache.org/jira/browse/SPARK-33231 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: Holden Karau >Priority: Major > > Execution Monitor & Pod Allocator have differing views of the world which can > lead to pod trashing. > The executor monitor can be notified of an executor coming up before a > snapshot is delivered to the PodAllocator. This can cause the executor > monitor to believe it needs to delete a pod, and the pod allocator to believe > that it needs to create a new pod. This happens if the podCreationTimeout is > too low for the cluster. Currently podCreationTimeout can only be configured > by increasing the batch delay but that has additional consequences leading to > slower spin up. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221029#comment-17221029 ] Apache Spark commented on SPARK-32405: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/30154 > Apply table options while creating tables in JDBC Table Catalog > --- > > Key: SPARK-32405 > URL: https://issues.apache.org/jira/browse/SPARK-32405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > We need to add an API to `JdbcDialect` to generate the SQL statement to > specify table options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32405: Assignee: Apache Spark > Apply table options while creating tables in JDBC Table Catalog > --- > > Key: SPARK-32405 > URL: https://issues.apache.org/jira/browse/SPARK-32405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > We need to add an API to `JdbcDialect` to generate the SQL statement to > specify table options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221028#comment-17221028 ] Apache Spark commented on SPARK-32405: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/30154 > Apply table options while creating tables in JDBC Table Catalog > --- > > Key: SPARK-32405 > URL: https://issues.apache.org/jira/browse/SPARK-32405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > We need to add an API to `JdbcDialect` to generate the SQL statement to > specify table options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog
[ https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32405: Assignee: (was: Apache Spark) > Apply table options while creating tables in JDBC Table Catalog > --- > > Key: SPARK-32405 > URL: https://issues.apache.org/jira/browse/SPARK-32405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > We need to add an API to `JdbcDialect` to generate the SQL statement to > specify table options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job
[ https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33237: - Assignee: Dongjoon Hyun > Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT > Jenkins job > -- > > Key: SPARK-33237 > URL: https://issues.apache.org/jira/browse/SPARK-33237 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job
[ https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33237. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30153 [https://github.com/apache/spark/pull/30153] > Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT > Jenkins job > -- > > Key: SPARK-33237 > URL: https://issues.apache.org/jira/browse/SPARK-33237 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > > Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33150) Groupby key may not be unique when using window
[ https://issues.apache.org/jira/browse/SPARK-33150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220984#comment-17220984 ] Aoyuan Liao commented on SPARK-33150: - [~DieterDP] After I looked deeper into code, the issue is not within spark. Spark creates pandas.dataframe from pd.DataFrame.from_records. However. it ignores the fold attribute of datetime object, which leads to the same window, as: {code:java} >>> from datetime import datetime >>> test = pd.DataFrame.from_records([(datetime(2019, 10, 27, 2, 54), 1), >>> (datetime(2019, 10, 27, 2, 54, fold=1), 3)]) >>> test 0 1 0 2019-10-27 02:54:00 1 1 2019-10-27 02:54:00 3 {code} IMHO, there is nothing much spark can do. If you enable arrow in spark(config spark.sql.execution.arrow.pyspark.enabled as true), the two UTC timestamps of dataframe will be distiguished. > Groupby key may not be unique when using window > --- > > Key: SPARK-33150 > URL: https://issues.apache.org/jira/browse/SPARK-33150 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.3, 3.0.0 >Reporter: Dieter De Paepe >Priority: Major > > > Due to the way spark converts dates to local times, it may end up losing > details that allow it to differentiate instants when those times fall in the > transition for daylight savings time. Setting the spark timezone to UTC does > not resolve the issue. > This issue is somewhat related to SPARK-32123, but seems independent enough > to consider this a separate issue. > A minimal example is below. I tested these on Spark 3.0.0 and 2.3.3 (I could > not get 2.4.x to work on my system). My machine is located in timezone > "Europe/Brussels". > > {code:java} > import pyspark > import pyspark.sql.functions as f > spark = (pyspark > .sql > .SparkSession > .builder > .master('local[1]') > .config("spark.sql.session.timeZone", "UTC") > .config('spark.driver.extraJavaOptions', '-Duser.timezone=UTC') \ > .config('spark.executor.extraJavaOptions', '-Duser.timezone=UTC') > .getOrCreate() > ) > debug_df = spark.createDataFrame([ > (1572137640, 1), > (1572137640, 2), > (1572141240, 3), > (1572141240, 4) > ],['epochtime', 'value']) > debug_df \ > .withColumn('time', f.from_unixtime('epochtime')) \ > .withColumn('window', f.window('time', '1 minute').start) \ > .collect() > {code} > > Output, here we see the window function internally transforms the times to > local time, and as such has to disambiguate between the Belgian winter and > summer hour transition by setting the "fold" attribute: > > {code:java} > [Row(epochtime=1572137640, value=1, time='2019-10-27 00:54:00', > window=datetime.datetime(2019, 10, 27, 2, 54)), > Row(epochtime=1572137640, value=2, time='2019-10-27 00:54:00', > window=datetime.datetime(2019, 10, 27, 2, 54)), > Row(epochtime=1572141240, value=3, time='2019-10-27 01:54:00', > window=datetime.datetime(2019, 10, 27, 2, 54, fold=1)), > Row(epochtime=1572141240, value=4, time='2019-10-27 01:54:00', > window=datetime.datetime(2019, 10, 27, 2, 54, fold=1))]{code} > > Now, this has severe implications when we use the window function for a > groupby operation: > > {code:java} > output = debug_df \ > .withColumn('time', f.from_unixtime('epochtime')) \ > .groupby(f.window('time', '1 minute').start.alias('window')).agg( >f.min('value').alias('min_value') > ) > output_collect = output.collect() > output_pandas = output.toPandas() > print(output_collect) > print(output_pandas) > {code} > Output: > > {code:java} > [Row(window=datetime.datetime(2019, 10, 27, 2, 54), min_value=1), > Row(window=datetime.datetime(2019, 10, 27, 2, 54, fold=1), min_value=3)] > window min_value > 0 2019-10-27 00:54:00 1 > 1 2019-10-27 00:54:00 3 > {code} > > While the output using collect() outputs Belgian local time, it allows us to > differentiate between the two different keys visually using the fold > attribute. However, due to the way the fold attribute is defined, [it is > ignored for|https://www.python.org/dev/peps/pep-0495/#the-fold-attribute] > equality comparison. > On the other hand, the pandas output uses the UTC output (due to the setting > of spark.sql.session.timeZone), but it has lost the disambiguating fold > attribute in the pandas datatype conversion. > In both cases, the column on which was grouped is not unique. > > {code:java} > print(output_collect[0].window == output_collect[1].window) # True > print(output_collect[0].window.fold == output_collect[1].window.fold) # False > print(output_pandas.window[0] == output_pandas.window[1]) # True > print(output_pandas.window[0].fold == output_pandas.window[1].fold) # True > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) ---
[jira] [Updated] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan
[ https://issues.apache.org/jira/browse/SPARK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33228: -- Fix Version/s: (was: 2.4.8) > Don't uncache data when replacing an existing view having the same plan > --- > > Key: SPARK-33228 > URL: https://issues.apache.org/jira/browse/SPARK-33228 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache > when replacing an existing view. But, this change drops cache even when > replacing a view having the same logical plan. A sequence of queries to > reproduce this as follows; > {code} > scala> val df = spark.range(1).selectExpr("id a", "id b") > scala> df.cache() > scala> df.explain() > == Physical Plan == > *(1) ColumnarToRow > +- InMemoryTableScan [a#2L, b#3L] > +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L] > +- *(1) Range (0, 1, step=1, splits=4) > scala> df.createOrReplaceTempView("t") > scala> sql("select * from t").explain() > == Physical Plan == > *(1) ColumnarToRow > +- InMemoryTableScan [a#2L, b#3L] > +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L] > +- *(1) Range (0, 1, step=1, splits=4) > // If one re-runs the same query `df.createOrReplaceTempView("t")`, the > cache's swept away > scala> df.createOrReplaceTempView("t") > scala> sql("select * from t").explain() > == Physical Plan == > *(1) Project [id#0L AS a#2L, id#0L AS b#3L] > +- *(1) Range (0, 1, step=1, splits=4) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33230) FileOutputWriter jobs have duplicate JobIDs if launched in same second
[ https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33230. --- Fix Version/s: 2.4.8 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30141 [https://github.com/apache/spark/pull/30141] > FileOutputWriter jobs have duplicate JobIDs if launched in same second > -- > > Key: SPARK-33230 > URL: https://issues.apache.org/jira/browse/SPARK-33230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.1 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Fix For: 3.1.0, 3.0.2, 2.4.8 > > > The Hadoop S3A staging committer has problems with >1 spark sql query being > launched simultaneously, as it uses the jobID for its path in the clusterFS > to pass the commit information from tasks to job committer. > If two queries are launched in the same second, they conflict and the output > of job 1 includes that of all job2 files written so far; job 2 will fail with > FNFE. > Proposed: > job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of > {{WriteJobDescription.uuid}} > That was the property name which used to serve this purpose; any committers > already written which use this property will pick it up without needing any > changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33230) FileOutputWriter jobs have duplicate JobIDs if launched in same second
[ https://issues.apache.org/jira/browse/SPARK-33230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33230: - Assignee: Steve Loughran > FileOutputWriter jobs have duplicate JobIDs if launched in same second > -- > > Key: SPARK-33230 > URL: https://issues.apache.org/jira/browse/SPARK-33230 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.1 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > > The Hadoop S3A staging committer has problems with >1 spark sql query being > launched simultaneously, as it uses the jobID for its path in the clusterFS > to pass the commit information from tasks to job committer. > If two queries are launched in the same second, they conflict and the output > of job 1 includes that of all job2 files written so far; job 2 will fail with > FNFE. > Proposed: > job conf to set {{"spark.sql.sources.writeJobUUID"}} to the value of > {{WriteJobDescription.uuid}} > That was the property name which used to serve this purpose; any committers > already written which use this property will pick it up without needing any > changes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job
[ https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33237: -- Component/s: Kubernetes > Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT > Jenkins job > -- > > Key: SPARK-33237 > URL: https://issues.apache.org/jira/browse/SPARK-33237 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job
[ https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33237: Assignee: Apache Spark > Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT > Jenkins job > -- > > Key: SPARK-33237 > URL: https://issues.apache.org/jira/browse/SPARK-33237 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job
[ https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220943#comment-17220943 ] Apache Spark commented on SPARK-33237: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30153 > Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT > Jenkins job > -- > > Key: SPARK-33237 > URL: https://issues.apache.org/jira/browse/SPARK-33237 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33237) Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT Jenkins job
[ https://issues.apache.org/jira/browse/SPARK-33237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33237: Assignee: (was: Apache Spark) > Use default Hadoop profile by removing explicit `-Phadoop-2.7` from K8s IT > Jenkins job > -- > > Key: SPARK-33237 > URL: https://issues.apache.org/jira/browse/SPARK-33237 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > Since Apache Spark 3.1.0, the default Hadoop version is 3.1.0. > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32185) User Guide - Monitoring
[ https://issues.apache.org/jira/browse/SPARK-32185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220902#comment-17220902 ] Abhijeet Prasad commented on SPARK-32185: - Hey, sorry for not updating this issue. I have been very busy with school these past few months, but I will try to get a PR out within the next week or so. > User Guide - Monitoring > --- > > Key: SPARK-32185 > URL: https://issues.apache.org/jira/browse/SPARK-32185 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Abhijeet Prasad >Priority: Major > > Monitoring. We should focus on how to monitor PySpark jobs. > - Custom Worker, see also > https://github.com/apache/spark/tree/master/python/test_coverage to enable > test coverage that include worker sides too. > - Sentry Support \(?\) > https://blog.sentry.io/2019/11/12/sentry-for-data-error-monitoring-with-pyspark > - Link back https://spark.apache.org/docs/latest/monitoring.html . > - ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-33197) Changes to spark.sql.analyzer.maxIterations do not take effect at runtime
[ https://issues.apache.org/jira/browse/SPARK-33197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuning Zhang closed SPARK-33197. > Changes to spark.sql.analyzer.maxIterations do not take effect at runtime > - > > Key: SPARK-33197 > URL: https://issues.apache.org/jira/browse/SPARK-33197 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.1.0 >Reporter: Yuning Zhang >Assignee: Yuning Zhang >Priority: Major > Fix For: 3.0.2, 3.1.0 > > > `spark.sql.analyzer.maxIterations` is not a static conf. However, changes to > it do not take effect at runtime. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC
[ https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220856#comment-17220856 ] Denise Mauldin commented on SPARK-19335: [~kevinyu98] Using AWS Glue to copy/update data between two databases. We do not want to TRUNCATE the tables. We need to update every row in a table without modifying tables that have foreign keys to this table. > Spark should support doing an efficient DataFrame Upsert via JDBC > - > > Key: SPARK-19335 > URL: https://issues.apache.org/jira/browse/SPARK-19335 > Project: Spark > Issue Type: Improvement >Reporter: Ilya Ganelin >Priority: Minor > > Doing a database update, as opposed to an insert is useful, particularly when > working with streaming applications which may require revisions to previously > stored data. > Spark DataFrames/DataSets do not currently support an Update feature via the > JDBC Writer allowing only Overwrite or Append. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC
[ https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220855#comment-17220855 ] Denise Mauldin commented on SPARK-19335: +1 This is a major deficiency for using Spark in ETL jobs. > Spark should support doing an efficient DataFrame Upsert via JDBC > - > > Key: SPARK-19335 > URL: https://issues.apache.org/jira/browse/SPARK-19335 > Project: Spark > Issue Type: Improvement >Reporter: Ilya Ganelin >Priority: Minor > > Doing a database update, as opposed to an insert is useful, particularly when > working with streaming applications which may require revisions to previously > stored data. > Spark DataFrames/DataSets do not currently support an Update feature via the > JDBC Writer allowing only Overwrite or Append. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33246) Spark SQL null semantics documentation is incorrect
[ https://issues.apache.org/jira/browse/SPARK-33246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stuart White updated SPARK-33246: - Attachment: null-semantics.patch > Spark SQL null semantics documentation is incorrect > --- > > Key: SPARK-33246 > URL: https://issues.apache.org/jira/browse/SPARK-33246 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.0.1 >Reporter: Stuart White >Priority: Trivial > Attachments: null-semantics.patch > > > The documentation of Spark SQL's null semantics is (I believe) incorrect. > The documentation states that "NULL AND False" yields NULL, when in fact it > yields False. > {noformat} > Seq[(java.lang.Boolean, java.lang.Boolean)]( > (true, null), > (false, null), > (null, true), > (null, false), > (null, null) > ) > .toDF("left_operand", "right_operand") > .withColumn("OR", 'left_operand || 'right_operand) > .withColumn("AND", 'left_operand && 'right_operand) > .show(truncate = false) > ++-++-+ > |left_operand|right_operand|OR |AND | > ++-++-+ > |true|null |true|null | > |false |null |null|false| > |null|true |true|null | > |null|false|null|false| < this line is incorrect in the > docs > |null|null |null|null | > ++-++-+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33246) Spark SQL null semantics documentation is incorrect
Stuart White created SPARK-33246: Summary: Spark SQL null semantics documentation is incorrect Key: SPARK-33246 URL: https://issues.apache.org/jira/browse/SPARK-33246 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.0.1 Reporter: Stuart White Attachments: null-semantics.patch The documentation of Spark SQL's null semantics is (I believe) incorrect. The documentation states that "NULL AND False" yields NULL, when in fact it yields False. {noformat} Seq[(java.lang.Boolean, java.lang.Boolean)]( (true, null), (false, null), (null, true), (null, false), (null, null) ) .toDF("left_operand", "right_operand") .withColumn("OR", 'left_operand || 'right_operand) .withColumn("AND", 'left_operand && 'right_operand) .show(truncate = false) ++-++-+ |left_operand|right_operand|OR |AND | ++-++-+ |true|null |true|null | |false |null |null|false| |null|true |true|null | |null|false|null|false| < this line is incorrect in the docs |null|null |null|null | ++-++-+ {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23539) Add support for Kafka headers in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-23539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220813#comment-17220813 ] Calvin commented on SPARK-23539: [~dongjin]/[~kabhwan] Apologies for reviving this long-closed ticket, but I was wondering if there are any plans to backport this feature to any of the Spark 2.x.x versions or if this feature will only be available from 3.0.0 onward? > Add support for Kafka headers in Structured Streaming > - > > Key: SPARK-23539 > URL: https://issues.apache.org/jira/browse/SPARK-23539 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Dongjin Lee >Priority: Major > Fix For: 3.0.0 > > > Kafka headers were added in 0.11. We should expose them through our kafka > data source in both batch and streaming queries. > This is currently blocked on version of Kafka in Spark from 0.10.1 to 1.0+ > SPARK-18057 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33228) Don't uncache data when replacing an existing view having the same plan
[ https://issues.apache.org/jira/browse/SPARK-33228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220807#comment-17220807 ] Apache Spark commented on SPARK-33228: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/30152 > Don't uncache data when replacing an existing view having the same plan > --- > > Key: SPARK-33228 > URL: https://issues.apache.org/jira/browse/SPARK-33228 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache > when replacing an existing view. But, this change drops cache even when > replacing a view having the same logical plan. A sequence of queries to > reproduce this as follows; > {code} > scala> val df = spark.range(1).selectExpr("id a", "id b") > scala> df.cache() > scala> df.explain() > == Physical Plan == > *(1) ColumnarToRow > +- InMemoryTableScan [a#2L, b#3L] > +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L] > +- *(1) Range (0, 1, step=1, splits=4) > scala> df.createOrReplaceTempView("t") > scala> sql("select * from t").explain() > == Physical Plan == > *(1) ColumnarToRow > +- InMemoryTableScan [a#2L, b#3L] > +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 > replicas) > +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L] > +- *(1) Range (0, 1, step=1, splits=4) > // If one re-runs the same query `df.createOrReplaceTempView("t")`, the > cache's swept away > scala> df.createOrReplaceTempView("t") > scala> sql("select * from t").explain() > == Physical Plan == > *(1) Project [id#0L AS a#2L, id#0L AS b#3L] > +- *(1) Range (0, 1, step=1, splits=4) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220752#comment-17220752 ] Sean R. Owen commented on SPARK-26043: -- I don't have a strong opinion on it. [~vanzin] says it would take some work to make a proper API and perhaps isn't widely used. Yes you can just use the same code in your project and/or access it directly from Java or with a shim class you put in the same Spark package, if you really wanted to. (Reflection too, but a bit messier) > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Masiero Vanzin >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26043) Make SparkHadoopUtil private to Spark
[ https://issues.apache.org/jira/browse/SPARK-26043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220748#comment-17220748 ] Wenchen Fan commented on SPARK-26043: - A quick way is to copy-paste the code to your repo so that it compiles, or use java to write a proxy, as `private[spark]` doesn't work for java. Seems like this util is still useful. [~srowen] shall we consider making it semi-public like the catalyst rules? We don't document it, and don't guarantee compatibility, but people can access it freely, and take risks on their own. > Make SparkHadoopUtil private to Spark > - > > Key: SPARK-26043 > URL: https://issues.apache.org/jira/browse/SPARK-26043 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Marcelo Masiero Vanzin >Assignee: Sean R. Owen >Priority: Minor > Fix For: 3.0.0 > > > This API contains a few small helper methods used internally by Spark, mostly > related to Hadoop configs and kerberos. > It's been historically marked as "DeveloperApi". But in reality it's not very > useful for others, and changes a lot to be considered a stable API. Better to > just make it private to Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal
[ https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-33233: -- Description: Now spark support group by ordinal, but cube/rollup/groupingsets not support this. This pr make cube/rollup/grouping sets support group by ordinal > CUBE/ROLLUP can't support UnresolvedOrdinal > --- > > Key: SPARK-33233 > URL: https://issues.apache.org/jira/browse/SPARK-33233 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Now spark support group by ordinal, but cube/rollup/groupingsets not support > this. This pr make cube/rollup/grouping sets support group by ordinal -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33245) Add built-in UDF - GETBIT
Yuming Wang created SPARK-33245: --- Summary: Add built-in UDF - GETBIT Key: SPARK-33245 URL: https://issues.apache.org/jira/browse/SPARK-33245 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang Teradata, Impala, Snowflake and Yellowbrick support this function: https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/PK1oV1b2jqvG~ohRnOro9w https://docs.cloudera.com/runtime/7.2.0/impala-sql-reference/topics/impala-bit-functions.html#bit_functions__getbit https://docs.snowflake.com/en/sql-reference/functions/getbit.html https://www.yellowbrick.com/docs/2.2/ybd_sqlref/getbit.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33183) Bug in optimizer rule EliminateSorts
[ https://issues.apache.org/jira/browse/SPARK-33183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33183: - Affects Version/s: (was: 3.0.1) (was: 3.0.0) 3.1.0 3.0.2 2.4.8 > Bug in optimizer rule EliminateSorts > > > Key: SPARK-33183 > URL: https://issues.apache.org/jira/browse/SPARK-33183 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.2, 3.1.0 >Reporter: Allison Wang >Priority: Major > > Currently, the rule {{EliminateSorts}} removes a global sort node if its > child plan already satisfies the required sort order without checking if the > child plan's ordering is local or global. For example, in the following > scenario, the first sort shouldn't be removed because it has a stronger > guarantee than the second sort even if the sort orders are the same for both > sorts. > {code:java} > Sort(orders, global = True, ...) > Sort(orders, global = False, ...){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33204) `Event Timeline` in Spark Job UI sometimes cannot be opened
[ https://issues.apache.org/jira/browse/SPARK-33204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-33204. Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30119 [https://github.com/apache/spark/pull/30119] > `Event Timeline` in Spark Job UI sometimes cannot be opened > > > Key: SPARK-33204 > URL: https://issues.apache.org/jira/browse/SPARK-33204 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Assignee: Apache Spark >Priority: Minor > Fix For: 3.1.0 > > Attachments: reproduce.gif > > > The Event Timeline area cannot be expanded when a spark application has some > failed jobs. > show as the attachment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal
[ https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220648#comment-17220648 ] Takeshi Yamamuro commented on SPARK-33233: -- Please fill the description. > CUBE/ROLLUP can't support UnresolvedOrdinal > --- > > Key: SPARK-33233 > URL: https://issues.apache.org/jira/browse/SPARK-33233 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33233) CUBE/ROLLUP can't support UnresolvedOrdinal
[ https://issues.apache.org/jira/browse/SPARK-33233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33233: - Issue Type: Improvement (was: Bug) > CUBE/ROLLUP can't support UnresolvedOrdinal > --- > > Key: SPARK-33233 > URL: https://issues.apache.org/jira/browse/SPARK-33233 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33223) Expose state information on SS UI
[ https://issues.apache.org/jira/browse/SPARK-33223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220637#comment-17220637 ] Apache Spark commented on SPARK-33223: -- User 'gaborgsomogyi' has created a pull request for this issue: https://github.com/apache/spark/pull/30151 > Expose state information on SS UI > - > > Key: SPARK-33223 > URL: https://issues.apache.org/jira/browse/SPARK-33223 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming, Web UI >Affects Versions: 3.0.1 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33223) Expose state information on SS UI
[ https://issues.apache.org/jira/browse/SPARK-33223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33223: Assignee: Apache Spark > Expose state information on SS UI > - > > Key: SPARK-33223 > URL: https://issues.apache.org/jira/browse/SPARK-33223 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming, Web UI >Affects Versions: 3.0.1 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33223) Expose state information on SS UI
[ https://issues.apache.org/jira/browse/SPARK-33223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33223: Assignee: (was: Apache Spark) > Expose state information on SS UI > - > > Key: SPARK-33223 > URL: https://issues.apache.org/jira/browse/SPARK-33223 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming, Web UI >Affects Versions: 3.0.1 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33075) Only disable auto bucketed scan for cached query
[ https://issues.apache.org/jira/browse/SPARK-33075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-33075. -- Fix Version/s: 3.1.0 Assignee: Cheng Su Resolution: Fixed Resolved by https://github.com/apache/spark/pull/30138 > Only disable auto bucketed scan for cached query > > > Key: SPARK-33075 > URL: https://issues.apache.org/jira/browse/SPARK-33075 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.1.0 > > > As a followup from discussion in > [https://github.com/apache/spark/pull/29804#discussion_r500033528,] auto > bucketed scan is disabled by default due to regression for cached query. > Suggested by [~cloud_fan], we can enable auto bucketed scan globally with > special handling for cached query, similar to adaptive execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32188) API Reference
[ https://issues.apache.org/jira/browse/SPARK-32188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220625#comment-17220625 ] Apache Spark commented on SPARK-32188: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/30150 > API Reference > - > > Key: SPARK-32188 > URL: https://issues.apache.org/jira/browse/SPARK-32188 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Example: https://hyukjin-spark.readthedocs.io/en/latest/reference/index.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32188) API Reference
[ https://issues.apache.org/jira/browse/SPARK-32188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220624#comment-17220624 ] Apache Spark commented on SPARK-32188: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/30150 > API Reference > - > > Key: SPARK-32188 > URL: https://issues.apache.org/jira/browse/SPARK-32188 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Example: https://hyukjin-spark.readthedocs.io/en/latest/reference/index.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33243) Add numpydoc into documentation dependency
[ https://issues.apache.org/jira/browse/SPARK-33243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33243: Assignee: Apache Spark > Add numpydoc into documentation dependency > -- > > Key: SPARK-33243 > URL: https://issues.apache.org/jira/browse/SPARK-33243 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > To switch the docstring formats, we should add numpydoc package into Sphinx. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33243) Add numpydoc into documentation dependency
[ https://issues.apache.org/jira/browse/SPARK-33243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33243: Assignee: (was: Apache Spark) > Add numpydoc into documentation dependency > -- > > Key: SPARK-33243 > URL: https://issues.apache.org/jira/browse/SPARK-33243 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > To switch the docstring formats, we should add numpydoc package into Sphinx. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33243) Add numpydoc into documentation dependency
[ https://issues.apache.org/jira/browse/SPARK-33243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220617#comment-17220617 ] Apache Spark commented on SPARK-33243: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/30149 > Add numpydoc into documentation dependency > -- > > Key: SPARK-33243 > URL: https://issues.apache.org/jira/browse/SPARK-33243 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > To switch the docstring formats, we should add numpydoc package into Sphinx. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33244) Unify the code paths for spark.table and spark.read.table
[ https://issues.apache.org/jira/browse/SPARK-33244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33244: Assignee: (was: Apache Spark) > Unify the code paths for spark.table and spark.read.table > - > > Key: SPARK-33244 > URL: https://issues.apache.org/jira/browse/SPARK-33244 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > The code paths of `spark.table` and `spark.read.table` should be the same. > This behavior is broke in SPARK-32592 since we need to respect options in > `spark.read.table` API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33244) Unify the code paths for spark.table and spark.read.table
[ https://issues.apache.org/jira/browse/SPARK-33244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33244: Assignee: Apache Spark > Unify the code paths for spark.table and spark.read.table > - > > Key: SPARK-33244 > URL: https://issues.apache.org/jira/browse/SPARK-33244 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Apache Spark >Priority: Major > > The code paths of `spark.table` and `spark.read.table` should be the same. > This behavior is broke in SPARK-32592 since we need to respect options in > `spark.read.table` API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33244) Unify the code paths for spark.table and spark.read.table
[ https://issues.apache.org/jira/browse/SPARK-33244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17220615#comment-17220615 ] Apache Spark commented on SPARK-33244: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/30148 > Unify the code paths for spark.table and spark.read.table > - > > Key: SPARK-33244 > URL: https://issues.apache.org/jira/browse/SPARK-33244 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > The code paths of `spark.table` and `spark.read.table` should be the same. > This behavior is broke in SPARK-32592 since we need to respect options in > `spark.read.table` API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33244) Unify the code paths for spark.table and spark.read.table
Yuanjian Li created SPARK-33244: --- Summary: Unify the code paths for spark.table and spark.read.table Key: SPARK-33244 URL: https://issues.apache.org/jira/browse/SPARK-33244 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Yuanjian Li The code paths of `spark.table` and `spark.read.table` should be the same. This behavior is broke in SPARK-32592 since we need to respect options in `spark.read.table` API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org