[jira] [Commented] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943118#comment-16943118 ] Mihaly Toth commented on SPARK-29078: - But if the user has access to that directory (which is the hive warehouse directory), it can see what databases are there regardless of having access to those databases or not. This is not the worst security gap, so if we believe this is acceptable I dont mind closing this jira. > Spark shell fails if read permission is not granted to hive warehouse > directory > --- > > Key: SPARK-29078 > URL: https://issues.apache.org/jira/browse/SPARK-29078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mihaly Toth >Priority: Major > > Similarly to SPARK-20256, in {{SharedSessionState}} when > {{GlobalTempViewManager}} is created, it is checked that there is no database > exists that has the same name as of the global temp database (name is > configurable with {{spark.sql.globalTempDatabase}}) , because that is a > special database, which should not exist in the metastore. For this, a read > permission is required on the warehouse directory at the moment, which on the > other hand would allow listing all the databases of all users. > When such a read access is not granted for security reasons, an access > violation exception should be ignored upon such initial validation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1692#comment-1692 ] Mihaly Toth commented on SPARK-29078: - I get AccessControlException from Hive pointing to {code:scala} externalCatalog.databaseExists(globalTempDB) {code} in {{SharedState}}: The codebase is a modification of 2.3.0. Please find the stack trace here: {noformat} hiveContext.sql("select * from db.t") org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.security.AccessControlException: Permission denied: user=user1, access=READ, inode="/apps/hive/warehouse":hive:hdfs:drwx-- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:353) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:252) at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkDefaultEnforcer(RangerHdfsAuthorizer.java:427) at org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:303) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1950) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1934) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1908) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8800) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:2089) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1466) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) ); at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114) at org.apache.spark.sql.internal.SharedState.externalCatalog(Shar
[jira] [Created] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory
Mihaly Toth created SPARK-29078: --- Summary: Spark shell fails if read permission is not granted to hive warehouse directory Key: SPARK-29078 URL: https://issues.apache.org/jira/browse/SPARK-29078 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Mihaly Toth Similarly to SPARK-20256, in {{SharedSessionState}} when {{GlobalTempViewManager}} is created, it is checked that there is no database exists that has the same name as of the global temp database (name is configurable with {{spark.sql.globalTempDatabase}}) , because that is a special database, which should not exist in the metastore. For this, a read permission is required on the warehouse directory at the moment, which on the other hand would allow listing all the databases of all users. When such a read access is not granted for security reasons, an access violation exception should be ignored upon such initial validation. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27704) Change default class loader to ParallelGC
Mihaly Toth created SPARK-27704: --- Summary: Change default class loader to ParallelGC Key: SPARK-27704 URL: https://issues.apache.org/jira/browse/SPARK-27704 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.0.0 Reporter: Mihaly Toth In JDK 11 the default class loader changed from ParallelGC to G1GC. Even though this gc performs better on pause times and interactivity, most of the tasks that need to be processed are more sensitive to throughput and the to the amount of memory. G1 sacrifices these to some extend to avoid the big pauses. As a result the user may perceive a regression compared to JDK 8. Even worse, the regression may not be limited to performance only but some jobs may start failing in case they do not fit into the memory they used to be happy with when running with previous JDK. Some other kind of apps, like streaming ones, may rather use G1 because of their more interactive, more realtime needs. With this jira it is proposed to have a configurable default GC for all spark applications. This may be overridable by the user through command line parameters. The default value of the default GC (in case it is not provided in spark-defaults.conf) could be ParallelGC. I do not see this change required but I think it would benefit to the user experience. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes
[ https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833784#comment-16833784 ] Mihaly Toth commented on SPARK-26839: - Hmm, sorry, I overlooked something. I only have NucelusException all over the test run. So I guess that needs to be resolved first. As I understood HIVE-17632 (especially the Datanucleus upgrade) is a dependency here. > on JDK11, IsolatedClientLoader must be able to load java.sql classes > > > Key: SPARK-26839 > URL: https://issues.apache.org/jira/browse/SPARK-26839 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > This might be very specific to my fork & a kind of weird system setup I'm > working on, I haven't completely confirmed yet, but I wanted to report it > anyway in case anybody else sees this. > When I try to do anything which touches the metastore on java11, I > immediately get errors from IsolatedClientLoader that it can't load anything > in java.sql. eg. > {noformat} > scala> spark.sql("show tables").show() > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > java/sql/SQLTransientException when creating Hive client using classpath: > file:/home/systest/jdk-11.0.2/, ... > ... > Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException > at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) > {noformat} > After a bit of debugging, I also discovered that the {{rootClassLoader}} is > {{null}} in {{IsolatedClientLoader}}. I think this would work if either > {{rootClassLoader}} could load those classes, or if {{isShared()}} was > changed to allow any class starting with "java." (I'm not sure why it only > allows "java.lang" and "java.net" currently.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes
[ https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833763#comment-16833763 ] Mihaly Toth commented on SPARK-26839: - [~srowen], I was facing CNFE and I have a potential fix for it on my fork. When I reproduced it on master, the CNFE goes away with the change but the {{NucleusException: The java type java.lang.Long ... cant be mapped for this datastore.}} stays. The problem I saw that in some cases {{HiveUtils}} assembles a jar list only comprising the application jar, and this same jar list is considered by {{IsolatedClientLoader}} as the source of the hive classes. Shall I submit my change as a PR directly here? I am not fully sure it matches the scope of this issue. Regarding Datanucleus it may deserve a new subtask in SPARK-24417. > on JDK11, IsolatedClientLoader must be able to load java.sql classes > > > Key: SPARK-26839 > URL: https://issues.apache.org/jira/browse/SPARK-26839 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > This might be very specific to my fork & a kind of weird system setup I'm > working on, I haven't completely confirmed yet, but I wanted to report it > anyway in case anybody else sees this. > When I try to do anything which touches the metastore on java11, I > immediately get errors from IsolatedClientLoader that it can't load anything > in java.sql. eg. > {noformat} > scala> spark.sql("show tables").show() > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > java/sql/SQLTransientException when creating Hive client using classpath: > file:/home/systest/jdk-11.0.2/, ... > ... > Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException > at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) > {noformat} > After a bit of debugging, I also discovered that the {{rootClassLoader}} is > {{null}} in {{IsolatedClientLoader}}. I think this would work if either > {{rootClassLoader}} could load those classes, or if {{isShared()}} was > changed to allow any class starting with "java." (I'm not sure why it only > allows "java.lang" and "java.net" currently.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713030#comment-16713030 ] Mihaly Toth commented on SPARK-25331: - I have closed my PR. I guess it should be documented that we expect the user to read only files that have their name written to manifest files. > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21548) Support insert into serial columns of table
[ https://issues.apache.org/jira/browse/SPARK-21548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625583#comment-16625583 ] Mihaly Toth commented on SPARK-21548: - You may want to look at the PR on this very similar jira: SPARK-20845 > Support insert into serial columns of table > --- > > Key: SPARK-21548 > URL: https://issues.apache.org/jira/browse/SPARK-21548 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: LvDongrong >Priority: Major > > When we use the 'insert into ...' statement we can only insert all the > columns into table.But int some cases,our table has many columns and we are > only interest in some of them.So we want to support the statement "insert > into table tbl (column1, column2,...) values (value1, value2, value3,...)". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620454#comment-16620454 ] Mihaly Toth commented on SPARK-25331: - I have updated the PR with a potential solution and removed the WIP flag from it. > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611244#comment-16611244 ] Mihaly Toth commented on SPARK-25331: - I will try to make it idempotent then. > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608933#comment-16608933 ] Mihaly Toth commented on SPARK-25331: - I was thinking about how to make FileStreamSink idempotent even in failure cases using the first approach (deterministic file names). As a starting point we need to put the partition id (and bucket id if it exists) into the file name and remove the UUID from it. If the same batch is rewritten and the file already exists we can simply skip writing it again assuming the same partition of the same batch will generate the same data again. There are a few special cases though: If the file is half written when the writing executor stops there will be missing records from the end of the file. We can eliminate this by first writing the data into a temp file and then moving it to its intended location. The other problematic are is around {{maxRecordsPerFile}} limit. This way the same batch+partition pair may generate multiple files. It may happen that some of these files were created but some of them are missing. Generating those that are missing is only working if the ordering of the items is exactly the same in each run. This may or may not be true. If the order differs between the two runs and we simply skip generating those that already exists there may be missing or duplicated items in the resulting files. We could subtract the records in the already existing files from the input RDD. I feel that would make the writing logic quite complex, and that would imply unexpected computational load onto the executor. But that would be working in all cases. Another solution for partially generated file sets would be to start reading and generating the file at the same time and compare the records one by one. If the already existing file is the same as the file to be generated we can skip creating the file. If it is different we can create a file with some mark in its name like "-v2". With this the receiver can achieve exactly once semantics in the following ways: # Do not limit the maximum records per file # Limit the number of records but apply strict ordering on the resulting rdd # Limit the number of records without applying strict ordering but compensate those files that have newer versions appearing in the output directory [~rxin], [~Gengliang.Wang] what is your opinion on this? > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603143#comment-16603143 ] Mihaly Toth commented on SPARK-25331: - After looking into how this could be solved there are a few potential ways I could think of: # Make the resulting file names deterministic based on the input. Currently it contains a UUID which is by nature different in each run. The question here if partitioning of the data can always be done the same way. And what else was the motivation for adding a UUID to the name. # Create a "write ahead manifest file" which contains the generated file names. This could be used in the {{ManifestFileCommitProtocol.setupJob}} which is currently a noop. We may need to store some additional data like partitioning in order to generate the same file contents again. # Document and mandate the use of the manifest file for the consumer of the file output. Currently this file is not mentioned in the docs. Even if this would be documented that would make the life of the consumer more difficult not to mention that this would be somewhat counter intuitive. Before rushing into the implementation it would make sense to discuss the direction I guess. I would pick the first if that is possible. > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihaly Toth updated SPARK-25331: Description: Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has been started by {{FileFormatWrite.write}} and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating {{FileStreamSink.addBtach}} will result in duplicate writing of the data In the event the driver fails after the executors start processing the job the processed batch will be written twice. Steps needed: # call {{FileStreamSink.addBtach}} # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} # call {{FileStreamSink.addBtach}} with the same data # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully # Verify file output - according to {{Sink.addBatch}} documentation the rdd should be written only once I have created a wip PR with a unit test: https://github.com/apache/spark/pull/22331 was: Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has been started by {{FileFormatWrite.write}} and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating {{FileStreamSink.addBtach}} will result in duplicate writing of the data In the event the driver fails after the executors start processing the job the processed batch will be written twice. Steps needed: 1. call {{FileStreamSink.addBtach}} 2. make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} 3. call {{FileStreamSink.addBtach}} with the same data 4. make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully 5. Verify file output - according to {{Sink.addBatch}} documentation the rdd should be written only once I have created a wip PR with a unit test: https://github.com/apache/spark/pull/22331 > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
Mihaly Toth created SPARK-25331: --- Summary: Structured Streaming File Sink duplicates records in case of driver failure Key: SPARK-25331 URL: https://issues.apache.org/jira/browse/SPARK-25331 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.1 Reporter: Mihaly Toth Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has been started by {{FileFormatWrite.write}} and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating {{FileStreamSink.addBtach}} will result in duplicate writing of the data In the event the driver fails after the executors start processing the job the processed batch will be written twice. Steps needed: 1. call {{FileStreamSink.addBtach}} 2. make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} 3. call {{FileStreamSink.addBtach}} with the same data 4. make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully 5. Verify file output - according to {{Sink.addBatch}} documentation the rdd should be written only once I have created a wip PR with a unit test: https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",
[ https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513897#comment-16513897 ] Mihaly Toth commented on SPARK-22918: - I managed to reproduce the problem in a unit test. When using a security manager (with derby) one needs to apply a security policy using {{Policy.setPolicy()}}. In its {{.getPermissions.implies}} one is tempted to use {{new SystemPermission("engine", "usederbyinternals")}}. This works fine but when you run a spark session it is seemingly ignored. This is caused by IsolatedClassLoader. {{SystemPermission}} does not work across class loaders meaning it requires that the permission that is checked needs to be in the same class loader as the one defined in the Policy. Otherwise their class will not be equal and thus the the call gets rejected. One solution is to use another permission in the policy file that only checks for the names and class names, and give the original {{SystemPermission}} like: {code:scala} new Permission(delegate.getName) { override def getActions: String = delegate.getActions override def implies(permission: Permission): Boolean = delegate.getClass.getCanonicalName == permission.getClass.getCanonicalName && delegate.getName == permission.getName override def hashCode(): Int = reflectionHashCode(this) override def equals(obj: scala.Any): Boolean = reflectionEquals(this, obj) } {code} At least this one worked for me. It also works with {{new AllPermission()}} in case one is not really into using fine grained access control. > sbt test (spark - local) fail after upgrading to 2.2.1 with: > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > > > Key: SPARK-22918 > URL: https://issues.apache.org/jira/browse/SPARK-22918 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Damian Momot >Priority: Major > > After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started > to fail with following exception: > {noformat} > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > at > java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) > at > java.security.AccessController.checkPermission(AccessController.java:884) > at > org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown > Source) > at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown > Source) > at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at java.lang.Class.newInstance(Class.java:442) > at > org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) > at > org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.c
[jira] [Commented] (SPARK-20845) Support specification of column names in INSERT INTO
[ https://issues.apache.org/jira/browse/SPARK-20845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507032#comment-16507032 ] Mihaly Toth commented on SPARK-20845: - I am working on this. I will post a _work in progress_ PR shortly. > Support specification of column names in INSERT INTO > > > Key: SPARK-20845 > URL: https://issues.apache.org/jira/browse/SPARK-20845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Priority: Minor > > Some databases allow you to specify column names when specifying the target > of an INSERT INTO. For example, in SQLite: > {code} > sqlite> CREATE TABLE twocolumn (x INT, y INT); INSERT INTO twocolumn(x, y) > VALUES (44,51), (NULL,52), (42,53), (45,45) >...> ; > sqlite> select * from twocolumn; > 44|51 > |52 > 42|53 > 45|45 > {code} > I have a corpus of existing queries of this form which I would like to run on > Spark SQL, so I think we should extend our dialect to support this syntax. > When implementing this, we should make sure to test the following behaviors > and corner-cases: > - Number of columns specified is greater than or less than the number of > columns in the table. > - Specification of repeated columns. > - Specification of columns which do not exist in the target table. > - Permute column order instead of using the default order in the table. > For each of these, we should check how SQLite behaves and should also compare > against another database. It looks like T-SQL supports this; see > https://technet.microsoft.com/en-us/library/dd776381(v=sql.105).aspx under > the "Inserting data that is not in the same order as the table columns" > header. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",
[ https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472298#comment-16472298 ] Mihaly Toth commented on SPARK-22918: - Yep, probably anybody who introduces a SecurityManager needs to grant the above permission. I am just wondering why such implementation details are propagating up in the architecture. Is there any added level of security in addition to the file system level security below derby? Should not there be at least a configuration option to disable such security checks? > sbt test (spark - local) fail after upgrading to 2.2.1 with: > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > > > Key: SPARK-22918 > URL: https://issues.apache.org/jira/browse/SPARK-22918 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Damian Momot >Priority: Major > > After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started > to fail with following exception: > {noformat} > java.security.AccessControlException: access denied > org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" ) > at > java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) > at > java.security.AccessController.checkPermission(AccessController.java:884) > at > org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown > Source) > at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown > Source) > at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source) > at java.security.AccessController.doPrivileged(Native Method) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source) > at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at java.lang.Class.newInstance(Class.java:442) > at > org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47) > at > org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131) > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325) > at > org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282) > at > org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187
[jira] [Resolved] (SPARK-23465) Dataset.withAllColumnsRenamed should map all column names to a new one
[ https://issues.apache.org/jira/browse/SPARK-23465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihaly Toth resolved SPARK-23465. - Resolution: Won't Fix Based on PR feedback I would conclude that this functionality is not very much needed. > Dataset.withAllColumnsRenamed should map all column names to a new one > -- > > Key: SPARK-23465 > URL: https://issues.apache.org/jira/browse/SPARK-23465 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Minor > > Currently one can only rename a column only one by one using > {{withColumnRenamed()}} function. When one would like to rename all or most > of the columns it would be easier to specify an algorithm for mapping from > the old to the new name (like prefixing) than iterating over all the fields. > Example usage is joining to a Dataset with the same or similar schema > (special case is self joining) where the names are the same or overlapping. > Such a joined Dataset would fail at {{saveAsTable()}} > With the new function usage would be easy like that: > {code:java} > ds.withAllColumnsRenamed("prefix" + _) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23729) Glob resolution breaks remote naming of files/archives
[ https://issues.apache.org/jira/browse/SPARK-23729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16403876#comment-16403876 ] Mihaly Toth commented on SPARK-23729: - Already working on this. Will submit a PR shortly. > Glob resolution breaks remote naming of files/archives > -- > > Key: SPARK-23729 > URL: https://issues.apache.org/jira/browse/SPARK-23729 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.3.0 >Reporter: Mihaly Toth >Priority: Major > > Given one uses {{spark-submit}} with either of the {{\-\-archives}} or the > {{\-\-files}} parameters, in case the file name actually contains glob > patterns, the rename part ({{...#nameAs}}) of the filename will eventually be > ignored. > Thinking over the resolution cases, if the resolution results in multiple > files, it does not make sense to send all of them under the same remote name. > So this should then result in an error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23729) Glob resolution breaks remote naming of files/archives
Mihaly Toth created SPARK-23729: --- Summary: Glob resolution breaks remote naming of files/archives Key: SPARK-23729 URL: https://issues.apache.org/jira/browse/SPARK-23729 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.3.0 Reporter: Mihaly Toth Given one uses {{spark-submit}} with either of the {{\-\-archives}} or the {{\-\-files}} parameters, in case the file name actually contains glob patterns, the rename part ({{...#nameAs}}) of the filename will eventually be ignored. Thinking over the resolution cases, if the resolution results in multiple files, it does not make sense to send all of them under the same remote name. So this should then result in an error. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20845) Support specification of column names in INSERT INTO
[ https://issues.apache.org/jira/browse/SPARK-20845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370056#comment-16370056 ] Mihaly Toth commented on SPARK-20845: - As shown in the links this duplicates SPARK-21548. However, I the wording of this issue sounds more descriptive for me. On the other hand the other one has PR linked to it. To reduce duplication I would propose to close SPARK-21548 and add prior pull request https://github.com/apache/spark/pull/18756 to the links section of this issue. > Support specification of column names in INSERT INTO > > > Key: SPARK-20845 > URL: https://issues.apache.org/jira/browse/SPARK-20845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Priority: Minor > > Some databases allow you to specify column names when specifying the target > of an INSERT INTO. For example, in SQLite: > {code} > sqlite> CREATE TABLE twocolumn (x INT, y INT); INSERT INTO twocolumn(x, y) > VALUES (44,51), (NULL,52), (42,53), (45,45) >...> ; > sqlite> select * from twocolumn; > 44|51 > |52 > 42|53 > 45|45 > {code} > I have a corpus of existing queries of this form which I would like to run on > Spark SQL, so I think we should extend our dialect to support this syntax. > When implementing this, we should make sure to test the following behaviors > and corner-cases: > - Number of columns specified is greater than or less than the number of > columns in the table. > - Specification of repeated columns. > - Specification of columns which do not exist in the target table. > - Permute column order instead of using the default order in the table. > For each of these, we should check how SQLite behaves and should also compare > against another database. It looks like T-SQL supports this; see > https://technet.microsoft.com/en-us/library/dd776381(v=sql.105).aspx under > the "Inserting data that is not in the same order as the table columns" > header. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23465) Dataset.withAllColumnsRenamed should map all column names to a new one
[ https://issues.apache.org/jira/browse/SPARK-23465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369275#comment-16369275 ] Mihaly Toth commented on SPARK-23465: - I have started working on this. > Dataset.withAllColumnsRenamed should map all column names to a new one > -- > > Key: SPARK-23465 > URL: https://issues.apache.org/jira/browse/SPARK-23465 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Minor > > Currently one can only rename a column only one by one using > {{withColumnRenamed()}} function. When one would like to rename all or most > of the columns it would be easier to specify an algorithm for mapping from > the old to the new name (like prefixing) than iterating over all the fields. > Example usage is joining to a Dataset with the same or similar schema > (special case is self joining) where the names are the same or overlapping. > Such a joined Dataset would fail at {{saveAsTable()}} > With the new function usage would be easy like that: > {code:java} > ds.withAllColumnsRenamed("prefix" + _) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23465) Dataset.withAllColumnsRenamed should map all column names to a new one
Mihaly Toth created SPARK-23465: --- Summary: Dataset.withAllColumnsRenamed should map all column names to a new one Key: SPARK-23465 URL: https://issues.apache.org/jira/browse/SPARK-23465 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Mihaly Toth Currently one can only rename a column only one by one using {{withColumnRenamed()}} function. When one would like to rename all or most of the columns it would be easier to specify an algorithm for mapping from the old to the new name (like prefixing) than iterating over all the fields. Example usage is joining to a Dataset with the same or similar schema (special case is self joining) where the names are the same or overlapping. Such a joined Dataset would fail at {{saveAsTable()}} With the new function usage would be easy like that: {code:java} ds.withAllColumnsRenamed("prefix" + _) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions
[ https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16355155#comment-16355155 ] Mihaly Toth commented on SPARK-23329: - Nice. I like that the redundant description part is simply omitted. The only issue I see is that {{e}} is actually not an angle. At least we could put it into plural, like: {code} /** * @param e angles in radians * @return sines of the angles, as if computed by [[java.lang.Math.sin]] * ... */ {code} I was also thinking about mentioning that it is actually a Column does not inflate the lines very much. At the same time they are more precise. {code} /** * @param e [[Column]] of angles in radians * @return [[Column]] of sines of the angles, as if computed by [[java.lang.Math.sin]] * ... */ {code} Now looking at this the first one seems better because it tells truth and one can figure out easily that the angles are stored in a Column. > Update the function descriptions with the arguments and returned values of > the trigonometric functions > -- > > Key: SPARK-23329 > URL: https://issues.apache.org/jira/browse/SPARK-23329 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Minor > Labels: starter > > We need an update on the function descriptions for all the trigonometric > functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the > implementation is based on the java.lang.Math. We need a clear description > about the units of the input arguments and the returned values. > For example, the following descriptions are lacking such info. > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555 > https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions
[ https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354166#comment-16354166 ] Mihaly Toth commented on SPARK-23329: - How about this approach? {code:scala} /** * Computes the sine of the given column. Works same as [[java.lang.Math.sin]] * * @param e Column of angles, in radians. * @return new Column comprising sine value of each `e` element. * * @group math_funcs * @since 1.4.0 */ def sin(e: Column): Column = withExpr { Sin(e.expr) } {code} I am in a bit of trouble with wording. The original doc stated {{Computes the sine of the given}} *{{value}}* which is not really the case. Even _calculating the sine of a Column_ is not 100% precise but I guess not misleading given the context and condense enough on the other hand. The unit of measurment can be possibly moved to the param description I believe. Another question is that the majority of the javadocs in {{functions.scala}} is lacking return value and parameter descriptions. Does this Jira target to fix all of them (there are 334 ' def ' expressions in the file) or just the math_funcs group or only the trigonometric out of them as the title suggests? > Update the function descriptions with the arguments and returned values of > the trigonometric functions > -- > > Key: SPARK-23329 > URL: https://issues.apache.org/jira/browse/SPARK-23329 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Minor > Labels: starter > > We need an update on the function descriptions for all the trigonometric > functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the > implementation is based on the java.lang.Math. We need a clear description > about the units of the input arguments and the returned values. > For example, the following descriptions are lacking such info. > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555 > https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org