[jira] [Commented] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory

2019-10-02 Thread Mihaly Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943118#comment-16943118
 ] 

Mihaly Toth commented on SPARK-29078:
-

But if the user has access to that directory (which is the hive warehouse 
directory), it can see what databases are there regardless of having access to 
those databases or not. This is not the worst security gap, so if we believe 
this is acceptable I dont mind closing this jira.

> Spark shell fails if read permission is not granted to hive warehouse 
> directory
> ---
>
> Key: SPARK-29078
> URL: https://issues.apache.org/jira/browse/SPARK-29078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mihaly Toth
>Priority: Major
>
> Similarly to SPARK-20256, in {{SharedSessionState}} when 
> {{GlobalTempViewManager}} is created, it is checked that there is no database 
> exists that has the same name as of the global temp database (name is 
> configurable with {{spark.sql.globalTempDatabase}}) , because that is a 
> special database, which should not exist in the metastore. For this, a read 
> permission is required on the warehouse directory at the moment, which on the 
> other hand would allow listing all the databases of all users.
> When such a read access is not granted for security reasons, an access 
> violation exception should be ignored upon such initial validation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory

2019-09-19 Thread Mihaly Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1692#comment-1692
 ] 

Mihaly Toth commented on SPARK-29078:
-

I get AccessControlException from Hive pointing to
{code:scala}
externalCatalog.databaseExists(globalTempDB)
{code}
in {{SharedState}}:

The codebase is a modification of 2.3.0. Please find the stack trace here:
{noformat}
hiveContext.sql("select * from db.t")   
   
org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.security.AccessControlException: Permission denied: 
user=user1, access=READ, inode="/apps/hive/warehouse":hive:hdfs:drwx--  
  
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:353)

at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:252)
  
at 
org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkDefaultEnforcer(RangerHdfsAuthorizer.java:427)
   
at 
org.apache.ranger.authorization.hadoop.RangerHdfsAuthorizer$RangerAccessControlEnforcer.checkPermission(RangerHdfsAuthorizer.java:303)

at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190)
  
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1950)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1934)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:1908)
 
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:8800)
   
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:2089)
 
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1466)

 
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

  
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)  
 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) 
 
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) 
 
at java.security.AccessController.doPrivileged(Native Method)   
 
at javax.security.auth.Subject.doAs(Subject.java:422)   
 
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
  
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)   
 
);  
 
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
 
  at 
org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:194)
 
  at 
org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:114)
 
  at 
org.apache.spark.sql.internal.SharedState.externalCatalog(Shar

[jira] [Created] (SPARK-29078) Spark shell fails if read permission is not granted to hive warehouse directory

2019-09-13 Thread Mihaly Toth (Jira)
Mihaly Toth created SPARK-29078:
---

 Summary: Spark shell fails if read permission is not granted to 
hive warehouse directory
 Key: SPARK-29078
 URL: https://issues.apache.org/jira/browse/SPARK-29078
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Mihaly Toth


Similarly to SPARK-20256, in {{SharedSessionState}} when 
{{GlobalTempViewManager}} is created, it is checked that there is no database 
exists that has the same name as of the global temp database (name is 
configurable with {{spark.sql.globalTempDatabase}}) , because that is a special 
database, which should not exist in the metastore. For this, a read permission 
is required on the warehouse directory at the moment, which on the other hand 
would allow listing all the databases of all users.

When such a read access is not granted for security reasons, an access 
violation exception should be ignored upon such initial validation.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27704) Change default class loader to ParallelGC

2019-05-14 Thread Mihaly Toth (JIRA)
Mihaly Toth created SPARK-27704:
---

 Summary: Change default class loader to ParallelGC
 Key: SPARK-27704
 URL: https://issues.apache.org/jira/browse/SPARK-27704
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.0.0
Reporter: Mihaly Toth


In JDK 11 the default class loader changed from ParallelGC to G1GC. Even though 
this gc performs better on pause times and interactivity, most of the tasks 
that need to be processed are more sensitive to throughput and the to the 
amount of memory. G1 sacrifices these to some extend to avoid the big pauses. 
As a result the user may perceive a regression compared to JDK 8. Even worse, 
the regression may not be limited to performance only but some jobs may start 
failing in case they do not fit into the memory they used to be happy with when 
running with previous JDK.

Some other kind of apps, like streaming ones, may rather use G1 because of 
their more interactive, more realtime needs.

With this jira it is proposed to have a configurable default GC for all spark 
applications. This may be overridable by the user through command line 
parameters. The default value of the default GC (in case it is not provided in 
spark-defaults.conf) could be ParallelGC.

I do not see this change required but I think it would benefit to the user 
experience.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes

2019-05-06 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833784#comment-16833784
 ] 

Mihaly Toth commented on SPARK-26839:
-

Hmm, sorry, I overlooked something. I only have NucelusException all over the 
test run. So I guess that needs to be resolved first. As I understood 
HIVE-17632 (especially the Datanucleus upgrade) is a dependency here.

> on JDK11, IsolatedClientLoader must be able to load java.sql classes
> 
>
> Key: SPARK-26839
> URL: https://issues.apache.org/jira/browse/SPARK-26839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> This might be very specific to my fork & a kind of weird system setup I'm 
> working on, I haven't completely confirmed yet, but I wanted to report it 
> anyway in case anybody else sees this.
> When I try to do anything which touches the metastore on java11, I 
> immediately get errors from IsolatedClientLoader that it can't load anything 
> in java.sql.  eg.
> {noformat}
> scala> spark.sql("show tables").show()
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> java/sql/SQLTransientException when creating Hive client using classpath: 
> file:/home/systest/jdk-11.0.2/, ...
> ...
> Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException
>   at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
> {noformat}
> After a bit of debugging, I also discovered that the {{rootClassLoader}} is 
> {{null}} in {{IsolatedClientLoader}}.  I think this would work if either 
> {{rootClassLoader}} could load those classes, or if {{isShared()}} was 
> changed to allow any class starting with "java."  (I'm not sure why it only 
> allows "java.lang" and "java.net" currently.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26839) on JDK11, IsolatedClientLoader must be able to load java.sql classes

2019-05-06 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833763#comment-16833763
 ] 

Mihaly Toth commented on SPARK-26839:
-

[~srowen], I was facing CNFE and I have a potential fix for it on my fork. When 
I reproduced it on master, the CNFE goes away with the change but the 
{{NucleusException: The java type java.lang.Long ... cant be mapped for this 
datastore.}} stays. The problem I saw that in some cases {{HiveUtils}} 
assembles a jar list only comprising the application jar, and this same jar 
list is considered by {{IsolatedClientLoader}} as the source of the hive 
classes.

Shall I submit my change as a PR directly here? I am not fully sure it matches 
the scope of this issue.

Regarding Datanucleus it may deserve a new subtask in SPARK-24417.

> on JDK11, IsolatedClientLoader must be able to load java.sql classes
> 
>
> Key: SPARK-26839
> URL: https://issues.apache.org/jira/browse/SPARK-26839
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> This might be very specific to my fork & a kind of weird system setup I'm 
> working on, I haven't completely confirmed yet, but I wanted to report it 
> anyway in case anybody else sees this.
> When I try to do anything which touches the metastore on java11, I 
> immediately get errors from IsolatedClientLoader that it can't load anything 
> in java.sql.  eg.
> {noformat}
> scala> spark.sql("show tables").show()
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> java/sql/SQLTransientException when creating Hive client using classpath: 
> file:/home/systest/jdk-11.0.2/, ...
> ...
> Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException
>   at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
> {noformat}
> After a bit of debugging, I also discovered that the {{rootClassLoader}} is 
> {{null}} in {{IsolatedClientLoader}}.  I think this would work if either 
> {{rootClassLoader}} could load those classes, or if {{isShared()}} was 
> changed to allow any class starting with "java."  (I'm not sure why it only 
> allows "java.lang" and "java.net" currently.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-12-07 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16713030#comment-16713030
 ] 

Mihaly Toth commented on SPARK-25331:
-

I have closed my PR. I guess it should be documented that we expect the user to 
read only files that have their name written to manifest files.

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21548) Support insert into serial columns of table

2018-09-24 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625583#comment-16625583
 ] 

Mihaly Toth commented on SPARK-21548:
-

You may want to look at the PR on this very similar jira: SPARK-20845

> Support insert into serial columns of table
> ---
>
> Key: SPARK-21548
> URL: https://issues.apache.org/jira/browse/SPARK-21548
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: LvDongrong
>Priority: Major
>
> When we use the 'insert into ...' statement we can only insert all the 
> columns into table.But int some cases,our table has many columns and we are 
> only interest in some of them.So we want to support the statement "insert 
> into table tbl (column1, column2,...) values (value1, value2, value3,...)".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-19 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620454#comment-16620454
 ] 

Mihaly Toth commented on SPARK-25331:
-

I have updated the PR with a potential solution and removed the WIP flag from 
it.

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-11 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611244#comment-16611244
 ] 

Mihaly Toth commented on SPARK-25331:
-

I will try to make it idempotent then.

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-10 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608933#comment-16608933
 ] 

Mihaly Toth commented on SPARK-25331:
-

I was thinking about how to make FileStreamSink idempotent even in failure 
cases using the first approach (deterministic file names).

As a starting point we need to put the partition id (and bucket id if it 
exists) into the file name and remove the UUID from it. If the same batch is 
rewritten and the file already exists we can simply skip writing it again 
assuming the same partition of the same batch will generate the same data 
again. There are a few special cases though:

If the file is half written when the writing executor stops there will be 
missing records from the end of the file. We can eliminate this by first 
writing the data into a temp file and then moving it to its intended location.

The other problematic are is around {{maxRecordsPerFile}} limit. This way the 
same batch+partition pair may generate multiple files. It may happen that some 
of these files were created but some of them are missing. Generating those that 
are missing is only working if the ordering of the items is exactly the same in 
each run. This may or may not be true. If the order differs between the two 
runs and we simply skip generating those that already exists there may be 
missing or duplicated items in the resulting files.

We could subtract the records in the already existing files from the input RDD. 
I feel that would make the writing logic quite complex, and that would imply 
unexpected computational load onto the executor. But that would be working in 
all cases.

Another solution for partially generated file sets would be to start reading 
and generating the file at the same time and compare the records one by one. If 
the already existing file is the same as the file to be generated we can skip 
creating the file. If it is different we can create a file with some mark in 
its name like "-v2". With this the receiver can achieve exactly once semantics 
in the following ways:
 # Do not limit the maximum records per file
 # Limit the number of records but apply strict ordering on the resulting rdd
 # Limit the number of records without applying strict ordering but compensate 
those files that have newer versions appearing in the output directory

[~rxin], [~Gengliang.Wang] what is your opinion on this?

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-04 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16603143#comment-16603143
 ] 

Mihaly Toth commented on SPARK-25331:
-

After looking into how this could be solved there are a few potential ways I 
could think of:
# Make the resulting file names deterministic based on the input. Currently it 
contains a UUID which is by nature different in each run. The question here if 
partitioning of the data can always be done the same way. And what else was the 
motivation for adding a UUID to the name.
# Create a "write ahead manifest file" which contains the generated file names. 
This could be used in the {{ManifestFileCommitProtocol.setupJob}} which is 
currently a noop. We may need to store some additional data like partitioning 
in order to generate the same file contents again.
# Document and mandate the use of the manifest file for the consumer of the 
file output. Currently this file is not mentioned in the docs. Even if this 
would be documented that would make the life of the consumer more difficult not 
to mention that this would be somewhat counter intuitive.

Before rushing into the implementation it would make sense to discuss the 
direction I guess. I would pick the first if that is possible.

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-04 Thread Mihaly Toth (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihaly Toth updated SPARK-25331:

Description: 
Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
been started by {{FileFormatWrite.write}} and then the resulting task sets are 
completed but in the meantime the driver dies. In such a case repeating 
{{FileStreamSink.addBtach}} will result in duplicate writing of the data

In the event the driver fails after the executors start processing the job the 
processed batch will be written twice.

Steps needed:
# call {{FileStreamSink.addBtach}}
# make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
# call {{FileStreamSink.addBtach}} with the same data
# make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully
# Verify file output - according to {{Sink.addBatch}} documentation the rdd 
should be written only once

I have created a wip PR with a unit test:
https://github.com/apache/spark/pull/22331


  was:
Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
been started by {{FileFormatWrite.write}} and then the resulting task sets are 
completed but in the meantime the driver dies. In such a case repeating 
{{FileStreamSink.addBtach}} will result in duplicate writing of the data

In the event the driver fails after the executors start processing the job the 
processed batch will be written twice.

Steps needed:
1. call {{FileStreamSink.addBtach}}
2. make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
3. call {{FileStreamSink.addBtach}} with the same data
4. make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully
5. Verify file output - according to {{Sink.addBatch}} documentation the rdd 
should be written only once

I have created a wip PR with a unit test:
https://github.com/apache/spark/pull/22331



> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-04 Thread Mihaly Toth (JIRA)
Mihaly Toth created SPARK-25331:
---

 Summary: Structured Streaming File Sink duplicates records in case 
of driver failure
 Key: SPARK-25331
 URL: https://issues.apache.org/jira/browse/SPARK-25331
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Mihaly Toth


Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
been started by {{FileFormatWrite.write}} and then the resulting task sets are 
completed but in the meantime the driver dies. In such a case repeating 
{{FileStreamSink.addBtach}} will result in duplicate writing of the data

In the event the driver fails after the executors start processing the job the 
processed batch will be written twice.

Steps needed:
1. call {{FileStreamSink.addBtach}}
2. make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
3. call {{FileStreamSink.addBtach}} with the same data
4. make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully
5. Verify file output - according to {{Sink.addBatch}} documentation the rdd 
should be written only once

I have created a wip PR with a unit test:
https://github.com/apache/spark/pull/22331




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",

2018-06-15 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513897#comment-16513897
 ] 

Mihaly Toth commented on SPARK-22918:
-

I managed to reproduce the problem in a unit test. When using a security 
manager (with derby) one needs to apply a security policy using 
{{Policy.setPolicy()}}. In its {{.getPermissions.implies}} one is tempted to 
use {{new SystemPermission("engine", "usederbyinternals")}}. This works fine 
but when you run a spark session it is seemingly ignored. This is caused by 
IsolatedClassLoader. {{SystemPermission}} does not work across class loaders 
meaning it requires that the permission that is checked needs to be in the same 
class loader as the one defined in the Policy. Otherwise their class will not 
be equal and thus the the call gets rejected.

One solution is to use another permission in the policy file that only checks 
for the names and class names, and give the original {{SystemPermission}} like:

{code:scala}
new Permission(delegate.getName) {
override def getActions: String = delegate.getActions

override def implies(permission: Permission): Boolean =
delegate.getClass.getCanonicalName == 
permission.getClass.getCanonicalName &&
delegate.getName == permission.getName

override def hashCode(): Int = reflectionHashCode(this)

override def equals(obj: scala.Any): Boolean = reflectionEquals(this, obj)
}
{code}

At least this one worked for me. It also works with {{new AllPermission()}} in 
case one is not really into using fine grained access control.

> sbt test (spark - local) fail after upgrading to 2.2.1 with: 
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
> 
>
> Key: SPARK-22918
> URL: https://issues.apache.org/jira/browse/SPARK-22918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Damian Momot
>Priority: Major
>
> After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started 
> to fail with following exception:
> {noformat}
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
>   at 
> java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>   at 
> java.security.AccessController.checkPermission(AccessController.java:884)
>   at 
> org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown
>  Source)
>   at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown 
> Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>   at 
> org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.c

[jira] [Commented] (SPARK-20845) Support specification of column names in INSERT INTO

2018-06-09 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16507032#comment-16507032
 ] 

Mihaly Toth commented on SPARK-20845:
-

I am working on this. I will post a _work in progress_ PR shortly.

> Support specification of column names in INSERT INTO
> 
>
> Key: SPARK-20845
> URL: https://issues.apache.org/jira/browse/SPARK-20845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> Some databases allow you to specify column names when specifying the target 
> of an INSERT INTO. For example, in SQLite:
> {code}
> sqlite> CREATE TABLE twocolumn (x INT, y INT); INSERT INTO twocolumn(x, y) 
> VALUES (44,51), (NULL,52), (42,53), (45,45)
>...> ;
> sqlite> select * from twocolumn;
> 44|51
> |52
> 42|53
> 45|45
> {code}
> I have a corpus of existing queries of this form which I would like to run on 
> Spark SQL, so I think we should extend our dialect to support this syntax.
> When implementing this, we should make sure to test the following behaviors 
> and corner-cases:
> - Number of columns specified is greater than or less than the number of 
> columns in the table.
> - Specification of repeated columns.
> - Specification of columns which do not exist in the target table.
> - Permute column order instead of using the default order in the table.
> For each of these, we should check how SQLite behaves and should also compare 
> against another database. It looks like T-SQL supports this; see 
> https://technet.microsoft.com/en-us/library/dd776381(v=sql.105).aspx under 
> the "Inserting data that is not in the same order as the table columns" 
> header.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22918) sbt test (spark - local) fail after upgrading to 2.2.1 with: java.security.AccessControlException: access denied org.apache.derby.security.SystemPermission( "engine",

2018-05-11 Thread Mihaly Toth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16472298#comment-16472298
 ] 

Mihaly Toth commented on SPARK-22918:
-

Yep, probably anybody who introduces a SecurityManager needs to grant the above 
permission. I am just wondering why such implementation details are propagating 
up in the architecture. Is there any added level of security in addition to the 
file system level security below derby? Should not there be at least a 
configuration option to disable such security checks?

> sbt test (spark - local) fail after upgrading to 2.2.1 with: 
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
> 
>
> Key: SPARK-22918
> URL: https://issues.apache.org/jira/browse/SPARK-22918
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Damian Momot
>Priority: Major
>
> After upgrading 2.2.0 -> 2.2.1 sbt test command in one of my projects started 
> to fail with following exception:
> {noformat}
> java.security.AccessControlException: access denied 
> org.apache.derby.security.SystemPermission( "engine", "usederbyinternals" )
>   at 
> java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
>   at 
> java.security.AccessController.checkPermission(AccessController.java:884)
>   at 
> org.apache.derby.iapi.security.SecurityUtil.checkDerbyInternalsPrivilege(Unknown
>  Source)
>   at org.apache.derby.iapi.services.monitor.Monitor.startMonitor(Unknown 
> Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot$1.run(Unknown Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.iapi.jdbc.JDBCBoot.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.boot(Unknown Source)
>   at org.apache.derby.jdbc.EmbeddedDriver.(Unknown Source)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at java.lang.Class.newInstance(Class.java:442)
>   at 
> org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:47)
>   at 
> org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.initialiseDataSources(ConnectionFactoryImpl.java:131)
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl.(ConnectionFactoryImpl.java:85)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:325)
>   at 
> org.datanucleus.store.AbstractStoreManager.registerConnectionFactory(AbstractStoreManager.java:282)
>   at 
> org.datanucleus.store.AbstractStoreManager.(AbstractStoreManager.java:240)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:286)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187

[jira] [Resolved] (SPARK-23465) Dataset.withAllColumnsRenamed should map all column names to a new one

2018-04-05 Thread Mihaly Toth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihaly Toth resolved SPARK-23465.
-
Resolution: Won't Fix

Based on PR feedback I would conclude that this functionality is not very much 
needed.

> Dataset.withAllColumnsRenamed should map all column names to a new one
> --
>
> Key: SPARK-23465
> URL: https://issues.apache.org/jira/browse/SPARK-23465
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Minor
>
> Currently one can only rename a column only one by one using 
> {{withColumnRenamed()}} function. When one would like to rename all or most 
> of the columns it would be easier to specify an algorithm for mapping from 
> the old to the new name (like prefixing) than iterating over all the fields.
> Example usage is joining to a Dataset with the same or similar schema 
> (special case is self joining) where the names are the same or overlapping. 
> Such a joined Dataset would fail at {{saveAsTable()}}
> With the new function usage would be easy like that:
> {code:java}
> ds.withAllColumnsRenamed("prefix" + _)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23729) Glob resolution breaks remote naming of files/archives

2018-03-17 Thread Mihaly Toth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16403876#comment-16403876
 ] 

Mihaly Toth commented on SPARK-23729:
-

Already working on this. Will submit a PR shortly.

> Glob resolution breaks remote naming of files/archives
> --
>
> Key: SPARK-23729
> URL: https://issues.apache.org/jira/browse/SPARK-23729
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Mihaly Toth
>Priority: Major
>
> Given one uses {{spark-submit}} with either of the {{\-\-archives}} or the 
> {{\-\-files}} parameters, in case the file name actually contains glob 
> patterns, the rename part ({{...#nameAs}}) of the filename will eventually be 
> ignored.
> Thinking over the resolution cases, if the resolution results in multiple 
> files, it does not make sense to send all of them under the same remote name. 
> So this should then result in an error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23729) Glob resolution breaks remote naming of files/archives

2018-03-17 Thread Mihaly Toth (JIRA)
Mihaly Toth created SPARK-23729:
---

 Summary: Glob resolution breaks remote naming of files/archives
 Key: SPARK-23729
 URL: https://issues.apache.org/jira/browse/SPARK-23729
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.0
Reporter: Mihaly Toth


Given one uses {{spark-submit}} with either of the {{\-\-archives}} or the 
{{\-\-files}} parameters, in case the file name actually contains glob 
patterns, the rename part ({{...#nameAs}}) of the filename will eventually be 
ignored.

Thinking over the resolution cases, if the resolution results in multiple 
files, it does not make sense to send all of them under the same remote name. 
So this should then result in an error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20845) Support specification of column names in INSERT INTO

2018-02-20 Thread Mihaly Toth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370056#comment-16370056
 ] 

Mihaly Toth commented on SPARK-20845:
-

As shown in the links this duplicates SPARK-21548. However, I the wording of 
this issue sounds more descriptive for me. On the other hand the other one has 
PR linked to it. 

To reduce duplication I would propose to close SPARK-21548 and add prior pull 
request https://github.com/apache/spark/pull/18756 to the links section of this 
issue.

> Support specification of column names in INSERT INTO
> 
>
> Key: SPARK-20845
> URL: https://issues.apache.org/jira/browse/SPARK-20845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> Some databases allow you to specify column names when specifying the target 
> of an INSERT INTO. For example, in SQLite:
> {code}
> sqlite> CREATE TABLE twocolumn (x INT, y INT); INSERT INTO twocolumn(x, y) 
> VALUES (44,51), (NULL,52), (42,53), (45,45)
>...> ;
> sqlite> select * from twocolumn;
> 44|51
> |52
> 42|53
> 45|45
> {code}
> I have a corpus of existing queries of this form which I would like to run on 
> Spark SQL, so I think we should extend our dialect to support this syntax.
> When implementing this, we should make sure to test the following behaviors 
> and corner-cases:
> - Number of columns specified is greater than or less than the number of 
> columns in the table.
> - Specification of repeated columns.
> - Specification of columns which do not exist in the target table.
> - Permute column order instead of using the default order in the table.
> For each of these, we should check how SQLite behaves and should also compare 
> against another database. It looks like T-SQL supports this; see 
> https://technet.microsoft.com/en-us/library/dd776381(v=sql.105).aspx under 
> the "Inserting data that is not in the same order as the table columns" 
> header.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23465) Dataset.withAllColumnsRenamed should map all column names to a new one

2018-02-19 Thread Mihaly Toth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369275#comment-16369275
 ] 

Mihaly Toth commented on SPARK-23465:
-

I have started working on this.

> Dataset.withAllColumnsRenamed should map all column names to a new one
> --
>
> Key: SPARK-23465
> URL: https://issues.apache.org/jira/browse/SPARK-23465
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Minor
>
> Currently one can only rename a column only one by one using 
> {{withColumnRenamed()}} function. When one would like to rename all or most 
> of the columns it would be easier to specify an algorithm for mapping from 
> the old to the new name (like prefixing) than iterating over all the fields.
> Example usage is joining to a Dataset with the same or similar schema 
> (special case is self joining) where the names are the same or overlapping. 
> Such a joined Dataset would fail at {{saveAsTable()}}
> With the new function usage would be easy like that:
> {code:java}
> ds.withAllColumnsRenamed("prefix" + _)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23465) Dataset.withAllColumnsRenamed should map all column names to a new one

2018-02-19 Thread Mihaly Toth (JIRA)
Mihaly Toth created SPARK-23465:
---

 Summary: Dataset.withAllColumnsRenamed should map all column names 
to a new one
 Key: SPARK-23465
 URL: https://issues.apache.org/jira/browse/SPARK-23465
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Mihaly Toth


Currently one can only rename a column only one by one using 
{{withColumnRenamed()}} function. When one would like to rename all or most of 
the columns it would be easier to specify an algorithm for mapping from the old 
to the new name (like prefixing) than iterating over all the fields.

Example usage is joining to a Dataset with the same or similar schema (special 
case is self joining) where the names are the same or overlapping. Such a 
joined Dataset would fail at {{saveAsTable()}}

With the new function usage would be easy like that:
{code:java}
ds.withAllColumnsRenamed("prefix" + _)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions

2018-02-07 Thread Mihaly Toth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16355155#comment-16355155
 ] 

Mihaly Toth commented on SPARK-23329:
-

Nice. I like that the redundant description part is simply omitted.

The only issue I see is that {{e}} is actually not an angle. At least we could 
put it into plural, like:
{code}
/**
 * @param e angles in radians
 * @return sines of the angles, as if computed by [[java.lang.Math.sin]]
 * ...
 */
{code}
 
I was also thinking about mentioning that it is actually a Column does not 
inflate the lines very much. At the same time they are more precise.

{code}
/**
 * @param e [[Column]] of angles in radians
 * @return [[Column]] of sines of the angles, as if computed by 
[[java.lang.Math.sin]]
 * ...
 */
{code}

Now looking at this the first one seems better because it tells truth and one 
can figure out easily that the angles are stored in a Column.

> Update the function descriptions with the arguments and returned values of 
> the trigonometric functions
> --
>
> Key: SPARK-23329
> URL: https://issues.apache.org/jira/browse/SPARK-23329
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Minor
>  Labels: starter
>
> We need an update on the function descriptions for all the trigonometric 
> functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the 
> implementation is based on the java.lang.Math. We need a clear description 
> about the units of the input arguments and the returned values. 
> For example, the following descriptions are lacking such info. 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555
> https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions

2018-02-06 Thread Mihaly Toth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354166#comment-16354166
 ] 

Mihaly Toth commented on SPARK-23329:
-

How about this approach?

{code:scala}
  /**
   * Computes the sine of the given column. Works same as [[java.lang.Math.sin]]
   *
   * @param  e Column of angles, in radians.
   * @return new Column comprising sine value of each `e` element.
   *
   * @group math_funcs
   * @since 1.4.0
   */
  def sin(e: Column): Column = withExpr { Sin(e.expr) }
{code}

I am in a bit of trouble with wording. The original doc stated {{Computes the 
sine of the given}} *{{value}}* which is not really the case. Even _calculating 
the sine of a Column_ is not 100% precise but I guess not misleading given the 
context and condense enough on the other hand.

The unit of measurment can be possibly moved to the param description I believe.

Another question is that the majority of the javadocs in {{functions.scala}} is 
lacking return value and parameter descriptions. Does this Jira target to fix 
all of them (there are 334 ' def ' expressions in the file) or just the 
math_funcs group or only the trigonometric out of them as the title suggests?

> Update the function descriptions with the arguments and returned values of 
> the trigonometric functions
> --
>
> Key: SPARK-23329
> URL: https://issues.apache.org/jira/browse/SPARK-23329
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Minor
>  Labels: starter
>
> We need an update on the function descriptions for all the trigonometric 
> functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the 
> implementation is based on the java.lang.Math. We need a clear description 
> about the units of the input arguments and the returned values. 
> For example, the following descriptions are lacking such info. 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555
> https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org