[jira] [Commented] (SPARK-14279) Improve the spark build to pick the version information from the pom file and add git commit information

2016-05-23 Thread Sanket Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296577#comment-15296577
 ] 

Sanket Reddy commented on SPARK-14279:
--

@tgraves Please assign this issue to dhruve

> Improve the spark build to pick the version information from the pom file and 
> add git commit information
> 
>
> Key: SPARK-14279
> URL: https://issues.apache.org/jira/browse/SPARK-14279
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
>
> Right now the spark-submit --version and other parts of the code pick up 
> version information from a static SPARK_VERSION. We would want to  pick the 
> version from the pom.version probably stored inside a properties file. Also, 
> it might be nice to have other details like branch, build information and 
> other specific details when having a spark-submit --version
> Note, the motivation is to more easily tie this to automated continuous 
> integration and deployment and to easily have traceability.
> Part of this is right now you have to manually change a java file to change 
> the version that comes out when you run spark-submit --version. With 
> continuous integration the build numbers could be something like 1.6.1.X 
> (where X increments on each change) and I want to see the exact version 
> easily. Having to manually change a java file makes that hard. obviously that 
> should make the apache spark releases easier as you don't have to manually 
> change this file as well.
> The other important part for me is the git information. This easily lets me 
> trace it back to exact commits. We have a multi-tenant YARN cluster and users 
> can run many different versions at once. I want to be able to see exactly 
> which version they are running. The reason to know exact version can range 
> from helping debug some problem to making sure someone didn't hack something 
> in Spark to cause bad things (generally they should use approved version), 
> etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20355) Display Spark version on history page

2017-04-17 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-20355:


 Summary: Display Spark version on history page
 Key: SPARK-20355
 URL: https://issues.apache.org/jira/browse/SPARK-20355
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 2.1.0
Reporter: Sanket Reddy
Priority: Minor


Spark Version for a specific application is not displayed on the history page 
now. It should be nice to switch the spark version on the UI when we click on 
the specific application.

Currently there seems to be way as SparkListenerLogStart records the 
application version. So, it should be trivial to listen to this event and 
provision this change on the UI.
{"Event":"SparkListenerLogStart","Spark 
Version":"1.6.2.0_2.7.2.7.1604210306_161643"}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14279) Improve the spark build to pick the version information from the pom file instead of package.scala

2016-03-30 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-14279:


 Summary: Improve the spark build to pick the version information 
from the pom file instead of package.scala
 Key: SPARK-14279
 URL: https://issues.apache.org/jira/browse/SPARK-14279
 Project: Spark
  Issue Type: Story
Reporter: Sanket Reddy
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14279) Improve the spark build to pick the version information from the pom file instead of package.scala

2016-03-30 Thread Sanket Reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanket Reddy updated SPARK-14279:
-
Description: Right now the spark-submit --version and other parts of the 
code pick up version information from a static SPARK_VERSION. We would want to  
pick the version from the pom.version probably stored inside a properties file. 
Also, it might be nice to have other details like branch, build information and 
other specific details when having a spark-submit --version

> Improve the spark build to pick the version information from the pom file 
> instead of package.scala
> --
>
> Key: SPARK-14279
> URL: https://issues.apache.org/jira/browse/SPARK-14279
> Project: Spark
>  Issue Type: Story
>  Components: Build
>Reporter: Sanket Reddy
>Assignee: Sanket Reddy
>Priority: Minor
>
> Right now the spark-submit --version and other parts of the code pick up 
> version information from a static SPARK_VERSION. We would want to  pick the 
> version from the pom.version probably stored inside a properties file. Also, 
> it might be nice to have other details like branch, build information and 
> other specific details when having a spark-submit --version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10436) spark-submit overwrites spark.files defaults with the job script filename

2015-09-09 Thread Sanket Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14737544#comment-14737544
 ] 

Sanket Reddy commented on SPARK-10436:
--

I am a newbie and interested in it, I will take a look at it.

> spark-submit overwrites spark.files defaults with the job script filename
> -
>
> Key: SPARK-10436
> URL: https://issues.apache.org/jira/browse/SPARK-10436
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.4.0
> Environment: Ubuntu, Spark 1.4.0 Standalone
>Reporter: axel dahl
>Priority: Minor
>  Labels: easyfix, feature
>
> In my spark-defaults.conf I have configured a set of libararies to be 
> uploaded to my Spark 1.4.0 Standalone cluster.  The entry appears as:
> spark.files  libarary.zip,file1.py,file2.py
> When I execute spark-submit -v test.py
> I see that spark-submit reads the defaults correctly, but that it overwrites 
> the "spark.files" default entry and replaces it with the name if the job 
> script, i.e. "test.py".
> This behavior doesn't seem intuitive.  test.py, should be added to the spark 
> working folder, but it should not overwrite the "spark.files" defaults.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10401) spark-submit --unsupervise

2015-09-12 Thread Sanket Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742252#comment-14742252
 ] 

Sanket Reddy commented on SPARK-10401:
--

I would like to work on it

> spark-submit --unsupervise 
> ---
>
> Key: SPARK-10401
> URL: https://issues.apache.org/jira/browse/SPARK-10401
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos
>Affects Versions: 1.5.0
>Reporter: Alberto Miorin
>
> When I submit a streaming job with the option --supervise to the new mesos 
> spark dispatcher, I cannot decommission the job.
> I tried spark-submit --kill, but dispatcher always restarts it.
> Driver and Executors are both Docker containers.
> I think there should be a subcommand spark-submit --unsupervise 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10401) spark-submit --unsupervise

2015-09-12 Thread Sanket Reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanket Reddy updated SPARK-10401:
-
Comment: was deleted

(was: I would like to work on it)

> spark-submit --unsupervise 
> ---
>
> Key: SPARK-10401
> URL: https://issues.apache.org/jira/browse/SPARK-10401
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos
>Affects Versions: 1.5.0
>Reporter: Alberto Miorin
>
> When I submit a streaming job with the option --supervise to the new mesos 
> spark dispatcher, I cannot decommission the job.
> I tried spark-submit --kill, but dispatcher always restarts it.
> Driver and Executors are both Docker containers.
> I think there should be a subcommand spark-submit --unsupervise 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6166) Add config to limit number of concurrent outbound connections for shuffle fetch

2016-01-15 Thread Sanket Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15101995#comment-15101995
 ] 

Sanket Reddy commented on SPARK-6166:
-

Hi, I modified the code to fit the latest Spark build, I will have the patch up 
soon.

> Add config to limit number of concurrent outbound connections for shuffle 
> fetch
> ---
>
> Key: SPARK-6166
> URL: https://issues.apache.org/jira/browse/SPARK-6166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Mridul Muralidharan
>Assignee: Shixiong Zhu
>Priority: Minor
>
> spark.reducer.maxMbInFlight puts a bound on the in flight data in terms of 
> size.
> But this is not always sufficient : when the number of hosts in the cluster 
> increase, this can lead to very large number of in-bound connections to one 
> more nodes - causing workers to fail under the load.
> I propose we also add a spark.reducer.maxReqsInFlight - which puts a bound on 
> number of outstanding outbound connections.
> This might still cause hotspots in the cluster, but in our tests this has 
> significantly reduced the occurance of worker failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21501) Spark shuffle index cache size should be memory based

2017-07-25 Thread Sanket Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100658#comment-16100658
 ] 

Sanket Reddy commented on SPARK-21501:
--

Hi I am working on this issue just to avoid any redundancies if any thanks

> Spark shuffle index cache size should be memory based
> -
>
> Key: SPARK-21501
> URL: https://issues.apache.org/jira/browse/SPARK-21501
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: Thomas Graves
>
> Right now the spark shuffle service has a cache for index files. It is based 
> on a # of files cached (spark.shuffle.service.index.cache.entries). This can 
> cause issues if people have a lot of reducers because the size of each entry 
> can fluctuate based on the # of reducers. 
> We saw an issues with a job that had 17 reducers and it caused NM with 
> spark shuffle service to use 700-800MB or memory in NM by itself.
> We should change this cache to be memory based and only allow a certain 
> memory size used. When I say memory based I mean the cache should have a 
> limit of say 100MB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-28 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754406#comment-16754406
 ] 

Sanket Reddy commented on SPARK-25692:
--

I had a few observations regarding this test suite...

When i run it on mac

$ sysctl hw.physicalcpu hw.logicalcpu
hw.physicalcpu: 4
hw.logicalcpu: 8 

I dont see the issue. This i think has got to do with io.serverThreads and the 
number of threads being used for handling chunk blocks and since there are 
multiple tests in the suite handling chunked blocks might require sufficient 
threads to handle the requests.

 

On a vm i was able to reproduce this consistently

-bash-4.1$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
CPU(s): 4
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 4

The root cause might be why it is actually failing is due to like [~zsxwing] 
pointed out is due to 
[https://github.com/apache/spark/blob/c00186f90cfcc33492d760f874ead34f0e3da6ed/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java#L88|https://github.com/apache/spark/blob/c00186f90cfcc33492d760f874ead34f0e3da6ed/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java#L88.]
 sharing of worker threads.

When I remove the static I no longer see the test failure.

 

So do we really need it to be static?

I dont think this requires a global declaration as these threads are only 
required on the shuffle server end and on the client TransportContext 
initialization i.e the Client don't initialize these threads. I assume for 
Shuffle Server there would be only one TransportContext object. So, I think 
this is fine to be an instance variable and I see no harm. Will do some testing 
again and if everything is fine will put up the pr...

 

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-30 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756210#comment-16756210
 ] 

Sanket Reddy commented on SPARK-25692:
--

Did some further digging

How to reproduce
./build/mvn test 
-Dtest=org.apache.spark.network.RequestTimeoutIntegrationSuite,org.apache.spark.network.ChunkFetchIntegrationSuite
 -DwildcardSuites=None test
furtherRequestsDelay Test within RequestTimeoutIntegrationSuite was holding 
onto worker references. The test does close the server context but since the 
threads are global and there is sleep of 60 secs to fetch a specific chunk 
within this test, it grabs on it and waits for the client to consume but 
however the test is testing for a request timeout and it times out after 10 
secs, so the workers are just waiting there for the buffer to be consumed as 
per my understanding. I think we dont need this to be static as the server just 
initializes the TransportContext object once. I did some manual tests and it 
looks good

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks

2019-01-30 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756226#comment-16756226
 ] 

Sanket Reddy commented on SPARK-25692:
--

Created a pr [https://github.com/apache/spark/pull/23700] plz take a look 
thanks and let me know your thoughts

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21798) No config to replace deprecated SPARK_CLASSPATH config for launching daemons like History Server

2017-08-21 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-21798:


 Summary: No config to replace deprecated SPARK_CLASSPATH config 
for launching daemons like History Server
 Key: SPARK-21798
 URL: https://issues.apache.org/jira/browse/SPARK-21798
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Sanket Reddy
Priority: Minor


History Server Launch uses SparkClassCommandBuilder for launching the server. 
It is observed that SPARK_CLASSPATH has been removed and deprecated. For 
spark-submit this takes a different route and spark.driver.extraClasspath takes 
care of specifying additional jars in the classpath that were previously 
specified in the SPARK_CLASSPATH. Right now the only way specify the additional 
jars for launching daemons such as history server is using SPARK_DIST_CLASSPATH 
(https://spark.apache.org/docs/latest/hadoop-provided.html) but this I presume 
is a distribution classpath. It would be nice to have a similar config like 
spark.driver.extraClasspath for launching daemons similar to history server. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21890) ObtainCredentials does not pass creds to addDelegationTokens

2017-08-31 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-21890:


 Summary: ObtainCredentials does not pass creds to 
addDelegationTokens
 Key: SPARK-21890
 URL: https://issues.apache.org/jira/browse/SPARK-21890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Sanket Reddy


I observed this while running a oozie job trying to connect to hbase via spark.
It look like the creds are not being passed in 
thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53
 for 2.2 release.

Stack trace:
Warning: Skip remote jar 
hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020/user/schintap/spark_oozie/apps/lib/spark-starter-2.0-SNAPSHOT-jar-with-dependencies.jar.
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], 
main() threw exception, Delegation Token can be issued only with kerberos or 
web authentication
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)

org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)

at org.apache.hadoop.ipc.Client.call(Client.java:1471)
at org.apache.hadoop.ipc.Client.call(Client.java:1408)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy10.getDelegationToken(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:933)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy11.getDelegationToken(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1038)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1543)
at 
org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:531)
at 
org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:509)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.addDelegation

[jira] [Updated] (SPARK-21890) ObtainCredentials does not pass creds to addDelegationTokens

2017-09-01 Thread Sanket Reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanket Reddy updated SPARK-21890:
-
Description: 
I observed this while running a oozie job trying to connect to hbase via spark.
It look like the creds are not being passed in 
thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53
 for 2.2 release.

More Info as to why it fails on secure grid,
Oozie launches the mapreduce job with a designated TGT to retrieve the tokens 
from the Namenode.
In this case in order to talk to hbase it gets a hbase token.
After which the the spark client is launched by oozie launcher which talks to 
hbase via tokens.
In the current scenario it uses new creds to talk to hbase which will be 
missing the tokens that have already been acquired and hence we
see the following stack trace exception.

Stack trace:
Warning: Skip remote jar 
hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020/user/schintap/spark_oozie/apps/lib/spark-starter-2.0-SNAPSHOT-jar-with-dependencies.jar.
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], 
main() threw exception, Delegation Token can be issued only with kerberos or 
web authentication
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)

org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
can be issued only with kerberos or web authentication
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)

at org.apache.hadoop.ipc.Client.call(Client.java:1471)
at org.apache.hadoop.ipc.Client.call(Client.java:1408)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy10.getDelegationToken(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:933)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy11.getDelegationToken(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1038)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getDelegation

[jira] [Commented] (SPARK-21890) ObtainCredentials does not pass creds to addDelegationTokens

2017-09-01 Thread Sanket Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150762#comment-16150762
 ] 

Sanket Reddy commented on SPARK-21890:
--

Will put up a PR for master too thanks

> ObtainCredentials does not pass creds to addDelegationTokens
> 
>
> Key: SPARK-21890
> URL: https://issues.apache.org/jira/browse/SPARK-21890
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sanket Reddy
>
> I observed this while running a oozie job trying to connect to hbase via 
> spark.
> It look like the creds are not being passed in 
> thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53
>  for 2.2 release.
> More Info as to why it fails on secure grid:
> Oozie client gets the necessary tokens the application needs before 
> launching.  It passes those tokens along to the oozie launcher job (MR job) 
> which will then actually call the Spark client to launch the spark app and 
> pass the tokens along.
> The oozie launcher job cannot get anymore tokens because all it has is tokens 
> ( you can't get tokens with tokens, you need tgt or keytab).  
> The error here is because the launcher job runs the Spark Client to submit 
> the spark job but the spark client doesn't see that it already has the hdfs 
> tokens so it tries to get more, which ends with the exception.
> There was  a change with SPARK-19021 to generalize the hdfs credentials 
> provider that changed it so we don't pass the existing credentials into the 
> call to get tokens so it doesn't realize it already has the necessary tokens.
> Stack trace:
> Warning: Skip remote jar 
> hdfs://axonitered-nn1.red.ygrid.yahoo.com:8020/user/schintap/spark_oozie/apps/lib/spark-starter-2.0-SNAPSHOT-jar-with-dependencies.jar.
> Failing Oozie Launcher, Main class 
> [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, 
> Delegation Token can be issued only with kerberos or web authentication
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
> can be issued only with kerberos or web authentication
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:5858)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:687)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:1003)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:448)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:999)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:881)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:810)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1936)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2523)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1471)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1408)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>   at com.sun.proxy

[jira] [Created] (SPARK-24533) typesafe has rebranded to lightbend. change the build/mvn endpoint from downloads.typesafe.com to downloads.lightbend.com

2018-06-12 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-24533:


 Summary: typesafe has rebranded to lightbend. change the build/mvn 
endpoint from downloads.typesafe.com to downloads.lightbend.com
 Key: SPARK-24533
 URL: https://issues.apache.org/jira/browse/SPARK-24533
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 2.3.1
Reporter: Sanket Reddy


typesafe has rebranded to lightbend. change the build/mvn endpoint from 
downloads.typesafe.com to downloads.lightbend.com.  Redirection works for now. 
It is nice to just update the endpoint to stay upto date.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24533) typesafe has rebranded to lightbend. change the build/mvn endpoint from downloads.typesafe.com to downloads.lightbend.com

2018-06-12 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509850#comment-16509850
 ] 

Sanket Reddy commented on SPARK-24533:
--

I will put up a PR shortly thanks

> typesafe has rebranded to lightbend. change the build/mvn endpoint from 
> downloads.typesafe.com to downloads.lightbend.com
> -
>
> Key: SPARK-24533
> URL: https://issues.apache.org/jira/browse/SPARK-24533
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Sanket Reddy
>Priority: Trivial
>
> typesafe has rebranded to lightbend. change the build/mvn endpoint from 
> downloads.typesafe.com to downloads.lightbend.com.  Redirection works for 
> now. It is nice to just update the endpoint to stay upto date.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-07-11 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-24787:


 Summary: Events being dropped at an alarming rate due to hsync 
being slow for eventLogging
 Key: SPARK-24787
 URL: https://issues.apache.org/jira/browse/SPARK-24787
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 2.3.1, 2.3.0
Reporter: Sanket Reddy


[https://github.com/apache/spark/pull/16924/files] updates the length of the 
inprogress files allowing history server being responsive.

Although we have a production job that has 6 tasks per stage and due to 
hsync being slow it starts dropping events and the history server has wrong 
stats due to events being dropped.

A viable solution is not to make it sync very frequently or make it 
configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-07-11 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540638#comment-16540638
 ] 

Sanket Reddy commented on SPARK-24787:
--

I am happy to work on this... will have a potential solution and a pr up

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-07-16 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545436#comment-16545436
 ] 

Sanket Reddy commented on SPARK-24787:
--

[~vanzin] do you have any suggestions regarding this issue?

[~olegd] I would rather make this configurable or trigger periodically?

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24986) OOM in BufferHolder during writes to a stream

2018-07-31 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-24986:


 Summary: OOM in BufferHolder during writes to a stream
 Key: SPARK-24986
 URL: https://issues.apache.org/jira/browse/SPARK-24986
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0, 2.2.0, 2.1.0
Reporter: Sanket Reddy


We have seen out of memory exception while running one of our prod jobs. We 
expect the memory allocation to be managed by unified memory manager during run 
time.

So the buffer which is growing during write is somewhat like this if the 
rowlength is constant then the buffer does not grow… it keeps resetting and 
writing the values to  the buffer… if the rows are variable and it is skewed 
and has huge stuff to be written this happens and i think the estimator which 
requests for initial execution memory does not account for this i think… 
Checking for underlying heap before growing the global buffer might be a viable 
option

java.lang.OutOfMemoryError: Java heap space
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:73)
at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter.initialize(UnsafeArrayWriter.java:61)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_1$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:232)
at 
org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$1.apply(AggregationIterator.scala:221)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:159)
at 
org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.next(SortBasedAggregationIterator.scala:29)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1075)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1091)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1129)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:513)
at 
org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:329)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1966)
at 
org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:270)
18/06/11 21:18:41 ERROR SparkUncaughtExceptionHandler: [Container in shutdown] 
Uncaught exception in thread Thread[stdout writer for 
Python/bin/python3.6,5,main]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24787) Events being dropped at an alarming rate due to hsync being slow for eventLogging

2018-08-17 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16583964#comment-16583964
 ] 

Sanket Reddy commented on SPARK-24787:
--

Thanks [~ste...@apache.org] [~vanzin] [~tgraves] it seems we might have to 
stick with hflush but think of potentially another solution to update the file 
status changes similar to YARN ATS.

Even if I periodically update it I think the dropped events issue might persist 
as it hard to have a proper flow control.

> Events being dropped at an alarming rate due to hsync being slow for 
> eventLogging
> -
>
> Key: SPARK-24787
> URL: https://issues.apache.org/jira/browse/SPARK-24787
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Sanket Reddy
>Priority: Minor
>
> [https://github.com/apache/spark/pull/16924/files] updates the length of the 
> inprogress files allowing history server being responsive.
> Although we have a production job that has 6 tasks per stage and due to 
> hsync being slow it starts dropping events and the history server has wrong 
> stats due to events being dropped.
> A viable solution is not to make it sync very frequently or make it 
> configurable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24416) Update configuration definition for spark.blacklist.killBlacklistedExecutors

2018-05-29 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-24416:


 Summary: Update configuration definition for 
spark.blacklist.killBlacklistedExecutors
 Key: SPARK-24416
 URL: https://issues.apache.org/jira/browse/SPARK-24416
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Sanket Reddy


spark.blacklist.killBlacklistedExecutors is defined as 

(Experimental) If set to "true", allow Spark to automatically kill, and attempt 
to re-create, executors when they are blacklisted. Note that, when an entire 
node is added to the blacklist, all of the executors on that node will be 
killed.

I presume the killing of blacklisted executors only happens after the stage 
completes successfully and all tasks have completed or on fetch failures 
(updateBlacklistForFetchFailure/updateBlacklistForSuccessfulTaskSet). It is 
confusing because the definition states that the executor will be attempted to 
be recreated as soon as it is blacklisted. This is not true while the stage is 
in progress and an executor is blacklisted, it will not attempt to cleanup 
until the stage finishes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25641) Change the spark.shuffle.server.chunkFetchHandlerThreadsPercent default to 100

2018-10-04 Thread Sanket Reddy (JIRA)
Sanket Reddy created SPARK-25641:


 Summary: Change the 
spark.shuffle.server.chunkFetchHandlerThreadsPercent default to 100
 Key: SPARK-25641
 URL: https://issues.apache.org/jira/browse/SPARK-25641
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Sanket Reddy


We want to change the default percentage to 100 for 
spark.shuffle.server.chunkFetchHandlerThreadsPercent. The reason being
currently this is set to 0. Which means currently if server.ioThreads > 0, the 
default number of threads would be 2 * #cores instead of server.io.Threads. We 
want the default to server.io.Threads in case this is not set at all. Also here 
a default of 0 would also mean 2 * #cores



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-10-22 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658985#comment-16658985
 ] 

Sanket Reddy commented on SPARK-25692:
--

Sure [~tgraves] [~zsxwing] i am taking a look. Thanks for reporting.

> Flaky test: ChunkFetchIntegrationSuite
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-10-22 Thread Sanket Reddy (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanket Reddy updated SPARK-25692:
-
Attachment: Screen Shot 2018-10-22 at 4.12.41 PM.png

> Flaky test: ChunkFetchIntegrationSuite
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-10-22 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659644#comment-16659644
 ] 

Sanket Reddy commented on SPARK-25692:
--

[~zsxwing] I haven't been able to reproduce this, but the number of threads 
used for this tests are 2* number of cores or spark.shuffle.io.serverThreads. I 
think the test server seems to be down. Would appreciate any help to reproduce 
this.

!Screen Shot 2018-10-22 at 4.12.41 PM.png!

> Flaky test: ChunkFetchIntegrationSuite
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25692) Flaky test: ChunkFetchIntegrationSuite

2018-10-24 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662409#comment-16662409
 ] 

Sanket Reddy commented on SPARK-25692:
--

[https://github.com/apache/spark/pull/22628/files] went it after 
[https://github.com/apache/spark/pull/22173] not sure you are still seeing 
issues... I ran the tests in a loop for 100 times. Could not reproduce this.

> Flaky test: ChunkFetchIntegrationSuite
> --
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Shixiong Zhu
>Priority: Blocker
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30411) saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms

2020-01-02 Thread Sanket Reddy (Jira)
Sanket Reddy created SPARK-30411:


 Summary: saveAsTable does not honor 
spark.hadoop.hive.warehouse.subdir.inherit.perms
 Key: SPARK-30411
 URL: https://issues.apache.org/jira/browse/SPARK-30411
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Sanket Reddy


-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
drwxr-x--T   - redsanket users 0 2019-12-04 
20:15 /tmp/my_databases

>>>spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>>AS orc");

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwxr-x--T   - redsanket users  0 2019-12-04 20:20 
/tmp/my_databases/example


Now after saveAsTable
>>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 
>>> 5)]
>>> df = spark.createDataFrame(data)
>>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwx--   - redsanket users  0 2019-12-04 20:23 
/tmp/my_databases/example
Overwrites the permissions

Insert into honors preserving parent directory permissions.
>>> spark.sql("DROP table redsanket_db.example");
DataFrame[]
>>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>> AS orc");
DataFrame[]
>>> df.write.format("orc").insertInto('redsanket_db.example')

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwxr-x--T   - schintap users  0 2019-12-04 20:43 
/tmp/my_databases/example
It is either limitation of the API based on the mode and the behavior has to be 
documented or needs to be fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30411) saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms

2020-01-02 Thread Sanket Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanket Reddy updated SPARK-30411:
-
Description: 
-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
 drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases

>>>spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>>AS orc");

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example

Now after saveAsTable
 >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
 >>> ('Fifth', 5)]
 >>> df = spark.createDataFrame(data)
 >>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
 -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
 Overwrites the permissions

Insert into honors preserving parent directory permissions.
 >>> spark.sql("DROP table redsanket_db.example");
 DataFrame[]
 >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
 >>> STORED AS orc");
 DataFrame[]
 >>> df.write.format("orc").insertInto('redsanket_db.example')

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
 It is either limitation of the API based on the mode and the behavior has to 
be documented or needs to be fixed

  was:
-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
drwxr-x--T   - redsanket users 0 2019-12-04 
20:15 /tmp/my_databases

>>>spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>>AS orc");

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwxr-x--T   - redsanket users  0 2019-12-04 20:20 
/tmp/my_databases/example


Now after saveAsTable
>>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 
>>> 5)]
>>> df = spark.createDataFrame(data)
>>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwx--   - redsanket users  0 2019-12-04 20:23 
/tmp/my_databases/example
Overwrites the permissions

Insert into honors preserving parent directory permissions.
>>> spark.sql("DROP table redsanket_db.example");
DataFrame[]
>>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>> AS orc");
DataFrame[]
>>> df.write.format("orc").insertInto('redsanket_db.example')

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwxr-x--T   - schintap users  0 2019-12-04 20:43 
/tmp/my_databases/example
It is either limitation of the API based on the mode and the behavior has to be 
documented or needs to be fixed


> saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms
> ---
>
> Key: SPARK-30411
> URL: https://issues.apache.org/jira/browse/SPARK-30411
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Sanket Reddy
>Priority: Minor
>
> -bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
>  drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
> >>>spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> >>>STORED AS orc");
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
> Now after saveAsTable
>  >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
> ('Fifth', 5)]
>  >>> df = spark.createDataFrame(data)
>  >>> 
> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
>  -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
>  Overwrites the permissions
> Insert into honors preserving parent directory permissions.
>  >>> spark.sql("DROP table redsanket_db.example");
>  DataFrame[]
>  >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> STORED AS orc");
>  DataFrame[]
>  >>> df.write.format("orc").insertInto('redsanket_db.example')
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
>  It is either limitation of the API based on the mode and the behavior has to 
> be documented or needs to be fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30411) saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms

2020-01-06 Thread Sanket Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009381#comment-17009381
 ] 

Sanket Reddy commented on SPARK-30411:
--

[~yumwang]  
[PR-22078|https://github.com/apache/spark/pull/22078#issuecomment-458851287] 
makes sense however and it fixes Hive 3.0.0 and it is not backward compatible 
change afaik.

Would be useful for users to not go ahead and manually change permissions on 
the File systems/use umask as a work around.

[~hyukjin.kwon] sure will try the hive implementation and get back but I doubt 
it would work, will give a try thanks for the quick reply

> saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms
> ---
>
> Key: SPARK-30411
> URL: https://issues.apache.org/jira/browse/SPARK-30411
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Sanket Reddy
>Priority: Minor
>
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
>  drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
> {code}
> {code}
> >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> >>> STORED AS orc");
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
> {code}
> Now after {{saveAsTable}}
> {code}
>  >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
> ('Fifth', 5)]
>  >>> df = spark.createDataFrame(data)
>  >>> 
> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
> {code}
>  Overwrites the permissions
> Insert into honors preserving parent directory permissions.
> {code}
>  >>> spark.sql("DROP table redsanket_db.example");
>  DataFrame[]
>  >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> STORED AS orc");
>  DataFrame[]
>  >>> df.write.format("orc").insertInto('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
> {code}
>  It is either limitation of the API based on the mode and the behavior has to 
> be documented or needs to be fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30411) saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms

2020-01-06 Thread Sanket Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17009381#comment-17009381
 ] 

Sanket Reddy edited comment on SPARK-30411 at 1/7/20 5:39 AM:
--

[~yumwang]  
[PR-22078|https://github.com/apache/spark/pull/22078#issuecomment-458851287] 
makes sense however and it fixes Hive 3.0.0 and it is not backward compatible 
change afaik.

My concern is inconsistency in the API's, it should preserve or should not 
preserve perms and needs to be documented for DDL, DML ops imho. 
(saveAsTable/insertInto)

Would be useful for users to not go ahead and manually change permissions on 
the File systems/use umask as a work around.

[~hyukjin.kwon] sure will try the hive implementation and get back but I doubt 
it would work, will give a try thanks for the quick reply


was (Author: sanket991):
[~yumwang]  
[PR-22078|https://github.com/apache/spark/pull/22078#issuecomment-458851287] 
makes sense however and it fixes Hive 3.0.0 and it is not backward compatible 
change afaik.

Would be useful for users to not go ahead and manually change permissions on 
the File systems/use umask as a work around.

[~hyukjin.kwon] sure will try the hive implementation and get back but I doubt 
it would work, will give a try thanks for the quick reply

> saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms
> ---
>
> Key: SPARK-30411
> URL: https://issues.apache.org/jira/browse/SPARK-30411
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Sanket Reddy
>Priority: Minor
>
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
>  drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
> {code}
> {code}
> >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> >>> STORED AS orc");
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
> {code}
> Now after {{saveAsTable}}
> {code}
>  >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
> ('Fifth', 5)]
>  >>> df = spark.createDataFrame(data)
>  >>> 
> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
> {code}
>  Overwrites the permissions
> Insert into honors preserving parent directory permissions.
> {code}
>  >>> spark.sql("DROP table redsanket_db.example");
>  DataFrame[]
>  >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> STORED AS orc");
>  DataFrame[]
>  >>> df.write.format("orc").insertInto('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
> {code}
>  It is either limitation of the API based on the mode and the behavior has to 
> be documented or needs to be fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30411) saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms

2020-01-08 Thread Sanket Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010799#comment-17010799
 ] 

Sanket Reddy commented on SPARK-30411:
--

spark.sql.orc.impl=hive does not work. [~yumwang] [~hyukjin.kwon]  do we want 
to revisit 
[PR-22078|https://github.com/apache/spark/pull/22078#issuecomment-458851287] or 
do we have a good reason not to proceed on this based on my concerns above? I 
could make an internal patch but just wondering on your thoughts.

> saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms
> ---
>
> Key: SPARK-30411
> URL: https://issues.apache.org/jira/browse/SPARK-30411
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Sanket Reddy
>Priority: Minor
>
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
>  drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
> {code}
> {code}
> >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> >>> STORED AS orc");
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
> {code}
> Now after {{saveAsTable}}
> {code}
>  >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
> ('Fifth', 5)]
>  >>> df = spark.createDataFrame(data)
>  >>> 
> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
> {code}
>  Overwrites the permissions
> Insert into honors preserving parent directory permissions.
> {code}
>  >>> spark.sql("DROP table redsanket_db.example");
>  DataFrame[]
>  >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> STORED AS orc");
>  DataFrame[]
>  >>> df.write.format("orc").insertInto('redsanket_db.example')
> {code}
> {code}
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
> {code}
>  It is either limitation of the API based on the mode and the behavior has to 
> be documented or needs to be fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33741) Add minimum threshold speculation config

2020-12-10 Thread Sanket Reddy (Jira)
Sanket Reddy created SPARK-33741:


 Summary: Add minimum threshold speculation config
 Key: SPARK-33741
 URL: https://issues.apache.org/jira/browse/SPARK-33741
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Sanket Reddy


When we turn on speculation with default configs we have the last 10% of the 
tasks subject to speculation. There are a lot of stages where the stage runs 
for few seconds to minutes. Also in general we don't want to speculate tasks 
that run within a specific interval. By setting a minimum threshold for 
speculation gives us better control



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34545) PySpark Python UDF return inconsistent results when applying 2 UDFs with different return type to 2 columns together

2021-02-25 Thread Sanket Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17291164#comment-17291164
 ] 

Sanket Reddy commented on SPARK-34545:
--

cc [~hyukjin.kwon]

> PySpark Python UDF return inconsistent results when applying 2 UDFs with 
> different return type to 2 columns together
> 
>
> Key: SPARK-34545
> URL: https://issues.apache.org/jira/browse/SPARK-34545
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Baohe Zhang
>Priority: Critical
>
> Python UDF returns inconsistent results between evaluating 2 columns together 
> and evaluating one by one.
> The issue occurs after we upgrading to spark3, so seems it doesn't exist in 
> spark2.
> How to reproduce it?
> {code:python}
> df = spark.createDataFrame([([(1.0, "1"), (1.0, "2"), (1.0, "3")], [(1, "1"), 
> (1, "2"), (1, "3")]), ([(2.0, "1"), (2.0, "2"), (2.0, "3")], [(2, "1"), (2, 
> "2"), (2, "3")]), ([(3.1, "1"), (3.1, "2"), (3.1, "3")], [(3, "1"), (3, "2"), 
> (3, "3")])], ['c1', 'c2'])
> from pyspark.sql.functions import udf
> from pyspark.sql.types import *
> def getLastElementWithTimeMaster(data_type):
> def getLastElementWithTime(list_elm):
> # x should be a list of (val, time)
> y = sorted(list_elm, key=lambda x: x[1]) # default is ascending
> return y[-1][0]
> return udf(getLastElementWithTime, data_type)
> # Add 2 columns whcih apply Python UDF
> df = df.withColumn("c3", getLastElementWithTimeMaster(DoubleType())("c1"))
> df = df.withColumn("c4", getLastElementWithTimeMaster(IntegerType())("c2"))
> # Show the results
> df.select("c3").show()
> df.select("c4").show()
> df.select("c3", "c4").show()
> {code}
> Results:
> {noformat}
> >>> df.select("c3").show()
> +---+ 
>   
> | c3|
> +---+
> |1.0|
> |2.0|
> |3.1|
> +---+
> >>> df.select("c4").show()
> +---+
> | c4|
> +---+
> |  1|
> |  2|
> |  3|
> +---+
> >>> df.select("c3", "c4").show()
> +---++
> | c3|  c4|
> +---++
> |1.0|null|
> |2.0|null|
> |3.1|   3|
> +---++
> {noformat}
> The test was done in branch-3.1 local mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31788) Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD

2020-05-21 Thread Sanket Reddy (Jira)
Sanket Reddy created SPARK-31788:


 Summary: Cannot convert org.apache.spark.api.java.JavaPairRDD to 
org.apache.spark.api.java.JavaRDD
 Key: SPARK-31788
 URL: https://issues.apache.org/jira/browse/SPARK-31788
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0, 3.0.1
Reporter: Sanket Reddy


Pair RDD conversion seems to have issues

SparkSession available as 'spark'.

>>> rdd1 = sc.parallelize([1,2,3,4,5])

>>> rdd2 = sc.parallelize([6,7,8,9,10])

>>> pairRDD1 = rdd1.zip(rdd2)

>>> unionRDD1 = sc.union([pairRDD1, pairRDD1])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870,

in union jrdds[i] = rdds[i]._jrdd

File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221,

in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)

>>> rdd3 = sc.parallelize([11,12,13,14,15])

>>> pairRDD2 = rdd3.zip(rdd3)

>>> unionRDD2 = sc.union([pairRDD1, pairRDD2])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] 
= rdds[i]._jrdd File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221, in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)

>>> rdd4 = sc.parallelize(range(5))

>>> pairRDD3 = rdd4.zip(rdd4)

>>> unionRDD3 = sc.union([pairRDD1, pairRDD3])

>>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
>>> 1), (2, 2), (3, 3), (4, 4)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31788) Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD

2020-05-21 Thread Sanket Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanket Reddy updated SPARK-31788:
-
Description: 
Pair RDD conversion seems to have issues

SparkSession available as 'spark'.

>>> rdd1 = sc.parallelize([1,2,3,4,5])

>>> rdd2 = sc.parallelize([6,7,8,9,10])

>>> pairRDD1 = rdd1.zip(rdd2)

>>> unionRDD1 = sc.union([pairRDD1, pairRDD1])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870,

in union jrdds[i] = rdds[i]._jrdd

File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221,

in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)

>>> rdd3 = sc.parallelize([11,12,13,14,15])

>>> pairRDD2 = rdd3.zip(rdd3)

>>> unionRDD2 = sc.union([pairRDD1, pairRDD2])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] 
= rdds[i]._jrdd File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221, in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)

>>> rdd4 = sc.parallelize(range(5))

>>> pairRDD3 = rdd4.zip(rdd4)

>>> unionRDD3 = sc.union([pairRDD1, pairRDD3])

>>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
>>> 1), (2, 2), (3, 3), (4, 4)]

 

2.4.5 does not have this regression

  was:
Pair RDD conversion seems to have issues

SparkSession available as 'spark'.

>>> rdd1 = sc.parallelize([1,2,3,4,5])

>>> rdd2 = sc.parallelize([6,7,8,9,10])

>>> pairRDD1 = rdd1.zip(rdd2)

>>> unionRDD1 = sc.union([pairRDD1, pairRDD1])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870,

in union jrdds[i] = rdds[i]._jrdd

File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221,

in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)

>>> rdd3 = sc.parallelize([11,12,13,14,15])

>>> pairRDD2 = rdd3.zip(rdd3)

>>> unionRDD2 = sc.union([pairRDD1, pairRDD2])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] 
= rdds[i]._jrdd File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221, in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCom

[jira] [Updated] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-21 Thread Sanket Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanket Reddy updated SPARK-31788:
-
Summary: Error when creating UnionRDD of PairRDDs  (was: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD)

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sanket Reddy
>Priority: Major
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> >>> rdd1 = sc.parallelize([1,2,3,4,5])
> >>> rdd2 = sc.parallelize([6,7,8,9,10])
> >>> pairRDD1 = rdd1.zip(rdd2)
> >>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd3 = sc.parallelize([11,12,13,14,15])
> >>> pairRDD2 = rdd3.zip(rdd3)
> >>> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd4 = sc.parallelize(range(5))
> >>> pairRDD3 = rdd4.zip(rdd4)
> >>> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
> >>> 1), (2, 2), (3, 3), (4, 4)]
>  
> 2.4.5 does not have this regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31788) Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD

2020-05-21 Thread Sanket Reddy (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanket Reddy updated SPARK-31788:
-
Description: 
Union RDD of Pair RDD's seems to have issues

SparkSession available as 'spark'.

>>> rdd1 = sc.parallelize([1,2,3,4,5])

>>> rdd2 = sc.parallelize([6,7,8,9,10])

>>> pairRDD1 = rdd1.zip(rdd2)

>>> unionRDD1 = sc.union([pairRDD1, pairRDD1])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870,

in union jrdds[i] = rdds[i]._jrdd

File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221,

in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)

>>> rdd3 = sc.parallelize([11,12,13,14,15])

>>> pairRDD2 = rdd3.zip(rdd3)

>>> unionRDD2 = sc.union([pairRDD1, pairRDD2])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] 
= rdds[i]._jrdd File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221, in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)

>>> rdd4 = sc.parallelize(range(5))

>>> pairRDD3 = rdd4.zip(rdd4)

>>> unionRDD3 = sc.union([pairRDD1, pairRDD3])

>>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
>>> 1), (2, 2), (3, 3), (4, 4)]

 

2.4.5 does not have this regression

  was:
Pair RDD conversion seems to have issues

SparkSession available as 'spark'.

>>> rdd1 = sc.parallelize([1,2,3,4,5])

>>> rdd2 = sc.parallelize([6,7,8,9,10])

>>> pairRDD1 = rdd1.zip(rdd2)

>>> unionRDD1 = sc.union([pairRDD1, pairRDD1])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870,

in union jrdds[i] = rdds[i]._jrdd

File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221,

in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
py4j.GatewayConnection.run(GatewayConnection.java:238) at 
java.lang.Thread.run(Thread.java:748)

>>> rdd3 = sc.parallelize([11,12,13,14,15])

>>> pairRDD2 = rdd3.zip(rdd3)

>>> unionRDD2 = sc.union([pairRDD1, pairRDD2])

Traceback (most recent call last): File "", line 1, in  File 
"/home/gs/spark/latest/python/pyspark/context.py", line 870, in union jrdds[i] 
= rdds[i]._jrdd File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 238, in _setitem_ File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
 line 221, in __set_item File 
"/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
calling None.None. Trace: py4j.Py4JException: Cannot convert 
org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
py4j.commands.ArrayCommand.execute(Arra

[jira] [Commented] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-21 Thread Sanket Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113566#comment-17113566
 ] 

Sanket Reddy commented on SPARK-31788:
--

[https://git.ouroath.com/hadoop/spark/commit/f83fedc9f20869ab4c62bb07bac50113d921207f]
 looks like it does not check for PairRDD type in pyspark

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sanket Reddy
>Priority: Major
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> >>> rdd1 = sc.parallelize([1,2,3,4,5])
> >>> rdd2 = sc.parallelize([6,7,8,9,10])
> >>> pairRDD1 = rdd1.zip(rdd2)
> >>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd3 = sc.parallelize([11,12,13,14,15])
> >>> pairRDD2 = rdd3.zip(rdd3)
> >>> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd4 = sc.parallelize(range(5))
> >>> pairRDD3 = rdd4.zip(rdd4)
> >>> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
> >>> 1), (2, 2), (3, 3), (4, 4)]
>  
> 2.4.5 does not have this regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-21 Thread Sanket Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113566#comment-17113566
 ] 

Sanket Reddy edited comment on SPARK-31788 at 5/21/20, 11:37 PM:
-

[https://github.com/apache/spark/commit/f83fedc9f20869ab4c62bb07bac50113d921207f]
 looks like it does not check for PairRDD type in pyspark


was (Author: sanket991):
[https://git.ouroath.com/hadoop/spark/commit/f83fedc9f20869ab4c62bb07bac50113d921207f]
 looks like it does not check for PairRDD type in pyspark

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sanket Reddy
>Priority: Major
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> >>> rdd1 = sc.parallelize([1,2,3,4,5])
> >>> rdd2 = sc.parallelize([6,7,8,9,10])
> >>> pairRDD1 = rdd1.zip(rdd2)
> >>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd3 = sc.parallelize([11,12,13,14,15])
> >>> pairRDD2 = rdd3.zip(rdd3)
> >>> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd4 = sc.parallelize(range(5))
> >>> pairRDD3 = rdd4.zip(rdd4)
> >>> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
> >>> 1), (2, 2), (3, 3), (4, 4)]
>  
> 2.4.5 does not have this regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31788) Error when creating UnionRDD of PairRDDs

2020-05-21 Thread Sanket Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113623#comment-17113623
 ] 

Sanket Reddy commented on SPARK-31788:
--

Took a naive dig at it [https://github.com/apache/spark/pull/28603] seems to 
work, looking for reviews and improvement suggestions.

> Error when creating UnionRDD of PairRDDs
> 
>
> Key: SPARK-31788
> URL: https://issues.apache.org/jira/browse/SPARK-31788
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Sanket Reddy
>Priority: Major
>
> Union RDD of Pair RDD's seems to have issues
> SparkSession available as 'spark'.
> >>> rdd1 = sc.parallelize([1,2,3,4,5])
> >>> rdd2 = sc.parallelize([6,7,8,9,10])
> >>> pairRDD1 = rdd1.zip(rdd2)
> >>> unionRDD1 = sc.union([pairRDD1, pairRDD1])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870,
> in union jrdds[i] = rdds[i]._jrdd
> File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221,
> in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd3 = sc.parallelize([11,12,13,14,15])
> >>> pairRDD2 = rdd3.zip(rdd3)
> >>> unionRDD2 = sc.union([pairRDD1, pairRDD2])
> Traceback (most recent call last): File "", line 1, in  File 
> "/home/gs/spark/latest/python/pyspark/context.py", line 870, in union 
> jrdds[i] = rdds[i]._jrdd File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 238, in _setitem_ File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py",
>  line 221, in __set_item File 
> "/home/gs/spark/latest/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 
> 332, in get_return_value py4j.protocol.Py4JError: An error occurred while 
> calling None.None. Trace: py4j.Py4JException: Cannot convert 
> org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at 
> py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at 
> py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at 
> py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748)
> >>> rdd4 = sc.parallelize(range(5))
> >>> pairRDD3 = rdd4.zip(rdd4)
> >>> unionRDD3 = sc.union([pairRDD1, pairRDD3])
> >>> unionRDD3.collect() [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (0, 0), (1, 
> >>> 1), (2, 2), (3, 3), (4, 4)]
>  
> 2.4.5 does not have this regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26201) python broadcast.value on driver fails with disk encryption enabled

2018-11-28 Thread Sanket Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16702046#comment-16702046
 ] 

Sanket Reddy commented on SPARK-26201:
--

Thanks [~tgraves] will put up the patch shortly

> python broadcast.value on driver fails with disk encryption enabled
> ---
>
> Key: SPARK-26201
> URL: https://issues.apache.org/jira/browse/SPARK-26201
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2
>Reporter: Thomas Graves
>Priority: Major
>
> I was trying python with rpc and disk encryption enabled and when I tried a 
> python broadcast variable and just read the value back on the driver side the 
> job failed with:
>  
> Traceback (most recent call last): File "broadcast.py", line 37, in  
> words_new.value File "/pyspark.zip/pyspark/broadcast.py", line 137, in value 
> File "pyspark.zip/pyspark/broadcast.py", line 122, in load_from_path File 
> "pyspark.zip/pyspark/broadcast.py", line 128, in load EOFError: Ran out of 
> input
> To reproduce use configs: --conf spark.network.crypto.enabled=true --conf 
> spark.io.encryption.enabled=true
>  
> Code:
> words_new = sc.broadcast(["scala", "java", "hadoop", "spark", "akka"])
>  words_new.value
>  print(words_new.value)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36685) Fix wrong assert statement

2021-09-07 Thread Sanket Reddy (Jira)
Sanket Reddy created SPARK-36685:


 Summary: Fix wrong assert statement
 Key: SPARK-36685
 URL: https://issues.apache.org/jira/browse/SPARK-36685
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.2
Reporter: Sanket Reddy


{code:scala}
require(numCols == mat.numCols, "The number of rows of the matrices in this 
sequence, " + "don't match!")
{code}
Shall the error message be "The number of columns..."?

This issue also appears in the open source spark:
 
[https://github.com/apache/spark/blob/master/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala#L1266]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org