Re: Turning off Jetty Http Options Method

2019-04-30 Thread Ankit Jain
I agree for an OSS project all end points that can be called are already publicly available. https://security.stackexchange.com/questions/138567/why-should-the-options-method-not-be-allowed-on-an-http-server has couple of good reasons though. "An essential part of security is to reduce the

spark on kubernetes driver pod phase changed from running to pending and starts another container in pod

2019-04-30 Thread zyfo2
I'm using spark-on-kubernetes to submit spark app to kubernetes. most of the time, it runs smoothly. but sometimes, I see logs after submitting: the driver pod phase changed from running to pending and starts another container in the pod though the first container exited successfully. The driver

RE: Turning off Jetty Http Options Method

2019-04-30 Thread email
If this is correct “This method exposes what all methods are supported by the end point” , I really don’t understand how’s that a security vulnerability considering the OSS nature of this project. Are you adding new endpoints to this webserver? More info about info/other methods :

RE: [EXT] handling skewness issues

2019-04-30 Thread email
Please share the links if they are publicly available. Otherwise please share the name of the talks. Thank you From: Jules Damji Sent: Monday, April 29, 2019 8:04 PM To: Michael Mansour Cc: rajat kumar ; user@spark.apache.org Subject: Re: [EXT] handling skewness issues Yes, indeed! A

Re: Turning off Jetty Http Options Method

2019-04-30 Thread Ankit Jain
+ d...@spark.apache.org On Tue, Apr 30, 2019 at 4:23 PM Ankit Jain wrote: > Aah - actually found https://issues.apache.org/jira/browse/SPARK-18664 - > "Don't respond to HTTP OPTIONS in HTTP-based UIs" > > Does anyone know if this can be prioritized? > > Thanks > Ankit > > On Tue, Apr 30, 2019

Re: Turning off Jetty Http Options Method

2019-04-30 Thread Ankit Jain
Aah - actually found https://issues.apache.org/jira/browse/SPARK-18664 - "Don't respond to HTTP OPTIONS in HTTP-based UIs" Does anyone know if this can be prioritized? Thanks Ankit On Tue, Apr 30, 2019 at 1:31 PM Ankit Jain wrote: > Hi Fellow Spark users, > We are using Spark 2.3.0 and

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Shixiong(Ryan) Zhu
I recommend you to use Structured Streaming as it has a patch that can workaround this issue: https://issues.apache.org/jira/browse/SPARK-26267 Best Regards, Ryan On Tue, Apr 30, 2019 at 3:34 PM Shixiong(Ryan) Zhu wrote: > There is a known issue that Kafka may return a wrong offset even if

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Shixiong(Ryan) Zhu
There is a known issue that Kafka may return a wrong offset even if there is no reset happening: https://issues.apache.org/jira/browse/KAFKA-7703 Best Regards, Ryan On Tue, Apr 30, 2019 at 10:41 AM Austin Weaver wrote: > @deng - There was a short erroneous period where 2 streams were reading

Best notebook for developing for apache spark using scala on Amazon EMR Cluster

2019-04-30 Thread V0lleyBallJunki3
Hello. I am using Zeppelin on Amazon EMR cluster while developing Apache Spark programs in Scala. The problem is that once that cluster is destroyed I lose all the notebooks on it. So over a period of time I have a lot of notebooks that require to be manually exported into my local disk and from

Turning off Jetty Http Options Method

2019-04-30 Thread Ankit Jain
Hi Fellow Spark users, We are using Spark 2.3.0 and security team is reporting a violation that Spark allows HTTP OPTIONS method to work(This method exposes what all methods are supported by the end point which could be exploited by a hacker). This method is on Jetty web server, I see Spark uses

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Austin Weaver
@deng - There was a short erroneous period where 2 streams were reading from the same topic and group id were running at the same time. We saw errors in this and stopped the extra stream. That being said, I would think regardless that the auto.offset.reset would kick in sine documentation says

Spark Structured Streaming | Highly reliable de-duplication strategy

2019-04-30 Thread Akshay Bhardwaj
Hi Experts, I am using spark structured streaming to read message from Kafka, with a producer that works with at-least once guarantee. This streaming job is running on Yarn cluster with hadoop 2.7 and spark 2.3 What is the most reliable strategy for avoiding duplicate data within stream in the

Re: How to specify number of Partition using newAPIHadoopFile()

2019-04-30 Thread Prateek Rajput
On Tue, Apr 30, 2019 at 6:48 PM Vatsal Patel wrote: > *Issue: * > > When I am reading sequence file in spark, I can specify the number of > partitions as an argument to the API, below is the way > *public JavaPairRDD sequenceFile(String path, Class > keyClass, Class valueClass, int

Re: Anaconda installation with Pyspark/Pyarrow (2.3.0+) on cloudera managed server

2019-04-30 Thread Patrick McCarthy
Hi Rishi, I've had success using the approach outlined here: https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html Does this work for you? On Tue, Apr 30, 2019 at 12:32 AM Rishi Shah wrote: > modified the subject & would like to clarify that I am looking to

Fwd: How to specify number of Partition using newAPIHadoopFile()

2019-04-30 Thread Vatsal Patel
*Issue: * When I am reading sequence file in spark, I can specify the number of partitions as an argument to the API, below is the way *public JavaPairRDD sequenceFile(String path, Class keyClass, Class valueClass, int minPartitions)* *In newAPIHadoopFile(), this support has been removed. below

Re: Issue with offset management using Spark on Dataproc

2019-04-30 Thread Akshay Bhardwaj
Hi Austin, Are you using Spark Streaming or Structured Streaming? For better understanding, could you also provide sample code/config params for your spark-kafka connector for the said streaming job? Akshay Bhardwaj +91-97111-33849 On Mon, Apr 29, 2019 at 10:34 PM Austin Weaver wrote: >

Re: Handle Null Columns in Spark Structured Streaming Kafka

2019-04-30 Thread SNEHASISH DUTTA
Hi NA function will replace null with some default value and not all my columns are of type string, so for some other data types (long/int etc) I have to provide some default value But ideally those values should be null Actually this null column drop is happening in this step df.selectExpr(

Re: unsubscribe

2019-04-30 Thread Arne Zachlod
please read this to unsubscribe: https://spark.apache.org/community.html TL;DR: user-unsubscr...@spark.apache.org so no mail to the list On 4/30/19 6:38 AM, Amrit Jangid wrote: - To unsubscribe e-mail:

Re: Koalas show data in IDE or pyspark

2019-04-30 Thread Manu Zhang
Hi, It seems koalas.DataFrame can't be displayed in terminal yet as in https://github.com/databricks/koalas/issues/150 and the work around is to convert it to pandas DataFrame. Thanks, Manu Zhang On Tue, Apr 30, 2019 at 2:46 PM Achilleus 003 wrote: > Hello Everyone, > > I have been trying to

Koalas show data in IDE or pyspark

2019-04-30 Thread Achilleus 003
Hello Everyone, I have been trying to run *koalas* on both pyspark and pyCharm IDE. When I run df = koalas.DataFrame({‘x’: [1, 2], ‘y’: [3, 4], ‘z’: [5, 6]}) df.head(5) I don't get the data back instead, I get an object. I thought df.head can be used to achieve this. Can anyone guide me on