Re: Spark production scenario

2018-03-08 Thread yncxcw
hi, Passion I don't know an exact solution. But yes, the port each executor chosen to communicate with driver is random. I am wondering if it's possible that you can have a node has two ethernet card, configure one card for intranet for Spark and configure one card for WAN. Then connect the

DataSet save to parquet partition problem

2018-03-08 Thread Junfeng Chen
I am trying to save a DataSet object to parquet file via > df.write().partitionBy("...").parquet(path) while this dataset contains the following struct: time: struct -dayOfMonth -monthOfYear ... Can I use the child field like time.monthOfYear as above in partition ? If yes, how?

Spark production scenario

2018-03-08 Thread रविशंकर नायर
Hi all, We are going to move to production with an 8 node Spark cluster. Request some help for below We are running on YARN cluster manager.That means YARN is installed with SSH between the nodes. When we run a standalone Spark program with spark-submit, YARN initializes a resource manager

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-08 Thread Tathagata Das
This doc is unrelated to the stream-stream join we added in Structured Streaming. :) That said we added append mode first because it easier to reason about the semantics of append mode especially in the context of outer joins. You output a row only when you know it wont be changed ever. The

Re: handling Remote dependencies for spark-submit in spark 2.3 with kubernetes

2018-03-08 Thread Yinan Li
One thing to note is you may need to have the S3 credentials in the init-container unless you use a publicly accessible URL. If this is the case, you can either create a Kubernetes secret and use the Spark config option for mounting secrets (secrets will be mounted into the init-container as well

Upgrades of streaming jobs

2018-03-08 Thread Georg Heiler
Hi What is the state of spark structured streaming jobs and upgrades? Can checkpoints of version 1 be read by version 2 of a job? Is downtime required to upgrade the job? Thanks

Re: handling Remote dependencies for spark-submit in spark 2.3 with kubernetes

2018-03-08 Thread Anirudh Ramanathan
You don't need to create the init-container. It's an implementation detail. If you provide a remote uri, and specify spark.kubernetes.container.image=, Spark *internally* will add the init container to the pod spec for you. *If *for some reason, you want to customize the init container image, you

Incompatibility in LZ4 dependencies

2018-03-08 Thread Lalwani, Jayesh
There is an incompatibility in LZ4 dependencies being imported in spark 2.3.0 org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 imports org.apache.kafka:kafka-clients:0.11.0.0, which imports net.jpountz.lz4:lz4:1.3.0 OTOH, org.apache.spark:spark-core_2.11:2.3.0 imports org.lz4:lz4-java:1.4.0

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-08 Thread Gourav Sengupta
super interesting. On Wed, Mar 7, 2018 at 11:44 AM, kant kodali wrote: > It looks to me that the StateStore described in this doc > > Actually > has full outer join and every other join

Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram
Thanks, Vadim! That helps and makes sense. I don't think we have a number of keys so large that we have to worry about it. If we do, I think I would go with an approach similar to what you suggested. Thanks again, Subhash Sent from my iPhone > On Mar 8, 2018, at 11:56 AM, Vadim Semenov

Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Vadim Semenov
You need to put randomness into the beginning of the key, if you put it other than into the beginning, it's not guaranteed that you're going to have good performance. The way we achieved this is by writing to HDFS first, and then having a custom DistCp implemented using Spark that copies parquet

Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram
Hey Spark user community, I am writing Parquet files from Spark to S3 using S3a. I was reading this article about improving S3 bucket performance, specifically about how it can help to introduce randomness to your key names so that data is written to different partitions.

Re: Issues with large schema tables

2018-03-08 Thread Gourav Sengupta
Hi Ballas, in Data Science terms you have 4500 variables without correlations or which are independent of each other. In Data Modelling terms you have an entity with 4500 properties. I have worked on hair splitting financial products, even they do not have properties of a financial product with

handling Remote dependencies for spark-submit in spark 2.3 with kubernetes

2018-03-08 Thread purna pradeep
Im trying to run spark-submit to kubernetes cluster with spark 2.3 docker container image The challenge im facing is application have a mainapplication.jar and other dependency files & jars which are located in Remote location like AWS s3 ,but as per spark 2.3 documentation there is something

Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
I am running in Spark standalone mode. No YARN. anyways, yarn application -kill is a manual process. I donot want that. I was to properly kill the driver/application programatically. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Properly stop applications or jobs within the application

2018-03-08 Thread sagar grover
I am assuming you are running in yarn cluster mode. Have you tried yarn application -kill application_id ? With regards, Sagar Grover Phone - 7022175584 On Thu, Mar 8, 2018 at 4:03 PM, bsikander wrote: > I have scenarios for both. > So, I want to kill both batch and

Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
I have scenarios for both. So, I want to kill both batch and streaming midway, if required. Usecase: Normally, if everything is okay we don't kill the application but sometimes while accessing external resources (like Kafka) something can go wrong. In that case, the application can become useless

Re: Properly stop applications or jobs within the application

2018-03-08 Thread sagar grover
What do you mean by stopping applications? Do you want to kill a batch application mid way or are you running streaming jobs that you want to kill? With regards, Sagar Grover On Thu, Mar 8, 2018 at 1:45 PM, bsikander wrote: > Any help would be much appreciated. This seems

Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
Any help would be much appreciated. This seems to be a common problem. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org