Task - Id : Staus Failed

2019-06-06 Thread dimitris plakas
Hello Everyone, I am trying to set up a yarn cluster with three nodes (one master and two workers). I followed this tutorial : https://linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/ I also try to execute the yarn exmaple at the end of this tutorial with the

Spark on Kubernetes Authentication error

2019-06-06 Thread Nick Dawes
Hi there, I'm trying to run Spark on EKS. Created an EKS cluster, added nodes and then trying to submit a Spark job from an EC2 instance. Ran following commands for access. kubectl create serviceaccount spark kubectl create clusterrolebinding spark-role --clusterrole=admin

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Bruno Nassivet
Hi Marcelo, Maybe the spark.sql.functions.explode give what you need? // Bruno > Le 6 juin 2019 à 16:02, Marcelo Valle a écrit : > > Generating the city id (child) is easy, monotonically increasing id worked > for me. > > The problem is the country (parent) which has to be in both

Fwd: [Spark SQL Thrift Server] Persistence errors with PostgreSQL and MySQL in 2.4.3

2019-06-06 Thread Ricardo Martinelli de Oliveira
Hello, I'm running Thrift server with PostgresSQL persistence for hive metastore. I'm using Postgres 9.6 and spark 2.4.3 in this environment. When I start Thrift server I get lots of errors while creating the schema and it happen everytime I reach postgres, like: 19/06/06 15:51:59 WARN

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Marcelo Valle
Hi Magnus, Thanks for replying. I didn't get the partition solution, tbh, but indeed, I was trying to figure a way of solving only with data frames without rejoining. I can't have a global list of countries in my real scenario, as the real scenario is not reference data, countries was just an

Multi-dimensional aggregations in Structured Streaming

2019-06-06 Thread Symeon Meichanetzoglou
Hi all, We are facing a challenge where a simple use case seems not trivial to implement in structured streaming: an aggregation should be calculated and then some other aggregations should further aggregate on the first aggregation. Something like: 1st aggregation: val df =

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Marcelo Valle
Generating the city id (child) is easy, monotonically increasing id worked for me. The problem is the country (parent) which has to be in both countries and cities data frames. On Thu, 6 Jun 2019 at 14:57, Magnus Nilsson wrote: > Well, you could do a repartition on cityname/nrOfCities and

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Magnus Nilsson
Well, you could do a repartition on cityname/nrOfCities and use the spark_partition_id function or the mappartitionswithindex dataframe method to add a city Id column. Then just split the dataframe into two subsets. Be careful of hashcollisions on the reparition Key though, or more than one city

Re: Spark on K8S - --packages not working for cluster mode?

2019-06-06 Thread pacuna
Great! Thanks a lot. Best, Pablo. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Marcelo Valle
Akshay, First of all, thanks for the answer. I *am* using monotonically increasing id, but that's not my problem. My problem is I want to output 2 tables from 1 data frame, 1 parent table with ID for the group by and 1 child table with the parent id without the group by. I was able to solve this

[no subject]

2019-06-06 Thread Shi Tyshanchn

Re: Spark on K8S - --packages not working for cluster mode?

2019-06-06 Thread Stavros Kontopoulos
Hi, This has been fixed here: https://github.com/apache/spark/pull/23546. Will be available with Spark 3.0.0 Best, Stavros On Wed, Jun 5, 2019 at 11:18 PM pacuna wrote: > I'm trying to run a sample code that reads a file from s3 so I need the aws > sdk and aws hadoop dependencies. > If I

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Akshay Bhardwaj
Additionally there is "uuid" function available as well if that helps your use case. Akshay Bhardwaj +91-97111-33849 On Thu, Jun 6, 2019 at 3:18 PM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Marcelo, > > If you are using spark 2.3+ and dataset API/SparkSQL,you can use this >

Re: adding a column to a groupBy (dataframe)

2019-06-06 Thread Akshay Bhardwaj
Hi Marcelo, If you are using spark 2.3+ and dataset API/SparkSQL,you can use this inbuilt function "monotonically_increasing_id" in Spark. A little tweaking using Spark sql inbuilt functions can enable you to achieve this without having to write code or define RDDs with map/reduce functions.

sparksql in sparkR?

2019-06-06 Thread ya
Dear list, I am trying to use sparksql within my R, I am having the following questions, could you give me some advice please? Thank you very much. 1. I connect my R and spark using the library sparkR, probably some of the members here also are R users? Do I understand correctly that SparkSQL