Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread yohann jardin
showed us. Yohann Jardin Le 7/8/2018 à 6:11 PM, kant kodali a écrit : @yohann sorry I am assuming you meant application master if so I believe spark is the one that provides application master. Is there anyway to look for how much resources are being requested and how much yarn is allowed

Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread yohann jardin
-am-resource-percent. Regards, Yohann Jardin Le 7/8/2018 à 4:40 PM, kant kodali a écrit : Hi, It's on local mac book pro machine that has 16GB RAM 512GB disk and 8 vCpu! I am not running any code since I can't even spawn spark-shell with yarn as master as described in my previous email. I just

Re: Spark LOCAL mode and external jar (extraClassPath)

2018-04-13 Thread yohann jardin
that you provide correctly the jar based on its location. I have found it tricky in some cases. As a debug try, if the jar is not on HDFS, you can copy it there and then specify the full path in the extraclasspath property. Regards, Yohann Jardin Le 4/13/2018 à 5:38 PM, Jason Boorn a écrit : I do

Re: learning Spark

2017-12-04 Thread yohann jardin
Plenty of documentation is available on Spark website itself: http://spark.apache.org/docs/latest/#where-to-go-from-here You’ll find deployment guides, tuning, etc. Yohann Jardin Le 05-Dec-17 à 1:38 AM, Somasundaram Sekar a écrit : Learning Spark - ORielly publication as a starter and official

Re: DataFrame multiple agg on the same column

2017-10-07 Thread yohann jardin
t(1)), sum('amount), max('amount), min('create_time), max('created_time)).show Yohann Jardin Le 10/7/2017 à 7:12 PM, Somasundaram Sekar a écrit : Hi, I have a GroupedData object, on which I perform aggregation of few columns since GroupedData takes in map, I cannot perform multiple aggregate on the

Re: add arraylist to dataframe

2017-08-29 Thread yohann jardin
Hello Asmath, Your list exist inside the driver, but you try to add element in it from the executors. They are in different processes, on different nodes, they do not communicate just like that. https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions There exist an action

Re: How to configure spark on Yarn cluster

2017-07-28 Thread yohann jardin
For yarn, I'm speaking about the file fairscheduler.xml (if you kept the default scheduling of Yarn): https://hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/FairScheduler.html#Allocation_file_format Yohann Jardin Le 7/28/2017 à 8:00 PM, jeff saremi a écrit : The only relevant

Re: How to configure spark on Yarn cluster

2017-07-28 Thread yohann jardin
-10-4-gb-of-10-4-gb-physic Regards, Yohann Jardin Le 7/28/2017 à 6:05 PM, jeff saremi a écrit : Thanks so much Yohann I checked the Storage/Memory column in Executors status page. Well below where I wanted to be. I will try the suggestion on smaller data sets. I am also well within the Yarn

Re: How to configure spark on Yarn cluster

2017-07-28 Thread yohann jardin
. Yohann Jardin Le 7/28/2017 à 8:03 AM, jeff saremi a écrit : I have the simplest job which i'm running against 100TB of data. The job keeps failing with ExecutorLostFailure's on containers killed by Yarn for exceeding memory limits I have varied the executor-memory from 32GB to 96GB

RE: Is there a difference between these aggregations

2017-07-24 Thread yohann jardin
Seen directly in the code: /** * Aggregate function: returns the average of the values in a group. * Alias for avg. * * @group agg_funcs * @since 1.4.0 */ def mean(e: Column): Column = avg(e) That's the same when the argument is the column name. So no difference between

RE: [Spark] Working with JavaPairRDD from Scala

2017-07-22 Thread yohann jardin
Hello Lukasz, You can just: val pairRdd = javapairrdd.rdd(); Then pairRdd will be of type RDD>, with K being com.vividsolutions.jts.geom.Polygon, and V being java.util.HashSet[com.vividsolutions.jts.geom.Polygon] If you really want to continue with Java objects: val

Re: Spark 2.1.1 and Hadoop version 2.2 or 2.7?

2017-06-21 Thread yohann jardin
https://spark.apache.org/docs/2.1.0/building-spark.html#specifying-the-hadoop-version Version Hadoop v2.2.0 only is the default build version, but other versions can still be built. The package you downloaded is prebuilt for Hadoop 2.7 as said on the download page, don't worry. Yohann Jardin

Re: Flume DStream produces 0 records after HDFS node killed

2017-06-21 Thread yohann jardin
Which version of Hadoop are you running on? Yohann Jardin Le 6/21/2017 à 1:06 AM, N B a écrit : Ok some more info about this issue to see if someone can shine a light on what could be going on. I turned on debug logging for org.apache.spark.streaming.scheduler in the driver process

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread yohann jardin
to argument on this topic. Yohann Jardin Le 6/11/2017 à 7:08 PM, vaquar khan a écrit : Hi Kant, Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors. Spark Streaming

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-11 Thread yohann jardin
to argument on this topic. Yohann Jardin Le 6/11/2017 à 7:08 PM, vaquar khan a écrit : Hi Kant, Kafka is the message broker that using as Producers and Consumers and Spark Streaming is used as the real time processing ,Kafka and Spark Streaming work together not competitors. Spark Streaming

Spark SQL, formatting timezone in UTC

2017-06-02 Thread yohann jardin
Hello everyone, I'm having a hard time with time zones. I have a Long representing a timestamp: 149636160, I want the output to be 2017-06-02 00:00:00 Based on https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html The only function that helps formatting a

Re: Recommended cluster parameters

2017-04-30 Thread yohann jardin
but the whole bunch available that you (can) have. Then using this information and indications on Spark website (http://spark.apache.org/docs/latest/hardware-provisioning.html), you will be able to specify the hardware of one node, and how many nodes you need (at least 3). Yohann Jardin Le 4

Writing dataframe to a final path using another temporary path

2017-03-28 Thread yohann jardin
Hello, I’m using spark 2.1. Once a job completes, I want to write a Parquet file to, let’s say, the folder /user/my_user/final_path/ However, I have other jobs reading files in that specific folder, so I need those files to be completely written when there are in that folder. So while the

RE: RE: Fast write datastore...

2017-03-16 Thread yohann jardin
Hello everyone, I'm also really interested in the answers as I will be facing the same issue soon. Muthu, if you evaluate again Apache Ignite, can you share your results? I also noticed Alluxio to store spark results in memory that you might want to investigate. In my case I want to use them

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread yohann jardin
/spark-examples*.jar after --class you specify the path, in your provided jar, to the Main you want to run. You finish by specifying the jar that contains your main class. Yohann Jardin Le 2/25/2017 à 9:50 PM, Raymond Xie a écrit : I am doing a spark streaming on a hortonworks sandbox and am stuck

Executor links in Job History

2017-02-22 Thread yohann jardin
Hello, I'm using Spark 2.1.0 and hadoop 2.2.0. When I launch jobs on Yarn, I can retrieve their information on Spark History Server, except that the links to stdout/stderr of executors are wrong -> they lead to their url while the job was running. We have the flag

Issues launching job dynamically in Java

2017-02-08 Thread yohann jardin
Hello everyone, I'm trying to develop a WebService launching jobs. The WebService is based on tomcat, and I'm working with Spark 2.1.0. The SparkLauncher provides two method to launch the job. First SparkLauncher.launch(), and SparkLauncher.startApplication(SparkAppHandle.Listener...