Spark UI crashes on Large Workloads

2017-07-17 Thread saatvikshah1994
Hi, I have a pyspark App which when provided a huge amount of data as input throws the error explained here sometimes: https://stackoverflow.com/questions/32340639/unable-to-understand-error-sparklistenerbus-has-already-stopped-dropping-event. All my code is running inside the main function, and

Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Takashi Sasaki
Hi Josh, As you say, I also recognize the problem. I feel I got a warning when specifying a huge data set. We also adjust the partition size but we are doing command options instead of default settings, or in code. Regards, Takashi 2017-07-18 6:48 GMT+09:00 Josh Holbrook

Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Josh Holbrook
I just ran into this issue! Small world. As far as I can tell, by default spark on EMR is completely untuned, but it comes with a flag that you can set to tell EMR to autotune spark. In your configuration.json file, you can add something like: { "Classification": "spark", "Properties":

Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Pascal Stammer
Hi Takashi, thanks for your help. After a further investigation, I figure out that the killed container was the driver process. After setting spark.yarn.driver.memoryOverhead instead of spark.yarn.executor.memoryOverhead the error was gone and application is executed without error. Maybe it

Spark Streaming handling Kafka exceptions

2017-07-17 Thread Jean-Francois Gosselin
How can I handle an error with Kafka with my DirectStream (network issue, zookeeper or broker going down) ? For example when the consumer fails to connect with Kafka (at startup) I only get a DEBUG log (not even an ERROR) and no exception are thrown ... I'm using Spark 2.1.1 and

Re: Running Spark und YARN on AWS EMR

2017-07-17 Thread Takashi Sasaki
Hi Pascal, The error also occurred frequently in our project. As a solution, it was effective to specify the memory size directly with spark-submit command. eg. spark-submit executor-memory 2g Regards, Takashi > 2017-07-18 5:18 GMT+09:00 Pascal Stammer : >> Hi, >> >>

Re: Slowness of Spark Thrift Server

2017-07-17 Thread Maciej Bryński
I did the test on Spark 2.2.0 and problem still exists. Any ideas how to fix it ? Regards, Maciek 2017-07-11 11:52 GMT+02:00 Maciej Bryński : > Hi, > I have following issue. > I'm trying to use Spark as a proxy to Cassandra. > The problem is the thrift server overhead. > >

Running Spark und YARN on AWS EMR

2017-07-17 Thread Pascal Stammer
Hi, I am running a Spark 2.1.x Application on AWS EMR with YARN and get following error that kill my application: AM Container for appattempt_1500320286695_0001_01 exited with exitCode: -104 For more detailed output, check application tracking

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-17 Thread Sam Elamin
Well done! This is amazing news :) Congrats and really cant wait to spread the structured streaming love! On Mon, Jul 17, 2017 at 5:25 PM, kant kodali wrote: > +1 > > On Tue, Jul 11, 2017 at 3:56 PM, Jean Georges Perrin wrote: > >> Awesome! Congrats! Can't

[ANNOUNCE] Apache Bahir 2.1.1 Released

2017-07-17 Thread Luciano Resende
Apache Bahir provides extensions to multiple distributed analytic platforms, extending their reach with a diversity of streaming connectors and SQL data sources. The Apache Bahir community is pleased to announce the release of Apache Bahir 2.1.1 which provides the following extensions for Apache

Re: running spark job with fat jar file

2017-07-17 Thread ayan guha
Hi Mitch - YARN uses a specific folder convention comprising application id, container id, attempt number and so on. Once you run a spark-submit using Yarn, you can see your application in Yarn RM UI page. Once the app finishes, you can see all logs using yarn logs -applicationId In this log,

Re: running spark job with fat jar file

2017-07-17 Thread Mich Talebzadeh
great Ayan. Is that local folder on HDFS? Will that be a hidden folder specific to the user executing the spark job? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: splitting columns into new columns

2017-07-17 Thread ayan guha
Hi Please use explode, which is written to solve exactly your problem. Consider below: >>> s = ["ERN~58XX7~^EPN~5X551~|1000"] >>> df = sc.parallelize(s).map(lambda t: t.split('|')).toDF(['phone','id']) >>> df.registerTempTable("t") >>> resDF = sqlContext.sql("select id,explode(phone)

Re: running spark job with fat jar file

2017-07-17 Thread ayan guha
Hi Here is my understanding: 1. For each container, there will be a local folder be created and application jar will be copied over there 2. Jars mentioned in --jars switch will be copied over to container to the class path of the application. So to your question, --jars is not required to be

Re: running spark job with fat jar file

2017-07-17 Thread Marcelo Vanzin
Yes. On Mon, Jul 17, 2017 at 10:47 AM, Mich Talebzadeh wrote: > thanks Marcelo. > > are these files distributed through hdfs? > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > >

Re: running spark job with fat jar file

2017-07-17 Thread Mich Talebzadeh
thanks Marcelo. are these files distributed through hdfs? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: running spark job with fat jar file

2017-07-17 Thread Marcelo Vanzin
The YARN backend distributes all files and jars you submit with your application. On Mon, Jul 17, 2017 at 10:45 AM, Mich Talebzadeh wrote: > thanks guys. > > just to clarify let us assume i am doing spark-submit as below: > > ${SPARK_HOME}/bin/spark-submit \ >

Re: running spark job with fat jar file

2017-07-17 Thread Mich Talebzadeh
thanks guys. just to clarify let us assume i am doing spark-submit as below: ${SPARK_HOME}/bin/spark-submit \ --packages ${PACKAGES} \ --driver-memory 2G \ --num-executors 2 \ --executor-memory 2G \ --executor-cores

Re: running spark job with fat jar file

2017-07-17 Thread ayan guha
Hi Mitch your jar file can be anywhere in the file system, including hdfs. If using yarn, preferably use cluster mode in terms of deployment. Yarn will distribute the jar to each container. Best Ayan On Tue, 18 Jul 2017 at 2:17 am, Marcelo Vanzin wrote: > Spark

Re: splitting columns into new columns

2017-07-17 Thread nayan sharma
Hi Pralabh, Thanks for your help. val xx = columnList.map(x => x->0).toMap val opMap = dataFrame.rdd.flatMap { row => columnList.foldLeft(xx) { case (y, col) => val s = row.getAs[String](col).split("\\^").length if (y(col) < s) y.updated(col, s) else y }.toList } val colMaxSizeMap =

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-17 Thread kant kodali
+1 On Tue, Jul 11, 2017 at 3:56 PM, Jean Georges Perrin wrote: > Awesome! Congrats! Can't wait!! > > jg > > > On Jul 11, 2017, at 18:48, Michael Armbrust > wrote: > > Hi all, > > Apache Spark 2.2.0 is the third release of the Spark 2.x line. This > release

Re: running spark job with fat jar file

2017-07-17 Thread Marcelo Vanzin
Spark distributes your application jar for you. On Mon, Jul 17, 2017 at 8:41 AM, Mich Talebzadeh wrote: > hi guys, > > > an uber/fat jar file has been created to run with spark in CDH yarc client > mode. > > As usual job is submitted to the edge node. > > does the jar

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Fretz Nuson
I was getting NullPointerException when trying to call SparkSQL from foreach. After debugging, i got to know spark session is not available in executor and could not successfully pass it. What i am doing is tablesRDD.foreach.collect() and it works but goes sequential On Mon, Jul 17, 2017 at

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Fretz Nuson
I did threading but got many failed tasks and they were not reprocessed. I am guessing driver lost track of threaded tasks. I had also tired Future and par of scala and same result as above On Mon, Jul 17, 2017 at 5:56 PM, Pralabh Kumar wrote: > Run the spark context in

running spark job with fat jar file

2017-07-17 Thread Mich Talebzadeh
hi guys, an uber/fat jar file has been created to run with spark in CDH yarc client mode. As usual job is submitted to the edge node. does the jar file has to be placed in the same directory ewith spark is running in the cluster to make it work? Also what will happen if say out of 9 nodes

?????? Spark 2.1.1 Error:java.lang.NoSuchMethodError: org.apache.spark.network.client.TransportClient.getChannel()Lio/netty/channel/Channel;

2017-07-17 Thread ????
Thanks for your reply. Can you describe it in more detail? Which dependency mismatch? It works well sometimes, but somtimes fails becaus of error 'NoSuchMethodError'. Thanks. -- -- ??: "vaquar khan";; :

Re: Spark 2.1.1 Error:java.lang.NoSuchMethodError: org.apache.spark.network.client.TransportClient.getChannel()Lio/netty/channel/Channel;

2017-07-17 Thread vaquar khan
Following error we are getting because of dependency mismatch. Regards, vaquar khan On Jul 17, 2017 3:50 AM, "zzcclp" <441586...@qq.com> wrote: Hi guys: I am using spark 2.1.1 to test on CDH 5.7.1, when i run on yarn with following command, error 'NoSuchMethodError:

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Rick Moritz
Put your jobs into a parallel collection using .par -- then you can submit them very easily to Spark, using .foreach. The jobs will then run using the FIFO scheduler in Spark. The advantage over the prior approaches are, that you won't have to deal with Threads, and that you can leave parallelism

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Simon Kitching
Have you tried simply making a list with your tables in it, then using SparkContext.makeRDD(Seq)? ie val tablenames = List("table1", "table2", "table3", ...) val tablesRDD = sc.makeRDD(tablenames, nParallelTasks) tablesRDD.foreach() > Am 17.07.2017 um 14:12 schrieb FN

Re: how to identify the alive master spark via Zookeeper ?

2017-07-17 Thread Alonso Isidoro Roman
Not sure if this can help, but a quick search on stackoverflow return this and this one

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Pralabh Kumar
Run the spark context in multithreaded way . Something like this val spark = SparkSession.builder() .appName("practice") .config("spark.scheduler.mode","FAIR") .enableHiveSupport().getOrCreate() val sc = spark.sparkContext val hc = spark.sqlContext val thread1 = new Thread {

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Matteo Cossu
Hello, have you tried to use threads instead of the loop? On 17 July 2017 at 14:12, FN wrote: > Hi > I am currently trying to parallelize reading multiple tables from Hive . As > part of an archival framework, i need to convert few hundred tables which > are in txt format

how to identify the alive master spark via Zookeeper ?

2017-07-17 Thread marina.brunel
Hello, In our project, we have a Spark cluster with 2 master and 4 workers and Zookeeper decides which master is alive. We have a problem with our reverse proxy to display the Spark Web UI. The RP redirect on a master with IP address configured in initial configuration but if Zookeeper

Reading Hive tables Parallel in Spark

2017-07-17 Thread FN
Hi I am currently trying to parallelize reading multiple tables from Hive . As part of an archival framework, i need to convert few hundred tables which are in txt format to Parquet. For now i am calling a Spark SQL inside a for loop for conversion. But this execute sequential and entire process

Re: splitting columns into new columns

2017-07-17 Thread Pralabh Kumar
Hi Nayan Please find the solution of your problem which work on spark 2. val spark = SparkSession.builder().appName("practice").enableHiveSupport().getOrCreate() val sc = spark.sparkContext val sqlContext = spark.sqlContext import spark.implicits._ val dataFrame =

Spark 2.1.1 Error:java.lang.NoSuchMethodError: org.apache.spark.network.client.TransportClient.getChannel()Lio/netty/channel/Channel;

2017-07-17 Thread zzcclp
Hi guys: I am using spark 2.1.1 to test on CDH 5.7.1, when i run on yarn with following command, error 'NoSuchMethodError: org.apache.spark.network.client.TransportClient.getChannel()Lio/netty/channel/Channel;' appears sometimes: command: *su cloudera-scm -s "/bin/sh" -c

RE: how to identify the alive master spark via Zookeeper ?

2017-07-17 Thread marina.brunel
Hello, I send another time this mail, no response abpit my question. Regards Marina De : BRUNEL Marina OBS/OAB Envoyé : jeudi 13 juillet 2017 10:43 À : user@spark.apache.org Cc : DL PINK MALIMA Objet : how to identify the alive master spark via Zookeeper ? Hello, In our project, we have

Re: splitting columns into new columns

2017-07-17 Thread nayan sharma
If I have 2-3 values in a column then I can easily separate it and create new columns with withColumn option. but I am trying to achieve it in loop and dynamically generate the new columns as many times the ^ has occurred in column values Can it be achieve in this way. > On 17-Jul-2017, at