how to kill application

2018-03-26 Thread Shuxin Yang
Hi,    I apologize if this question was asked before. I try to find the answer, but in vain.    I'm running PySpark on Google Cloud Platform with Spark 2.2.0 and YARN resource manager.    Let S1 be the set of application-ids collected via 'curl

Re: [Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Felix Cheung
If your data can be split into groups and you can call into your favorite R package on each group of data (in parallel): https://spark.apache.org/docs/latest/sparkr.html#run-a-given-function-on-a-large-dataset-grouping-by-input-columns-and-using-gapply-or-gapplycollect

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-26 Thread Fawze Abujaber
Thanks for the update. What about cores per executor? On Tue, 27 Mar 2018 at 6:45 Rohit Karlupia wrote: > Thanks Fawze! > > On the memory front, I am currently working on GC and CPU aware task > scheduling. I see wonderful results based on my tests so far. Once the >

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-26 Thread Rohit Karlupia
Thanks Fawze! On the memory front, I am currently working on GC and CPU aware task scheduling. I see wonderful results based on my tests so far. Once the feature is complete and available, spark will work with whatever memory is provided (at least enough for the largest possible task). It will

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-26 Thread Fawze Abujaber
Hi Rohit, I would like to thank you for the unlimited patience and support that you are providing here and behind the scene for all of us. The tool is amazing and easy to use and understand most of the metrics ... Thinking if we need to run it in cluster mode and all the time, i think we can

Re: Class cast exception while using Data Frames

2018-03-26 Thread Nikhil Goyal
|-- myMap: map (nullable = true) ||-- key: struct ||-- value: double (valueContainsNull = true) |||-- _1: string (nullable = true) |||-- _2: string (nullable = true) |-- count: long (nullable = true) On Mon, Mar 26, 2018 at 1:41 PM, Gauthier Feuillen

Re: [Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Nisha Muktewar
Look at LinkedIn's Photon ML package: https://github.com/linkedin/photon-ml One of the caveats is/was that the input data has to be in Avro in a specific format. On Mon, Mar 26, 2018 at 1:46 PM, Josh Goldsborough < joshgoldsboroughs...@gmail.com> wrote: > The company I work for is trying to do

Re: [Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Jörn Franke
SparkR does not mean all libraries of R are executed by magic in a distributed fashion that scales with the data. In fact that is similar to many other analytical software. They have the possibility to run things in parallel but the libraries themselves are not using them. Reason is that it is

[Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Josh Goldsborough
The company I work for is trying to do some mixed-effects regression modeling in our new big data platform including SparkR. We can run via SparkR's support of native R & use lme4. But it runs single threaded. So we're looking for tricks/techniques to process large data sets. This was asked a

Re: Class cast exception while using Data Frames

2018-03-26 Thread Gauthier Feuillen
Can you give the output of “printSchema” ? > On 26 Mar 2018, at 22:39, Nikhil Goyal wrote: > > Hi guys, > > I have a Map[(String, String), Double] as one of my columns. Using > input.getAs[Map[(String, String), Double]](0) > throws exception: Caused by:

Class cast exception while using Data Frames

2018-03-26 Thread Nikhil Goyal
Hi guys, I have a Map[(String, String), Double] as one of my columns. Using input.getAs[Map[(String, String), Double]](0) throws exception: Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2 Even the schema

Re: Local dirs

2018-03-26 Thread Gauthier Feuillen
Thanks > On 26 Mar 2018, at 22:09, Marcelo Vanzin wrote: > > On Mon, Mar 26, 2018 at 1:08 PM, Gauthier Feuillen > wrote: >> Is there a way to change this value without changing yarn-site.xml ? > > No. Local dirs are defined by the NodeManager, and

Re: Local dirs

2018-03-26 Thread Marcelo Vanzin
On Mon, Mar 26, 2018 at 1:08 PM, Gauthier Feuillen wrote: > Is there a way to change this value without changing yarn-site.xml ? No. Local dirs are defined by the NodeManager, and Spark cannot override them. -- Marcelo

Local dirs

2018-03-26 Thread Gauthier Feuillen
Hi I am trying to change the spark.local.dir property. I am running spark on yarn and have already tried the following properties: export LOCAL_DIRS= spark.yarn.appMasterEnv.LOCAL_DIRS= spark.yarn.appMasterEnv.SPARK_LOCAL_DIRS= spark.yarn.nodemanager.local-dirs=/ spark.local.dir= But still it

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-26 Thread Michael Shtelma
Hi Keith, Thanks for the suggestion! I have solved this already. The problem was, that the yarn process was not responding to start/stop commands and has not applied my configuration changes. I have killed it and restarted my cluster, and after that yarn has started using

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-26 Thread Keith Chapman
Hi Michael, sorry for the late reply. I guess you may have to set it through the hdfs core-site.xml file. The property you need to set is "hadoop.tmp.dir" which defaults to "/tmp/hadoop-${user.name}" Regards, Keith. http://keith-chapman.com On Mon, Mar 19, 2018 at 1:05 PM, Michael Shtelma

Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
On Mon, Mar 26, 2018 at 11:01 AM, Fawze Abujaber wrote: > Weird, I just ran spark-shell and it's log is comprised but my spark jobs > that scheduled using oozie is not getting compressed. Ah, then it's probably a problem with how Oozie is generating the config for the Spark

Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
Hi Marcelo, Weird, I just ran spark-shell and it's log is comprised but my spark jobs that scheduled using oozie is not getting compressed. On Mon, Mar 26, 2018 at 8:56 PM, Marcelo Vanzin wrote: > You're either doing something wrong, or talking about different logs. > I

Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
I distributed this config to all the nodes cross the cluster and with no success, new spark logs still uncompressed. On Mon, Mar 26, 2018 at 8:12 PM, Marcelo Vanzin wrote: > Spark should be using the gateway's configuration. Unless you're > launching the application from a

Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
You're either doing something wrong, or talking about different logs. I just added that to my config and ran spark-shell. $ hdfs dfs -ls /user/spark/applicationHistory | grep application_1522085988298_0002 -rwxrwx--- 3 blah blah 9844 2018-03-26 10:54

Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
Spark should be using the gateway's configuration. Unless you're launching the application from a different node, if the setting is there, Spark should be using it. You can also look in the UI's environment page to see the configuration that the app is using. On Mon, Mar 26, 2018 at 10:10 AM,

Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
I see this configuration only on the spark gateway server, and my spark is running on Yarn, so I think I missing something ... I’m using cloudera manager to set this parameter, maybe I need to add this parameter in other configuration On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin

Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
If the spark-defaults.conf file in the machine where you're starting the Spark app has that config, then that's all that should be needed. On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber wrote: > Thanks Marcelo, > > Yes I was was expecting to see the new apps compressed but I

Re: Spark logs compression

2018-03-26 Thread Fawze Abujaber
Thanks Marcelo, Yes I was was expecting to see the new apps compressed but I don’t , do I need to perform restart to spark or Yarn? On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin wrote: > Log compression is a client setting. Doing that will make new apps > write event logs in

Re: Spark logs compression

2018-03-26 Thread Marcelo Vanzin
Log compression is a client setting. Doing that will make new apps write event logs in compressed format. The SHS doesn't compress existing logs. On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber wrote: > Hi All, > > I'm trying to compress the logs at SPark history server, i

Spark logs compression

2018-03-26 Thread Fawze Abujaber
Hi All, I'm trying to compress the logs at SPark history server, i added spark.eventLog.compress=true to spark-defaults.conf to spark Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-defaults.conf which i see applied only to the spark gateway servers spark conf.

What do I need to set to see the number of records and processing time for each batch in SPARK UI?

2018-03-26 Thread kant kodali
Hi All, I am using spark 2.3.0 and I wondering what do I need to set to see the number of records and processing time for each batch in SPARK UI? The default UI doesn't seem to show this. Thanks@

spark 2.3 dataframe join bug

2018-03-26 Thread 李斌松
Hi, sparks: I'm using spark2.3 and had found a bug in spark dataframe, here is my codes: sc = sparkSession.sparkContext tmp = sparkSession.createDataFrame(sc.parallelize([[1, 2, 3, 4], [1, 2, 5, 6], [2, 3, 4, 5], [2, 3, 5, 6]])).toDF('a', 'b', 'c', 'd')

Re: Re: the issue about the + in column,can we support the string please?

2018-03-26 Thread Shmuel Blitz
I agree. Just pointed out the option, in case you missed it. Cheers, Shmuel On Mon, Mar 26, 2018 at 10:57 AM, 1427357...@qq.com <1427357...@qq.com> wrote: > Hi, > > Using concat is one of the way. > But the + is more intuitive and easy to understand. > > -- >

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-26 Thread Rohit Karlupia
Hi Shmuel, In general it is hard to pin point to exact code which is responsible for a specific stage. For example when using spark sql, depending upon the kind of joins, aggregations used in the the single line of query, we will have multiple stages in the spark application. I usually try to

Re: Re: the issue about the + in column,can we support the string please?

2018-03-26 Thread 1427357...@qq.com
Hi, Using concat is one of the way. But the + is more intuitive and easy to understand. 1427357...@qq.com From: Shmuel Blitz Date: 2018-03-26 15:31 To: 1427357...@qq.com CC: spark?users; dev Subject: Re: the issue about the + in column,can we support the string please? Hi, you can get the

Re: the issue about the + in column,can we support the string please?

2018-03-26 Thread Shmuel Blitz
Hi, you can get the same with: import org.apache.spark.sql.functions._ import sqlContext.implicits._ import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} val schema = StructType(Array(StructField("name", StringType),