Hi,
I apologize if this question was asked before. I try to find the
answer, but in vain.
I'm running PySpark on Google Cloud Platform with Spark 2.2.0 and
YARN resource manager.
Let S1 be the set of application-ids collected via 'curl
If your data can be split into groups and you can call into your favorite R
package on each group of data (in parallel):
https://spark.apache.org/docs/latest/sparkr.html#run-a-given-function-on-a-large-dataset-grouping-by-input-columns-and-using-gapply-or-gapplycollect
Thanks for the update.
What about cores per executor?
On Tue, 27 Mar 2018 at 6:45 Rohit Karlupia wrote:
> Thanks Fawze!
>
> On the memory front, I am currently working on GC and CPU aware task
> scheduling. I see wonderful results based on my tests so far. Once the
>
Thanks Fawze!
On the memory front, I am currently working on GC and CPU aware task
scheduling. I see wonderful results based on my tests so far. Once the
feature is complete and available, spark will work with whatever memory is
provided (at least enough for the largest possible task). It will
Hi Rohit,
I would like to thank you for the unlimited patience and support that you
are providing here and behind the scene for all of us.
The tool is amazing and easy to use and understand most of the metrics ...
Thinking if we need to run it in cluster mode and all the time, i think we
can
|-- myMap: map (nullable = true)
||-- key: struct
||-- value: double (valueContainsNull = true)
|||-- _1: string (nullable = true)
|||-- _2: string (nullable = true)
|-- count: long (nullable = true)
On Mon, Mar 26, 2018 at 1:41 PM, Gauthier Feuillen
Look at LinkedIn's Photon ML package: https://github.com/linkedin/photon-ml
One of the caveats is/was that the input data has to be in Avro in a
specific format.
On Mon, Mar 26, 2018 at 1:46 PM, Josh Goldsborough <
joshgoldsboroughs...@gmail.com> wrote:
> The company I work for is trying to do
SparkR does not mean all libraries of R are executed by magic in a distributed
fashion that scales with the data. In fact that is similar to many other
analytical software. They have the possibility to run things in parallel but
the libraries themselves are not using them. Reason is that it is
The company I work for is trying to do some mixed-effects regression
modeling in our new big data platform including SparkR.
We can run via SparkR's support of native R & use lme4. But it runs single
threaded. So we're looking for tricks/techniques to process large data
sets.
This was asked a
Can you give the output of “printSchema” ?
> On 26 Mar 2018, at 22:39, Nikhil Goyal wrote:
>
> Hi guys,
>
> I have a Map[(String, String), Double] as one of my columns. Using
> input.getAs[Map[(String, String), Double]](0)
> throws exception: Caused by:
Hi guys,
I have a Map[(String, String), Double] as one of my columns. Using
input.getAs[Map[(String, String), Double]](0)
throws exception: Caused by: java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot
be cast to scala.Tuple2
Even the schema
Thanks
> On 26 Mar 2018, at 22:09, Marcelo Vanzin wrote:
>
> On Mon, Mar 26, 2018 at 1:08 PM, Gauthier Feuillen
> wrote:
>> Is there a way to change this value without changing yarn-site.xml ?
>
> No. Local dirs are defined by the NodeManager, and
On Mon, Mar 26, 2018 at 1:08 PM, Gauthier Feuillen
wrote:
> Is there a way to change this value without changing yarn-site.xml ?
No. Local dirs are defined by the NodeManager, and Spark cannot override them.
--
Marcelo
Hi
I am trying to change the spark.local.dir property. I am running spark on yarn
and have already tried the following properties:
export LOCAL_DIRS=
spark.yarn.appMasterEnv.LOCAL_DIRS=
spark.yarn.appMasterEnv.SPARK_LOCAL_DIRS=
spark.yarn.nodemanager.local-dirs=/
spark.local.dir=
But still it
Hi Keith,
Thanks for the suggestion!
I have solved this already.
The problem was, that the yarn process was not responding to
start/stop commands and has not applied my configuration changes.
I have killed it and restarted my cluster, and after that yarn has
started using
Hi Michael,
sorry for the late reply. I guess you may have to set it through the hdfs
core-site.xml file. The property you need to set is "hadoop.tmp.dir" which
defaults to "/tmp/hadoop-${user.name}"
Regards,
Keith.
http://keith-chapman.com
On Mon, Mar 19, 2018 at 1:05 PM, Michael Shtelma
On Mon, Mar 26, 2018 at 11:01 AM, Fawze Abujaber wrote:
> Weird, I just ran spark-shell and it's log is comprised but my spark jobs
> that scheduled using oozie is not getting compressed.
Ah, then it's probably a problem with how Oozie is generating the
config for the Spark
Hi Marcelo,
Weird, I just ran spark-shell and it's log is comprised but my spark jobs
that scheduled using oozie is not getting compressed.
On Mon, Mar 26, 2018 at 8:56 PM, Marcelo Vanzin wrote:
> You're either doing something wrong, or talking about different logs.
> I
I distributed this config to all the nodes cross the cluster and with no
success, new spark logs still uncompressed.
On Mon, Mar 26, 2018 at 8:12 PM, Marcelo Vanzin wrote:
> Spark should be using the gateway's configuration. Unless you're
> launching the application from a
You're either doing something wrong, or talking about different logs.
I just added that to my config and ran spark-shell.
$ hdfs dfs -ls /user/spark/applicationHistory | grep
application_1522085988298_0002
-rwxrwx--- 3 blah blah 9844 2018-03-26 10:54
Spark should be using the gateway's configuration. Unless you're
launching the application from a different node, if the setting is
there, Spark should be using it.
You can also look in the UI's environment page to see the
configuration that the app is using.
On Mon, Mar 26, 2018 at 10:10 AM,
I see this configuration only on the spark gateway server, and my spark is
running on Yarn, so I think I missing something ...
I’m using cloudera manager to set this parameter, maybe I need to add this
parameter in other configuration
On Mon, 26 Mar 2018 at 20:05 Marcelo Vanzin
If the spark-defaults.conf file in the machine where you're starting
the Spark app has that config, then that's all that should be needed.
On Mon, Mar 26, 2018 at 10:02 AM, Fawze Abujaber wrote:
> Thanks Marcelo,
>
> Yes I was was expecting to see the new apps compressed but I
Thanks Marcelo,
Yes I was was expecting to see the new apps compressed but I don’t , do I
need to perform restart to spark or Yarn?
On Mon, 26 Mar 2018 at 19:53 Marcelo Vanzin wrote:
> Log compression is a client setting. Doing that will make new apps
> write event logs in
Log compression is a client setting. Doing that will make new apps
write event logs in compressed format.
The SHS doesn't compress existing logs.
On Mon, Mar 26, 2018 at 9:17 AM, Fawze Abujaber wrote:
> Hi All,
>
> I'm trying to compress the logs at SPark history server, i
Hi All,
I'm trying to compress the logs at SPark history server, i
added spark.eventLog.compress=true to spark-defaults.conf to spark Spark
Client Advanced Configuration Snippet (Safety Valve) for
spark-conf/spark-defaults.conf
which i see applied only to the spark gateway servers spark conf.
Hi All,
I am using spark 2.3.0 and I wondering what do I need to set to see the
number of records and processing time for each batch in SPARK UI? The
default UI doesn't seem to show this.
Thanks@
Hi, sparks:
I'm using spark2.3 and had found a bug in spark dataframe, here is my
codes:
sc = sparkSession.sparkContext
tmp = sparkSession.createDataFrame(sc.parallelize([[1, 2, 3, 4],
[1, 2, 5, 6], [2, 3, 4, 5], [2, 3, 5, 6]])).toDF('a', 'b', 'c', 'd')
I agree.
Just pointed out the option, in case you missed it.
Cheers,
Shmuel
On Mon, Mar 26, 2018 at 10:57 AM, 1427357...@qq.com <1427357...@qq.com>
wrote:
> Hi,
>
> Using concat is one of the way.
> But the + is more intuitive and easy to understand.
>
> --
>
Hi Shmuel,
In general it is hard to pin point to exact code which is responsible for a
specific stage. For example when using spark sql, depending upon the kind
of joins, aggregations used in the the single line of query, we will have
multiple stages in the spark application. I usually try to
Hi,
Using concat is one of the way.
But the + is more intuitive and easy to understand.
1427357...@qq.com
From: Shmuel Blitz
Date: 2018-03-26 15:31
To: 1427357...@qq.com
CC: spark?users; dev
Subject: Re: the issue about the + in column,can we support the string please?
Hi,
you can get the
Hi,
you can get the same with:
import org.apache.spark.sql.functions._
import sqlContext.implicits._
import org.apache.spark.sql.types.{IntegerType, StringType, StructField,
StructType}
val schema = StructType(Array(StructField("name", StringType),
32 matches
Mail list logo