Hi Alex,
Shuffle files in spark are deleted when the object holding a reference to
the shuffle file on disk goes out of scope (is garbage collected by the
JVM). Could it be the case that you are keeping these objects alive?
Regards,
Keith.
http://keith-chapman.com
On Sun, Jul 21, 2019 at 12
of execution and memory. I would
rather use Dataframe sort operation if performance is key.
Regards,
Keith.
http://keith-chapman.com
On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve <
supun.kamburugam...@gmail.com> wrote:
> Hi all,
>
> We are trying to measure the sorting performan
traclasspath the jar file needs to be present on all the
executors.
Regards,
Keith.
http://keith-chapman.com
On Wed, Jun 19, 2019 at 8:57 PM naresh Goud
wrote:
> Hello All,
>
> How can we override jars in spark submit?
> We have hive-exec-spark jar which is available as part of de
Yes that is correct, that would cause computation twice. If you want the
computation to happen only once you can cache the dataframe and call count
and write on the cached dataframe.
Regards,
Keith.
http://keith-chapman.com
On Mon, May 20, 2019 at 6:43 PM Rishi Shah wrote:
> Hi All,
>
Hello,
I think you can try with below , the reason is only yarn-cllient mode is
supported for your scenario.
master("yarn-client")
Thanks very much.
Keith
From: 张万新
Sent: Thursday, November 1, 2018 11:36 PM
To: 崔苗(数据与人工智能产品开发部) <0049003...@znv.com>
Cc: user
Subject: Re: ho
t %r in type %s" % (dataType, obj,
type(obj)))
TypeError: TimestampType can not accept object '2018-03-21 08:06:17' in
type
Regards,
Keith.
http://keith-chapman.com
-XX:OnOutOfMemoryError='kill -9 %p'
Regards,
Keith.
http://keith-chapman.com
On Mon, Jun 11, 2018 at 8:22 PM, ankit jain wrote:
> Hi,
> Does anybody know if Yarn uses a different Garbage Collector from Spark
> standalone?
>
> We migrated our application recently from EMR to K8(not using native sp
Hi Michael,
sorry for the late reply. I guess you may have to set it through the hdfs
core-site.xml file. The property you need to set is "hadoop.tmp.dir" which
defaults to "/tmp/hadoop-${user.name}"
Regards,
Keith.
http://keith-chapman.com
On Mon, Mar 19, 2018 at 1:05
Can you try setting spark.executor.extraJavaOptions to have -D
java.io.tmpdir=someValue
Regards,
Keith.
http://keith-chapman.com
On Mon, Mar 19, 2018 at 10:29 AM, Michael Shtelma <mshte...@gmail.com>
wrote:
> Hi Keith,
>
> Thank you for your answer!
> I have done this,
Hi Michael,
You could either set spark.local.dir through spark conf or java.io.tmpdir
system property.
Regards,
Keith.
http://keith-chapman.com
On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma <mshte...@gmail.com> wrote:
> Hi everybody,
>
> I am running spark job on yarn,
Hi,
I'd like to write a custom Spark strategy that runs after all the existing
Spark strategies are run. Looking through the Spark code it seems like the
custom strategies are prepended to the list of strategies in Spark. Is
there a way I could get it to run last?
Regards,
Keith.
http://keith
My issue is that there is not enough pressure on GC, hence GC is not
kicking in fast enough to delete the shuffle files of previous iterations.
Regards,
Keith.
http://keith-chapman.com
On Thu, Feb 22, 2018 at 6:58 PM, naresh Goud <nareshgoud.du...@gmail.com>
wrote:
> It would be very
kicking in more often and the size of /tmp stays under control. Is there
any way I could configure spark to handle this issue?
One option that I have is to have GC run more often by
setting spark.cleaner.periodicGC.interval to a much lower value. Is there a
cleaner solution?
Regards,
Keith.
http
Hi Manuel,
You could use the following to add a path to the library search path,
--conf spark.driver.extraLibraryPath=PathToLibFolder
--conf spark.executor.extraLibraryPath=PathToLibFolder
Thanks,
Keith.
Regards,
Keith.
http://keith-chapman.com
On Wed, Jan 17, 2018 at 5:39 PM, Manuel Sopena
the columns while not the ddl.
Thanks very much.
Keith
From: Anastasios Zouzias [mailto:zouz...@gmail.com]
Sent: Sunday, October 1, 2017 3:05 PM
To: Kanagha Kumar <kpra...@salesforce.com>
Cc: user @spark <user@spark.apache.org>
Subject: Re: Error - Spark reading from HDFS via dataframe
.builder()
.master("yarn-client")
//"yarn-client", "local"
.config(sc)
.appName(SparkEAZDebug.class.getName())
.enableHiveSupport()
Finally find the root cause and raise a bug issue in
https://issues.apache.org/jira/browse/SPARK-21819
Thanks very much.
Keith
From: Sun, Keith
Sent: 2017年8月22日 8:48
To: user@spark.apache.org
Subject: A bug in spark or hadoop RPC with kerberos authentication?
Hello ,
I met this very weird
stem.out.println(sc.toDebugString());
SparkSession sparkSessesion= SparkSession
.builder()
.master("yarn-client") //"yarn-client", "local"
.config(sc)
.appName(SparkEAZDebug.class.getName())
.enableHiveSupport()
.getOrCreate();
Thanks very much.
Keith
Here is an example of a window lead function,
select *, lead(someColumn1) over ( partition by someColumn2 order by
someColumn13 asc nulls first) as someName from someTable
Regards,
Keith.
http://keith-chapman.com
On Tue, Jul 25, 2017 at 9:15 AM, kant kodali <kanth...@gmail.com> wrote:
,
Keith.
http://keith-chapman.com
On Tue, Jul 25, 2017 at 12:50 AM, kant kodali <kanth...@gmail.com> wrote:
> HI All,
>
> I just want to run some spark structured streaming Job similar to this
>
> DS.filter(col("name").equalTo("john"))
> .groupB
You could also enable it with --conf spark.logLineage=true if you do not
want to change any code.
Regards,
Keith.
http://keith-chapman.com
On Fri, Jul 21, 2017 at 7:57 PM, Keith Chapman <keithgchap...@gmail.com>
wrote:
> Hi Ron,
>
> You can try using the toDebugString me
Hi Ron,
You can try using the toDebugString method on the RDD, this will print the
RDD lineage.
Regards,
Keith.
http://keith-chapman.com
On Fri, Jul 21, 2017 at 11:24 AM, Ron Gonzalez <zlgonza...@yahoo.com.invalid
> wrote:
> Hi,
> Can someone point me to a test case or share
Hi Nguyen,
This looks promising and seems like I could achieve it using cluster by.
Thanks for the pointer.
Regards,
Keith.
http://keith-chapman.com
On Sat, Jun 24, 2017 at 5:27 AM, nguyen duc Tuan <newvalu...@gmail.com>
wrote:
> Hi Chapman,
> You can use "cluster by&quo
Thanks for the pointer Saliya, I'm looking got an equivalent api in
dataset/dataframe for repartitionAndSortWithinPartitions, I've already
converted most of the RDD's to Dataframes.
Regards,
Keith.
http://keith-chapman.com
On Sat, Jun 24, 2017 at 3:48 AM, Saliya Ekanayake <esal...@gmail.
/Dataframe instead of RDDs, so my question is:
Is there custom partitioning of Dataset/Dataframe implemented in Spark?
Can I accomplish the partial sort using mapPartitions on the resulting
partitioned Dataset/Dataframe?
Any thoughts?
Regards,
Keith.
http://keith-chapman.com
As Paul said it really depends on what you want to do with your data,
perhaps writing it to a file would be a better option, but again it depends
on what you want to do with the data you collect.
Regards,
Keith.
http://keith-chapman.com
On Tue, Apr 4, 2017 at 7:38 AM, Eike von Seggern
x = spark.read.format("csv").load("/home/user/data.csv")
>
> x.show()
>
> }
>
> }
>
>
> hope this helps.
>
> Diego
>
> On 22 Mar 2017 7:18 pm, "Keith Chapman" <keithgchap...@gmail.com> wrote:
>
> Hi,
>
> I'm
how
}
}
Compiling the above program gives, I'd expect it to work as its a simple
case class, changing it to as[String] works, but I would like to get the
case class to work.
[error] /home/keith/dataset/DataSetTest.scala:13: Unable to find encoder
for type stored in a Dataset. Primitive typ
Hi Jacek,
I've looked at SparkListener and tried it, I see it getting fired on the
master but I don't see it getting fired on the workers in a cluster.
Regards,
Keith.
http://keith-chapman.com
On Fri, Jan 20, 2017 at 11:09 AM, Jacek Laskowski <ja...@japila.pl> wrote:
> Hi,
>
&g
I recently wrote a blog post[1] sharing my experiences with using
Apache Spark to load data into Apache Fluo. One of the things I cover
in this blog post is late binding of dependencies and exclusion of
provided dependencies when building a shaded jar. When writing the
post, I was unsure about
-production data.
Yong, that's a good point about the web content. I had forgotten to mention
that when I first saw this a few months ago, on another project, I could
sometimes trigger the OOM by trying to view the web ui for the job. That's
another case I'll try to reproduce.
Thanks again!
Keith
were
hoping that someone had seen this before and it rung a bell. Maybe there's
a setting to clean up info from old jobs that we can adjust.
Cheers,
Keith.
On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin <aseigneu...@ipponusa.com>
wrote:
> Hi Irina,
>
> I would question the
There is no java shell in spark.
> On May 25, 2016, at 1:11 AM, Ashok Kumar wrote:
>
> Hello,
>
> A newbie question.
>
> Is it possible to use java code directly in spark shell without using maven
> to build a jar file?
>
> How can I switch from scala to java
The spark docs section for "JDBC to Other Databases"
(https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
describes the partitioning as "... Notice that lowerBound and upperBound
are just used to decide the partition stride, not for filtering the rows
in
I'm not a python expert, so I'm wondering if anybody has a working
example of a partitioner for the "partitionFunc" argument (default
"portable_hash") to rdd.partitionBy()?
-
To unsubscribe, e-mail:
The spark docs section for JDBC to Other Databases
(https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
describes the partitioning as ... Notice that lowerBound and upperBound
are just used to decide the partition stride, not for filtering the rows
in
into an rdd with context.textFile(),
flatmap that and union these rdds.
also see
http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files
On 1 December 2014 at 16:50, Keith Simmons ke...@pulse.io wrote:
This is a long shot, but...
I'm trying to load a bunch
Yep, that's definitely possible. It's one of the workarounds I was
considering. I was just curious if there was a simpler (and perhaps more
efficient) approach.
Keith
On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg andy.tw...@gmail.com wrote:
Could you modify your function so that it streams
unless
you have a ton (like thousands) of concurrently running applications
connecting to it there's little likelihood for it to OOM. At least that's
my understanding.
-Andrew
2014-10-22 15:51 GMT-07:00 Sameer Farooqui same...@databricks.com:
Hi Keith,
Would be helpful if you could post
We've been getting some OOMs from the spark master since upgrading to Spark
1.1.0. I've found SPARK_DAEMON_MEMORY, but that also seems to increase the
worker heap, which as far as I know is fine. Is there any setting which
*only* increases the master heap size?
Keith
Maybe I should put this another way. If spark has two jobs, A and B, both
of which consume the entire allocated memory pool, is it expected that
spark can launch B before the executor processes tied to A are completely
terminated?
On Thu, Oct 9, 2014 at 6:57 PM, Keith Simmons ke...@pulse.io
me know if there's any additional information I can
provide.
Keith
P.S. We're running spark 1.0.2
job...
14/10/09 20:51:17 INFO Worker: Executor app-20141009204127-0029/1 finished
with state KILLED
As you can see, the first app didn't actually shutdown until two minutes
after the new job launched. During that time, I was at double the worker
memory limit.
Keith
On Thu, Oct 9, 2014 at 5:06
into the individual record types without a problem.
The immediate cause seems to be a task trying to deserialize one or more
SQL case classes before loading the spark uber jar, but I have no idea why
this is happening, or why it only happens when I do a join. Ideas?
Keith
P.S. If it's relevant, we're using
locally.
On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons keith.simm...@gmail.com
wrote:
Nope. All of them are registered from the driver program.
However, I think we've found the culprit. If the join column between two
tables is not in the same column position in both tables, it triggers
)
On Tue, Jul 15, 2014 at 1:05 PM, Michael Armbrust mich...@databricks.com
wrote:
Can you print out the queryExecution?
(i.e. println(sql().queryExecution))
On Tue, Jul 15, 2014 at 12:44 PM, Keith Simmons keith.simm...@gmail.com
wrote:
To give a few more details of my environment in case
Cool. So Michael's hunch was correct, it is a thread issue. I'm currently
using a tarball build, but I'll do a spark build with the patch as soon as
I have a chance and test it out.
Keith
On Tue, Jul 15, 2014 at 4:14 PM, Zongheng Yang zonghen...@gmail.com wrote:
Hi Keith gorenuru
Good point. Shows how personal use cases color how we interpret products.
On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote:
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote:
Impala is *not* built on map/reduce, though it was built to replace
Hive, which
registering any custom
serializers).
Keith
On Tue, Jul 8, 2014 at 2:58 PM, Robert James srobertja...@gmail.com wrote:
As a new user, I can definitely say that my experience with Spark has
been rather raw. The appeal of interactive, batch, and in between all
using more or less straight Scala
a pretty good handle on the overall RDD contribution.
Thanks for all the help.
Keith
On Wed, May 28, 2014 at 6:43 AM, Christopher Nguyen c...@adatao.com wrote:
Keith, please see inline.
--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen
On Tue, May 27
each task
is processing a single partition and there are a bounded number of tasks in
flight, my memory use has a rough upper limit.
Keith
? Specifically, once a key/value pair is serialized
in the shuffle stage of a task, are the references to the raw java objects
released before the next task is started.
On Tue, May 27, 2014 at 6:21 PM, Christopher Nguyen c...@adatao.com wrote:
Keith, do you mean bound as in (a) strictly control to some
, but they looked generally right. Not sure if this is the
failure you are talking about or not.
As far as shortest path, the programming guide had an example that worked
well for me under
https://spark.incubator.apache.org/docs/latest/graphx-programming-guide.html#pregel-api
.
Keith
On Sun, Mar 9, 2014
53 matches
Mail list logo