Re: Long-Running Spark application doesn't clean old shuffle data correctly

2019-07-21 Thread Keith Chapman
Hi Alex, Shuffle files in spark are deleted when the object holding a reference to the shuffle file on disk goes out of scope (is garbage collected by the JVM). Could it be the case that you are keeping these objects alive? Regards, Keith. http://keith-chapman.com On Sun, Jul 21, 2019 at 12

Re: Sorting tuples with byte key and byte value

2019-07-15 Thread Keith Chapman
of execution and memory. I would rather use Dataframe sort operation if performance is key. Regards, Keith. http://keith-chapman.com On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve < supun.kamburugam...@gmail.com> wrote: > Hi all, > > We are trying to measure the sorting performan

Re: Override jars in spark submit

2019-06-19 Thread Keith Chapman
traclasspath the jar file needs to be present on all the executors. Regards, Keith. http://keith-chapman.com On Wed, Jun 19, 2019 at 8:57 PM naresh Goud wrote: > Hello All, > > How can we override jars in spark submit? > We have hive-exec-spark jar which is available as part of de

Re: [pyspark 2.3] count followed by write on dataframe

2019-05-20 Thread Keith Chapman
Yes that is correct, that would cause computation twice. If you want the computation to happen only once you can cache the dataframe and call count and write on the cached dataframe. Regards, Keith. http://keith-chapman.com On Mon, May 20, 2019 at 6:43 PM Rishi Shah wrote: > Hi All, >

RE: how to use cluster sparkSession like localSession

2018-11-04 Thread Sun, Keith
Hello, I think you can try with below , the reason is only yarn-cllient mode is supported for your scenario. master("yarn-client") Thanks very much. Keith From: 张万新 Sent: Thursday, November 1, 2018 11:36 PM To: 崔苗(数据与人工智能产品开发部) <0049003...@znv.com> Cc: user Subject: Re: ho

Pyspark error when converting string to timestamp in map function

2018-08-17 Thread Keith Chapman
t %r in type %s" % (dataType, obj, type(obj))) TypeError: TimestampType can not accept object '2018-03-21 08:06:17' in type Regards, Keith. http://keith-chapman.com

Re: GC- Yarn vs Standalone K8

2018-06-11 Thread Keith Chapman
-XX:OnOutOfMemoryError='kill -9 %p' Regards, Keith. http://keith-chapman.com On Mon, Jun 11, 2018 at 8:22 PM, ankit jain wrote: > Hi, > Does anybody know if Yarn uses a different Garbage Collector from Spark > standalone? > > We migrated our application recently from EMR to K8(not using native sp

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-26 Thread Keith Chapman
Hi Michael, sorry for the late reply. I guess you may have to set it through the hdfs core-site.xml file. The property you need to set is "hadoop.tmp.dir" which defaults to "/tmp/hadoop-${user.name}" Regards, Keith. http://keith-chapman.com On Mon, Mar 19, 2018 at 1:05

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Keith Chapman
Can you try setting spark.executor.extraJavaOptions to have -D java.io.tmpdir=someValue Regards, Keith. http://keith-chapman.com On Mon, Mar 19, 2018 at 10:29 AM, Michael Shtelma <mshte...@gmail.com> wrote: > Hi Keith, > > Thank you for your answer! > I have done this,

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Keith Chapman
Hi Michael, You could either set spark.local.dir through spark conf or java.io.tmpdir system property. Regards, Keith. http://keith-chapman.com On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma <mshte...@gmail.com> wrote: > Hi everybody, > > I am running spark job on yarn,

Can I get my custom spark strategy to run last?

2018-03-01 Thread Keith Chapman
Hi, I'd like to write a custom Spark strategy that runs after all the existing Spark strategies are run. Looking through the Spark code it seems like the custom strategies are prepended to the list of strategies in Spark. Is there a way I could get it to run last? Regards, Keith. http://keith

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread Keith Chapman
My issue is that there is not enough pressure on GC, hence GC is not kicking in fast enough to delete the shuffle files of previous iterations. Regards, Keith. http://keith-chapman.com On Thu, Feb 22, 2018 at 6:58 PM, naresh Goud <nareshgoud.du...@gmail.com> wrote: > It would be very

Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread Keith Chapman
kicking in more often and the size of /tmp stays under control. Is there any way I could configure spark to handle this issue? One option that I have is to have GC run more often by setting spark.cleaner.periodicGC.interval to a much lower value. Is there a cleaner solution? Regards, Keith. http

Re: update LD_LIBRARY_PATH when running apache job in a YARN cluster

2018-01-17 Thread Keith Chapman
Hi Manuel, You could use the following to add a path to the library search path, --conf spark.driver.extraLibraryPath=PathToLibFolder --conf spark.executor.extraLibraryPath=PathToLibFolder Thanks, Keith. Regards, Keith. http://keith-chapman.com On Wed, Jan 17, 2018 at 5:39 PM, Manuel Sopena

How to find the temporary views' DDL

2017-10-01 Thread Sun, Keith
the columns while not the ddl. Thanks very much. Keith From: Anastasios Zouzias [mailto:zouz...@gmail.com] Sent: Sunday, October 1, 2017 3:05 PM To: Kanagha Kumar <kpra...@salesforce.com> Cc: user @spark <user@spark.apache.org> Subject: Re: Error - Spark reading from HDFS via dataframe

RE: A bug in spark or hadoop RPC with kerberos authentication?

2017-08-23 Thread Sun, Keith
.builder() .master("yarn-client") //"yarn-client", "local" .config(sc) .appName(SparkEAZDebug.class.getName()) .enableHiveSupport()

RE: A bug in spark or hadoop RPC with kerberos authentication?

2017-08-23 Thread Sun, Keith
Finally find the root cause and raise a bug issue in https://issues.apache.org/jira/browse/SPARK-21819 Thanks very much. Keith From: Sun, Keith Sent: 2017年8月22日 8:48 To: user@spark.apache.org Subject: A bug in spark or hadoop RPC with kerberos authentication? Hello , I met this very weird

A bug in spark or hadoop RPC with kerberos authentication?

2017-08-22 Thread Sun, Keith
stem.out.println(sc.toDebugString()); SparkSession sparkSessesion= SparkSession .builder() .master("yarn-client") //"yarn-client", "local" .config(sc) .appName(SparkEAZDebug.class.getName()) .enableHiveSupport() .getOrCreate(); Thanks very much. Keith

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Keith Chapman
Here is an example of a window lead function, select *, lead(someColumn1) over ( partition by someColumn2 order by someColumn13 asc nulls first) as someName from someTable Regards, Keith. http://keith-chapman.com On Tue, Jul 25, 2017 at 9:15 AM, kant kodali <kanth...@gmail.com> wrote:

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Keith Chapman
, Keith. http://keith-chapman.com On Tue, Jul 25, 2017 at 12:50 AM, kant kodali <kanth...@gmail.com> wrote: > HI All, > > I just want to run some spark structured streaming Job similar to this > > DS.filter(col("name").equalTo("john")) > .groupB

Re: Get full RDD lineage for a spark job

2017-07-21 Thread Keith Chapman
You could also enable it with --conf spark.logLineage=true if you do not want to change any code. Regards, Keith. http://keith-chapman.com On Fri, Jul 21, 2017 at 7:57 PM, Keith Chapman <keithgchap...@gmail.com> wrote: > Hi Ron, > > You can try using the toDebugString me

Re: Get full RDD lineage for a spark job

2017-07-21 Thread Keith Chapman
Hi Ron, You can try using the toDebugString method on the RDD, this will print the RDD lineage. Regards, Keith. http://keith-chapman.com On Fri, Jul 21, 2017 at 11:24 AM, Ron Gonzalez <zlgonza...@yahoo.com.invalid > wrote: > Hi, > Can someone point me to a test case or share

Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Keith Chapman
Hi Nguyen, This looks promising and seems like I could achieve it using cluster by. Thanks for the pointer. Regards, Keith. http://keith-chapman.com On Sat, Jun 24, 2017 at 5:27 AM, nguyen duc Tuan <newvalu...@gmail.com> wrote: > Hi Chapman, > You can use "cluster by&quo

Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Keith Chapman
Thanks for the pointer Saliya, I'm looking got an equivalent api in dataset/dataframe for repartitionAndSortWithinPartitions, I've already converted most of the RDD's to Dataframes. Regards, Keith. http://keith-chapman.com On Sat, Jun 24, 2017 at 3:48 AM, Saliya Ekanayake <esal...@gmail.

Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-23 Thread Keith Chapman
/Dataframe instead of RDDs, so my question is: Is there custom partitioning of Dataset/Dataframe implemented in Spark? Can I accomplish the partial sort using mapPartitions on the resulting partitioned Dataset/Dataframe? Any thoughts? Regards, Keith. http://keith-chapman.com

Re: Alternatives for dataframe collectAsList()

2017-04-04 Thread Keith Chapman
As Paul said it really depends on what you want to do with your data, perhaps writing it to a file would be a better option, but again it depends on what you want to do with the data you collect. Regards, Keith. http://keith-chapman.com On Tue, Apr 4, 2017 at 7:38 AM, Eike von Seggern

Re: Having issues reading a csv file into a DataSet using Spark 2.1

2017-03-22 Thread Keith Chapman
x = spark.read.format("csv").load("/home/user/data.csv") > > x.show() > > } > > } > > > hope this helps. > > Diego > > On 22 Mar 2017 7:18 pm, "Keith Chapman" <keithgchap...@gmail.com> wrote: > > Hi, > > I'm

Having issues reading a csv file into a DataSet using Spark 2.1

2017-03-22 Thread Keith Chapman
how } } Compiling the above program gives, I'd expect it to work as its a simple case class, changing it to as[String] works, but I would like to get the case class to work. [error] /home/keith/dataset/DataSetTest.scala:13: Unable to find encoder for type stored in a Dataset. Primitive typ

Re:

2017-01-20 Thread Keith Chapman
Hi Jacek, I've looked at SparkListener and tried it, I see it getting fired on the master but I don't see it getting fired on the workers in a cluster. Regards, Keith. http://keith-chapman.com On Fri, Jan 20, 2017 at 11:09 AM, Jacek Laskowski <ja...@japila.pl> wrote: > Hi, > &g

Library dependencies in Spark

2017-01-10 Thread Keith Turner
I recently wrote a blog post[1] sharing my experiences with using Apache Spark to load data into Apache Fluo. One of the things I cover in this blog post is late binding of dependencies and exclusion of provided dependencies when building a shaded jar. When writing the post, I was unsure about

Re: Long-running job OOMs driver process

2016-11-18 Thread Keith Bourgoin
-production data. Yong, that's a good point about the web content. I had forgotten to mention that when I first saw this a few months ago, on another project, I could sometimes trigger the OOM by trying to view the web ui for the job. That's another case I'll try to reproduce. Thanks again! Keith

Re: Long-running job OOMs driver process

2016-11-18 Thread Keith Bourgoin
were hoping that someone had seen this before and it rung a bell. Maybe there's a setting to clean up info from old jobs that we can adjust. Cheers, Keith. On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin <aseigneu...@ipponusa.com> wrote: > Hi Irina, > > I would question the

Re: Using Java in Spark shell

2016-05-25 Thread Keith
There is no java shell in spark. > On May 25, 2016, at 1:11 AM, Ashok Kumar wrote: > > Hello, > > A newbie question. > > Is it possible to use java code directly in spark shell without using maven > to build a jar file? > > How can I switch from scala to java

Spark SQL "partition stride"?

2016-01-11 Thread Keith Freeman
The spark docs section for "JDBC to Other Databases" (https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) describes the partitioning as "... Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in

python rdd.partionBy(): any examples of a custom partitioner?

2015-12-07 Thread Keith Freeman
I'm not a python expert, so I'm wondering if anybody has a working example of a partitioner for the "partitionFunc" argument (default "portable_hash") to rdd.partitionBy()? - To unsubscribe, e-mail:

Spark 1.4.0 SQL JDBC partition stride?

2015-06-21 Thread Keith Freeman
The spark docs section for JDBC to Other Databases (https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) describes the partitioning as ... Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
into an rdd with context.textFile(), flatmap that and union these rdds. also see http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files On 1 December 2014 at 16:50, Keith Simmons ke...@pulse.io wrote: This is a long shot, but... I'm trying to load a bunch

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
Yep, that's definitely possible. It's one of the workarounds I was considering. I was just curious if there was a simpler (and perhaps more efficient) approach. Keith On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg andy.tw...@gmail.com wrote: Could you modify your function so that it streams

Re: Setting only master heap

2014-10-26 Thread Keith Simmons
unless you have a ton (like thousands) of concurrently running applications connecting to it there's little likelihood for it to OOM. At least that's my understanding. -Andrew 2014-10-22 15:51 GMT-07:00 Sameer Farooqui same...@databricks.com: Hi Keith, Would be helpful if you could post

Setting only master heap

2014-10-22 Thread Keith Simmons
We've been getting some OOMs from the spark master since upgrading to Spark 1.1.0. I've found SPARK_DAEMON_MEMORY, but that also seems to increase the worker heap, which as far as I know is fine. Is there any setting which *only* increases the master heap size? Keith

Re: Hung spark executors don't count toward worker memory limit

2014-10-13 Thread Keith Simmons
Maybe I should put this another way. If spark has two jobs, A and B, both of which consume the entire allocated memory pool, is it expected that spark can launch B before the executor processes tied to A are completely terminated? On Thu, Oct 9, 2014 at 6:57 PM, Keith Simmons ke...@pulse.io

Hung spark executors don't count toward worker memory limit

2014-10-09 Thread Keith Simmons
me know if there's any additional information I can provide. Keith P.S. We're running spark 1.0.2

Re: Hung spark executors don't count toward worker memory limit

2014-10-09 Thread Keith Simmons
job... 14/10/09 20:51:17 INFO Worker: Executor app-20141009204127-0029/1 finished with state KILLED As you can see, the first app didn't actually shutdown until two minutes after the new job launched. During that time, I was at double the worker memory limit. Keith On Thu, Oct 9, 2014 at 5:06

Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
into the individual record types without a problem. The immediate cause seems to be a task trying to deserialize one or more SQL case classes before loading the spark uber jar, but I have no idea why this is happening, or why it only happens when I do a join. Ideas? Keith P.S. If it's relevant, we're using

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
locally. On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons keith.simm...@gmail.com wrote: Nope. All of them are registered from the driver program. However, I think we've found the culprit. If the join column between two tables is not in the same column position in both tables, it triggers

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
) On Tue, Jul 15, 2014 at 1:05 PM, Michael Armbrust mich...@databricks.com wrote: Can you print out the queryExecution? (i.e. println(sql().queryExecution)) On Tue, Jul 15, 2014 at 12:44 PM, Keith Simmons keith.simm...@gmail.com wrote: To give a few more details of my environment in case

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
Cool. So Michael's hunch was correct, it is a thread issue. I'm currently using a tarball build, but I'll do a spark build with the patch as soon as I have a chance and test it out. Keith On Tue, Jul 15, 2014 at 4:14 PM, Zongheng Yang zonghen...@gmail.com wrote: Hi Keith gorenuru

Re: Comparative study

2014-07-09 Thread Keith Simmons
Good point. Shows how personal use cases color how we interpret products. On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote: On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote: Impala is *not* built on map/reduce, though it was built to replace Hive, which

Re: Comparative study

2014-07-08 Thread Keith Simmons
registering any custom serializers). Keith On Tue, Jul 8, 2014 at 2:58 PM, Robert James srobertja...@gmail.com wrote: As a new user, I can definitely say that my experience with Spark has been rather raw. The appeal of interactive, batch, and in between all using more or less straight Scala

Re: Spark Memory Bounds

2014-05-28 Thread Keith Simmons
a pretty good handle on the overall RDD contribution. Thanks for all the help. Keith On Wed, May 28, 2014 at 6:43 AM, Christopher Nguyen c...@adatao.com wrote: Keith, please see inline. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Tue, May 27

Spark Memory Bounds

2014-05-27 Thread Keith Simmons
each task is processing a single partition and there are a bounded number of tasks in flight, my memory use has a rough upper limit. Keith

Re: Spark Memory Bounds

2014-05-27 Thread Keith Simmons
? Specifically, once a key/value pair is serialized in the shuffle stage of a task, are the references to the raw java objects released before the next task is started. On Tue, May 27, 2014 at 6:21 PM, Christopher Nguyen c...@adatao.com wrote: Keith, do you mean bound as in (a) strictly control to some

Re: TriangleCount Shortest Path under Spark

2014-03-13 Thread Keith Massey
, but they looked generally right. Not sure if this is the failure you are talking about or not. As far as shortest path, the programming guide had an example that worked well for me under https://spark.incubator.apache.org/docs/latest/graphx-programming-guide.html#pregel-api . Keith On Sun, Mar 9, 2014