spark single PROCESS_LOCAL task

2016-07-15 Thread Matt K
Hi all, I'm seeing some curious behavior which I have a hard time interpreting. I have a job which does a "groupByKey" and results in 300 executors. 299 are run in NODE_LOCAL mode. 1 executor is run in PROCESS_LOCAL mode. The 1 executor that runs in PROCESS_LOCAL mode gets about 10x as much

Re: spark metrics question

2016-02-07 Thread Matt K
Thanks Takeshi, that's exactly what I was looking for. On Fri, Feb 5, 2016 at 12:32 PM, Takeshi Yamamuro <linguin@gmail.com> wrote: > How about using `spark.jars` to send jars into a cluster? > > On Sat, Feb 6, 2016 at 12:00 AM, Matt K <matvey1...@gmail.com> wrote: &

Re: spark metrics question

2016-02-05 Thread Matt K
reports metrics of each Executor? > > Thanks > > On 3 February 2016 at 15:56, Matt K <matvey1...@gmail.com> wrote: > >> Thanks for sharing Yiannis, looks very promising! >> >> Do you know if I can package a custom class with my application, or does >> it

spark metrics question

2016-02-03 Thread Matt K
Hi guys, I'm looking to create a custom sync based on Spark's Metrics System: https://github.com/apache/spark/blob/9f603fce78fcc997926e9a72dec44d48cbc396fc/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala If I want to collect metrics from the Driver, Master, and Executor nodes,

Re: spark metrics question

2016-02-03 Thread Matt K
deeper look to > it: https://github.com/ibm-research-ireland/sparkoscope > > Thanks, > Yiannis > > On 3 February 2016 at 13:32, Matt K <matvey1...@gmail.com> wrote: > >> Hi guys, >> >> I'm looking to create a custom sync bas

Re: memory leak when saving Parquet files in Spark

2015-12-14 Thread Matt K
onCols: _*) > .mode(saveMode) > .save(targetPath) > > In 1.5, we've disabled schema merging by default. > > Cheng > > > On 12/11/15 5:33 AM, Matt K wrote: > > Hi all, > > I have a process that's continuously saving data as Parquet w

Re: extracting file path using dataframes

2015-09-01 Thread Matt K
Just want to add - I'm looking to partition the resulting Parquet files by customer-id, which is why I'm looking to extract the customer-id from the path. On Tue, Sep 1, 2015 at 7:00 PM, Matt K <matvey1...@gmail.com> wrote: > Hi all, > > TL;DR - is there a way to extract the s

extracting file path using dataframes

2015-09-01 Thread Matt K
Hi all, TL;DR - is there a way to extract the source path from an RDD via the Scala API? I have sequence files on S3 that look something like this: s3://data/customer=123/... s3://data/customer=456/... I am using Spark Dataframes to convert these sequence files to Parquet. As part of the