Re: GroupBy and Spark Performance issue

2017-01-17 Thread Andy Dang
Repartition wouldn't save you from skewed data unfortunately. The way Spark works now is that it pulls data of the same key to one single partition, and Spark, AFAIK, retains the mapping from key to data in memory. You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid this

Re: groupByKey vs mapPartitions for efficient grouping within a Partition

2017-01-16 Thread Andy Dang
groupByKey() is a wide dependency and will cause a full shuffle. It's advised against using this transformation unless you keys are balanced (well-distributed) and you need a full shuffle. Otherwise, what you want is aggregateByKey() or reduceByKey() (depending on the output). These actions are

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang
gregate/AggUtils.scala#L38 > > So, I'm not sure about your query though, it seems the types of aggregated > data in your query > are not supported for hash-based aggregates. > > // maropu > > > > On Mon, Jan 9, 2017 at 10:52 PM, Andy Dang <nam...@gmail.com> wrote

How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang
Hi all, It appears to me that Dataset.groupBy().agg(udaf) requires a full sort, which is very inefficient for certain aggration: The code is very simple: - I have a UDAF - What I want to do is: dataset.groupBy(cols).agg(udaf).count() The physical plan I got was: *HashAggregate(keys=[],

Converting an InternalRow to a Row

2017-01-04 Thread Andy Dang
Hi all, (cc-ing dev since I've hit some developer API corner) What's the best way to convert an InternalRow to a Row if I've got an InternalRow and the corresponding Schema. Code snippet: @Test public void foo() throws Exception { Row row = RowFactory.create(1);

Re: top-k function for Window

2017-01-03 Thread Andy Dang
t and order by: > > > > > > SELECT time_bucket, > >identifier1, > >identifier2, > >count(identifier2) as myCount > > FROM table > > GROUP BY time_bucket, > >identifier1, > >identifier2 > > OR

top-k function for Window

2017-01-03 Thread Andy Dang
Hi all, What's the best way to do top-k with Windowing in Dataset world? I have a snippet of code that filters the data to the top-k, but with skewed keys: val windowSpec = Window.parititionBy(skewedKeys).orderBy(dateTime) val rank = row_number().over(windowSpec) input.withColumn("rank",

Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Andy Dang
com> wrote: > Andy, Thanks for reply. > > If we download all the dependencies at separate location and link with > spark job jar on spark cluster, is it best way to execute spark job ? > > Thanks. > > On Fri, Dec 23, 2016 at 8:34 PM, Andy Dang <nam...@gmail.com> wrote:

Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Andy Dang
I used to use uber jar in Spark 1.x because of classpath issues (we couldn't re-model our dependencies based on our code, and thus cluster's run dependencies could be very different from running Spark directly in the IDE. We had to use userClasspathFirst "hack" to work around this. With Spark 2,

Re: Do you use spark 2.0 in work?

2016-10-31 Thread Andy Dang
This is my personal email so I can't exactly discuss work-related topics. But yes, many teams in my company use Spark 2.0 in production environment. What are the challenges that prevent you from adopting it (besides migration from Spark 1.x)? --- Regards, Andy On Mon, Oct 31, 2016 at 8:16

Re: an OOM while persist as DISK_ONLY

2016-03-03 Thread Andy Dang
Spark shuffling algorithm is very aggressive in storing everything in RAM, and the behavior is worse in 1.6 with the UnifiedMemoryManagement. At least in previous versions you can limit the shuffler memory, but Spark 1.6 will use as much memory as it can get. What I see is that Spark seems to

Re: [SPARK STREAMING] Concurrent operations in spark streaming

2015-10-24 Thread Andy Dang
If you execute the collect step (foreach in 1, possibly reduce in 2) in two threads in the driver then both of them will be executed in parallel. Whichever gets submitted to Spark first gets executed first - you can use a semaphore if you need to ensure the ordering of execution, though I would

Re: sc.parallelize with defaultParallelism=1

2015-09-30 Thread Andy Dang
Can't you just load the data from HBase first, and then call sc.parallelize on your dataset? -Andy --- Regards, Andy (Nam) Dang On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu < nicolae.maras...@adswizz.com> wrote: > Hi, > > > When calling sc.parallelize(data,1), is there a preference

Re: Combine key-value pair in spark java

2015-09-30 Thread Andy Dang
You should be able to use a simple mapping: rdd.map(tuple -> tuple._1() + tuple._2()) --- Regards, Andy (Nam) Dang On Wed, Sep 30, 2015 at 10:34 AM, Ramkumar V wrote: > Hi, > > I have key value pair of JavaRDD (JavaPairRDD rdd) but i > want to