If you execute the collect step (foreach in 1, possibly reduce in 2) in two
threads in the driver then both of them will be executed in parallel.
Whichever gets submitted to Spark first gets executed first - you can use a
semaphore if you need to ensure the ordering of execution, though I would
Can't you just load the data from HBase first, and then call sc.parallelize
on your dataset?
-Andy
---
Regards,
Andy (Nam) Dang
On Wed, Sep 30, 2015 at 12:52 PM, Nicolae Marasoiu <
nicolae.maras...@adswizz.com> wrote:
> Hi,
>
>
> When calling sc.parallelize(data,1), is there a preference
You should be able to use a simple mapping:
rdd.map(tuple -> tuple._1() + tuple._2())
---
Regards,
Andy (Nam) Dang
On Wed, Sep 30, 2015 at 10:34 AM, Ramkumar V
wrote:
> Hi,
>
> I have key value pair of JavaRDD (JavaPairRDD rdd) but i
> want to
Spark shuffling algorithm is very aggressive in storing everything in RAM,
and the behavior is worse in 1.6 with the UnifiedMemoryManagement. At least
in previous versions you can limit the shuffler memory, but Spark 1.6 will
use as much memory as it can get. What I see is that Spark seems to
Repartition wouldn't save you from skewed data unfortunately. The way Spark
works now is that it pulls data of the same key to one single partition,
and Spark, AFAIK, retains the mapping from key to data in memory.
You can use aggregateBykey() or combineByKey() or reduceByKey() to avoid
this
groupByKey() is a wide dependency and will cause a full shuffle. It's
advised against using this transformation unless you keys are balanced
(well-distributed) and you need a full shuffle.
Otherwise, what you want is aggregateByKey() or reduceByKey() (depending on
the output). These actions are
This is my personal email so I can't exactly discuss work-related topics.
But yes, many teams in my company use Spark 2.0 in production environment.
What are the challenges that prevent you from adopting it (besides
migration from Spark 1.x)?
---
Regards,
Andy
On Mon, Oct 31, 2016 at 8:16
Hi all,
What's the best way to do top-k with Windowing in Dataset world?
I have a snippet of code that filters the data to the top-k, but with
skewed keys:
val windowSpec = Window.parititionBy(skewedKeys).orderBy(dateTime)
val rank = row_number().over(windowSpec)
input.withColumn("rank",
t and order by:
>
>
>
>
>
> SELECT time_bucket,
>
>identifier1,
>
>identifier2,
>
>count(identifier2) as myCount
>
> FROM table
>
> GROUP BY time_bucket,
>
>identifier1,
>
>identifier2
>
> OR
Hi all,
It appears to me that Dataset.groupBy().agg(udaf) requires a full sort,
which is very inefficient for certain aggration:
The code is very simple:
- I have a UDAF
- What I want to do is: dataset.groupBy(cols).agg(udaf).count()
The physical plan I got was:
*HashAggregate(keys=[],
gregate/AggUtils.scala#L38
>
> So, I'm not sure about your query though, it seems the types of aggregated
> data in your query
> are not supported for hash-based aggregates.
>
> // maropu
>
>
>
> On Mon, Jan 9, 2017 at 10:52 PM, Andy Dang <nam...@gmail.com> wrote
Hi all,
(cc-ing dev since I've hit some developer API corner)
What's the best way to convert an InternalRow to a Row if I've got an
InternalRow and the corresponding Schema.
Code snippet:
@Test
public void foo() throws Exception {
Row row = RowFactory.create(1);
I used to use uber jar in Spark 1.x because of classpath issues (we
couldn't re-model our dependencies based on our code, and thus cluster's
run dependencies could be very different from running Spark directly in the
IDE. We had to use userClasspathFirst "hack" to work around this.
With Spark 2,
com>
wrote:
> Andy, Thanks for reply.
>
> If we download all the dependencies at separate location and link with
> spark job jar on spark cluster, is it best way to execute spark job ?
>
> Thanks.
>
> On Fri, Dec 23, 2016 at 8:34 PM, Andy Dang <nam...@gmail.com> wrote:
14 matches
Mail list logo