Re: Distributing a FlatMap across a Spark Cluster

Tom Barber Wed, 23 Jun 2021 10:28:05 -0700

Looks like repartitioning was my friend, seems to be distributed across the
cluster now.


All good. Thanks!


On Wed, Jun 23, 2021 at 2:18 PM Tom Barber <t...@spicule.co.uk> wrote:

> Okay so I tried another idea which was to use a real simple class to drive
> a mapPartitions... because logic in my head seems to suggest that I want to
> map my partitions...
>
> @SerialVersionUID(100L)
> class RunCrawl extends Serializable{
>   def mapCrawl(x: Iterator[(String, Iterable[Resource])], job:
> SparklerJob): Iterator[CrawlData] = {
>     val m = 1000
>     x.flatMap({case (grp, rs) => new FairFetcher(job, rs.iterator, m,
>       FetchFunction, ParseFunction, OutLinkFilterFunction,
> StatusUpdateSolrTransformer)})
>   }
>
>   def runCrawl(f: RDD[(String, Iterable[Resource])], job: SparklerJob):
> RDD[CrawlData] = {
>     f.mapPartitions( x => mapCrawl(x, job))
>
>   }
>
> }
>
> That is what it looks like. But the task execution window in the cluster
> looks the same:
>
> https://pasteboard.co/K7WrBnV.png
>
> 1 task on a single node.
>
> I feel like I'm missing something obvious here about either
>
> a) how spark works
> b) how it divides up partitions to tasks
> c) the fact its a POJO and not a file of stuff.
>
> Or probably some of all 3.
>
> Tom
>
> On Wed, Jun 23, 2021 at 11:44 AM Tom Barber <t...@spicule.co.uk> wrote:
>
>> (I should point out that I'm diagnosing this by looking at the active
>> tasks https://pasteboard.co/K7VryDJ.png, if I'm reading it incorrectly,
>> let me know)
>>
>> On Wed, Jun 23, 2021 at 11:38 AM Tom Barber <t...@spicule.co.uk> wrote:
>>
>>> Uff.... hello fine people.
>>>
>>> So the cause of the above issue was, unsurprisingly, human error. I
>>> found a local[*] spark master config which gazumped my own one.... so
>>> mystery solved. But I have another question, that is still the crux of this
>>> problem:
>>>
>>> Here's a bit of trimmed code, that I'm currently testing with. I
>>> deliberately stuck in a repartition(50), just to force it to, what I
>>> believe was chunk it up and distribute it. Which is all good.
>>>
>>> override def run(): Unit = {
>>>     ...
>>>
>>>     val rdd = new MemexCrawlDbRDD(sc, job, maxGroups = topG, topN = topN)
>>>     val f = rdd.map(r => (r.getGroup, r))
>>>       .groupByKey().repartition(50);
>>>
>>>     val c = f.getNumPartitions
>>>       val fetchedRdd = f.flatMap({ case (grp, rs) => new FairFetcher(job, 
>>> rs.iterator, localFetchDelay,
>>>         FetchFunction, ParseFunction, OutLinkFilterFunction, 
>>> StatusUpdateSolrTransformer) })
>>>       .persist()
>>>
>>>     val d = fetchedRdd.getNumPartitions
>>>
>>>     ...
>>>
>>>     val scoredRdd = score(fetchedRdd)
>>>
>>>     ...
>>>
>>> }
>>>
>>> def score(fetchedRdd: RDD[CrawlData]): RDD[CrawlData] = {
>>>   val job = this.job.asInstanceOf[SparklerJob]
>>>
>>>   val scoredRdd = fetchedRdd.map(d => ScoreFunction(job, d))
>>>
>>>   val scoreUpdateRdd: RDD[SolrInputDocument] = scoredRdd.map(d => 
>>> ScoreUpdateSolrTransformer(d))
>>>   val scoreUpdateFunc = new SolrStatusUpdate(job)
>>>   sc.runJob(scoreUpdateRdd, scoreUpdateFunc)
>>>
>>> }
>>>
>>>
>>> Basically for anyone new to this, the business logic lives inside the
>>> FairFetcher and I need that distributed over all the nodes in spark cluster.
>>>
>>> Here's a quick illustration of what I'm seeing:
>>> https://pasteboard.co/K7VovBO.png
>>>
>>> It chunks up to code and distributes the tasks across the cluster, but
>>> that occurs _prior_ to the business logic  in the FlatMap being executed.
>>>
>>> So specifically, has anyone got any ideas about how to split that
>>> flatmap operation up so the RDD processing runs across the nodes, not
>>> limited to a single node?
>>>
>>> Thanks for all your help so far,
>>>
>>> Tom
>>>
>>> On Wed, Jun 9, 2021 at 8:08 PM Tom Barber <t...@spicule.co.uk> wrote:
>>>
>>>> Ah no sorry, so in the load image, the crawl has just kicked off on the
>>>> driver node which is why its flagged red and the load is spiking.
>>>> https://pasteboard.co/K5QHOJN.png here's the cluster now its been
>>>> running a while. The red node is still (and is always every time I tested
>>>> it) the driver node.
>>>>
>>>> Tom
>>>>
>>>>
>>>>
>>>> On Wed, Jun 9, 2021 at 8:03 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> Where do you see that ... I see 3 executors busy at first. If that's
>>>>> the crawl then ?
>>>>>
>>>>> On Wed, Jun 9, 2021 at 1:59 PM Tom Barber <t...@spicule.co.uk> wrote:
>>>>>
>>>>>> Yeah :)
>>>>>>
>>>>>> But it's all running through the same node. So I can run multiple
>>>>>> tasks of the same type on the same node(the driver), but I can't run
>>>>>> multiple tasks on multiple nodes.
>>>>>>
>>>>>> On Wed, Jun 9, 2021 at 7:57 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>>
>>>>>>> Wait. Isn't that what you were trying to parallelize in the first
>>>>>>> place?
>>>>>>>
>>>>>>> On Wed, Jun 9, 2021 at 1:49 PM Tom Barber <t...@spicule.co.uk> wrote:
>>>>>>>
>>>>>>>> Yeah but that something else is the crawl being run, which is
>>>>>>>> triggered from inside the RDDs, because the log output is slowly 
>>>>>>>> outputting
>>>>>>>> crawl data.
>>>>>>>>
>>>>>>>>
>>>>>> Spicule Limited is registered in England & Wales. Company Number:
>>>>>> 09954122. Registered office: First Floor, Telecom House, 125-135 Preston
>>>>>> Road, Brighton, England, BN1 6AF. VAT No. 251478891.
>>>>>>
>>>>>>
>>>>>> All engagements are subject to Spicule Terms and Conditions of
>>>>>> Business. This email and its contents are intended solely for the
>>>>>> individual to whom it is addressed and may contain information that is
>>>>>> confidential, privileged or otherwise protected from disclosure,
>>>>>> distributing or copying. Any views or opinions presented in this email 
>>>>>> are
>>>>>> solely those of the author and do not necessarily represent those of
>>>>>> Spicule Limited. The company accepts no liability for any damage caused 
>>>>>> by
>>>>>> any virus transmitted by this email. If you have received this message in
>>>>>> error, please notify us immediately by reply email before deleting it 
>>>>>> from
>>>>>> your system. Service of legal notice cannot be effected on Spicule 
>>>>>> Limited
>>>>>> by email.
>>>>>>
>>>>>

-- 


Spicule Limited is registered in England & Wales. Company Number: 
09954122. Registered office: First Floor, Telecom House, 125-135 Preston 
Road, Brighton, England, BN1 6AF. VAT No. 251478891.




All engagements 
are subject to Spicule Terms and Conditions of Business. This email and its 
contents are intended solely for the individual to whom it is addressed and 
may contain information that is confidential, privileged or otherwise 
protected from disclosure, distributing or copying. Any views or opinions 
presented in this email are solely those of the author and do not 
necessarily represent those of Spicule Limited. The company accepts no 
liability for any damage caused by any virus transmitted by this email. If 
you have received this message in error, please notify us immediately by 
reply email before deleting it from your system. Service of legal notice 
cannot be effected on Spicule Limited by email.

Re: Distributing a FlatMap across a Spark Cluster

Reply via email to