Okay that was some caching issue. Now there is a shared mount point between
the place the pyspark code is executed and the spark nodes it runs. Hrmph,
I was hoping that wouldn't be the case. Fair enough!
On Thu, Mar 7, 2024 at 11:23 PM Tom Barber wrote:
> Okay interesting, maybe my assumpt
/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0
so what is /data/hive even referring to when I print out the spark conf
values and neither now refer to /data/hive/
On Thu, Mar 7, 2024 at 9:49 PM Tom Barber wrote:
> Wonder if anyone can just sort my brain out h
Wonder if anyone can just sort my brain out here as to whats possible or
not.
I have a container running Spark, with Hive and a ThriftServer. I want to
run code against it remotely.
If I take something simple like this
from pyspark.sql import SparkSession
from pyspark.sql.types import
Looks like repartitioning was my friend, seems to be distributed across the
cluster now.
All good. Thanks!
On Wed, Jun 23, 2021 at 2:18 PM Tom Barber wrote:
> Okay so I tried another idea which was to use a real simple class to drive
> a mapPartitions... because logic in my head
b) how it divides up partitions to tasks
c) the fact its a POJO and not a file of stuff.
Or probably some of all 3.
Tom
On Wed, Jun 23, 2021 at 11:44 AM Tom Barber wrote:
> (I should point out that I'm diagnosing this by looking at the active
> tasks https://pasteboard.co/K7VryDJ.png, if
(I should point out that I'm diagnosing this by looking at the active tasks
https://pasteboard.co/K7VryDJ.png, if I'm reading it incorrectly, let me
know)
On Wed, Jun 23, 2021 at 11:38 AM Tom Barber wrote:
> Uff hello fine people.
>
> So the cause of the above issue was, unsur
how to split that flatmap
operation up so the RDD processing runs across the nodes, not limited to a
single node?
Thanks for all your help so far,
Tom
On Wed, Jun 9, 2021 at 8:08 PM Tom Barber wrote:
> Ah no sorry, so in the load image, the crawl has just kicked off on the
> driver node which
.
Tom
On Wed, Jun 9, 2021 at 8:03 PM Sean Owen wrote:
> Where do you see that ... I see 3 executors busy at first. If that's the
> crawl then ?
>
> On Wed, Jun 9, 2021 at 1:59 PM Tom Barber wrote:
>
>> Yeah :)
>>
>> But it's all running through the same
rst place?
>
> On Wed, Jun 9, 2021 at 1:49 PM Tom Barber wrote:
>
>> Yeah but that something else is the crawl being run, which is triggered
>> from inside the RDDs, because the log output is slowly outputting crawl
>> data.
>>
>>
--
Spicule Limited is reg
g else on the driver - not doing everything on 1 machine.
>
> On Wed, Jun 9, 2021 at 12:43 PM Tom Barber wrote:
>
>> And also as this morning: https://pasteboard.co/K5Q9aEf.png
>>
>> Removing the cpu pins gives me more tasks but as you can see here:
>>
>> https://pas
med.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 9 Jun 2021 at 18:43, Tom Barber wrote:
>
>> And also as this morning: https://pasteboard.co/K5Q9aEf.png
>>
>> Removing the
And also as this morning: https://pasteboard.co/K5Q9aEf.png
Removing the cpu pins gives me more tasks but as you can see here:
https://pasteboard.co/K5Q9GO0.png
It just loads up a single server.
On Wed, Jun 9, 2021 at 6:32 PM Tom Barber wrote:
> Thanks Chris
>
> All the co
ys and see if there's any weird pattern to them.
> 5. See if the same thing happens in spark local.
>
> If you have a reproducible example you can post publically then I'm happy
> to take a look.
>
> Chris
>
> On Wed, Jun 9, 2021 at 5:17 PM Tom Barber wrote:
>
>>
.getGroup, r))
>
> how many distinct groups do you ended up with? If there's just one then I
> think you might see the behaviour you observe.
>
> Chris
>
>
> On Wed, Jun 9, 2021 at 4:17 PM Tom Barber wrote:
>
>> Also just to follow up on that slightly, I di
ent] = repRdd.map(d =>
ScoreUpdateSolrTransformer(d))
I did that, but the crawl is executed in that repartition executor (which I
should have pointed out I already know).
Tom
On Wed, Jun 9, 2021 at 4:37 PM Tom Barber wrote:
> Sorry Sam, I missed that earlier, I'll give it a spin.
>
>
ache()
> repRdd.take(1)
> Then map operation on repRdd here.
>
> I’ve done similar map operations in the past and this works.
>
> Thanks.
>
> On Wed, Jun 9, 2021 at 11:17 AM Tom Barber wrote:
>
>> Also just to follow up on that slightly, I did also try off the back
RDD[SolrInputDocument] =
scoredRdd.repartition(50).map(d => ScoreUpdateSolrTransformer(d))
Where I repartitioned that scoredRdd map out of interest, it then triggers
the FairFetcher function there, instead of in the runJob(), but still on a
single executor
Tom
On Wed, Jun 9, 2021 at 4:11 PM Tom Barber
e part of how this can do something unusual.
> How do you trigger any action? what happens after persist()
>
> On Wed, Jun 9, 2021 at 9:48 AM Tom Barber wrote:
>
>> Thanks Mich,
>>
>> The key on the first iteration is just a string that says "seed", so it
>&
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable fo
; I think we need more info about what else is happening in the code.
>
> On Wed, Jun 9, 2021 at 6:30 AM Tom Barber wrote:
>
>> Yeah so if I update the FairFetcher to return a seq it makes no real
>> difference.
>>
>> Here's an image of what I'm seeing just for r
bfs:/FileStore/bcf/sparkler7.jar","crawl","-id","mytestcrawl11",
"-tn", "5000", "-co",
"{\"plugins.active\":[\"urlfilter-regex\",\"urlfilter-samehost\",\"fetcher-chrome\"],\"plugins\&
I've not run it yet, but I've stuck a toSeq on the end, but in reality a
Seq just inherits Iterator, right?
Flatmap does return a RDD[CrawlData] unless my IDE is lying to me.
Tom
On Wed, Jun 9, 2021 at 10:54 AM Tom Barber wrote:
> Interesting Jayesh, thanks, I will test.
>
> All
o call toSeq on
> FairFetcher.
>
> On 6/8/21, 10:10 PM, "Tom Barber" wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> F
For anyone interested here's the execution logs up until the point where it
actually kicks off the workload in question:
https://gist.github.com/buggtb/a9e0445f24182bc8eedfe26c0f07a473
On 2021/06/09 01:52:39, Tom Barber wrote:
> ExecutorID says driver, and looking at the IP addresses
gt; how many partitions does the groupByKey produce? that would limit your
> parallelism no matter what if it's a small number.
>
> On Tue, Jun 8, 2021 at 8:07 PM Tom Barber wrote:
>
> > Hi folks,
> >
> > Hopefully someone with more Spark experience than me can ex
Hi folks,
Hopefully someone with more Spark experience than me can explain this a bit.
I dont' know if this is possible, impossible or just an old design that could
be better.
I'm running Sparkler as a spark-submit job on a databricks spark cluster and
its getting to this point in the
Hi guys
Hopefully someone can help me, or at least explain stuff to me.
I use a tool that required JDBC metadata (tables/columns etc)
So using spark 1.3.1 I try stuff like:
registerTempTable()
or saveAsTable()
on my parquet file.
The former doesn't show any table metadata for JDBC
27 matches
Mail list logo