Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Okay that was some caching issue. Now there is a shared mount point between the place the pyspark code is executed and the spark nodes it runs. Hrmph, I was hoping that wouldn't be the case. Fair enough! On Thu, Mar 7, 2024 at 11:23 PM Tom Barber wrote: > Okay interesting, maybe my assumpt

Re: Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
/accounts_20240307_232110_1_0_6_post21_g4fdc321_d20240307/_temporary/0 so what is /data/hive even referring to when I print out the spark conf values and neither now refer to /data/hive/ On Thu, Mar 7, 2024 at 9:49 PM Tom Barber wrote: > Wonder if anyone can just sort my brain out h

Creating remote tables using PySpark

2024-03-07 Thread Tom Barber
Wonder if anyone can just sort my brain out here as to whats possible or not. I have a container running Spark, with Hive and a ThriftServer. I want to run code against it remotely. If I take something simple like this from pyspark.sql import SparkSession from pyspark.sql.types import

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
Looks like repartitioning was my friend, seems to be distributed across the cluster now. All good. Thanks! On Wed, Jun 23, 2021 at 2:18 PM Tom Barber wrote: > Okay so I tried another idea which was to use a real simple class to drive > a mapPartitions... because logic in my head

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
b) how it divides up partitions to tasks c) the fact its a POJO and not a file of stuff. Or probably some of all 3. Tom On Wed, Jun 23, 2021 at 11:44 AM Tom Barber wrote: > (I should point out that I'm diagnosing this by looking at the active > tasks https://pasteboard.co/K7VryDJ.png, if

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
(I should point out that I'm diagnosing this by looking at the active tasks https://pasteboard.co/K7VryDJ.png, if I'm reading it incorrectly, let me know) On Wed, Jun 23, 2021 at 11:38 AM Tom Barber wrote: > Uff hello fine people. > > So the cause of the above issue was, unsur

Re: Distributing a FlatMap across a Spark Cluster

2021-06-23 Thread Tom Barber
how to split that flatmap operation up so the RDD processing runs across the nodes, not limited to a single node? Thanks for all your help so far, Tom On Wed, Jun 9, 2021 at 8:08 PM Tom Barber wrote: > Ah no sorry, so in the load image, the crawl has just kicked off on the > driver node which

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
. Tom On Wed, Jun 9, 2021 at 8:03 PM Sean Owen wrote: > Where do you see that ... I see 3 executors busy at first. If that's the > crawl then ? > > On Wed, Jun 9, 2021 at 1:59 PM Tom Barber wrote: > >> Yeah :) >> >> But it's all running through the same

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
rst place? > > On Wed, Jun 9, 2021 at 1:49 PM Tom Barber wrote: > >> Yeah but that something else is the crawl being run, which is triggered >> from inside the RDDs, because the log output is slowly outputting crawl >> data. >> >> -- Spicule Limited is reg

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
g else on the driver - not doing everything on 1 machine. > > On Wed, Jun 9, 2021 at 12:43 PM Tom Barber wrote: > >> And also as this morning: https://pasteboard.co/K5Q9aEf.png >> >> Removing the cpu pins gives me more tasks but as you can see here: >> >> https://pas

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
med. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 9 Jun 2021 at 18:43, Tom Barber wrote: > >> And also as this morning: https://pasteboard.co/K5Q9aEf.png >> >> Removing the

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
And also as this morning: https://pasteboard.co/K5Q9aEf.png Removing the cpu pins gives me more tasks but as you can see here: https://pasteboard.co/K5Q9GO0.png It just loads up a single server. On Wed, Jun 9, 2021 at 6:32 PM Tom Barber wrote: > Thanks Chris > > All the co

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
ys and see if there's any weird pattern to them. > 5. See if the same thing happens in spark local. > > If you have a reproducible example you can post publically then I'm happy > to take a look. > > Chris > > On Wed, Jun 9, 2021 at 5:17 PM Tom Barber wrote: > >>

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
.getGroup, r)) > > how many distinct groups do you ended up with? If there's just one then I > think you might see the behaviour you observe. > > Chris > > > On Wed, Jun 9, 2021 at 4:17 PM Tom Barber wrote: > >> Also just to follow up on that slightly, I di

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
ent] = repRdd.map(d => ScoreUpdateSolrTransformer(d)) I did that, but the crawl is executed in that repartition executor (which I should have pointed out I already know). Tom On Wed, Jun 9, 2021 at 4:37 PM Tom Barber wrote: > Sorry Sam, I missed that earlier, I'll give it a spin. > >

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
ache() > repRdd.take(1) > Then map operation on repRdd here. > > I’ve done similar map operations in the past and this works. > > Thanks. > > On Wed, Jun 9, 2021 at 11:17 AM Tom Barber wrote: > >> Also just to follow up on that slightly, I did also try off the back

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
RDD[SolrInputDocument] = scoredRdd.repartition(50).map(d => ScoreUpdateSolrTransformer(d)) Where I repartitioned that scoredRdd map out of interest, it then triggers the FairFetcher function there, instead of in the runJob(), but still on a single executor  Tom On Wed, Jun 9, 2021 at 4:11 PM Tom Barber

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
e part of how this can do something unusual. > How do you trigger any action? what happens after persist() > > On Wed, Jun 9, 2021 at 9:48 AM Tom Barber wrote: > >> Thanks Mich, >> >> The key on the first iteration is just a string that says "seed", so it >&

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
> > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable fo

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
; I think we need more info about what else is happening in the code. > > On Wed, Jun 9, 2021 at 6:30 AM Tom Barber wrote: > >> Yeah so if I update the FairFetcher to return a seq it makes no real >> difference. >> >> Here's an image of what I'm seeing just for r

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
bfs:/FileStore/bcf/sparkler7.jar","crawl","-id","mytestcrawl11", "-tn", "5000", "-co", "{\"plugins.active\":[\"urlfilter-regex\",\"urlfilter-samehost\",\"fetcher-chrome\"],\"plugins\&

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
I've not run it yet, but I've stuck a toSeq on the end, but in reality a Seq just inherits Iterator, right? Flatmap does return a RDD[CrawlData] unless my IDE is lying to me. Tom On Wed, Jun 9, 2021 at 10:54 AM Tom Barber wrote: > Interesting Jayesh, thanks, I will test. > > All

Re: Distributing a FlatMap across a Spark Cluster

2021-06-09 Thread Tom Barber
o call toSeq on > FairFetcher. > > On 6/8/21, 10:10 PM, "Tom Barber" wrote: > > CAUTION: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender and > know the content is safe. > > > > F

Re: Distributing a FlatMap across a Spark Cluster

2021-06-08 Thread Tom Barber
For anyone interested here's the execution logs up until the point where it actually kicks off the workload in question: https://gist.github.com/buggtb/a9e0445f24182bc8eedfe26c0f07a473 On 2021/06/09 01:52:39, Tom Barber wrote: > ExecutorID says driver, and looking at the IP addresses

Re: Distributing a FlatMap across a Spark Cluster

2021-06-08 Thread Tom Barber
gt; how many partitions does the groupByKey produce? that would limit your > parallelism no matter what if it's a small number. > > On Tue, Jun 8, 2021 at 8:07 PM Tom Barber wrote: > > > Hi folks, > > > > Hopefully someone with more Spark experience than me can ex

Distributing a FlatMap across a Spark Cluster

2021-06-08 Thread Tom Barber
Hi folks, Hopefully someone with more Spark experience than me can explain this a bit. I dont' know if this is possible, impossible or just an old design that could be better. I'm running Sparkler as a spark-submit job on a databricks spark cluster and its getting to this point in the

Help getting Spark JDBC metadata

2015-09-09 Thread Tom Barber
Hi guys Hopefully someone can help me, or at least explain stuff to me. I use a tool that required JDBC metadata (tables/columns etc) So using spark 1.3.1 I try stuff like: registerTempTable() or saveAsTable() on my parquet file. The former doesn't show any table metadata for JDBC