Accumulators and other important metrics for your job

2021-05-27 Thread Hamish Whittal
Hi folks, I have a problematic dataset I'm working with and am trying to find ways of "debugging" the data. For example, the most simple thing I would like to do is to know how many rows of data I've read and compare that to a simple count of the lines in the file. I could do: df.count()

Re: Keeping track of how long something has been in a queue

2020-09-04 Thread Hamish Whittal
Sorry, I moved a paragraph, (2) If Ms green.th was first seen at 13:04:04, then at 13:04:05 and finally > at 13:04:17, she's been in the queue for 13 seconds (ignoring the ms). >

Keeping track of how long something has been in a queue

2020-09-04 Thread Hamish Whittal
Hi folks, I have a stream coming from Kafka. It has this schema: { "id": 4, "account_id": 1070998, "uid": "green.th", "last_activity_time": "2020-09-03 13:04:04.520129" } Another event arrives a few milliseconds/seconds later: { "id": 9, "account_id": 1070998,

Stream to Stream joins

2020-08-24 Thread Hamish Whittal
Hi folks, I've got a stream coming from Kafka. It has the following schema: userdata : { id: INT, acctid: INT, uid: STRING, logintm: datetime } I'm trying to count the number of logins by acctid. I can do the count fine, but the table only has the acctid and the count. Now I wish to get all

Spark Streaming with Kafka and Python

2020-08-12 Thread Hamish Whittal
Hi folks, Thought I would ask here because it's somewhat confusing. I'm using Spark 2.4.5 on EMR 5.30.1 with Amazon MSK. The version of Scala used is 2.11.12. I'm using this version of the libraries spark-streaming-kafka-0-8_2.11-2.4.5.jar Now I'm wanting to read from Kafka topics using Python

Re: MySQL query continually add IS NOT NULL onto a query even though I don't request it

2020-04-02 Thread Hamish Whittal
e in lockdown like me!) Hamish On Wed, Apr 1, 2020 at 7:47 AM Hamish Whittal wrote: > Hi folks, > > 1) First Problem: > I'm querying MySQL. I submit a query like this: > > out = wam.select('message_id', 'business_id', 'info', > 'entered_system_date', 'auto_update_time').filte

MySQL query continually add IS NOT NULL onto a query even though I don't request it

2020-03-31 Thread Hamish Whittal
Hi folks, 1) First Problem: I'm querying MySQL. I submit a query like this: out = wam.select('message_id', 'business_id', 'info', 'entered_system_date', 'auto_update_time').filter("auto_update_time >= '2020-04-01 05:27'").dropDuplicates(['message_id', 'auto_update_time']) But what I see in the

Re: Still incompatible schemas

2020-03-09 Thread Hamish Whittal
t;s3://path/file") df.show(3, false) // this displays the > results. "* > > > Backbutton.co.uk > ¯\_(ツ)_/¯ > ♡۶Java♡۶RMI ♡۶ > Make Use Method {MUM} > makeuse.org > <http://www.backbutton.co.uk> > > > On Mon, 9

Still incompatible schemas

2020-03-09 Thread Hamish Whittal
Hi folks, Thanks for the help thus far. I'm trying to track down the source of this error: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary w hen doing a message.show() Basically I'm reading in a single Parquet file (to try to narrow

Re:

2020-03-02 Thread Hamish Whittal
> } finally { > reader.close() > } > } > .toDF("schema name", "fields") > .show(false) > > .binaryFiles provides you all filenames that match the given pattern as an > RDD, so the following .map is executed on the Spark executors. > The

[no subject]

2020-03-01 Thread Hamish Whittal
Hi there, I have an hdfs directory with thousands of files. It seems that some of them - and I don't know which ones - have a problem with their schema and it's causing my Spark application to fail with this error: Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column