subject:"Record count query parallel processing in databricks spark delta lake"

Re: Record count query parallel processing in databricks spark delta lake

2020-01-20 Thread anbutech

Thank you Farhan so much for the help. please help me on the design approach of this problem.what is the best way to achieve this code to get the results better. I have some clarification on the code. want to take daily record count of ingestion source vs databricks delta lake table vs

Re: Record count query parallel processing in databricks spark delta lake

2020-01-19 Thread Farhan Misarwala

Hi Anbutech, If I am not mistaken, I believe you are trying to read multiple dataframes from around 150 different paths (in your case the Kafka topics) to count their records. You have all these paths stored in a CSV with columns year, month, day and hour. Here is what I came up with; I have

Record count query parallel processing in databricks spark delta lake

2020-01-17 Thread anbutech

Hi, I have a question on the design of monitoring pyspark script on the large number of source json data coming from more than 100 kafka topics. These multiple topics are store under separate bucket in aws s3.each of the kafka topics having more Terabytes of json data with respect to the