The context is different here.
The file path are coming as messages in kafka topic.
Spark streaming (structured) consumes form this topic.
Now it have to get the value from the message , thus the path to file.
read the json stored at the file location into another df.
Thanks
Deepak
On Sun, Jun 9,
Hi All,
Just wanted to check back regarding best way to perform deduplication. Is
using drop duplicates the optimal way to get rid of duplicates? Would it be
better if we run operations on red directly?
Also what about if we want to keep the last value of the group while
performing deduplication
Hi All,
I have a table with 3TB data, stored as parquet snappy compression - 100
columns.. However I am filtering the DataFrame on date column (date between
20190501-20190530) & selecting only 20 columns & counting.. This operation
takes about 45 mins!!
Shouldn't parquet do the predicate pushdown
Hi Deepak,
You can use textFileStream.
https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
Plz start using stackoverflow to ask question to other ppl so get benefits
of answer
Regards,
Vaquar khan
On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma wrote:
> I am using spark streamin
Depending on what accuracy is needed, hyperloglogs can be an interesting
alternative
https://en.m.wikipedia.org/wiki/HyperLogLog
> Am 09.06.2019 um 15:59 schrieb big data :
>
> From m opinion, Bitmap is the best solution for active users calculation.
> Other solution almost bases on count(dist
From m opinion, Bitmap is the best solution for active users calculation. Other
solution almost bases on count(distinct) calculation process, which is more
slower.
If you 've implemented Bitmap solution including how to build Bitmap, how to
load Bitmap, then Bitmap is the best choice.
在 2019/6
I am using spark streaming application to read from kafka.
The value coming from kafka message is path to hdfs file.
I am using spark 2.x , spark.read.stream.
What is the best way to read this path in spark streaming and then read the
json stored at the hdfs path , may be using spark.read.json , i