Re: Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma
The context is different here. The file path are coming as messages in kafka topic. Spark streaming (structured) consumes form this topic. Now it have to get the value from the message , thus the path to file. read the json stored at the file location into another df. Thanks Deepak On Sun, Jun

Re: High level explanation of dropDuplicates

2019-06-09 Thread Rishi Shah
Hi All, Just wanted to check back regarding best way to perform deduplication. Is using drop duplicates the optimal way to get rid of duplicates? Would it be better if we run operations on red directly? Also what about if we want to keep the last value of the group while performing deduplication

[pyspark 2.3+] Querying non-partitioned @TB data table is too slow

2019-06-09 Thread Rishi Shah
Hi All, I have a table with 3TB data, stored as parquet snappy compression - 100 columns.. However I am filtering the DataFrame on date column (date between 20190501-20190530) & selecting only 20 columns & counting.. This operation takes about 45 mins!! Shouldn't parquet do the predicate

Re: Read hdfs files in spark streaming

2019-06-09 Thread vaquar khan
Hi Deepak, You can use textFileStream. https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html Plz start using stackoverflow to ask question to other ppl so get benefits of answer Regards, Vaquar khan On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma wrote: > I am using spark

Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-09 Thread Jörn Franke
Depending on what accuracy is needed, hyperloglogs can be an interesting alternative https://en.m.wikipedia.org/wiki/HyperLogLog > Am 09.06.2019 um 15:59 schrieb big data : > > From m opinion, Bitmap is the best solution for active users calculation. > Other solution almost bases on

Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-09 Thread big data
From m opinion, Bitmap is the best solution for active users calculation. Other solution almost bases on count(distinct) calculation process, which is more slower. If you 've implemented Bitmap solution including how to build Bitmap, how to load Bitmap, then Bitmap is the best choice. 在

Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma
I am using spark streaming application to read from kafka. The value coming from kafka message is path to hdfs file. I am using spark 2.x , spark.read.stream. What is the best way to read this path in spark streaming and then read the json stored at the hdfs path , may be using spark.read.json ,