date:20190609

Re: Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma

The context is different here. The file path are coming as messages in kafka topic. Spark streaming (structured) consumes form this topic. Now it have to get the value from the message , thus the path to file. read the json stored at the file location into another df. Thanks Deepak On Sun, Jun 9,

Re: High level explanation of dropDuplicates

2019-06-09 Thread Rishi Shah

Hi All, Just wanted to check back regarding best way to perform deduplication. Is using drop duplicates the optimal way to get rid of duplicates? Would it be better if we run operations on red directly? Also what about if we want to keep the last value of the group while performing deduplication

[pyspark 2.3+] Querying non-partitioned @TB data table is too slow

2019-06-09 Thread Rishi Shah

Hi All, I have a table with 3TB data, stored as parquet snappy compression - 100 columns.. However I am filtering the DataFrame on date column (date between 20190501-20190530) & selecting only 20 columns & counting.. This operation takes about 45 mins!! Shouldn't parquet do the predicate pushdown

Re: Read hdfs files in spark streaming

2019-06-09 Thread vaquar khan

Hi Deepak, You can use textFileStream. https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html Plz start using stackoverflow to ask question to other ppl so get benefits of answer Regards, Vaquar khan On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma wrote: > I am using spark streamin

Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-09 Thread Jörn Franke

Depending on what accuracy is needed, hyperloglogs can be an interesting alternative https://en.m.wikipedia.org/wiki/HyperLogLog > Am 09.06.2019 um 15:59 schrieb big data : > > From m opinion, Bitmap is the best solution for active users calculation. > Other solution almost bases on count(dist

Re: [Pyspark 2.4] Best way to define activity within different time window

2019-06-09 Thread big data

From m opinion, Bitmap is the best solution for active users calculation. Other solution almost bases on count(distinct) calculation process, which is more slower. If you 've implemented Bitmap solution including how to build Bitmap, how to load Bitmap, then Bitmap is the best choice. 在 2019/6

Read hdfs files in spark streaming

2019-06-09 Thread Deepak Sharma

I am using spark streaming application to read from kafka. The value coming from kafka message is path to hdfs file. I am using spark 2.x , spark.read.stream. What is the best way to read this path in spark streaming and then read the json stored at the hdfs path , may be using spark.read.json , i

Re: Read hdfs files in spark streaming

Re: High level explanation of dropDuplicates

[pyspark 2.3+] Querying non-partitioned @TB data table is too slow

Re: Read hdfs files in spark streaming

Re: [Pyspark 2.4] Best way to define activity within different time window

Re: [Pyspark 2.4] Best way to define activity within different time window

Read hdfs files in spark streaming

7 matches

Site Navigation

Mail list logo

Footer information