Re: Read hdfs files in spark streaming
The context is different here. The file path are coming as messages in kafka topic. Spark streaming (structured) consumes form this topic. Now it have to get the value from the message , thus the path to file. read the json stored at the file location into another df. Thanks Deepak On Sun, Jun 9, 2019 at 11:03 PM vaquar khan wrote: > Hi Deepak, > > You can use textFileStream. > > https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html > > Plz start using stackoverflow to ask question to other ppl so get benefits > of answer > > > Regards, > Vaquar khan > > On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma wrote: > >> I am using spark streaming application to read from kafka. >> The value coming from kafka message is path to hdfs file. >> I am using spark 2.x , spark.read.stream. >> What is the best way to read this path in spark streaming and then read >> the json stored at the hdfs path , may be using spark.read.json , into a df >> inside the spark streaming app. >> Thanks a lot in advance >> >> -- >> Thanks >> Deepak >> > -- Thanks Deepak www.bigdatabig.com www.keosha.net
Re: High level explanation of dropDuplicates
Hi All, Just wanted to check back regarding best way to perform deduplication. Is using drop duplicates the optimal way to get rid of duplicates? Would it be better if we run operations on red directly? Also what about if we want to keep the last value of the group while performing deduplication (based on some sorting criteria)? Thanks, Rishi On Mon, May 20, 2019 at 3:33 PM Nicholas Hakobian < nicholas.hakob...@rallyhealth.com> wrote: > From doing some searching around in the spark codebase, I found the > following: > > > https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474 > > So it appears there is no direct operation called dropDuplicates or > Deduplicate, but there is an optimizer rule that converts this logical > operation to a physical operation that is equivalent to grouping by all the > columns you want to deduplicate across (or all columns if you are doing > something like distinct), and taking the First() value. So (using a pySpark > code example): > > df = input_df.dropDuplicates(['col1', 'col2']) > > Is effectively shorthand for saying something like: > > df = input_df.groupBy('col1', > 'col2').agg(first(struct(input_df.columns)).alias('data')).select('data.*') > > Except I assume that it has some internal optimization so it doesn't need > to pack/unpack the column data, and just returns the whole Row. > > Nicholas Szandor Hakobian, Ph.D. > Principal Data Scientist > Rally Health > nicholas.hakob...@rallyhealth.com > > > > On Mon, May 20, 2019 at 11:38 AM Yeikel wrote: > >> Hi , >> >> I am looking for a high level explanation(overview) on how >> dropDuplicates[1] >> works. >> >> [1] >> >> https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326 >> >> Could someone please explain? >> >> Thank you >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- Regards, Rishi Shah
[pyspark 2.3+] Querying non-partitioned @TB data table is too slow
Hi All, I have a table with 3TB data, stored as parquet snappy compression - 100 columns.. However I am filtering the DataFrame on date column (date between 20190501-20190530) & selecting only 20 columns & counting.. This operation takes about 45 mins!! Shouldn't parquet do the predicate pushdown and filtering without scanning the entire dataset? -- Regards, Rishi Shah
Re: Read hdfs files in spark streaming
Hi Deepak, You can use textFileStream. https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html Plz start using stackoverflow to ask question to other ppl so get benefits of answer Regards, Vaquar khan On Sun, Jun 9, 2019, 8:08 AM Deepak Sharma wrote: > I am using spark streaming application to read from kafka. > The value coming from kafka message is path to hdfs file. > I am using spark 2.x , spark.read.stream. > What is the best way to read this path in spark streaming and then read > the json stored at the hdfs path , may be using spark.read.json , into a df > inside the spark streaming app. > Thanks a lot in advance > > -- > Thanks > Deepak >
Re: [Pyspark 2.4] Best way to define activity within different time window
Depending on what accuracy is needed, hyperloglogs can be an interesting alternative https://en.m.wikipedia.org/wiki/HyperLogLog > Am 09.06.2019 um 15:59 schrieb big data : > > From m opinion, Bitmap is the best solution for active users calculation. > Other solution almost bases on count(distinct) calculation process, which is > more slower. > > If you 've implemented Bitmap solution including how to build Bitmap, how to > load Bitmap, then Bitmap is the best choice. > >> 在 2019/6/5 下午6:49, Rishi Shah 写道: >> Hi All, >> >> Is there a best practice around calculating daily, weekly, monthly, >> quarterly, yearly active users? >> >> One approach is to create a window of daily bitmap and aggregate it based on >> period later. However I was wondering if anyone has a better approach to >> tackling this problem.. >> >> -- >> Regards, >> >> Rishi Shah
Re: [Pyspark 2.4] Best way to define activity within different time window
From m opinion, Bitmap is the best solution for active users calculation. Other solution almost bases on count(distinct) calculation process, which is more slower. If you 've implemented Bitmap solution including how to build Bitmap, how to load Bitmap, then Bitmap is the best choice. 在 2019/6/5 下午6:49, Rishi Shah 写道: Hi All, Is there a best practice around calculating daily, weekly, monthly, quarterly, yearly active users? One approach is to create a window of daily bitmap and aggregate it based on period later. However I was wondering if anyone has a better approach to tackling this problem.. -- Regards, Rishi Shah
Read hdfs files in spark streaming
I am using spark streaming application to read from kafka. The value coming from kafka message is path to hdfs file. I am using spark 2.x , spark.read.stream. What is the best way to read this path in spark streaming and then read the json stored at the hdfs path , may be using spark.read.json , into a df inside the spark streaming app. Thanks a lot in advance -- Thanks Deepak