Re: Merge multiple different s3 logs using pyspark 2.4.3
Unless I am reading this wrong, this can be achieved with aws sync ? aws s3 sync s3://my-bucket/ingestion/source1/y=2019/m=12/d=12 s3://my-bucket/ingestion/processed/ *src_category=other*/y=2019/m=12/d=12 Thanks, -Shraddha On Thu, Jan 9, 2020 at 7:05 AM Gourav Sengupta wrote: > why s3a? > > On Thu, Jan 9, 2020 at 2:20 AM anbutech wrote: > >> Hello, >> >> version = spark 2.4.3 >> >> I have 3 different sources json logs data which having same schema(same >> columns order) in the raw data and want to add one new column as >> "src_category" for all the 3 different source to distinguish the source >> category and merge all the 3 different sources into the single dataframe >> to read the json data for the processing.what is the best way to handle >> this case. >> >> df = spark.read.json(merged_3sourcesraw_data) >> >> Input: >> >> s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json >> s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json >> s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json >> >> output: >> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other >> >> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new >> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows >> >> >> Thanks >> >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>
Re: [pyspark 2.4] maxrecordsperfile option
After digging in a bit more, it looks like maxrecordsperfile does not provide full parallelism as expected. Any thoughts on this would be really helpful. On Sat, Nov 23, 2019 at 11:36 PM Rishi Shah wrote: > Hi All, > > Version 2.2 introduced maxrecordsperfile option while writing data, could > someone help understand the performance impact of using maxrecordsperfile > (single pass at writing data with this option) vs repartitioning (2 stage > process where we write down data and then consolidate later)? > > -- > Regards, > > Rishi Shah >
Re: [pyspark 2.3+] repartition followed by window function
Any suggestions? On Wed, May 22, 2019 at 6:32 AM Rishi Shah wrote: > Hi All, > > If dataframe is repartitioned in memory by (date, id) columns and then if > I use multiple window functions which uses partition by clause with (date, > id) columns --> we can avoid shuffle/sort again I believe.. Can someone > confirm this? > > However what happens when dataframe repartition was done using (date, id) > columns, but window function which follows repartition needs a partition by > clause with (date, id, col3, col4) columns ? Would spark reshuffle the > data? or would it know to utilize the initially partitioned/shuffled data > by date/id (as date & id are the common partition keys)? > > -- > Regards, > > Rishi Shah >
Re: Use derived column for other derived column in the same statement
Also the same thing for groupby agg operation, how can we use one aggregated result (say min(amount)) to derive another aggregated column? On Sun, Apr 21, 2019 at 11:24 PM Rishi Shah wrote: > Hello All, > > How can we use a derived column1 for deriving another column in the same > dataframe operation statement? > > something like: > > df = df.withColumn('derived1', lit('something')) > .withColumn('derived2', col('derived1') == 'something') > > -- > Regards, > > Rishi Shah >