Re: Merge multiple different s3 logs using pyspark 2.4.3

2020-01-09 Thread Shraddha Shah
Unless I am reading this wrong, this can be achieved with aws sync ?

aws s3 sync
s3://my-bucket/ingestion/source1/y=2019/m=12/d=12
s3://my-bucket/ingestion/processed/
*src_category=other*/y=2019/m=12/d=12

Thanks,
-Shraddha



On Thu, Jan 9, 2020 at 7:05 AM Gourav Sengupta 
wrote:

> why s3a?
>
> On Thu, Jan 9, 2020 at 2:20 AM anbutech  wrote:
>
>> Hello,
>>
>> version = spark 2.4.3
>>
>> I have 3 different sources json logs data which having same schema(same
>> columns order) in the raw data and want to add one new column as
>> "src_category"  for all the  3 different source to distinguish the source
>> category  and merge all the  3 different sources into the single dataframe
>> to read the json data for the  processing.what is the best way to handle
>> this case.
>>
>> df = spark.read.json(merged_3sourcesraw_data)
>>
>> Input:
>>
>> s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
>> s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
>> s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json
>>
>> output:
>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
>>
>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
>> s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows
>>
>>
>> Thanks
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: [pyspark 2.4] maxrecordsperfile option

2019-11-30 Thread Shraddha Shah
After digging in a bit more, it looks like maxrecordsperfile does not
provide full parallelism as expected. Any thoughts on this would be really
helpful.

On Sat, Nov 23, 2019 at 11:36 PM Rishi Shah 
wrote:

> Hi All,
>
> Version 2.2 introduced maxrecordsperfile option while writing data, could
> someone help understand the performance impact of using maxrecordsperfile
> (single pass at writing data with this option) vs repartitioning (2 stage
> process where we write down data and then consolidate later)?
>
> --
> Regards,
>
> Rishi Shah
>


Re: [pyspark 2.3+] repartition followed by window function

2019-05-22 Thread Shraddha Shah
Any suggestions?

On Wed, May 22, 2019 at 6:32 AM Rishi Shah  wrote:

> Hi All,
>
> If dataframe is repartitioned in memory by (date, id) columns and then if
> I use multiple window functions which uses partition by clause with (date,
> id) columns --> we can avoid shuffle/sort again I believe.. Can someone
> confirm this?
>
> However what happens when dataframe repartition was done using (date, id)
> columns, but window function which follows repartition needs a partition by
> clause with (date, id, col3, col4) columns ? Would spark reshuffle the
> data? or would it know to utilize the initially partitioned/shuffled data
> by date/id (as date & id are the common partition keys)?
>
> --
> Regards,
>
> Rishi Shah
>


Re: Use derived column for other derived column in the same statement

2019-04-21 Thread Shraddha Shah
Also the same thing for groupby agg operation, how can we use one
aggregated result (say min(amount)) to derive another aggregated column?

On Sun, Apr 21, 2019 at 11:24 PM Rishi Shah 
wrote:

> Hello All,
>
> How can we use a derived column1 for deriving another column in the same
> dataframe operation statement?
>
> something like:
>
> df = df.withColumn('derived1', lit('something'))
> .withColumn('derived2', col('derived1') == 'something')
>
> --
> Regards,
>
> Rishi Shah
>