Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Parsian, Mahmoud
Good idea. Will be useful +1 From: ashok34...@yahoo.com.INVALID Date: Monday, March 18, 2024 at 6:36 AM To: user @spark , Spark dev list , Mich Talebzadeh Cc: Matei Zaharia Subject: Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community External message, be mindful

Question on writing a dataframe without medatadata column names

2020-10-26 Thread Parsian, Mahmoud
Let’s say that I have a spark dataframe as 3 columns: id, name, age. When I save it into HDFS/S3, it saves as: (where I have used “partitionBy(id, name)”) /id=1/name=Alex/.parquet /id=2/name=Bob/.parquet If I want not to include “id=” and “name=” in directory structures, what should I do Theref

How to improve performance of saveAsTextFile()

2017-03-10 Thread Parsian, Mahmoud
How to improve performance of JavaRDD.saveAsTextFile(“hdfs://…“). This is taking over 30 minutes on a cluster of 10 nodes. Running Spark on YARN. JavaRDD has 120 million entries. Thank you, Best regards, Mahmoud

RE: Sorting Reduced/Groupd Values without Explicit Sorting

2014-06-29 Thread Parsian, Mahmoud
itions after shuffle, so you have to sort by yourself. Thanks Saisai. From: Parsian, Mahmoud [mailto:mpars...@illumina.com] Sent: Monday, June 30, 2014 11:08 AM To: user@spark.apache.org Subject: RE: Sorting Reduced/Groupd Values without Explicit Sorting Hi Jerry, Thank you for replying to my qu

RE: Sorting Reduced/Groupd Values without Explicit Sorting

2014-06-29 Thread Parsian, Mahmoud
ithout Explicit Sorting Hi Mahmoud, I think you cannot achieve this in current Spark framework, because current Spark’s Shuffle is based on hash, which is different from MapReduce’s sort-based shuffle, so you should implement sorting explicitly using RDD operator. Thanks Jerry From: Parsian, Mahmou

Sorting Reduced/Groupd Values without Explicit Sorting

2014-06-29 Thread Parsian, Mahmoud
Given the following time series data: name, time, value x,2,9 x,1,3 x,3,6 y,2,5 y,1,7 y,3,1 z,3,7 z,4,0 z,1,4 z,2,8 we want to generate the following (the reduced/grouped values are sorted by time). x => [(1,3), (2,9), (3,6)] y => [(1,7), (2,5), (3,1)] z => [(1,4), (2,8), (3,7), (4,0)] One obv

Re: question on setup() and cleanup() methods for map() and reduce()

2014-04-28 Thread Parsian, Mahmoud
fectively an input split. Ameet On Mon, Apr 28, 2014 at 9:22 PM, Parsian, Mahmoud mailto:mpars...@illumina.com>> wrote: In classic MapReduce/Hadoop, you may optionally define setup() and cleanup() methods. They ( setup() and cleanup() ) are called for each task, so if you have 20 mappers run

question on setup() and cleanup() methods for map() and reduce()

2014-04-28 Thread Parsian, Mahmoud
In classic MapReduce/Hadoop, you may optionally define setup() and cleanup() methods. They ( setup() and cleanup() ) are called for each task, so if you have 20 mappers running, the setup/cleanup will be called for each one. What is the equivalent of these in Spark? Thanks, best regards, Mahmoud