Re: How default partitioning in spark is deployed

2021-03-16 Thread Attila Zsolt Piros
Oh, sure that was the reason. You can keep using the `foreachPartition` and get the partition ID from the `TaskContext`: scala> import org.apache.spark.TaskContext import org.apache.spark.TaskContext scala> myRDD.foreachPartition( e => { println(TaskContext.getPartitionId + ":" +

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
That's a very good idea, thanks for sharing German! On Tue, Mar 16, 2021 at 7:08 PM German Schiavon wrote: > Hi all, > > I guess you could do something like this too: > > [image: Captura de pantalla 2021-03-16 a las 14.35.46.png] > > On Tue, 16 Mar 2021 at 13:31, Renganathan Mutthiah < >

Re: How default partitioning in spark is deployed

2021-03-16 Thread German Schiavon
Hi all, I guess you could do something like this too: [image: Captura de pantalla 2021-03-16 a las 14.35.46.png] On Tue, 16 Mar 2021 at 13:31, Renganathan Mutthiah wrote: > Hi Attila, > > Thanks for looking into this! > > I actually found the issue and it turned out to be that the print >

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
Hi Attila, Thanks for looking into this! I actually found the issue and it turned out to be that the print statements misled me. The records are indeed stored in different partitions. What happened is since the foreachpartition method is run parallelly by different threads, they all printed the

Re: How default partitioning in spark is deployed

2021-03-16 Thread Attila Zsolt Piros
Hi! This is weird. The code of foreachPartition leads to ParallelCollectionRDD

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
Hi Mich, Thanks for your precious time looking into my query. Yes, when we increase the number of objects, all partitions start having the data. I actually tried to understand what happens in my particular case. Thanks! On Tue, Mar 16, 2021 at 2:10 PM Mich Talebzadeh wrote: > Hi, > > Well as

Re: How default partitioning in spark is deployed

2021-03-16 Thread Mich Talebzadeh
Hi, Well as it appears you have 5 entries in your data and 12 cores. The theory is that you run multiple tasks in parallel across multiple cores on a desktop which applies to your case. The statistics is not there to give a meaningful interpretation why Spark decided to put all data in one

How default partitioning in spark is deployed

2021-03-15 Thread Renganathan Mutthiah
Hi, I have a question with respect to default partitioning in RDD. *case class Animal(id:Int, name:String) val myRDD = session.sparkContext.parallelize( (Array( Animal(1, "Lion"), Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"), Animal(5, "Chetah") ) ))Console println

Partitioning in spark while reading from RDBMS via JDBC

2017-03-31 Thread Devender Yadav
Hi All, I am running spark in cluster mode and reading data from RDBMS via JDBC. As per spark docs, these partitioning parameters describe how to partition the table when reading in parallel from multiple

Re: Partitioning in spark

2016-06-24 Thread Darshan Singh
Thanks but the whole point is not setting it explicitly but it should be derived from its parent RDDS. Thanks On Fri, Jun 24, 2016 at 6:09 AM, ayan guha wrote: > You can change paralllism like following: > > conf = SparkConf() > conf.set('spark.sql.shuffle.partitions',10)

Re: Partitioning in spark

2016-06-23 Thread ayan guha
You can change paralllism like following: conf = SparkConf() conf.set('spark.sql.shuffle.partitions',10) sc = SparkContext(conf=conf) On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh wrote: > Hi, > > My default parallelism is 100. Now I join 2 dataframes with 20

Partitioning in spark

2016-06-23 Thread Darshan Singh
Hi, My default parallelism is 100. Now I join 2 dataframes with 20 partitions each , joined dataframe has 100 partition. I want to know what is the way to keep it to 20 (except re-partition and coalesce. Also, when i join these 2 dataframes I am using 4 columns as joined columns. The dataframes

Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
My current version of spark is 1.3.0 and my question is the next: I have large data frames where the main field is an user id. I need to do many group by's and joins using that field. Do the performance will increase if before doing any group by or join operation I first convert to rdd to

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores
Thanks Michael for your input. By 1) do you mean: - Caching the partitioned_rdd - Caching the partitioned_df - *Or* just caching unpartitioned_df without the need of creating the partitioned_rdd variable? Can you expand a little bit more 2) Thanks! On Wed, Oct 14, 2015 at 12:11

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
This won't help as for two reasons: 1) Its all still just creating lineage since you aren't caching the partitioned data. It will still fetch the shuffled blocks for each query. 2) The query optimizer is not aware of RDD level partitioning since its mostly a blackbox. 1) could be fixed by

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Michael Armbrust
Caching the partitioned_df <- this one, but you have to do the partitioning using something like sql("SELECT * FROM ... CLUSTER BY a") as there is no such operation exposed on dataframes. 2) Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-5354

Re: Partitioning in spark streaming

2015-08-12 Thread Tathagata Das
to repartition based on a key, a shuffle is needed. On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia mohitanch...@gmail.com wrote: How does partitioning in spark work when it comes to streaming? What's the best way to partition a time series data grouped by a certain tag like categories of product video

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
. Best Ayan On Wed, Aug 12, 2015 at 9:06 AM, Mohit Anchlia mohitanch...@gmail.com wrote: How does partitioning in spark work when it comes to streaming? What's the best way to partition a time series data grouped by a certain tag like categories of product video, music etc. -- Best

Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
How does partitioning in spark work when it comes to streaming? What's the best way to partition a time series data grouped by a certain tag like categories of product video, music etc.

Re: Partitioning in spark streaming

2015-08-11 Thread ayan guha
partition scheme should not skew data into one partition too much. Best Ayan On Wed, Aug 12, 2015 at 9:06 AM, Mohit Anchlia mohitanch...@gmail.com wrote: How does partitioning in spark work when it comes to streaming? What's the best way to partition a time series data grouped by a certain tag like

Re: Partitioning in spark streaming

2015-08-11 Thread Hemant Bhanawat
partitioning in spark work when it comes to streaming? What's the best way to partition a time series data grouped by a certain tag like categories of product video, music etc.

Re: Partitioning in spark streaming

2015-08-11 Thread Mohit Anchlia
during the batchInterval are partitions of the RDD. Now if you want to repartition based on a key, a shuffle is needed. On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia mohitanch...@gmail.com wrote: How does partitioning in spark work when it comes to streaming? What's the best way to partition

Partitioning under Spark 1.0.x

2014-08-19 Thread losmi83
in the meantime? Or how to get corresponding partitions on one node? Thanks, Milos -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Partitioning-under-Spark-1-0-x-tp12375.html Sent from the Apache Spark User List mailing list archive at Nabble.com