S3 committer for dynamic partitioning

2024-03-05 Thread Nikhil Goyal
Hi folks, We have been following this doc for writing data from Spark Job to S3. However it fails writing to dynamic partitions. Any suggestions on what config should be used to avoid the cost of renaming in S3?

dataset partitioning algorithm implementation help

2021-12-23 Thread sam smith
Hello All, I am replicating a paper's algorithm about a partitioning approach to anonymize datasets with Spark / Java, and want to ask you for some help to review my 150 lines of code. My github repo, attached below, contains both my java class and the related paper: https://github.com

Re: How default partitioning in spark is deployed

2021-03-16 Thread Attila Zsolt Piros
t happens when you paste the >>> above method and call it like: >>> >>> scala> def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] >>> = { >>> | (0 until numSlices).iterator.map { i => >>> | val start = ((i * length) /

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
>>> Animal(5,Chetah) >>> >>> scala> Console println myRDD.getNumPartitions >>> 12 >>> >>> 3) Can you please check spark-shell what happens when you paste the >>> above method and call it like: >>> >>> scala&

Re: How default partitioning in spark is deployed

2021-03-16 Thread German Schiavon
thod and call it like: >> >> scala> def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] >> = { >> | (0 until numSlices).iterator.map { i => >> | val start = ((i * length) / numSlices).toInt >> | val end = (((i + 1) * length

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
val end = (((i + 1) * length) / numSlices).toInt > | (start, end) > | } > | } > positions: (length: Long, numSlices: Int)Iterator[(Int, Int)] > > scala> positions(5, 12).foreach(println) > (0,0) > (0,0) > (0,1) > (1,1) > (1,2

Re: How default partitioning in spark is deployed

2021-03-16 Thread Attila Zsolt Piros
foreach(println) (0,0) (0,0) (0,1) (1,1) (1,2) (2,2) (2,2) (2,3) (3,3) (3,4) (4,4) (4,5) As you can see in my case the `positions` result consistent with the `foreachPartition` and this should be deterministic. Best regards, Attila On Tue, Mar 16, 2021 at 5:34 AM Renganathan Mutthiah < renganath

Re: How default partitioning in spark is deployed

2021-03-16 Thread Renganathan Mutthiah
e or destruction. > > > > > On Tue, 16 Mar 2021 at 04:35, Renganathan Mutthiah < > renganatha...@gmail.com> wrote: > >> Hi, >> >> I have a question with respect to default partitioning in RDD. >> >> >> >> >> *case class Animal(id:

Re: How default partitioning in spark is deployed

2021-03-16 Thread Mich Talebzadeh
on with respect to default partitioning in RDD. > > > > > *case class Animal(id:Int, name:String) val myRDD = > session.sparkContext.parallelize( (Array( Animal(1, "Lion"), > Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"), Animal(5

How default partitioning in spark is deployed

2021-03-15 Thread Renganathan Mutthiah
Hi, I have a question with respect to default partitioning in RDD. *case class Animal(id:Int, name:String) val myRDD = session.sparkContext.parallelize( (Array( Animal(1, "Lion"), Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"),

Re: Renaming a DataFrame column makes Spark lose partitioning information

2020-08-05 Thread Antoine Wendlinger
Well that's great ! Thank you very much :) Antoine On Tue, Aug 4, 2020 at 11:22 PM Terry Kim wrote: > This is fixed in Spark 3.0 by https://github.com/apache/spark/pull/26943: > > scala> :paste > // Entering paste mode (ctrl-D to finish) > > Seq((1, 2)) > .toDF("a", "b") >

Re: Renaming a DataFrame column makes Spark lose partitioning information

2020-08-04 Thread Terry Kim
This is fixed in Spark 3.0 by https://github.com/apache/spark/pull/26943: scala> :paste // Entering paste mode (ctrl-D to finish) Seq((1, 2)) .toDF("a", "b") .repartition($"b") .withColumnRenamed("b", "c") .repartition($"c") .explain() // Exiting paste mode, now

Renaming a DataFrame column makes Spark lose partitioning information

2020-08-04 Thread Antoine Wendlinger
Hi, When renaming a DataFrame column, it looks like Spark is forgetting the partition information: Seq((1, 2)) .toDF("a", "b") .repartition($"b") .withColumnRenamed("b", "c") .repartition($"c") .explain() Gives the following plan: == Physical Plan ==

Partitioning query

2019-09-13 Thread ☼ R Nair
Hi, We are running a Spark JDBC code to pull data from Oracle, with some 200 partitions. Sometimes we are seeing that some tasks are failing or not moving forward. Is there anyway we can see/find the queries responsible for each partition or task ? How to enable this? Thanks Best, Ravion

Re: Static partitioning in partitionBy()

2019-05-08 Thread Gourav Sengupta
some data skew problem but might work for you >> >> >> >> -- >> *From:* Burak Yavuz >> *Sent:* Tuesday, May 7, 2019 9:35:10 AM >> *To:* Shubham Chaurasia >> *Cc:* dev; user@spark.apache.org >> *Subject:* Re: Static partitioning in partitionBy() >>

Re: Static partitioning in partitionBy()

2019-05-08 Thread Shubham Chaurasia
nt:* Tuesday, May 7, 2019 9:35:10 AM > *To:* Shubham Chaurasia > *Cc:* dev; user@spark.apache.org > *Subject:* Re: Static partitioning in partitionBy() > > It depends on the data source. Delta Lake (https://delta.io) allows you > to do it with the .option("replaceWhere",

Re: Static partitioning in partitionBy()

2019-05-07 Thread Felix Cheung
partitioning in partitionBy() It depends on the data source. Delta Lake (https://delta.io) allows you to do it with the .option("replaceWhere", "c = c1"). With other file formats, you can write directly into the partition directory (tablePath/c=c1), but you lose atomicity. On Tu

Re: Static partitioning in partitionBy()

2019-05-07 Thread Burak Yavuz
It depends on the data source. Delta Lake (https://delta.io) allows you to do it with the .option("replaceWhere", "c = c1"). With other file formats, you can write directly into the partition directory (tablePath/c=c1), but you lose atomicity. On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia

Static partitioning in partitionBy()

2019-05-07 Thread Shubham Chaurasia
Hi All, Is there a way I can provide static partitions in partitionBy()? Like: df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save Above code gives following error as it tries to find column `c=c1` in df. org.apache.spark.sql.AnalysisException: Partition column `c=c1`

Re: Howto force spark to honor parquet partitioning

2019-05-03 Thread Gourav Sengupta
tion (1 > file per hour). > > So far my only solution is to use repartition: > df.repartition(col("event_hour")) > > But there is a lot of overhead with unnecessary shuffle. I'd like to force > spark to "pickup" the parquet partitioning. > > In my invest

Howto force spark to honor parquet partitioning

2019-05-03 Thread Tomas Bartalos
my goal is to write 1 file per parquet partition (1 file per hour). So far my only solution is to use repartition: df.repartition(col("event_hour")) But there is a lot of overhead with unnecessary shuffle. I'd like to force spark to "pickup" the parquet partitioning. In my inve

Re: Pyspark Partitioning

2018-10-04 Thread Vitaliy Pisarev
in the resulting partitions. On Thu, Oct 4, 2018, 23:27 dimitris plakas wrote: > Hello everyone, > > Here is an issue that i am facing in partitioning dtafarame. > > I have a dataframe which called data_df. It is look like: > > Group_Id | Object_Id | Trajectory >1

Pyspark Partitioning

2018-10-04 Thread dimitris plakas
Hello everyone, Here is an issue that i am facing in partitioning dtafarame. I have a dataframe which called data_df. It is look like: Group_Id | Object_Id | Trajectory 1 | obj1| Traj1 2 | obj2| Traj2 1 | obj3| Traj3 3

Re: Pyspark Partitioning

2018-10-01 Thread Gourav Sengupta
Hi, the most simple option is create UDF's of these different functions and then use case statement (or similar) in SQL and pass it on. But this is low tech, in case you have conditions based on record values which are even more granular, why not use a single UDF, and then let conditions handle

Re: Pyspark Partitioning

2018-09-30 Thread ayan guha
Hi There are a set pf finction which can be used with the construct Over (partition by col order by col). You search for rank and window functions in spark documentation. On Mon, 1 Oct 2018 at 5:29 am, Riccardo Ferrari wrote: > Hi Dimitris, > > I believe the methods partitionBy >

Re: Pyspark Partitioning

2018-09-30 Thread Riccardo Ferrari
Hi Dimitris, I believe the methods partitionBy and mapPartitions are specific to RDDs while you're talking about DataFrames

Pyspark Partitioning

2018-09-30 Thread dimitris plakas
Hello everyone, I am trying to split a dataframe on partitions and i want to apply a custom function on every partition. More precisely i have a dataframe like the one below Group_Id | Id | Points 1| id1| Point1 2| id2| Point2 I want to have a partition for every

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
here that helps you beyond repartition(number of >> partitons) or calling your udf on foreachPartition? If your data is on >> disk, Spark is already partitioning it for you by rows. How is adding the >> host info helping? >> >> Thanks, >> Sonal >> Nube Technologies <h

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
t; partitons) or calling your udf on foreachPartition? If your data is on > disk, Spark is already partitioning it for you by rows. How is adding the > host info helping? > > Thanks, > Sonal > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalg

Re: [External Sender] Pitfalls of partitioning by host?

2018-08-28 Thread Jayesh Lalwani
If you group by the host that you have computed using the UDF, Spark is always going to shuffle your dataset, even if the end result is that all the new partitions look exactly like the old partitions, just placed on differrent nodes. Remember the hostname will probably hash differrently than the

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Sonal Goyal
Hi Patrick, Sorry is there something here that helps you beyond repartition(number of partitons) or calling your udf on foreachPartition? If your data is on disk, Spark is already partitioning it for you by rows. How is adding the host info helping? Thanks, Sonal Nube Technologies <h

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-28 Thread Patrick McCarthy
Mostly I'm guessing that it adds efficiency to a job where partitioning is required but shuffling is not. For example, if I want to apply a UDF to 1tb of records on disk, I might need to repartition(5) to get the task size down to an acceptable size for my cluster. If I don't care that it's

Re: Pitfalls of partitioning by host?

2018-08-27 Thread Michael Artz
Well if we think of shuffling as a necessity to perform an operation, then the problem would be that you are adding a ln aggregation stage to a job that is going to get shuffled anyway. Like if you need to join two datasets, then Spark will still shuffle the data, whether they are grouped by

Pitfalls of partitioning by host?

2018-08-27 Thread Patrick McCarthy
When debugging some behavior on my YARN cluster I wrote the following PySpark UDF to figure out what host was operating on what row of data: @F.udf(T.StringType()) def add_hostname(x): import socket return str(socket.gethostname()) It occurred to me that I could use this to enforce

Dynamic partitioning weird behavior

2018-08-07 Thread Nikolay Skovpin
data into this table with SaveMode.Overwrite. What i did: import org.apache.spark.sql.{SaveMode, SparkSession} val spark = SparkSession.builder() .appName("Test for dynamic partitioning") .config("spark.sql.sources.partitionOverwriteMode", "dynamic") .getOrCreate()

Overwrite only specific partition with hive dynamic partitioning

2018-08-01 Thread Nirav Patel
Hi, I have a hive partition table created using sparkSession. I would like to insert/overwrite Dataframe data to specific set of partition without loosing any other partition. In each run I have to update Set of partitions not just one. e.g. I have dataframe with bid=1, bid=2, bid=3 in first

[SPARK-SQL] Reading JSON column as a DataFrame and keeping partitioning information

2018-07-20 Thread Daniel Mateus Pires
I've been trying to figure out this one for some time now, I have JSONs representing Products coming (physically) partitioned by Brand and I would like to create a DataFrame from the JSON but also keep the partitioning information (Brand) ``` case class Product(brand: String, value: String

Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread ayan guha
p in the below please? > > Thanks, > Aakash. > > > -- Forwarded message -- > From: Aakash Basu <aakash.spark@gmail.com> > Date: Tue, Oct 31, 2017 at 9:17 PM > Subject: Regarding column partitioning IDs and names as per hierarchical > level

Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread Jean Georges Perrin
> From: Aakash Basu <aakash.spark@gmail.com > <mailto:aakash.spark@gmail.com>> > Date: Tue, Oct 31, 2017 at 9:17 PM > Subject: Regarding column partitioning IDs and names as per hierarchical > level SparkSQL > To: user <user@spark.apache.org <mailto:us

Fwd: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu
Hey all, Any help in the below please? Thanks, Aakash. -- Forwarded message -- From: Aakash Basu <aakash.spark@gmail.com> Date: Tue, Oct 31, 2017 at 9:17 PM Subject: Regarding column partitioning IDs and names as per hierarchical level SparkSQL To: user

Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu
Hi all, I have to generate a table with Spark-SQL with the following columns - Level One Id: VARCHAR(20) NULL Level One Name: VARCHAR( 50) NOT NULL Level Two Id: VARCHAR( 20) NULL Level Two Name: VARCHAR(50) NULL Level Thr ee Id: VARCHAR(20) NULL Level Thr ee Name: VARCHAR(50) NULL Level Four

Design aspects of Data partitioning for Window functions

2017-08-30 Thread Vasu Gourabathina
All: If this question was already discussed, please let me know. I can try to look into the archive. Data Characteristics: entity_id date fact_1 fact_2 fact_N derived_1 derived_2 derived_X a) There are 1000s of such entities in the system b) Each one has various Fact attributes per

Informing Spark about specific Partitioning scheme to avoid shuffles

2017-07-22 Thread saatvikshah1994
). This was done when loading the data into a RDD of lists and before the .toDF() call. So I assume Spark would not already know that such a partitioning exists and might trigger a shuffle if I call a shuffling transform using field1 or field2 as keys and then cache that information. Is it possible

Re: How does partitioning happen for binary files in spark ?

2017-04-06 Thread Jay
ignored. Thanks, Jayadeep On Thu, Apr 6, 2017 at 3:43 PM, ashwini anand <aayan...@gmail.com> wrote: > By looking into the source code, I found that for textFile(), the > partitioning is computed by the computeSplitSize() function in > FileInputFormat class. This function takes int

Re: How does partitioning happen for binary files in spark ?

2017-04-06 Thread Jay
ignored. Thanks, Jayadeep On Thu, Apr 6, 2017 at 3:43 PM, ashwini anand <aayan...@gmail.com> wrote: > By looking into the source code, I found that for textFile(), the > partitioning is computed by the computeSplitSize() function in > FileInputFormat class. This function takes int

How does partitioning happen for binary files in spark ?

2017-04-06 Thread ashwini anand
By looking into the source code, I found that for textFile(), the partitioning is computed by the computeSplitSize() function in FileInputFormat class. This function takes into consideration the minPartitions value passed by user. As per my understanding , the same thing for binaryFiles

Re: Partitioning strategy

2017-04-02 Thread Jörn Franke
er > time selection RDD is being filtered and on filtered RDD further > transformations and actions are performed. And, as spark says, child RDD get > partitions from parent RDD. > > Therefore, is there any way to decide partitioning strategy after filter > operations?

Partitioning strategy

2017-04-02 Thread jasbir.sing
get partitions from parent RDD. Therefore, is there any way to decide partitioning strategy after filter operations? Regards, Jasbir Singh This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential

Partitioning in spark while reading from RDBMS via JDBC

2017-03-31 Thread Devender Yadav
Hi All, I am running spark in cluster mode and reading data from RDBMS via JDBC. As per spark docs<http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>, these partitioning parameters describe how to partition the table when reading in parallel from mu

RE: Huge partitioning job takes longer to close after all tasks finished

2017-03-09 Thread PSwain
: user@spark.apache.org Subject: Re: Huge partitioning job takes longer to close after all tasks finished Thank you liu. Can you please explain what do you mean by enabling spark fault tolerant mechanism? I observed that after all tasks finishes, spark is working on concatenating same partitions from all

Re: Huge partitioning job takes longer to close after all tasks finished

2017-03-09 Thread Gourav Sengupta
Hi, you are definitely not using SPARK 2.1 in the way it should be used. Try using sessions, and follow their guidelines, this issue has been specifically resolved as a part of Spark 2.1 release. Regards, Gourav On Wed, Mar 8, 2017 at 8:00 PM, Swapnil Shinde wrote:

Re: Huge partitioning job takes longer to close after all tasks finished

2017-03-08 Thread Swapnil Shinde
Thank you liu. Can you please explain what do you mean by enabling spark fault tolerant mechanism? I observed that after all tasks finishes, spark is working on concatenating same partitions from all tasks on file system. eg, task1 - partition1, partition2, partition3 task2 - partition1,

Re: Huge partitioning job takes longer to close after all tasks finished

2017-03-07 Thread cht liu
Do you enable the spark fault tolerance mechanism, RDD run at the end of the job, will start a separate job, to the checkpoint data written to the file system before the persistence of high availability 2017-03-08 2:45 GMT+08:00 Swapnil Shinde : > Hello all >I have

Huge partitioning job takes longer to close after all tasks finished

2017-03-07 Thread Swapnil Shinde
Hello all I have a spark job that reads parquet data and partition it based on one of the columns. I made sure partitions equally distributed and not skewed. My code looks like this - datasetA.write.partitonBy("column1").parquet(outputPath) Execution plan - [image: Inline image 1] All

Re: Kafka Streaming and partitioning

2017-02-26 Thread tonyye
Hi Dave, I had the same question and was wondering if you had found a way to do the join without causing a shuffle? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Streaming-and-partitioning-tp25955p28425.html Sent from the Apache Spark User List

RE: forcing dataframe groupby partitioning

2017-01-29 Thread Mendelson, Assaf
Could you explain why this would work? Assaf. From: Haviv, Daniel [mailto:dha...@amazon.com] Sent: Sunday, January 29, 2017 7:09 PM To: Mendelson, Assaf Cc: user@spark.apache.org Subject: Re: forcing dataframe groupby partitioning If there's no built in local groupBy, You could do something like

Re: forcing dataframe groupby partitioning

2017-01-29 Thread Haviv, Daniel
If there's no built in local groupBy, You could do something like that: df.groupby(C1,C2).agg(...).flatmap(x=>x.groupBy(C1)).agg Thank you. Daniel On 29 Jan 2017, at 18:33, Mendelson, Assaf > wrote: Hi, Consider the following example:

forcing dataframe groupby partitioning

2017-01-29 Thread Mendelson, Assaf
Hi, Consider the following example: df.groupby(C1,C2).agg(some agg).groupby(C1).agg(some more agg) The default way spark would behave would be to shuffle according to a combination of C1 and C2 and then shuffle again by C1 only. This behavior makes sense when one uses C2 to salt C1 for

Re: Spark Partitioning Strategy with Parquet

2016-12-30 Thread titli batali
decrease the >>>> processing time We are converting converting the csv files to parquet and >>>> partioning it with userid, df.write.format("parquet").par >>>> titionBy("useridcol").save("hdfs://path"). >>>> >>>> So that while reading the parquet files, we read a particular user in a >>>> particular partition and create a Cartesian product of (date X transaction) >>>> and work on the tuple in each partition, to achieve the above level of >>>> nesting. Partitioning on 8 million users is it a bad option. What could be >>>> a better way to achieve this? >>>> >>>> Thanks >>>> >>>> >>>> >>> >>> >>

Parallel dynamic partitioning producing duplicated data

2016-11-30 Thread Mehdi Ben Haj Abbes
Hi Folks, I have a spark job reading a csv file into a dataframe. I register that dataframe as a tempTable then I’m writing that dataframe/tempTable to hive external table (using parquet format for storage) I’m using this kind of command : hiveContext.sql(*"INSERT INTO TABLE t

RE: CSV to parquet preserving partitioning

2016-11-23 Thread benoitdr
dataframe to parquet table partitioned by dirs It requires to write his own parser. I could not find a solution to preserve the partitioning using sc.textfile or the databricks csv parser. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving

RE: CSV to parquet preserving partitioning

2016-11-18 Thread benoitdr
the ETL phase. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28103.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Spark Partitioning Strategy with Parquet

2016-11-17 Thread ayan guha
se the >>> processing time We are converting converting the csv files to parquet and >>> partioning it with userid, df.write.format("parquet").par >>> titionBy("useridcol").save("hdfs://path"). >>> >>> So that while reading the parquet files, we read a particular user in a >>> particular partition and create a Cartesian product of (date X transaction) >>> and work on the tuple in each partition, to achieve the above level of >>> nesting. Partitioning on 8 million users is it a bad option. What could be >>> a better way to achieve this? >>> >>> Thanks >>> >>> >>> >> >> >

Re: Spark Partitioning Strategy with Parquet

2016-11-17 Thread titli batali
while reading the parquet files, we read a particular user in a >> particular partition and create a Cartesian product of (date X transaction) >> and work on the tuple in each partition, to achieve the above level of >> nesting. Partitioning on 8 million users is it a bad option. What could be >> a better way to achieve this? >> >> Thanks >> >> >> > >

Re: Spark Partitioning Strategy with Parquet

2016-11-17 Thread Xiaomeng Wan
Cartesian product of (date X transaction) > and work on the tuple in each partition, to achieve the above level of > nesting. Partitioning on 8 million users is it a bad option. What could be > a better way to achieve this? > > Thanks > > >

Fwd: Spark Partitioning Strategy with Parquet

2016-11-17 Thread titli batali
user in a particular partition and create a Cartesian product of (date X transaction) and work on the tuple in each partition, to achieve the above level of nesting. Partitioning on 8 million users is it a bad option. What could be a better way to achieve this? Thanks

RE: CSV to parquet preserving partitioning

2016-11-16 Thread neil90
- hdfs://path/dir=dir1/part-r-xxx.gz.parquet hdfs://path/dir=dir2/part-r-yyy.gz.parquet hdfs://path/dir=dir3/part-r-zzz.gz.parquet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28087.html Sent from the Apach

RE: CSV to parquet preserving partitioning

2016-11-16 Thread benoitdr
ct: Re: CSV to parquet preserving partitioning Is there anything in the files to let you know which directory they should be in? If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.c

Re: CSV to parquet preserving partitioning

2016-11-16 Thread neil90
Is there anything in the files to let you know which directory they should be in? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28083.html Sent from the Apache Spark User List mailing list archive

RE: CSV to parquet preserving partitioning

2016-11-16 Thread Drooghaag, Benoit (Nokia - BE)
the partitioning. As I can see, it creates one partition per csv file, so the data from one input directory can be puzzled accross the nodes ... From: Daniel Siegmann [mailto:dsiegm...@securityscorecard.io] Sent: mardi 15 novembre 2016 18:57 To: Drooghaag, Benoit (Nokia - BE) <benoit.dro

Re: CSV to parquet preserving partitioning

2016-11-15 Thread Daniel Siegmann
.csv > /path/dir3/file4.csv > /path/dir3/file5.csv > /path/dir3/file6.csv > > I'd like to read those files and write their data to a parquet table in > hdfs, preserving the partitioning (partitioned by input directory), and > such > as there is a single output file per part

CSV to parquet preserving partitioning

2016-11-15 Thread benoitdr
/path/dir2/file3.csv /path/dir3/file4.csv /path/dir3/file5.csv /path/dir3/file6.csv I'd like to read those files and write their data to a parquet table in hdfs, preserving the partitioning (partitioned by input directory), and such as there is a single output file per partition. The output files

Re: MapWithState partitioning

2016-10-31 Thread Andrii Biletskyi
lt;andrb...@gmail.com> > wrote: > > Thanks for response, > > > > So as I understand there is no way to "tell" mapWithState leave the > > partitioning schema as any other transformation would normally do. > > Then I would like to clarify if there is a simpl

Re: MapWithState partitioning

2016-10-31 Thread Cody Koeninger
as I understand there is no way to "tell" mapWithState leave the > partitioning schema as any other transformation would normally do. > Then I would like to clarify if there is a simple way to do a transformation > to a key-value stream and specify somehow the Partitioner that effe

Re: MapWithState partitioning

2016-10-31 Thread Cody Koeninger
If you call a transformation on an rdd using the same partitioner as that rdd, no shuffle will occur. KafkaRDD doesn't have a partitioner, there's no consistent partitioning scheme that works for all kafka uses. You can wrap each kafkardd with an rdd that has a custom partitioner that you write

MapWithState partitioning

2016-10-31 Thread Andrii Biletskyi
Hi all, I'm using Spark Streaming mapWithState operation to do a stateful operation on my Kafka stream (though I think similar arguments would apply for any source). Trying to understand a way to control mapWithState's partitioning schema. My transformations are simple: 1) create KafkaDStream

Re: Re-partitioning mapwithstateDstream

2016-10-13 Thread manasdebashiskar
StateSpec has a method numPartitions to set the initial number of partition. That should do the trick. ...Manas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-partitioning-mapwithstateDstream-tp27880p27899.html Sent from the Apache Spark User List

Re: Partitioning in spark

2016-06-24 Thread Darshan Singh
Thanks but the whole point is not setting it explicitly but it should be derived from its parent RDDS. Thanks On Fri, Jun 24, 2016 at 6:09 AM, ayan guha wrote: > You can change paralllism like following: > > conf = SparkConf() > conf.set('spark.sql.shuffle.partitions',10)

Re: Partitioning in spark

2016-06-23 Thread ayan guha
You can change paralllism like following: conf = SparkConf() conf.set('spark.sql.shuffle.partitions',10) sc = SparkContext(conf=conf) On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh wrote: > Hi, > > My default parallelism is 100. Now I join 2 dataframes with 20

Partitioning in spark

2016-06-23 Thread Darshan Singh
Hi, My default parallelism is 100. Now I join 2 dataframes with 20 partitions each , joined dataframe has 100 partition. I want to know what is the way to keep it to 20 (except re-partition and coalesce. Also, when i join these 2 dataframes I am using 4 columns as joined columns. The dataframes

Re: Custom positioning/partitioning Dataframes

2016-06-03 Thread Takeshi Yamamuro
Hi, I'm afraid spark has no explicit api to set custom partitioners in df for now. // maropu On Sat, Jun 4, 2016 at 1:09 AM, Nilesh Chakraborty <nil...@nileshc.com> wrote: > Hi, > > I have a domain-specific schema (RDF data with vertical partitioning, ie. > one table per pr

Re: Strategies for propery load-balanced partitioning

2016-06-03 Thread Takeshi Yamamuro
..@wellsfargo.com" <saif.a.ell...@wellsfargo.com> > *Date: *Friday, June 3, 2016 at 8:31 AM > *To: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *Strategies for propery load-balanced partitioning > > > > Hello everyone! > >

Re: Strategies for propery load-balanced partitioning

2016-06-03 Thread Silvio Fiorito
lt;saif.a.ell...@wellsfargo.com> Date: Friday, June 3, 2016 at 8:31 AM To: "user@spark.apache.org" <user@spark.apache.org> Subject: Strategies for propery load-balanced partitioning Hello everyone! I was noticing that, when reading parquet files or actually any kind of source data fram

RE: Strategies for propery load-balanced partitioning

2016-06-03 Thread Saif.A.Ellafi
: user; Reynold Xin; mich...@databricks.com Subject: Re: Strategies for propery load-balanced partitioning I suppose you are running on 1.6. I guess you need some solution based on [1], [2] features which are coming in 2.0. [1] https://issues.apache.org/jira/browse/SPARK-12538 / https

Re: Strategies for propery load-balanced partitioning

2016-06-03 Thread Ovidiu-Cristian MARCU
I suppose you are running on 1.6. I guess you need some solution based on [1], [2] features which are coming in 2.0. [1] https://issues.apache.org/jira/browse/SPARK-12538 / https://issues.apache.org/jira/browse/SPARK-12394

Custom positioning/partitioning Dataframes

2016-06-03 Thread Nilesh Chakraborty
Hi, I have a domain-specific schema (RDF data with vertical partitioning, ie. one table per property) and I want to instruct SparkSQL to keep semantically closer property tables closer together, that is, group dataframes together into different nodes (or at least encourage it somehow) so

Strategies for propery load-balanced partitioning

2016-06-03 Thread Saif.A.Ellafi
Hello everyone! I was noticing that, when reading parquet files or actually any kind of source data frame data (spark-csv, etc), default partinioning is not fair. Action tasks usually act very fast on some partitions and very slow on some others, and frequently, even fast on all but last

Partitioning Data to optimize combineByKey

2016-06-02 Thread Nathan Case
Hello, I am trying to process a dataset that is approximately 2 tb using a cluster with 4.5 tb of ram. The data is in parquet format and is initially loaded into a dataframe. A subset of the data is then queried for and converted to RDD for more complicated processing. The first stage of that

GraphFrame graph partitioning

2016-05-25 Thread rohit13k
How to do graph partition in GraphFrames similar to the partitionBy feature in GraphX? Can we use the Dataframe's repartition feature in 1.6 to provide a graph partitioning in graphFrames? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/GraphFrame-graph

Re: Is there Graph Partitioning impl for Scala/Spark?

2016-03-11 Thread Alexander Pivovarov
, Alexander Pivovarov <apivova...@gmail.com> wrote: > Is there Graph Partitioning impl (e.g. Spectral ) which can be used in > Spark? > I guess it should be at least java/scala lib > Maybe even tuned to work with GraphX >

RE: Partitioning to speed up processing?

2016-03-10 Thread Gerhard Fiedler
Grouping is applied in the aggregation. From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of Holden Karau Sent: Thu, Mar 10, 2016 13:56 To: Gerhard Fiedler Cc: user@spark.apache.org Subject: Re: Partitioning to speed up processing? Are they entire data set aggregates

Re: Partitioning to speed up processing?

2016-03-10 Thread Holden Karau
Are they entire data set aggregates or is there some grouping applied? On Thursday, March 10, 2016, Gerhard Fiedler <gfied...@algebraixdata.com> wrote: > I have a number of queries that result in a sequence Filter > Project > > Aggregate. I wonder whether partitioning the input

Partitioning to speed up processing?

2016-03-10 Thread Gerhard Fiedler
I have a number of queries that result in a sequence Filter > Project > Aggregate. I wonder whether partitioning the input table makes sense. Does Aggregate benefit from a partitioned input? If so, what partitions would be most useful (related to the aggregations)? Do Filter and P

Kafka Streaming and partitioning

2016-01-13 Thread ddav
Hi, I have the following use case: 1. Reference data stored in an RDD that is persisted and partitioned using a simple custom partitioner. 2. Input stream from kafka that uses the same partitioner algorithm as the ref data RDD - this partitioning is done in kafka. I am using kafka direct

Re: Kafka Streaming and partitioning

2016-01-13 Thread Cody Koeninger
ieve others have >> >> >> >> >> On Wed, Jan 13, 2016 at 3:40 AM, ddav < <dave.davo...@gmail.com> >> dave.davo...@gmail.com> wrote: >> >>> Hi, >>> >>> I have the following use case: >>> >>> 1. Reference dat

  1   2   3   >