Re: CATALYST rule join

2018-02-27 Thread tan shai
operation `Generate explode` appears many times in the physical plan. Do you have any other ideas ? Maybe rewriting the code. Thank you 2018-02-25 23:08 GMT+01:00 tan shai <tan.shai...@gmail.com>: > Hi, > > I need to write a rule to customize the join function using Spark C

CATALYST rule join

2018-02-25 Thread tan shai
Hi, I need to write a rule to customize the join function using Spark Catalyst optimizer. The objective to duplicate the second dataset using this process: - Execute a udf on the column called x, this udf returns an array - Execute an explode function on the new column Using SQL terms, my

Tuning Spark memory

2016-09-23 Thread tan shai
Hi, I am working with Spark 2.0, the job starts by sorting the input data and storing the output on HDFS. I am getting Out of memory errors, the solution was to increase the value of spark.shuffle.memoryFraction from 0.2 to 0.8 and this solves the problem. But in the documentation I have found

Total memory of workers

2016-09-06 Thread tan shai
Hello, Can anyone explain to me the behavior of spark if the size of the processed file is greater than the total memory available on workers? Many thanks.

RangePartitioning

2016-07-08 Thread tan shai
Hi, Can any one explain to me the class RangePartitioning " https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala " case class RangePartitioning(ordering: Seq[SortOrder],

[no subject]

2016-07-08 Thread tan shai
Hi, Can any one explain to me the class RangePartitioning " https://github.com/apache/spark/blob/d5911d1173fe0872f21cae6c47abf8ff479345a4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala " case class RangePartitioning(ordering: Seq[SortOrder],

Re: Extend Dataframe API

2016-07-07 Thread tan shai
7, 2016 at 9:31 AM, tan shai <tan.shai...@gmail.com> wrote: > >> Hi, >> >> I need to add new operations to the dataframe API. >> Can any one explain to me how to extend the plans of query execution? >> >> Many thanks. >> > >

Re: Question regarding structured data and partitions

2016-07-07 Thread tan shai
as such i think? > you could however do dataFrame.rdd, to force it to create a physical plan > that results in an actual rdd, and then query the rdd for partition info. > > On Thu, Jul 7, 2016 at 4:24 AM, tan shai <tan.shai...@gmail.com> wrote: > >> Using partitioning with data

Extend Dataframe API

2016-07-07 Thread tan shai
Hi, I need to add new operations to the dataframe API. Can any one explain to me how to extend the plans of query execution? Many thanks.

Re: Optimize filter operations with sorted data

2016-07-07 Thread tan shai
ideration operate on sorted column(s) ? > > > > Cheers > > > >> On Jul 7, 2016, at 2:25 AM, tan shai <tan.shai...@gmail.com> wrote: > >> > >> Hi, > >> > >> I have a sorted dataframe, I need to optimize the filter operations. >

Re: Optimize filter operations with sorted data

2016-07-07 Thread tan shai
Yes it is operating on the sorted column 2016-07-07 11:43 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > Does the filter under consideration operate on sorted column(s) ? > > Cheers > > > On Jul 7, 2016, at 2:25 AM, tan shai <tan.shai...@gmail.com> wrote: > > >

Optimize filter operations with sorted data

2016-07-07 Thread tan shai
Hi, I have a sorted dataframe, I need to optimize the filter operations. How does Spark performs filter operations on sorted dataframe? It is scanning all the data? Many thanks.

Re: Question regarding structured data and partitions

2016-07-07 Thread tan shai
Using partitioning with dataframes, how can we retrieve informations about partitions? partitions bounds for example Thanks, Shaira 2016-07-07 6:30 GMT+02:00 Koert Kuipers : > spark does keep some information on the partitions of an RDD, namely the > partitioning/partitioner.

Dataframe sort

2016-07-05 Thread tan shai
Hi, I need to sort a dataframe and retrive the bounds of each partition. The dataframe.sort() is using the range partitioning in the physical plan. I need to retrieve partition bounds. Many thanks for your help.