Re: Enabling push-based shuffle in Spark

2020-01-27 Thread Long, Andrew
The easiest would be to create a fork of the code in github. I can also accept diffs. Cheers Andrew From: Min Shen Date: Monday, January 27, 2020 at 12:48 PM To: "Long, Andrew" , "dev@spark.apache.org" Subject: Re: Enabling push-based shuffle in Spark Hi Andrew, We

Re: How to implement a "saveAsBinaryFile" function?

2020-01-16 Thread Long, Andrew
Hey Bing, There’s a couple different approaches you could take. The quickest and easiest would be to use the existing APIs val bytes = spark.range(1000 bytes.foreachPartition(bytes =>{ //W ARNING anything used in here will need to be serializable. // There's some magic to serializing the

Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-07 Thread Long, Andrew
/tpcds/ “Are performance degradations to existing queries that are fixable by new equivalent queries not allowed for a new major spark version” The general rule of thumb for my group (which is NOT databricks) is, as long as the geomean of tpcds increases you’re fine as long as you don’t break any

Re: SortMergeJoinExec: Utilizing child partitioning when joining

2020-01-02 Thread Long, Andrew
improvements in other areas. “Could someone help explain why the different join types have different output partitionings“ Long story short when a join happens the join exec zips together the partitions of the left and right side so that one partition of the join has the elements of the left and right

CR for adding bucket join support to V2 Datasources

2019-11-18 Thread Long, Andrew
Hey Friends, I recently created a pull request to add an optional support for bucket joins to V2 Datasources, via a concrete class representing the Spark Style ash Partitioning. If anyone has some free time Id appreciate a code review. This also adds a concrete implementation of V2

Timeline for Spark 3.0

2019-06-28 Thread Long, Andrew
Hey Friends, Is there a timeline for spark 3.0 in terms of the first RC and final release? Cheers Andrew

Bucketing and catalyst

2019-05-02 Thread Long, Andrew
Hey Friends, How aware of bucketing is Catalyst? I’ve been trying to piece together how Catalyst knows that it can remove a sort and shuffle given that both tables are bucketed and sorted the same way. Is there any classes in particular I should look at? Cheers Andrew

Re: Stage 152 contains a task of very large size (12747 KB). The maximum recommended task size is 100 KB

2019-05-01 Thread Long, Andrew
ursday, April 25, 2019 at 8:47 AM To: "Long, Andrew" Cc: dev Subject: Re: FW: Stage 152 contains a task of very large size (12747 KB). The maximum recommended task size is 100 KB I usually only see that in regards to folks parallelizing very large objects. From what I know, it's real

FW: Stage 152 contains a task of very large size (12747 KB). The maximum recommended task size is 100 KB

2019-04-23 Thread Long, Andrew
Hey Friends, Is there an easy way of figuring out whats being pull into the task context? I’ve been getting the following message which I suspect means I’ve unintentional caught some large objects but figuring out what those objects are is stumping me. 19/04/23 13:52:13 WARN

Sort order in bucketing in a custom datasource

2019-04-16 Thread Long, Andrew
Hey Friends, Is it possible to specify the sort order or bucketing in a way that can be used by the optimizer in spark? Cheers Andrew

Which parts of a parquet read happen on the driver vs the executor?

2019-04-11 Thread Long, Andrew
Hey Friends, I’m working on a POC that involves reading and writing parquet files mid dag. Writes are working but I’m struggling with getting reads working due to serialization issues. I’ve got code that works in master=local but not in yarn. So here are my questions. 1. Is there an

Re: Manually reading parquet files.

2019-03-21 Thread Long, Andrew
la:305) at com.amazon.horizon.azulene.ParquetReadTests$$anonfun$2.apply(ParquetReadTests.scala:100) at com.amazon.horizon.azulene.ParquetReadTests$$anonfun$2.apply(ParquetReadTests.scala:100) From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Thursday, March 21, 2019 at 3:32 PM T

Manually reading parquet files.

2019-03-21 Thread Long, Andrew
Hello Friends, I’m working on a performance improvement that reads additional parquet files in the middle of a lambda and I’m running into some issues. This is what id like todo ds.mapPartitions(x=>{ //read parquet file in and perform an operation with x }) Here’s my current POC code but

Re: Spark data quality bug when reading parquet files from hive metastore

2018-09-07 Thread Long, Andrew
Thanks Fokko, I will definitely take a look at this. Cheers Andrew From: "Driesprong, Fokko" Date: Friday, August 24, 2018 at 2:39 AM To: "reubensaw...@hotmail.com" Cc: "dev@spark.apache.org" Subject: Re: Spark data quality bug when reading parquet files from hive metastore Hi Andrew,

Spark data quality bug when reading parquet files from hive metastore

2018-08-22 Thread Long, Andrew
Hello Friends, I’ve encountered a bug where spark silently corrupts data when reading from a parquet hive table where the table schema does not match the file schema. I’d like to give a shot at adding some extra validations to the code to handle this corner case and I was wondering if anyone

Feedback on first commit + jira issue I opened

2018-05-31 Thread Long, Andrew
Hello Friends, I’m a new committer and I’ve submitted my first patch and I had some questions about documentation standards. In my patch(jira below) I’ve added a config parameter to adjust the number of records show when a user calls .show() on a dataframe. I was hoping someone could double

Re: Can I add a new method to RDD class?

2016-12-07 Thread Teng Long
ersions:set -DnewVersion=your_new_version > > On Wed, Dec 7, 2016 at 11:31 AM, Teng Long <[hidden email] > > wrote: > Hi Holden, > > Can you please tell me how to edit version numbers efficiently? the correct > way? I'm really struggling with this and don't know where to loo

Re: Can I add a new method to RDD class?

2016-12-06 Thread Teng Long
, I think changing the property (line 29) in spark's root > pom.xml should be sufficient. However, keep in mind that you'll also > need to publish spark locally before you can access it in your test > application. > > On Tue, Dec 6, 2016 at 2:50 AM, Teng Long <[hidden email

Re: Can I add a new method to RDD class?

2016-12-06 Thread Teng Long
Thank you Jokob for clearing things up for me. Before, I thought my application was compiled against my local build since I can get all the logs I just added in spark-core. But it was all along using spark downloaded from remote maven repository, and that’s why I “cannot" add new RDD methods

Re: Can I add a new method to RDD class?

2016-12-05 Thread Teng Long
k.rdd”. It seems like my import statement is wrong, but I don’t know how? Thanks! > On Dec 5, 2016, at 5:14 PM, Teng Long <longteng...@gmail.com> wrote: > > I’m trying to implement a transformation that can merge partitions (to align > with GPU specs) and move them onto GPU memory

Re: Can I add a new method to RDD class?

2016-12-05 Thread Teng Long
draTable[SomeType]("ks", "not_existing_table")) > > > > > > And here you will se an example of "extending" RDD - > https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md > > <https://github.com/datastax/spa

Re: Can I add a new method to RDD class?

2016-12-05 Thread Teng Long
he.spark.SparkContext > > <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD@sparkContext:org.apache.spark.SparkContext> > > On Mon, Dec 5, 2016 at 1:04 PM, Teng Long <longteng...@gmail.com > <mailto:longteng...@gmail.com>> w

Re: Can I add a new method to RDD class?

2016-12-05 Thread Teng Long
uggested are probably better, but also > does your method need to be defined on the RDD class? Could you instead make > a helper object or class to expose whatever functionality you need? > > On Mon, Dec 5, 2016 at 6:06 PM long <longteng...@gmail.com > <mailto:longteng..

Re: Can I add a new method to RDD class?

2016-12-05 Thread long
mple of using this pattern. >> >> If you want I can quickly cook up a short conplete example with rdd(although >> there is nothing really more to my example in earlier mail) ? >> >> Thanks >> Tarun Kumar >> >> On Mon, 5 Dec 2016 at 7:15 AM, long <[

Re: Can I add a new method to RDD class?

2016-12-04 Thread long
So is there documentation of this I can refer to? > On Dec 5, 2016, at 1:07 AM, Tarun Kumar [via Apache Spark Developers List] > wrote: > > Hi Tenglong, > > In addition to trsell's reply, you can add any method to an rdd without > making changes to

Re: Can I add a new method to RDD class?

2016-12-04 Thread long
So im my sbt build script, I have the same line as instructed in the quickstart guide , libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.2" And since I was able to see all the other logs I added into the spark source code,