Can you try the lastest 1.6.0 RC which includes SPARK-11111 ? Cheers
On Fri, Dec 18, 2015 at 7:38 AM, Prasad Ravilla <pras...@slalom.com> wrote: > Hi, > > I am running into performance issue when joining data frames created from > avro files using spark-avro library. > > The data frames are created from 120K avro files and the total size is > around 1.5 TB. > The two data frames are very huge with billions of records. > > *The join for these two DataFrames runs forever.* > This process runs on a yarn cluster with 300 executors with 4 executor > cores and 8GB memory. > > Any insights on this join will help. I have posted the explain plan below. > I notice a CartesianProduct in the Physical Plan. I am wondering if this > is causing the performance issue. > > > Below is the logical plan and the physical plan. ( Due to the confidential > nature, I am unable to post any of the column names or the file names here ) > > == Optimized Logical Plan == > Limit 21 > Join Inner, [ Join Conditions ] > Join Inner, [ Join Conditions ] > Project [ List of columns ] > Relation [ List of columns ] AvroRelation[ fileName1 ] -- Another > large file > InMemoryRelation [List of columsn ], true, 10000, StorageLevel(true, > true, false, true, 1), (Repartition 1, false), None > Project [ List of Columns ] > Relation[ List of Columns] AvroRelation[ filename2 ] -- This is a very > large file > > == Physical Plan == > Limit 21 > Filter (filter conditions) > CartesianProduct > Filter (more filter conditions) > CartesianProduct > Project (selecting a few columns and applying a UDF to one column) > Scan AvroRelation[avro file][ columns in Avro File ] > InMemoryColumnarTableScan [List of columns ], true, 10000, > StorageLevel(true, true, false, true, 1), (Repartition 1, false), None) > Project [ List of Columns ] > Scan AvroRelation[Avro File][List of Columns] > > Code Generation: true > > > Thanks, > Prasad. >