A question about skew join hint
Hi all: i saw skewed join hint optimization in https://docs.azuredatabricks.net/delta/join-performance/skew-join.html. it is a great feature to help users to avoid the problem brought from skewed data. My question 1. which version we will have this ? i have not found the feature in the master branch by key word "+SKEW" 2. the article is under Documentation -> Delta Lake-> Optimizations-> Optimize Join Performance -> Skew Join Optimization, is this feature only in delta lake? Best Regards Kelly Zhang
[DISCUSS] Remove sorting of fields in PySpark SQL Row construction
Currently, when a PySpark Row is created with keyword arguments, the fields are sorted alphabetically. This has created a lot of confusion with users because it is not obvious (although it is stated in the pydocs) that they will be sorted alphabetically. Then later when applying a schema and the field order does not match, an error will occur. Here is a list of some of the JIRAs that I have been tracking all related to this issue: SPARK-24915, SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion of the issue [1]. The original reason for sorting fields is because kwargs in python < 3.6 are not guaranteed to be in the same order that they were entered [2]. Sorting alphabetically ensures a consistent order. Matters are further complicated with the flag _*from_dict*_ that allows the Row fields to to be referenced by name when made by kwargs, but this flag is not serialized with the Row and leads to inconsistent behavior. For instance: >>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first() Row(B='2', A='1')>>> spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1", B="2")]), "B string, A string").first() Row(B='1', A='2') I think the best way to fix this is to remove the sorting of fields when constructing a Row. For users with Python 3.6+, nothing would change because these versions of Python ensure that the kwargs stays in the ordered entered. For users with Python < 3.6, using kwargs would check a conf to either raise an error or fallback to a LegacyRow that sorts the fields as before. With Python < 3.6 being deprecated now, this LegacyRow can also be removed at the same time. There are also other ways to create Rows that will not be affected. I have opened a JIRA [3] to capture this, but I am wondering what others think about fixing this for Spark 3.0? [1] https://github.com/apache/spark/pull/20280 [2] https://www.python.org/dev/peps/pep-0468/ [3] https://issues.apache.org/jira/browse/SPARK-29748
Re: Avro file question
Assuming you always read data together one large file is good and basic hdfs use case On Tue, 5 Nov 2019 at 4:28 am, Yaniv Harpaz wrote: > It depends on your usage (when and how u read). > the smaller files you were thinking about are also larger than the HDFS > block size? > I would not go for something smaller than a block. > > Usually (if relevant to the way you read the data) the partitioning helps > determine that. > > > Yaniv Harpaz > [ yaniv.harpaz at gmail.com ] > > > On Mon, Nov 4, 2019 at 7:03 PM Sam wrote: > >> Hi, >> >> How do we choose between single large avro file (size much larger than >> HDFS block size) vs multiple smaller avro files (close to HDFS block size? >> >> Since avro is splittable, is there even a need to split a very large avro >> file into smaller files? >> >> I’m assuming that a single large avro file can also be split into >> multiple mappers/reducers/executors during processing. >> >> Thanks. >> > -- Best Regards, Ayan Guha
Re: Avro file question
It depends on your usage (when and how u read). the smaller files you were thinking about are also larger than the HDFS block size? I would not go for something smaller than a block. Usually (if relevant to the way you read the data) the partitioning helps determine that. Yaniv Harpaz [ yaniv.harpaz at gmail.com ] On Mon, Nov 4, 2019 at 7:03 PM Sam wrote: > Hi, > > How do we choose between single large avro file (size much larger than > HDFS block size) vs multiple smaller avro files (close to HDFS block size? > > Since avro is splittable, is there even a need to split a very large avro > file into smaller files? > > I’m assuming that a single large avro file can also be split into multiple > mappers/reducers/executors during processing. > > Thanks. >
Avro file question
Hi, How do we choose between single large avro file (size much larger than HDFS block size) vs multiple smaller avro files (close to HDFS block size? Since avro is splittable, is there even a need to split a very large avro file into smaller files? I’m assuming that a single large avro file can also be split into multiple mappers/reducers/executors during processing. Thanks.