Re: preferredlocations for hadoopfsrelations based baseRelations

2020-06-04 Thread ZHANG Wei
AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()` method, which is ordered by the data size, to get the partition preferred locations. If there are other vectors to sort, I'm wondering if here[1] can be a place to add. Or inheriting class `FilePartition` with overridden `preferredL

Re: Applying schema dynamically in dataframe

2020-05-17 Thread ZHANG Wei
May I get a sample scenario to understand the requirement? -- Cheers, -z On Sat, 16 May 2020 11:45:03 +0530 rahul c wrote: > Hi dev, > > Currently I have a scenario where I am reading the data from Kafka using > spark dataframe. > > Multiple data sources ingest the data into kafka same topic

Executor exceptions stacktrace omitted by HotSpot in long running application

2020-05-07 Thread ZHANG Wei
Hi, I'm considering to improve the experience of hitting potential exceptions stacktrace omitted in long running application[1], which is a JVM HotSpot optimization as Shixiong(Ryan) commented[2]. There might be 2 options: 1. Adds `-XX:-OmitStackTraceInFastThrow` as a common Executor JVM option.

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-30 Thread ZHANG Wei
>> > >> Spark has targeted to have a unified API set rather than having separate > >> Java classes to reduce the maintenance cost, > >> e.g.) JavaRDD <> RDD vs DataFrame. These JavaXXX are more about the > >> legacy. > >> > >>

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-28 Thread ZHANG Wei
discouraged in > general up to my best knowledge. > A Java user won't likely know asJava in Scala but a Scala user will likely > know both asScala and asJava. > > > 2020년 4월 28일 (화) 오전 11:35, ZHANG Wei 님이 작성: > > > How about making a small change on option

Re: [DISCUSS] Java specific APIs design concern and choice

2020-04-27 Thread ZHANG Wei
How about making a small change on option 4: Keep Scala API returning Scala type instance with providing a `asJava` method to return a Java type instance. Scala 2.13 has provided CollectionConverter [1][2][3], in the following Spark dependences upgrade, which can be supported by nature. For cu

Re: ShuffleMapStage and pendingPartitions vs isAvailable or findMissingPartitions?

2020-04-27 Thread ZHANG Wei
AFAICT, not must have `pendingPartitions`, `mapOutputTrackerMaster` is added by a later change, `pendingPartitions` can be cleaned up. -- Cheers, -z On Sun, 26 Apr 2020 11:53:09 +0200 Jacek Laskowski wrote: > Hi, > > I found that ShuffleMapStage has this (apparently superfluous) > pendingPart