Re: [DISCUSS] Multiple columns adding/replacing support in PySpark DataFrame API

2021-04-29 Thread Saurabh Chawla
Hi All,

I also had a scenario where at runtime, I needed to loop through a
dataframe to use withColumn many times.

 For the safer side I used the reflection to access the withColumns to
prevent any java.lang.StackOverflowError.

val dataSetClass = Class.forName("org.apache.spark.sql.Dataset")
val newConfigurationMethod =
  dataSetClass.getMethod("withColumns", classOf[Seq[String]],
classOf[Seq[Column]])
newConfigurationMethod.invoke(
  baseDataFrame, columnName, columnValue).asInstanceOf[DataFrame]

It would be great if we use the "withColumns" rather than using the
reflection code like this.
or
make changes in the code to merge the project with existing project in the
plan, instead of adding the new project every time we call the "withColumn".

+1 for exposing the *withColumns*

Regards
Saurabh Chawla

On Thu, Apr 22, 2021 at 1:03 PM Yikun Jiang  wrote:

> Hi, all
>
> *Background:*
>
> Currently, there is a withColumns
> <https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402>[1]
> method to help users/devs add/replace multiple columns at once.
> But this method is private and isn't exposed as a public API interface,
> that means it cannot be used by the user directly, and also it is not
> supported in PySpark API.
>
> As the dataframe user, I can only call withColumn() multiple times:
>
> df.withColumn("key1", col("key1")).withColumn("key2", 
> col("key2")).withColumn("key3", col("key3"))
>
> rather than:
>
> df.withColumn(["key1", "key2", "key3"], [col("key1"), col("key2"), 
> col("key3")])
>
> Multiple calls bring some higher cost on developer experience and
> performance. Especially in a PySpark related scenario, multiple calls mean
> multiple py4j calls.
>
> As mentioned
> <https://github.com/apache/spark/pull/32276#issuecomment-824461143> from
> @Hyukjin, there were some previous discussions on  SPARK-12225
> <https://issues.apache.org/jira/browse/SPARK-12225> [2] .
>
> [1]
> https://github.com/apache/spark/blob/b5241c97b17a1139a4ff719bfce7f68aef094d95/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2402
> [2] https://issues.apache.org/jira/browse/SPARK-12225
>
> *Potential solution:*
> Looks like there are 2 potential solutions if we want to support it:
>
> 1. Introduce a *withColumns *api for Scala/Python.
> A separate public withColumns API will be added in scala/python api.
>
> 2. Make withColumn can receive *single col *and also the* list of cols*.
> I did some experimental try on PySpark on
> https://github.com/apache/spark/pull/32276
> Just like Maciej said
> <https://github.com/apache/spark/pull/32276#pullrequestreview-641280217>
> it will bring some confusion with naming.
>
>
> Thanks for your reading, feel free to reply if you have any other concerns
> or suggestions!
>
>
> Regards,
> Yikun
>


Re: Spark master build hangs using parallel build option in maven

2020-01-17 Thread Saurabh Chawla
Hi Sean,

Thanks for checking this.

I am able to see parallel build info in the readme file
https://github.com/apache/spark#building-spark

"
You can build Spark using more than one thread by using the -T option with
Maven, see "Parallel builds in Maven 3"
<https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3>.
More detailed documentation is available from the project site, at "Building
Spark" <https://spark.apache.org/docs/latest/building-spark.html>.
"

This used to work while building older version of spark(2.4.3, 2.3.2 etc).
build/mvn -Duse.zinc.server=false -DuseZincForJdk8=false
-Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver
-Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5
-DskipTests=true -T 4 clean package

Also I have seen the maven version is changed from 3.5.4 to 3.6.3 in master
branch compared to spark 2.4.3.
Not sure if it's due to some bug in maven version used in master or some
new change added in the master branch that prevent the parallel build.

Regards
Saurabh Chawla


On Sat, Jan 18, 2020 at 2:19 AM Sean Owen  wrote:

> I don't believe you can use a parallel build indeed. Some things
> collide with each other. Some of the suites are run in parallel inside
> the build though already.
>
> On Fri, Jan 17, 2020 at 1:23 PM Saurabh Chawla 
> wrote:
> >
> > Hi All,
> >
> > Spark master build hangs using parallel build option in maven. On
> running build the sequentially on spark master using maven, build did not
> hang. This issue occurs on giving hadoop-provided (-Phadoop-provided
> -Dhadoop.version=2.8.5) option. Same command works fine to build
> spark-2.4.3 parallelly
> >
> > Command to build spark master sequentially - Spark build works fine
> > build/mvn  -Duse.zinc.server=false -DuseZincForJdk8=false
> -Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver
> -Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5
> -DskipTests=true  clean package
> >
> > Command to build spark master parallel - spark build hangs
> > build/mvn -X -Duse.zinc.server=false -DuseZincForJdk8=false
> -Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver
> -Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5
> -DskipTests=true -T 4 clean package
> >
> > This is the trace which keeps on repeating in maven logs
> >
> > [DEBUG] building maven31 dependency graph for
> org.apache.spark:spark-network-yarn_2.12:jar:3.0.0-SNAPSHOT with
> Maven31DependencyGraphBuilder
> > [DEBUG] Dependency collection stats: {ConflictMarker.analyzeTime=60583,
> ConflictMarker.markTime=23750, ConflictMarker.nodeCount=419,
> ConflictIdSorter.graphTime=41262, ConflictIdSorter.topsortTime=9704,
> ConflictIdSorter.conflictIdCount=105,
> ConflictIdSorter.conflictIdCycleCount=0, ConflictResolver.totalTime=632542,
> ConflictResolver.conflictItemCount=193,
> DefaultDependencyCollector.collectTime=1020759,
> DefaultDependencyCollector.transformTime=775495}
> > [DEBUG] org.apache.spark:spark-network-yarn_2.12:jar:3.0.0-SNAPSHOT
> > [DEBUG]
> org.apache.spark:spark-network-shuffle_2.12:jar:3.0.0-SNAPSHOT:compile
> > [DEBUG]
>  org.apache.spark:spark-network-common_2.12:jar:3.0.0-SNAPSHOT:compile
> > [DEBUG]  io.netty:netty-all:jar:4.1.42.Final:compile (version
> managed from 4.1.42.Final)
> > [DEBUG]  org.apache.commons:commons-lang3:jar:3.9:compile
> (version managed from 3.9)
> > [DEBUG]
> org.fusesource.leveldbjni:leveldbjni-all:jar:1.8:compile (version managed
> from 1.8)
> > [DEBUG]
> com.fasterxml.jackson.core:jackson-databind:jar:2.10.0:compile (version
> managed from 2.10.0)
> > [DEBUG]
>  com.fasterxml.jackson.core:jackson-core:jar:2.10.0:compile (version
> managed from 2.10.0)
> > [DEBUG]
> com.fasterxml.jackson.core:jackson-annotations:jar:2.10.0:compile (version
> managed from 2.10.0)
> > [DEBUG]  com.google.code.findbugs:jsr305:jar:3.0.0:compile
> (version managed from 3.0.0)
> > [DEBUG]  com.google.guava:guava:jar:14.0.1:provided (scope
> managed from compile) (version managed from 14.0.1)
> > [DEBUG]  org.apache.commons:commons-crypto:jar:1.0.0:compile
> (version managed from 1.0.0) (exclusions managed from
> [net.java.dev.jna:jna:*:*])
> > [DEBUG]   io.dropwizard.metrics:metrics-core:jar:4.1.1:compile
> (version managed from 4.1.1)
> > [DEBUG]org.apache.spark:spark-tags_2.12:jar:3.0.0-SNAPSHOT:test
> > [DEBUG]   org.scala-lang:scala-library:jar:2.12.10:compile (version
> managed from 2.12.10)
> > [DEBUG]org.apache.spark:spark-tags_2.12:jar:tests:3.0.0-SNAPSHOT:test
> > [DEBUG]or

Spark master build hangs using parallel build option in maven

2020-01-17 Thread Saurabh Chawla
-annotations:jar:2.10.0:provided
(version managed from 2.7.8)
[DEBUG]
 jakarta.xml.bind:jakarta.xml.bind-api:jar:2.3.2:provided
[DEBUG]
 jakarta.activation:jakarta.activation-api:jar:1.2.1:provided
[DEBUG]
com.fasterxml.jackson.jaxrs:jackson-jaxrs-json-provider:jar:2.7.8:provided
[DEBUG]
 com.fasterxml.jackson.jaxrs:jackson-jaxrs-base:jar:2.7.8:provided
[DEBUG]
org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.8.5:provided
[DEBUG]
 org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.8.5:provided
[DEBUG]   org.apache.hadoop:hadoop-annotations:jar:2.8.5:provided
[DEBUG]org.slf4j:slf4j-api:jar:1.7.16:provided
[DEBUG]org.spark-project.spark:unused:jar:1.0.0:compile
[DEBUG]org.scalatest:scalatest_2.12:jar:3.0.8:test
[DEBUG]   org.scalactic:scalactic_2.12:jar:3.0.8:test
[DEBUG]   org.scala-lang:scala-reflect:jar:2.12.10:test (version
managed from 2.12.8)
[DEBUG]   org.scala-lang.modules:scala-xml_2.12:jar:1.2.0:test (version
managed from 1.2.0)
[DEBUG]junit:junit:jar:4.12:test
[DEBUG]   org.hamcrest:hamcrest-core:jar:1.3:test (scope managed from
compile) (version managed from 1.3)
[DEBUG]com.novocode:junit-interface:jar:0.11:test
[DEBUG]   org.scala-sbt:test-interface:jar:1.0:test
[DEBUG] updateExcludesInDeps()

I'll be very grateful if someone could help.

Regards
Saurabh Chawla