Re: Interested in contributing to SPARK-24815
Hi Pavan, Refer to the ASF Source Header and Copyright Notice Policy[1], code directly submitted to ASF should include the Apache license header without any additional copyright notice. Kent Yao [1] https://www.apache.org/legal/src-headers.html#headers Sean Owen 于2023年7月25日周二 07:22写道: > > When contributing to an ASF project, it's governed by the terms of the ASF > ICLA: https://www.apache.org/licenses/icla.pdf or CCLA: > https://www.apache.org/licenses/cla-corporate.pdf > > I don't believe ASF projects ever retain an original author copyright > statement, but rather source files have a statement like: > > ... > * Licensed to the Apache Software Foundation (ASF) under one or more > * contributor license agreements. See the NOTICE file distributed with > * this work for additional information regarding copyright ownership. > ... > > While it's conceivable that such a statement could live in a NOTICE file, I > don't believe that's been done for any of the thousands of other > contributors. That's really more for noting the license of > non-Apache-licensed code. Code directly contributed to the project is assumed > to have been licensed per above already. > > It might be wise to review the CCLA with Twilio and consider establishing > that to govern contributions. > > On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi > wrote: >> >> Hi Spark Dev, >> >> My name is Pavan Kotikalapudi, I work at Twilio. >> >> I am looking to contribute to this spark issue >> https://issues.apache.org/jira/browse/SPARK-24815. >> >> There is a clause from the company's OSS saying >> >> - The proposed contribution is about 100 lines of code modification in the >> Spark project, involving two files - this is considered a large >> contribution. An appropriate Twilio copyright notice needs to be added for >> the portion of code that is newly added. >> >> Please let me know if that is acceptable? >> >> Thank you, >> >> Pavan >> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Map Partition is called Multiple Times
I am trying to run a spark job which performs some database operations and saves passed records in one table and the failed ones in another. Here is the code for the same: ``` log.info("Starting the spark job {}"); String sparkAppName = generateSparkAppName("reading-graph"); SparkConf sparkConf = getSparkConf(sparkAppName); SparkSession sparkSession = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate(); LocalDate startDate = LocalDate.of(2022, 1, 1); final LongAccumulator accumulator = sparkSession.sparkContext().longAccumulator("Successful Processed Count"); LocalDate endDate = startDate.plusDays(90); Dataset rows = sparkSession.table("db.graph_details") .select(new Column("_mm_dd"), new Column("timestamp"), new Column("email_id"))) .filter("_mm_dd >= '" + startDate + "' AND _mm_dd < '" + endDate); Dataset> tuple2Dataset = rows.mapPartitions(new GetPaymentsGraphFeatures(accumulator), Encoders.tuple(Encoders.BOOLEAN(), Encoders.STRING())); tuple2Dataset.persist(); Dataset successfulRows = tuple2Dataset.filter((FilterFunction>) booleanRowTuple2 -> booleanRowTuple2._1).map( (MapFunction, Row>) booleanRowTuple2 -> mapToRow(booleanRowTuple2._2), RowEncoder.apply(getSchema())); Dataset failedRows = tuple2Dataset.filter((FilterFunction>) booleanRowTuple2 -> !booleanRowTuple2._1).map( (MapFunction, Row>) booleanRowTuple2 -> mapToRow(booleanRowTuple2._2), RowEncoder.apply(getFailureSchema())); successfulRows.write().mode("overwrite").saveAsTable("db.deepak_jan_result"); failedRows.write().mode("overwrite").saveAsTable("db.deepak_jan_result_failures"); tuple2Dataset.unpersist(); log.info("Completed the spark job"); ``` The spark job is running the mapPartitions twice, once to get the successfulRows and once to get the failedRows. But ideally the mapPartitions should be run once right ? My job to process the successful action takes more than 1 hour. Can that be causing this behaviour ? How can I ensure that the map partitions run only once ? - To unsubscribe e-mail: user-unsubscr...@spark.apache.org