Re: Interested in contributing to SPARK-24815

2023-07-25 Thread Kent Yao
Hi Pavan,

Refer to the ASF Source Header and Copyright Notice Policy[1], code
directly submitted to ASF should include the Apache license header
without any additional copyright notice.


Kent Yao

[1] https://www.apache.org/legal/src-headers.html#headers

Sean Owen  于2023年7月25日周二 07:22写道:

>
> When contributing to an ASF project, it's governed by the terms of the ASF 
> ICLA: https://www.apache.org/licenses/icla.pdf or CCLA: 
> https://www.apache.org/licenses/cla-corporate.pdf
>
> I don't believe ASF projects ever retain an original author copyright 
> statement, but rather source files have a statement like:
>
> ...
>  * Licensed to the Apache Software Foundation (ASF) under one or more
>  * contributor license agreements.  See the NOTICE file distributed with
>  * this work for additional information regarding copyright ownership.
> ...
>
> While it's conceivable that such a statement could live in a NOTICE file, I 
> don't believe that's been done for any of the thousands of other 
> contributors. That's really more for noting the license of 
> non-Apache-licensed code. Code directly contributed to the project is assumed 
> to have been licensed per above already.
>
> It might be wise to review the CCLA with Twilio and consider establishing 
> that to govern contributions.
>
> On Mon, Jul 24, 2023 at 6:10 PM Pavan Kotikalapudi 
>  wrote:
>>
>> Hi Spark Dev,
>>
>> My name is Pavan Kotikalapudi, I work at Twilio.
>>
>> I am looking to contribute to this spark issue 
>> https://issues.apache.org/jira/browse/SPARK-24815.
>>
>> There is a clause from the company's OSS saying
>>
>> - The proposed contribution is about 100 lines of code modification in the 
>> Spark project, involving two files - this is considered a large 
>> contribution. An appropriate Twilio copyright notice needs to be added for 
>> the portion of code that is newly added.
>>
>> Please let me know if that is acceptable?
>>
>> Thank you,
>>
>> Pavan
>>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Map Partition is called Multiple Times

2023-07-25 Thread Deepak Patankar
I am trying to run a spark job which performs some database operations
and saves passed records in one table and the failed ones in another.

Here is the code for the same:

```
log.info("Starting the spark job {}");

String sparkAppName = generateSparkAppName("reading-graph");
SparkConf sparkConf = getSparkConf(sparkAppName);
SparkSession sparkSession =
SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();

LocalDate startDate = LocalDate.of(2022, 1, 1);
final LongAccumulator accumulator =
sparkSession.sparkContext().longAccumulator("Successful Processed
Count");
LocalDate endDate = startDate.plusDays(90);

Dataset rows = sparkSession.table("db.graph_details")
.select(new Column("_mm_dd"), new Column("timestamp"), new
Column("email_id")))
.filter("_mm_dd >= '" + startDate + "' AND _mm_dd < '" + endDate);

Dataset> tuple2Dataset =
rows.mapPartitions(new GetPaymentsGraphFeatures(accumulator),
Encoders.tuple(Encoders.BOOLEAN(), Encoders.STRING()));
tuple2Dataset.persist();

Dataset successfulRows =
tuple2Dataset.filter((FilterFunction>)
booleanRowTuple2 -> booleanRowTuple2._1).map(
(MapFunction, Row>) booleanRowTuple2 ->
mapToRow(booleanRowTuple2._2), RowEncoder.apply(getSchema()));

Dataset failedRows =
tuple2Dataset.filter((FilterFunction>)
booleanRowTuple2 -> !booleanRowTuple2._1).map(
(MapFunction, Row>) booleanRowTuple2 ->
mapToRow(booleanRowTuple2._2), RowEncoder.apply(getFailureSchema()));

successfulRows.write().mode("overwrite").saveAsTable("db.deepak_jan_result");
failedRows.write().mode("overwrite").saveAsTable("db.deepak_jan_result_failures");
tuple2Dataset.unpersist();
log.info("Completed the spark job");
```

The spark job is running the mapPartitions twice, once to get the
successfulRows and once to get the failedRows. But ideally the
mapPartitions should be run once right ?

My job to process the successful action takes more than 1 hour. Can
that be causing this behaviour ?

How can I ensure that the map partitions run only once ?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org