Contribute to Apache Spark

2020-06-29 Thread ????????
Hi,
I want to contribute to Apache Spark.
Would you please give me the contributor permission?
My JIRA ID is suizhe007.

Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread wuyi
This could be a sub-task of 
https://issues.apache.org/jira/browse/SPARK-25299
  (Use remote storage for
persisting shuffle data)? 

It's good if we could put the whole SPARK-25299 in Spark 3.1.



Holden Karau wrote
> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
> 
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk 

> maxim.gekk@

> 
> wrote:
> 
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 

> dongjoon.hyun@

> 
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>> In my perspective, the last main missing piece was Dynamic
>>> allocation
>>> and
>>> - Dynamic allocation with shuffle tracking is already shipped at
>>> 3.0.
>>> - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the
>>> main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
> 
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  https://amzn.to/2MaRAG9;
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread Jungtaek Lim
Does this count only "new features" (probably major), or also count
"improvements"? I'm aware of a couple of improvements which should be
ideally included in the next release, but if this counts only major new
features then don't feel they should be listed.

On Tue, Jun 30, 2020 at 1:32 AM Holden Karau  wrote:

> Should we also consider the shuffle service refactoring to support
> pluggable storage engines as targeting the 3.1 release?
>
> On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk 
> wrote:
>
>> Hi Dongjoon,
>>
>> I would add:
>> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
>> - Filters pushdown to other datasources like Avro
>> - Support nested attributes of filters pushed down to JSON
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>>
>> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>>> community opinion on Apache Spark 3.1 feature expectations.
>>>
>>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>>> - https://spark.apache.org/versioning-policy.html
>>>
>>> I'm expecting the following items:
>>>
>>> 1. Support Scala 2.13
>>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>>> 3. Declaring Kubernetes Scheduler GA
>>> In my perspective, the last main missing piece was Dynamic
>>> allocation and
>>> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>>> - Dynamic allocation with worker decommission/data migration is
>>> targeting 3.1. (Thanks, Holden)
>>> 4. DSv2 Stabilization
>>>
>>> I'm aware of some more features which are on the way currently, but I
>>> love to hear the opinions from the main developers and more over the main
>>> users who need those features.
>>>
>>> Thank you in advance. Welcome for any comments.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Spark 3 pod template for the driver

2020-06-29 Thread edeesis
If I could muster a guess, you still need to specify the executor image. As
is, this will only specify the driver image.

You can specify it as --conf spark.kubernetes.container.image or --conf
spark.kubernetes.executor.container.image



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS][SPIP] Graceful Decommissioning

2020-06-29 Thread Holden Karau
Ah, I had thought there was a larger issue given the scope of the comments.
Excited to hear that is not the case. I'll respond in the doc :)

On Mon, Jun 29, 2020 at 8:03 AM wuyi  wrote:

> I've left the comments in SPIP, so let's discuss there.
>
>
> Holden Karau wrote
> > So from the template I believe the SPIP is supposed to be more high level
> > and then design goes into the linked “design sketch.” What sort of detail
> > would you like to see added?
> >
> > On Mon, Jun 29, 2020 at 1:38 AM wuyi 
>
> > yi.wu@
>
> >  wrote:
> >
> >> Thank you for your effort, Holden.
> >>
> >> I left a few comments in SPIP. I asked for some details, though I know
> >> some
> >> contents have been include in the design doc. I'm not very clear about
> >> difference between the design doc and SPIP. But from what I saw at the
> >> SPIP
> >> template questions, I think some details maybe still needed.
> >>
> >>
> >> --
> >> Yi
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >>
> >> -
> >> To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
> >>
> >> --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> > https://amzn.to/2MaRAG9  https://amzn.to/2MaRAG9;
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread Holden Karau
Should we also consider the shuffle service refactoring to support
pluggable storage engines as targeting the 3.1 release?

On Mon, Jun 29, 2020 at 9:31 AM Maxim Gekk 
wrote:

> Hi Dongjoon,
>
> I would add:
> - Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
> - Filters pushdown to other datasources like Avro
> - Support nested attributes of filters pushed down to JSON
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> After a short celebration of Apache Spark 3.0, I'd like to ask you the
>> community opinion on Apache Spark 3.1 feature expectations.
>>
>> First of all, Apache Spark 3.1 is scheduled for December 2020.
>> - https://spark.apache.org/versioning-policy.html
>>
>> I'm expecting the following items:
>>
>> 1. Support Scala 2.13
>> 2. Use Apache Hadoop 3.2 by default for better cloud support
>> 3. Declaring Kubernetes Scheduler GA
>> In my perspective, the last main missing piece was Dynamic allocation
>> and
>> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
>> - Dynamic allocation with worker decommission/data migration is
>> targeting 3.1. (Thanks, Holden)
>> 4. DSv2 Stabilization
>>
>> I'm aware of some more features which are on the way currently, but I
>> love to hear the opinions from the main developers and more over the main
>> users who need those features.
>>
>> Thank you in advance. Welcome for any comments.
>>
>> Bests,
>> Dongjoon.
>>
>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread Maxim Gekk
Hi Dongjoon,

I would add:
- Filters pushdown to JSON (https://github.com/apache/spark/pull/27366)
- Filters pushdown to other datasources like Avro
- Support nested attributes of filters pushed down to JSON

Maxim Gekk

Software Engineer

Databricks, Inc.


On Mon, Jun 29, 2020 at 7:07 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> After a short celebration of Apache Spark 3.0, I'd like to ask you the
> community opinion on Apache Spark 3.1 feature expectations.
>
> First of all, Apache Spark 3.1 is scheduled for December 2020.
> - https://spark.apache.org/versioning-policy.html
>
> I'm expecting the following items:
>
> 1. Support Scala 2.13
> 2. Use Apache Hadoop 3.2 by default for better cloud support
> 3. Declaring Kubernetes Scheduler GA
> In my perspective, the last main missing piece was Dynamic allocation
> and
> - Dynamic allocation with shuffle tracking is already shipped at 3.0.
> - Dynamic allocation with worker decommission/data migration is
> targeting 3.1. (Thanks, Holden)
> 4. DSv2 Stabilization
>
> I'm aware of some more features which are on the way currently, but I love
> to hear the opinions from the main developers and more over the main users
> who need those features.
>
> Thank you in advance. Welcome for any comments.
>
> Bests,
> Dongjoon.
>


Re: Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread JackyLee
Thank you for putting forward this.
Can we put the support of view and partition catalog in version 3.1? 
AFAIT, these are great features in DSv2 and Catalog. With these, we can work
well with warehouse, such as delta or hive.

https://github.com/apache/spark/pull/28147
https://github.com/apache/spark/pull/28617

Thanks.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Apache Spark 3.1 Feature Expectation (Dec. 2020)

2020-06-29 Thread Dongjoon Hyun
Hi, All.

After a short celebration of Apache Spark 3.0, I'd like to ask you the
community opinion on Apache Spark 3.1 feature expectations.

First of all, Apache Spark 3.1 is scheduled for December 2020.
- https://spark.apache.org/versioning-policy.html

I'm expecting the following items:

1. Support Scala 2.13
2. Use Apache Hadoop 3.2 by default for better cloud support
3. Declaring Kubernetes Scheduler GA
In my perspective, the last main missing piece was Dynamic allocation
and
- Dynamic allocation with shuffle tracking is already shipped at 3.0.
- Dynamic allocation with worker decommission/data migration is
targeting 3.1. (Thanks, Holden)
4. DSv2 Stabilization

I'm aware of some more features which are on the way currently, but I love
to hear the opinions from the main developers and more over the main users
who need those features.

Thank you in advance. Welcome for any comments.

Bests,
Dongjoon.


Re: [DISCUSS][SPIP] Graceful Decommissioning

2020-06-29 Thread wuyi
I've left the comments in SPIP, so let's discuss there.


Holden Karau wrote
> So from the template I believe the SPIP is supposed to be more high level
> and then design goes into the linked “design sketch.” What sort of detail
> would you like to see added?
> 
> On Mon, Jun 29, 2020 at 1:38 AM wuyi 

> yi.wu@

>  wrote:
> 
>> Thank you for your effort, Holden.
>>
>> I left a few comments in SPIP. I asked for some details, though I know
>> some
>> contents have been include in the design doc. I'm not very clear about
>> difference between the design doc and SPIP. But from what I saw at the
>> SPIP
>> template questions, I think some details maybe still needed.
>>
>>
>> --
>> Yi
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  https://amzn.to/2MaRAG9;
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS][SPIP] Graceful Decommissioning

2020-06-29 Thread Holden Karau
So from the template I believe the SPIP is supposed to be more high level
and then design goes into the linked “design sketch.” What sort of detail
would you like to see added?

On Mon, Jun 29, 2020 at 1:38 AM wuyi  wrote:

> Thank you for your effort, Holden.
>
> I left a few comments in SPIP. I asked for some details, though I know some
> contents have been include in the design doc. I'm not very clear about
> difference between the design doc and SPIP. But from what I saw at the SPIP
> template questions, I think some details maybe still needed.
>
>
> --
> Yi
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Contract for PartitionReader/InputPartition for ColumnarBatch?

2020-06-29 Thread Bobby Evans
Micah,

You are correct. The contract for processing ColumnarBatches is that the
code that produced the batch is responsible for closing it and
anything downstream of it cannot keep any references to it. This is just
like with UnsafeRow.  If an UnsafeRow is cached, like for aggregates or
sorts, it must be copied into a separate memory buffer. This does not lend
itself to efficient memory management when doing columnar processing, but
for the intended purpose of loading columnar data and then instantly
turning it into rows, it works fine.

Any change to this contract would require performance testing.  This is
because several of the input formats are written to reuse the batch/memory
buffer. Spark is configured by default to keep the batch size small so that
the batch can fit in the CPU cache.  A change to the contract would
potentially mean a lot of object churn for GC to handle, or some possibly
complex code to do reference counting and memory reuse.

I personally would prefer to see this change because we are doing columnar
processing with a lot of transformations that we don't want to keep memory
statically allocated for. In our plugin, we have the consumer be
responsible for closing the incoming batch, but our batch sizes are a lot
larger so GC pressure is less of an issue. The only thing for us is that we
have to manage the transition between the spark columnar model and our
plugin's internal columnar model.  Not a big deal though.

Thanks,

Bobby

On Sat, Jun 27, 2020 at 11:28 PM Micah Kornfield 
wrote:

> Hello spark-dev,
>
> Looking at ColumnarBatch [1] it seems to indicate a single object is meant
> to be used for the entire loading process.
>
> Does this imply that Spark assumes the ColumnarBatch and any direct
> references to ColumnarBatch (e.g. UTF8Strings) returned by
> InputPartitionReader/PartitionReader [2][3] get invalidated after "next()"
> is called on the Reader?
>
> Does the same apply for InternalRow?
>
> Does it make sense to update the contracts one way or another (I'm happy
> to make a PR).?
>
> Thanks,
> Micah
>
> [1]
> https://github.com/apache/spark/blob/c341de8b3e1f1d3327bd4ae3b0d2ec048f64d306/sql/catalyst/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java
> [2]
> https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java
> [3]
> https://github.com/apache/spark/blob/a5efbb284e29b1d879490a4ee2c9fa08acec42b0/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/PartitionReader.java
>


Announcing ApacheCon @Home 2020

2020-06-29 Thread Rich Bowen

Hi, Apache enthusiast!

(You’re receiving this because you’re subscribed to one or more dev or 
user mailing lists for an Apache Software Foundation project.)


The ApacheCon Planners and the Apache Software Foundation are pleased to 
announce that ApacheCon @Home will be held online, September 29th 
through October 1st, 2020. We’ll be featuring content from dozens of our 
projects, as well as content about community, how Apache works, business 
models around Apache software, the legal aspects of open source, and 
many other topics.


Full details about the event, and registration, is available at 
https://apachecon.com/acah2020


Due to the confusion around how and where this event was going to be 
held, and in order to open up to presenters from around the world who 
may previously have been unable or unwilling to travel, we’ve reopened 
the Call For Presentations until July 13th. Submit your talks today at 
https://acna2020.jamhosted.net/


We hope to see you at the event!
Rich Bowen, VP Conferences, The Apache Software Foundation

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: preferredlocations for hadoopfsrelations based baseRelations

2020-06-29 Thread Steve Loughran
Here's a class which lets you proved a function on a row by row basis to
declare location

https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/org/apache/spark/cloudera/ParallelizedWithLocalityRDD.scala

needs to be in o.a.spark as something you need is scoped to the spark
packages only.

I used it for a PoC of a distcp replacement -each row was a filename, so
the locations of each row was the server with the first block of the file
https://github.com/hortonworks-spark/cloud-integration/blob/master/spark-cloud-integration/src/main/scala/com/cloudera/spark/cloud/applications/CloudCp.scala#L137

it would be convenient if either the bits of the API I needed was public or
the extra RDD code just went in somewhere. It's nothing complicated

On Thu, 4 Jun 2020 at 09:31, ZHANG Wei  wrote:

> AFAICT, `FileScanRDD` invokes`FilePartition::preferredLocations()`
> method, which is ordered by the data size, to get the partition
> preferred locations. If there are other vectors to sort, I'm wondering
> if here[1] can be a place to add. Or inheriting class `FilePartition`
> with overridden `preferredLocations()` might also work.
>
> --
> Cheers,
> -z
> [1]
> https://github.com/apache/spark/blob/a4195d28ae94793b793641f121e21982bf3880d1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L43
>
> On Thu, 4 Jun 2020 06:40:43 +
> Nasrulla Khan Haris  wrote:
>
> > HI Spark developers,
> >
> > I have created new format extending fileformat. I see
> getPrefferedLocations is available if newCustomRDD is created. Since
> fileformat is based off FileScanRDD which uses readfile method to read
> partitioned file, Is there a way to add desired preferredLocations ?
> >
> > Appreciate your responses.
> >
> > Thanks,
> > NKH
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-29 Thread Steve Loughran
you are going to need hadoop-3.1 on your classpath, with hadoop-aws and the
same aws-sdk it was built with (1.11.something). Mixing hadoop JARs is
doomed. using a different aws sdk jar is a bit risky, though more recent
upgrades have all be fairly low stress

On Fri, 19 Jun 2020 at 05:39, murat migdisoglu 
wrote:

> Hi all
> I've upgraded my test cluster to spark 3 and change my comitter to
> directory and I still get this error.. The documentations are somehow
> obscure on that.
> Do I need to add a third party jar to support new comitters?
>
> java.lang.ClassNotFoundException:
> org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
>
>
> On Thu, Jun 18, 2020 at 1:35 AM murat migdisoglu <
> murat.migdiso...@gmail.com> wrote:
>
>> Hello all,
>> we have a hadoop cluster (using yarn) using  s3 as filesystem with
>> s3guard is enabled.
>> We are using hadoop 3.2.1 with spark 2.4.5.
>>
>> When I try to save a dataframe in parquet format, I get the following
>> exception:
>> java.lang.ClassNotFoundException:
>> com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
>>
>> My relevant spark configurations are as following:
>>
>> "hadoop.mapreduce.outputcommitter.factory.scheme.s3a":"org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory",
>> "fs.s3a.committer.name": "magic",
>> "fs.s3a.committer.magic.enabled": true,
>> "fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>>
>> While spark streaming fails with the exception above, apache beam
>> succeeds writing parquet files.
>> What might be the problem?
>>
>> Thanks in advance
>>
>>
>> --
>> "Talkers aren’t good doers. Rest assured that we’re going there to use
>> our hands, not our tongues."
>> W. Shakespeare
>>
>
>
> --
> "Talkers aren’t good doers. Rest assured that we’re going there to use
> our hands, not our tongues."
> W. Shakespeare
>


Re: Setting spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 and Doc issue

2020-06-29 Thread Steve Loughran
v2 does a file-by-file copy to the dest dir in task commit; v1 promotes
task attempts to job attempt dir by dir rename, job commit lists those and
moves the contents

if the worker fails during task commit -the next task attempt has to
replace every file -so it had better use the same filenames.

The really scary issue is a network partition: if the first worker went
off-line long enough for a second attempt to commit (If speculation has
enabled that may not be very long at all as could already be waiting) then
if the second worker goes online again it may continue with its commit and
partially overwrite some but not all of the output.

That task commit is not atomic even though spark requires this. It is worse
on Amazon S3 because rename is O(data). The window for failure is a lot
longer.

The S3A committers don't commit their work until job commit; while that is
non-atomic (nor is MR v1 BTW) it's time is |files|/(min(|threads|,
max-http-pool-size))

The EMR spark committer does actually commit its work in task commit, so is
also vulnerable. I wish they copied more of our ASF-licensed code :). Or
some of IBM's stocator work.


Presumably their algorithm is

pre-task-reporting ready-to-commit: upload files from the localfd task
attempt staging dir to dest dir, without completing the upload. You could
actually do this with a scanning thread uploading as you go along.
task commit: POST all the uploads
job commit: touch _SUCCESS

The scales better (no need to load & commit uploads in job commit) and does
not require any consistent cluster FS. And is faster.

But again: the failure semantic of task commit isn't what spark expects.

Bonus fun: google GCS dir commit is file-by-file so non atomic; v1 task
commit does expect an atomic dir rename. So you may as well use v2.

They could add a committer which didn't do that rename, just write a
manifest file to the job attempt dir pointing to the successful task
attempt; commit that with their atomic file rename. The committer plugin
point in MR lets you declare a committer factory for each FS, so it could
be done without any further changes to spark.

On Thu, 25 Jun 2020 at 22:38, Waleed Fateem  wrote:

> I was trying to make my email short and concise, but the rationale behind
> setting that as 1 by default is because it's safer. With algorithm version
> 2 you run the risk of having bad data in cases where tasks fail or even
> duplicate data if a task fails and succeeds on a reattempt (I don't know if
> this is true for all OutputCommitters that extend the FileOutputCommitter
> or not).
>
> Imran and Marcelo also discussed this here:
>
> https://issues.apache.org/jira/browse/SPARK-20107?focusedCommentId=15945177=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15945177
>
> I also did discuss this a bit with Steve Loughran and his opinion was that
> v2 should just be deprecated all together. I believe he was going to bring
> that up with the Hadoop developers.
>
>
> On Thu, Jun 25, 2020 at 3:56 PM Sean Owen  wrote:
>
>> I think is a Hadoop property that is just passed through? if the
>> default is different in Hadoop 3 we could mention that in the docs. i
>> don't know if we want to always set it to 1 as a Spark default, even
>> in Hadoop 3 right?
>>
>> On Thu, Jun 25, 2020 at 2:43 PM Waleed Fateem 
>> wrote:
>> >
>> > Hello!
>> >
>> > I noticed that in the documentation starting with 2.2.0 it states that
>> the parameter spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
>> is 1 by default:
>> > https://issues.apache.org/jira/browse/SPARK-20107
>> >
>> > I don't actually see this being set anywhere explicitly in the Spark
>> code and so the documentation isn't entirely accurate in case you run on an
>> environment that has MAPREDUCE-6406 implemented (starting with Hadoop 3.0).
>> >
>> > The default version was explicitly set to 2 in the FileOutputCommitter
>> class, so any output committer that inherits from this class
>> (ParquetOutputCommitter for example) would use v2 in a Hadoop 3.0
>> environment and v1 in the older Hadoop environments.
>> >
>> > Would it make sense for us to consider setting v1 as the default in
>> code in case the configuration was not set by a user?
>> >
>> > Regards,
>> >
>> > Waleed
>>
>


Re: Spark 3 pod template for the driver

2020-06-29 Thread Michel Sumbul
Hello,

Adding the dev mailing list maybe there is someone here that can help to
have/show a valid/accepted pod template for spark 3?

Thanks in advance,
Michel


Le ven. 26 juin 2020 à 14:03, Michel Sumbul  a
écrit :

> Hi Jorge,
> If I set that in the spark submit command it works but I want it only in
> the pod template file.
>
> Best regards,
> Michel
>
> Le ven. 26 juin 2020 à 14:01, Jorge Machado  a écrit :
>
>> Try to set spark.kubernetes.container.image
>>
>> On 26. Jun 2020, at 14:58, Michel Sumbul  wrote:
>>
>> Hi guys,
>>
>> I try to use Spark 3 on top of Kubernetes and to specify a pod template
>> for the driver.
>>
>> Here is my pod manifest or the driver and when I do a spark-submit with
>> the option:
>> --conf
>> spark.kubernetes.driver.podTemplateFile=/data/k8s/podtemplate_driver3.yaml
>>
>> I got the error message that I need to specify an image, but it's the
>> manifest.
>> Does my manifest file is wrong, How should it look like?
>>
>> Thanks for your help,
>> Michel
>>
>> 
>> The pod manifest:
>>
>> apiVersion: v1
>> kind: Pod
>> metadata:
>>   name: mySpark3App
>>   labels:
>> app: mySpark3App
>> customlabel/app-id: "1"
>> spec:
>>   securityContext:
>> runAsUser: 1000
>>   volumes:
>> - name: "test-volume"
>>   emptyDir: {}
>>   containers:
>> - name: spark3driver
>>   image: mydockerregistry.example.com/images/dev/spark3:latest
>>   instances: 1
>>   resources:
>> requests:
>>   cpu: "1000m"
>>   memory: "512Mi"
>> limits:
>>   cpu: "1000m"
>>   memory: "512Mi"
>>   volumeMounts:
>>- name: "test-volume"
>>  mountPath: "/tmp"
>>
>>
>>


Re: UnknownSource NullPointerException in CodeGen. with Custom Strategy

2020-06-29 Thread wuyi
Hi Nasrulla,

Could you give a complete demo to reproduce the issue?



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS][SPIP] Graceful Decommissioning

2020-06-29 Thread wuyi
Thank you for your effort, Holden.

I left a few comments in SPIP. I asked for some details, though I know some
contents have been include in the design doc. I'm not very clear about
difference between the design doc and SPIP. But from what I saw at the SPIP
template questions, I think some details maybe still needed.


--
Yi





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org