RE: [SPARK-23207] Repro

2019-08-09 Thread tcondie
Hi Sean,

To finish the job, I did need to set spark.stage.maxConsecutiveAttempts to a 
large number e.g., 100; a suggestion from Jiang Xingbo.

I haven't seen any recent movement/PRs on this issue, but I'll see if we can 
repro with a more recent version of Spark. 

Best regards,
Tyson

-Original Message-
From: Sean Owen  
Sent: Friday, August 9, 2019 7:49 AM
To: tcon...@gmail.com
Cc: dev 
Subject: Re: [SPARK-23207] Repro

Interesting but I'd put this on the JIRA, and also test vs master first. It's 
entirely possible this is something else that was subsequently fixed, and maybe 
even backported for 2.4.4.
(I can't quite reproduce it - just makes the second job fail, which is also 
puzzling)

On Fri, Aug 9, 2019 at 8:11 AM  wrote:
>
> Hi,
>
>
>
> We are able to reproduce this bug in Spark 2.4 using the following program:
>
>
>
> import scala.sys.process._
>
> import org.apache.spark.TaskContext
>
>
>
> val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000, 
> x)}.repartition(20)
>
> res.distinct.count
>
>
>
> // kill an executor in the stage that performs repartition(239)
>
> val df = res.repartition(113).cache.repartition(239).map { x =>
>
>   if (TaskContext.get.attemptNumber == 0 && 
> TaskContext.get.partitionId < 1) {
>
> throw new Exception("pkill -f java".!!)
>
>   }
>
>   x
>
> }
>
> df.distinct.count()
>
>
>
> The first df.distinct.count correctly produces 1
>
> The second df.distinct.count incorrect produces 9769
>
>
>
> If the cache step is removed then the bug does not reproduce.
>
>
>
> Best regards,
>
> Tyson
>
>


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[SPARK-23207] Repro

2019-08-09 Thread tcondie
Hi,

 

We are able to reproduce this bug in Spark 2.4 using the following program:

 

import scala.sys.process._

import org.apache.spark.TaskContext

 

val res = spark.range(0, 1 * 1, 1).map{ x => (x % 1000,
x)}.repartition(20)

res.distinct.count 

 

// kill an executor in the stage that performs repartition(239)

val df = res.repartition(113).cache.repartition(239).map { x =>

  if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1)
{

throw new Exception("pkill -f java".!!)

  }

  x

}

df.distinct.count()

 

The first df.distinct.count correctly produces 1
The second df.distinct.count incorrect produces 9769
 

If the cache step is removed then the bug does not reproduce. 

 

Best regards,

Tyson

 



RE: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-12 Thread tcondie
+1 (non-binding)

 

Tyson Condie

 

From: Kazuaki Ishizaki  
Sent: Thursday, May 9, 2019 9:17 AM
To: Bryan Cutler 
Cc: Bobby Evans ; Spark dev list ;
Thomas graves 
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar
Processing Support

 

+1 (non-binding)

Kazuaki Ishizaki



From:Bryan Cutler mailto:cutl...@gmail.com> >
To:Bobby Evans mailto:bo...@apache.org> >
Cc:Thomas graves mailto:tgra...@apache.org> >,
Spark dev list mailto:dev@spark.apache.org> >
Date:2019/05/09 03:20
Subject:Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
Columnar Processing Support

  _  




+1 (non-binding)

On Tue, May 7, 2019 at 12:04 PM Bobby Evans < 
bo...@apache.org> wrote:
I am +!

On Tue, May 7, 2019 at 1:37 PM Thomas graves < 
tgra...@apache.org> wrote:
Hi everyone,

I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
for extended Columnar Processing Support.  The proposal is to extend
the support to allow for more columnar processing.  We had previous
vote and discussion threads and have updated the SPIP based on the
comments to clarify a few things and reduce the scope.

You can find the updated proposal in the jira at:
 
https://issues.apache.org/jira/browse/SPARK-27396.

Please vote as early as you can, I will leave the vote open until next
Monday (May 13th), 2pm CST to give people plenty of time.

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thanks!
Tom Graves

-
To unsubscribe e-mail:  
dev-unsubscr...@spark.apache.org





RE: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread tcondie
+1 (non-binding) for better columnar data processing support.

 

From: Jules Damji  
Sent: Friday, April 19, 2019 12:21 PM
To: Bryan Cutler 
Cc: Dev 
Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar 
Processing Support

 

+ (non-binding)

Sent from my iPhone

Pardon the dumb thumb typos :)


On Apr 19, 2019, at 10:30 AM, Bryan Cutler mailto:cutl...@gmail.com> > wrote:

+1 (non-binding)

 

On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe mailto:jl...@apache.org> > wrote:

+1 (non-binding).  Looking forward to seeing better support for processing 
columnar data.

 

Jason

 

On Tue, Apr 16, 2019 at 10:38 AM Tom Graves mailto:tgraves...@yahoo.com.invalid> > wrote:

Hi everyone,

 

I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for extended 
Columnar Processing Support.  The proposal is to extend the support to allow 
for more columnar processing.

 

You can find the full proposal in the jira at: 
https://issues.apache.org/jira/browse/SPARK-27396. There was also a DISCUSS 
thread in the dev mailing list.

 

Please vote as early as you can, I will leave the vote open until next Monday 
(the 22nd), 2pm CST to give people plenty of time.

 

[ ] +1: Accept the proposal as an official SPIP

[ ] +0

[ ] -1: I don't think this is a good idea because ...

 

 

Thanks!

Tom Graves



RE: Hive Hash in Spark

2019-03-07 Thread tcondie
Thanks Ryan and Reynold for the information!

 

Cheers,

Tyson

 

From: Ryan Blue  
Sent: Wednesday, March 6, 2019 3:47 PM
To: Reynold Xin 
Cc: tcon...@gmail.com; Spark Dev List 
Subject: Re: Hive Hash in Spark

 

I think this was needed to add support for bucketed Hive tables. Like Tyson 
noted, if the other side of a join can be bucketed the same way, then Spark can 
use a bucketed join. I have long-term plans to support this in the DataSourceV2 
API, but I don't think we are very close to implementing it yet.

 

rb

 

On Wed, Mar 6, 2019 at 1:57 PM Reynold Xin mailto:r...@databricks.com> > wrote:

  

 

I think they might be used in bucketing? Not 100% sure.

 

 

On Wed, Mar 06, 2019 at 1:40 PM, mailto:tcon...@gmail.com> 
> wrote:

Hi,

 

I noticed the existence of a Hive Hash partitioning implementation in Spark, 
but also noticed that it’s not being used, and that the Spark hash partitioning 
function is presently hardcoded to Murmur3. My question is whether Hive Hash is 
dead code or are their future plans to support reading and understanding data 
the has been partitioned using Hive Hash? By understanding, I mean that I’m 
able to avoid a full shuffle join on Table A (partitioned by Hive Hash) when 
joining with a Table B that I can shuffle via Hive Hash to Table A. 

 

Thank you,

Tyson

 




 

-- 

Ryan Blue

Software Engineer

Netflix



Hive Hash in Spark

2019-03-06 Thread tcondie
Hi,

 

I noticed the existence of a Hive Hash partitioning implementation in Spark,
but also noticed that it's not being used, and that the Spark hash
partitioning function is presently hardcoded to Murmur3. My question is
whether Hive Hash is dead code or are their future plans to support reading
and understanding data the has been partitioned using Hive Hash? By
understanding, I mean that I'm able to avoid a full shuffle join on Table A
(partitioned by Hive Hash) when joining with a Table B that I can shuffle
via Hive Hash to Table A. 

 

Thank you,

Tyson



[DISCUSS] SPIP SPARK-26257

2019-01-14 Thread tcondie
Dear Spark Community,

 

I have posted a SPIP to JIRA:
https://issues.apache.org/jira/browse/SPARK-26257

 

I look forward to your feedback on the JIRA ticket.

 

Best regards,

Tyson



[Discuss] Language Interop for Apache Spark

2018-09-25 Thread tcondie
There seems to be some desire for third party language extensions for Apache
Spark. Some notable examples include:

*   C#/F# from project Mobius https://github.com/Microsoft/Mobius
*   Haskell from project sparkle https://github.com/tweag/sparkle
*   Julia from project Spark.jl https://github.com/dfdx/Spark.jl

 

Presently, Apache Spark supports Python and R via a tightly integrated
interop layer. It would seem that much of that existing interop layer could
be refactored into a clean surface for general (third party) language
bindings, such as the above mentioned. More specifically, could we
generalize the following modules:

1.  Deploy runners (e.g., PythonRunner and RRunner) 
2.  DataFrame Executors
3.  RDD operations? 

 

The last being questionable: integrating third party language extensions at
the RDD level may be too heavy-weight and unnecessary given the preference
towards the DataFrame abstraction. 

 

The main goals of this effort would be:

1.  Provide a clean abstraction for third party language extensions
making it easier to maintain (the language extension) with the evolution of
Apache Spark
2.  Provide guidance to third party language authors on how a language
extension should be implemented
3.  Provide general reusable libraries that are not specific to any
language extension
4.  Open the door to developers that prefer alternative languages

 

-Tyson Condie