date:20220907

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney

I could be wrong , but… just start it. If you have the capacity, it takes a lot of time on large datasets to reduce the entire dataset. If you have the resources, start combining and reducing on partial map results. As soon as you’ve got one record out of the map, it has a reduce key in the plan,

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park

Yes, we can get reduce tasks started when there are enough resources in the cluster. As you point out, reduce tasks cannot produce their output while map tasks are still running, but they can prefetch the output of map tasks. In our prototype implementation of pipelined execution, everything works

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney

Oops, it has been long since Russell labored on Hadoop, speculative execution isn’t the right term - that is something else. Cascading has a declarative interface so you can plan more, whereas Spark is more imperative. Point remains :) On Wed, Sep 7, 2022 at 3:56 PM Russell Jurney wrote: > You

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney

You want to talk to Chris Wensel, creator of cascading, a system that did speculative execution for a large volume of enterprise workloads. It was the first approachable way to scale workloads using Hadoop. He could write a book about this topic. Happy to introduce you if you’d like, or you could

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sean Owen

Wait, how do you start reduce tasks before maps are finished? is the idea that some reduce tasks don't depend on all the maps, or at least you can get started? You can already execute unrelated DAGs in parallel of course. On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park wrote: > You are right --

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park

You are right -- Spark can't do this with its current architecture. My question was: if there was a new implementation supporting pipelined execution, what kind of Spark jobs would benefit (a lot) from it? Thanks, --- Sungwoo On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney wrote: > I don't

Spark SQL

2022-09-07 Thread Mayur Benodekar

am new to scala and spark both . I have a code in scala which executes quieres in while loop one after the other. What we need to do is if a particular query takes more than a certain time , for example # 10 mins we should be able to stop the query execution for that particular query and move

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney

I don't think Spark can do this with its current architecture. It has to wait for the step to be done, speculative execution isn't possible. Others probably know more about why that is. Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI

Re: Spark equivalent to hdfs groups

2022-09-07 Thread phiroc

Many thanks, Sean. - Mail original - De: "Sean Owen" À: phi...@free.fr Cc: "User" Envoyé: Mercredi 7 Septembre 2022 17:05:55 Objet: Re: Spark equivalent to hdfs groups No, because this is a storage concept, and Spark is not a storage system. You would appeal to tools and interfaces

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen

No, because this is a storage concept, and Spark is not a storage system. You would appeal to tools and interfaces that the storage system provides, like hdfs. Where or how the hdfs binary is available depends on how you deploy Spark where; it would be available on a Hadoop cluster. It's just not

Re: Spark equivalent to hdfs groups

2022-09-07 Thread phiroc

Hi Sean, I'm talking about HDFS Groups. On Linux, you can type "hdfs groups " to get the list of the groups user1 belongs to. In Zeppelin/Spark, the hdfs executable is not accessible. As a result, I wondered if there was a class in Spark (eg. Security or ACL) which would let you access a

Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park

Hello Spark users, I have a question on the architecture of Spark (which could lead to a research problem). In its current implementation, Spark finishes executing all the tasks in a stage before proceeding to child stages. For example, given a two-stage map-reduce DAG, Spark finishes executing

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen

Spark isn't a storage system or user management system; no there is no notion of groups (groups for what?) On Wed, Sep 7, 2022 at 8:36 AM wrote: > Hello, > is there a Spark equivalent to "hdfs groups "? > Many thanks. > Philippe > >

Spark equivalent to hdfs groups

2022-09-07 Thread phiroc

Hello, is there a Spark equivalent to "hdfs groups "? Many thanks. Philippe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark Structured Streaming - unable to change max.poll.records (showing as 1)

2022-09-07 Thread karan alang

Hello All, i've a Spark structured streaming job which reads from Kafka, does processing and puts data into Mongo/Kafka/GCP Buckets (i.e. it is processing heavy) I'm consistently seeing the following warnings: ``` 22/09/06 16:55:03 INFO

Re: Pipelined execution in Spark (???)

Re: Pipelined execution in Spark (???)

Re: Pipelined execution in Spark (???)

Re: Pipelined execution in Spark (???)

Re: Pipelined execution in Spark (???)

Re: Pipelined execution in Spark (???)

Spark SQL

Re: Pipelined execution in Spark (???)

Re: Spark equivalent to hdfs groups

Re: Spark equivalent to hdfs groups

Re: Spark equivalent to hdfs groups

Pipelined execution in Spark (???)

Re: Spark equivalent to hdfs groups

Spark equivalent to hdfs groups

Spark Structured Streaming - unable to change max.poll.records (showing as 1)

15 matches

Site Navigation

Mail list logo

Footer information