Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
I could be wrong , but… just start it. If you have the capacity, it takes a lot of time on large datasets to reduce the entire dataset. If you have the resources, start combining and reducing on partial map results. As soon as you’ve got one record out of the map, it has a reduce key in the plan,

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
Yes, we can get reduce tasks started when there are enough resources in the cluster. As you point out, reduce tasks cannot produce their output while map tasks are still running, but they can prefetch the output of map tasks. In our prototype implementation of pipelined execution, everything works

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
Oops, it has been long since Russell labored on Hadoop, speculative execution isn’t the right term - that is something else. Cascading has a declarative interface so you can plan more, whereas Spark is more imperative. Point remains :) On Wed, Sep 7, 2022 at 3:56 PM Russell Jurney wrote: > You

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
You want to talk to Chris Wensel, creator of cascading, a system that did speculative execution for a large volume of enterprise workloads. It was the first approachable way to scale workloads using Hadoop. He could write a book about this topic. Happy to introduce you if you’d like, or you could

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sean Owen
Wait, how do you start reduce tasks before maps are finished? is the idea that some reduce tasks don't depend on all the maps, or at least you can get started? You can already execute unrelated DAGs in parallel of course. On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park wrote: > You are right --

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
You are right -- Spark can't do this with its current architecture. My question was: if there was a new implementation supporting pipelined execution, what kind of Spark jobs would benefit (a lot) from it? Thanks, --- Sungwoo On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney wrote: > I don't

Spark SQL

2022-09-07 Thread Mayur Benodekar
am new to scala and spark both . I have a code in scala which executes quieres in while loop one after the other. What we need to do is if a particular query takes more than a certain time , for example # 10 mins we should be able to stop the query execution for that particular query and move

Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
I don't think Spark can do this with its current architecture. It has to wait for the step to be done, speculative execution isn't possible. Others probably know more about why that is. Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI

Re: Spark equivalent to hdfs groups

2022-09-07 Thread phiroc
Many thanks, Sean. - Mail original - De: "Sean Owen" À: phi...@free.fr Cc: "User" Envoyé: Mercredi 7 Septembre 2022 17:05:55 Objet: Re: Spark equivalent to hdfs groups No, because this is a storage concept, and Spark is not a storage system. You would appeal to tools and interfaces

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen
No, because this is a storage concept, and Spark is not a storage system. You would appeal to tools and interfaces that the storage system provides, like hdfs. Where or how the hdfs binary is available depends on how you deploy Spark where; it would be available on a Hadoop cluster. It's just not

Re: Spark equivalent to hdfs groups

2022-09-07 Thread phiroc
Hi Sean, I'm talking about HDFS Groups. On Linux, you can type "hdfs groups " to get the list of the groups user1 belongs to. In Zeppelin/Spark, the hdfs executable is not accessible. As a result, I wondered if there was a class in Spark (eg. Security or ACL) which would let you access a

Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
Hello Spark users, I have a question on the architecture of Spark (which could lead to a research problem). In its current implementation, Spark finishes executing all the tasks in a stage before proceeding to child stages. For example, given a two-stage map-reduce DAG, Spark finishes executing

Re: Spark equivalent to hdfs groups

2022-09-07 Thread Sean Owen
Spark isn't a storage system or user management system; no there is no notion of groups (groups for what?) On Wed, Sep 7, 2022 at 8:36 AM wrote: > Hello, > is there a Spark equivalent to "hdfs groups "? > Many thanks. > Philippe > >

Spark equivalent to hdfs groups

2022-09-07 Thread phiroc
Hello, is there a Spark equivalent to "hdfs groups "? Many thanks. Philippe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark Structured Streaming - unable to change max.poll.records (showing as 1)

2022-09-07 Thread karan alang
Hello All, i've a Spark structured streaming job which reads from Kafka, does processing and puts data into Mongo/Kafka/GCP Buckets (i.e. it is processing heavy) I'm consistently seeing the following warnings: ``` 22/09/06 16:55:03 INFO