error: object functions is not a member of package org.apache.spark.sql.avro

2020-08-08 Thread dwgw
Hi
I am getting the following error while trying to import the package
org.apache.spark.sql.avro.functions._ in the scala shell:

scala> import org.apache.spark.sql.avro.functions._
:23: error: object functions is not a member of package
org.apache.spark.sql.avro
import org.apache.spark.sql.avro.functions._

and i have invoked the spark-shell with the following command:

# spark-shell --packages
org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0,org.apache.spark:spark-streaming-kafka-0-10_2.11:2.3.0,com.databricks:spark-avro_2.11:4.0.0,org.apache.spark:spark-avro_2.11:2.4.0

Which package i have to passed as a parameter along with spark shell ? I am
trying to implement few examples from here
https://docs.databricks.com/spark/latest/structured-streaming/avro-dataframe.html

Regards



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark streaming receivers

2020-08-08 Thread Dark Crusader
Hi,

I'm having some trouble figuring out how receivers tie into spark
driver-executor structure.
Do all executors have a receiver that is blocked as soon as it
receives some stream data?
Or can multiple streams of data be taken as input into a single executor?

I have stream data coming in at every second coming from 5 different
sources. I want to aggregate data from each of them. Does this mean I need
5 executors or does it have to do with threads on the executor?

I might be mixing in a few concepts here. Any help would be appreciated.
Thank you.


Re: Spark streaming receivers

2020-08-08 Thread Russell Spitzer
Note, none of this applies to Direct streaming approaches, only receiver
based Dstreams.

You can think of a receiver as a long running task that never finishes.
Each receiver is submitted to an executor slot somewhere, it then runs
indefinitely and internally has a method which passes records over to a
block management system. There is a timing that you set which decides when
each block is "done" and records after that time has passed go into the
next block (See parameter

spark.streaming.blockInterval)  Once a block is done it can be processed in
the next Spark batch.. The gap between a block starting and a block being
finished is why you can lose data in Receiver streaming without
WriteAheadLoging. Usually your block interval is divisible into your batch
interval so you'll get X blocks per batch. Each block becomes one partition
of the job being done in a Streaming batch. Multiple receivers can be
unified into a single dstream, which just means the blocks produced by all
of those receivers are handled in the same Streaming batch.

So if you have 5 different receivers, you need at minimum 6 executor cores.
1 core for each receiver, and 1 core to actually do your processing work.
In a real world case you probably want significantly more  cores on the
processing side than just 1. Without repartitioning you will never have
more that

A quick example

I run 5 receivers with block interval of 100ms and spark batch interval of
1 second. I use union to group them all together, I will most likely end up
with one Spark Job for each batch every second running with 50 partitions
(1000ms / 100(ms / partition / receiver) * 5 receivers). If I have a total
of 10 cores in the system. 5 of them are running receivers, The remaining 5
must process the 50 partitions of data generated by the last second of work.

And again, just to reiterate, if you are doing a direct streaming approach
or structured streaming, none of this applies.

On Sat, Aug 8, 2020 at 10:03 AM Dark Crusader 
wrote:

> Hi,
>
> I'm having some trouble figuring out how receivers tie into spark
> driver-executor structure.
> Do all executors have a receiver that is blocked as soon as it
> receives some stream data?
> Or can multiple streams of data be taken as input into a single executor?
>
> I have stream data coming in at every second coming from 5 different
> sources. I want to aggregate data from each of them. Does this mean I need
> 5 executors or does it have to do with threads on the executor?
>
> I might be mixing in a few concepts here. Any help would be appreciated.
> Thank you.
>


Re: Spark batch job chaining

2020-08-08 Thread Amit Sharma
Any help is appreciated. I have spark batch job based on condition I would
like to start another batch job by invoking .sh file. Just want to know can
we achieve that?

Thanks
Amit

On Fri, Aug 7, 2020 at 3:58 PM Amit Sharma  wrote:

> Hi, I want to write a batch job which would call another batch job based
> on condition. Can I call one batch job through another in scala or I can do
> it just by python script. Example would be really helpful.
>
>
> Thanks
> Amit
>
>
>


Re: Spark batch job chaining

2020-08-08 Thread Jun Zhu
Hi
I am using Airflow in such scenario