[pyspark 2.4.3] small input csv ~3.4GB gets 40K tasks created

2019-08-29 Thread Rishi Shah
Hi All, I am scratching my head against this weird behavior, where df (read from .csv) of size ~3.4GB gets cross joined with itself and creates 50K tasks! How to correlate input size with number of tasks in this case? -- Regards, Rishi Shah

Re: Structured Streaming Dataframe Size

2019-08-29 Thread Tathagata Das
Responses inline. On Wed, Aug 28, 2019 at 8:42 AM Nick Dawes wrote: > Thank you, TD. Couple of follow up questions please. > > 1) "It only keeps around the minimal intermediate state data" > > How do you define "minimal" here? Is there a configuration property to > control the time or size of

Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Jörn Franke
1) this is not a use case, but a technical solution. Hence nobody can tell you if it make sense or not 2) do an upsert in Cassandra. However keep in mind that the application submitting to the Kafka topic and the one consuming from the Kafka topic need to ensure that they process messages in

Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Aayush Ranaut
What exactly is your requirement?  Is the read before write mandatory? Are you maintaining states in Cassandra? Regards Prathmesh Ranaut https://linkedin.com/in/prathmeshranaut > On Aug 29, 2019, at 3:35 PM, Shyam P wrote: > > > thanks Aayush.     For every record I need to get the data

Re: Control Sqoop job from Spark job

2019-08-29 Thread Chetan Khatri
Sorry, I call sqoop job from above function. Can you help me to resolve this. Thanks On Fri, Aug 30, 2019 at 1:31 AM Chetan Khatri wrote: > Hi Users, > I am launching a Sqoop job from Spark job and would like to FAIL Spark job > if Sqoop job fails. > > def executeSqoopOriginal(serverName:

Control Sqoop job from Spark job

2019-08-29 Thread Chetan Khatri
Hi Users, I am launching a Sqoop job from Spark job and would like to FAIL Spark job if Sqoop job fails. def executeSqoopOriginal(serverName: String, schemaName: String, username: String, password: String, query: String, splitBy: String, fetchSize: Int, numMappers: Int,

Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Shyam P
thanks Aayush. For every record I need to get the data from cassandra table and update it ? Else it may not update the existing record. What is this datastax-spark-connector ? is that not a "Cassandra connector library written for spark"? If not , how to write ourselves. Where and how to

Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Aayush Ranaut
Cassandra is upsert, you should be able to do what you need with a single statement unless you’re looking to maintain counters. I’m not sure if there is a Cassandra connector library written for spark streaming because we wrote one ourselves when we wanted to do the same. Regards Prathmesh

Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Shyam P
Hi, I need to do a PoC for a business use-case. *Use case :* Need to update a record in Cassandra table if exists. Will spark streaming support compare each record and update existing Cassandra record ? For each record received from kakfa topic , If I want to check and compare each record

Re: [Spark SQL] failure in query

2019-08-29 Thread Subash Prabakar
What is the no of part files in that big table? And what is the distribution of request ID? Is the variance of the column is less or huge? Because partitionBy clause will move data with same request ID to one executor. If the data is huge it might put load on executor. On Sun, 25 Aug 2019 at

In Catalyst expressions, when is it appropriate to use codegen

2019-08-29 Thread Arwin Tio
Hi, I am exploring the usage of Catalyst expression functions to avoid the performance issues associated with UDFs. One thing that I noticed is that there is a trait called CodegenFallback and there are some Catalyst expressions in Spark that inherit from it [0]. My question is, is there a

How to Load a Graphx Graph from a parquet file?

2019-08-29 Thread Alexander Czech
Hey all, I want to load a parquet containing my edges into an Graph my code so far looks like this: val edgesDF = spark.read.parquet("/path/to/edges/parquet/") val edgesRDD = edgesDF.rdd val graph = Graph.fromEdgeTuples(edgesRDD, 1) But simply this produces an error: [error] found :