Udfs in spark

2019-03-27 Thread Achilleus 003
Couple of questions regarding udfs: 1) Is there a way to get all the registered UDFs in spark scala? I couldn’t find any straight forward api. But found a pattern to get all the registered udfs. Spark.catalog.listfunctions.filter(_.className == null).collect This does the trick but not sure it

How to extract data in parallel from RDBMS tables

2019-03-27 Thread Surendra , Manchikanti
Hi All, Is there any way to copy all the tables in parallel from RDBMS using Spark? We are looking for a functionality similar to Sqoop. Thanks, Surendra

How Spark coordinates multi contender race on writing zookeeper? (Also on stackoverflow)

2019-03-27 Thread Zili Chen
Hi guys, Recently I open a question[1] on StackOverflow about leader election with ZooKeeper high-availability backend. It puzzles me for some days and it would be really help if you can take a look or even give some thoughts. Copy the content to mailing list: Spark uses Curator#LeaderLatch for

Fwd: How Spark coordinates multi contender race on writing zookeeper? (Also on stackoverflow)

2019-03-27 Thread Zili Chen
Hi guys, Recently I open a question[1] on StackOverflow about leader election with ZooKeeper high-availability backend. It puzzles me for some days and it would be really help if you can take a look or even give some thoughts. Copy the content to mailing list: Spark uses Curator#LeaderLatch for

Spark migration to Kubernetes

2019-03-27 Thread thrisha
Hi All, We have Spark Streaming pipelines(written in java) currently running on yarn in production. We are evaluating moving these streaming pipelines onto Kubernetes. We had set up a working Kubernetes cluster. I have been reading Spark documentation and a few other blogs on migrating them to

Re: Streaming data out of spark to a Kafka topic

2019-03-27 Thread Gabor Somogyi
Hi Mich, Please take a look at how to write data into Kafka topic with DStreams: https://github.com/gaborgsomogyi/spark-dstream-secure-kafka-sink-app/blob/62d64ce368bc07b385261f85f44971b32fe41327/src/main/scala/com/cloudera/spark/examples/DirectKafkaSinkWordCount.scala#L77 (DStreams has no native

Re: Spark Profiler

2019-03-27 Thread Jack Kolokasis
Thanks for your reply.  Your help is very valuable and all these links are helpful (especially your example) Best Regards --Iacovos On 3/27/19 10:42 PM, Luca Canali wrote: I find that the Spark metrics system is quite useful to gather resource utilization metrics of Spark applications,

RE: Spark Profiler

2019-03-27 Thread Luca Canali
I find that the Spark metrics system is quite useful to gather resource utilization metrics of Spark applications, including CPU, memory and I/O. If you are interested an example how this works for us at: https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark If

Streaming data out of spark to a Kafka topic

2019-03-27 Thread Mich Talebzadeh
Hi, In a traditional we get data via Kafka into Spark streaming, do some work and write to a NoSQL database like Mongo, Hbase or Aerospike. That part can be done below and is best explained by the code as follows: Once a high value DF lookups is created I want send the data to a new topic for

Re: Parquet File Output Sink - Spark Structured Streaming

2019-03-27 Thread Matt Kuiper
Thanks Gabor - your comment helps me clarify my question. Yes, I have maxFilesPerTrigger set to 1 on the Read Stream call. I am also seeing the Streaming Query process the single input file, however a single file on input does not appear to result in the Streaming Query writing the output to

Re: Parquet File Output Sink - Spark Structured Streaming

2019-03-27 Thread Gabor Somogyi
Hi Matt, Maybe you could set maxFilesPerTrigger to 1. BR, G On Wed, Mar 27, 2019 at 4:45 PM Matt Kuiper wrote: > Hello, > > I am new to Spark and Structured Streaming and have the following File > Output Sink question: > > Wondering what (and how to modify) triggers a Spark Sturctured

Parquet File Output Sink - Spark Structured Streaming

2019-03-27 Thread Matt Kuiper
Hello, I am new to Spark and Structured Streaming and have the following File Output Sink question: Wondering what (and how to modify) triggers a Spark Sturctured Streaming Query (with Parquet File output sink configured) to write data to the parquet files. I periodically feed the Stream

Spark Kafka Batch Write guarantees

2019-03-27 Thread hemant singh
We are using spark batch to write Dataframe to Kafka topic. The spark write function with write.format(source = Kafka). Does spark provide similar guarantee like it provides with saving dataframe to disk; that partial data is not written to Kafka i.e. full dataframe is saved or if job fails no

Re: writing a small csv to HDFS is super slow

2019-03-27 Thread Gezim Sejdiu
Hi Lian, many thanks for the detailed information and sharing the solution with us. I will forward this to a student and hopefully will resolve the issue. Best regards, On Wed, Mar 27, 2019 at 1:55 AM Lian Jiang wrote: > Hi Gezim, > > My execution plan of the data frame to write into HDFS is

RE: RPC timeout error for AES based encryption between driver and executor

2019-03-27 Thread Sinha, Breeta (Nokia - IN/Bangalore)
Hi Vanzin, "spark.authenticate" is working properly for our environment (Spark 2.4 on Kubernetes). We have made few code changes through which secure communication between driver and executor is working fine using shared spark.authenticate.secret. Even SASL encryption works but when we set,