Re: Structured Streaming : Custom Source and Sink Development and PySpark.

2018-08-30 Thread Russell Spitzer
Yes, Scala or Java. No. Once you have written the implementation it is valid in all df apis. As for examples there are many, check the Kafka source in tree or one of the many sources listed on the spark packages website. On Thu, Aug 30, 2018, 8:23 PM Ramaswamy, Muthuraman <

Structured Streaming : Custom Source and Sink Development and PySpark.

2018-08-30 Thread Ramaswamy, Muthuraman
I would like to develop Custom Source and Sink. So, I have a couple of questions: 1. Do I have to use Scala or Java to develop these Custom Source/Sink? 1. Also, once the source/sink has been developed, to use in PySpark/Python, do I have to develop any Py4J modules? Any pointers or

Re: CSV parser - how to parse column containing json data

2018-08-30 Thread Brandon Geise
If you know your json schema you can create a struct and then apply that using from_json: val json_schema = StructType(Array(StructField(“x”, StringType, true), StructField(“y”, StringType, true), StructField(“z”, IntegerType, true))) .withColumn("_c3",

CSV parser - how to parse column containing json data

2018-08-30 Thread Nirav Patel
Is there a way to parse csv file with some column in middle containing json data structure? "a",102,"c","{"x":"xx","y":false,"z":123}","d","e",102.2 Thanks, Nirav --   

Local mode vs client mode with one executor

2018-08-30 Thread Guillermo Ortiz
I have many spark processes, some of them are pretty simple and they don't have to process almost messages but they were developed with the same archeotype and they use spark. Some of them are executed with many executors but a few ones don't make sense to process with more than 2-4 cores in only

Re: spark structured streaming jobs working in HDP2.6 fail in HDP3.0

2018-08-30 Thread Lian Jiang
I solved this issue by using spark 2.3.1 jars copied from the HDP3.0 cluster. Thanks. On Thu, Aug 30, 2018 at 10:18 AM Lian Jiang wrote: > Per https://spark.apache.org/docs/latest/building-spark.html, spark 2.3.1 > is built with hadoop 2.6.X by default. This is why I see my fat jar > includes

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-30 Thread Cody Koeninger
You're using an older version of spark, with what looks like a manually included different version of the kafka-clients jar (1.0) than what that version of the spark connector was written to depend on (0.10.0.1), so there's no telling what's going on. On Wed, Aug 29, 2018 at 3:40 PM, Guillermo

Re: ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-30 Thread Cody Koeninger
I doubt that fix will get backported to 2.3.x Are you able to test against master? 2.4 with the fix you linked to is likely to hit code freeze soon. >From a quick look at your code, I'm not sure why you're mapping over an array of brokers. It seems like that would result in different streams

ConcurrentModificationExceptions with CachedKafkaConsumers

2018-08-30 Thread Bryan Jeffrey
Hello, Spark Users. We have an application using Spark 2.3.0 and the 0.8 Kafka client. We're have a Spark streaming job, and we're reading a reasonable amount of data from Kafka (40 GB / minute or so). We would like to move to using the Kafka 0.10 client to avoid requiring our (0.10.2.1) Kafka

Re: Default Java Opts Standalone

2018-08-30 Thread Sonal Goyal
Hi Eevee, For the executor, have you tried a. Passing --conf "spark.executor.extraJavaOptions=-XX" as part of the spark-submit command line if you want it application specific OR b. Setting spark.executor.extraJavaOptions in conf/spark-default.conf for all jobs. Thanks, Sonal Nube Technologies

Re: spark structured streaming jobs working in HDP2.6 fail in HDP3.0

2018-08-30 Thread Lian Jiang
Per https://spark.apache.org/docs/latest/building-spark.html, spark 2.3.1 is built with hadoop 2.6.X by default. This is why I see my fat jar includes hadoop 2.6.5 (instead of 3.1.0) jars. HftpFileSystem has been removed in hadoop 3. On https://spark.apache.org/downloads.html, I only see spark

spark structured streaming jobs working in HDP2.6 fail in HDP3.0

2018-08-30 Thread Lian Jiang
Hi, I am using HDP3.0 which uses HADOOP3.1.0 and Spark 2.3.1. My spark streaming jobs running fine in HDP2.6.4 (HADOOP2.7.3, spark 2.2.0) fails in HDP3: java.lang.IllegalAccessError: class org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface

Unsubscribe

2018-08-30 Thread Michael Styles

Default Java Opts Standalone

2018-08-30 Thread Evelyn Bayes
Hey all, Stuck trying to set a parameter in the spark-env.sh and I’m hoping someone here knows how. I want to set the JVM setting -XX:+ExitOnOutOfMemoryError for both Spark executors and Spark workers in a standalone mode. So far my best guess so far is: Worker

Spark Structured Streaming checkpointing with S3 data source

2018-08-30 Thread sherif98
I have data that is continuously pushed to multiple S3 buckets. I want to set up a structured streaming application that uses the S3 buckets as the data source and do stream-stream joins. My question is if the application is down for some reason, will restarting the application would continue