RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread Taylor Cox
Have a look at this guide here: https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html You should be able to send your sensor data to a Kafka topic, which Spark will subscribe to. You may need to use an Input DStream to connect

Re: CSV parser - how to parse column containing json data

2018-10-02 Thread Brandon Geise
Do your schema inference and then apply the JSON schema using withColumn overwriting the String representation From: Nirav Patel Date: Tuesday, October 2, 2018 at 5:00 PM To: Cc: spark users Subject: Re: CSV parser - how to parse column containing json data I need to inferSchema from

RE: Unable to read multiple JSON.Gz File.

2018-10-02 Thread Jyoti Ranjan Mahapatra
Hi Mahendar, Which version of spark and Hadoop are you using? I tried it on spark2.3.1 with Hadoop 2.7.3 and it works for a folder containing multiple gz files. From: Mahender Sarangam Sent: Monday, October 1, 2018 2:00 AM To: user@spark.apache.org Subject: Unable to read multiple JSON.Gz

Re: CSV parser - how to parse column containing json data

2018-10-02 Thread Nirav Patel
I need to inferSchema from CSV as well. As per your solution, I am creating SructType only for Json field. So how am I going to mix and match here? i.e. do type inference for all fields but json field and use custom json_schema for json field. On Thu, Aug 30, 2018 at 5:29 PM Brandon Geise

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread zakhavan
Thank you, Taylor for your reply. The second solution doesn't work for my case since my text files are getting updated every second. Actually, my input data is live such that I'm getting 2 streams of data from 2 seismic sensors and then I write them into 2 text files for simplicity and this is

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread Taylor Cox
Hey Zeinab, We may have to take a small step back here. The sliding window approach (ie: the window operation) is unique to Data stream mining. So it makes sense that window() is restricted to DStream. It looks like you're not using a stream mining approach. From what I can see in your code,

How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread zakhavan
Hello, I have 2 text file in the following form and my goal is to calculate the Pearson correlation between them using sliding window in pyspark: 123.00 -12.00 334.00 . . . First I read these 2 text file and store them in RDD format and then I apply the window operation on each RDD but I keep