RE: How to avoid long-running jobs blocking short-running jobs

2018-11-05 Thread Taylor Cox
Hi Conner, What is preventing you from using a cluster model? I wonder if docker containers could help you here? A quick internet search yielded Mist: https://github.com/Hydrospheredata/mist Could be useful? Taylor -Original Message- From: conner Sent: Saturday, November 3, 2018 2:04

RE: Shuffle write explosion

2018-11-05 Thread Taylor Cox
At first glance, I wonder if your tables are partitioned? There may not be enough parallelism happening. You can also pass in the number of partitions and/or a custom partitioner to help Spark “guess” how to organize the shuffle. Have you seen any of these docs?

RE: CSV parser - is there a way to find malformed csv record

2018-10-09 Thread Taylor Cox
Hey Nirav, Here’s an idea: Suppose your file.csv has N records, one for each line. Read the csv line-by-line (without spark) and attempt to parse each line. If a record is malformed, catch the exception and rethrow it with the line number. That should show you where the problematic record(s)

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread Taylor Cox
Have a look at this guide here: https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html You should be able to send your sensor data to a Kafka topic, which Spark will subscribe to. You may need to use an Input DStream to connect

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread Taylor Cox
Hey Zeinab, We may have to take a small step back here. The sliding window approach (ie: the window operation) is unique to Data stream mining. So it makes sense that window() is restricted to DStream. It looks like you're not using a stream mining approach. From what I can see in your code,