Challenges with Datasource V2 API

2019-06-25 Thread Sunita Arvind
Hello Spark Experts, I am having challenges using the DataSource V2 API. I created a mock The input partitions seem to be created correctly. The below output confirms that: 19/06/23 16:00:21 INFO root: createInputPartitions 19/06/23 16:00:21 INFO root: Create a partition for abc The

Problem with the ML ALS algorithm

2019-06-25 Thread Steve Pruitt
I get an inexplicable exception when trying to build an ALSModel with the implicit set to true. I can’t find any help online. Thanks in advance. My code is: ALS als = new ALS() .setMaxIter(5) .setRegParam(0.01) .setUserCol("customer")

Spark Structured Streaming Custom Sources confusion

2019-06-25 Thread Lars Francke
Hi, I'm a bit confused about the current state and the future plans of custom data sources in Structured Streaming. So for DStreams we could write a Receiver as documented. Can this be used with Structured Streaming? Then we had the DataSource API with DefaultSource et. al. which was (in my

Distinguishing between field missing and null in individual record?

2019-06-25 Thread Jeff Evans
Suppose we have the following JSON, which we parse into a DataFrame (using the mulitline option). [{ "id": 8541, "value": "8541 changed again value" },{ "id": 51109, "name": "newest bob", "value": "51109 changed again" }] Regardless of whether we explicitly define a schema, or allow it

Implementing Upsert logic Through Streaming

2019-06-25 Thread Sachit Murarka
Hi All, I will get records continously in text file form(Streaming). It will have timestamp as field also. Target is Oracle Database. My Goal is to maintain latest record for a key in Oracle. Could you please suggest how this can be implemented efficiently? Kind Regards, Sachit Murarka

Potential Problem : Dropping malformed tables from CSV (PySpark)

2019-06-25 Thread Conor Begley
Hey, Currently working with CSV data and I've come across an unusual case. In Databricks, if I use the following code: GA_pages = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true', comment = '#', mode = "DROPMALFORMED").load(ga_sessions_path)