[Structured Streaming][Parquet] How do specify partition and data when saving to Parquet

2018-03-02 Thread karthikjay
My DataFrame has the following schema root |-- data: struct (nullable = true) ||-- zoneId: string (nullable = true) ||-- deviceId: string (nullable = true) ||-- timeSinceLast: long (nullable = true) |-- date: date (nullable = true) How can I do a writeStream with Parquet format

Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung
For pyspark specifically IMO should be very high on the list to port back... As for roadmap - should be sharing more soon. From: lucas.g...@gmail.com Sent: Friday, March 2, 2018 9:41:46 PM To: user@spark.apache.org Cc: Felix Cheung Subject:

Re: Question on Spark-kubernetes integration

2018-03-02 Thread lucas.g...@gmail.com
Oh interesting, given that pyspark was working in spark on kub 2.2 I assumed it would be part of what got merged. Is there a roadmap in terms of when that may get merged up? Thanks! On 2 March 2018 at 21:32, Felix Cheung wrote: > That’s in the plan. We should be

Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung
That's in the plan. We should be sharing a bit more about the roadmap in future releases shortly. In the mean time this is in the official documentation on what is coming: https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work This supports started as a fork of the Apache

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-02 Thread Tathagata Das
Structured Streaming's file sink solves these problems by writing a log/manifest of all the authoritative files written out (for any format). So if you run batch or interactive queries on the output directory with Spark, it will automatically read the manifest and only process files are that are

Re: Pyspark Error: Unable to read a hive table with transactional property set as 'True'

2018-03-02 Thread ayan guha
Hi Couple of questions: 1. It seems the error is due to number format: Caused by: java.util.concurrent.ExecutionException: java.lang.NumberFormatException: For input string: "0003024_" at java.util.concurrent.FutureTask.report(FutureTask.java:122) at

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-02 Thread Sunil Parmar
Is there a way to get finer control over file writing in parquet file writer ? We've an streaming application using Apache Apex ( on path of migration to Spark ...story for a different thread). The existing streaming application read JSON from Kafka and writes Parquet to HDFS. We're trying to

Re: Can I get my custom spark strategy to run last?

2018-03-02 Thread Vadim Semenov
Something like this? sparkSession.experimental.extraStrategies = Seq(Strategy) val logicalPlan = df.logicalPlan val newPlan: LogicalPlan = Strategy(logicalPlan) Dataset.ofRows(sparkSession, newPlan) On Thu, Mar 1, 2018 at 8:20 PM, Keith Chapman wrote: > Hi, > > I'd

Pyspark Error: Unable to read a hive table with transactional property set as 'True'

2018-03-02 Thread Debabrata Ghosh
Hi All, Greetings ! I needed some help to read a Hive table via Pyspark for which the transactional property is set to 'True' (In other words ACID property is enabled). Following is the entire stacktrace and the description of the hive table. Would you please be able to help

Spark Streaming reading many topics with Avro

2018-03-02 Thread Guillermo Ortiz
Hello, I want to read with a single Spark Streaming process several topics. I'm using avro and the data to the different topics have a different schema.Ideally, If I would only have one topic I could implement a deserializer but, I don't know if it's possible with many different schemas. val

Question on Spark-kubernetes integration

2018-03-02 Thread Lalwani, Jayesh
Does the Resource scheduler support dynamic resource allocation? Are there any plans to add in the future? The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be

Re: K Means Clustering Explanation

2018-03-02 Thread Matt Hicks
Thanks Alessandro and Christoph.  I appreciate the feedback, but I'm still having issues determining how to actually accomplish this with the API. Can anyone point me to an example in code showing how to accomplish this? On Fri, Mar 2, 2018 2:37 AM, Alessandro Solimando

Re: K Means Clustering Explanation

2018-03-02 Thread Alessandro Solimando
Hi Matt, similarly to what Christoph does, I first derive the cluster id for the elements of my original dataset, and then I use a classification algorithm (cluster ids being the classes here). For this method to be useful you need a "human-readable" model, tree-based models are generally a good