Re: [Beginner] Kafka 0.11 header support in Spark Structured Streaming

2018-02-28 Thread Tathagata Das
I made a JIRA for it - https://issues.apache.org/jira/browse/SPARK-23539 Unfortunately it is blocked by Kafka version upgrade, which has a few nasty issues related to Kafka bugs - https://issues.apache.org/jira/browse/SPARK-18057 On Wed, Feb 28, 2018 at 3:17 PM, karthikus

[Structured Streaming] Handling Kakfa Stream messages with different JSON Schemas.

2018-02-28 Thread karthikjay
Hi all, I have the following code to stream Kafka data and apply a schema called "hbSchema" on it and then act on the data. val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "10.102.255.241:9092") .option("subscribe", "mabr_avro_json1")

Using Thrift with Dataframe

2018-02-28 Thread Nikhil Goyal
Hi guys, I have a RDD of thrift struct. I want to convert it into a dataframe. Can someone suggest how I can do this? Thanks Nikhil

Re: [Beginner] Kafka 0.11 header support in Spark Structured Streaming

2018-02-28 Thread karthikus
TD, Thanks for your response. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-02-28 Thread Tathagata Das
There is no good way to save to parquet without causing downstream consistency issues. You could use foreachRDD to get each RDD, convert it to DataFrame/Dataset, and write out as parquet files. But you will later run into issues with partial files caused by failures, etc. On Wed, Feb 28, 2018 at

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread Vadim Semenov
Yeah, without actually seeing what's happening on that line, it'd be difficult to say for sure. You can check what patches HortonWorks applied, or/and ask them. And yeah, seg fault is totally possible on any size of the data. But you should've seen it in the `stdout` (assuming that the regular

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-02-28 Thread Patrick Alwell
I don’t think sql context is “deprecated” in this sense. It’s still accessible by earlier versions of Spark. But yes, at first glance it looks like you are correct. I don’t see a recordWriter method for parquet outside of the SQL package.

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi Vadim thanks I use HortonWorks package. I dont think there are any seg faults are dataframe I am trying to write is very small in size. Can it still create seg fault? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread Vadim Semenov
Who's your spark provider? EMR, Azure, Databricks, etc.? Maybe contact them, since they've probably applied some patches Also have you checked `stdout` for some Segfaults? I vaguely remember getting `Task failed while writing rows at` and seeing some segfaults that caused that On Wed, Feb 28,

[Beginner] How to save Kafka Dstream data to parquet ?

2018-02-28 Thread karthikus
Hi all, I have a Kafka stream data and I need to save the data in parquet format without using Structured Streaming (due to the lack of Kafka Message header support). val kafkaStream = KafkaUtils.createDirectStream( streamingContext,

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi thanks Vadim you are right I saw that line already 468 I dont see any code it is just comment yes I am sure I am using all spark-* jar which is built for spark 2.2.0 and Scala 2.11. I am also stuck unfortunately with these errors not sure how to solve them. -- Sent from:

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread Vadim Semenov
I'm sorry, didn't see `Caused by: java.lang.NullPointerException at org.apache.spark.sql.SparkSession$$anonfun$3.apply(SparkSession.scala:468)` Are you sure that you use 2.2.0? I don't see any code on that line

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi thanks for the reply I only see NPE and Task failed while writing rows all over places I dont see any other errors expect SparkException job aborted and followed by two exception I pasted earlier. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread Vadim Semenov
There should be another exception trace (basically, the actual cause) after this one, could you post it? On Wed, Feb 28, 2018 at 1:39 PM, unk1102 wrote: > Hi I am getting the following exception when I try to write DataFrame using > the following code. Please guide. I am

org.apache.spark.SparkException: Task failed while writing rows

2018-02-28 Thread unk1102
Hi I am getting the following exception when I try to write DataFrame using the following code. Please guide. I am using Spark 2.2.0. df.write.format("parquet").mode(SaveMode.Append); org.apache.spark.SparkException: Task failed while writing rows at

Joins in spark for large tables

2018-02-28 Thread KhajaAsmath Mohammed
Hi, Is there any best approach to reduce shuffling in spark. I have two tables and both of them are large. any suggestions? I saw only about broadcast but that will not work in my case. Thanks, Asmath

Re: how to add columns to row when column has a different encoder?

2018-02-28 Thread David Capwell
Anyone know a way right now to do this? As best as I can tell I need a custom expression to pass to udf to do this. Just finished a protobuf encoder and it feels like expression is not meant to be public (good amount of things are private[sql]), am I wrong about this? Am I looking at the right