Re: Infer JSON schema in structured streaming Kafka.

2017-12-10 Thread satyajit vegesna
Hi Jacek, Thank you for responding back, i have tried memory sink, and below is what i did val fetchValue = debeziumRecords.selectExpr("value").withColumn("tableName", functions.get_json_object($"value".cast(StringType), "$.schema.name")) .withColumn("operation",

Why Spark 2.2.1 still bundles old Hive jars?

2017-12-10 Thread An Qin
Hi, all, I want to include Sentry 2.0.0 in my Spark project. However it bundles Hive 2.3.2. I find the newest Spark 2.2.1 still bundles old Hive jars, for example, hive-exec-1.2.1.spark2.jar. Why does it upgrade to the new Hive? Are they compatible? Regards, Qin An.

Loading a spark dataframe column into T-Digest using java

2017-12-10 Thread Himasha de Silva
Hi, I want to load a spark dataframe column into T-Digest using java to calculate quantile values. I write this code to do this, but it's giving zero for size of tdigest. values are not added to tDigest. my code - https://gist.github.com/anonymous/1f2e382fdda002580154b5c43fbe9b3a Thank you.

Re: Infer JSON schema in structured streaming Kafka.

2017-12-10 Thread Jacek Laskowski
Hi, What about memory sink? That could work. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On

Re: pyspark + from_json(col("col_name"), schema) returns all null

2017-12-10 Thread Jacek Laskowski
Hi, Not that I'm aware of, but in your case checking out whether a JSON message fit your schema and the pipeline would've taken pyspark alone with JSONs on disk, wouldn't it? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Spark Structured Streaming

Infer JSON schema in structured streaming Kafka.

2017-12-10 Thread satyajit vegesna
Hi All, I would like to infer JSON schema from a sample of data that i receive from, Kafka Streams(specific topic), and i have to infer the schema as i am going to receive random JSON string with different schema for each topic, so i chose to go ahead with below steps, a. readStream from

Re: pyspark + from_json(col("col_name"), schema) returns all null

2017-12-10 Thread salemi
I found the root cause! There was mismatch between the StructField type and the json message. Is there a good write up / wiki out there that describes how to debug spark jobs? Thanks -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Save hive table from spark in hive 2.1.0

2017-12-10 Thread Alejandro Reina
I did what you said and I was finally able to update the scheme. But you're right, it's very dirty, I have to modify almost all the scripts. The problem of the scripts comes from having already a previous table in that version, many of the tables or columns that I try to add, already exist and it

Re: Save hive table from spark in hive 2.1.0

2017-12-10 Thread रविशंकर नायर
Hi, Good try. As you can see, when you run upgrade using schematool, there is a duplicate column error. Can you please look the script generated and edit to avoid duplicate column? Not sure why the Hive guys made it complicated, I did face same issues like you. Can anyone else give a clean and

Re: Row Encoder For DataSet

2017-12-10 Thread Tomasz Dudek
Hello Sandeep, you can pass Row to UDAF. Just provide a proper inputSchema to your UDAF. Check out this example https://docs.databricks.com/ spark/latest/spark-sql/udaf-scala.html Yours, Tomasz 2017-12-10 11:55 GMT+01:00 Sandip Mehta : > Thanks Georg. I have looked

Re: Save hive table from spark in hive 2.1.0

2017-12-10 Thread Alejandro Reina
I have tried what you propose, added the property to hive-site.xml, and although with this option I can run hive, this does not solve my problem. I'm sorry if maybe you explain me wrongly. I need to save a dataframe transformed into spark in hive, with the version of scheme 2.1.1 of hive (last

Re: UDF issues with spark

2017-12-10 Thread Daniel Haviv
Some code would help to debug the issue On Fri, 8 Dec 2017 at 21:54 Afshin, Bardia < bardia.afs...@changehealthcare.com> wrote: > Using pyspark cli on spark 2.1.1 I’m getting out of memory issues when > running the udf function on a recordset count of 10 with a mapping of the > same value