Updating schemas

2017-05-08 Thread Jorge Magallón
Hello, Im trying to update a parquet schema. The only way that i know is read the file change the schema a write the file again. The problem i think is that with large data is too slow and ineficient Is there other way to update schema?? Do you know other solution?? Thanx in advance

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho
Ok, great, Well I havn't provided a good example of what I'm doing. Let's assume that my case class is case class A(tons of fields, with sub classes) val df = sqlContext.sql("select * from a").as[A] val df2 = spark.emptyDataset[A] df.union(df2) This code will throw the exception. Is this

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Burak Yavuz
Yes, unfortunately. This should actually be fixed, and the union's schema should have the less restrictive of the DataFrames. On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho < dirceu.semigh...@gmail.com> wrote: > HI Burak, > By nullability you mean that if I have the exactly the same

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho
HI Burak, By nullability you mean that if I have the exactly the same schema, but one side support null and the other doesn't, this exception (in union dataset) will be thrown? 2017-05-08 16:41 GMT-03:00 Burak Yavuz : > I also want to add that generally these may be caused by

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Burak Yavuz
I also want to add that generally these may be caused by the `nullability` field in the schema. On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu wrote: > This is because RDD.union doesn't check the schema, so you won't see the > problem unless you run RDD and hit

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Shixiong(Ryan) Zhu
This is because RDD.union doesn't check the schema, so you won't see the problem unless you run RDD and hit the incompatible column problem. For RDD, You may not see any error if you don't use the incompatible column. Dataset.union requires compatible schema. You can print ds.schema and

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Bruce Packer
> On May 8, 2017, at 11:07 AM, Dirceu Semighini Filho > wrote: > > Hello, > I've a very complex case class structure, with a lot of fields. > When I try to union two datasets of this class, it doesn't work with the > following error : > ds.union(ds1) > Exception in

Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho
Hello, I've a very complex case class structure, with a lot of fields. When I try to union two datasets of this class, it doesn't work with the following error : ds.union(ds1) Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the

Re: Spark Shell issue on HDInsight

2017-05-08 Thread Denny Lee
This appears to be an issue with the Spark to DocumentDB connector, specifically version 0.0.1. Could you run the 0.0.3 version of the jar and see if you're still getting the same error? i.e. spark-shell --master yarn --jars

Re: Join streams Apache Spark

2017-05-08 Thread saulshanabrook
I, actually, just ran it in a Docker image. But the point is, it doesn't need to run in the JVM, because it just runs as a separate process. Then your Java (or any other client) code sends messages to it over TCP and it relays them to Spark. On Mon, May 8, 2017 at 4:07 AM tencas [via Apache Spark

Spark Shell issue on HDInsight

2017-05-08 Thread ayan guha
Hi I am facing an issue while trying to use azure-document-db connector from Microsoft. Instructions/Github . Error while trying to add jar in spark-shell: spark-shell --jars

how to set up h2o sparkling water on jupyter notebook on a windows machine

2017-05-08 Thread Zeming Yu
Hi, I'm a newbie, so please bear with me. *I'm using a windows 10 machine. I installed spark here:* C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7 *I also installed h2o sparkling water here:* C:\sparkling-water-2.1.1 *I use this code in command line to launch a jupyter notebook for

Re: Join streams Apache Spark

2017-05-08 Thread tencas
Yep, I mean the first script you posted. So, you can compile it to Java binaries for example ? Ok, I have no idea about Go. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-streams-Apache-Spark-tp28603p28662.html Sent from the Apache Spark User List

hbase + spark + hdfs

2017-05-08 Thread mathieu ferlay
Hi everybody. I’m totally new in Spark and I wanna know one stuff that I do not manage to find. I have a full ambary install with hbase, Hadoop and spark. My code reads and writes in hdfs via hbase. Thus, as I understood, all data stored are in bytes format in hdfs. Now, I know that it’s possible

Spark 2.1.0 with Hive 2.1.1?

2017-05-08 Thread Lohith Samaga M
Hi, Good day. My setup: 1. Single node Hadoop 2.7.3 on Ubuntu 16.04. 2. Hive 2.1.1 with metastore in MySQL. 3. Spark 2.1.0 configured using hive-site.xml to use MySQL metastore. 4. The VERSION table contains SCHEMA_VERSION = 2.1.0 Hive

Re: Join streams Apache Spark

2017-05-08 Thread Gourav Sengupta
On another note, you might want to first try Flume in case you are just at exploration phase. The advantage of Flume (using push) is that you do not need to write any additional program in order to sink or write your data to any target system. I am not quite sure how well Flume works with SPARK

Re: take the difference between two columns of a dataframe in pyspark

2017-05-08 Thread Gourav Sengupta
Hi, convert then to temporary table and write a SQL, that will also work. Regards, Gourav On Sun, May 7, 2017 at 2:49 AM, Zeming Yu wrote: > Say I have the following dataframe with two numeric columns A and B, > what's the best way to add a column showing the difference