Filtering DataType column with Timestamp
Hi. I have a DateType column and I want to filter all the values greater or equal than a certain Timestamp. This works, for example df.col(columnName).geq(value) evaluates to a column with DateTypes greater or equal than value. Except for one case: if the value of the Timestamp is initialized to "1/1/2018 00:00:00" it filters the columns that are greater than "1/1/2018 00:00:00". If some rows are initialized to the exact same date and time it does not include it in the results, that is, the part of "equal" is not working. If I change the column type to Timestamp this works fine. Is this a bug or is it some known behaviour when comparing DateTypes to Timestamps? Thanks! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[SPARK-SQL] Reading JSON column as a DataFrame and keeping partitioning information
I've been trying to figure out this one for some time now, I have JSONs representing Products coming (physically) partitioned by Brand and I would like to create a DataFrame from the JSON but also keep the partitioning information (Brand) ``` case class Product(brand: String, value: String) val df = spark.createDataFrame(Seq(Product("something", """{"a": "b", "c": "d"}"""))) df.write.partitionBy("brand").mode("overwrite").json("/tmp/products5/") val df2 = spark.read.json("/tmp/products5/") df2.show /* ++--+ | value|brand| ++--+ |{"a": "b", "c": "d"}| something| ++--+ */ // This is simple and effective but it gets rid of the brand! spark.read.json(df2.select("value").as[String]).show /* +---+---+ | a| c| +---+---+ | b| d| +---+---+ */ ``` Ideally I'd like something similar to spark.read.json that would keep the partitioning values and merge it with the rest of the DataFrame End result I would like: ``` /* +---+---+---+ | a| c| brand| +---+---+---+ | b| d| something| +---+---+---+ */ ``` Best regards, Daniel Mateus Pires - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Re: spark sql data skew
try divide and conquer, create a column x for the fist character of userid, and group by company+x. if still too large, try first two character. On 17 July 2018 at 02:25, 崔苗 wrote: > 30G user data, how to get distinct users count after creating a composite > key based on company and userid? > > > 在 2018-07-13 18:24:52,Jean Georges Perrin 写道: > > Just thinking out loud… repartition by key? create a composite key based > on company and userid? > > How big is your dataset? > > On Jul 13, 2018, at 06:20, 崔苗 wrote: > > Hi, > when I want to count(distinct userId) by company,I met the data skew and > the task takes too long time,how to count distinct by keys on skew data in > spark sql ? > > thanks for any reply > > > >
Query on Spark Hive with kerberos Enabled on Kubernetes
Hi All, I am trying to use Spark 2.2.0 Kubernetes(https://github.com/apache-spark-on-k8s/spark/tree/v2.2.0-kubernetes-0.5.0) code to run the Hive Query on Kerberos Enabled cluster. Spark-submit's fail for the Hive Queries, but pass when I am trying to access the hdfs. Is this a known limitation or am I doing something wrong. Please let me know. If this is working, can you please specify an example for running Hive Queries? Thanks. Regards Surya
Re: Parquet
I generally write to Parquet when I want to repeat the operation of reading data and perform different operations on it every time. This would save db time for me. Thanks Muthu On Thu, Jul 19, 2018, 18:34 amin mohebbi wrote: > We do have two big tables each includes 5 billion of rows, so my question > here is should we partition /sort the data and convert it to Parquet before > doing any join? > > Best Regards ... Amin > Mohebbi PhD candidate in Software Engineering at university of Malaysia > Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my > amin_...@me.com >