date:20180720

Filtering DataType column with Timestamp

2018-07-20 Thread fmilano

Hi. I have a DateType column and I want to filter all the values greater or
equal than a certain Timestamp. This works, for example
df.col(columnName).geq(value) evaluates to a column with DateTypes greater
or equal than value. Except for one case: if the value of the Timestamp is
initialized to "1/1/2018 00:00:00" it filters the columns that are greater
than "1/1/2018 00:00:00". If some rows are initialized to the exact same
date and time it does not include it in the results, that is, the part of
"equal" is not working. If I change the column type to Timestamp this works
fine.

Is this a bug or is it some known behaviour when comparing DateTypes to
Timestamps?

Thanks!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[SPARK-SQL] Reading JSON column as a DataFrame and keeping partitioning information

2018-07-20 Thread Daniel Mateus Pires

I've been trying to figure out this one for some time now, I have JSONs 
representing Products coming (physically) partitioned by Brand and I would like 
to create a DataFrame from the JSON but also keep the partitioning information 
(Brand)

```
case class Product(brand: String, value: String)
val df = spark.createDataFrame(Seq(Product("something", """{"a": "b", "c": 
"d"}""")))
df.write.partitionBy("brand").mode("overwrite").json("/tmp/products5/")
val df2 = spark.read.json("/tmp/products5/")

df2.show
/*
++--+
|   value|brand|
++--+
|{"a": "b", "c": "d"}|  something|
++--+
*/


// This is simple and effective but it gets rid of the brand!
spark.read.json(df2.select("value").as[String]).show
/*
+---+---+
|  a|  c|
+---+---+
|  b|  d|
+---+---+
*/
```

Ideally I'd like something similar to spark.read.json that would keep the 
partitioning values and merge it with the rest of the DataFrame

End result I would like:
```
/*
+---+---+---+
|  a|  c| brand|
+---+---+---+
|  b|  d| something|
+---+---+---+
*/
```

Best regards,
Daniel Mateus Pires
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Re: spark sql data skew

2018-07-20 Thread Xiaomeng Wan

try divide and conquer, create a column x for the fist character of userid,
and group by company+x. if still too large, try first two character.

On 17 July 2018 at 02:25, 崔苗  wrote:

> 30Ｇ user data, how to get distinct users count after creating a composite
> key based on company and userid?
>
>
> 在 2018-07-13 18:24:52，Jean Georges Perrin  写道：
>
> Just thinking out loud… repartition by key? create a composite key based
> on company and userid?
>
> How big is your dataset?
>
> On Jul 13, 2018, at 06:20, 崔苗  wrote:
>
> Hi,
> when I want to count(distinct userId) by company，I met the data skew and
> the task takes too long time，how to count distinct by keys on skew data in
> spark sql ?
>
> thanks for any reply
>
>
>
>

Query on Spark Hive with kerberos Enabled on Kubernetes

2018-07-20 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)

Hi All,
I am trying to use Spark 2.2.0 
Kubernetes(https://github.com/apache-spark-on-k8s/spark/tree/v2.2.0-kubernetes-0.5.0)
 code to run the Hive Query on Kerberos Enabled cluster. Spark-submit's fail 
for the Hive Queries, but pass when I am trying to access the hdfs. Is this a 
known limitation or am I doing something wrong. Please let me know. If this is 
working, can you please specify an example for running Hive Queries?

Thanks.

Regards
Surya

Re: Parquet

2018-07-20 Thread Muthu Jayakumar

I generally write to Parquet when I want to repeat the operation of reading
data and perform different operations on it every time. This would save db
time for me.

Thanks
Muthu

On Thu, Jul 19, 2018, 18:34 amin mohebbi 
wrote:

> We do have two big tables each includes 5 billion of rows, so my question
> here is should we partition /sort the data and convert it to Parquet before
> doing any join?
>
> Best Regards ... Amin
> Mohebbi PhD candidate in Software Engineering   at university of Malaysia
> Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my
> amin_...@me.com
>

Filtering DataType column with Timestamp

[SPARK-SQL] Reading JSON column as a DataFrame and keeping partitioning information

Re: Re: spark sql data skew

Query on Spark Hive with kerberos Enabled on Kubernetes

Re: Parquet

5 matches

Site Navigation

Mail list logo

Footer information