Flatten JSON to multiple columns in Spark

2017-07-17 Thread Chetan Khatri
Hello Spark Dev's, Can you please guide me, how to flatten JSON to multiple columns in Spark. *Example:* Sr No Title ISBN Info 1 Calculus Theory 1234567890 [{"cert":[{ "authSbmtr":"009415da-c8cd-418d-869e-0a19601d79fa", 009415da-c8cd-418d-869e-0a19601d79fa

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
Sorry, I didn't express clearly. I think the evaluation order doesn't matter in the context of join implementation(sort or hash based). it should only refer to join key. Thanks Chang On Tue, Jul 18, 2017 at 7:57 AM, Liang-Chi Hsieh wrote: > > Evaluation order does matter. A

Unsubscribe

2017-07-17 Thread Praveen Kumar

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
Evaluation order does matter. A non-deterministic expression can change its output due to internal state which may depend on input order. MonotonicallyIncreasingID is an example for the stateful expression. Once you change the row order, the evaluation results are different. Chang Chen wrote

Re: Slowness of Spark Thrift Server

2017-07-17 Thread Maciej Bryński
I did the test on Spark 2.2.0 and problem still exists. Any ideas how to fix it ? Regards, Maciek 2017-07-11 11:52 GMT+02:00 Maciej Bryński : > Hi, > I have following issue. > I'm trying to use Spark as a proxy to Cassandra. > The problem is the thrift server overhead. > >

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-17 Thread Sam Elamin
Well done! This is amazing news :) Congrats and really cant wait to spread the structured streaming love! On Mon, Jul 17, 2017 at 5:25 PM, kant kodali wrote: > +1 > > On Tue, Jul 11, 2017 at 3:56 PM, Jean Georges Perrin wrote: > >> Awesome! Congrats! Can't

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Xiao Li
When users call rand(seed) with a specific seed number, users expect the results should be deterministic no matter whether this is pushed down or not. rand(seed) is stateful. Thus, the order of predicates in the same join condition even matters. For example, in the same join condition, if the

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
I see. Actually, it isn't about evaluation order which user can't specify. It's about how many times we evaluate the non-deterministic expression for the same row. For example, given the SQL: SELECT a.col1 FROM tbl1 a LEFT OUTER JOIN tbl2 b ON CASE WHEN a.col2 IS NULL TNEN cast(rand(9)*1000 -

Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-17 Thread kant kodali
+1 On Tue, Jul 11, 2017 at 3:56 PM, Jean Georges Perrin wrote: > Awesome! Congrats! Can't wait!! > > jg > > > On Jul 11, 2017, at 18:48, Michael Armbrust > wrote: > > Hi all, > > Apache Spark 2.2.0 is the third release of the Spark 2.x line. This > release

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
IIUC, the evaluation order of rows in Join can be different in different physical operators, e.g., Sort-based and Hash-based. But for non-deterministic expressions, different evaluation orders change results. Chang Chen wrote > I see the issue. I will try

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
I see the issue. I will try https://github.com/apache/spark/pull/18652, I think 1 For Join Operator, the left and right plan can't be non-deterministic. 2 If Filter can support non-deterministic, why not join condition? 3 We can't push down or project non-deterministic expression, since it may

Re: How to tune the performance of Tpch query5 within Spark

2017-07-17 Thread Pralabh Kumar
Hi To read file parallely , you can follow the below code. case class readData (fileName : String , spark : SparkSession) extends Callable[Dataset[Row]]{ override def call(): Dataset[Row] = { spark.read.parquet(fileName) // spark.read.csv(fileName) } } val spark =

Re: How to tune the performance of Tpch query5 within Spark

2017-07-17 Thread vaquar khan
Verify your configuration, following link covered all Spark tuning points. https://spark.apache.org/docs/latest/tuning.html Regards, Vaquar khan On Jul 17, 2017 6:56 AM, "何文婷" wrote: 2.1.1 发自网易邮箱大师 On 07/17/2017 20:55, vaquar khan wrote: Could

Re: How to tune the performance of Tpch query5 within Spark

2017-07-17 Thread 何文婷
2.1.1 发自网易邮箱大师 On 07/17/2017 20:55, vaquar khan wrote: Could you please let us know your Spark version? Regards, vaquar khan On Jul 17, 2017 12:18 AM, "163" wrote: I change the UDF but the performance seems still slow. What can I do else? 在

Re: How to tune the performance of Tpch query5 within Spark

2017-07-17 Thread vaquar khan
Could you please let us know your Spark version? Regards, vaquar khan On Jul 17, 2017 12:18 AM, "163" wrote: > I change the UDF but the performance seems still slow. What can I do else? > > > 在 2017年7月14日,下午8:34,Wenchen Fan 写道: > > Try to replace

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
I created a draft pull request for explaining the cases: https://github.com/apache/spark/pull/18652 Chang Chen wrote > Hi All > > I don't understand the difference between the semantics, I found Spark > does > the same thing for GroupBy non-deterministic. From Map-Reduce point of > view, Join

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread 蒋星博
FYI there have been a related discussion here: https://github.com/apache/spark/pull/15417#discussion_r85295977 2017-07-17 15:44 GMT+08:00 Chang Chen : > Hi All > > I don't understand the difference between the semantics, I found Spark > does the same thing for GroupBy

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Chang Chen
Hi All I don't understand the difference between the semantics, I found Spark does the same thing for GroupBy non-deterministic. From Map-Reduce point of view, Join is also GroupBy in essence . @Liang Chi Hsieh in which situation,

Re: [SQL] Syntax "case when" doesn't be supported in JOIN

2017-07-17 Thread Liang-Chi Hsieh
Thinking about it more, I think it changes the semantics only under certain scenarios. For the example SQL query shown in previous discussion, it looks the same semantics. Xiao Li wrote > If the join condition is non-deterministic, pushing it down to the > underlying project will change the

Re: How to tune the performance of Tpch query5 within Spark

2017-07-17 Thread 163
I change the UDF but the performance seems still slow. What can I do else? > 在 2017年7月14日,下午8:34,Wenchen Fan 写道: > > Try to replace your UDF with Spark built-in expressions, it should be as > simple as `$”x” * (lit(1) - $”y”)`. > >> On 14 Jul 2017, at 5:46 PM, 163