RE: Not able to write output to local filsystem from Standalone mode.

2016-05-27 Thread Yong Zhang
he less things "just make sense to me". > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On F

RE: Not able to write output to local filsystem from Standalone mode.

2016-05-26 Thread Yong Zhang
That just makes sense, doesn't it? The only place will be driver. If not, the executor will be having contention by whom should create the directory in this case. Only the coordinator (driver in this case) is the best place for doing it. Yong From: math...@closetwork.org Date: Wed, 25 May 2016

RE: Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread Yong Zhang
If you are using Spark > 1.5, the best way is to use DataFrame API directly, instead of SQL. In dataframe, you can specify the boardcast join hint in the dataframe API, which will force the boardcast join. Yong From: mich.talebza...@gmail.com Date: Mon, 20 Jun 2016 13:09:17 +0100 Subject: Re:

Spark 1.5.2, DataFrame broadcast join, OOM

2016-02-23 Thread Yong Zhang
Hi, I am testing the Spark 1.5.2 using a 4 nodes cluster, with 64G memory each, and one is master and 3 are workers. I am using Standalone mode. Here is my spark-env.sh for settings: export SPARK_LOCAL_DIRS=/data1/spark/local,/data2/spark/local,/data3/spark/localexport

RE: Plan issue with spark 1.5.2

2016-04-06 Thread Yong Zhang
me know if you need further information. On Tue, Apr 5, 2016 at 6:33 PM, Yong Zhang <java8...@hotmail.com> wrote: You need to show us the execution plan, so we can understand what is your issue. Use the spark shell code to show how your DF is built, how you partition them, then use expl

RE: Plan issue with spark 1.5.2

2016-04-05 Thread Yong Zhang
You need to show us the execution plan, so we can understand what is your issue. Use the spark shell code to show how your DF is built, how you partition them, then use explain(true) on your join DF, and show the output here, so we can better help you. Yong > Date: Tue, 5 Apr 2016 09:46:59

RE: Plan issue with spark 1.5.2

2016-04-06 Thread Yong Zhang
that each partition from 1 DF join with partition with same key of DF 2 on the worker node without shuffling the data.In other words do as much as work within worker node before shuffling the data. ThanksDarshan Singh On Wed, Apr 6, 2016 at 10:06 PM, Yong Zhang <java8...@hotmail.com>

RE: Plan issue with spark 1.5.2

2016-04-06 Thread Yong Zhang
Thanks On Wed, Apr 6, 2016 at 8:37 PM, Yong Zhang <java8...@hotmail.com> wrote: What you are looking for is https://issues.apache.org/jira/browse/SPARK-4849 This feature is available in Spark 1.6.0, so the DataFrame can reuse the partitioned data in the join. For you case in 1.5.x, you have

RE: ordering over structs

2016-04-06 Thread Yong Zhang
1) Is a struct in Spark like a struct in C++? Kinda. Its an ordered collection of data with known names/types. 2) What is an alias in this context? it is assigning a name to the column. similar to doing AS in sql. 3) How does this code even work? Ordering

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
If they do that, they must provide a customized input format, instead of through JDBC. Yong Date: Wed, 6 Apr 2016 23:56:54 +0100 Subject: Re: Sqoop on Spark From: mich.talebza...@gmail.com To: mohaj...@gmail.com CC: jornfra...@gmail.com; msegel_had...@hotmail.com; guha.a...@gmail.com;

RE: Sqoop on Spark

2016-04-06 Thread Yong Zhang
pr 6, 2016 at 4:05 PM, Yong Zhang <java8...@hotmail.com> wrote: If they do that, they must provide a customized input format, instead of through JDBC. Yong Date: Wed, 6 Apr 2016 23:56:54 +0100 Subject: Re: Sqoop on Spark From: mich.talebza...@gmail.com To: mohaj...@gmail.com CC: jorn

RE: Partition pruning in spark 1.5.2

2016-04-05 Thread Yong Zhang
Hi, Michael: I would like to ask the same question, if the DF hash partitioned, then cache, now query/filter by the column which hashed for partition, will Spark be smart enough to do the Partition pruning in this case, instead of depending on Parquet's partition pruning. I think that is the

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Yong Zhang
; > The broadcast hint does not work as expected in this case, could you > also how the logical plan by 'explain(true)'? > > On Wed, Mar 23, 2016 at 8:39 AM, Yong Zhang <java8...@hotmail.com> wrote: > > > > So I am testing this code to understand "broadcast&qu

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-23 Thread Yong Zhang
t; From: dav...@databricks.com > To: java8...@hotmail.com > CC: user@spark.apache.org > > On Wed, Mar 23, 2016 at 10:35 AM, Yong Zhang <java8...@hotmail.com> wrote: > > Here is the output: > > > > == Parsed Logical Plan == > > Project [400+ columns] > &g

RE: Spark 1.5.2, why the broadcast join shuffle so much data in the last step

2016-03-24 Thread Yong Zhang
.com > CC: user@spark.apache.org > > On Wed, Mar 23, 2016 at 10:35 AM, Yong Zhang <java8...@hotmail.com> wrote: > > Here is the output: > > > > == Parsed Logical Plan == > > Project [400+ columns] > > +- Project [400+ columns] > >+- Project [400+ columns] &g

Does Spark 1.5.x really still support Hive 0.12?

2016-03-04 Thread Yong Zhang
When I tried to compile the Spark 1.5.2 with -Phive-0.12.0, maven gave me back an error that profile doesn't exist any more. But when I read the Spark SQL programming guide here: http://spark.apache.org/docs/1.5.2/sql-programming-guide.htmlIt keeps mentioning Spark 1.5.2 still can work with

RE: spark.driver.memory meaning

2016-04-03 Thread Yong Zhang
In the standalone mode, it applies to the Driver JVM processor heap size. You should consider giving enough memory space to it, in standalone mode, due to: 1) Any data you bring back to the driver will store in it, like RDD.collect or DF.show2) The Driver also host a web UI for the application

Spark 1.5.2 Master OOM

2016-03-30 Thread Yong Zhang
Hi, Sparkers Our cluster is running Spark 1.5.2 with Standalone mode. It runs fine for weeks, but today, I found out the master crash due to OOM. We have several ETL jobs runs daily on Spark, and adhoc jobs. I can see the "Completed Applications" table grows in the master UI. Original I set

RE: SPARK-13900 - Join with simple OR conditions take too long

2016-03-31 Thread Yong Zhang
I agree that there won't be a generic solution for these kind of cases. Without the CBO from Spark or Hadoop ecosystem in short future, maybe Spark DataFrame/SQL should support more hints from the end user, as in these cases, end users will be smart enough to tell the engine what is the correct

RE: Eliminating shuffle write and spill disk IO reads/writes in Spark

2016-04-01 Thread Yong Zhang
Is there a configuration in the Spark of location of "shuffle spilling"? I didn't recall ever see that one. Can you share it out? It will be good for a test writing to RAM Disk if that configuration is available. Thanks Yong From: r...@databricks.com Date: Fri, 1 Apr 2016 15:32:23 -0700

RE: How this unit test passed on master trunk?

2016-04-24 Thread Yong Zhang
put itself does not have any ordering. I am not sure why the unit test and the real env have different environment. Xiao, I do see the difference between unit test and local cluster run. Do you know the reason? Thanks. Zhan Zhang On Apr 22, 2016, at 11:23 AM, Yong Zhan

RE: Java exception when showing join

2016-04-25 Thread Yong Zhang
r 2016 07:45:12 -0500 > > I get an invalid syntax error when I do that. > > On Fri, 2016-04-22 at 20:06 -0400, Yong Zhang wrote: > > use "dispute_df.join(comments_df, dispute_df.COMMENTID === > > comments_df.COMMENTID).first()" instead. > > > > Yong > &g

RE: Java exception when showing join

2016-04-25 Thread Yong Zhang
get an invalid syntax error when I do that. > > On Fri, 2016-04-22 at 20:06 -0400, Yong Zhang wrote: > > use "dispute_df.join(comments_df, dispute_df.COMMENTID === > > comments_df.COMMENTID).first()" instead. > > > > Yong > > > > Date: Fri, 22 Apr 201

RE: Java exception when showing join

2016-04-22 Thread Yong Zhang
use "dispute_df.join(comments_df, dispute_df.COMMENTID === comments_df.COMMENTID).first()" instead. Yong Date: Fri, 22 Apr 2016 17:42:26 -0400 From: webe...@aim.com To: user@spark.apache.org Subject: Java exception when showing join I am using pyspark with netezza. I am getting a java

How this unit test passed on master trunk?

2016-04-22 Thread Yong Zhang
Hi, I was trying to find out why this unit test can pass in Spark code. inhttps://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala for this unit test: test("Star Expansion - CreateStruct and CreateArray") { val structDf =

RE: SQLContext and HiveContext parse a query string differently ?

2016-05-12 Thread Yong Zhang
Not sure what do you mean? You want to have one exactly query running fine in both sqlContext and HiveContext? The query parser are different, why do you want to have this feature? Do I understand your question correctly? Yong Date: Thu, 12 May 2016 13:09:34 +0200 Subject: SQLContext and

RE: Weird results with Spark SQL Outer joins

2016-05-02 Thread Yong Zhang
We are still not sure what is the problem, if you cannot show us with some example data. For dps with 42632 rows, and swig with 42034 rows, if dps full outer join with swig on 3 columns; with additional filters, get the same resultSet row count as dps lefter outer join with swig on 3 columns,

Spark Standalone with SPARK_CLASSPATH in spark-env.sh and "spark.driver.userClassPathFirst"

2016-04-15 Thread Yong Zhang
Hi, I found out one problem of using "spark.driver.userClassPathFirst" and SPARK_CLASSPATH in spark-env.sh on Standalone environment, and want to confirm this in fact has no good solution. We are running Spark 1.5.2 in standalone mode on a cluster. Since the cluster doesn't have the direct

Spark Standalone with SPARK_CLASSPATH in spark-env.sh and "spark.driver.userClassPathFirst"

2016-04-15 Thread Yong Zhang
Hi, I found out one problem of using "spark.driver.userClassPathFirst" and SPARK_CLASSPATH in spark-env.sh on Standalone environment, and want to confirm this in fact has no good solution. We are running Spark 1.5.2 in standalone mode on a cluster. Since the cluster doesn't have the direct

RE: Logging in executors

2016-04-13 Thread Yong Zhang
Is the env/dev/log4j-executor.properties file within your jar file? Is the path matching with what you specified as env/dev/log4j-executor.properties? If you read the log4j document here: https://logging.apache.org/log4j/1.2/manual.html When you specify the

RE: how does sc.textFile translate regex in the input.

2016-04-13 Thread Yong Zhang
It is described in "Hadoop Definition Guild", chapter 3, FilePattern https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch03.html#FilePatterns Yong From: pradeep1...@gmail.com Date: Wed, 13 Apr 2016 18:56:58 + Subject: how does sc.textFile translate regex in

Re: Outer Explode needed

2016-07-26 Thread Yong Zhang
The reason of no response is that this feature is not available yet. You can vote and following this JIRA https://issues.apache.org/jira/browse/SPARK-13721, if you really need this feature. Yong From: Don Drake Sent: Monday, July 25,

Re: UDF returning generic Seq

2016-07-26 Thread Yong Zhang
I don't know the if "ANY" will work or not, but do you take a look about how "map_values" UDF implemented in Spark, which return map values of an array/seq of arbitrary type. https://issues.apache.org/jira/browse/SPARK-16279 Yong From: Chris Beavers

Re: Extracting key word from a textual column

2016-08-02 Thread Yong Zhang
Well, if you still want to use windows function for your logic, then you need to derive a new column out, like "catalog", and use it as part of grouping logic. Maybe you can use regex for deriving out this new column. The implementation needs to depend on your data in

Re: Spark SQL and number of task

2016-08-04 Thread Yong Zhang
The 2 plans look similar, but they are big difference, if you also consider that your source is in fact from a no-sql DB, like C*. The OR plan has "Filter ((id#0L = 94) || (id#0L = 2))", which means the filter is indeed happening on Spark side, instead of on C* side. Which means to fulfill

RE: Processing json document

2016-07-07 Thread Yong Zhang
The problem is for Hadoop Input format to identify the record delimiter. If the whole json record is in one line, then the nature record delimiter will be the new line character. Keep in mind in distribute file system, the file split position most likely IS not on the record delimiter. The

Re: Spark join and large temp files

2016-08-08 Thread Yong Zhang
Join requires shuffling. The problem is that you have to shuffle 1.5T data, which caused problem on your disk usage. Another way is to broadcast the 1.5G small dataset, so there is no shuffle requirement for 1.5T dataset. But you need to make sure you have enough memory. Can you try to

RE: Possible to broadcast a function?

2016-06-30 Thread Yong Zhang
How about this old discussion related to similar problem as yours. http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-td3203.html Yong From: aper...@timerazor.com Date: Wed, 29 Jun 2016 14:00:07 + Subject: Possible to broadcast a function? To:

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Yong Zhang
You can always use explain method to validate your DF or SQL, before any action. Yong From: Jacek Laskowski Sent: Tuesday, February 21, 2017 4:34 AM To: Linyuxin Cc: user Subject: Re: [SparkSQL] pre-check syntex before running spark job? Hi,

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
: Yong Zhang <java8...@hotmail.com> Sent: Tuesday, February 21, 2017 1:17 PM To: Sidney Feiner; Chanh Le; user @spark Subject: Re: How to query a query with not contain, not start_with, not end_with condition effective? Sorry, didn't pay attention to the originally requirement. Did y

Re: Spark SQL : Join operation failure

2017-02-22 Thread Yong Zhang
Your error message is not clear about what really happens. Is your container killed by Yarn, or it indeed runs OOM? When I run the spark job with big data, here is normally what I will do: 1) Enable GC output. You need to monitor the GC output in the executor, to understand the GC pressure.

Re: Efficient Spark-Sql queries when only nth Column changes

2017-02-18 Thread Yong Zhang
If you only need the group by in the same hierarchy logic, then you can group by at the lowest level, and cache it, then use the cached DF to derive to the higher level, so Spark will only scan the originally table once, and reuse the cache in the following. val df_base =

Re: Serialization error - sql UDF related

2017-02-18 Thread Yong Zhang
You define "getNewColumnName" as method, which requires the class/object holding it has to be serializable. >From the stack trace, it looks like this method defined in >ProductDimensionSFFConverterRealApp, but it is not serializable. In fact, your method only uses String and Boolean, which

Re: Duplicate Rank for within same partitions

2017-02-24 Thread Yong Zhang
What you described is not clear here. Do you want to rank your data based on (date, hour, language, item_type, time_zone), and sort by score; or you want to rank your data based on (date, hour) and sort by language, item_type, time_zone and score? If you mean the first one, then your Spark

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
rom data where url like '%sell%')").explain(true) Yong From: Sidney Feiner <sidney.fei...@startapp.com> Sent: Tuesday, February 21, 2017 10:46 AM To: Yong Zhang; Chanh Le; user @spark Subject: RE: How to query a query with not contain, not start_wit

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
Not sure if I misunderstand your question, but what's wrong doing it this way? scala> spark.version res6: String = 2.0.2 scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", "url") df: org.apache.spark.sql.DataFrame = [user_id: int, url: string] scala>

Re: How to specify default value for StructField?

2017-02-14 Thread Yong Zhang
You maybe are looking for something like "spark.sql.parquet.mergeSchema" for ORC. Unfortunately, I don't think it is available, unless someone tells me I am wrong. You can create a JIRA to request this feature, but we all know that Parquet is the first citizen format [] Yong

Re: skewed data in join

2017-02-16 Thread Yong Zhang
Yes. You have to change your key, or as BigData term, "adding salt". Yong From: Gourav Sengupta Sent: Thursday, February 16, 2017 11:11 AM To: user Subject: skewed data in join Hi, Is there a way to do multiple reducers for joining

Re: How to specify default value for StructField?

2017-02-15 Thread Yong Zhang
If it works under hive, do you try just create the DF from Hive table directly in Spark? That should work, right? Yong From: Begar, Veena <veena.be...@hpe.com> Sent: Wednesday, February 15, 2017 10:16 AM To: Yong Zhang; smartzjp; user@spark.apache.org S

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Yong Zhang
What DB you are using for your Hive meta store, and what types are your partition columns? You maybe want to read the discussion in SPARK-6910, and especially the comments in PR. There are some limitation about partition pruning in Hive/Spark, maybe yours is one of them. Yong

Re: Spark #cores

2017-01-18 Thread Yong Zhang
spark.sql.shuffle.partitions is not only controlling of the Spark SQL, but also in any implementation based on Spark DataFrame. If you are using "spark.ml" package, then most ML libraries in it are based on DataFrame. So you shouldn't use "spark.default.parallelism", instead of

Re: Spark #cores

2017-01-18 Thread Yong Zhang
sal...@gmail.com> Sent: Wednesday, January 18, 2017 3:21 PM To: Yong Zhang Cc: spline_pal...@yahoo.com; jasbir.s...@accenture.com; User Subject: Re: Spark #cores So, I should be using spark.sql.shuffle.partitions to control the parallelism? Is there there a guide to how to tune this? Tha

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-03 Thread Yong Zhang
Data Locality is part of job/task scheduling responsibility. So both links you specified originally are correct, one is for the standalone mode comes with Spark, another is for the YARN. Both have this ability. But YARN, as a very popular scheduling component, comes with MUCH, MUCH more

Why Spark cannot get the derived field of case class in Dataset?

2017-02-28 Thread Yong Zhang
In the following example, the "day" value is in the case class, but I cannot get that in the Spark dataset, which I would like to use at runtime? Any idea? Do I have to force it to be present in the case class constructor? I like to derive it out automatically and used in the dataset or

Re: Spark driver CPU usage

2017-03-01 Thread Yong Zhang
It won't control the cpu usage of Driver. You should check out what CPUs are doing on your driver side. But I just want to make sure that you do know the full CPU usage on a 4 cores Linux box will be 400%. So 100% really just make one core busy. Driver does maintain the application web UI,

Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-06 Thread Yong Zhang
This is an interesting point. I tested with originally data with Spark 2.0 release, I can get the same statistic output in the originally email like following: 50 1.77695393562 51 0.695149898529 52 0.638142108917 53 0.647341966629 54 0.663456916809 55 0.629166126251 56 0.644149065018 57

Re: distribute work (files)

2016-09-07 Thread Yong Zhang
What error do you get? FileNotFoundException? Please paste the stacktrace here. Yong From: Peter Figliozzi Sent: Wednesday, September 7, 2016 10:18 AM To: ayan guha Cc: Lydia Ickler; user.spark Subject: Re: distribute work (files)

Re: spark-xml to avro - SchemaParseException: Can't redefine

2016-09-08 Thread Yong Zhang
Do you take a look about this -> https://github.com/databricks/spark-avro/issues/54 Yong [https://avatars0.githubusercontent.com/u/1457102?v=3=400] spark-avro fails to save DF with nested records having the

Re: With spark DataFrame, how to write to existing folder?

2016-09-23 Thread Yong Zhang
df.write.format(source).mode("overwrite").save(path) Yong From: Dan Bikle Sent: Friday, September 23, 2016 6:45 PM To: user@spark.apache.org Subject: With spark DataFrame, how to write to existing folder? spark-world, I am walking through

Great performance improvement of Spark 1.6.2 on our production cluster

2016-08-29 Thread Yong Zhang
Today I deployed Spark 1.6.2 on our production cluster. There is one daily huge job we run it every day using Spark SQL, and it is the biggest Spark job on our cluster running daily. I was impressive by the speed improvement. Here is the history statistics of this daily job: 1) 11 to 12 hours

Re: find outliers within data

2016-11-22 Thread Yong Zhang
Spark Dataframe window functions? https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html Introducing Window Functions in Spark SQL - Databricks databricks.com To use window

Re: Dataframe broadcast join hint not working

2016-11-28 Thread Yong Zhang
If your query plan has "Project" in it, there is a bug in Spark preventing "broadcast" hint working in pre-2.0 release. https://issues.apache.org/jira/browse/SPARK-13383 Unfortunately, there is no port fix in 1.x. Yong From: Anton Okolnychyi

Re: Long-running job OOMs driver process

2016-11-18 Thread Yong Zhang
Just wondering, is it possible the memory usage keeping going up due to the web UI content? Yong From: Alexis Seigneurin Sent: Friday, November 18, 2016 10:17 AM To: Nathan Lande Cc: Keith Bourgoin; Irina Truong;

Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-18 Thread Yong Zhang
That's correct, as long as you don't change the StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L166 Yong From: Rabin Banerjee Sent: Friday, November 18, 2016 10:36 AM

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread Yong Zhang
Read the document on https://github.com/datastax/spark-cassandra-connector Yong From: kant kodali Sent: Friday, November 11, 2016 11:04 AM To: user @spark Subject: How to use Spark SQL to connect to Cassandra from Spark-Shell? How to use

Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-20 Thread Yong Zhang
, and will be cached individually. Yong From: Taotao.Li <charles.up...@gmail.com> Sent: Sunday, November 20, 2016 6:18 AM To: Rabin Banerjee Cc: Yong Zhang; user; Mich Talebzadeh; Tathagata Das Subject: Re: Will spark cache table once even if I call read

Re: Java to show struct field from a Dataframe

2016-12-17 Thread Yong Zhang
: Richard Xin <richardxin...@yahoo.com> Sent: Saturday, December 17, 2016 8:53 PM To: Yong Zhang; zjp_j...@163.com; user Subject: Re: Java to show struct field from a Dataframe I tried to transform root |-- latitude: double (nullable = false) |-- longitude: double (nullable = false) |-- name:

Re: Java to show struct field from a Dataframe

2016-12-17 Thread Yong Zhang
"[D" type means a double array type. So this error simple means you have double[] data, but Spark needs to cast it to Double, as your schema defined. The error message clearly indicates the data doesn't match with the type specified in the schema. I wonder how you are so sure about your

Re: null values returned by max() over a window function

2016-11-29 Thread Yong Zhang
This is not a bug, but a intension of windows function. When you use max + rowsBetween, it is kind of strange requirement. RowsBetween is more like to be used to calculate the moving sun or avg, which will handle null as 0. But in your case, you want your grouping window as 2 rows before +

Re: Spark SQL 1.6.3 ORDER BY and partitions

2017-01-09 Thread Yong Zhang
I am not sure what do you mean that "table" is comprised of 200/1200 partitions. A partition could mean the dataset(RDD/DataFrame) will be chunked within Spark, then processed; Or it could mean you define the metadata in the Hive of the partitions of the table. If you mean the first one, so

Re: Converting dataframe to dataset question

2017-03-23 Thread Yong Zhang
Not sure I understand this problem, why I cannot reproduce it? scala> spark.version res22: String = 2.1.0 scala> case class Teamuser(teamid: String, userid: String, role: String) defined class Teamuser scala> val df = Seq(Teamuser("t1", "u1", "role1")).toDF df: org.apache.spark.sql.DataFrame =

Re: how to read object field within json file

2017-03-23 Thread Yong Zhang
That's why your "source" should be defined as an Array[Struct] type (which makes sense in this case, it has an undetermined length , so you can explode it and get the description easily. Now you need write your own UDF, maybe can do what you want. Yong From:

Re: Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-23 Thread Yong Zhang
Change: val arrayinput = input.getAs[Array[String]](0) to: val arrayinput = input.getAs[Seq[String]](0) Yong From: shyla deshpande Sent: Thursday, March 23, 2017 8:18 PM To: user Subject: Spark dataframe,

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
I am not 100% sure, but normally "dispatcher-event-loop" OOM means the driver OOM. Are you sure your workers OOM? Yong From: bsikander Sent: Friday, March 24, 2017 5:48 AM To: user@spark.apache.org Subject: [Worker Crashing]

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
I never experienced worker OOM or very rarely see this online. So my guess that you have to generate the heap dump file to analyze it. Yong From: Behroz Sikander <behro...@gmail.com> Sent: Friday, March 24, 2017 9:15 AM To: Yong Zhang Cc: user@spark.apac

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
set the worker to 2g, and never experienced any OOM from workers. Our cluster is live for more than 1 year, and we also use Spark 1.6.2 on production. Yong From: Behroz Sikander <behro...@gmail.com> Sent: Friday, March 24, 2017 9:29 AM To: Yong Zhang Cc

Re: spark-submit config via file

2017-03-24 Thread Yong Zhang
Of course it is possible. You can always to set any configurations in your application using API, instead of pass in through the CLI. val sparkConf = new SparkConf().setAppName(properties.get("appName")).set("master", properties.get("master")).set(xxx, properties.get("xxx")) Your error is

Re: how to read object field within json file

2017-03-24 Thread Yong Zhang
I missed the part to pass in a schema to force the "struct" to a Map, then use explode. Good option. Yong From: Michael Armbrust <mich...@databricks.com> Sent: Friday, March 24, 2017 3:02 PM To: Yong Zhang Cc: Selvam Raman; user Subject: Re: ho

Re: Need help for RDD/DF transformation.

2017-03-30 Thread Yong Zhang
you can just pick the first element out from the Array "keys" of DF2, to join. Otherwise, I don't see any way to avoid a cartesian join. Yong From: Mungeol Heo <mungeol@gmail.com> Sent: Thursday, March 30, 2017 3:05 AM To: ayan guha Cc: Yong Zhang

Re: calculate diff of value and median in a group

2017-03-22 Thread Yong Zhang
((id, iter) => (id, median(iter.map(_._2).toSeq))).show +---+-+ | _1| _2| +---+-+ |101|0.355| |100| 0.43| +---+-+ Yong From: ayan guha <guha.a...@gmail.com> Sent: Wednesday, March 22, 2017 7:23 PM To: Craig Ching Cc: Yong Zhang; user@spa

Re: calculate diff of value and median in a group

2017-03-22 Thread Yong Zhang
Are the elements count big per group? If not, you can group them and use the code to calculate the median and diff. Yong From: Craig Ching Sent: Wednesday, March 22, 2017 3:17 PM To: user@spark.apache.org Subject: calculate diff of value

Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Yong Zhang
The error message indeed is not very clear. What you did wrong is that the repartitionAndSortWithinPartitions not only requires PairRDD, but also OrderedRDD. Your case class as key is NOT Ordered. Either you extends it from Ordered, or provide a companion object to do the implicit Ordering.

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread Yong Zhang
You don't need to repartition your data just for join purpose. But if the either parties of join is already partitioned, Spark will use this advantage as part of join optimization. Should you reduceByKey before the join really depend on your join logic. ReduceByKey will shuffle, and following

Re: Need help for RDD/DF transformation.

2017-03-29 Thread Yong Zhang
What is the desired result for RDD/DF 1 1, a 3, c 5, b RDD/DF 2 [1, 2, 3] [4, 5] Yong From: Mungeol Heo Sent: Wednesday, March 29, 2017 5:37 AM To: user@spark.apache.org Subject: Need help for RDD/DF transformation. Hello, Suppose,

If TypedColumn is a subclass of Column, why I cannot apply function on it in Dataset?

2017-03-18 Thread Yong Zhang
In the following example, after I used "typed.avg" to generate a TypedColumn, and I want to apply round on top of it? But why Spark complains about it? Because it doesn't know that it is a TypedColumn? Thanks Yong scala> spark.version res20: String = 2.1.0 scala> case

Re: apply UDFs to N columns dynamically in dataframe

2017-03-15 Thread Yong Zhang
Is the answer here good for your case? http://stackoverflow.com/questions/33151866/spark-udf-with-varargs [https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-i...@2.png?v=73d79a89bded] scala - Spark UDF with varargs -

Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Yong Zhang
Not really sure what is the root problem you try to address. The number of tasks need to be run in Spark depends on the number of partitions in your job. Let's use a simple word count example, if your spark job read 128G data from HDFS (assume the default block size is 128M), then the mapper

Re: Dataset : Issue with Save

2017-03-16 Thread Yong Zhang
You can take a look of https://issues.apache.org/jira/browse/SPARK-12837 Yong Spark driver requires large memory space for serialized ... issues.apache.org Executing a sql statement with a large number of partitions requires a high memory

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-16 Thread Yong Zhang
In this kind of question, you always want to tell us the spark version. Yong From: darin Sent: Thursday, March 16, 2017 9:59 PM To: user@spark.apache.org Subject: spark streaming exectors memory increasing and executor killed by yarn Hi,

Re: Dataset : Issue with Save

2017-03-17 Thread Yong Zhang
From: Bahubali Jain <bahub...@gmail.com> Sent: Thursday, March 16, 2017 11:41 PM To: Yong Zhang Cc: user@spark.apache.org Subject: Re: Dataset : Issue with Save I am using SPARK 2.0 . There are comments in the ticket since Oct-2016 which clearly mention that issue

Re: RDD can not convert to df, thanks

2017-03-17 Thread Yong Zhang
You also need the import the sqlContext implicits import sqlContext.implicits._ Yong From: 萝卜丝炒饭 <1427357...@qq.com> Sent: Friday, March 17, 2017 1:52 AM To: user-return-68576-1427357147=qq.com; user Subject: Re: RDD can not convert to df, thanks More info,I

Re: Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

2017-03-17 Thread Yong Zhang
Starting from Spark 2, these kind of operation are implemented in left anti join, instead of using RDD operation directly. Same issue also on sqlContext. scala> spark.version res25: String = 2.0.2 spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) ==

Re: Dataset : Issue with Save

2017-03-16 Thread Yong Zhang
; Sent: Thursday, March 16, 2017 10:34 PM To: Yong Zhang Cc: user@spark.apache.org Subject: Re: Dataset : Issue with Save Hi, Was this not yet resolved? Its a very common requirement to save a dataframe, is there a better way to save a dataframe by avoiding data being sent to driver?. "

Re: Reading ASN.1 files in Spark

2017-04-06 Thread Yong Zhang
Spark can read any file, as long as you can provide it the Hadoop InputFormat implementation. Did you try this guy's example? http://awcoleman.blogspot.com/2014/07/processing-asn1-call-detail-records.html

Re: Spark failing while persisting sorted columns.

2017-03-09 Thread Yong Zhang
My guess is that your executor already crashed, due to OOM?. You should check the executor log, it may tell you more information. Yong From: Rohit Verma Sent: Thursday, March 9, 2017 4:41 AM To: user Subject: Spark failing while

Re: keep or remove sc.stop() coz of RpcEnv already stopped error

2017-03-13 Thread Yong Zhang
What version of Spark you are using? Based on Spark-12967, it is fixed on Spark 2.0 and later. If you are using Spark 1.x, you can ignore this Warning. It shouldn't affect any functions. Yong From: nancy henry Sent: Monday, March

Re: org.apache.spark.SparkException: Task not serializable

2017-03-13 Thread Yong Zhang
In fact, I will suggest different way to handle the originally problem. The example listed originally comes with a Java Function doesn't use any instance fields/methods, so serializing the whole class is a overkill solution. Instead, you can/should make the Function static, which will work in

Re: Sorted partition ranges without overlap

2017-03-13 Thread Yong Zhang
You can implement your own partitioner based on your own logic. Yong From: Kristoffer Sjögren Sent: Monday, March 13, 2017 9:34 AM To: user Subject: Sorted partition ranges without overlap Hi I have a RDD that needs to be sorted

Re: can spark take advantage of ordered data?

2017-03-10 Thread Yong Zhang
I think it is an interesting requirement, but I am not familiar with Spark enough to say it can be done as latest spark version or not. >From my understanding, you are looking for some API from the spark to read the >source directly into a ShuffledRDD, which indeed needs (K, V and a

Re: finding Spark Master

2017-03-07 Thread Yong Zhang
This website explains it very clear, if you are using Yarn. https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_running_spark_on_yarn.html Running Spark Applications on YARN -

  1   2   >