Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
ve code won't work in Spark SQL. * As I said, I am NOT running in either Scale or PySpark session, but in a pure Spark SQL. * Is it possible to do the above logic in Spark SQL, without using "exploding"? Thanks ____ From: Mich Talebzadeh Sent: Saturda

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
in Spark SQL. * As I said, I am NOT running in either Scale or PySpark session, but in a pure Spark SQL. * Is it possible to do the above logic in Spark SQL, without using "exploding"? Thanks ____ From: Mich Talebzadeh Sent: Saturday, May 6, 2023

Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-05 Thread Yong Zhang
Hi, This is on Spark 3.1 environment. For some reason, I can ONLY do this in Spark SQL, instead of either Scala or PySpark environment. I want to aggregate an array into a Map of element count, within that array, but in Spark SQL. I know that there is an aggregate function available like

Re: Spark 3.1 with spark AVRO

2022-03-10 Thread Yong Zhang
r file you downloaded in the jars directory of Spark and run spark-shell again without the packages option. This will guarantee that the jar file is on the classpath of Spark driver and executors.. On 3/10/22 1:24 PM, Yong Zhang wrote: Hi, I am puzzled with this issue of Spark 3.1 version to read

Spark 3.1 with spark AVRO

2022-03-10 Thread Yong Zhang
Hi, I am puzzled with this issue of Spark 3.1 version to read avro file. Everything is done on my local mac laptop so far, and I really don't know where the issue comes from, and I googled a lot and cannot find any clue. I am always using Spark 2.4 version, as it is really mature. But for a

Re: Why Spark JDBC Writing in a sequential order

2018-05-25 Thread Yong Zhang
this and coordinating all tasks and executors as I observed? Yong From: Jörn Franke <jornfra...@gmail.com> Sent: Friday, May 25, 2018 10:50 AM To: Yong Zhang Cc: user@spark.apache.org Subject: Re: Why Spark JDBC Writing in a sequential order Can your database receive the

Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Yong Zhang
What version of Spark you are using? You can search "spark.sql.parquet.mergeSchema" on https://spark.apache.org/docs/latest/sql-programming-guide.html Starting from Spark 1.5, the default is already "false", which means Spark shouldn't scan all the parquet files to generate the schema.

Re: how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Yong Zhang
What's wrong just using a UDF doing for loop in scala? You can change the for loop logic for what combination you want. scala> spark.version res4: String = 2.2.1 scala> aggDS.printSchema root |-- name: string (nullable = true) |-- colors: array (nullable = true) ||-- element: string

Re: java.lang.UnsupportedOperationException: CSV data source does not support struct/ERROR RetryingBlockFetcher

2018-03-28 Thread Yong Zhang
Your dataframe has array data type, which is NOT supported by CSV. How csv file can include array or other nest structure? If you want your data to be human readable text, write out as json in your case then. Yong From: Mina Aslani

Re: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to Case class

2018-03-23 Thread Yong Zhang
From: Yong Zhang <java8...@hotmail.com> Sent: Thursday, March 22, 2018 10:08 PM To: user@spark.apache.org Subject: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to Case class I am trying to research a

java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to Case class

2018-03-22 Thread Yong Zhang
I am trying to research a custom Aggregator implementation, and following the example in the Spark sample code here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedTypedAggregation.scala But I cannot use it in the agg function, and

Re: CATALYST rule join

2018-02-27 Thread Yong Zhang
Not fully understand your question, but maybe you want check out this JIRA https://issues.apache.org/jira/browse/SPARK-17728, especially in the comments area. There are some discussion about the logic why UDF could be executed multi times by Spark. Yong From:

Re: Parquet files from spark not readable in Cascading

2017-11-16 Thread Yong Zhang
I don't have experience with Cascading, but we saw similar issue for importing the data generated in Spark into Hive. Did you try this setting "spark.sql.parquet.writeLegacyFormat" to true? https://stackoverflow.com/questions/44279870/why-cant-impala-read-parquet-files-after-spark-sqls-write

Re: DataFrameReader read from S3 org.apache.spark.sql.AnalysisException: Path does not exist

2017-07-12 Thread Yong Zhang
Can't you just catch that exception and return an empty dataframe? Yong From: Sumona Routh Sent: Wednesday, July 12, 2017 4:36 PM To: user Subject: DataFrameReader read from S3 org.apache.spark.sql.AnalysisException: Path does not exist

Re: about broadcast join of base table in spark sql

2017-07-02 Thread Yong Zhang
Then you need to tell us the spark version, and post the execution plan here, so we can help you better. Yong From: Paley Louie <paley2...@gmail.com> Sent: Sunday, July 2, 2017 12:36 AM To: Yong Zhang Cc: Bryan Jeffrey; d...@spark.org; user@spark.apac

Re: about broadcast join of base table in spark sql

2017-06-30 Thread Yong Zhang
Or since you already use the DataFrame API, instead of SQL, you can add the broadcast function to force it. https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#broadcast(org.apache.spark.sql.DataFrame) Yong functions - Apache

Re: SparkSQL to read XML Blob data to create multiple rows

2017-06-29 Thread Yong Zhang
scala>spark.version res6: String = 2.1.1 scala>val rdd = sc.parallelize(Seq("""Title1.1Description_1.1 Title1.2Description_1.2 Title1.3Description_1.3 """)) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :24 scala>import com.databricks.spark.xml.XmlReader

Re: how to call udf with parameters

2017-06-18 Thread Yong Zhang
What version of spark you are using? I cannot reproduce your error: scala> spark.version res9: String = 2.1.1 scala> val dataset = Seq((0, "hello"), (1, "world")).toDF("id", "text") dataset: org.apache.spark.sql.DataFrame = [id: int, text: string] scala> import org.apache.spark.sql.functions.udf

Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

2017-06-18 Thread Yong Zhang
I assume you use Scala to implement your UDFs. In this case, Scala language itself provides some options already for you. If you want to control more logic when UDFs init, you can define a Scala object, def your UDF as part of it, then the object in Scala will behavior like Singleton pattern

Re: Parquet file generated by Spark, but not compatible read by Hive

2017-06-13 Thread Yong Zhang
a.a...@gmail.com> Sent: Tuesday, June 13, 2017 1:54 AM To: Angel Francisco Orta Cc: Yong Zhang; user@spark.apache.org Subject: Re: Parquet file generated by Spark, but not compatible read by Hive Try setting following Param: conf.set("spark.sql.hive.convertMetastoreParquet","false")

Parquet file generated by Spark, but not compatible read by Hive

2017-06-12 Thread Yong Zhang
We are using Spark 1.6.2 as ETL to generate parquet file for one dataset, and partitioned by "brand" (which is a string to represent brand in this dataset). After the partition files generated in HDFS like "brand=a" folder, we add the partitions in the Hive. The hive version is 1.2.1 (In

Re: Why spark.sql.autoBroadcastJoinThreshold not available

2017-05-15 Thread Yong Zhang
You should post the execution plan here, so we can provide more accurate support. Since in your feature table, you are building it with projection ("where "), so my guess is that the following JIRA (SPARK-13383) stops the broadcast join.

Re: Reading ASN.1 files in Spark

2017-04-06 Thread Yong Zhang
Spark can read any file, as long as you can provide it the Hadoop InputFormat implementation. Did you try this guy's example? http://awcoleman.blogspot.com/2014/07/processing-asn1-call-detail-records.html

Re: Need help for RDD/DF transformation.

2017-03-30 Thread Yong Zhang
you can just pick the first element out from the Array "keys" of DF2, to join. Otherwise, I don't see any way to avoid a cartesian join. Yong From: Mungeol Heo <mungeol@gmail.com> Sent: Thursday, March 30, 2017 3:05 AM To: ayan guha Cc: Yong Zhang

Re: Spark SQL, dataframe join questions.

2017-03-29 Thread Yong Zhang
You don't need to repartition your data just for join purpose. But if the either parties of join is already partitioned, Spark will use this advantage as part of join optimization. Should you reduceByKey before the join really depend on your join logic. ReduceByKey will shuffle, and following

Re: Secondary Sort using Apache Spark 1.6

2017-03-29 Thread Yong Zhang
The error message indeed is not very clear. What you did wrong is that the repartitionAndSortWithinPartitions not only requires PairRDD, but also OrderedRDD. Your case class as key is NOT Ordered. Either you extends it from Ordered, or provide a companion object to do the implicit Ordering.

Re: Need help for RDD/DF transformation.

2017-03-29 Thread Yong Zhang
What is the desired result for RDD/DF 1 1, a 3, c 5, b RDD/DF 2 [1, 2, 3] [4, 5] Yong From: Mungeol Heo Sent: Wednesday, March 29, 2017 5:37 AM To: user@spark.apache.org Subject: Need help for RDD/DF transformation. Hello, Suppose,

Re: how to read object field within json file

2017-03-24 Thread Yong Zhang
I missed the part to pass in a schema to force the "struct" to a Map, then use explode. Good option. Yong From: Michael Armbrust <mich...@databricks.com> Sent: Friday, March 24, 2017 3:02 PM To: Yong Zhang Cc: Selvam Raman; user Subject: Re: ho

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
set the worker to 2g, and never experienced any OOM from workers. Our cluster is live for more than 1 year, and we also use Spark 1.6.2 on production. Yong From: Behroz Sikander <behro...@gmail.com> Sent: Friday, March 24, 2017 9:29 AM To: Yong Zhang Cc

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
I never experienced worker OOM or very rarely see this online. So my guess that you have to generate the heap dump file to analyze it. Yong From: Behroz Sikander <behro...@gmail.com> Sent: Friday, March 24, 2017 9:15 AM To: Yong Zhang Cc: user@spark.apac

Re: spark-submit config via file

2017-03-24 Thread Yong Zhang
Of course it is possible. You can always to set any configurations in your application using API, instead of pass in through the CLI. val sparkConf = new SparkConf().setAppName(properties.get("appName")).set("master", properties.get("master")).set(xxx, properties.get("xxx")) Your error is

Re: [Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread Yong Zhang
I am not 100% sure, but normally "dispatcher-event-loop" OOM means the driver OOM. Are you sure your workers OOM? Yong From: bsikander Sent: Friday, March 24, 2017 5:48 AM To: user@spark.apache.org Subject: [Worker Crashing]

Re: Spark dataframe, UserDefinedAggregateFunction(UDAF) help!!

2017-03-23 Thread Yong Zhang
Change: val arrayinput = input.getAs[Array[String]](0) to: val arrayinput = input.getAs[Seq[String]](0) Yong From: shyla deshpande Sent: Thursday, March 23, 2017 8:18 PM To: user Subject: Spark dataframe,

Re: how to read object field within json file

2017-03-23 Thread Yong Zhang
That's why your "source" should be defined as an Array[Struct] type (which makes sense in this case, it has an undetermined length , so you can explode it and get the description easily. Now you need write your own UDF, maybe can do what you want. Yong From:

Re: Converting dataframe to dataset question

2017-03-23 Thread Yong Zhang
Not sure I understand this problem, why I cannot reproduce it? scala> spark.version res22: String = 2.1.0 scala> case class Teamuser(teamid: String, userid: String, role: String) defined class Teamuser scala> val df = Seq(Teamuser("t1", "u1", "role1")).toDF df: org.apache.spark.sql.DataFrame =

Re: calculate diff of value and median in a group

2017-03-22 Thread Yong Zhang
((id, iter) => (id, median(iter.map(_._2).toSeq))).show +---+-+ | _1| _2| +---+-+ |101|0.355| |100| 0.43| +---+-+ Yong From: ayan guha <guha.a...@gmail.com> Sent: Wednesday, March 22, 2017 7:23 PM To: Craig Ching Cc: Yong Zhang; user@spa

Re: calculate diff of value and median in a group

2017-03-22 Thread Yong Zhang
Are the elements count big per group? If not, you can group them and use the code to calculate the median and diff. Yong From: Craig Ching Sent: Wednesday, March 22, 2017 3:17 PM To: user@spark.apache.org Subject: calculate diff of value

If TypedColumn is a subclass of Column, why I cannot apply function on it in Dataset?

2017-03-18 Thread Yong Zhang
In the following example, after I used "typed.avg" to generate a TypedColumn, and I want to apply round on top of it? But why Spark complains about it? Because it doesn't know that it is a TypedColumn? Thanks Yong scala> spark.version res20: String = 2.1.0 scala> case

Re: Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

2017-03-17 Thread Yong Zhang
Starting from Spark 2, these kind of operation are implemented in left anti join, instead of using RDD operation directly. Same issue also on sqlContext. scala> spark.version res25: String = 2.0.2 spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) ==

Re: Dataset : Issue with Save

2017-03-17 Thread Yong Zhang
From: Bahubali Jain <bahub...@gmail.com> Sent: Thursday, March 16, 2017 11:41 PM To: Yong Zhang Cc: user@spark.apache.org Subject: Re: Dataset : Issue with Save I am using SPARK 2.0 . There are comments in the ticket since Oct-2016 which clearly mention that issue

Re: RDD can not convert to df, thanks

2017-03-17 Thread Yong Zhang
You also need the import the sqlContext implicits import sqlContext.implicits._ Yong From: 萝卜丝炒饭 <1427357...@qq.com> Sent: Friday, March 17, 2017 1:52 AM To: user-return-68576-1427357147=qq.com; user Subject: Re: RDD can not convert to df, thanks More info,I

Re: Dataset : Issue with Save

2017-03-16 Thread Yong Zhang
; Sent: Thursday, March 16, 2017 10:34 PM To: Yong Zhang Cc: user@spark.apache.org Subject: Re: Dataset : Issue with Save Hi, Was this not yet resolved? Its a very common requirement to save a dataframe, is there a better way to save a dataframe by avoiding data being sent to driver?. "

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-16 Thread Yong Zhang
In this kind of question, you always want to tell us the spark version. Yong From: darin Sent: Thursday, March 16, 2017 9:59 PM To: user@spark.apache.org Subject: spark streaming exectors memory increasing and executor killed by yarn Hi,

Re: Dataset : Issue with Save

2017-03-16 Thread Yong Zhang
You can take a look of https://issues.apache.org/jira/browse/SPARK-12837 Yong Spark driver requires large memory space for serialized ... issues.apache.org Executing a sql statement with a large number of partitions requires a high memory

Re: apply UDFs to N columns dynamically in dataframe

2017-03-15 Thread Yong Zhang
Is the answer here good for your case? http://stackoverflow.com/questions/33151866/spark-udf-with-varargs [https://cdn.sstatic.net/Sites/stackoverflow/img/apple-touch-i...@2.png?v=73d79a89bded] scala - Spark UDF with varargs -

Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Yong Zhang
Not really sure what is the root problem you try to address. The number of tasks need to be run in Spark depends on the number of partitions in your job. Let's use a simple word count example, if your spark job read 128G data from HDFS (assume the default block size is 128M), then the mapper

Re: Sorted partition ranges without overlap

2017-03-13 Thread Yong Zhang
You can implement your own partitioner based on your own logic. Yong From: Kristoffer Sjögren Sent: Monday, March 13, 2017 9:34 AM To: user Subject: Sorted partition ranges without overlap Hi I have a RDD that needs to be sorted

Re: keep or remove sc.stop() coz of RpcEnv already stopped error

2017-03-13 Thread Yong Zhang
What version of Spark you are using? Based on Spark-12967, it is fixed on Spark 2.0 and later. If you are using Spark 1.x, you can ignore this Warning. It shouldn't affect any functions. Yong From: nancy henry Sent: Monday, March

Re: org.apache.spark.SparkException: Task not serializable

2017-03-13 Thread Yong Zhang
In fact, I will suggest different way to handle the originally problem. The example listed originally comes with a Java Function doesn't use any instance fields/methods, so serializing the whole class is a overkill solution. Instead, you can/should make the Function static, which will work in

Re: can spark take advantage of ordered data?

2017-03-10 Thread Yong Zhang
I think it is an interesting requirement, but I am not familiar with Spark enough to say it can be done as latest spark version or not. >From my understanding, you are looking for some API from the spark to read the >source directly into a ShuffledRDD, which indeed needs (K, V and a

Re: Spark failing while persisting sorted columns.

2017-03-09 Thread Yong Zhang
My guess is that your executor already crashed, due to OOM?. You should check the executor log, it may tell you more information. Yong From: Rohit Verma Sent: Thursday, March 9, 2017 4:41 AM To: user Subject: Spark failing while

Re: finding Spark Master

2017-03-07 Thread Yong Zhang
This website explains it very clear, if you are using Yarn. https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_running_spark_on_yarn.html Running Spark Applications on YARN -

Re: Spark driver CPU usage

2017-03-01 Thread Yong Zhang
It won't control the cpu usage of Driver. You should check out what CPUs are doing on your driver side. But I just want to make sure that you do know the full CPU usage on a 4 cores Linux box will be 400%. So 100% really just make one core busy. Driver does maintain the application web UI,

Why Spark cannot get the derived field of case class in Dataset?

2017-02-28 Thread Yong Zhang
In the following example, the "day" value is in the case class, but I cannot get that in the Spark dataset, which I would like to use at runtime? Any idea? Do I have to force it to be present in the case class constructor? I like to derive it out automatically and used in the dataset or

Re: Duplicate Rank for within same partitions

2017-02-24 Thread Yong Zhang
What you described is not clear here. Do you want to rank your data based on (date, hour, language, item_type, time_zone), and sort by score; or you want to rank your data based on (date, hour) and sort by language, item_type, time_zone and score? If you mean the first one, then your Spark

Re: Spark SQL : Join operation failure

2017-02-22 Thread Yong Zhang
Your error message is not clear about what really happens. Is your container killed by Yarn, or it indeed runs OOM? When I run the spark job with big data, here is normally what I will do: 1) Enable GC output. You need to monitor the GC output in the executor, to understand the GC pressure.

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
: Yong Zhang <java8...@hotmail.com> Sent: Tuesday, February 21, 2017 1:17 PM To: Sidney Feiner; Chanh Le; user @spark Subject: Re: How to query a query with not contain, not start_with, not end_with condition effective? Sorry, didn't pay attention to the originally requirement. Did y

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
rom data where url like '%sell%')").explain(true) Yong From: Sidney Feiner <sidney.fei...@startapp.com> Sent: Tuesday, February 21, 2017 10:46 AM To: Yong Zhang; Chanh Le; user @spark Subject: RE: How to query a query with not contain, not start_wit

Re: How to query a query with not contain, not start_with, not end_with condition effective?

2017-02-21 Thread Yong Zhang
Not sure if I misunderstand your question, but what's wrong doing it this way? scala> spark.version res6: String = 2.0.2 scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", "url") df: org.apache.spark.sql.DataFrame = [user_id: int, url: string] scala>

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Yong Zhang
You can always use explain method to validate your DF or SQL, before any action. Yong From: Jacek Laskowski Sent: Tuesday, February 21, 2017 4:34 AM To: Linyuxin Cc: user Subject: Re: [SparkSQL] pre-check syntex before running spark job? Hi,

Re: Serialization error - sql UDF related

2017-02-18 Thread Yong Zhang
You define "getNewColumnName" as method, which requires the class/object holding it has to be serializable. >From the stack trace, it looks like this method defined in >ProductDimensionSFFConverterRealApp, but it is not serializable. In fact, your method only uses String and Boolean, which

Re: Efficient Spark-Sql queries when only nth Column changes

2017-02-18 Thread Yong Zhang
If you only need the group by in the same hierarchy logic, then you can group by at the lowest level, and cache it, then use the cached DF to derive to the higher level, so Spark will only scan the originally table once, and reuse the cache in the following. val df_base =

Re: skewed data in join

2017-02-16 Thread Yong Zhang
Yes. You have to change your key, or as BigData term, "adding salt". Yong From: Gourav Sengupta Sent: Thursday, February 16, 2017 11:11 AM To: user Subject: skewed data in join Hi, Is there a way to do multiple reducers for joining

Re: How to specify default value for StructField?

2017-02-15 Thread Yong Zhang
If it works under hive, do you try just create the DF from Hive table directly in Spark? That should work, right? Yong From: Begar, Veena <veena.be...@hpe.com> Sent: Wednesday, February 15, 2017 10:16 AM To: Yong Zhang; smartzjp; user@spark.apache.org S

Re: How to specify default value for StructField?

2017-02-14 Thread Yong Zhang
You maybe are looking for something like "spark.sql.parquet.mergeSchema" for ORC. Unfortunately, I don't think it is available, unless someone tells me I am wrong. You can create a JIRA to request this feature, but we all know that Parquet is the first citizen format [] Yong

Re: Spark #cores

2017-01-18 Thread Yong Zhang
sal...@gmail.com> Sent: Wednesday, January 18, 2017 3:21 PM To: Yong Zhang Cc: spline_pal...@yahoo.com; jasbir.s...@accenture.com; User Subject: Re: Spark #cores So, I should be using spark.sql.shuffle.partitions to control the parallelism? Is there there a guide to how to tune this? Tha

Re: Spark #cores

2017-01-18 Thread Yong Zhang
spark.sql.shuffle.partitions is not only controlling of the Spark SQL, but also in any implementation based on Spark DataFrame. If you are using "spark.ml" package, then most ML libraries in it are based on DataFrame. So you shouldn't use "spark.default.parallelism", instead of

Re: Spark sql query plan contains all the partitions from hive table even though filtering of partitions is provided

2017-01-17 Thread Yong Zhang
What DB you are using for your Hive meta store, and what types are your partition columns? You maybe want to read the discussion in SPARK-6910, and especially the comments in PR. There are some limitation about partition pruning in Hive/Spark, maybe yours is one of them. Yong

Re: Spark SQL 1.6.3 ORDER BY and partitions

2017-01-09 Thread Yong Zhang
I am not sure what do you mean that "table" is comprised of 200/1200 partitions. A partition could mean the dataset(RDD/DataFrame) will be chunked within Spark, then processed; Or it could mean you define the metadata in the Hive of the partitions of the table. If you mean the first one, so

Re: Java to show struct field from a Dataframe

2016-12-17 Thread Yong Zhang
: Richard Xin <richardxin...@yahoo.com> Sent: Saturday, December 17, 2016 8:53 PM To: Yong Zhang; zjp_j...@163.com; user Subject: Re: Java to show struct field from a Dataframe I tried to transform root |-- latitude: double (nullable = false) |-- longitude: double (nullable = false) |-- name:

Re: Java to show struct field from a Dataframe

2016-12-17 Thread Yong Zhang
"[D" type means a double array type. So this error simple means you have double[] data, but Spark needs to cast it to Double, as your schema defined. The error message clearly indicates the data doesn't match with the type specified in the schema. I wonder how you are so sure about your

Re: null values returned by max() over a window function

2016-11-29 Thread Yong Zhang
This is not a bug, but a intension of windows function. When you use max + rowsBetween, it is kind of strange requirement. RowsBetween is more like to be used to calculate the moving sun or avg, which will handle null as 0. But in your case, you want your grouping window as 2 rows before +

Re: Dataframe broadcast join hint not working

2016-11-28 Thread Yong Zhang
If your query plan has "Project" in it, there is a bug in Spark preventing "broadcast" hint working in pre-2.0 release. https://issues.apache.org/jira/browse/SPARK-13383 Unfortunately, there is no port fix in 1.x. Yong From: Anton Okolnychyi

Re: find outliers within data

2016-11-22 Thread Yong Zhang
Spark Dataframe window functions? https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html Introducing Window Functions in Spark SQL - Databricks databricks.com To use window

Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-20 Thread Yong Zhang
, and will be cached individually. Yong From: Taotao.Li <charles.up...@gmail.com> Sent: Sunday, November 20, 2016 6:18 AM To: Rabin Banerjee Cc: Yong Zhang; user; Mich Talebzadeh; Tathagata Das Subject: Re: Will spark cache table once even if I call read

Re: Will spark cache table once even if I call read/cache on the same table multiple times

2016-11-18 Thread Yong Zhang
That's correct, as long as you don't change the StorageLevel. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L166 Yong From: Rabin Banerjee Sent: Friday, November 18, 2016 10:36 AM

Re: Long-running job OOMs driver process

2016-11-18 Thread Yong Zhang
Just wondering, is it possible the memory usage keeping going up due to the web UI content? Yong From: Alexis Seigneurin Sent: Friday, November 18, 2016 10:17 AM To: Nathan Lande Cc: Keith Bourgoin; Irina Truong;

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread Yong Zhang
Read the document on https://github.com/datastax/spark-cassandra-connector Yong From: kant kodali Sent: Friday, November 11, 2016 11:04 AM To: user @spark Subject: How to use Spark SQL to connect to Cassandra from Spark-Shell? How to use

Re: With spark DataFrame, how to write to existing folder?

2016-09-23 Thread Yong Zhang
df.write.format(source).mode("overwrite").save(path) Yong From: Dan Bikle Sent: Friday, September 23, 2016 6:45 PM To: user@spark.apache.org Subject: With spark DataFrame, how to write to existing folder? spark-world, I am walking through

Re: spark-xml to avro - SchemaParseException: Can't redefine

2016-09-08 Thread Yong Zhang
Do you take a look about this -> https://github.com/databricks/spark-avro/issues/54 Yong [https://avatars0.githubusercontent.com/u/1457102?v=3=400] spark-avro fails to save DF with nested records having the

Re: distribute work (files)

2016-09-07 Thread Yong Zhang
What error do you get? FileNotFoundException? Please paste the stacktrace here. Yong From: Peter Figliozzi Sent: Wednesday, September 7, 2016 10:18 AM To: ayan guha Cc: Lydia Ickler; user.spark Subject: Re: distribute work (files)

Re: Re[8]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-06 Thread Yong Zhang
This is an interesting point. I tested with originally data with Spark 2.0 release, I can get the same statistic output in the originally email like following: 50 1.77695393562 51 0.695149898529 52 0.638142108917 53 0.647341966629 54 0.663456916809 55 0.629166126251 56 0.644149065018 57

Great performance improvement of Spark 1.6.2 on our production cluster

2016-08-29 Thread Yong Zhang
Today I deployed Spark 1.6.2 on our production cluster. There is one daily huge job we run it every day using Spark SQL, and it is the biggest Spark job on our cluster running daily. I was impressive by the speed improvement. Here is the history statistics of this daily job: 1) 11 to 12 hours

Re: Spark join and large temp files

2016-08-08 Thread Yong Zhang
Join requires shuffling. The problem is that you have to shuffle 1.5T data, which caused problem on your disk usage. Another way is to broadcast the 1.5G small dataset, so there is no shuffle requirement for 1.5T dataset. But you need to make sure you have enough memory. Can you try to

Re: Spark SQL and number of task

2016-08-04 Thread Yong Zhang
The 2 plans look similar, but they are big difference, if you also consider that your source is in fact from a no-sql DB, like C*. The OR plan has "Filter ((id#0L = 94) || (id#0L = 2))", which means the filter is indeed happening on Spark side, instead of on C* side. Which means to fulfill

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-03 Thread Yong Zhang
Data Locality is part of job/task scheduling responsibility. So both links you specified originally are correct, one is for the standalone mode comes with Spark, another is for the YARN. Both have this ability. But YARN, as a very popular scheduling component, comes with MUCH, MUCH more

Re: Extracting key word from a textual column

2016-08-02 Thread Yong Zhang
Well, if you still want to use windows function for your logic, then you need to derive a new column out, like "catalog", and use it as part of grouping logic. Maybe you can use regex for deriving out this new column. The implementation needs to depend on your data in

Re: UDF returning generic Seq

2016-07-26 Thread Yong Zhang
I don't know the if "ANY" will work or not, but do you take a look about how "map_values" UDF implemented in Spark, which return map values of an array/seq of arbitrary type. https://issues.apache.org/jira/browse/SPARK-16279 Yong From: Chris Beavers

Re: Outer Explode needed

2016-07-26 Thread Yong Zhang
The reason of no response is that this feature is not available yet. You can vote and following this JIRA https://issues.apache.org/jira/browse/SPARK-13721, if you really need this feature. Yong From: Don Drake Sent: Monday, July 25,

RE: Processing json document

2016-07-07 Thread Yong Zhang
The problem is for Hadoop Input format to identify the record delimiter. If the whole json record is in one line, then the nature record delimiter will be the new line character. Keep in mind in distribute file system, the file split position most likely IS not on the record delimiter. The

RE: Possible to broadcast a function?

2016-06-30 Thread Yong Zhang
How about this old discussion related to similar problem as yours. http://apache-spark-user-list.1001560.n3.nabble.com/Running-a-task-once-on-each-executor-td3203.html Yong From: aper...@timerazor.com Date: Wed, 29 Jun 2016 14:00:07 + Subject: Possible to broadcast a function? To:

RE: Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread Yong Zhang
If you are using Spark > 1.5, the best way is to use DataFrame API directly, instead of SQL. In dataframe, you can specify the boardcast join hint in the dataframe API, which will force the boardcast join. Yong From: mich.talebza...@gmail.com Date: Mon, 20 Jun 2016 13:09:17 +0100 Subject: Re:

RE: Not able to write output to local filsystem from Standalone mode.

2016-05-27 Thread Yong Zhang
he less things "just make sense to me". > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > > On F

RE: Not able to write output to local filsystem from Standalone mode.

2016-05-26 Thread Yong Zhang
That just makes sense, doesn't it? The only place will be driver. If not, the executor will be having contention by whom should create the directory in this case. Only the coordinator (driver in this case) is the best place for doing it. Yong From: math...@closetwork.org Date: Wed, 25 May 2016

RE: SQLContext and HiveContext parse a query string differently ?

2016-05-12 Thread Yong Zhang
Not sure what do you mean? You want to have one exactly query running fine in both sqlContext and HiveContext? The query parser are different, why do you want to have this feature? Do I understand your question correctly? Yong Date: Thu, 12 May 2016 13:09:34 +0200 Subject: SQLContext and

RE: Weird results with Spark SQL Outer joins

2016-05-02 Thread Yong Zhang
We are still not sure what is the problem, if you cannot show us with some example data. For dps with 42632 rows, and swig with 42034 rows, if dps full outer join with swig on 3 columns; with additional filters, get the same resultSet row count as dps lefter outer join with swig on 3 columns,

RE: Java exception when showing join

2016-04-25 Thread Yong Zhang
get an invalid syntax error when I do that. > > On Fri, 2016-04-22 at 20:06 -0400, Yong Zhang wrote: > > use "dispute_df.join(comments_df, dispute_df.COMMENTID === > > comments_df.COMMENTID).first()" instead. > > > > Yong > > > > Date: Fri, 22 Apr 201

RE: Java exception when showing join

2016-04-25 Thread Yong Zhang
r 2016 07:45:12 -0500 > > I get an invalid syntax error when I do that. > > On Fri, 2016-04-22 at 20:06 -0400, Yong Zhang wrote: > > use "dispute_df.join(comments_df, dispute_df.COMMENTID === > > comments_df.COMMENTID).first()" instead. > > > > Yong > &g

RE: How this unit test passed on master trunk?

2016-04-24 Thread Yong Zhang
put itself does not have any ordering. I am not sure why the unit test and the real env have different environment. Xiao, I do see the difference between unit test and local cluster run. Do you know the reason? Thanks. Zhan Zhang On Apr 22, 2016, at 11:23 AM, Yong Zhan

RE: Java exception when showing join

2016-04-22 Thread Yong Zhang
use "dispute_df.join(comments_df, dispute_df.COMMENTID === comments_df.COMMENTID).first()" instead. Yong Date: Fri, 22 Apr 2016 17:42:26 -0400 From: webe...@aim.com To: user@spark.apache.org Subject: Java exception when showing join I am using pyspark with netezza. I am getting a java

  1   2   >