RE: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Cheng, Hao
Congratulations!! Jerry, you really deserve it. Hao -Original Message- From: Mridul Muralidharan [mailto:mri...@gmail.com] Sent: Tuesday, August 29, 2017 12:04 PM To: Matei Zaharia Cc: dev ; Saisai Shao Subject:

RE: [VOTE] Release Apache Spark 2.0.0 (RC4)

2016-07-14 Thread Cheng, Hao
-1 Breaks the existing applications while using the Script Transformation in Spark SQL, as the default Record/Column delimiter class changed since we don’t get the default conf value from HiveConf any more, see SPARK-16515; This is a regression. From: Reynold Xin [mailto:r...@databricks.com]

RE: new datasource

2015-11-19 Thread Cheng, Hao
I think you probably need to write some code as you need to support the ES, there are 2 options per my understanding: Create a new Data Source from scratch, but you probably need to overwrite the interface at:

RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao
I am not sure what the best practice for this specific problem, but it’s really worth to think about it in 2.0, as it is a painful issue for lots of users. By the way, is it also an opportunity to deprecate the RDD API (or internal API only?)? As lots of its functionality overlapping with

RE: A proposal for Spark 2.0

2015-11-12 Thread Cheng, Hao
DataFrames or DataSets don't fully fit. On Thu, Nov 12, 2015 at 5:17 PM, Cheng, Hao <hao.ch...@intel.com<mailto:hao.ch...@intel.com>> wrote: I am not sure what the best practice for this specific problem, but it’s really worth to think about it in 2.0, as it is a painful is

RE: Sort Merge Join from the filesystem

2015-11-09 Thread Cheng, Hao
Yes, we definitely need to think how to handle this case, probably even more common than both sorted/partitioned tables case, can you jump to the jira and leave comment there? From: Alex Nastetsky [mailto:alex.nastet...@vervemobile.com] Sent: Tuesday, November 10, 2015 3:03 AM To: Cheng, Hao Cc

RE: dataframe slow down with tungsten turn on

2015-11-05 Thread Cheng, Hao
turn on -- Forwarded message -- From: gen tang <gen.tan...@gmail.com<mailto:gen.tan...@gmail.com>> Date: Fri, Nov 6, 2015 at 12:14 AM Subject: Re: dataframe slow down with tungsten turn on To: "Cheng, Hao" <hao.ch...@intel.com<mailto:hao.ch...@inte

RE: Sort Merge Join from the filesystem

2015-11-04 Thread Cheng, Hao
Yes, we probably need more change for the data source API if we need to implement it in a generic way. BTW, I create the JIRA by copy most of words from Alex. ☺ https://issues.apache.org/jira/browse/SPARK-11512 From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, November 5, 2015

RE: dataframe slow down with tungsten turn on

2015-11-04 Thread Cheng, Hao
BTW, 1 min V.S. 2 Hours, seems quite weird, can you provide more information on the ETL work? From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Thursday, November 5, 2015 12:56 PM To: gen tang; dev@spark.apache.org Subject: RE: dataframe slow down with tungsten turn on 1.5 has critical

RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
Probably 2 reasons: 1. HadoopFsRelation was introduced since 1.4, but seems CsvRelation was created based on 1.3 2. HadoopFsRelation introduces the concept of Partition, which probably not necessary for LibSVMRelation. But I think it will be easy to change as extending from

RE: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation ?

2015-11-04 Thread Cheng, Hao
problem as you described, probably we can add additional checking / reporting rule for the abuse. From: Jeff Zhang [mailto:zjf...@gmail.com] Sent: Thursday, November 5, 2015 1:55 PM To: Cheng, Hao Cc: dev@spark.apache.org Subject: Re: Why LibSVMRelation and CsvRelation don't extends HadoopFsRelation

RE: Unable to acquire memory errors in HiveCompatibilitySuite

2015-09-16 Thread Cheng, Hao
We actually meet the similiar problem in a real case, see https://issues.apache.org/jira/browse/SPARK-10474 After checking the source code, the external sort memory management strategy seems the root cause of the issue. Currently, we allocate the 4MB (page size) buffer as initial in the

RE: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-06 Thread Cheng, Hao
Not sure if it’s too late, but we found a critical bug at https://issues.apache.org/jira/browse/SPARK-10466 UnsafeRow ser/de will cause assert error, particularly for sort-based shuffle with data spill, this is not acceptable as it’s very common in a large table joins. From: Reynold Xin

RE: Automatically deleting pull request comments left by AmplabJenkins

2015-08-13 Thread Cheng, Hao
I found the https://spark-prs.appspot.com/ is super slow while open it in a new window recently, not sure just myself or everybody experience the same, is there anyways to speed up? From: Josh Rosen [mailto:rosenvi...@gmail.com] Sent: Friday, August 14, 2015 10:21 AM To: dev Subject: Re:

RE: Automatically deleting pull request comments left by AmplabJenkins

2015-08-13 Thread Cheng, Hao
OK, thanks, probably just myself… From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, August 14, 2015 11:04 AM To: Cheng, Hao Cc: Josh Rosen; dev Subject: Re: Automatically deleting pull request comments left by AmplabJenkins I tried accessing just now. It took several seconds before

RE: Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread Cheng, Hao
Firstly, spark.sql.autoBroadcastJoinThreshold only works for the EQUAL JOIN. Currently, for the non-equal join, if the join type is the INNER join, then it will be done by CartesianProduct join and BroadcastNestedLoopJoin works for the outer joins. In the BroadcastnestedLoopJoin, the table

RE: [SparkSQL ] What is Exchange in physical plan for ?

2015-06-08 Thread Cheng, Hao
It means the data shuffling, and its arguments also show the partitioning strategy. -Original Message- From: invkrh [mailto:inv...@gmail.com] Sent: Monday, June 8, 2015 9:34 PM To: dev@spark.apache.org Subject: [SparkSQL ] What is Exchange in physical plan for ? Hi,

RE: [VOTE] Release Apache Spark 1.4.0 (RC2)

2015-05-25 Thread Cheng, Hao
Add another Blocker issue, just created! It seems a regression. https://issues.apache.org/jira/browse/SPARK-7853 -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, May 25, 2015 3:37 PM To: Patrick Wendell Cc: dev@spark.apache.org Subject: Re: [VOTE] Release

RE: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Cheng, Hao
Thanks for reporting this. We intend to support the multiple metastore versions in a single build(hive-0.13.1) by introducing the IsolatedClientLoader, but probably you’re hitting the bug, please file a jira issue for this. I will keep investigating on this also. Hao From: Mark Hamstra

RE: Does Spark SQL (JDBC) support nest select with current version

2015-05-15 Thread Cheng, Hao
Spark SQL just load the query result as a new source (via JDBC), so DO NOT confused with the Spark SQL tables. They are totally independent database systems. From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID] Sent: Friday, May 15, 2015 1:59 PM To: Cheng, Hao; Dev Subject: Re: Does Spark SQL

RE: Does Spark SQL (JDBC) support nest select with current version

2015-05-14 Thread Cheng, Hao
You need to register the “dataFrame” as a table first and then do queries on it? Do you mean that also failed? From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID] Sent: Friday, May 15, 2015 1:10 PM To: Yi Zhang; Dev Subject: Re: Does Spark SQL (JDBC) support nest select with current version If

RE: Add Char support in SQL dataTypes

2015-03-19 Thread Cheng, Hao
Can you use the Varchar or String instead? Currently, Spark SQL will convert the varchar into string type internally(without max length limitation). However, char type is not supported yet. -Original Message- From: A.M.Chan [mailto:kaka_1...@163.com] Sent: Friday, March 20, 2015 9:56

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Cheng, Hao
I am not so sure if Hive supports change the metastore after initialized, I guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably that's why it doesn't work as expected for Q1. BTW, in most of cases, people configure the metastore settings in hive-site.xml, and will not

RE: Join implementation in SparkSQL

2015-01-15 Thread Cheng, Hao
Not so sure about your question, but the SparkStrategies.scala and Optimizer.scala is a good start if you want to get details of the join implementation or optimization. -Original Message- From: Andrew Ash [mailto:and...@andrewash.com] Sent: Friday, January 16, 2015 4:52 AM To: Reynold

RE: Question on saveAsTextFile with overwrite option

2014-12-24 Thread Cheng, Hao
I am wondering if we can provide more friendly API, other than configuration for this purpose. What do you think Patrick? Cheng Hao -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Thursday, December 25, 2014 3:22 PM To: Shao, Saisai Cc: u...@spark.apache.org

RE: Where are the docs for the SparkSQL DataTypes?

2014-12-11 Thread Cheng, Hao
Part of it can be found at: https://github.com/apache/spark/pull/3429/files#diff-f88c3e731fcb17b1323b778807c35b38R34 Sorry it's a TO BE reviewed PR, but still should be informative. Cheng Hao -Original Message- From: Alessandro Baretta [mailto:alexbare...@gmail.com] Sent: Friday

RE: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

2014-12-06 Thread Cheng, Hao
I've created(reused) the PR https://github.com/apache/spark/pull/3336, hopefully we can fix this regression. Thanks for the reporting. Cheng Hao -Original Message- From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Saturday, December 6, 2014 4:51 AM To: kb Cc: d

RE: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Cheng, Hao
+1, that definitely will speeds up the PR reviewing / merging. -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Thursday, November 6, 2014 12:46 PM To: dev Subject: Re: [VOTE] Designating maintainers for some Spark components +1 since this is already the de facto

RE: Build with Hive 0.13.1 doesn't have datanucleus and parquet dependencies.

2014-10-27 Thread Cheng, Hao
Hive-thriftserver module is not included while specifying the profile hive-0.13.1. -Original Message- From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Monday, October 27, 2014 4:48 PM To: dev@spark.apache.org Subject: Build with Hive 0.13.1 doesn't have datanucleus and

RE: HiveContext, schemaRDD.printSchema get different dataTypes, feature or a bug? really strange and surprised...

2014-08-31 Thread Cheng, Hao
Yes, the root cause for that is the output ObjectInspector in SerDe implementation doesn't reflect the real typeinfo. Hive actually provides the API like TypeInfoUtils.getStandardJavaObjectInspectorFromTypeInfo(TypeInfo) for the mapping. You probably need to update the code at

RE: [sql]enable spark sql cli support spark sql

2014-08-15 Thread Cheng, Hao
If so, probably we need to add the SQL dialects switching support for SparkSQLCLI, as Fei suggested. What do you think the priority for this? -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Friday, August 15, 2014 1:57 PM To: Cheng, Hao Cc: scwf; dev

RE: [sql]enable spark sql cli support spark sql

2014-08-14 Thread Cheng, Hao
Actually the SQL Parser (another SQL dialect in SparkSQL) is quite weak, and only support some basic queries, not sure what's the plan for its enhancement. -Original Message- From: scwf [mailto:wangf...@huawei.com] Sent: Friday, August 15, 2014 11:22 AM To: dev@spark.apache.org Subject: