RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Cheng, Hao
. From: Todd [mailto:bit1...@163.com] Sent: Friday, September 11, 2015 2:17 PM To: Cheng, Hao Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL Thanks Hao for the reply. I turn the merge sort join off

RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Cheng, Hao
, September 11, 2015 3:39 PM To: Todd Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL I add the following two options: spark.sql.planner.sortMergeJoin=false

RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Cheng, Hao
This is not a big surprise the SMJ is slower than the HashJoin, as we do not fully utilize the sorting yet, more details can be found at https://issues.apache.org/jira/browse/SPARK-2926 . Anyway, can you disable the sort merge join by “spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and

RE: Driver OOM after upgrading to 1.5

2015-09-09 Thread Cheng, Hao
Will that be helpful if adding jvm options like: -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, September 10, 2015 5:31 AM To: Sandy Ryza Cc: user@spark.apache.org Subject: Re: Driver OOM after upgrading to 1.5 It's

[jira] [Commented] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734395#comment-14734395 ] Cheng Hao commented on SPARK-10484: --- In cartesian produce implementation, there is 2 level nested loops

[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-08 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14736016#comment-14736016 ] Cheng Hao commented on SPARK-10466: --- Sorry, [~davies], I found the spark conf doens't take effect when

[jira] [Created] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-06 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10466: - Summary: UnsafeRow exception in Sort-Based Shuffle with data spill Key: SPARK-10466 URL: https://issues.apache.org/jira/browse/SPARK-10466 Project: Spark Issue

RE: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-06 Thread Cheng, Hao
Not sure if it’s too late, but we found a critical bug at https://issues.apache.org/jira/browse/SPARK-10466 UnsafeRow ser/de will cause assert error, particularly for sort-based shuffle with data spill, this is not acceptable as it’s very common in a large table joins. From: Reynold Xin

RE: Re: Job aborted due to stage failure: java.lang.StringIndexOutOfBoundsException: String index out of range: 18

2015-08-30 Thread Cheng, Hao
Hi, can you try something like: val rowRDD=sc.textFile(/user/spark/short_model).map{ line = val p = line.split(\\tfile:///\\t) if (p.length =72) { Row(p(0), p(1)…) } else { throw new RuntimeException(s“failed in parsing $line”) } } From the log

[jira] [Created] (SPARK-10327) Cache Table is not working while subquery has alias in its project list

2015-08-27 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10327: - Summary: Cache Table is not working while subquery has alias in its project list Key: SPARK-10327 URL: https://issues.apache.org/jira/browse/SPARK-10327 Project: Spark

[jira] [Created] (SPARK-10270) Add/Replace some Java friendly DataFrame API

2015-08-25 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10270: - Summary: Add/Replace some Java friendly DataFrame API Key: SPARK-10270 URL: https://issues.apache.org/jira/browse/SPARK-10270 Project: Spark Issue Type

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-25 Thread Cheng, Hao
Ok, I see, thanks for the correction, but this should be optimized. From: Shixiong Zhu [mailto:zsxw...@gmail.com] Sent: Tuesday, August 25, 2015 2:08 PM To: Cheng, Hao Cc: Jeff Zhang; user@spark.apache.org Subject: Re: DataFrame#show cost 2 Spark Jobs ? That's two jobs. `SparkPlan.executeTake

[jira] [Commented] (SPARK-10215) Div of Decimal returns null

2015-08-25 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710719#comment-14710719 ] Cheng Hao commented on SPARK-10215: --- Yes, that's a blocker issue for our customer, I

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-25 Thread Cheng, Hao
O, Sorry, I miss reading your reply! I know the minimum tasks will be 2 for scanning, but Jeff is talking about 2 jobs, not 2 tasks. From: Shixiong Zhu [mailto:zsxw...@gmail.com] Sent: Tuesday, August 25, 2015 1:29 PM To: Cheng, Hao Cc: Jeff Zhang; user@spark.apache.org Subject: Re: DataFrame

RE: Spark thrift server on yarn

2015-08-25 Thread Cheng, Hao
Did you register temp table via the beeline or in a new Spark SQL CLI? As I know, the temp table cannot cross the HiveContext. Hao From: Udit Mehta [mailto:ume...@groupon.com] Sent: Wednesday, August 26, 2015 8:19 AM To: user Subject: Spark thrift server on yarn Hi, I am trying to start a

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
The first job is to infer the json schema, and the second one is what you mean of the query. You can provide the schema while loading the json file, like below: sqlContext.read.schema(xxx).json(“…”)? Hao From: Jeff Zhang [mailto:zjf...@gmail.com] Sent: Monday, August 24, 2015 6:20 PM To:

RE: Loading already existing tables in spark shell

2015-08-24 Thread Cheng, Hao
And be sure the hive-site.xml is under the classpath or under the path of $SPARK_HOME/conf Hao From: Ishwardeep Singh [mailto:ishwardeep.si...@impetus.co.in] Sent: Monday, August 24, 2015 8:57 PM To: user Subject: Re: Loading already existing tables in spark shell Hi Jeetendra, I faced

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
loading the data for JSON, it’s probably causes longer time for ramp up with large number of files/partitions. From: Jeff Zhang [mailto:zjf...@gmail.com] Sent: Tuesday, August 25, 2015 8:11 AM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: DataFrame#show cost 2 Spark Jobs ? Hi Cheng, I

[jira] [Created] (SPARK-10215) Div of Decimal returns null

2015-08-24 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10215: - Summary: Div of Decimal returns null Key: SPARK-10215 URL: https://issues.apache.org/jira/browse/SPARK-10215 Project: Spark Issue Type: Bug Components

RE: Test case for the spark sql catalyst

2015-08-24 Thread Cheng, Hao
Yes, check the source code under: https://github.com/apache/spark/tree/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst From: Todd [mailto:bit1...@163.com] Sent: Tuesday, August 25, 2015 1:01 PM To: user@spark.apache.org Subject: Test case for the spark sql catalyst Hi, Are

[jira] [Updated] (SPARK-10134) Improve the performance of Binary Comparison

2015-08-23 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-10134: -- Priority: Minor (was: Major) Improve the performance of Binary Comparison

[jira] [Updated] (SPARK-10134) Improve the performance of Binary Comparison

2015-08-23 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-10134: -- Fix Version/s: (was: 1.6.0) Improve the performance of Binary Comparison

[jira] [Commented] (SPARK-10134) Improve the performance of Binary Comparison

2015-08-23 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708766#comment-14708766 ] Cheng Hao commented on SPARK-10134: --- We can improve that by enable the comparison every

[jira] [Commented] (SPARK-10130) type coercion for IF should have children resolved first

2015-08-20 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704513#comment-14704513 ] Cheng Hao commented on SPARK-10130: --- Can you change the fix version to 1.5? Lots

[jira] [Created] (SPARK-10134) Improve the performance of Binary Comparison

2015-08-20 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10134: - Summary: Improve the performance of Binary Comparison Key: SPARK-10134 URL: https://issues.apache.org/jira/browse/SPARK-10134 Project: Spark Issue Type

[jira] [Commented] (SPARK-9357) Remove JoinedRow

2015-08-19 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704311#comment-14704311 ] Cheng Hao commented on SPARK-9357: -- JoinedRow does increase the overhead by adding layer

[jira] [Comment Edited] (SPARK-9357) Remove JoinedRow

2015-08-19 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704311#comment-14704311 ] Cheng Hao edited comment on SPARK-9357 at 8/20/15 5:28 AM

[jira] [Comment Edited] (SPARK-9357) Remove JoinedRow

2015-08-19 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704311#comment-14704311 ] Cheng Hao edited comment on SPARK-9357 at 8/20/15 5:29 AM

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false. BTW, which version are you using? Hao From: Jerrick Hoang [mailto:jerrickho...@gmail.com] Sent: Thursday, August 20, 2015 12:16 PM To: Philip Weaver Cc: user Subject: Re: Spark Sql behaves strangely with tables with

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
20, 2015 1:46 PM To: Cheng, Hao Cc: Philip Weaver; user Subject: Re: Spark Sql behaves strangely with tables with a lot of partitions I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs trying to speed up spark sql with tables with a huge number of partitions, I've made

[jira] [Commented] (SPARK-9357) Remove JoinedRow

2015-08-19 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702603#comment-14702603 ] Cheng Hao commented on SPARK-9357: -- JoinedRow is probably in high efficiency for case

[jira] [Commented] (SPARK-7218) Create a real iterator with open/close for Spark SQL

2015-08-19 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702584#comment-14702584 ] Cheng Hao commented on SPARK-7218: -- Can you give some BKM for this task? Create a real

[jira] [Created] (SPARK-10044) AnalysisException in resolving reference for sorting with aggregation

2015-08-16 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10044: - Summary: AnalysisException in resolving reference for sorting with aggregation Key: SPARK-10044 URL: https://issues.apache.org/jira/browse/SPARK-10044 Project: Spark

RE: Automatically deleting pull request comments left by AmplabJenkins

2015-08-13 Thread Cheng, Hao
I found the https://spark-prs.appspot.com/ is super slow while open it in a new window recently, not sure just myself or everybody experience the same, is there anyways to speed up? From: Josh Rosen [mailto:rosenvi...@gmail.com] Sent: Friday, August 14, 2015 10:21 AM To: dev Subject: Re:

RE: Automatically deleting pull request comments left by AmplabJenkins

2015-08-13 Thread Cheng, Hao
OK, thanks, probably just myself… From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, August 14, 2015 11:04 AM To: Cheng, Hao Cc: Josh Rosen; dev Subject: Re: Automatically deleting pull request comments left by AmplabJenkins I tried accessing just now. It took several seconds before

[jira] [Commented] (SPARK-8240) string function: concat

2015-08-13 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696471#comment-14696471 ] Cheng Hao commented on SPARK-8240: -- It's probably very difficult to define the function

[jira] [Commented] (SPARK-8240) string function: concat

2015-08-13 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696469#comment-14696469 ] Cheng Hao commented on SPARK-8240: -- It works for me like: {code} sql(select concat

[jira] [Commented] (SPARK-9879) OOM in LIMIT clause with large number

2015-08-13 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14696338#comment-14696338 ] Cheng Hao commented on SPARK-9879: -- I create a new physical operator called LargeLimit

[jira] [Created] (SPARK-9879) OOM in CTAS with LIMIT

2015-08-12 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-9879: Summary: OOM in CTAS with LIMIT Key: SPARK-9879 URL: https://issues.apache.org/jira/browse/SPARK-9879 Project: Spark Issue Type: Bug Components: SQL

[jira] [Updated] (SPARK-9879) OOM in LIMIT clause with large number

2015-08-12 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-9879: - Summary: OOM in LIMIT clause with large number (was: OOM in CTAS with LIMIT) OOM in LIMIT clause

[jira] [Updated] (SPARK-9879) OOM in CTAS with LIMIT

2015-08-12 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-9879: - Description: {code} create table spark.tablsetest as select * from dpa_ord_bill_tf order by member_id

[jira] [Updated] (SPARK-9879) OOM in CTAS with LIMIT

2015-08-12 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-9879: - Description: {code} create table spark.tablsetest as select * from dpa_ord_bill_tf order by member_id

RE: Refresh table

2015-08-11 Thread Cheng, Hao
Refreshing table only works for the Spark SQL DataSource in my understanding, apparently here, you’re running a Hive Table. Can you try to create a table like: |CREATE TEMPORARY TABLE parquetTable (a int, b string) |USING org.apache.spark.sql.parquet.DefaultSource

RE: Spark DataFrames uses too many partition

2015-08-11 Thread Cheng, Hao
That's a good question, we don't support reading small files in a single partition yet, but it's definitely an issue we need to optimize, do you mind to create a jira issue for this? Hopefully we can merge that in 1.6 release. 200 is the default partition number for parallel tasks after the

RE: Potential bug broadcastNestedLoopJoin or default value of spark.sql.autoBroadcastJoinThreshold

2015-08-11 Thread Cheng, Hao
Firstly, spark.sql.autoBroadcastJoinThreshold only works for the EQUAL JOIN. Currently, for the non-equal join, if the join type is the INNER join, then it will be done by CartesianProduct join and BroadcastNestedLoopJoin works for the outer joins. In the BroadcastnestedLoopJoin, the table

RE: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-11 Thread Cheng, Hao
Definitely worth to try. And you can sort the record before writing out, and then you will get the parquet files without overlapping keys. Let us know if that helps. Hao From: Philip Weaver [mailto:philip.wea...@gmail.com] Sent: Wednesday, August 12, 2015 4:05 AM To: Cheng Lian Cc: user

[jira] [Created] (SPARK-9735) Auto infer partition schema of HadoopFsRelation should should respected the user specified one

2015-08-07 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-9735: Summary: Auto infer partition schema of HadoopFsRelation should should respected the user specified one Key: SPARK-9735 URL: https://issues.apache.org/jira/browse/SPARK-9735

[jira] [Commented] (SPARK-9689) Cache doesn't refresh for HadoopFsRelation based table

2015-08-06 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14661359#comment-14661359 ] Cheng Hao commented on SPARK-9689: -- After investigation, the root cause for the failure

[jira] [Updated] (SPARK-9689) Cache doesn't refresh for HadoopFsRelation based table

2015-08-06 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-9689: - Description: {code:title=example|borderStyle=solid} // create a HadoopFsRelation based table sql

[jira] [Created] (SPARK-9689) Cache doesn't refresh for HadoopFsRelation based table

2015-08-06 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-9689: Summary: Cache doesn't refresh for HadoopFsRelation based table Key: SPARK-9689 URL: https://issues.apache.org/jira/browse/SPARK-9689 Project: Spark Issue Type: Bug

[jira] [Commented] (SPARK-7119) ScriptTransform doesn't consider the output data type

2015-08-03 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14652892#comment-14652892 ] Cheng Hao commented on SPARK-7119: -- [~marmbrus] This is actually a bug fixing

[jira] [Created] (SPARK-9381) Migrate JSON data source to the new partitioning data source

2015-07-27 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-9381: Summary: Migrate JSON data source to the new partitioning data source Key: SPARK-9381 URL: https://issues.apache.org/jira/browse/SPARK-9381 Project: Spark Issue

[jira] [Commented] (SPARK-9374) [Spark SQL] Throw out erorr of AnalysisException: nondeterministic expressions are only allowed in Project or Filter during the spark sql parse phase

2015-07-27 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642706#comment-14642706 ] Cheng Hao commented on SPARK-9374: -- [~cloud_fan]] Can you also take look at this failure

[jira] [Commented] (SPARK-9239) HiveUDAF support for AggregateFunction2

2015-07-23 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-9239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14638276#comment-14638276 ] Cheng Hao commented on SPARK-9239: -- [~yhuai] are you working on this now? Or I can take

[jira] [Commented] (SPARK-8230) complex function: size

2015-07-15 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629107#comment-14629107 ] Cheng Hao commented on SPARK-8230: -- [~pedrorodriguez], actually [~TarekAuel] set a good

RE: HiveThriftServer2.startWithContext error with registerTempTable

2015-07-15 Thread Cheng, Hao
Have you ever try query the “select * from temp_table” from the spark shell? Or can you try the option --jars while starting the spark shell? From: Srikanth [mailto:srikanth...@gmail.com] Sent: Thursday, July 16, 2015 9:36 AM To: user Subject: Re: HiveThriftServer2.startWithContext error with

RE: Python DataFrames: length of ArrayType

2015-07-15 Thread Cheng, Hao
Actually it's supposed to be part of Spark 1.5 release, see https://issues.apache.org/jira/browse/SPARK-8230 You're definitely welcome to contribute to it, let me know if you have any question on implementing it. Cheng Hao -Original Message- From: pedro [mailto:ski.rodrig...@gmail.com

RE: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Cheng, Hao
Can you describe how did you cache the tables? In another HiveContext? AFAIK, cached table only be visible within the same HiveContext, you probably need to execute the sql query like “cache table mytable as SELECT xxx” in the JDBC connection also. Cheng Hao From: Brandon White

RE: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Cheng, Hao
So you’re with different HiveContext instances for the caching. We are not expected to see the cached tables cached with the other HiveContext instance. From: Brandon White [mailto:bwwintheho...@gmail.com] Sent: Wednesday, July 15, 2015 8:48 AM To: Cheng, Hao Cc: user Subject: Re: How do you

[jira] [Commented] (SPARK-8956) Rollup produces incorrect result when group by contains expressions

2015-07-12 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624121#comment-14624121 ] Cheng Hao commented on SPARK-8956: -- Sorry, I didn't notice this jira issue when I created

[jira] [Updated] (SPARK-8972) Incorrect result for rollup

2015-07-12 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-8972: - Description: {code:java} import sqlContext.implicits._ case class KeyValue(key: Int, value: String) val

[jira] [Created] (SPARK-8972) Wrong result for rollup

2015-07-09 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-8972: Summary: Wrong result for rollup Key: SPARK-8972 URL: https://issues.apache.org/jira/browse/SPARK-8972 Project: Spark Issue Type: Bug Components: SQL

[jira] [Updated] (SPARK-8972) Incorrect result for rollup

2015-07-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-8972: - Summary: Incorrect result for rollup (was: Wrong result for rollup) Incorrect result for rollup

RE: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Cheng, Hao
Never mind, I’ve created the jira issue at https://issues.apache.org/jira/browse/SPARK-8972. From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Friday, July 10, 2015 9:15 AM To: yana.kadiy...@gmail.com; ayan guha Cc: user Subject: RE: [SparkSQL] Incorrect ROLLUP results Yes, this is a bug, do

RE: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Cheng, Hao
Yes, this is a bug, do you mind to create a jira issue for this? I will fix this asap. BTW, what’s your spark version? From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Friday, July 10, 2015 12:16 AM To: ayan guha Cc: user Subject: Re: [SparkSQL] Incorrect ROLLUP results

[jira] [Issue Comment Deleted] (SPARK-8864) Date/time function and data type design

2015-07-08 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-8864: - Comment: was deleted (was: Thanks for explanation. The design looks good to me now.) Date/time function

[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-08 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618200#comment-14618200 ] Cheng Hao commented on SPARK-8864: -- Thanks for explanation. The design looks good to me

[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-08 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618201#comment-14618201 ] Cheng Hao commented on SPARK-8864: -- Thanks for explanation. The design looks good to me

[jira] [Updated] (SPARK-7119) ScriptTransform doesn't consider the output data type

2015-07-08 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Hao updated SPARK-7119: - Priority: Blocker (was: Major) ScriptTransform doesn't consider the output data type

[jira] [Created] (SPARK-8867) Show the UDF usage for user.

2015-07-07 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-8867: Summary: Show the UDF usage for user. Key: SPARK-8867 URL: https://issues.apache.org/jira/browse/SPARK-8867 Project: Spark Issue Type: Task Components

[jira] [Created] (SPARK-8883) Remove the class OverrideFunctionRegistry

2015-07-07 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-8883: Summary: Remove the class OverrideFunctionRegistry Key: SPARK-8883 URL: https://issues.apache.org/jira/browse/SPARK-8883 Project: Spark Issue Type: Improvement

RE: Hive UDFs

2015-07-07 Thread Cheng, Hao
dataframe.limit(1).selectExpr(xxx).collect()? -Original Message- From: chrish2312 [mailto:c...@palantir.com] Sent: Wednesday, July 8, 2015 6:20 AM To: user@spark.apache.org Subject: Hive UDFs I know the typical way to apply a hive UDF to a dataframe is basically something like:

[jira] [Commented] (SPARK-8864) Date/time function and data type design

2015-07-07 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14617846#comment-14617846 ] Cheng Hao commented on SPARK-8864: -- Long = 2 ^ 63 = 9.2E18, the timestamp is in us

RE: HiveContext throws org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2015-07-07 Thread Cheng, Hao
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.derby.jdbc.EmbeddedDriver It will be included in the assembly jar usually, not sure what's wrong. But can you try add the derby jar into the driver classpath and try again? -Original Message- From: bdev

[jira] [Created] (SPARK-8791) Make a better hashcode for InternalRow

2015-07-02 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-8791: Summary: Make a better hashcode for InternalRow Key: SPARK-8791 URL: https://issues.apache.org/jira/browse/SPARK-8791 Project: Spark Issue Type: Improvement

[jira] [Commented] (SPARK-8159) Improve SQL/DataFrame expression coverage

2015-07-02 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612728#comment-14612728 ] Cheng Hao commented on SPARK-8159: -- Will that possible to add all of the expressions

[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type

2015-07-01 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609627#comment-14609627 ] Cheng Hao commented on SPARK-8653: -- Yes, I agree that we cannot make a clear cut

[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type

2015-07-01 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609609#comment-14609609 ] Cheng Hao commented on SPARK-8653: -- For most of the Mathematical expressions, we can get

[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type

2015-07-01 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609629#comment-14609629 ] Cheng Hao commented on SPARK-8653: -- What do you think [~rxin]? Add constraint

[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type

2015-06-29 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607195#comment-14607195 ] Cheng Hao commented on SPARK-8653: -- [~rxin] I'll agree that we need to rename the trait

[jira] [Created] (SPARK-8653) Add constraint for Children expression for data type

2015-06-26 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-8653: Summary: Add constraint for Children expression for data type Key: SPARK-8653 URL: https://issues.apache.org/jira/browse/SPARK-8653 Project: Spark Issue Type: Sub

RE: Support for Windowing and Analytics functions in Spark SQL

2015-06-22 Thread Cheng, Hao
Yes, with should be with HiveContext, not SQLContext. From: ayan guha [mailto:guha.a...@gmail.com] Sent: Tuesday, June 23, 2015 2:51 AM To: smazumder Cc: user Subject: Re: Support for Windowing and Analytics functions in Spark SQL 1.4 supports it On 23 Jun 2015 02:59, Sourav Mazumder

RE: Question about SPARK_WORKER_CORES and spark.task.cpus

2015-06-22 Thread Cheng, Hao
It’s actually not that tricky. SPARK_WORKER_CORES: is the max task thread pool size of the of the executor, the same saying of “one executor with 32 cores and the executor could execute 32 tasks simultaneously”. Spark doesn’t care about how much real physical CPU/Cores you have (OS does), so

RE: Is HiveContext Thread Safe?

2015-06-17 Thread Cheng, Hao
Yes, it is thread safe. That’s how Spark SQL JDBC Server works. Cheng Hao From: V Dineshkumar [mailto:developer.dines...@gmail.com] Sent: Wednesday, June 17, 2015 9:44 PM To: user@spark.apache.org Subject: Is HiveContext Thread Safe? Hi, I have a HiveContext which I am using in multiple

RE: generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Cheng, Hao
Seems you're hitting the self-join, currently Spark SQL won't cache any result/logical tree for further analyzing or computing for self-join. Since the logical tree is huge, it's reasonable to take long time in generating its tree string recursively. And I also doubt the computing can finish

RE: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng, Hao
Not sure if Spark RDD will provide API to fetch the record one by one from the final result set, instead of the pulling them all / (or whole partition data) and fit in the driver memory. Seems a big change. From: Cheng Lian [mailto:l...@databricks.com] Sent: Friday, June 12, 2015 3:51 PM To:

RE: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng, Hao
Not sure if Spark Core will provide API to fetch the record one by one from the block manager, instead of the pulling them all into the driver memory. From: Cheng Lian [mailto:l...@databricks.com] Sent: Friday, June 12, 2015 3:51 PM To: 姜超才; Hester wang; user@spark.apache.org Subject: Re: 回复:

[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore

2015-06-10 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14581315#comment-14581315 ] Cheng Hao commented on SPARK-7550: -- I will start working on this today, sorry

[jira] [Commented] (SPARK-8159) Improve expression coverage

2015-06-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578503#comment-14578503 ] Cheng Hao commented on SPARK-8159: -- One more question, is that possible to assign

[jira] [Commented] (SPARK-8267) string function: trim

2015-06-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578559#comment-14578559 ] Cheng Hao commented on SPARK-8267: -- I am working on this. string function: trim

[jira] [Commented] (SPARK-8159) Improve expression coverage

2015-06-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578502#comment-14578502 ] Cheng Hao commented on SPARK-8159: -- Agree, it would be easier to track the progress

RE: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-09 Thread Cheng, Hao
Is it the large result set return from the Thrift Server? And can you paste the SQL and physical plan? From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, June 9, 2015 12:01 PM To: Sourav Mazumder Cc: user Subject: Re: Spark SQL with Thrift Server is very very slow and finally failing

[jira] [Commented] (SPARK-8248) string function: length

2015-06-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579127#comment-14579127 ] Cheng Hao commented on SPARK-8248: -- I am working on this. string function: length

[jira] [Commented] (SPARK-8228) conditional function: isnull

2015-06-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579903#comment-14579903 ] Cheng Hao commented on SPARK-8228: -- I'll take this one. conditional function: isnull

[jira] [Commented] (SPARK-8242) string function: decode

2015-06-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579914#comment-14579914 ] Cheng Hao commented on SPARK-8242: -- I'll take this one. string function: decode

[jira] [Commented] (SPARK-8244) string function: find_in_set

2015-06-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579916#comment-14579916 ] Cheng Hao commented on SPARK-8244: -- I'll take this one. string function: find_in_set

[jira] [Commented] (SPARK-8246) string function: get_json_object

2015-06-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579918#comment-14579918 ] Cheng Hao commented on SPARK-8246: -- I'll take this one. string function

[jira] [Commented] (SPARK-8251) string function: alias upper / ucase

2015-06-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579922#comment-14579922 ] Cheng Hao commented on SPARK-8251: -- I'll take this one. string function: alias upper

[jira] [Commented] (SPARK-8259) string function: rpad

2015-06-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579931#comment-14579931 ] Cheng Hao commented on SPARK-8259: -- I'll take this one. string function: rpad

[jira] [Commented] (SPARK-8261) string function: space

2015-06-09 Thread Cheng Hao (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-8261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579933#comment-14579933 ] Cheng Hao commented on SPARK-8261: -- I'll take this one. string function: space

<    1   2   3   4   5   6   7   >