RE: Spark Streaming - graceful shutdown when stream has no more data

2016-02-24 Thread Cheng, Hao
This is very interesting, how to shutdown the streaming job gracefully once no input data for some time. A doable solution probably you can count the input data by using the Accumulator, and anther thread (in master node) will always to get the latest accumulator value, if there is no value

RE: Spark SQL joins taking too long

2016-01-27 Thread Cheng, Hao
Another possibility is about the parallelism? Probably be 1 or some other small value, since the input data size is not that big. If in that case, probably you can try something like: Df1.repartition(10).registerTempTable(“hospitals”); Df2.repartition(10).registerTempTable(“counties”); … And

RE: JSON to SQL

2016-01-27 Thread Cheng, Hao
Have you ever try the DataFrame API like: sqlContext.read.json("/path/to/file.json"); the Spark SQL will auto infer the type/schema for you. And lateral view will help on the flatten issues, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView, as well as the “a.b[0].c”

RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Which version are you using? Have you tried the 1.6? From: Vadim Tkachenko [mailto:apache...@gmail.com] Sent: Wednesday, December 30, 2015 10:17 AM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Problem with WINDOW functions? When I allocate 200g to executor, it is able to make better

RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Can you try to write the result into another file instead? Let's see if there any issue in the executors side . sqlContext.sql("SELECT day,page,dense_rank() OVER (PARTITION BY day ORDER BY pageviews DESC) as rank FROM d1").filter("rank <= 20").sort($"day",$"rank").write.parquet("/path/to/file")

RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
Is there any improvement if you set a bigger memory for executors? -Original Message- From: va...@percona.com [mailto:va...@percona.com] On Behalf Of Vadim Tkachenko Sent: Wednesday, December 30, 2015 9:51 AM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Problem with WINDOW

RE: Problem with WINDOW functions?

2015-12-29 Thread Cheng, Hao
s etc. will be more helpful in understanding your problem. From: Vadim Tkachenko [mailto:apache...@gmail.com] Sent: Wednesday, December 30, 2015 10:49 AM To: Cheng, Hao Subject: Re: Problem with WINDOW functions? I use 1.5.2. Where can I get 1.6? I do not see it on http://spark.apache.org/downloads.html T

RE: Does Spark SQL support rollup like HQL

2015-12-29 Thread Cheng, Hao
Hi, currently, the Simple SQL Parser of SQLContext is quite weak, and doesn’t support the rollup, but you can check the code https://github.com/apache/spark/pull/5080/ , which aimed to add the support, just in case you can patch it in your own branch. In Spark 2.0, the simple SQL Parser will

RE: Rule Engine for Spark

2015-11-04 Thread Cheng, Hao
Or try Streaming SQL? Which is a simple layer on top of the Spark Streaming. ☺ https://github.com/Intel-bigdata/spark-streamingsql From: Cassa L [mailto:lcas...@gmail.com] Sent: Thursday, November 5, 2015 8:09 AM To: Adrian Tanase Cc: Stefano Baghino; user Subject: Re: Rule Engine for Spark

RE: Sort Merge Join

2015-11-02 Thread Cheng, Hao
No as far as I can tell, @Michael @YinHuai @Reynold , any comments on this optimization? From: Jonathan Coveney [mailto:jcove...@gmail.com] Sent: Tuesday, November 3, 2015 4:17 AM To: Alex Nastetsky Cc: Cheng, Hao; user Subject: Re: Sort Merge Join Additionally, I'm curious if there are any

RE: Sort Merge Join

2015-11-01 Thread Cheng, Hao
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For example, in the code below, the two datasets have different number of partitions, but it still does a SortMerge join after a "hashpartitioning". [Hao:] A distributed JOIN operation (either HashBased or SortBased Join)

RE: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-28 Thread Cheng, Hao
Hi Jerry, I’ve filed a bug in jira, and also the fixing https://issues.apache.org/jira/browse/SPARK-11364 It will be great appreciated if you can verify the PR with your case. Thanks, Hao From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Wednesday, October 28, 2015 8:51 AM To: Jerry Lam

RE: SparkSQL on hive error

2015-10-27 Thread Cheng, Hao
Hi Anand, can you paste the table creating statement? I’d like to reproduce that in my local first, and BTW, which version are you using? Hao From: Anand Nalya [mailto:anand.na...@gmail.com] Sent: Tuesday, October 27, 2015 11:35 PM To: spark users Subject: SparkSQL on hive error Hi, I've a

RE: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Cheng, Hao
After a draft glance, seems a bug in Spark SQL, do you mind to create a jira for this? And then I can start to fix it. Thanks, Hao From: Jerry Lam [mailto:chiling...@gmail.com] Sent: Wednesday, October 28, 2015 3:13 AM To: Marcelo Vanzin Cc: user@spark.apache.org Subject: Re: [Spark-SQL]:

RE: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Cheng, Hao
I am not sure if we really want to support that with HiveContext, but a workround is to use the Spark package at https://github.com/databricks/spark-csv From: Felix Cheung [mailto:felixcheun...@hotmail.com] Sent: Tuesday, October 27, 2015 10:54 AM To: Daniel Haviv; user Subject: RE: HiveContext

RE: Hive with apache spark

2015-10-11 Thread Cheng, Hao
One option is you can read the data via JDBC, however, probably it's the worst option, as you probably need some hacky work to enable the parallel reading in Spark SQL. Another option is copy the hive-site.xml of your Hive Server to $SPARK_HOME/conf, then Spark SQL will see everything that Hive

RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
A join B join C === (A join B) join C Semantically they are equivalent, right? From: Richard Eggert [mailto:richard.egg...@gmail.com] Sent: Monday, October 12, 2015 5:12 AM To: Subhajit Purkayastha Cc: User Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF? It's the same as joining 2.

RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
Spark SQL supports very basic join reordering optimization, based on the raw table data size, this was added couple major releases back. And the “EXPLAIN EXTENDED query” command is a very informative tool to verify whether the optimization taking effect. From: Raajay

RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
hih...@gmail.com] Sent: Monday, October 12, 2015 8:37 AM To: Cheng, Hao Cc: Richard Eggert; Subhajit Purkayastha; User Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF? Some weekend reading: http://stackoverflow.com/questions/20022196/are-left-outer-joins-associative Cheers On Sun, Oct 11, 2015 a

RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
Probably you have to read the source code, I am not sure if there are any .ppt or slides. Hao From: VJ Anand [mailto:vjan...@sankia.com] Sent: Monday, October 12, 2015 11:43 AM To: Cheng, Hao Cc: Raajay; user@spark.apache.org Subject: Re: Join Order Optimization Hi - Is there a design document

RE: Join Order Optimization

2015-10-11 Thread Cheng, Hao
, October 12, 2015 10:17 AM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Join Order Optimization Hi Cheng, Could you point me to the JIRA that introduced this change ? Also, is this SPARK-2211 the right issue to follow for cost-based optimization? Thanks Raajay On Sun, Oct 11, 2015 at 7

RE: Insert via HiveContext is slow

2015-10-09 Thread Cheng, Hao
I think DF performs the same as the SQL API does in the multi-inserts, if you don’t use the cached table. Hao From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com] Sent: Friday, October 9, 2015 3:09 PM To: Cheng, Hao Cc: user Subject: Re: Insert via HiveContext is slow Thanks Hao

RE: Insert via HiveContext is slow

2015-10-08 Thread Cheng, Hao
I think that’s a known performance issue(Compared to Hive) of Spark SQL in multi-inserts. A workaround is create a temp cached table for the projection first, and then do the multiple inserts base on the cached table. We are actually working on the POC of some similar cases, hopefully it comes

RE: Performance Spark SQL vs Dataframe API faster

2015-09-22 Thread Cheng, Hao
Yes, should be the same, as they are just different frontend, but the same thing in optimization / execution. -Original Message- From: sanderg [mailto:s.gee...@wimionline.be] Sent: Tuesday, September 22, 2015 10:06 PM To: user@spark.apache.org Subject: Performance Spark SQL vs Dataframe

RE: RE: spark sql hook

2015-09-16 Thread Cheng, Hao
Probably a workable solution is, create your own SQLContext by extending the class HiveContext, and override the `analyzer`, and add your own rule to do the hacking. From: r7raul1...@163.com [mailto:r7raul1...@163.com] Sent: Thursday, September 17, 2015 11:08 AM To: Cheng, Hao; user Subject: Re

RE: spark sql hook

2015-09-16 Thread Cheng, Hao
Catalyst TreeNode is very fundamental API, not sure what kind of hook you need. Any concrete example will be more helpful to understand your requirement. Hao From: r7raul1...@163.com [mailto:r7raul1...@163.com] Sent: Thursday, September 17, 2015 10:54 AM To: user Subject: spark sql hook I

RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Cheng, Hao
. From: Todd [mailto:bit1...@163.com] Sent: Friday, September 11, 2015 2:17 PM To: Cheng, Hao Cc: Jesse F Chen; Michael Armbrust; user@spark.apache.org Subject: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL Thanks Hao for the reply. I turn the merge sort join off

RE: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-11 Thread Cheng, Hao
, September 11, 2015 3:39 PM To: Todd Cc: Cheng, Hao; Jesse F Chen; Michael Armbrust; user@spark.apache.org Subject: Re:Re:RE: Re:RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL I add the following two options: spark.sql.planner.sortMergeJoin=false

RE: spark 1.5 SQL slows down dramatically by 50%+ compared with spark 1.4.1 SQL

2015-09-10 Thread Cheng, Hao
This is not a big surprise the SMJ is slower than the HashJoin, as we do not fully utilize the sorting yet, more details can be found at https://issues.apache.org/jira/browse/SPARK-2926 . Anyway, can you disable the sort merge join by “spark.sql.planner.sortMergeJoin=false;” in Spark 1.5, and

RE: Driver OOM after upgrading to 1.5

2015-09-09 Thread Cheng, Hao
Will that be helpful if adding jvm options like: -XX:+CMSClassUnloadingEnabled -XX:+CMSPermGenSweepingEnabled From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, September 10, 2015 5:31 AM To: Sandy Ryza Cc: user@spark.apache.org Subject: Re: Driver OOM after upgrading to 1.5 It's

RE: Re: Job aborted due to stage failure: java.lang.StringIndexOutOfBoundsException: String index out of range: 18

2015-08-30 Thread Cheng, Hao
Hi, can you try something like: val rowRDD=sc.textFile(/user/spark/short_model).map{ line = val p = line.split(\\tfile:///\\t) if (p.length =72) { Row(p(0), p(1)…) } else { throw new RuntimeException(s“failed in parsing $line”) } } From the log

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-25 Thread Cheng, Hao
Ok, I see, thanks for the correction, but this should be optimized. From: Shixiong Zhu [mailto:zsxw...@gmail.com] Sent: Tuesday, August 25, 2015 2:08 PM To: Cheng, Hao Cc: Jeff Zhang; user@spark.apache.org Subject: Re: DataFrame#show cost 2 Spark Jobs ? That's two jobs. `SparkPlan.executeTake

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-25 Thread Cheng, Hao
O, Sorry, I miss reading your reply! I know the minimum tasks will be 2 for scanning, but Jeff is talking about 2 jobs, not 2 tasks. From: Shixiong Zhu [mailto:zsxw...@gmail.com] Sent: Tuesday, August 25, 2015 1:29 PM To: Cheng, Hao Cc: Jeff Zhang; user@spark.apache.org Subject: Re: DataFrame

RE: Spark thrift server on yarn

2015-08-25 Thread Cheng, Hao
Did you register temp table via the beeline or in a new Spark SQL CLI? As I know, the temp table cannot cross the HiveContext. Hao From: Udit Mehta [mailto:ume...@groupon.com] Sent: Wednesday, August 26, 2015 8:19 AM To: user Subject: Spark thrift server on yarn Hi, I am trying to start a

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
The first job is to infer the json schema, and the second one is what you mean of the query. You can provide the schema while loading the json file, like below: sqlContext.read.schema(xxx).json(“…”)? Hao From: Jeff Zhang [mailto:zjf...@gmail.com] Sent: Monday, August 24, 2015 6:20 PM To:

RE: Loading already existing tables in spark shell

2015-08-24 Thread Cheng, Hao
And be sure the hive-site.xml is under the classpath or under the path of $SPARK_HOME/conf Hao From: Ishwardeep Singh [mailto:ishwardeep.si...@impetus.co.in] Sent: Monday, August 24, 2015 8:57 PM To: user Subject: Re: Loading already existing tables in spark shell Hi Jeetendra, I faced

RE: DataFrame#show cost 2 Spark Jobs ?

2015-08-24 Thread Cheng, Hao
loading the data for JSON, it’s probably causes longer time for ramp up with large number of files/partitions. From: Jeff Zhang [mailto:zjf...@gmail.com] Sent: Tuesday, August 25, 2015 8:11 AM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: DataFrame#show cost 2 Spark Jobs ? Hi Cheng, I

RE: Test case for the spark sql catalyst

2015-08-24 Thread Cheng, Hao
Yes, check the source code under: https://github.com/apache/spark/tree/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst From: Todd [mailto:bit1...@163.com] Sent: Tuesday, August 25, 2015 1:01 PM To: user@spark.apache.org Subject: Test case for the spark sql catalyst Hi, Are

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false. BTW, which version are you using? Hao From: Jerrick Hoang [mailto:jerrickho...@gmail.com] Sent: Thursday, August 20, 2015 12:16 PM To: Philip Weaver Cc: user Subject: Re: Spark Sql behaves strangely with tables with

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
20, 2015 1:46 PM To: Cheng, Hao Cc: Philip Weaver; user Subject: Re: Spark Sql behaves strangely with tables with a lot of partitions I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs trying to speed up spark sql with tables with a huge number of partitions, I've made

RE: Refresh table

2015-08-11 Thread Cheng, Hao
Refreshing table only works for the Spark SQL DataSource in my understanding, apparently here, you’re running a Hive Table. Can you try to create a table like: |CREATE TEMPORARY TABLE parquetTable (a int, b string) |USING org.apache.spark.sql.parquet.DefaultSource

RE: Spark DataFrames uses too many partition

2015-08-11 Thread Cheng, Hao
That's a good question, we don't support reading small files in a single partition yet, but it's definitely an issue we need to optimize, do you mind to create a jira issue for this? Hopefully we can merge that in 1.6 release. 200 is the default partition number for parallel tasks after the

RE: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-11 Thread Cheng, Hao
Definitely worth to try. And you can sort the record before writing out, and then you will get the parquet files without overlapping keys. Let us know if that helps. Hao From: Philip Weaver [mailto:philip.wea...@gmail.com] Sent: Wednesday, August 12, 2015 4:05 AM To: Cheng Lian Cc: user

RE: HiveThriftServer2.startWithContext error with registerTempTable

2015-07-15 Thread Cheng, Hao
Have you ever try query the “select * from temp_table” from the spark shell? Or can you try the option --jars while starting the spark shell? From: Srikanth [mailto:srikanth...@gmail.com] Sent: Thursday, July 16, 2015 9:36 AM To: user Subject: Re: HiveThriftServer2.startWithContext error with

RE: Python DataFrames: length of ArrayType

2015-07-15 Thread Cheng, Hao
Actually it's supposed to be part of Spark 1.5 release, see https://issues.apache.org/jira/browse/SPARK-8230 You're definitely welcome to contribute to it, let me know if you have any question on implementing it. Cheng Hao -Original Message- From: pedro [mailto:ski.rodrig...@gmail.com

RE: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Cheng, Hao
Can you describe how did you cache the tables? In another HiveContext? AFAIK, cached table only be visible within the same HiveContext, you probably need to execute the sql query like “cache table mytable as SELECT xxx” in the JDBC connection also. Cheng Hao From: Brandon White

RE: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Cheng, Hao
So you’re with different HiveContext instances for the caching. We are not expected to see the cached tables cached with the other HiveContext instance. From: Brandon White [mailto:bwwintheho...@gmail.com] Sent: Wednesday, July 15, 2015 8:48 AM To: Cheng, Hao Cc: user Subject: Re: How do you

RE: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Cheng, Hao
Never mind, I’ve created the jira issue at https://issues.apache.org/jira/browse/SPARK-8972. From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Friday, July 10, 2015 9:15 AM To: yana.kadiy...@gmail.com; ayan guha Cc: user Subject: RE: [SparkSQL] Incorrect ROLLUP results Yes, this is a bug, do

RE: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Cheng, Hao
Yes, this is a bug, do you mind to create a jira issue for this? I will fix this asap. BTW, what’s your spark version? From: Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Sent: Friday, July 10, 2015 12:16 AM To: ayan guha Cc: user Subject: Re: [SparkSQL] Incorrect ROLLUP results

RE: Hive UDFs

2015-07-07 Thread Cheng, Hao
dataframe.limit(1).selectExpr(xxx).collect()? -Original Message- From: chrish2312 [mailto:c...@palantir.com] Sent: Wednesday, July 8, 2015 6:20 AM To: user@spark.apache.org Subject: Hive UDFs I know the typical way to apply a hive UDF to a dataframe is basically something like:

RE: HiveContext throws org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

2015-07-07 Thread Cheng, Hao
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.derby.jdbc.EmbeddedDriver It will be included in the assembly jar usually, not sure what's wrong. But can you try add the derby jar into the driver classpath and try again? -Original Message- From: bdev

RE: Support for Windowing and Analytics functions in Spark SQL

2015-06-22 Thread Cheng, Hao
Yes, with should be with HiveContext, not SQLContext. From: ayan guha [mailto:guha.a...@gmail.com] Sent: Tuesday, June 23, 2015 2:51 AM To: smazumder Cc: user Subject: Re: Support for Windowing and Analytics functions in Spark SQL 1.4 supports it On 23 Jun 2015 02:59, Sourav Mazumder

RE: Question about SPARK_WORKER_CORES and spark.task.cpus

2015-06-22 Thread Cheng, Hao
It’s actually not that tricky. SPARK_WORKER_CORES: is the max task thread pool size of the of the executor, the same saying of “one executor with 32 cores and the executor could execute 32 tasks simultaneously”. Spark doesn’t care about how much real physical CPU/Cores you have (OS does), so

RE: Is HiveContext Thread Safe?

2015-06-17 Thread Cheng, Hao
Yes, it is thread safe. That’s how Spark SQL JDBC Server works. Cheng Hao From: V Dineshkumar [mailto:developer.dines...@gmail.com] Sent: Wednesday, June 17, 2015 9:44 PM To: user@spark.apache.org Subject: Is HiveContext Thread Safe? Hi, I have a HiveContext which I am using in multiple

RE: generateTreeString causes huge performance problems on dataframe persistence

2015-06-17 Thread Cheng, Hao
Seems you're hitting the self-join, currently Spark SQL won't cache any result/logical tree for further analyzing or computing for self-join. Since the logical tree is huge, it's reasonable to take long time in generating its tree string recursively. And I also doubt the computing can finish

RE: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng, Hao
Not sure if Spark RDD will provide API to fetch the record one by one from the final result set, instead of the pulling them all / (or whole partition data) and fit in the driver memory. Seems a big change. From: Cheng Lian [mailto:l...@databricks.com] Sent: Friday, June 12, 2015 3:51 PM To:

RE: 回复: Re: 回复: Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-12 Thread Cheng, Hao
Not sure if Spark Core will provide API to fetch the record one by one from the block manager, instead of the pulling them all into the driver memory. From: Cheng Lian [mailto:l...@databricks.com] Sent: Friday, June 12, 2015 3:51 PM To: 姜超才; Hester wang; user@spark.apache.org Subject: Re: 回复:

RE: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-09 Thread Cheng, Hao
Is it the large result set return from the Thrift Server? And can you paste the SQL and physical plan? From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Tuesday, June 9, 2015 12:01 PM To: Sourav Mazumder Cc: user Subject: Re: Spark SQL with Thrift Server is very very slow and finally failing

RE: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-05 Thread Cheng, Hao
Confirmed, with latest master, we don't support complex data type for Simple Hive UDF, do you mind file an issue in jira? -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Friday, June 5, 2015 12:35 PM To: ogoh; user@spark.apache.org Subject: RE: SparkSQL : using

RE: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-04 Thread Cheng, Hao
Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0? -Original Message- From: ogoh [mailto:oke...@gmail.com] Sent: Friday, June 5, 2015 10:10 AM To: user@spark.apache.org Subject: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface

RE: Pointing SparkSQL to existing Hive Metadata with data file locations in HDFS

2015-05-27 Thread Cheng, Hao
Yes, but be sure you put the hive-site.xml under your class path. Any problem you meet? Cheng Hao From: Sanjay Subramanian [mailto:sanjaysubraman...@yahoo.com.INVALID] Sent: Thursday, May 28, 2015 8:53 AM To: user Subject: Pointing SparkSQL to existing Hive Metadata with data file locations

RE: SparkSQL errors in 1.4 rc when using with Hive 0.12 metastore

2015-05-24 Thread Cheng, Hao
Thanks for reporting this. We intend to support the multiple metastore versions in a single build(hive-0.13.1) by introducing the IsolatedClientLoader, but probably you’re hitting the bug, please file a jira issue for this. I will keep investigating on this also. Hao From: Mark Hamstra

RE: InferredSchema Example in Spark-SQL

2015-05-17 Thread Cheng, Hao
Forgot to import the implicit functions/classes? import sqlContext.implicits._ From: Rajdeep Dua [mailto:rajdeep@gmail.com] Sent: Monday, May 18, 2015 8:08 AM To: user@spark.apache.org Subject: InferredSchema Example in Spark-SQL Hi All, Was trying the Inferred Schema spart example

RE: InferredSchema Example in Spark-SQL

2015-05-17 Thread Cheng, Hao
Typo? Should be .toDF(), not .toRD() From: Ram Sriharsha [mailto:sriharsha@gmail.com] Sent: Monday, May 18, 2015 8:31 AM To: Rajdeep Dua Cc: user Subject: Re: InferredSchema Example in Spark-SQL you mean toDF() ? (toDF converts the RDD to a DataFrame, in this case inferring schema from the

RE: What's the advantage features of Spark SQL(JDBC)

2015-05-15 Thread Cheng, Hao
Spark SQL just take the JDBC as a new data source, the same as we need to support loading data from a .csv or .json. From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID] Sent: Friday, May 15, 2015 2:30 PM To: User Subject: What's the advantage features of Spark SQL(JDBC) Hi All, Comparing

RE: question about sparksql caching

2015-05-15 Thread Cheng, Hao
You probably can try something like: val df = sqlContext.sql(select c1, sum(c2) from T1, T2 where T1.key=T2.key group by c1) df.cache() // Cache the result, but it's a lazy execution. df.registerAsTempTable(my_result) sqlContext.sql(select * from my_result where c1=1).collect // the cache

RE: What's the advantage features of Spark SQL(JDBC)

2015-05-15 Thread Cheng, Hao
Yes. From: Yi Zhang [mailto:zhangy...@yahoo.com] Sent: Friday, May 15, 2015 2:51 PM To: Cheng, Hao; User Subject: Re: What's the advantage features of Spark SQL(JDBC) @Hao, As you said, there is no advantage feature for JDBC, it just provides unified api to support different data sources

RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-05 Thread Cheng, Hao
, Hao; Wang, Daoyuan; Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining_2_tables. Hi guys, attache the pic of physical plan and logs.Thanks. Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:Cheng, Hao hao.ch

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Can you print out the physical plan? EXPLAIN SELECT xxx… From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Monday, May 4, 2015 9:08 PM To: Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining 2 tables. hi Olivier spark1.3.1, with java1.8.0.45 and add 2 pics

Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread Cheng, Hao
I assume you’re using the DataFrame API within your application. sql(“SELECT…”).explain(true) From: Wang, Daoyuan Sent: Tuesday, May 5, 2015 10:16 AM To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user Subject: RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. You can use

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Or, have you ever try broadcast join? From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Tuesday, May 5, 2015 8:33 AM To: luohui20...@sina.com; Olivier Girardot; user Subject: RE: 回复:Re: sparksql running slow while joining 2 tables. Can you print out the physical plan? EXPLAIN SELECT xxx

RE: Re: problem with spark thrift server

2015-04-23 Thread Cheng, Hao
Hi, can you describe a little bit how the ThriftServer crashed, or steps to reproduce that? It’s probably a bug of ThriftServer. Thanks, From: guoqing0...@yahoo.com.hk [mailto:guoqing0...@yahoo.com.hk] Sent: Friday, April 24, 2015 9:55 AM To: Arush Kharbanda Cc: user Subject: Re: Re: problem

RE: Spark Avarage

2015-04-06 Thread Cheng, Hao
The Dataframe API should be perfectly helpful in this case. https://spark.apache.org/docs/1.3.0/sql-programming-guide.html Some code snippet will like: val sqlContext = new org.apache.spark.sql.SQLContext(sc) // this is used to implicitly convert an RDD to a DataFrame. import

RE: Spark SQL. Memory consumption

2015-04-02 Thread Cheng, Hao
, but that’s still on going. Cheng Hao From: Masf [mailto:masfwo...@gmail.com] Sent: Thursday, April 2, 2015 11:47 PM To: user@spark.apache.org Subject: Spark SQL. Memory consumption Hi. I'm using Spark SQL 1.2. I have this query: CREATE TABLE test_MA STORED AS PARQUET AS SELECT

RE: Spark SQL udf(ScalaUdf) is very slow

2015-03-23 Thread Cheng, Hao
This is a very interesting issue, the root reason for the lower performance probably is, in Scala UDF, Spark SQL converts the data type from internal representation to Scala representation via Scala reflection recursively. Can you create a Jira issue for tracking this? I can start to work on

RE: Spark SQL Self join with agreegate

2015-03-19 Thread Cheng, Hao
Not so sure your intention, but something like SELECT sum(val1), sum(val2) FROM table GROUP BY src, dest ? -Original Message- From: Shailesh Birari [mailto:sbirar...@gmail.com] Sent: Friday, March 20, 2015 9:31 AM To: user@spark.apache.org Subject: Spark SQL Self join with agreegate

RE: [SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-18 Thread Cheng, Hao
Seems the elasticsearch-hadoop project was built with an old version of Spark, and then you upgraded the Spark version in execution env, as I know the StructField changed the definition in Spark 1.2, can you confirm the version problem first? From: Todd Nist [mailto:tsind...@gmail.com] Sent:

RE: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread Cheng, Hao
Or you need to specify the jars either in configuration or bin/spark-sql --jars mysql-connector-xx.jar From: fightf...@163.com [mailto:fightf...@163.com] Sent: Monday, March 16, 2015 2:04 PM To: sandeep vura; Ted Yu Cc: user Subject: Re: Re: Unable to instantiate

RE: Re: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-03-16 Thread Cheng, Hao
It doesn’t take effect if just putting jar files under the lib-managed/jars folder, you need to put that under class path explicitly. From: sandeep vura [mailto:sandeepv...@gmail.com] Sent: Monday, March 16, 2015 2:21 PM To: Cheng, Hao Cc: fightf...@163.com; Ted Yu; user Subject: Re: Re: Unable

RE: Spark SQL using Hive metastore

2015-03-11 Thread Cheng, Hao
check the configuration file of $SPARK_HOME/conf/spark-xxx.conf ? Cheng Hao From: Grandl Robert [mailto:rgra...@yahoo.com.INVALID] Sent: Thursday, March 12, 2015 5:07 AM To: user@spark.apache.org Subject: Spark SQL using Hive metastore Hi guys, I am a newbie in running Spark SQL / Spark. My goal

RE: Does any one know how to deploy a custom UDAF jar file in SparkSQL?

2015-03-10 Thread Cheng, Hao
You can add the additional jar when submitting your job, something like: ./bin/spark-submit --jars xx.jar … More options can be listed by just typing ./bin/spark-submit From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 10, 2015 8:48 PM To: user@spark.apache.org Subject: Does

RE: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread Cheng, Hao
Currently, Spark SQL doesn’t provide interface for developing the custom UDTF, but it can work seamless with Hive UDTF. I am working on the UDTF refactoring for Spark SQL, hopefully will provide an Hive independent UDTF soon after that. From: shahab [mailto:shahab.mok...@gmail.com] Sent:

RE: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread Cheng, Hao
/pull/3247 From: shahab [mailto:shahab.mok...@gmail.com] Sent: Wednesday, March 11, 2015 1:44 AM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how? Thanks Hao, But my question concerns UDAF (user defined aggregation function ) not UDTF

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Cheng, Hao
I am not so sure if Hive supports change the metastore after initialized, I guess not. Spark SQL totally rely on Hive Metastore in HiveContext, probably that's why it doesn't work as expected for Q1. BTW, in most of cases, people configure the metastore settings in hive-site.xml, and will not

RE: SQL with Spark Streaming

2015-03-10 Thread Cheng, Hao
Intel has a prototype for doing this, SaiSai and Jason are the authors. Probably you can ask them for some materials. From: Mohit Anchlia [mailto:mohitanch...@gmail.com] Sent: Wednesday, March 11, 2015 8:12 AM To: user@spark.apache.org Subject: SQL with Spark Streaming Does Spark Streaming also

RE: Connection PHP application to Spark Sql thrift server

2015-03-05 Thread Cheng, Hao
Can you query upon Hive? Let's confirm if it's a bug of SparkSQL in your PHP code first. -Original Message- From: fanooos [mailto:dev.fano...@gmail.com] Sent: Thursday, March 5, 2015 4:57 PM To: user@spark.apache.org Subject: Connection PHP application to Spark Sql thrift server We

RE: Does SparkSQL support ..... having count (fieldname) in SQL statement?

2015-03-04 Thread Cheng, Hao
I’ve tried with latest code, seems it works, which version are you using Shahab? From: yana [mailto:yana.kadiy...@gmail.com] Sent: Wednesday, March 4, 2015 8:47 PM To: shahab; user@spark.apache.org Subject: RE: Does SparkSQL support . having count (fieldname) in SQL statement? I think the

RE: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Cheng, Hao
Can you provide the detailed failure call stack? From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 3, 2015 3:52 PM To: user@spark.apache.org Subject: Supporting Hive features in Spark SQL Thrift JDBC server Hi, According to Spark SQL documentation, Spark SQL supports the

RE: SparkSQL, executing an OR

2015-03-03 Thread Cheng, Hao
Using where('age =10 'age =4) instead. -Original Message- From: Guillermo Ortiz [mailto:konstt2...@gmail.com] Sent: Tuesday, March 3, 2015 5:14 PM To: user Subject: SparkSQL, executing an OR I'm trying to execute a query with Spark. (Example from the Spark Documentation) val teenagers

RE: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread Cheng, Hao
Hive UDF are only applicable for HiveContext and its subclass instance, is the CassandraAwareSQLContext a direct sub class of HiveContext or SQLContext? From: shahab [mailto:shahab.mok...@gmail.com] Sent: Tuesday, March 3, 2015 5:10 PM To: Cheng, Hao Cc: user@spark.apache.org Subject: Re

RE: insert Hive table with RDD

2015-03-03 Thread Cheng, Hao
Using the SchemaRDD / DataFrame API via HiveContext Assume you're using the latest code, something probably like: val hc = new HiveContext(sc) import hc.implicits._ existedRdd.toDF().insertInto(hivetable) or existedRdd.toDF().registerTempTable(mydata) hc.sql(insert into hivetable as select xxx

RE: java.lang.IncompatibleClassChangeError when using PrunedFilteredScan

2015-03-03 Thread Cheng, Hao
As the call stack shows, the mongodb connector is not compatible with the Spark SQL Data Source interface. The latest Data Source API is changed since 1.2, probably you need to confirm which spark version the MongoDB Connector build against. By the way, a well format call stack will be more

RE: Spark SQL Thrift Server start exception : java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

2015-03-03 Thread Cheng, Hao
” while starting the spark shell. From: Anusha Shamanur [mailto:anushas...@gmail.com] Sent: Wednesday, March 4, 2015 5:07 AM To: Cheng, Hao Subject: Re: Spark SQL Thrift Server start exception : java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory Hi, I am getting

RE: Is SQLContext thread-safe?

2015-03-02 Thread Cheng, Hao
instance. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Tuesday, March 3, 2015 7:56 AM To: Cheng, Hao; user Subject: RE: Is SQLContext thread-safe? Thanks for the response. Then I have another question: when will we want to create multiple SQLContext instances

RE: Spark SQL Thrift Server start exception : java.lang.ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory

2015-03-02 Thread Cheng, Hao
Copy those jars into the $SPARK_HOME/lib/ datanucleus-api-jdo-3.2.6.jar datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar see https://github.com/apache/spark/blob/master/bin/compute-classpath.sh#L120 -Original Message- From: fanooos [mailto:dev.fano...@gmail.com] Sent: Tuesday,

RE: Executing hive query from Spark code

2015-03-02 Thread Cheng, Hao
I am not so sure how Spark SQL compiled in CDH, but if didn’t specify the –Phive and –Phive-thriftserver flags during the build, most likely it will not work if just by providing the Hive lib jars later on. For example, does the HiveContext class exist in the assembly jar? I am also quite

RE: Is SQLContext thread-safe?

2015-03-02 Thread Cheng, Hao
https://issues.apache.org/jira/browse/SPARK-2087 https://github.com/apache/spark/pull/4382 I am working on the prototype, but will be updated soon. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Tuesday, March 3, 2015 8:32 AM To: Cheng, Hao; user Subject: RE

RE: Performance tuning in Spark SQL.

2015-03-02 Thread Cheng, Hao
This is actually a quite open question, from my understanding, there're probably ways to tune like: *SQL Configurations like: Configuration Key Default Value spark.sql.autoBroadcastJoinThreshold 10 * 1024 * 1024 spark.sql.defaultSizeInBytes 10 * 1024 * 1024 + 1

RE: Is SQLContext thread-safe?

2015-03-02 Thread Cheng, Hao
Yes it is thread safe, at least it's supposed to be. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4:43 PM To: user Subject: Is SQLContext thread-safe? Hi, is it safe to use the same SQLContext to do Select operations in different threads

JLine hangs under Windows8

2015-02-27 Thread Cheng, Hao
$.main(SparkSQLCLIDriver.scala:202) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) Thanks, Cheng Hao - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

  1   2   >