Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
What’s the command line you used to build Spark? Notice that you need to add |-Phive-thriftserver| to build the JDBC Thrift server. This profile was once removed in in v1.1.0, but added back in v1.2.0 because of dependency issue introduced by Scala 2.11 support. On 11/27/14 12:53 AM,

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
What version are you trying to build? I was at first assuming you're using the most recent master, but from your first mail it seems that you were trying to build Spark v1.1.0? On 11/27/14 12:57 PM, vdiwakar.malladi wrote: Thanks for your response. I'm using the following command. mvn

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
Hm, then the command line you used should be fine. Actually just tried it locally and it’s fine. Make sure to run it in the root directory of Spark source tree (don’t |cd| into assembly). On 11/27/14 1:35 PM, vdiwakar.malladi wrote: Yes, I'm building it from Spark 1.1.0 Thanks in advance.

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
I see. As what the exception stated, Maven can’t find |unzip| to help building PySpark. So you need a Windows version of |unzip| (probably from MinGW or Cygwin?) On 11/27/14 2:10 PM, vdiwakar.malladi wrote: Thanks for your prompt responses. I'm generating assembly jar file from windows 7

Re: Spark SQL table Join, one task is taking long

2014-12-03 Thread Cheng Lian
Hey Venkat, This behavior seems reasonable. According to the table name, I guess here |DAgents| should be the fact table and |ContactDetails| is the dim table. Below is an explanation of a similar query, you may see |src| as |DAgents| and |src1| as |ContactDetails|. |0:

Re: SQL query in scala API

2014-12-03 Thread Cheng Lian
You may do this: |table(users).groupBy('zip)('zip, count('user), countDistinct('user)) | On 12/4/14 8:47 AM, Arun Luthra wrote: I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API,

Re: Window function by Spark SQL

2014-12-04 Thread Cheng Lian
Window functions are not supported yet, but there is a PR for it: https://github.com/apache/spark/pull/2953 On 12/5/14 12:22 PM, Dai, Kevin wrote: Hi, ALL How can I group by one column and order by another one, then select the first row for each group (which is just like window function

Re: SQL query in scala API

2014-12-05 Thread Cheng Lian
++ seen1) }).mapValues {case (count, seen) = (count, seen.size) } | On 12/5/14 3:47 AM, Arun Luthra wrote: Is that Spark SQL? I'm wondering if it's possible without spark SQL. On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: You may do

Re: Spark SQL: How to get the hierarchical element with SQL?

2014-12-07 Thread Cheng Lian
You may access it via something like |SELECT filterIp.element FROM tb|, just like Hive. Or if you’re using Spark SQL DSL, you can use |tb.select(filterIp.element.attr)|. On 12/8/14 1:08 PM, Xuelin Cao wrote: Hi, I'm generating a Spark SQL table from an offline Json file. The

Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian
Essentially, the Spark SQL JDBC Thrift server is just a Spark port of HiveServer2. You don't need to run Hive, but you do need a working Metastore. On 12/9/14 3:59 PM, Anas Mosaad wrote: Thanks Judy, this is exactly what I'm looking for. However, and plz forgive me if it's a dump question is:

Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian
(0.106 seconds)/ /0: jdbc:hive2://localhost:1 / Kindly advice, what am I missing? I want to read the RDD using SQL from outside spark-shell (i.e. like any other relational database) On Tue, Dec 9, 2014 at 11:05 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote

Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian
) On Tue, Dec 9, 2014 at 11:44 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: How did you register the table under spark-shell? Two things to notice: 1. To interact with Hive, HiveContext instead of SQLContext must be used. 2. `registerTempTable` doesn't

Re: Compare performance of sqlContext.jsonFile and sqlContext.jsonRDD

2014-12-11 Thread Cheng Lian
There are several overloaded versions of both |jsonFile| and |jsonRDD|. Schema inferring is kinda expensive since it requires an extra Spark job. You can avoid schema inferring by storing the inferred schema and then use it together with the following two methods: * |def jsonFile(path:

Re: Spark SQL API Doc IsCached as SQL command

2014-12-12 Thread Cheng Lian
There isn’t a SQL statement that directly maps |SQLContext.isCached|, but you can use |EXPLAIN EXTENDED| to check whether the underlying physical plan is a |InMemoryColumnarTableScan|. On 12/13/14 7:14 AM, Judy Nash wrote: Hello, Few questions on Spark SQL: 1)Does Spark SQL support

Re: [SPARK-SQL]how to run cache command with Running the Thrift JDBC/ODBC server

2014-12-19 Thread Cheng Lian
It seems that the Thrift server you connected to is the original HiveServer2 rather than Spark SQL HiveThriftServer2. On 12/19/14 4:08 PM, jeanlyn92 wrote: when i run the *cache table as *in the beeline which communicate with the thrift server i got the follow error: 14/12/19 15:57:05 ERROR

Re: spark-sql with join terribly slow.

2014-12-21 Thread Cheng Lian
secs. Hari On Wed, Dec 17, 2014 at 10:09 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: What kinds are the tables underlying the SchemaRDDs? Could you please provide the DDL of the tables and the query you executed? On 12/18/14 6:15 AM

Re: Querying registered RDD (AsTable) using JDBC

2014-12-21 Thread Cheng Lian
Evert - Thanks for the instructions, this is generally useful in other scenarios, but I think this isn’t what Shahab needs, because |saveAsTable| actually saves the contents of the SchemaRDD into Hive. Shahab - As Michael has answered in another thread, you may try

Re: Spark SQL DSL for joins?

2014-12-21 Thread Cheng Lian
On 12/17/14 1:43 PM, Jerry Raj wrote: Hi, I'm using the Scala DSL for Spark SQL, but I'm not able to do joins. I have two tables (backed by Parquet files) and I need to do a join across them using a common field (user_id). This works fine using standard SQL but not using the

Re: SparkSQL 1.2.1-snapshot Left Join problem

2014-12-21 Thread Cheng Lian
Could you please file a JIRA together with the Git commit you're using? Thanks! On 12/18/14 2:32 AM, Hao Ren wrote: Hi, When running SparkSQL branch 1.2.1 on EC2 standalone cluster, the following query does not work: create table debug as select v1.* from t1 as v1 left join t2 as v2 on

Re: integrating long-running Spark jobs with Thriftserver

2014-12-21 Thread Cheng Lian
Hi Schweichler, This is an interesting and practical question. I'm not familiar with how Tableau works, but would like to share some thoughts. In general, big data analytics frameworks like MR and Spark tend to perform immutable functional transformations over immutable data. Whilst in your

Re: Spark SQL 1.2 with CDH 4, Hive UDF is not working.

2014-12-22 Thread Cheng Lian
Hi Ji, Spark SQL 1.2 only works with either Hive 0.12.0 or 0.13.1 due to Hive API/protocol compatibility issues. When interacting with Hive 0.11.x, connections and simple queries may succeed, but things may go crazy in unexpected corners (like UDF). Cheng On 12/22/14 4:15 PM, Ji ZHANG

Re: How to export data from hive into hdfs in spark program?

2014-12-23 Thread Cheng Lian
This depends on which output format you want. For Parquet, you can simply do this: |hiveContext.table(some_db.some_table).saveAsParquetFile(hdfs://path/to/file) | On 12/23/14 5:22 PM, LinQili wrote: Hi Leo: Thanks for your reply. I am talking about using hive from spark to export data from

Re: SchemaRDD.sample problem

2014-12-23 Thread Cheng Lian
Here is a more cleaned up version, can be used in |./sbt/sbt hive/console| to easily reproduce this issue: |sql(SELECT * FROM src WHERE key % 2 = 0). sample(withReplacement =false, fraction =0.05). registerTempTable(sampled) println(table(sampled).queryExecution) val query = sql(SELECT

Re: Spark SQL job block when use hive udf from_unixtime

2014-12-23 Thread Cheng Lian
Could you please provide a complete stacktrace? Also it would be good if you can share your hive-site.xml as well. On 12/23/14 4:42 PM, Dai, Kevin wrote: Hi, there When I use hive udf from_unixtime with the HiveContext, the job block and the log is as follow: sun.misc.Unsafe.park(Native

Re: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Cheng Lian
Generally you can use |-Dsun.io.serialization.extendedDebugInfo=true| to enable serialization debugging information when serialization exceptions are raised. On 12/24/14 1:32 PM, bigdata4u wrote: I am trying to use sql over Spark streaming using Java. But i am getting Serialization

Re: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-24 Thread Cheng Lian
Hao and Lam - I think the issue here is that |registerRDDAsTable| only creates a temporary table, which is not seen by Hive metastore. And Michael had once given a workaround for creating external Parquet table:

Re: got ”org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got ffffff80“ from hive metastroe service when I use show tables command in spark-sql shell

2014-12-24 Thread Cheng Lian
Hi Roc, Spark SQL 1.2.0 can only work with Hive 0.12.0 or Hive 0.13.1 (controlled by compilation flags), versions prior 1.2.0 only works with Hive 0.12.0. So Hive 0.15.0-SNAPSHOT is not an option. Would like to add that this is due to backwards compatibility issue of Hive metastore, AFAIK

Re: SparkSQL 1.2.0 sources API error

2015-01-02 Thread Cheng Lian
Most of the time a NoSuchMethodError means wrong classpath settings, and some jar file is overriden by a wrong version. In your case it could be netty. On 1/3/15 1:36 PM, Niranda Perera wrote: Hi all, I am evaluating the spark sources API released with Spark 1.2.0. But I'm getting a

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Cheng Lian
Currently no if you don't want to use Spark SQL's HiveContext. But we're working on adding partitioning support to the external data sources API, with which you can create, for example, partitioned Parquet tables without using Hive. Cheng On 1/26/15 8:47 AM, Danny Yates wrote: Thanks

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-05 Thread Cheng Lian
pseudo distributed YARN cluster. Would you mind to elaborate more about steps to reproduce this bug? Thanks ​ On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian lian.cs@gmail.com

Re: Can't access remote Hive table from spark

2015-02-05 Thread Cheng Lian
Please note that Spark 1.2.0 /only/ support Hive 0.13.1 /or/ 0.12.0, none of other versions are supported. Best, Cheng On 1/25/15 12:18 AM, guxiaobo1982 wrote: Hi, I built and started a single node standalone Spark 1.2.0 cluster along with a single node Hive 0.14.0 instance installed by

Re: Parquet compression codecs not applied

2015-02-05 Thread Cheng Lian
Hi Ayoub, The doc page isn’t wrong, but it’s indeed confusing. |spark.sql.parquet.compression.codec| is used when you’re wring Parquet file with something like |data.saveAsParquetFile(...)|. However, you are using Hive DDL in the example code. All Hive DDLs and commands like |SET| are

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Cheng Lian
|IF| is implemented as a generic UDF in Hive (|GenericUDFIf|). It seems that this function can’t be properly resolved. Could you provide a minimum code snippet that reproduces this issue? Cheng On 1/20/15 1:22 AM, Xuelin Cao wrote: Hi, I'm trying to migrate some hive scripts to

Re: Spark SQL: Assigning several aliases to the output (several return values) of an UDF

2015-01-20 Thread Cheng Lian
Guess this can be helpful: http://stackoverflow.com/questions/14252615/stack-function-in-hive-how-to-specify-multiple-aliases On 1/19/15 8:26 AM, mucks17 wrote: Hello I use Hive on Spark and have an issue with assigning several aliases to the output (several return values) of an UDF. I ran

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Cheng Lian
for future groupings (assuming we cache I suppose) Mick On 20 Jan 2015, at 20:44, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: First of all, even if the underlying dataset is partitioned as expected, a shuffle can’t be avoided. Because Spark SQL knows

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Cheng Lian
:07 PM, Cheng Lian lian.cs@gmail.com wrote: Hey Yana, Sorry for the late reply, missed this important thread somehow. And many thanks for reporting this. It turned out to be a bug — filter pushdown is only enabled when using client side metadata, which is not expected, because task side

Re: Support for SQL on unions of tables (merge tables?)

2015-01-20 Thread Cheng Lian
I think you can resort to a Hive table partitioned by date https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables On 1/11/15 9:51 PM, Paul Wais wrote: Dear List, What are common approaches for addressing over a union of tables / RDDs? E.g.

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Cheng Lian
-means model from its cluster centers. -Xiangrui On Tue, Jan 20, 2015 at 11:55 AM, Cheng Lian lian.cs@gmail.com wrote: This is because KMeanModel is neither a built-in type nor a user defined type recognized by Spark SQL. I think you can write your own UDT version of KMeansModel in this case

Re: Spark Sql reading whole table from cache instead of required coulmns

2015-01-20 Thread Cheng Lian
Hey Surbhit, In this case, the web UI stats is not accurate. Please refer to this thread for an explanation: https://www.mail-archive.com/user@spark.apache.org/msg18919.html Cheng On 1/13/15 1:46 AM, Surbhit wrote: Hi, I am using spark 1.1.0. I am using the spark-sql shell to run all the

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-20 Thread Cheng Lian
Hey Yana, Sorry for the late reply, missed this important thread somehow. And many thanks for reporting this. It turned out to be a bug — filter pushdown is only enabled when using client side metadata, which is not expected, because task side metadata code path is more performant. And I

Re: MapType in spark-sql

2015-01-20 Thread Cheng Lian
You need to provide key type, value type for map type, element type for array type, and whether they contain null: |StructType(Array( StructField(map_field,MapType(keyType =IntegerType, valueType =StringType, containsNull =true), nullable =true),

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-20 Thread Cheng Lian
. For example, |Sort| does defensive copy as it needs to cache rows for sorting. Keen to get the best performance and the best blend of SparkSQL and functional Spark. Cheers, Nathan From: Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com Date: Monday, 12 January 2015 1:21 am

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-20 Thread Cheng Lian
First of all, even if the underlying dataset is partitioned as expected, a shuffle can’t be avoided. Because Spark SQL knows nothing about the underlying data distribution. However, this does reduce network IO. You can prepare your data like this (say |CustomerCode| is a string field with

Re: Is there any way to support multiple users executing SQL on thrift server?

2015-01-20 Thread Cheng Lian
Hey Yi, I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like to investigate this issue later. Would you please open an JIRA for it? Thanks! Cheng On 1/19/15 1:00 AM, Yi Tian wrote: Is there any way to support multiple users executing SQL on one thrift server? I

Re: Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Cheng Lian
I had once worked on a named row feature but haven’t got time to finish it. It looks like this: |sql(...).named.map { row:NamedRow = row[Int]('key) - row[String]('value) } | Basically the |named| method generates a field name to ordinal map for each RDD partition. This map is then shared

Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-20 Thread Cheng Lian
|spark.sql.parquet.filterPushdown| defaults to |false| because there’s a bug in Parquet which may cause NPE, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration This bug hasn’t been fixed in Parquet master. We’ll turn this on once the bug is fixed.

Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-20 Thread Cheng Lian
In Spark SQL, Parquet filter pushdown doesn’t cover |HiveTableScan| for now. May I ask why do you prefer |HiveTableScan| rather than |ParquetTableScan|? Cheng On 1/19/15 5:02 PM, Xiaoyu Wang wrote: The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Cheng Lian
This is because |KMeanModel| is neither a built-in type nor a user defined type recognized by Spark SQL. I think you can write your own UDT version of |KMeansModel| in this case. You may refer to |o.a.s.mllib.linalg.Vector| and |o.a.s.mllib.linalg.VectorUDT| as an example. Cheng On 1/20/15

Re: SparkSQL Performance Tuning Options

2015-01-27 Thread Cheng Lian
On 1/27/15 5:55 PM, Cheng Lian wrote: On 1/27/15 11:38 AM, Manoj Samel wrote: Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db. Use case is Spark Yarn app will start and serve as query server for multiple users i.e. always up and running. At startup, there is option

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-28 Thread Cheng Lian
On 1/21/15 10:39 AM, Cheng Lian wrote: Oh yes, thanks for adding that using |sc.hadoopConfiguration.set| also works :-) ​ On Wed, Jan 21, 2015 at 7:11 AM, Yana Kadiyska yana.kadiy...@gmail.com mailto:yana.kadiy...@gmail.com wrote: Thanks for looking Cheng. Just to clarify in case other

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Cheng Lian
According to the Gist Ayoub provided, the schema is fine. I reproduced this issue locally, it should be bug, but I don't think it's related to SPARK-5236. Will investigate this soon. Ayoub - would you mind to help to file a JIRA for this issue? Thanks! Cheng On 1/30/15 11:28 AM, Michael

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Cheng Lian
According to the Gist Ayoub provided, the schema is fine. I reproduced this issue locally, it should be bug, but I don't think it's related to SPARK-5236. Will investigate this soon. Ayoub - would you mind to help to file a JIRA for this issue? Thanks! Cheng On 1/30/15 11:28 AM, Michael

Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-30 Thread Cheng Lian
Yeah, currently there isn't such a repo. However, the Spark team is working on this. Cheng On 1/30/15 8:19 AM, Ayoub wrote: I am not personally aware of a repo for snapshot builds. In my use case, I had to build spark 1.2.1-snapshot see

Re: Mathematical functions in spark sql

2015-01-27 Thread Cheng Lian
Hey Alexey, You need to use |HiveContext| in order to access Hive UDFs. You may try it with |bin/spark-sql| (|src| is a Hive table): |spark-sql select key / 3 from src limit 10; 79.33 28.668 103.67 9.0 55.0 136.34 85.0 92.67

Re: Set is not parseable as row field in SparkSql

2015-01-28 Thread Cheng Lian
Hey Jorge, This is expected. Because there isn’t an obvious mapping from |Set[T]| to any SQL types. Currently we have complex types like array, map, and struct, which are inherited from Hive. In your case, I’d transform the |Set[T]| into a |Seq[T]| first, then Spark SQL can map it to an

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-11 Thread Cheng Lian
operator may also cache row objects. This is very implementation specific and may change between versions. Cheers, ~N From: Michael Armbrust mich...@databricks.com mailto:mich...@databricks.com Date: Saturday, 10 January 2015 3:41 am To: Cheng Lian lian.cs@gmail.com mailto:lian.cs

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-09 Thread Cheng Lian
Hey Nathan, Thanks for sharing, this is a very interesting post :) My comments are inlined below. Cheng On 1/7/15 11:53 AM, Nathan McCarthy wrote: Hi, I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via rdd.mapPartitions(…). Using the latest release 1.2.0. Simple

Re: New ColumnType For Decimal Caching

2015-02-15 Thread Cheng Lian
Hi Manoj, Yes, you've already hit the point. I think timestamp type support in the in-memory columnar support can be a good reference for you. Also, you may want to enable compression support for decimal type by adding DECIMAL column type to RunLengthEncoding.supports and

Re: Loading tables using parquetFile vs. loading tables from Hive metastore with Parquet serde

2015-02-16 Thread Cheng Lian
Hi Jianshi, When accessing a Hive table with Parquet SerDe, Spark SQL tries to convert it into Spark SQL's native Parquet support for better performance. And yes, predicate push-down, column pruning are applied here. In 1.3.0, we'll also cover the write path except for writing partitioned table.

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Hey Xuelin, which data item in the Web UI did you check? On 1/7/15 5:37 PM, Xuelin Cao wrote: Hi, Curious and curious. I'm puzzled by the Spark SQL cached table. Theoretically, the cached table should be columnar table, and only scan the column that included in my SQL. However, in my

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
the input data for each task (in the stage detail page). And the sum of the input data for each task is also 1212.5MB On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: Hey Xuelin, which data item in the Web UI did you check? On 1/7/15 5

Re: example insert statement in Spark SQL

2015-01-08 Thread Cheng Lian
Spark SQL supports Hive insertion statement (Hive 0.14.0 style insertion is not supported though) https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries The small SQL dialect provided in Spark SQL doesn't support insertion

Re: SparkSQL support for reading Avro files

2015-01-08 Thread Cheng Lian
This package is moved here: https://github.com/databricks/spark-avro On 1/6/15 5:12 AM, yanenli2 wrote: Hi All, I want to use the SparkSQL to manipulate the data with Avro format. I found a solution at https://github.com/marmbrus/sql-avro . However it doesn't compile successfully anymore with

Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-08 Thread Cheng Lian
The |+| operator only handles numeric data types, you may register you own concat function like this: |sqlContext.registerFunction(concat, (s: String, t: String) = s + t) sqlContext.sql(select concat('$', col1) from tbl) | Cheng On 1/5/15 1:13 PM, RK wrote: The issue is happening when I try

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
drops. If you try like this: cacheTable(tbl) sql(select * from tbl).collect() sql(select name from tbl).collect() sql(select * from tbl).collect() Is the input data of the 3rd SQL bigger than 49.1KB? On Thu, Jan 8, 2015 at 9:36 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Cheng Lian
Would you mind to provide the query? If it's confidential, could you please help constructing a query that reproduces this issue? Cheng On 3/18/15 6:03 PM, Roberto Coluccio wrote: Hi everybody, When trying to upgrade from Spark 1.1.1 to Spark 1.2.x (tried both 1.2.0 and 1.2.1) I encounter a

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-18 Thread Cheng Lian
You should probably increase executor memory by setting spark.executor.memory. Full list of available configurations can be found here http://spark.apache.org/docs/latest/configuration.html Cheng On 3/18/15 9:15 PM, Yiannis Gkoufas wrote: Hi there, I was trying the new DataFrame API with

Re: Spark SQL weird exception after upgrading from 1.1.1 to 1.2.x

2015-03-18 Thread Cheng Lian
with only (and less than 22) String fields. Hope the situation is a bit more clear. Thanks anyone who will help me out here. Roberto On Wed, Mar 18, 2015 at 12:09 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: Would you mind to provide the query? If it's

Re: sparksql native jdbc driver

2015-03-18 Thread Cheng Lian
Yes On 3/18/15 8:20 PM, sequoiadb wrote: hey guys, In my understanding SparkSQL only supports JDBC connection through hive thrift server, is this correct? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Re: [Spark SQL]: Convert JavaSchemaRDD back to JavaRDD of a specific class

2015-03-15 Thread Cheng Lian
Currently there’s no convenient way to convert a |SchemaRDD|/|JavaSchemaRDD| back to an |RDD|/|JavaRDD| of some case class. But you can convert an |RDD|/|JavaRDD| into an |RDD[Row]|/|JavaRDDRow| using |schemaRdd.rdd| and |new JavaRDDRow(schemaRdd.rdd)|. Cheng On 3/15/15 10:22 PM, Renato

Re: insert hive partitioned table

2015-03-16 Thread Cheng Lian
Not quite sure whether I understand your question properly. But if you just want to read the partition columns, it’s pretty easy. Take the “year” column as an example, you may do this in HiveQL: |hiveContext.sql(SELECT year FROM speed) | or in DataFrame DSL:

Re: configure number of cached partition in memory on SparkSQL

2015-03-16 Thread Cheng Lian
Hi Judy, In the case of |HadoopRDD| and |NewHadoopRDD|, partition number is actually decided by the |InputFormat| used. And |spark.sql.inMemoryColumnarStorage.batchSize| is not related to partition number, it controls the in-memory columnar batch size within a single partition. Also, what

Re: Explanation on the Hive in the Spark assembly

2015-03-15 Thread Cheng Lian
Spark SQL supports most commonly used features of HiveQL. However, different HiveQL statements are executed in different manners: 1. DDL statements (e.g. |CREATE TABLE|, |DROP TABLE|, etc.) and commands (e.g. |SET key = value|, |ADD FILE|, |ADD JAR|, etc.) In most cases, Spark SQL

Re: Running spark function on parquet without sql

2015-03-15 Thread Cheng Lian
That's an unfortunate documentation bug in the programming guide... We failed to update it after making the change. Cheng On 2/28/15 8:13 AM, Deborah Siegel wrote: Hi Michael, Would you help me understand the apparent difference here.. The Spark 1.2.1 programming guide indicates: Note

Re: Writing wide parquet file in Spark SQL

2015-03-15 Thread Cheng Lian
This article by Ryan Blue should be helpful to understand the problem http://ingest.tips/2015/01/31/parquet-row-group-size/ The TL;DR is, you may decrease |parquet.block.size| to reduce memory consumption. Anyway, 100K columns is a really big burden for Parquet, but I guess your data should

Re: Read Parquet file from scala directly

2015-03-15 Thread Cheng Lian
The parquet-tools code should be pretty helpful (although it's Java) https://github.com/apache/incubator-parquet-mr/tree/master/parquet-tools/src/main/java/parquet/tools/command On 3/10/15 12:25 AM, Shuai Zheng wrote: Hi All, I have a lot of parquet files, and I try to open them directly

Re: From Spark web ui, how to prove the parquet column pruning working

2015-03-15 Thread Cheng Lian
Hey Yong, It seems that Hadoop `FileSystem` adds the size of a block to the metrics even if you only touch a fraction of it (reading Parquet metadata for example). This behavior can be verified by the following snippet: ```scala import org.apache.spark.sql.Row import

Re: Is there any problem in having a long opened connection to spark sql thrift server

2015-03-15 Thread Cheng Lian
It should be OK. If you encountered problems in having a long opened connection to the Thrift server, it should be a bug. Cheng On 3/9/15 6:41 PM, fanooos wrote: I have some applications developed using PHP and currently we have a problem in connecting these applications to spark sql thrift

Re: Parquet and repartition

2015-03-16 Thread Cheng Lian
Hey Masf, I’ve created SPARK-6360 https://issues.apache.org/jira/browse/SPARK-6360 to track this issue. Detailed analysis is provided there. The TL;DR is, for Spark 1.1 and 1.2, if a SchemaRDD contains decimal or UDT column(s), after applying any traditional RDD transformations (e.g.

Re: insert hive partitioned table

2015-03-16 Thread Cheng Lian
, and the value of the partition column to be inserted must be from temporary registered table/dataframe. Patcharee On 16. mars 2015 15:26, Cheng Lian wrote: Not quite sure whether I understand your question properly. But if you just want to read the partition columns, it’s pretty easy. Take the “year

Re: Iterate over contents of schemaRDD loaded from parquet file to extract timestamp

2015-03-16 Thread Cheng Lian
I don't see non-serializable objects in the provided snippets. But you can always add -Dsun.io.serialization.extendedDebugInfo=true to Java options to debug serialization errors. Cheng On 3/17/15 12:43 PM, anu wrote: Spark Version - 1.1.0 Scala - 2.10.4 I have loaded following type data

Re: Question on Spark 1.3 SQL External Datasource

2015-03-17 Thread Cheng Lian
Hey Yang, My comments are in-lined below. Cheng On 3/18/15 6:53 AM, Yang Lei wrote: Hello, I am migrating my Spark SQL external datasource integration from Spark 1.2.x to Spark 1.3. I noticed, there are a couple of new filters now, e.g. org.apache.spark.sql.sources.And. However, for a

Re: Unable to saveAsParquetFile to HDFS since Spark 1.3.0

2015-03-17 Thread Cheng Lian
This has been fixed by https://github.com/apache/spark/pull/5020 On 3/18/15 12:24 AM, Franz Graf wrote: Hi all, today we tested Spark 1.3.0. Everything went pretty fine except that I seem to be unable to save an RDD as parquet to HDFS. A minimum example is: import sqlContext.implicits._ //

Re: Spark-sql query got exception.Help

2015-03-25 Thread Cheng Lian
$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-03-25 19:05 GMT+08:00 Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com: Could you please provide the full stack trace? On 3/25/15 6:26 PM, 李铖 wrote: It is ok when I do query data from

Re: Write Parquet File with spark-streaming with Spark 1.3

2015-03-26 Thread Cheng Lian
You may resort to the generic save API introduced in 1.3, which supports appending as long as the target data source supports it. And in 1.3, Parquet does support appending. Cheng On 3/26/15 4:13 PM, Richard Grossman wrote: Hi I've succeed to write kafka stream to parquet file in Spark 1.2

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Cheng Lian
I couldn’t reproduce this with the following spark-shell snippet: |scala import sqlContext.implicits._ scala Seq((1, 2)).toDF(a, b) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) scala res0.save(xxx, org.apache.spark.sql.SaveMode.Overwrite) | The _common_metadata file is

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Cheng Lian
We're working together with AsiaInfo on this. Possibly will deliver an initial version of window function support in 1.4.0. But it's not a promise yet. Cheng On 3/26/15 7:27 PM, Arush Kharbanda wrote: Its not yet implemented. https://issues.apache.org/jira/browse/SPARK-1442 On Thu, Mar 26,

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-30 Thread Cheng Lian
as HiveContext constructor does not accept JaveSparkContext and JaveSparkContext is not subclass of SparkContext. Anyone else have any idea? I suspect this is supported now. On Sun, Mar 29, 2015 at 8:54 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: You may

Re: Can spark sql read existing tables created in hive

2015-03-30 Thread Cheng Lian
The mysql command line doesn't use JDBC to talk to MySQL server, so this doesn't verify anything. I think this Hive metastore installation guide from Cloudera may be helpful. Although this document is for CDH4, the general steps are the same, and should help you to figure out the

Re: Can spark sql read existing tables created in hive

2015-03-30 Thread Cheng Lian
Ah, sorry, my bad... http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-2-0/CDH4-Installation-Guide/cdh4ig_topic_18_4.html On 3/30/15 10:24 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: Hello Lian Can you share the URL ? On Mon, Mar 30, 2015 at 6:12 PM, Cheng Lian lian.cs@gmail.com

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Cheng Lian
(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) Regards, Deepak On Fri, Mar 27, 2015 at 8:33 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: As the exception suggests, you don't have MySQL JDBC driver on your classpath

Re: spark-sql throws org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException

2015-03-26 Thread Cheng Lian
As the exception suggests, you don't have MySQL JDBC driver on your classpath. On 3/27/15 10:45 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: I am unable to run spark-sql form command line. I attempted the following 1) export SPARK_HOME=/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4 export

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Cheng Lian
This should be a bug in the Explode.eval(), which always assumes the underlying SQL array is represented by a Scala Seq. Would you mind to open a JIRA ticket for this? Thanks! Cheng On 3/27/15 7:00 PM, Jon Chase wrote: Spark 1.3.0 Two issues: a) I'm unable to get a lateral view explode

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Cheng Lian
, 2015 at 7:26 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: I couldn’t reproduce this with the following spark-shell snippet: |scala import sqlContext.implicits._ scala Seq((1, 2)).toDF(a, b) scala res0.save(xxx

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Cheng Lian
(), as it produced a similar exception (though there was no use of explode there). On Fri, Mar 27, 2015 at 7:20 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: This should be a bug in the Explode.eval(), which always assumes the underlying SQL array is represented by a Scala

Re: Spark SQL lateral view explode doesn't work, and unable to save array types to Parquet

2015-03-27 Thread Cheng Lian
:14 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: Forgot to mention that, would you mind to also provide the full stack trace of the exception thrown in the saveAsParquetFile call? Thanks! Cheng On 3/27/15 7:35 PM, Jon Chase wrote: https

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-29 Thread Cheng Lian
You may simply pass in JavaSparkContext.sc On 3/29/15 9:25 PM, Vincent He wrote: All, I try Spark SQL with Java, I find HiveContext does not accept JavaSparkContext, is this true? Or any special build of Spark I need to do (I build with Hive and thrift server)? Can we use HiveContext in

Re: Does Spark HiveContext supported with JavaSparkContext?

2015-03-29 Thread Cheng Lian
is not subclass of SparkContext. Anyone else have any idea? I suspect this is supported now. On Sun, Mar 29, 2015 at 8:54 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: You may simply pass in JavaSparkContext.sc On 3/29/15 9:25 PM, Vincent He wrote: All

Re: Convert Spark SQL table to RDD in Scala / error: value toFloat is a not a member of Any

2015-03-22 Thread Cheng Lian
You need either |.map { row = (row(0).asInstanceOf[Float], row(1).asInstanceOf[Float], ...) } | or |.map {case Row(f0:Float, f1:Float, ...) = (f0, f1) } | On 3/23/15 9:08 AM, Minnow Noir wrote: I'm following some online tutorial written in Python and trying to convert a Spark SQL table

Re: Spark-SQL 1.2.0 sort by results are not consistent with Hive

2015-02-23 Thread Cheng Lian
(Move to user list.) Hi Kannan, You need to set |mapred.map.tasks| to 1 in hive-site.xml. The reason is this line of code https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L68, which overrides |spark.default.parallelism|. Also,

<    1   2   3   4   >