Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2015-02-05 Thread Cheng Lian
pseudo distributed YARN cluster. Would you mind to elaborate more about steps to reproduce this bug? Thanks ​ On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian lian.cs@gmail.com

Re: Can't access remote Hive table from spark

2015-02-05 Thread Cheng Lian
Please note that Spark 1.2.0 /only/ support Hive 0.13.1 /or/ 0.12.0, none of other versions are supported. Best, Cheng On 1/25/15 12:18 AM, guxiaobo1982 wrote: Hi, I built and started a single node standalone Spark 1.2.0 cluster along with a single node Hive 0.14.0 instance installed by

Re: Parquet compression codecs not applied

2015-02-05 Thread Cheng Lian
Hi Ayoub, The doc page isn’t wrong, but it’s indeed confusing. |spark.sql.parquet.compression.codec| is used when you’re wring Parquet file with something like |data.saveAsParquetFile(...)|. However, you are using Hive DDL in the example code. All Hive DDLs and commands like |SET| are

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Cheng Lian
According to the Gist Ayoub provided, the schema is fine. I reproduced this issue locally, it should be bug, but I don't think it's related to SPARK-5236. Will investigate this soon. Ayoub - would you mind to help to file a JIRA for this issue? Thanks! Cheng On 1/30/15 11:28 AM, Michael

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Cheng Lian
According to the Gist Ayoub provided, the schema is fine. I reproduced this issue locally, it should be bug, but I don't think it's related to SPARK-5236. Will investigate this soon. Ayoub - would you mind to help to file a JIRA for this issue? Thanks! Cheng On 1/30/15 11:28 AM, Michael

Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-30 Thread Cheng Lian
Yeah, currently there isn't such a repo. However, the Spark team is working on this. Cheng On 1/30/15 8:19 AM, Ayoub wrote: I am not personally aware of a repo for snapshot builds. In my use case, I had to build spark 1.2.1-snapshot see

Re: Error when get data from hive table. Use python code.

2015-01-29 Thread Cheng Lian
What version of Spark and Hive are you using? Spark 1.1.0 and prior version /only/ support Hive 0.12.0. Spark 1.2.0 supports Hive 0.12.0 /or/ 0.13.1. Cheng On 1/29/15 6:36 PM, QiuxuanZhu wrote: *Dear all, * *I have no idea when it raises an error when I run the following code.* * * def

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-28 Thread Cheng Lian
On 1/21/15 10:39 AM, Cheng Lian wrote: Oh yes, thanks for adding that using |sc.hadoopConfiguration.set| also works :-) ​ On Wed, Jan 21, 2015 at 7:11 AM, Yana Kadiyska yana.kadiy...@gmail.com mailto:yana.kadiy...@gmail.com wrote: Thanks for looking Cheng. Just to clarify in case other

Re: Set is not parseable as row field in SparkSql

2015-01-28 Thread Cheng Lian
Hey Jorge, This is expected. Because there isn’t an obvious mapping from |Set[T]| to any SQL types. Currently we have complex types like array, map, and struct, which are inherited from Hive. In your case, I’d transform the |Set[T]| into a |Seq[T]| first, then Spark SQL can map it to an

Re: SparkSQL Performance Tuning Options

2015-01-27 Thread Cheng Lian
On 1/27/15 5:55 PM, Cheng Lian wrote: On 1/27/15 11:38 AM, Manoj Samel wrote: Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db. Use case is Spark Yarn app will start and serve as query server for multiple users i.e. always up and running. At startup, there is option

Re: Mathematical functions in spark sql

2015-01-27 Thread Cheng Lian
Hey Alexey, You need to use |HiveContext| in order to access Hive UDFs. You may try it with |bin/spark-sql| (|src| is a Hive table): |spark-sql select key / 3 from src limit 10; 79.33 28.668 103.67 9.0 55.0 136.34 85.0 92.67

Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Cheng Lian
Currently no if you don't want to use Spark SQL's HiveContext. But we're working on adding partitioning support to the external data sources API, with which you can create, for example, partitioned Parquet tables without using Hive. Cheng On 1/26/15 8:47 AM, Danny Yates wrote: Thanks

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-21 Thread Cheng Lian
for future groupings (assuming we cache I suppose) Mick On 20 Jan 2015, at 20:44, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: First of all, even if the underlying dataset is partitioned as expected, a shuffle can’t be avoided. Because Spark SQL knows

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Cheng Lian
:07 PM, Cheng Lian lian.cs@gmail.com wrote: Hey Yana, Sorry for the late reply, missed this important thread somehow. And many thanks for reporting this. It turned out to be a bug — filter pushdown is only enabled when using client side metadata, which is not expected, because task side

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread Cheng Lian
|IF| is implemented as a generic UDF in Hive (|GenericUDFIf|). It seems that this function can’t be properly resolved. Could you provide a minimum code snippet that reproduces this issue? Cheng On 1/20/15 1:22 AM, Xuelin Cao wrote: Hi, I'm trying to migrate some hive scripts to

Re: Spark SQL: Assigning several aliases to the output (several return values) of an UDF

2015-01-20 Thread Cheng Lian
Guess this can be helpful: http://stackoverflow.com/questions/14252615/stack-function-in-hive-how-to-specify-multiple-aliases On 1/19/15 8:26 AM, mucks17 wrote: Hello I use Hive on Spark and have an issue with assigning several aliases to the output (several return values) of an UDF. I ran

Re: Support for SQL on unions of tables (merge tables?)

2015-01-20 Thread Cheng Lian
I think you can resort to a Hive table partitioned by date https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables On 1/11/15 9:51 PM, Paul Wais wrote: Dear List, What are common approaches for addressing over a union of tables / RDDs? E.g.

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Cheng Lian
-means model from its cluster centers. -Xiangrui On Tue, Jan 20, 2015 at 11:55 AM, Cheng Lian lian.cs@gmail.com wrote: This is because KMeanModel is neither a built-in type nor a user defined type recognized by Spark SQL. I think you can write your own UDT version of KMeansModel in this case

Re: Spark Sql reading whole table from cache instead of required coulmns

2015-01-20 Thread Cheng Lian
Hey Surbhit, In this case, the web UI stats is not accurate. Please refer to this thread for an explanation: https://www.mail-archive.com/user@spark.apache.org/msg18919.html Cheng On 1/13/15 1:46 AM, Surbhit wrote: Hi, I am using spark 1.1.0. I am using the spark-sql shell to run all the

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-20 Thread Cheng Lian
Hey Yana, Sorry for the late reply, missed this important thread somehow. And many thanks for reporting this. It turned out to be a bug — filter pushdown is only enabled when using client side metadata, which is not expected, because task side metadata code path is more performant. And I

Re: MapType in spark-sql

2015-01-20 Thread Cheng Lian
You need to provide key type, value type for map type, element type for array type, and whether they contain null: |StructType(Array( StructField(map_field,MapType(keyType =IntegerType, valueType =StringType, containsNull =true), nullable =true),

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-20 Thread Cheng Lian
. For example, |Sort| does defensive copy as it needs to cache rows for sorting. Keen to get the best performance and the best blend of SparkSQL and functional Spark. Cheers, Nathan From: Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com Date: Monday, 12 January 2015 1:21 am

Re: [SQL] Using HashPartitioner to distribute by column

2015-01-20 Thread Cheng Lian
First of all, even if the underlying dataset is partitioned as expected, a shuffle can’t be avoided. Because Spark SQL knows nothing about the underlying data distribution. However, this does reduce network IO. You can prepare your data like this (say |CustomerCode| is a string field with

Re: Is there any way to support multiple users executing SQL on thrift server?

2015-01-20 Thread Cheng Lian
Hey Yi, I'm quite unfamiliar with Hadoop/HDFS auth mechanisms for now, but would like to investigate this issue later. Would you please open an JIRA for it? Thanks! Cheng On 1/19/15 1:00 AM, Yi Tian wrote: Is there any way to support multiple users executing SQL on one thrift server? I

Re: Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Cheng Lian
I had once worked on a named row feature but haven’t got time to finish it. It looks like this: |sql(...).named.map { row:NamedRow = row[Int]('key) - row[String]('value) } | Basically the |named| method generates a field name to ordinal map for each RDD partition. This map is then shared

Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-20 Thread Cheng Lian
|spark.sql.parquet.filterPushdown| defaults to |false| because there’s a bug in Parquet which may cause NPE, please refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration This bug hasn’t been fixed in Parquet master. We’ll turn this on once the bug is fixed.

Re: Why custom parquet format hive table execute ParquetTableScan physical plan, not HiveTableScan?

2015-01-20 Thread Cheng Lian
In Spark SQL, Parquet filter pushdown doesn’t cover |HiveTableScan| for now. May I ask why do you prefer |HiveTableScan| rather than |ParquetTableScan|? Cheng On 1/19/15 5:02 PM, Xiaoyu Wang wrote: The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set

Re: Saving a mllib model in Spark SQL

2015-01-20 Thread Cheng Lian
This is because |KMeanModel| is neither a built-in type nor a user defined type recognized by Spark SQL. I think you can write your own UDT version of |KMeansModel| in this case. You may refer to |o.a.s.mllib.linalg.Vector| and |o.a.s.mllib.linalg.VectorUDT| as an example. Cheng On 1/20/15

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-11 Thread Cheng Lian
operator may also cache row objects. This is very implementation specific and may change between versions. Cheers, ~N From: Michael Armbrust mich...@databricks.com mailto:mich...@databricks.com Date: Saturday, 10 January 2015 3:41 am To: Cheng Lian lian.cs@gmail.com mailto:lian.cs

Re: SparkSQL schemaRDD MapPartitions calls - performance issues - columnar formats?

2015-01-09 Thread Cheng Lian
Hey Nathan, Thanks for sharing, this is a very interesting post :) My comments are inlined below. Cheng On 1/7/15 11:53 AM, Nathan McCarthy wrote: Hi, I’m trying to use a combination of SparkSQL and ‘normal' Spark/Scala via rdd.mapPartitions(…). Using the latest release 1.2.0. Simple

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Hey Xuelin, which data item in the Web UI did you check? On 1/7/15 5:37 PM, Xuelin Cao wrote: Hi, Curious and curious. I'm puzzled by the Spark SQL cached table. Theoretically, the cached table should be columnar table, and only scan the column that included in my SQL. However, in my

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
the input data for each task (in the stage detail page). And the sum of the input data for each task is also 1212.5MB On Thu, Jan 8, 2015 at 6:40 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: Hey Xuelin, which data item in the Web UI did you check? On 1/7/15 5

Re: example insert statement in Spark SQL

2015-01-08 Thread Cheng Lian
Spark SQL supports Hive insertion statement (Hive 0.14.0 style insertion is not supported though) https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries The small SQL dialect provided in Spark SQL doesn't support insertion

Re: SparkSQL support for reading Avro files

2015-01-08 Thread Cheng Lian
This package is moved here: https://github.com/databricks/spark-avro On 1/6/15 5:12 AM, yanenli2 wrote: Hi All, I want to use the SparkSQL to manipulate the data with Avro format. I found a solution at https://github.com/marmbrus/sql-avro . However it doesn't compile successfully anymore with

Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-08 Thread Cheng Lian
The |+| operator only handles numeric data types, you may register you own concat function like this: |sqlContext.registerFunction(concat, (s: String, t: String) = s + t) sqlContext.sql(select concat('$', col1) from tbl) | Cheng On 1/5/15 1:13 PM, RK wrote: The issue is happening when I try

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
drops. If you try like this: cacheTable(tbl) sql(select * from tbl).collect() sql(select name from tbl).collect() sql(select * from tbl).collect() Is the input data of the 3rd SQL bigger than 49.1KB? On Thu, Jan 8, 2015 at 9:36 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs

Re: SparkSQL 1.2.0 sources API error

2015-01-02 Thread Cheng Lian
Most of the time a NoSuchMethodError means wrong classpath settings, and some jar file is overriden by a wrong version. In your case it could be netty. On 1/3/15 1:36 PM, Niranda Perera wrote: Hi all, I am evaluating the spark sources API released with Spark 1.2.0. But I'm getting a

Re: Not Serializable exception when integrating SQL and Spark Streaming

2014-12-24 Thread Cheng Lian
Generally you can use |-Dsun.io.serialization.extendedDebugInfo=true| to enable serialization debugging information when serialization exceptions are raised. On 12/24/14 1:32 PM, bigdata4u wrote: I am trying to use sql over Spark streaming using Java. But i am getting Serialization

Re: SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-24 Thread Cheng Lian
Hao and Lam - I think the issue here is that |registerRDDAsTable| only creates a temporary table, which is not seen by Hive metastore. And Michael had once given a workaround for creating external Parquet table:

Re: got ”org.apache.thrift.protocol.TProtocolException: Expected protocol id ffffff82 but got ffffff80“ from hive metastroe service when I use show tables command in spark-sql shell

2014-12-24 Thread Cheng Lian
Hi Roc, Spark SQL 1.2.0 can only work with Hive 0.12.0 or Hive 0.13.1 (controlled by compilation flags), versions prior 1.2.0 only works with Hive 0.12.0. So Hive 0.15.0-SNAPSHOT is not an option. Would like to add that this is due to backwards compatibility issue of Hive metastore, AFAIK

Re: How to export data from hive into hdfs in spark program?

2014-12-23 Thread Cheng Lian
This depends on which output format you want. For Parquet, you can simply do this: |hiveContext.table(some_db.some_table).saveAsParquetFile(hdfs://path/to/file) | On 12/23/14 5:22 PM, LinQili wrote: Hi Leo: Thanks for your reply. I am talking about using hive from spark to export data from

Re: SchemaRDD.sample problem

2014-12-23 Thread Cheng Lian
Here is a more cleaned up version, can be used in |./sbt/sbt hive/console| to easily reproduce this issue: |sql(SELECT * FROM src WHERE key % 2 = 0). sample(withReplacement =false, fraction =0.05). registerTempTable(sampled) println(table(sampled).queryExecution) val query = sql(SELECT

Re: Spark SQL job block when use hive udf from_unixtime

2014-12-23 Thread Cheng Lian
Could you please provide a complete stacktrace? Also it would be good if you can share your hive-site.xml as well. On 12/23/14 4:42 PM, Dai, Kevin wrote: Hi, there When I use hive udf from_unixtime with the HiveContext, the job block and the log is as follow: sun.misc.Unsafe.park(Native

Re: Spark SQL 1.2 with CDH 4, Hive UDF is not working.

2014-12-22 Thread Cheng Lian
Hi Ji, Spark SQL 1.2 only works with either Hive 0.12.0 or 0.13.1 due to Hive API/protocol compatibility issues. When interacting with Hive 0.11.x, connections and simple queries may succeed, but things may go crazy in unexpected corners (like UDF). Cheng On 12/22/14 4:15 PM, Ji ZHANG

Re: spark-sql with join terribly slow.

2014-12-21 Thread Cheng Lian
secs. Hari On Wed, Dec 17, 2014 at 10:09 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: What kinds are the tables underlying the SchemaRDDs? Could you please provide the DDL of the tables and the query you executed? On 12/18/14 6:15 AM

Re: Querying registered RDD (AsTable) using JDBC

2014-12-21 Thread Cheng Lian
Evert - Thanks for the instructions, this is generally useful in other scenarios, but I think this isn’t what Shahab needs, because |saveAsTable| actually saves the contents of the SchemaRDD into Hive. Shahab - As Michael has answered in another thread, you may try

Re: Spark SQL DSL for joins?

2014-12-21 Thread Cheng Lian
On 12/17/14 1:43 PM, Jerry Raj wrote: Hi, I'm using the Scala DSL for Spark SQL, but I'm not able to do joins. I have two tables (backed by Parquet files) and I need to do a join across them using a common field (user_id). This works fine using standard SQL but not using the

Re: SparkSQL 1.2.1-snapshot Left Join problem

2014-12-21 Thread Cheng Lian
Could you please file a JIRA together with the Git commit you're using? Thanks! On 12/18/14 2:32 AM, Hao Ren wrote: Hi, When running SparkSQL branch 1.2.1 on EC2 standalone cluster, the following query does not work: create table debug as select v1.* from t1 as v1 left join t2 as v2 on

Re: integrating long-running Spark jobs with Thriftserver

2014-12-21 Thread Cheng Lian
Hi Schweichler, This is an interesting and practical question. I'm not familiar with how Tableau works, but would like to share some thoughts. In general, big data analytics frameworks like MR and Spark tend to perform immutable functional transformations over immutable data. Whilst in your

Re: [SPARK-SQL]how to run cache command with Running the Thrift JDBC/ODBC server

2014-12-19 Thread Cheng Lian
It seems that the Thrift server you connected to is the original HiveServer2 rather than Spark SQL HiveThriftServer2. On 12/19/14 4:08 PM, jeanlyn92 wrote: when i run the *cache table as *in the beeline which communicate with the thrift server i got the follow error: 14/12/19 15:57:05 ERROR

Re: Spark SQL API Doc IsCached as SQL command

2014-12-12 Thread Cheng Lian
There isn’t a SQL statement that directly maps |SQLContext.isCached|, but you can use |EXPLAIN EXTENDED| to check whether the underlying physical plan is a |InMemoryColumnarTableScan|. On 12/13/14 7:14 AM, Judy Nash wrote: Hello, Few questions on Spark SQL: 1)Does Spark SQL support

Re: Compare performance of sqlContext.jsonFile and sqlContext.jsonRDD

2014-12-11 Thread Cheng Lian
There are several overloaded versions of both |jsonFile| and |jsonRDD|. Schema inferring is kinda expensive since it requires an extra Spark job. You can avoid schema inferring by storing the inferred schema and then use it together with the following two methods: * |def jsonFile(path:

Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian
Essentially, the Spark SQL JDBC Thrift server is just a Spark port of HiveServer2. You don't need to run Hive, but you do need a working Metastore. On 12/9/14 3:59 PM, Anas Mosaad wrote: Thanks Judy, this is exactly what I'm looking for. However, and plz forgive me if it's a dump question is:

Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian
(0.106 seconds)/ /0: jdbc:hive2://localhost:1 / Kindly advice, what am I missing? I want to read the RDD using SQL from outside spark-shell (i.e. like any other relational database) On Tue, Dec 9, 2014 at 11:05 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote

Re: Spark-SQL JDBC driver

2014-12-09 Thread Cheng Lian
) On Tue, Dec 9, 2014 at 11:44 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: How did you register the table under spark-shell? Two things to notice: 1. To interact with Hive, HiveContext instead of SQLContext must be used. 2. `registerTempTable` doesn't

Re: Spark SQL: How to get the hierarchical element with SQL?

2014-12-07 Thread Cheng Lian
You may access it via something like |SELECT filterIp.element FROM tb|, just like Hive. Or if you’re using Spark SQL DSL, you can use |tb.select(filterIp.element.attr)|. On 12/8/14 1:08 PM, Xuelin Cao wrote: Hi, I'm generating a Spark SQL table from an offline Json file. The

Re: SQL query in scala API

2014-12-05 Thread Cheng Lian
++ seen1) }).mapValues {case (count, seen) = (count, seen.size) } | On 12/5/14 3:47 AM, Arun Luthra wrote: Is that Spark SQL? I'm wondering if it's possible without spark SQL. On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: You may do

Re: Window function by Spark SQL

2014-12-04 Thread Cheng Lian
Window functions are not supported yet, but there is a PR for it: https://github.com/apache/spark/pull/2953 On 12/5/14 12:22 PM, Dai, Kevin wrote: Hi, ALL How can I group by one column and order by another one, then select the first row for each group (which is just like window function

Re: Spark SQL table Join, one task is taking long

2014-12-03 Thread Cheng Lian
Hey Venkat, This behavior seems reasonable. According to the table name, I guess here |DAgents| should be the fact table and |ContactDetails| is the dim table. Below is an explanation of a similar query, you may see |src| as |DAgents| and |src1| as |ContactDetails|. |0:

Re: SQL query in scala API

2014-12-03 Thread Cheng Lian
You may do this: |table(users).groupBy('zip)('zip, count('user), countDistinct('user)) | On 12/4/14 8:47 AM, Arun Luthra wrote: I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API,

Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-26 Thread Cheng Lian
:37 GMT+09:00 Cheng Lian lian.cs@gmail.com: Spark SQL supports complex types, but casting doesn't work for complex types right now. On 11/25/14 4:04 PM, critikaled wrote: https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
What’s the command line you used to build Spark? Notice that you need to add |-Phive-thriftserver| to build the JDBC Thrift server. This profile was once removed in in v1.1.0, but added back in v1.2.0 because of dependency issue introduced by Scala 2.11 support. On 11/27/14 12:53 AM,

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
What version are you trying to build? I was at first assuming you're using the most recent master, but from your first mail it seems that you were trying to build Spark v1.1.0? On 11/27/14 12:57 PM, vdiwakar.malladi wrote: Thanks for your response. I'm using the following command. mvn

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
Hm, then the command line you used should be fine. Actually just tried it locally and it’s fine. Make sure to run it in the root directory of Spark source tree (don’t |cd| into assembly). On 11/27/14 1:35 PM, vdiwakar.malladi wrote: Yes, I'm building it from Spark 1.1.0 Thanks in advance.

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-26 Thread Cheng Lian
I see. As what the exception stated, Maven can’t find |unzip| to help building PySpark. So you need a Windows version of |unzip| (probably from MinGW or Cygwin?) On 11/27/14 2:10 PM, vdiwakar.malladi wrote: Thanks for your prompt responses. I'm generating assembly jar file from windows 7

Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-25 Thread Cheng Lian
Spark SQL supports complex types, but casting doesn't work for complex types right now. On 11/25/14 4:04 PM, critikaled wrote: https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Doesn't

Re: Spark SQL Join returns less rows that expected

2014-11-25 Thread Cheng Lian
Which version are you using? Or if you are using the most recent master or branch-1.2, which commit are you using? On 11/25/14 4:08 PM, david wrote: Hi, I have 2 files which come from csv import of 2 Oracle tables. F1 has 46730613 rows F2 has 3386740 rows I build 2 tables with

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-24 Thread Cheng Lian
SparkContext unsuccessfully. Let me know if you need anything else. *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Friday, November 21, 2014 8:02 PM *To:* Judy Nash; u...@spark.incubator.apache.org *Subject:* Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Hi

Re: advantages of SparkSQL?

2014-11-24 Thread Cheng Lian
For the “never register a table” part, actually you /can/ use Spark SQL without registering a table via its DSL. Say you’re going to extract an |Int| field named |key| from the table and double it: |import org.apache.sql.catalyst.dsl._ val data = sqc.parquetFile(path) val double =

Re: spark-sql broken

2014-11-22 Thread Cheng Lian
You're probably hitting this issue https://issues.apache.org/jira/browse/SPARK-4532 Patrick made a fix for this https://github.com/apache/spark/pull/3398 On 11/22/14 10:39 AM, tridib wrote: After taking today's build from master branch I started getting this error when run spark-sql: Class

Re: Debug Sql execution

2014-11-22 Thread Cheng Lian
You may try |EXPLIAN EXTENDED sql| to see the logical plan, analyzed logical plan, optimized logical plan and physical plan. Also |SchemaRDD.toDebugString| shows storage related debugging information. On 11/21/14 4:11 AM, Gordon Benjamin wrote: hey, Can anyone tell me how to debug a sql

Re: querying data from Cassandra through the Spark SQL Thrift JDBC server

2014-11-22 Thread Cheng Lian
This thread might be helpful http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html On 11/20/14 4:11 AM, Mohammed Guller wrote: Hi – I was curious if anyone is using the Spark SQL Thrift JDBC server with Cassandra. It would be great be if you could share

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-21 Thread Cheng Lian
Hi Judy, could you please provide the commit SHA1 of the version you're using? Thanks! On 11/22/14 11:05 AM, Judy Nash wrote: Hi, Thrift server is failing to start for me on latest spark 1.2 branch. I got the error below when I start thrift server. Exception in thread main

Re: Why is ALS class serializable ?

2014-11-19 Thread Cheng Lian
When a field of an object is enclosed in a closure, the object itself is also enclosed automatically, thus the object need to be serializable. On 11/19/14 6:39 PM, Hao Ren wrote: Hi, When reading through ALS code, I find that: class ALS private ( private var numUserBlocks: Int,

Re: Building Spark with hive does not work

2014-11-18 Thread Cheng Lian
Ah... Thanks Ted! And Hao, sorry for being the original trouble maker :) On 11/18/14 1:50 AM, Ted Yu wrote: Looks like this was where you got that commandline: http://search-hadoop.com/m/JW1q5RlPrl Cheers On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren inv...@gmail.com mailto:inv...@gmail.com

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Cheng Lian
A not so efficient way can be this: |val r0: RDD[OriginalRow] = ... val r1 = r0.keyBy(row = extractKeyFromOriginalRow(row)) val r2 = r1.keys.distinct().zipWithIndex() val r3 = r2.join(r1).values | On 11/18/14 8:54 PM, shahab wrote: Hi, In my spark application, I am loading some

Re: Building Spark with hive does not work

2014-11-17 Thread Cheng Lian
Hey Hao, Which commit are you using? Just tried 64c6b9b with exactly the same command line flags, couldn't reproduce this issue. Cheng On 11/17/14 10:02 PM, Hao Ren wrote: Hi, I am building spark on the most recent master branch. I checked this page:

Re: SparkSQL exception on cached parquet table

2014-11-16 Thread Cheng Lian
(Forgot to cc user mail list) On 11/16/14 4:59 PM, Cheng Lian wrote: Hey Sadhan, Thanks for the additional information, this is helpful. Seems that some Parquet internal contract was broken, but I'm not sure whether it's caused by Spark SQL or Parquet, or even maybe the Parquet file itself

Re: Load json format dataset as RDD

2014-11-16 Thread Cheng Lian
|SQLContext.jsonFile| assumes one JSON record per line. Although I haven’t tried yet, it seems that this |JsonInputFormat| [1] can be helpful. You may read your original data set with |SparkContext.hadoopFile| and |JsonInputFormat|, then transform the resulted |RDD[String]| into a |JsonRDD|

Re: SparkSQL exception on cached parquet table

2014-11-15 Thread Cheng Lian
Hi Sadhan, Could you please provide the stack trace of the |ArrayIndexOutOfBoundsException| (if any)? The reason why the first query succeeds is that Spark SQL doesn’t bother reading all data from the table to give |COUNT(*)|. In the second case, however, the whole table is asked to be

Re: saveAsParquetFile throwing exception

2014-11-14 Thread Cheng Lian
Which version are you using? You probably hit this bug https://issues.apache.org/jira/browse/SPARK-3421 if some field name in the JSON contains characters other than [a-zA-Z0-9_]. This has been fixed in https://github.com/apache/spark/pull/2563 On 11/14/14 6:35 PM, vdiwakar.malladi wrote:

Re: saveAsParquetFile throwing exception

2014-11-14 Thread Cheng Lian
Hm, I'm not sure whether this is the official way to upgrade CDH Spark, maybe you can checkout https://github.com/cloudera/spark, apply required patches, and then compile your own version. On 11/14/14 8:46 PM, vdiwakar.malladi wrote: Thanks for your response. I'm using Spark 1.1.0 Currently

Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Cheng Lian
13, 2014 at 10:50 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: No, the columnar buffer is built in a small batching manner, the batch size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| property. The default value for this in master

Re: Spark JDBC Thirft Server over HTTP

2014-11-13 Thread Cheng Lian
HTTP is not supported yet, and I don't think there's an JIRA ticket for it. On 11/14/14 8:21 AM, vs wrote: Does Spark JDBC thrift server allow connections over HTTP? http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#running-the-thrift-jdbc-server doesn't see to indicate this

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
one more question - does that mean that we still need enough memory in the cluster to uncompress the data before it can be compressed again or does that just read the raw data as is? On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Cheng Lian
If you’re looking for executor side setup and cleanup functions, there ain’t any yet, but you can achieve the same semantics via |RDD.mapPartitions|. Please check the “setup() and cleanup” section of this blog from Cloudera for details:

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Cheng Lian
can I write it like this? rdd.mapPartition(i = setup(); i).map(...).mapPartition(i = cleanup(); i) So I don't need to mess up the logic and still can use map, filter and other transformations for RDD. Jianshi On Fri, Nov 14, 2014 at 12:20 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs

Re: Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Cheng Lian
Currently there’s no way to cache the compressed sequence file directly. Spark SQL uses in-memory columnar format while caching table rows, so we must read all the raw data and convert them into columnar format. However, you can enable in-memory columnar compression by setting

Re: To generate IndexedRowMatrix from an RowMatrix

2014-11-10 Thread Cheng Lian
You may use |RDD.zipWithIndex|. On 11/10/14 10:03 PM, Lijun Wang wrote: Hi, I need a matrix with each row having a index, e.g., index = 0 for first row, index = 1 for second row. Could someone tell me how to generate such IndexedRowMatrix from an RowMatrix? Besides, is there anyone

Re: Understanding spark operation pipeline and block storage

2014-11-10 Thread Cheng Lian
On 11/6/14 1:39 AM, Hao Ren wrote: Hi, I would like to understand the pipeline of spark's operation(transformation and action) and some details on block storage. Let's consider the following code: val rdd1 = SparkContext.textFile(hdfs://...) rdd1.map(func1).map(func2).count For example, we

Re: thrift jdbc server probably running queries as hive query

2014-11-10 Thread Cheng Lian
Hey Sadhan, I really don't think this is Spark log... Unlike Shark, Spark SQL doesn't even provide a Hive mode to let you execute queries against Hive. Would you please check whether there is an existing HiveServer2 running there? Spark SQL HiveThriftServer2 is just a Spark port of

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
Hi Jean, Thanks for reporting this. This is indeed a bug: some column types (Binary, Array, Map and Struct, and unfortunately for some reason, Boolean), a NoopColumnStats is used to collect column statistics, which causes this issue. Filed SPARK-4182 to track this issue, will fix this ASAP.

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
Just submitted a PR to fix this https://github.com/apache/spark/pull/3059 On Sun, Nov 2, 2014 at 12:36 AM, Jean-Pascal Billaud j...@tellapart.com wrote: Great! Thanks. Sent from my iPad On Nov 1, 2014, at 8:35 AM, Cheng Lian lian.cs@gmail.com wrote: Hi Jean, Thanks for reporting

Re: Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread Cheng Lian
Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0, and related PRs are already merged or being merged to the master branch. On 10/29/14 7:43 PM, arthur.hk.c...@gmail.com wrote: Hi, My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise. Or, any news

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Cheng Lian
Which version of Spark and Hadoop are you using? Could you please provide the full stack trace of the exception? On Tue, Oct 28, 2014 at 5:48 AM, Du Li l...@yahoo-inc.com.invalid wrote: Hi, I was trying to set up Spark SQL on a private cluster. I configured a hive-site.xml under

Re: SparkSQL display wrong result

2014-10-27 Thread Cheng Lian
Would you mind to share DDLs of all involved tables? What format are these tables stored in? Is this issue specific to this query? I guess Hive, Shark and Spark SQL all read from the same HDFS dataset? On 10/27/14 3:45 PM, lyf刘钰帆 wrote: Hi, I am using SparkSQL 1.1.0 with cdh 4.6.0

Re: 答复: SparkSQL display wrong result

2014-10-27 Thread Cheng Lian
LOCAL INPATH '/home/data/testFolder/qrytblB.txt' INTO TABLE tblB; *发件人:*Cheng Lian [mailto:lian.cs@gmail.com] *发 送时间:*2014年10月27日16:48 *收件人:*lyf刘钰帆; user@spark.apache.org *主题:*Re: SparkSQL display wrong result Would you mind to share DDLs of all involved tables? What format

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Cheng Lian
I have never tried this yet, but maybe you can use an in-memory Derby database as metastore https://db.apache.org/derby/docs/10.7/devguide/cdevdvlpinmemdb.html I'll investigate this when free, guess we can use this for Spark SQL Hive support testing. On 10/27/14 4:38 PM, Jianshi Huang

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Cheng Lian
https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-EmbeddedMetastore Cheers On Oct 27, 2014, at 6:20 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: I have never tried this yet, but maybe you can use an in-memory Derby

Re: Getting Spark SQL talking to Sql Server

2014-10-21 Thread Cheng Lian
Instead of using Spark SQL, you can use JdbcRDD to extract data from SQL server. Currently Spark SQL can't run queries against SQL server. The foreign data source API planned in Spark 1.2 can make this possible. On 10/21/14 6:26 PM, Ashic Mahtab wrote: Hi, Is there a simple way to run spark

<    1   2   3   4   >