Re: spark1.0.1 catalyst transform filter not push down

2014-07-14 Thread Yin Huai
Hi, queryPlan.baseLogicalPlan is not the plan used to execution. Actually, the baseLogicalPlan of a SchemaRDD (queryPlan in your case) is just the parsed plan (the parsed plan will be analyzed, and then optimized. Finally, a physical plan will be created). The plan shows up after you execute "val

Re: Potential bugs in SparkSQL

2014-07-14 Thread Yin Huai
I have opened https://issues.apache.org/jira/browse/SPARK-2474 to track this bug. I will also explain my understanding of the root cause. On Thu, Jul 10, 2014 at 6:03 PM, Michael Armbrust wrote: > Hmm, yeah looks like the table name is not getting applied to the > attributes of m. You can work

Re: Ambiguous references to id : what does it mean ?

2014-07-15 Thread Yin Huai
Hi Jao, Seems the SQL analyzer cannot resolve the references in the Join condition. What is your query? Did you use the Hive Parser (your query was submitted through hql(...)) or the basic SQL Parser (your query was submitted through sql(...)). Thanks, Yin On Tue, Jul 15, 2014 at 8:52 AM, Jaon

Re: Spark Streaming Json file groupby function

2014-07-16 Thread Yin Huai
Hi Srinivas, Seems the query you used is val results =sqlContext.sql("select type from table1"). However, table1 does not have a field called type. The schema of table1 is defined as the class definition of your case class Record (i.e. ID, name, score, and school are fields of your table1). Can yo

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-19 Thread Yin Huai
Can you attach your code? Thanks, Yin On Sat, Jul 19, 2014 at 4:10 PM, chutium wrote: > 160G parquet files (ca. 30 files, snappy compressed, made by cloudera > impala) > > ca. 30 full table scan, took 3-5 columns out, then some normal scala > operations like substring, groupby, filter, at the

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-21 Thread Yin Huai
Hi Victor, Instead of importing sqlContext.createSchemaRDD, can you explicitly call sqlContext.createSchemaRDD(rdd) to create a SchemaRDD? For example, You have a case class Record. case class Record(data_date: String, mobile: String, create_time: String) Then, you create a RDD[Record] and let

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-21 Thread Yin Huai
Instead of using union, can you try sqlContext.parquetFile("/user/ hive/warehouse/xxx_parquet.db").registerAsTable("parquetTable")? Then, var all = sql("select some_id, some_type, some_time from parquetTable").map(line => (line(0), (line(1).toString, line(2).toString.substring(0, 19 Thanks, Y

Re: DROP IF EXISTS still throws exception about "table does not exist"?

2014-07-21 Thread Yin Huai
Hi Nan, It is basically a log entry because your table does not exist. It is not a real exception. Thanks, Yin On Mon, Jul 21, 2014 at 7:10 AM, Nan Zhu wrote: > a related JIRA: https://issues.apache.org/jira/browse/SPARK-2605 > > -- > Nan Zhu > > On Monday, July 21, 2014 at 10:10 AM, Nan Zh

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-22 Thread Yin Huai
Tue, Jul 22, 2014 at 12:53 AM, Victor Sheng wrote: > Hi, Yin Huai > I test again with your snippet code. > It works well in spark-1.0.1 > > Here is my code: > > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > case class Record(data_date: String, mob

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-23 Thread Yin Huai
Yes, https://issues.apache.org/jira/browse/SPARK-2576 is used to track it. On Wed, Jul 23, 2014 at 9:11 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Do we have a JIRA issue to track this? I think I've run into a similar > issue. > > > On Wed, Jul 23,

Re: Simple record matching using Spark SQL

2014-07-24 Thread Yin Huai
Hi Sarath, I will try to reproduce the problem. Thanks, Yin On Wed, Jul 23, 2014 at 11:32 PM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > Hi Michael, > > Sorry for the delayed response. > > I'm using Spark 1.0.1 (pre-built version for hadoop 1). I'm running spark > prog

Re: Simple record matching using Spark SQL

2014-07-24 Thread Yin Huai
Hi Sarath, Have you tried the current branch 1.0? If not, can you give it a try and see if the problem can be resolved? Thanks, Yin On Thu, Jul 24, 2014 at 11:17 AM, Yin Huai wrote: > Hi Sarath, > > I will try to reproduce the problem. > > Thanks, > > Yin > > >

Re: HiveContext is creating metastore warehouse locally instead of in hdfs

2014-07-31 Thread Yin Huai
Another way is to set "hive.metastore.warehouse.dir" explicitly to the HDFS dir storing Hive tables by using SET command. For example: hiveContext.hql("SET hive.metastore.warehouse.dir=hdfs://localhost:54310/user/hive/warehouse") On Thu, Jul 31, 2014 at 8:05 AM, Andrew Lee wrote: > Hi All, >

Re: Inconsistent Spark SQL behavior when column names contain dots

2014-07-31 Thread Yin Huai
I have created https://issues.apache.org/jira/browse/SPARK-2775 to track it. On Thu, Jul 31, 2014 at 11:47 AM, Budde, Adam wrote: > I still see the same “Unresolved attributes” error when using hql + > backticks. > > Here’s a code snippet that replicates this behavior: > > val hiveContext =

Re: pyspark inferSchema

2014-08-05 Thread Yin Huai
Yes, 2376 has been fixed in master. Can you give it a try? Also, for inferSchema, because Python is dynamically typed, I agree with Davies to provide a way to scan a subset (or entire) of the dataset to figure out the proper schema. We will take a look it. Thanks, Yin On Tue, Aug 5, 2014 at 12

Re: trouble with jsonRDD and jsonFile in pyspark

2014-08-05 Thread Yin Huai
I tried jsonRDD(...).printSchema() and it worked. Seems the problem is when we take the data back to the Python side, SchemaRDD#javaToPython failed on your cases. I have created https://issues.apache.org/jira/browse/SPARK-2875 to track it. Thanks, Yin On Tue, Aug 5, 2014 at 9:20 PM, Brad Miller

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
Hi Brad, It is a bug. I have filed https://issues.apache.org/jira/browse/SPARK-2908 to track it. It will be fixed soon. Thanks, Yin On Thu, Aug 7, 2014 at 10:55 AM, Brad Miller wrote: > Hi All, > > I'm having a bit of trouble with nested data structures in pyspark with > saveAsParquetFile.

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
will have a better story to handle NullType columns ( https://issues.apache.org/jira/browse/SPARK-2695). But, we still will not expose NullType to users. On Thu, Aug 7, 2014 at 1:41 PM, Brad Miller wrote: > Thanks Yin! > > best, > -Brad > > > On Thu, Aug 7, 2014 at 1:39 PM, Yin

Re: trouble with saveAsParquetFile

2014-08-07 Thread Yin Huai
The PR is https://github.com/apache/spark/pull/1840. On Thu, Aug 7, 2014 at 1:48 PM, Yin Huai wrote: > Actually, the issue is if values of a field are always null (or this field > is missing), we cannot figure out the data type. So, we use NullType (it is > an internal data type).

Re: CDH5, HiveContext, Parquet

2014-08-10 Thread Yin Huai
If the link to PR/1819 is broken. Here is the one https://github.com/apache/spark/pull/1819. On Sun, Aug 10, 2014 at 5:56 PM, Eric Friedman wrote: > Thanks Michael, I can try that too. > > I know you guys aren't in sales/marketing (thank G-d), but given all the > hoopla about the CDH<->DataBric

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Yin Huai
Hi Jenny, How's your metastore configured for both Hive and Spark SQL? Which metastore mode are you using (based on https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin )? Thanks, Yin On Mon, Aug 11, 2014 at 6:15 PM, Jenny Zhao wrote: > > > you can reproduce this issue

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-12 Thread Yin Huai
> > org.apache.hive.service.auth.WebConsoleAuthenticationProviderImpl > > > hive.server2.enable.impersonation > true > > > hive.security.webconsole.url > http://hdtest022.svl.ibm.com:8080 > > > hive.security.authorization.enabled > true >

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-13 Thread Yin Huai
mode, I am not able to switch to a database other than the default one, for > Yarn-client mode, it works fine. > > Thanks! > > Jenny > > > On Tue, Aug 12, 2014 at 12:53 PM, Yin Huai wrote: > >> Hi Jenny, >> >> Have you copied hive-site.xml to spark/conf

Re: SparkSQL Hive partitioning support

2014-08-13 Thread Yin Huai
Hi Silvio, You can insert into a static partition via SQL statement. Dynamic partitioning is not supported at the moment. Thanks, Yin On Wed, Aug 13, 2014 at 2:03 PM, Michael Armbrust wrote: > This is not supported at the moment. There are no concrete plans at the > moment to support it tho

Re: Spark RuntimeException due to Unsupported datatype NullType

2014-08-19 Thread Yin Huai
Hi Rafeeq, I think the following part triggered the bug https://issues.apache.org/jira/browse/SPARK-2908. [{*"href":null*,"rel":"me"}] It has been fixed. Can you try spark master and see if the error get resolved? Thanks, Yin On Mon, Aug 11, 2014 at 3:53 AM, rafeeq s wrote: > Hi, > > *Spar

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-19 Thread Yin Huai
Seems https://issues.apache.org/jira/browse/SPARK-2846 is the jira tracking this issue. On Mon, Aug 18, 2014 at 6:26 PM, cesararevalo wrote: > Thanks, Zhan for the follow up. > > But, do you know how I am supposed to set that table name on the jobConf? I > don't have access to that object from

Re: spark error when distinct on more than one cloume

2014-08-19 Thread Yin Huai
Hi, The SQLParser used by SQLContext is pretty limited. Instead, can you try HiveContext? Thanks, Yin On Tue, Aug 19, 2014 at 7:57 AM, wan...@testbird.com wrote: > > sql:SELECT app_id,COUNT(DISTINCT app_id, macaddr) cut from object group > by app_id > > > *Error Log* > > 14/08/19 17:58:26 IN

Re: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Yin Huai
If you want to filter the table name, you can use hc.sql("show tables").filter(row => !"test".equals(row.getString(0 Seems making functionRegistry transient can fix the error. On Wed, Aug 20, 2014 at 8:53 PM, Vida Ha wrote: > Hi, > > I doubt the the broadcast variable is your problem, sin

RE: Got NotSerializableException when access broadcast variable

2014-08-20 Thread Yin Huai
PR is https://github.com/apache/spark/pull/2074. -- From: Yin Huai Sent: ‎8/‎20/‎2014 10:56 PM To: Vida Ha Cc: tianyi ; Fengyun RAO ; user@spark.apache.org Subject: Re: Got NotSerializableException when access broadcast variable If you want to filter the table name

Re: Spark SQL: Caching nested structures extremely slow

2014-08-21 Thread Yin Huai
I have not profiled this part. But, I think one possible cause is allocating an array for every inner struct for every row (every struct value is represented by a Spark SQL row). I will play with it later and see what I find. On Tue, Aug 19, 2014 at 9:01 PM, Evan Chan wrote: > Hey guys, > > I'm

Re: Spark SQL Parser error

2014-08-22 Thread Yin Huai
Hi Sankar, You need to create an external table in order to specify the location of data (i.e. using CREATE EXTERNAL TABLE user1 LOCATION). You can take a look at this page for r

Re: Spark SQL Parser error

2014-08-22 Thread Yin Huai
anks and Regards, > Sankar S. > > > > On , S Malligarjunan wrote: > > > Hello Yin, > > I have tried the create external table command as well. I get the same > error. > Please help me to find the root cause. > > Thanks and Regards, > Sankar S. >

Re: Spark SQL Parser error

2014-08-26 Thread Yin Huai
gt; ./bin/spark-shell --jars option, > > In all three option when I try to create temporary funtion i get the > classNotFoundException. What would be the issue here? > > Thanks and Regards, > Sankar S. > > > > On Saturday, 23 August 2014, 0:53, Yin Huai > wrote: >

Re: unable to instantiate HiveMetaStoreClient on LocalHiveContext

2014-08-26 Thread Yin Huai
Hello Du, Can you check if there is a dir "metastore" in the place you launching your program. If so, can you delete it and try again? Also, can you try HiveContext? LocalHiveContext is deprecated. Thanks, Yin On Mon, Aug 25, 2014 at 6:33 PM, Du Li wrote: > Hi, > > I created an instance o

Re: Problem Accessing Hive Table from hiveContext

2014-09-01 Thread Yin Huai
Hello Igor, Although Decimal is supported, Hive 0.12 does not support user definable precision and scale (it was introduced in Hive 0.13). Thanks, Yin On Sat, Aug 30, 2014 at 1:50 AM, Zitser, Igor wrote: > Hi All, > New to spark and using Spark 1.0.2 and hive 0.12. > > If hive table created

Re: spark sql - create new_table as select * from table

2014-09-11 Thread Yin Huai
What is the schema of "table"? On Thu, Sep 11, 2014 at 4:30 PM, jamborta wrote: > thanks. this was actually using hivecontext. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14009.html > Sen

Re: spark sql - create new_table as select * from table

2014-09-11 Thread Yin Huai
Oh, never mind. The support of CTAS queries is pretty limited. Can you try to first create the table and then use insert into? On Thu, Sep 11, 2014 at 6:45 PM, Yin Huai wrote: > What is the schema of "table"? > > On Thu, Sep 11, 2014 at 4:30 PM, jamborta wrote: > >&g

Re: Spark SQL and running parquet tables?

2014-09-11 Thread Yin Huai
It is in SQLContext ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext ). On Thu, Sep 11, 2014 at 3:21 PM, DanteSama wrote: > Michael Armbrust wrote > > You'll need to run parquetFile("path").registerTempTable("name") to > > refresh the table. > > I'm not

Re: Re: Spark SQL -- more than two tables for join

2014-09-11 Thread Yin Huai
1.0.1 does not have the support on outer joins (added in 1.1). Can you try 1.1 branch? On Wed, Sep 10, 2014 at 9:28 PM, boyingk...@163.com wrote: > Hi,michael : > > I think Arthur.hk.chan isn't here now,I Can > Show something: > 1)my spark version is 1.0.1 > 2) when I use multiple join ,like t

Re: compiling spark source code

2014-09-13 Thread Yin Huai
Can you try "sbt/sbt clean" first? On Sat, Sep 13, 2014 at 4:29 PM, Ted Yu wrote: > bq. [error] File name too long > > It is not clear which file(s) loadfiles was loading. > Is the filename in earlier part of the output ? > > Cheers > > On Sat, Sep 13, 2014 at 10:58 AM, kkptninja wrote: > >> Hi

Re: About SparkSQL 1.1.0 join between more than two table

2014-09-15 Thread Yin Huai
1.0.1 does not have the support on outer joins (added in 1.1). Your query should be fine in 1.1. On Mon, Sep 15, 2014 at 5:35 AM, Yanbo Liang wrote: > Spark SQL can support SQL and HiveSQL which used SQLContext and > HiveContext separate. > As far as I know, SQLContext of Spark SQL 1.1.0 can not

Re: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Yin Huai
Seems https://issues.apache.org/jira/browse/HIVE-5474 is related? On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao wrote: > Thank you for pasting the steps, I will look at this, hopefully come out > with a solution soon. > > -Original Message- > From: linkpatrickliu [mailto:linkpatrick...@liv

Re: SparkSQL 1.1 hang when "DROP" or "LOAD"

2014-09-16 Thread Yin Huai
I meant it may be a Hive bug since we also call Hive's drop table internally. On Tue, Sep 16, 2014 at 1:44 PM, Yin Huai wrote: > Seems https://issues.apache.org/jira/browse/HIVE-5474 is related? > > On Tue, Sep 16, 2014 at 4:49 AM, Cheng, Hao wrote: > >> Thank you for pa

Re: spark-1.1.0-bin-hadoop2.4 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass

2014-09-18 Thread Yin Huai
Hello Andy, Will our JSON support in Spark SQL help your case? If your JSON files store one JSON object per line, you can use SQLContext.jsonFile to load it. If you want to do pre-process these files, once you have an RDD[String] (one JSON object per String), you can use SQLContext.jsonRDD. In bot

Re: Spark SQL 1.1.0: NPE when join two cached table

2014-09-22 Thread Yin Huai
It is a bug. I have created https://issues.apache.org/jira/browse/SPARK-3641 to track it. Thanks for reporting it. Yin On Mon, Sep 22, 2014 at 4:34 AM, Haopu Wang wrote: > I have two data sets and want to join them on each first field. Sample > data are below: > > > > data set 1: > > id2,na

Re: Spark SQL CLI

2014-09-22 Thread Yin Huai
Hi Gaurav, Can you put hive-site.xml in conf/ and try again? Thanks, Yin On Mon, Sep 22, 2014 at 4:02 PM, gtinside wrote: > Hi , > > I have been using spark shell to execute all SQLs. I am connecting to > Cassandra , converting the data in JSON and then running queries on it, I > am using Hi

Re: Spark SQL CLI

2014-09-22 Thread Yin Huai
; > Regards, > Gaurav > > On Mon, Sep 22, 2014 at 6:30 PM, Yin Huai wrote: > >> Hi Gaurav, >> >> Can you put hive-site.xml in conf/ and try again? >> >> Thanks, >> >> Yin >> >> On Mon, Sep 22, 2014 at 4:02 PM, gtinside wrote: &g

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-29 Thread Yin Huai
What version of Spark did you use? Can you try the master branch? On Mon, Sep 29, 2014 at 1:52 PM, vdiwakar.malladi < vdiwakar.mall...@gmail.com> wrote: > Thanks for your prompt response. > > Still on further note, I'm getting the exception while executing the query. > > "SELECT data[0].name FROM

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-30 Thread Yin Huai
I think this problem has been fixed after the 1.1 release. Can you try the master branch? On Mon, Sep 29, 2014 at 10:06 PM, vdiwakar.malladi < vdiwakar.mall...@gmail.com> wrote: > I'm using the latest version i.e. Spark 1.1.0 > > Thanks. > > > > -- > View this message in context: > http://apache-

Re: partition size for initial read

2014-10-02 Thread Yin Huai
Hi Tamas, Can you try to set mapred.map.tasks and see if it works? Thanks, Yin On Thu, Oct 2, 2014 at 10:33 AM, Tamas Jambor wrote: > That would work - I normally use hive queries through spark sql, I > have not seen something like that there. > > On Thu, Oct 2, 2014 at 3:13 PM, Ashish Jain

Re: SparkSQL DataType mappings

2014-10-02 Thread Yin Huai
Hi Costin, I am answering your questions below. 1. You can find Spark SQL data type reference at here . It explains the underlying data type for a Spark SQL data type for Scala, Java, and Python APIs. For

Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-13 Thread Yin Huai
Question 1: Please check http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#hive-tables. Question 2: One workaround is to re-write it. You can use LEFT SEMI JOIN to implement the subquery with EXISTS and use LEFT OUTER JOIN + IS NULL to implement the subquery with NOT EXISTS. SELECT S_NA

Re: Is "Array Of Struct" supported in json RDDs? is it possible to query this?

2014-10-13 Thread Yin Huai
If you are using HiveContext, it should work in 1.1. Thanks, Yin On Mon, Oct 13, 2014 at 5:08 AM, shahab wrote: > Hello, > > Given the following structure, is it possible to query, e.g. session[0].id > ? > > In general, is it possible to query "Array Of Struct" in json RDDs? > > root > > |--

Re: Nested Query using SparkSQL 1.1.0

2014-10-13 Thread Yin Huai
Hi Shahab, Can you try to use HiveContext? Its should work in 1.1. For SQLContext, this issues was not fixed in 1.1 and you need to use master branch at the moment. Thanks, Yin On Sun, Oct 12, 2014 at 5:20 PM, shahab wrote: > Hi, > > Apparently is it is possible to query nested json using sp

Re: Spark SQL parser bug?

2014-10-13 Thread Yin Huai
Seems the reason that you got "wrong" results was caused by timezone. The time in java.sql.Timestamp(long time) means "milliseconds since January 1, 1970, 00:00:00 *GMT*. A negative number is the number of milliseconds before January 1, 1970, 00:00:00 *GMT*." However, in ts>='1970-01-01 00:00:00'

Re: Nested Query using SparkSQL 1.1.0

2014-10-13 Thread Yin Huai
ot;Easy JSON Data Manipulation in Spark"), is it > possible to perform aggregation kind queries, > for example counting number of attributes (considering that attributes in > schema is presented as "array"), or any other type of aggregation? > > best, > /Shahab &g

Re: Spark SQL parser bug?

2014-10-13 Thread Yin Huai
che.spark.sql.SchemaRDD = > > SchemaRDD[20] at RDD at SchemaRDD.scala:103 > > == Query Plan == > > == Physical Plan == > > Project [a#2] > > ExistingRdd [a#2,ts#3], MapPartitionsRDD[22] at mapPartitions at > basicOperators.scala:208 > > > > scala> s.

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-14 Thread Yin Huai
Hello Terry, How many columns does pqt_rdt_snappy have? Thanks, Yin On Tue, Oct 14, 2014 at 11:52 AM, Terry Siu wrote: > Hi Michael, > > That worked for me. At least I’m now further than I was. Thanks for the > tip! > > -Terry > > From: Michael Armbrust > Date: Monday, October 13, 2014

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-16 Thread Yin Huai
name data_type comment > > > > coll_def_id string None > > seg_def_id string None > > Time taken: 0.788 seconds, Fetched: 29 row(s) > > > As you can see, I have 21 data columns, foll

Re: spark sql: timestamp in json - fails

2014-10-20 Thread Yin Huai
Hi Tridib, For the second approach, can you attach the complete stack trace? Thanks, Yin On Mon, Oct 20, 2014 at 8:24 PM, Michael Armbrust wrote: > I think you are running into a bug that will be fixed by this PR: > https://github.com/apache/spark/pull/2850 > > On Mon, Oct 20, 2014 at 4:34 PM

Re: spark sql: timestamp in json - fails

2014-10-20 Thread Yin Huai
/SPARK-4003 > > You can check PR https://github.com/apache/spark/pull/2850 . > > > > Thanks, > > Daoyuan > > > > *From:* Yin Huai [mailto:huaiyin@gmail.com] > *Sent:* Tuesday, October 21, 2014 10:00 AM > *To:* Michael Armbrust > *Cc:* tridib; u...@sp

Re: spark sql: sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread Yin Huai
Is there any specific issues you are facing? Thanks, Yin On Tue, Oct 21, 2014 at 4:00 PM, tridib wrote: > Any help? or comments? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-sqlContext-jsonFile-date-type-detection-and-perforormance-

Re: Spark SQL : sqlContext.jsonFile date type detection and perforormance

2014-10-21 Thread Yin Huai
de snippet that may help. val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) val schemaRDD = hiveContext.jsonFile(...) schemaRDD.registerTempTable("jsonTable") hiveContext.sql("SELECT CAST(columnName as DATE) FROM jsonTable") Thanks, Yin On Tue, Oct 21, 20

Re: SchemaRDD Convert

2014-10-22 Thread Yin Huai
The implicit conversion function mentioned by Hao is createSchemaRDD in SQLContext/HiveContext. You can import it by doing val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Or new org.apache.spark.sql.hive.HiveContext(sc) for HiveContext import sqlContext.createSchemaRDD On Wed, Oct 2

Re: Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread Yin Huai
Hello Arthur, You can use do aggregations in SQL. How did you create LINEITEM? Thanks, Yin On Thu, Oct 23, 2014 at 8:54 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I got $TreeNodeException, few questions: > Q1) How should I do aggregation in SparK? Can I use aggre

Re: Cleaning/transforming json befor converting to SchemaRDD

2014-11-04 Thread Yin Huai
Hi Daniel, Right now, you need to do the transformation manually. The feature you need is under development (https://issues.apache.org/jira/browse/SPARK-4190). Thanks, Yin On Tue, Nov 4, 2014 at 2:44 AM, Gerard Maas wrote: > You could transform the json to a case class instead of serializing

Re: save as JSON objects

2014-11-04 Thread Yin Huai
Hello Andrejs, For now, you need to use a JSON lib to serialize records of your datasets as JSON strings. In future, we will add a method to SchemaRDD to let you write a SchemaRDD in JSON format (I have created https://issues.apache.org/jira/browse/SPARK-4228 to track it). Thanks, Yin On Tue, N

Re: spark sql create nested schema

2014-11-04 Thread Yin Huai
Hello Tridib, For you case, you can use StructType(StructField("ParentInfo", parentInfo, true) :: StructField("ChildInfo", childInfo, true) :: Nil) to create the StructType representing the schema (parentInfo and childInfo are two existing StructTypes). You can take a look at our docs ( http://spa

Re: [SQL] PERCENTILE is not working

2014-11-05 Thread Yin Huai
Hello Kevin, https://issues.apache.org/jira/browse/SPARK-3891 will fix this bug. Thanks, Yin On Wed, Nov 5, 2014 at 8:06 PM, Cheng, Hao wrote: > Which version are you using? I can reproduce that in the latest code, but > with different exception. > I've filed an bug https://issues.apache.org/

Re: jsonRdd and MapType

2014-11-07 Thread Yin Huai
Hello Brian, Right now, MapType is not supported in the StructType provided to jsonRDD/jsonFile. We will add the support. I have created https://issues.apache.org/jira/browse/SPARK-4302 to track this issue. Thanks, Yin On Fri, Nov 7, 2014 at 3:41 PM, boclair wrote: > I'm loading json into spa

Re: Converting a json struct to map

2014-11-19 Thread Yin Huai
Oh, actually, we do not support MapType provided by the schema given to jsonRDD at the moment (my bad..). Daniel, you need to wait for the patch of 4476 (I should have one soon). Thanks, Yin On Wed, Nov 19, 2014 at 2:32 PM, Daniel Haviv wrote: > Thank you Michael > I will try it out tomorrow >

Re: How to deal with BigInt in my case class for RDD => SchemaRDD convertion

2014-11-21 Thread Yin Huai
Hello Jianshi, The reason of that error is that we do not have a Spark SQL data type for Scala BigInt. You can use Decimal for your case. Thanks, Yin On Fri, Nov 21, 2014 at 5:11 AM, Jianshi Huang wrote: > Hi, > > I got an error during rdd.registerTempTable(...) saying scala.MatchError: > sca

Re: Spark SQL Join returns less rows that expected

2014-11-25 Thread Yin Huai
I guess you want to use split("\\|") instead of split("|"). On Tue, Nov 25, 2014 at 4:51 AM, Cheng Lian wrote: > Which version are you using? Or if you are using the most recent master or > branch-1.2, which commit are you using? > > > On 11/25/14 4:08 PM, david wrote: > >> Hi, >> >> I have 2

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-26 Thread Yin Huai
Hello Jonathan, There was a bug regarding casting data types before inserting into a Hive table. Hive does not have the notion of "containsNull" for array values. So, for a Hive table, the containsNull will be always true for an array and we should ignore this field for Hive. This issue has been f

Re: can't get smallint field from hive on spark

2014-11-26 Thread Yin Huai
For "hive on spark", did you mean the thrift server of Spark SQL or https://issues.apache.org/jira/browse/HIVE-7292? If you meant the latter one, I think Hive's mailing list will be a good place to ask (see https://hive.apache.org/mailing_lists.html). Thanks, Yin On Wed, Nov 26, 2014 at 10:49 PM

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-08 Thread Yin Huai
Hello Jianshi, You meant you want to convert a Map to a Struct, right? We can extract some useful functions from JsonRDD.scala, so others can access them. Thanks, Yin On Mon, Dec 8, 2014 at 1:29 AM, Jianshi Huang wrote: > I checked the source code for inferSchema. Looks like this is exactly w

Re: How to convert RDD to JSON?

2014-12-08 Thread Yin Huai
If you are using spark SQL in 1.2, you can use toJson to convert a SchemaRDD to an RDD[String] that contains one JSON object per string value. Thanks, Yin On Mon, Dec 8, 2014 at 11:52 PM, YaoPau wrote: > Pretty straightforward: Using Scala, I have an RDD that represents a table > with four col

Re: Hive UDAF percentile_approx says "This UDAF does not support the deprecated getEvaluator() method."

2015-01-13 Thread Yin Huai
Yeah, it's a bug. It has been fixed by https://issues.apache.org/jira/browse/SPARK-3891 in master. On Tue, Jan 13, 2015 at 2:41 PM, Ted Yu wrote: > Looking at the source code for AbstractGenericUDAFResolver, the following > (non-deprecated) method should be called: > > public GenericUDAFEvalua

<    1   2