Re: Proper caching method

2014-04-14 Thread Cheng Lian
Hi Joe, You need to make sure which RDD is used most frequently. In your case, rdd2 rdd3 are filtered result of rdd1, so usually they are relatively smaller than rdd1, and it would be more reasonable to cache rdd2 and/or rdd3 if rdd1is not referenced elsewhere. Say rdd1 takes 10G, rdd2 takes 1G

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Cheng Lian
Your Spark solution first reduces partial results into a single partition, computes the final result, and then collects to the driver side. This involves a shuffle and two waves of network traffic. Instead, you can directly collect partial results to the driver and then computes the final results

Re: StackOverflow Error when run ALS with 100 iterations

2014-04-15 Thread Cheng Lian
Probably this JIRA issuehttps://spark-project.atlassian.net/browse/SPARK-1006solves your problem. When running with large iteration number, the lineage DAG of ALS becomes very deep, both DAGScheduler and Java serializer may overflow because they are implemented in a recursive way. You may resort

Re: confused by reduceByKey usage

2014-04-17 Thread Cheng Lian
A tip: using println is only convenient when you are working with local mode. When running Spark in clustering mode (standalone/YARN/Mesos), output of println goes to executor stdout. On Fri, Apr 18, 2014 at 6:53 AM, 诺铁 noty...@gmail.com wrote: yeah, I got it.! using println to debug is great

Re: confused by reduceByKey usage

2014-04-17 Thread Cheng Lian
is better way to debug? On Fri, Apr 18, 2014 at 9:27 AM, Cheng Lian lian.cs@gmail.com wrote: A tip: using println is only convenient when you are working with local mode. When running Spark in clustering mode (standalone/YARN/Mesos), output of println goes to executor stdout. On Fri

Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
Without caching, an RDD will be evaluated multiple times if referenced multiple times by other RDDs. A silly example: val text = sc.textFile(input.log)val r1 = text.filter(_ startsWith ERROR)val r2 = text.map(_ split )val r3 = (r1 ++ r2).collect() Here the input file will be scanned twice

Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
: Shouldnt the dag optimizer optimize these routines. Sorry if its a dumb question :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Apr 23, 2014 at 12:29 PM, Cheng Lian lian.cs@gmail.comwrote: Without

Re: two calls of saveAsTextFile() have different results on the same RDD

2014-04-23 Thread Cheng Lian
elements are printed only once. On Wed, Apr 23, 2014 at 4:35 PM, Cheng Lian lian.cs@gmail.com wrote: Good question :) Although RDD DAG is lazy evaluated, it’s not exactly the same as Scala lazy val. For Scala lazy val, evaluated value is automatically cached, while evaluated RDD elements

Re: Access Last Element of RDD

2014-04-24 Thread Cheng Lian
You may try this: val lastOption = sc.textFile(input).mapPartitions { iterator = if (iterator.isEmpty) { iterator } else { Iterator .continually((iterator.next(), iterator.hasNext())) .collect { case (value, false) = value } .take(1) } }.collect().lastOption

Re: Cache issue for iteration with broadcast

2014-05-05 Thread Cheng Lian
Have you tried Broadcast.unpersist()? On Mon, May 5, 2014 at 6:34 PM, Earthson earthson...@gmail.com wrote: RDD.checkpoint works fine. But spark.cleaner.ttl is really ugly for broadcast cleaning. May be it could be removed automatically when no dependences. -- View this message in

Re: Join : Giving incorrect result

2014-06-04 Thread Cheng Lian
Hi Ajay, would you mind to synthesise a minimum code snippet that can reproduce this issue and paste it here? On Wed, Jun 4, 2014 at 8:32 PM, Ajay Srivastava a_k_srivast...@yahoo.com wrote: Hi, I am doing join of two RDDs which giving different results ( counting number of records ) each

Re: Is it possible to read file head in each partition?

2014-07-30 Thread Cheng Lian
What's the format of the file header? Is it possible to filter them out by prefix string matching or regex? On Wed, Jul 30, 2014 at 1:39 PM, Fengyun RAO raofeng...@gmail.com wrote: It will certainly cause bad performance, since it reads the whole content of a large file into one value,

Re: Got error “java.lang.IllegalAccessError when using HiveContext in Spark shell on AWS

2014-08-07 Thread Cheng Lian
Hey Zhun, Thanks for the detailed problem description. Please see my comments inlined below. On Thu, Aug 7, 2014 at 6:18 PM, Zhun Shen shenzhunal...@gmail.com wrote: Caused by: java.lang.IllegalAccessError: tried to access method

Re: reduceByKey to get all associated values

2014-08-07 Thread Cheng Lian
You may use groupByKey in this case. On Aug 7, 2014, at 9:18 PM, Konstantin Kudryavtsev kudryavtsev.konstan...@gmail.com wrote: Hi there, I'm interested if it is possible to get the same behavior as for reduce function from MR framework. I mean for each key K get list of associated

Re: reduceByKey to get all associated values

2014-08-07 Thread Cheng Lian
The point is that in many cases the operation passed to reduceByKey aggregates data into much smaller size, say + and * for integer. String concatenation doesn’t actually “shrink” data, thus in your case, rdd.reduceByKey(_ ++ _) and rdd.groupByKey suffer similar performance issue. In general,

Re: Save an RDD to a SQL Database

2014-08-07 Thread Cheng Lian
Maybe a little off topic, but would you mind to share your motivation of saving the RDD into an SQL DB? If you’re just trying to do further transformations/queries with SQL for convenience, then you may just use Spark SQL directly within your Spark application without saving them into DB:

Re: Missing SparkSQLCLIDriver and Beeline drivers in Spark

2014-08-07 Thread Cheng Lian
Things have changed a bit in the master branch, and the SQL programming guide in master branch actually doesn’t apply to branch-1.0-jdbc. In branch-1.0-jdbc, Hive Thrift server and Spark SQL CLI are included in the hive profile and are thus not enabled by default. You need to either - pass

Re: Spark SQL dialect

2014-08-10 Thread Cheng Lian
Currently the SQL dialect provided by Spark SQL only support a set of most frequently used structures and doesn't support DDL and DML operations. In the long run, we'd like to replace it with a full featured SQL-92 implementation. On Sat, Aug 9, 2014 at 8:11 AM, Sathish Kumaran Vairavelu

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-10 Thread Cheng Lian
Hi Jenny, does this issue only happen when running Spark SQL with YARN in your environment? On Sat, Aug 9, 2014 at 3:56 AM, Jenny Zhao linlin200...@gmail.com wrote: Hi, I am able to run my hql query on yarn cluster mode when connecting to the default hive metastore defined in

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Cheng Lian
Since you were using hql(...), it’s probably not related to JDBC driver. But I failed to reproduce this issue locally with a single node pseudo distributed YARN cluster. Would you mind to elaborate more about steps to reproduce this bug? Thanks ​ On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian

Re: Spark SQL JDBC

2014-08-11 Thread Cheng Lian
Hi John, the JDBC Thrift server resides in its own build profile and need to be enabled explicitly by ./sbt/sbt -Phive-thriftserver assembly. ​ On Tue, Aug 5, 2014 at 4:54 AM, John Omernik j...@omernik.com wrote: I am using spark-1.1.0-SNAPSHOT right now and trying to get familiar with the

Re: Spark SQL JDBC

2014-08-13 Thread Cheng Lian
) at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162) ... 32 more Any suggestions? On Tue, Aug 12, 2014 at 12:47 AM, Cheng Lian lian.cs@gmail.com wrote: Hi John, the JDBC Thrift server resides in its own build profile and need to be enabled

Re: s3:// sequence file startup time

2014-08-18 Thread Cheng Lian
Maybe irrelevant, but this resembles a lot the S3 Parquet file issue we've met before. It takes a dozen minutes to read the metadata because the ParquetInputFormat tries to call geFileStatus for all part-files sequentially. Just checked SequenceFileInputFormat, and found that a MapFile may share

Re: Potential Thrift Server Bug on Spark SQL,perhaps with cache table?

2014-08-25 Thread Cheng Lian
Hi John, I tried to follow your description but failed to reproduce this issue. Would you mind to provide some more details? Especially: - Exact Git commit hash of the snapshot version you were using Mine: e0f946265b9ea5bc48849cf7794c2c03d5e29fba

Re: How to get prerelease thriftserver working?

2014-08-27 Thread Cheng Lian
Hey Matt, if you want to access existing Hive data, you still need a to run a Hive metastore service, and provide a proper hive-site.xml (just drop it in $SPARK_HOME/conf). Could you provide the error log you saw? ​ On Wed, Aug 27, 2014 at 12:09 PM, Michael Armbrust mich...@databricks.com

Re: Spark / Thrift / ODBC connectivity

2014-08-29 Thread Cheng Lian
You can use the Thrift server to access Hive tables that locates in legacy Hive warehouse and/or those generated by Spark SQL. Simba provides Spark SQL ODBC driver that enables applications like Tableau. But right now I'm not 100% sure about whether the driver has officially released yet. On

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Cheng Lian
You can always use sqlContext.uncacheTable to uncache the old table. ​ On Fri, Sep 12, 2014 at 10:33 AM, pankaj.arora pankajarora.n...@gmail.com wrote: Hi Patrick, What if all the data has to be keep in cache all time. If applying union result in new RDD then caching this would result into

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread Cheng Lian
Ah, I see. So basically what you need is something like cache write through support which exists in Shark but not implemented in Spark SQL yet. In Shark, when inserting data into a table that has already been cached, the newly inserted data will be automatically cached and “union”-ed with the

Re: Issue with Spark-1.1.0 and the start-thriftserver.sh script

2014-09-26 Thread Cheng Lian
Hi Helene, Thanks for the report. In Spark 1.1, we use a special exit code to indicate |SparkSubmit| fails because of class not found. But unfortunately I chose a not so special exit code — 1… So whenever the process exit with 1 as exit code, the |-Phive| error message is shown. A PR that

Re: Using one sql query's result inside another sql query

2014-09-26 Thread Cheng Lian
H Twinkle, The failure is caused by case sensitivity. The temp table actually stores the original un-analyzed logical plan, thus field names remain capital (F1, F2, etc.). I believe this issue has already been fixed by PR #2382 https://github.com/apache/spark/pull/2382. As a workaround, you

Re: SparkSQL Thriftserver in Mesos

2014-09-26 Thread Cheng Lian
You can avoid install Spark on each node by uploading Spark distribution tarball file to HDFS setting |spark.executor.uri| to the HDFS location. In this way, Mesos will download and the tarball file before launching containers. Please refer to this Spark documentation page

Re: problem with HiveContext inside Actor

2014-09-26 Thread Cheng Lian
This is reasonable, since the actual constructor gets called is |Driver()| rather than |Driver(HiveConf)|. The former initializes the |conf| field by: |conf = SessionState.get().getConf() | And |SessionState.get()| reads a TSS value. Thus executing SQL queries within another thread causes

Re: Access file name in map function

2014-09-26 Thread Cheng Lian
If the size of each file is small, you may try |SparkContext.wholeTextFiles|. Otherwise you can try something like this: |val filenames: Seq[String] = ... val combined: RDD[(String,String)] = filenames.map { name = sc.textFile(name).map(line = name - line) }.reduce(_ ++ _) | On 9/26/14

Re: Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction?

2014-09-26 Thread Cheng Lian
Yes it is. The in-memory storage used with |SchemaRDD| also uses |RDD.cache()| under the hood. On 9/26/14 4:04 PM, Haopu Wang wrote: Hi, I'm querying a big table using Spark SQL. I see very long GC time in some stages. I wonder if I can improve it by tuning the storage parameter. The

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Cheng Lian
Would you mind to provide the DDL of this partitioned table together with the query you tried? The stacktrace suggests that the query was trying to cast a map into something else, which is not supported in Spark SQL. And I doubt whether Hive support casting a complex type to some other type.

Re: problem with HiveContext inside Actor

2014-09-26 Thread Cheng Lian
This fix is reasonable, since the actual constructor gets called is |Driver()| rather than |Driver(HiveConf)|. The former initializes the |conf| field by: |conf = SessionState.get().getConf() | And |SessionState.get()| reads a TSS value. Thus executing SQL queries within another thread

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Cheng Lian
Would you mind to provide the DDL of this partitioned table together with the query you tried? The stacktrace suggests that the query was trying to cast a map into something else, which is not supported in Spark SQL. And I doubt whether Hive support casting a complex type to some other type.

Re: Using one sql query's result inside another sql query

2014-09-28 Thread Cheng Lian
[ works ] *queryResult1withSchema = hiveContext.applySchema( Queryresult1, Queryresult1.schema )* registerTempTable(*queryResult1withSchema*) Queryresult2 = Query2 using *queryResult1withSchema* [ *works* ] On Fri, Sep 26, 2014 at 5:13 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs

Re: Unresolved attributes: SparkSQL on the schemaRDD

2014-09-29 Thread Cheng Lian
In your case, the table has only one row, whose contents is “data”, which is an array. You need something like |SELECT data[0].name FROM json_table| to access the |name| field. On 9/29/14 11:08 PM, vdiwakar.malladi wrote: Hello, I'm exploring SparkSQL and I'm facing issue while using the

Re: HiveContext: cache table not supported for partitioned table?

2014-10-02 Thread Cheng Lian
Cache table works with partitioned table. I guess you’re experimenting with a default local metastore and the metastore_db directory doesn’t exist at the first place. In this case, all metastore tables/views don’t exist at first and will throw the error message you saw when the |PARTITIONS|

Re: SparkSQL on Hive error

2014-10-03 Thread Cheng Lian
Also make sure to call |hiveContext.sql| within the same thread where |hiveContext| is created, because Hive uses thread-local variable to initialize the |Driver.conf|. On 10/3/14 4:52 PM, Michael Armbrust wrote: Are you running master? There was briefly a regression here that is hopefully

Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-07 Thread Cheng Lian
The build command should be correct. What exact error did you encounter when trying Spark 1.1 + Hive 0.12 + Hadoop 2.5.0? On 10/7/14 2:21 PM, Li HM wrote: Thanks for the replied. Please refer to my another post entitled How to make ./bin/spark-sql work with hive. It has all the

Re: Spark SQL parser bug?

2014-10-10 Thread Cheng Lian
Hi Mohammed, Would you mind to share the DDL of the table |x| and the complete stacktrace of the exception you got? A full Spark shell session history would be more than helpful. PR #2084 had been merged in master in Aug, and timestamp type is supported in 1.1. I tried the following

Re: Spark SQL - Exception only when using cacheTable

2014-10-10 Thread Cheng Lian
Hi Poiuytrez, what version of Spark are you using? Exception details like stacktrace are really needed to investigate this issue. You can find them in the executor logs, or just browse the application stderr/stdout link from Spark Web UI. On 10/9/14 9:37 PM, poiuytrez wrote: Hello, I have a

Re: Unable to share Sql between HiveContext and JDBC Thrift Server

2014-10-10 Thread Cheng Lian
Which version are you using? Also |.saveAsTable()| saves the table to Hive metastore, so you need to make sure your Spark application points to the same Hive metastore instance as the JDBC Thrift server. For example, put |hive-site.xml| under |$SPARK_HOME/conf|, and run |spark-shell| and

Re: Spark SQL parser bug?

2014-10-10 Thread Cheng Lian
Hmm, there is a “T” in the timestamp string, which makes the string not a valid timestamp string representation. Internally Spark SQL uses |java.sql.Timestamp.valueOf| to cast a string to a timestamp. On 10/11/14 2:08 AM, Mohammed Guller wrote: scala rdd.registerTempTable(x) scala val sRdd

Re: Spark SQL - Exception only when using cacheTable

2014-10-11 Thread Cheng Lian
How was the table created? Would you mind to share related code? It seems that the underlying type of the |customer_id| field is actually long, but the schema says it’s integer, basically it’s a type mismatch error. The first query succeeds because |SchemaRDD.count()| is translated to

Re: spark-sql failing for some tables in hive

2014-10-11 Thread Cheng Lian
Hmm, the details of the error didn't show in your mail... On 10/10/14 12:25 AM, sadhan wrote: We have a hive deployement on which we tried running spark-sql. When we try to do describe table_name for some of the tables, spark-sql fails with this: while it works for some of the other tables.

Re: Setting SparkSQL configuration

2014-10-13 Thread Cheng Lian
Currently Spark SQL doesn’t support reading SQL specific configurations via system properties. But for |HiveContext|, you can put them in |hive-site.xml|. On 10/13/14 4:28 PM, Kevin Paul wrote: Hi all, I tried to set the configuration spark.sql.inMemoryColumnarStorage.compressed, and

Re: Steps to connect BI Tools with Spark SQL using Thrift JDBC server

2014-10-14 Thread Cheng Lian
Denny Lee wrote an awesome article on how to connect to Tableau to Spark SQL recently: https://www.concur.com/blog/en-us/connect-tableau-to-sparksql On 10/14/14 6:10 PM, Neeraj Garg02 wrote: Hi Everybody, I’m looking for information on possible Thrift JDBC/ODBC clients and Thrift JDBC/ODBC

Re: YARN deployment of Spark and Thrift JDBC server

2014-10-14 Thread Cheng Lian
On 10/14/14 7:31 PM, Neeraj Garg02 wrote: Hi All, I’ve downloaded and installed Apache Spark 1.1.0 pre-built for Hadoop 2.4. Now, I want to test two features of Spark: |1.|*YARN deployment* : As per my understanding, I need to modify “spark-defaults.conf” file with the settings mentioned

Re: SparkSQL: set hive.metastore.warehouse.dir in CLI doesn't work

2014-10-16 Thread Cheng Lian
The warehouse location need to be specified before the |HiveContext| initialization, you can set it via: |./bin/spark-sql --hiveconf hive.metastore.warehouse.dir=/home/spark/hive/warehouse | On 10/15/14 8:55 PM, Hao Ren wrote: Hi, The following query in sparkSQL 1.1.0 CLI doesn't work.

Re: YARN deployment of Spark and Thrift JDBC server

2014-10-16 Thread Cheng Lian
On 10/16/14 12:44 PM, neeraj wrote: I would like to reiterate that I don't have Hive installed on the Hadoop cluster. I have some queries on following comment from Cheng Lian-2: The Thrift server is used to interact with existing Hive data, and thus needs Hive Metastore to access Hive catalog

Re: [SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-16 Thread Cheng Lian
Why do you need to convert a JavaSchemaRDD to SchemaRDD? Are you trying to use some API that doesn't exist in JavaSchemaRDD? On 10/15/14 5:50 PM, Earthson wrote: I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found that DataTypeConversions is protected[sql]. Finally I

Re: YARN deployment of Spark and Thrift JDBC server

2014-10-16 Thread Cheng Lian
On 10/16/14 10:48 PM, neeraj wrote: 1. I'm trying to use Spark SQL as data source.. is it possible? Unfortunately Spark SQL ODBC/JDBC support are based on the Thrift server, so at least you need HDFS and a working Hive Metastore instance (used to persist catalogs) to make things work. 2.

Re: Help required on exercise Data Exploratin using Spark SQL

2014-10-16 Thread Cheng Lian
Hi Neeraj, The Spark Summit 2014 tutorial uses Spark 1.0. I guess you're using Spark 1.1? Parquet support got polished quite a bit since then, and changed the string representation of the query plan, but this output should be OK :) Cheng On 10/16/14 10:45 PM, neeraj wrote: Hi, I'm

Re: Folding an RDD in order

2014-10-16 Thread Cheng Lian
Hi Michael, I'm not sure I fully understood your question, but I think RDD.aggregate can be helpful in your case. You can see it as a more general version of fold. Cheng On 10/16/14 11:15 PM, Michael Misiewicz wrote: Hi, I'm working on a problem where I'd like to sum items in an RDD /in

Re: Spark SQL DDL, DML commands

2014-10-16 Thread Cheng Lian
I guess you're referring to the simple SQL dialect recognized by the SqlParser component. Spark SQL supports most DDL and DML of Hive. But the simple SQL dialect is still very limited. Usually it's used together with some Spark application written in Java/Scala/Python. Within a Spark

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread Cheng Lian
You can first union them into a single RDD and then call |foreach|. In Scala: |rddList.reduce(_.union(_)).foreach(myFunc) | For the serialization issue, I don’t have any clue unless more code can be shared. On 10/16/14 11:39 PM, /soumya/ wrote: Hi, my programming model requires me to

Re: Folding an RDD in order

2014-10-16 Thread Cheng Lian
, 2014 at 11:46 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: Hi Michael, I'm not sure I fully understood your question, but I think RDD.aggregate can be helpful in your case. You can see it as a more general version of fold. Cheng On 10/16/14

Re: Spark/HIVE Insert Into values Error

2014-10-18 Thread Cheng Lian
Currently Spark SQL uses Hive 0.12.0, which doesn't support the INSERT INTO ... VALUES ... syntax. On 10/18/14 1:33 AM, arthur.hk.c...@gmail.com wrote: Hi, When trying to insert records into HIVE, I got error, My Spark is 1.1.0 and Hive 0.12.0 Any idea what would be wrong? Regards Arthur

Re: Unable to connect to Spark thrift JDBC server with pluggable authentication

2014-10-18 Thread Cheng Lian
Hi Jenny, how did you configure the classpath and start the Thrift server (YARN client/YARN cluster/standalone/...)? On 10/18/14 4:14 AM, Jenny Zhao wrote: Hi, if Spark thrift JDBC server is started with non-secure mode, it is working fine. with a secured mode in case of pluggable

Re: a hivectx insertinto issue-can inertinto function be applied to a hive table

2014-10-18 Thread Cheng Lian
In your JSON snippet, 111 and 222 are quoted, namely they are strings. Thus they are automatically inferred as string rather than tinyint by |jsonRDD|. Try this in Spark shell: |val sparkContext = sc import org.apache.spark.sql._ import sparkContext._ val sqlContext = new

Re: Getting Spark SQL talking to Sql Server

2014-10-21 Thread Cheng Lian
Instead of using Spark SQL, you can use JdbcRDD to extract data from SQL server. Currently Spark SQL can't run queries against SQL server. The foreign data source API planned in Spark 1.2 can make this possible. On 10/21/14 6:26 PM, Ashic Mahtab wrote: Hi, Is there a simple way to run spark

Re: Spark - HiveContext - Unstructured Json

2014-10-21 Thread Cheng Lian
You can resort to |SQLContext.jsonFile(path: String, samplingRate: Double)| and set |samplingRate| to 1.0, so that all the columns can be inferred. You can also use |SQLContext.applySchema| to specify your own schema (which is a |StructType|). On 10/22/14 5:56 AM, Harivardan Jayaraman

Re: SparkSQL display wrong result

2014-10-27 Thread Cheng Lian
Would you mind to share DDLs of all involved tables? What format are these tables stored in? Is this issue specific to this query? I guess Hive, Shark and Spark SQL all read from the same HDFS dataset? On 10/27/14 3:45 PM, lyf刘钰帆 wrote: Hi, I am using SparkSQL 1.1.0 with cdh 4.6.0

Re: 答复: SparkSQL display wrong result

2014-10-27 Thread Cheng Lian
LOCAL INPATH '/home/data/testFolder/qrytblB.txt' INTO TABLE tblB; *发件人:*Cheng Lian [mailto:lian.cs@gmail.com] *发 送时间:*2014年10月27日16:48 *收件人:*lyf刘钰帆; user@spark.apache.org *主题:*Re: SparkSQL display wrong result Would you mind to share DDLs of all involved tables? What format

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Cheng Lian
I have never tried this yet, but maybe you can use an in-memory Derby database as metastore https://db.apache.org/derby/docs/10.7/devguide/cdevdvlpinmemdb.html I'll investigate this when free, guess we can use this for Spark SQL Hive support testing. On 10/27/14 4:38 PM, Jianshi Huang

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Cheng Lian
https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin#AdminManualMetastoreAdmin-EmbeddedMetastore Cheers On Oct 27, 2014, at 6:20 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: I have never tried this yet, but maybe you can use an in-memory Derby

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Cheng Lian
Which version of Spark and Hadoop are you using? Could you please provide the full stack trace of the exception? On Tue, Oct 28, 2014 at 5:48 AM, Du Li l...@yahoo-inc.com.invalid wrote: Hi, I was trying to set up Spark SQL on a private cluster. I configured a hive-site.xml under

Re: Spark 1.1.0 on Hive 0.13.1

2014-10-29 Thread Cheng Lian
Spark 1.1.0 doesn't support Hive 0.13.1. We plan to support it in 1.2.0, and related PRs are already merged or being merged to the master branch. On 10/29/14 7:43 PM, arthur.hk.c...@gmail.com wrote: Hi, My Hive is 0.13.1, how to make Spark 1.1.0 run on Hive 0.13? Please advise. Or, any news

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
Hi Jean, Thanks for reporting this. This is indeed a bug: some column types (Binary, Array, Map and Struct, and unfortunately for some reason, Boolean), a NoopColumnStats is used to collect column statistics, which causes this issue. Filed SPARK-4182 to track this issue, will fix this ASAP.

Re: SparkSQL + Hive Cached Table Exception

2014-11-01 Thread Cheng Lian
Just submitted a PR to fix this https://github.com/apache/spark/pull/3059 On Sun, Nov 2, 2014 at 12:36 AM, Jean-Pascal Billaud j...@tellapart.com wrote: Great! Thanks. Sent from my iPad On Nov 1, 2014, at 8:35 AM, Cheng Lian lian.cs@gmail.com wrote: Hi Jean, Thanks for reporting

Re: To generate IndexedRowMatrix from an RowMatrix

2014-11-10 Thread Cheng Lian
You may use |RDD.zipWithIndex|. On 11/10/14 10:03 PM, Lijun Wang wrote: Hi, I need a matrix with each row having a index, e.g., index = 0 for first row, index = 1 for second row. Could someone tell me how to generate such IndexedRowMatrix from an RowMatrix? Besides, is there anyone

Re: Understanding spark operation pipeline and block storage

2014-11-10 Thread Cheng Lian
On 11/6/14 1:39 AM, Hao Ren wrote: Hi, I would like to understand the pipeline of spark's operation(transformation and action) and some details on block storage. Let's consider the following code: val rdd1 = SparkContext.textFile(hdfs://...) rdd1.map(func1).map(func2).count For example, we

Re: thrift jdbc server probably running queries as hive query

2014-11-10 Thread Cheng Lian
Hey Sadhan, I really don't think this is Spark log... Unlike Shark, Spark SQL doesn't even provide a Hive mode to let you execute queries against Hive. Would you please check whether there is an existing HiveServer2 running there? Spark SQL HiveThriftServer2 is just a Spark port of

Re: Cache sparkSql data without uncompressing it in memory

2014-11-12 Thread Cheng Lian
Currently there’s no way to cache the compressed sequence file directly. Spark SQL uses in-memory columnar format while caching table rows, so we must read all the raw data and convert them into columnar format. However, you can enable in-memory columnar compression by setting

Re: Spark JDBC Thirft Server over HTTP

2014-11-13 Thread Cheng Lian
HTTP is not supported yet, and I don't think there's an JIRA ticket for it. On 11/14/14 8:21 AM, vs wrote: Does Spark JDBC thrift server allow connections over HTTP? http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#running-the-thrift-jdbc-server doesn't see to indicate this

Re: Cache sparkSql data without uncompressing it in memory

2014-11-13 Thread Cheng Lian
one more question - does that mean that we still need enough memory in the cluster to uncompress the data before it can be compressed again or does that just read the raw data as is? On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Cheng Lian
If you’re looking for executor side setup and cleanup functions, there ain’t any yet, but you can achieve the same semantics via |RDD.mapPartitions|. Please check the “setup() and cleanup” section of this blog from Cloudera for details:

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Cheng Lian
can I write it like this? rdd.mapPartition(i = setup(); i).map(...).mapPartition(i = cleanup(); i) So I don't need to mess up the logic and still can use map, filter and other transformations for RDD. Jianshi On Fri, Nov 14, 2014 at 12:20 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs

Re: saveAsParquetFile throwing exception

2014-11-14 Thread Cheng Lian
Which version are you using? You probably hit this bug https://issues.apache.org/jira/browse/SPARK-3421 if some field name in the JSON contains characters other than [a-zA-Z0-9_]. This has been fixed in https://github.com/apache/spark/pull/2563 On 11/14/14 6:35 PM, vdiwakar.malladi wrote:

Re: saveAsParquetFile throwing exception

2014-11-14 Thread Cheng Lian
Hm, I'm not sure whether this is the official way to upgrade CDH Spark, maybe you can checkout https://github.com/cloudera/spark, apply required patches, and then compile your own version. On 11/14/14 8:46 PM, vdiwakar.malladi wrote: Thanks for your response. I'm using Spark 1.1.0 Currently

Re: Cache sparkSql data without uncompressing it in memory

2014-11-14 Thread Cheng Lian
13, 2014 at 10:50 PM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: No, the columnar buffer is built in a small batching manner, the batch size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| property. The default value for this in master

Re: SparkSQL exception on cached parquet table

2014-11-15 Thread Cheng Lian
Hi Sadhan, Could you please provide the stack trace of the |ArrayIndexOutOfBoundsException| (if any)? The reason why the first query succeeds is that Spark SQL doesn’t bother reading all data from the table to give |COUNT(*)|. In the second case, however, the whole table is asked to be

Re: SparkSQL exception on cached parquet table

2014-11-16 Thread Cheng Lian
(Forgot to cc user mail list) On 11/16/14 4:59 PM, Cheng Lian wrote: Hey Sadhan, Thanks for the additional information, this is helpful. Seems that some Parquet internal contract was broken, but I'm not sure whether it's caused by Spark SQL or Parquet, or even maybe the Parquet file itself

Re: Load json format dataset as RDD

2014-11-16 Thread Cheng Lian
|SQLContext.jsonFile| assumes one JSON record per line. Although I haven’t tried yet, it seems that this |JsonInputFormat| [1] can be helpful. You may read your original data set with |SparkContext.hadoopFile| and |JsonInputFormat|, then transform the resulted |RDD[String]| into a |JsonRDD|

Re: Building Spark with hive does not work

2014-11-17 Thread Cheng Lian
Hey Hao, Which commit are you using? Just tried 64c6b9b with exactly the same command line flags, couldn't reproduce this issue. Cheng On 11/17/14 10:02 PM, Hao Ren wrote: Hi, I am building spark on the most recent master branch. I checked this page:

Re: Building Spark with hive does not work

2014-11-18 Thread Cheng Lian
Ah... Thanks Ted! And Hao, sorry for being the original trouble maker :) On 11/18/14 1:50 AM, Ted Yu wrote: Looks like this was where you got that commandline: http://search-hadoop.com/m/JW1q5RlPrl Cheers On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren inv...@gmail.com mailto:inv...@gmail.com

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread Cheng Lian
A not so efficient way can be this: |val r0: RDD[OriginalRow] = ... val r1 = r0.keyBy(row = extractKeyFromOriginalRow(row)) val r2 = r1.keys.distinct().zipWithIndex() val r3 = r2.join(r1).values | On 11/18/14 8:54 PM, shahab wrote: Hi, In my spark application, I am loading some

Re: Why is ALS class serializable ?

2014-11-19 Thread Cheng Lian
When a field of an object is enclosed in a closure, the object itself is also enclosed automatically, thus the object need to be serializable. On 11/19/14 6:39 PM, Hao Ren wrote: Hi, When reading through ALS code, I find that: class ALS private ( private var numUserBlocks: Int,

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-21 Thread Cheng Lian
Hi Judy, could you please provide the commit SHA1 of the version you're using? Thanks! On 11/22/14 11:05 AM, Judy Nash wrote: Hi, Thrift server is failing to start for me on latest spark 1.2 branch. I got the error below when I start thrift server. Exception in thread main

Re: spark-sql broken

2014-11-22 Thread Cheng Lian
You're probably hitting this issue https://issues.apache.org/jira/browse/SPARK-4532 Patrick made a fix for this https://github.com/apache/spark/pull/3398 On 11/22/14 10:39 AM, tridib wrote: After taking today's build from master branch I started getting this error when run spark-sql: Class

Re: Debug Sql execution

2014-11-22 Thread Cheng Lian
You may try |EXPLIAN EXTENDED sql| to see the logical plan, analyzed logical plan, optimized logical plan and physical plan. Also |SchemaRDD.toDebugString| shows storage related debugging information. On 11/21/14 4:11 AM, Gordon Benjamin wrote: hey, Can anyone tell me how to debug a sql

Re: querying data from Cassandra through the Spark SQL Thrift JDBC server

2014-11-22 Thread Cheng Lian
This thread might be helpful http://apache-spark-user-list.1001560.n3.nabble.com/tableau-spark-sql-cassandra-tp19282.html On 11/20/14 4:11 AM, Mohammed Guller wrote: Hi – I was curious if anyone is using the Spark SQL Thrift JDBC server with Cassandra. It would be great be if you could share

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-24 Thread Cheng Lian
SparkContext unsuccessfully. Let me know if you need anything else. *From:*Cheng Lian [mailto:lian.cs@gmail.com] *Sent:* Friday, November 21, 2014 8:02 PM *To:* Judy Nash; u...@spark.incubator.apache.org *Subject:* Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava Hi

Re: advantages of SparkSQL?

2014-11-24 Thread Cheng Lian
For the “never register a table” part, actually you /can/ use Spark SQL without registering a table via its DSL. Say you’re going to extract an |Int| field named |key| from the table and double it: |import org.apache.sql.catalyst.dsl._ val data = sqc.parquetFile(path) val double =

Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-25 Thread Cheng Lian
Spark SQL supports complex types, but casting doesn't work for complex types right now. On 11/25/14 4:04 PM, critikaled wrote: https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala Doesn't

Re: Spark SQL Join returns less rows that expected

2014-11-25 Thread Cheng Lian
Which version are you using? Or if you are using the most recent master or branch-1.2, which commit are you using? On 11/25/14 4:08 PM, david wrote: Hi, I have 2 files which come from csv import of 2 Oracle tables. F1 has 46730613 rows F2 has 3386740 rows I build 2 tables with

Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-26 Thread Cheng Lian
:37 GMT+09:00 Cheng Lian lian.cs@gmail.com: Spark SQL supports complex types, but casting doesn't work for complex types right now. On 11/25/14 4:04 PM, critikaled wrote: https://github.com/apache/spark/blob/84d79ee9ec47465269f7b0a7971176da93c96f3f/sql/catalyst/src/main/scala/org/apache

  1   2   3   4   >