Spark SQL incorrect result on GROUP BY query

2014-06-11 Thread Pei-Lun Lee
y[org.apache.spark.sql.Row] = Array([b,180], [3,18], [a,75], [c,270], [4,56], [1,1]) and if I run the same query again, the new result will be correct: sql("select k,count(*) from foo group by k").collect res2: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300]) Should I file a bug? -- Pei-Lun Lee

Re: Spark SQL incorrect result on GROUP BY query

2014-06-12 Thread Pei-Lun Lee
I reran with master and looks like it is fixed. 2014-06-12 1:26 GMT+08:00 Michael Armbrust : > I'd try rerunning with master. It is likely you are running into > SPARK-1994 <https://issues.apache.org/jira/browse/SPARK-1994>. > > Michael > > > On Wed, Jun

Re: LiveListenerBus throws exception and weird web UI bug

2014-06-26 Thread Pei-Lun Lee
ound because those events never been > submitted. > > Don’t know if that can help. > > On Jun 26, 2014, at 6:41 AM, Pei-Lun Lee wrote: > > > > > Hi, > > > > We have a long running spark application runs on spark 1.0 standalone > server and after it runs se

Re: Which OutputCommitter to use for S3?

2015-03-05 Thread Pei-Lun Lee
Thanks for the DirectOutputCommitter example. However I found it only works for saveAsHadoopFile. What about saveAsParquetFile? It looks like SparkSQL is using ParquetOutputCommitter, which is subclass of FileOutputCommitter. On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor wrote: > FYI. We're cur

Re: Which OutputCommitter to use for S3?

2015-03-16 Thread Pei-Lun Lee
ect dependency makes this injection much more > difficult for saveAsParquetFile. > > On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee wrote: > >> Thanks for the DirectOutputCommitter example. >> However I found it only works for saveAsHadoopFile. What about >> saveAsParquetFile? &

SparkSQL 1.3.0 JDBC data source issues

2015-03-18 Thread Pei-Lun Lee
Hi, I am trying jdbc data source in spark sql 1.3.0 and found some issues. First, the syntax "where str_col='value'" will give error for both postgresql and mysql: psql> create table foo(id int primary key,name text,age int); bash> SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar spark/bin/spark-s

Re: SparkSQL 1.3.0 JDBC data source issues

2015-03-19 Thread Pei-Lun Lee
JIRA and PR for first issue: https://issues.apache.org/jira/browse/SPARK-6408 https://github.com/apache/spark/pull/5087 On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee wrote: > Hi, > > I am trying jdbc data source in spark sql 1.3.0 and found some issues. > > First, the syntax

SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-25 Thread Pei-Lun Lee
Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit of _common_metadata? Will reading performs better when it is present? Thanks, -- Pei-Lun

Re: Which OutputCommitter to use for S3?

2015-03-25 Thread Pei-Lun Lee
I updated the PR for SPARK-6352 to be more like SPARK-3595. I added a new setting "spark.sql.parquet.output.committer.class" in hadoop configuration to allow custom implementation of ParquetOutputCommitter. Can someone take a look at the PR? On Mon, Mar 16, 2015 at 5:23 PM, Pei-Lun

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
n_metadata file is typically much smaller than _metadata, > because it doesn’t contain row group information, and thus can be faster to > read than _metadata. > > Cheng > > On 3/26/15 12:48 PM, Pei-Lun Lee wrote: > > Hi, > > When I save parquet file with SaveMode.Overwrite,

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
ersion matters here, but I did observe > cases where Spark behaves differently because of semantic differences of > the same API in different Hadoop versions. > > Cheng > > On 3/27/15 11:33 AM, Pei-Lun Lee wrote: > > Hi Cheng, > > on my computer, execute res0.save(

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
ld you > mind to open a JIRA for this? > > Cheng > > On 3/27/15 2:40 PM, Pei-Lun Lee wrote: > > I'm using 1.0.4 > > Thanks, > -- > Pei-Lun > > On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian wrote: > >> Hm, which version of Hadoop are you using? Actu

Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-14 Thread Pei-Lun Lee
Hi, I am using spark-sql 1.0.1 to load parquet files generated from method described in: https://gist.github.com/massie/7224868 When I try to submit a select query with columns of type fixed length byte array, the following error pops up: 14/07/14 11:09:14 INFO scheduler.DAGScheduler: Failed

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
yet, but there is a PR open to fix it: > https://issues.apache.org/jira/browse/SPARK-2446 > > > On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee wrote: > >> Hi, >> >> I am using spark-sql 1.0.1 to load parquet files generated from method >> described in: >&g

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
Filed SPARK-2446 2014-07-15 16:17 GMT+08:00 Michael Armbrust : > Oh, maybe not. Please file another JIRA. > > > On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee wrote: > >> Hi Michael, >> >> Good to know it is being handled. I tried master branch (9fe693b5) and

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
Sorry, should be SPARK-2489 2014-07-15 19:22 GMT+08:00 Pei-Lun Lee : > Filed SPARK-2446 > > > > 2014-07-15 16:17 GMT+08:00 Michael Armbrust : > > Oh, maybe not. Please file another JIRA. >> >> >> On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee wrote: >

spark sql left join gives KryoException: Buffer overflow

2014-07-18 Thread Pei-Lun Lee
: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1 Looks like spark sql tried to do a broadcast join and collecting one of the table to master but it is too large. How do we explicitly control the join behavior like this? -- Pei-Lun Lee

Re: spark sql left join gives KryoException: Buffer overflow

2014-07-21 Thread Pei-Lun Lee
ely, this is a query where we just don't have an efficiently > implementation yet. You might try switching the table order. > > Here is the JIRA for doing something more efficient: > https://issues.apache.org/jira/browse/SPARK-2212 > > > On Fri, Jul 18, 2014 at 7:05 AM

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-08-03 Thread Pei-Lun Lee
Hi, We have a PR to support fixed length byte array in parquet file. https://github.com/apache/spark/pull/1737 Can someone help verifying it? Thanks. 2014-07-15 19:23 GMT+08:00 Pei-Lun Lee : > Sorry, should be SPARK-2489 > > > 2014-07-15 19:22 GMT+08:00 Pei-Lun Lee : > >

Spark SQL - custom aggregation function (UDAF)

2014-10-06 Thread Pei-Lun Lee
Hi, Does spark sql currently support user-defined custom aggregation function in scala like the way UDF defined with sqlContext.registerFunction? (not hive UDAF) Thanks, -- Pei-Lun

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-14 Thread Pei-Lun Lee
I created https://issues.apache.org/jira/browse/SPARK-3947 On Tue, Oct 14, 2014 at 3:54 AM, Michael Armbrust wrote: > Its not on the roadmap for 1.2. I'd suggest opening a JIRA. > > On Mon, Oct 13, 2014 at 4:28 AM, Pierre B < > pierre.borckm...@realimpactanalytics.com> wrote: > >> Is it planned

Re: spark sql union all is slow

2014-10-14 Thread Pei-Lun Lee
Hi, You can merge them into one table by: sqlContext.unionAll(sqlContext.unionAll(sqlContext.table("table_1"), sqlContext.table("table_2")), sqlContext.table("table_3")).registarTempTable("table_all") Or load them in one call by: sqlContext.parquetFile("table_1.parquet,table_2.parquet,table_3.p