Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
version matters here, but I did observe cases where Spark behaves differently because of semantic differences of the same API in different Hadoop versions. Cheng On 3/27/15 11:33 AM, Pei-Lun Lee wrote: Hi Cheng, on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode. Overwrite

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-27 Thread Pei-Lun Lee
1.0.4. Would you mind to open a JIRA for this? Cheng On 3/27/15 2:40 PM, Pei-Lun Lee wrote: I'm using 1.0.4 Thanks, -- Pei-Lun On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com wrote: Hm, which version of Hadoop are you using? Actually there should also be a _metadata

Re: SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-26 Thread Pei-Lun Lee
, and thus can be faster to read than _metadata. Cheng On 3/26/15 12:48 PM, Pei-Lun Lee wrote: Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit

Re: Which OutputCommitter to use for S3?

2015-03-25 Thread Pei-Lun Lee
I updated the PR for SPARK-6352 to be more like SPARK-3595. I added a new setting spark.sql.parquet.output.committer.class in hadoop configuration to allow custom implementation of ParquetOutputCommitter. Can someone take a look at the PR? On Mon, Mar 16, 2015 at 5:23 PM, Pei-Lun Lee pl

SparkSQL overwrite parquet file does not generate _common_metadata

2015-03-25 Thread Pei-Lun Lee
Hi, When I save parquet file with SaveMode.Overwrite, it never generate _common_metadata. Whether it overwrites an existing dir or not. Is this expected behavior? And what is the benefit of _common_metadata? Will reading performs better when it is present? Thanks, -- Pei-Lun

Re: SparkSQL 1.3.0 JDBC data source issues

2015-03-19 Thread Pei-Lun Lee
JIRA and PR for first issue: https://issues.apache.org/jira/browse/SPARK-6408 https://github.com/apache/spark/pull/5087 On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee pl...@appier.com wrote: Hi, I am trying jdbc data source in spark sql 1.3.0 and found some issues. First, the syntax where

SparkSQL 1.3.0 JDBC data source issues

2015-03-18 Thread Pei-Lun Lee
Hi, I am trying jdbc data source in spark sql 1.3.0 and found some issues. First, the syntax where str_col='value' will give error for both postgresql and mysql: psql create table foo(id int primary key,name text,age int); bash SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar

Re: Which OutputCommitter to use for S3?

2015-03-16 Thread Pei-Lun Lee
that direct dependency makes this injection much more difficult for saveAsParquetFile. On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee pl...@appier.com wrote: Thanks for the DirectOutputCommitter example. However I found it only works for saveAsHadoopFile. What about saveAsParquetFile? It looks

Re: Which OutputCommitter to use for S3?

2015-03-05 Thread Pei-Lun Lee
Thanks for the DirectOutputCommitter example. However I found it only works for saveAsHadoopFile. What about saveAsParquetFile? It looks like SparkSQL is using ParquetOutputCommitter, which is subclass of FileOutputCommitter. On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor

Re: Spark SQL - custom aggregation function (UDAF)

2014-10-14 Thread Pei-Lun Lee
I created https://issues.apache.org/jira/browse/SPARK-3947 On Tue, Oct 14, 2014 at 3:54 AM, Michael Armbrust mich...@databricks.com wrote: Its not on the roadmap for 1.2. I'd suggest opening a JIRA. On Mon, Oct 13, 2014 at 4:28 AM, Pierre B pierre.borckm...@realimpactanalytics.com wrote:

Re: spark sql union all is slow

2014-10-14 Thread Pei-Lun Lee
Hi, You can merge them into one table by: sqlContext.unionAll(sqlContext.unionAll(sqlContext.table(table_1), sqlContext.table(table_2)), sqlContext.table(table_3)).registarTempTable(table_all) Or load them in one call by:

Re: spark sql left join gives KryoException: Buffer overflow

2014-07-21 Thread Pei-Lun Lee
: Unfortunately, this is a query where we just don't have an efficiently implementation yet. You might try switching the table order. Here is the JIRA for doing something more efficient: https://issues.apache.org/jira/browse/SPARK-2212 On Fri, Jul 18, 2014 at 7:05 AM, Pei-Lun Lee pl

spark sql left join gives KryoException: Buffer overflow

2014-07-18 Thread Pei-Lun Lee
: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 1 Looks like spark sql tried to do a broadcast join and collecting one of the table to master but it is too large. How do we explicitly control the join behavior like this? -- Pei-Lun Lee

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
, but there is a PR open to fix it: https://issues.apache.org/jira/browse/SPARK-2446 On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee pl...@appier.com wrote: Hi, I am using spark-sql 1.0.1 to load parquet files generated from method described in: https://gist.github.com/massie/7224868 When I

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
Filed SPARK-2446 2014-07-15 16:17 GMT+08:00 Michael Armbrust mich...@databricks.com: Oh, maybe not. Please file another JIRA. On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote: Hi Michael, Good to know it is being handled. I tried master branch (9fe693b5) and got

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-15 Thread Pei-Lun Lee
Sorry, should be SPARK-2489 2014-07-15 19:22 GMT+08:00 Pei-Lun Lee pl...@appier.com: Filed SPARK-2446 2014-07-15 16:17 GMT+08:00 Michael Armbrust mich...@databricks.com: Oh, maybe not. Please file another JIRA. On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote

Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-14 Thread Pei-Lun Lee
Hi, I am using spark-sql 1.0.1 to load parquet files generated from method described in: https://gist.github.com/massie/7224868 When I try to submit a select query with columns of type fixed length byte array, the following error pops up: 14/07/14 11:09:14 INFO scheduler.DAGScheduler: Failed

Re: LiveListenerBus throws exception and weird web UI bug

2014-06-26 Thread Pei-Lun Lee
submitted. Don’t know if that can help. On Jun 26, 2014, at 6:41 AM, Pei-Lun Lee pl...@appier.com wrote: Hi, We have a long running spark application runs on spark 1.0 standalone server and after it runs several hours the following exception shows up: 14/06/25 23:13:08 ERROR

Re: Spark SQL incorrect result on GROUP BY query

2014-06-12 Thread Pei-Lun Lee
-Lun Lee pl...@appier.com wrote: Hi, I am using spark 1.0.0 and found in spark sql some queries use GROUP BY give weird results. To reproduce, type the following commands in spark-shell connecting to a standalone server: case class Foo(k: String, v: Int) val sqlContext = new

Spark SQL incorrect result on GROUP BY query

2014-06-11 Thread Pei-Lun Lee
], [c,270], [4,56], [1,1]) and if I run the same query again, the new result will be correct: sql(select k,count(*) from foo group by k).collect res2: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300]) Should I file a bug? -- Pei-Lun Lee