y[org.apache.spark.sql.Row] = Array([b,180], [3,18], [a,75],
[c,270], [4,56], [1,1])
and if I run the same query again, the new result will be correct:
sql("select k,count(*) from foo group by k").collect
res2: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])
Should I file a bug?
--
Pei-Lun Lee
I reran with master and looks like it is fixed.
2014-06-12 1:26 GMT+08:00 Michael Armbrust :
> I'd try rerunning with master. It is likely you are running into
> SPARK-1994 <https://issues.apache.org/jira/browse/SPARK-1994>.
>
> Michael
>
>
> On Wed, Jun
ound because those events never been
> submitted.
>
> Don’t know if that can help.
>
> On Jun 26, 2014, at 6:41 AM, Pei-Lun Lee wrote:
>
> >
> > Hi,
> >
> > We have a long running spark application runs on spark 1.0 standalone
> server and after it runs se
Thanks for the DirectOutputCommitter example.
However I found it only works for saveAsHadoopFile. What about
saveAsParquetFile?
It looks like SparkSQL is using ParquetOutputCommitter, which is subclass
of FileOutputCommitter.
On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor
wrote:
> FYI. We're cur
ect dependency makes this injection much more
> difficult for saveAsParquetFile.
>
> On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee wrote:
>
>> Thanks for the DirectOutputCommitter example.
>> However I found it only works for saveAsHadoopFile. What about
>> saveAsParquetFile?
&
Hi,
I am trying jdbc data source in spark sql 1.3.0 and found some issues.
First, the syntax "where str_col='value'" will give error for both
postgresql and mysql:
psql> create table foo(id int primary key,name text,age int);
bash> SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar spark/bin/spark-s
JIRA and PR for first issue:
https://issues.apache.org/jira/browse/SPARK-6408
https://github.com/apache/spark/pull/5087
On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee wrote:
> Hi,
>
> I am trying jdbc data source in spark sql 1.3.0 and found some issues.
>
> First, the syntax
Hi,
When I save parquet file with SaveMode.Overwrite, it never generate
_common_metadata. Whether it overwrites an existing dir or not.
Is this expected behavior?
And what is the benefit of _common_metadata? Will reading performs better
when it is present?
Thanks,
--
Pei-Lun
I updated the PR for SPARK-6352 to be more like SPARK-3595.
I added a new setting "spark.sql.parquet.output.committer.class" in hadoop
configuration to allow custom implementation of ParquetOutputCommitter.
Can someone take a look at the PR?
On Mon, Mar 16, 2015 at 5:23 PM, Pei-Lun
n_metadata file is typically much smaller than _metadata,
> because it doesn’t contain row group information, and thus can be faster to
> read than _metadata.
>
> Cheng
>
> On 3/26/15 12:48 PM, Pei-Lun Lee wrote:
>
> Hi,
>
> When I save parquet file with SaveMode.Overwrite,
ersion matters here, but I did observe
> cases where Spark behaves differently because of semantic differences of
> the same API in different Hadoop versions.
>
> Cheng
>
> On 3/27/15 11:33 AM, Pei-Lun Lee wrote:
>
> Hi Cheng,
>
> on my computer, execute res0.save(
ld you
> mind to open a JIRA for this?
>
> Cheng
>
> On 3/27/15 2:40 PM, Pei-Lun Lee wrote:
>
> I'm using 1.0.4
>
> Thanks,
> --
> Pei-Lun
>
> On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian wrote:
>
>> Hm, which version of Hadoop are you using? Actu
Hi,
I am using spark-sql 1.0.1 to load parquet files generated from method
described in:
https://gist.github.com/massie/7224868
When I try to submit a select query with columns of type fixed length byte
array, the following error pops up:
14/07/14 11:09:14 INFO scheduler.DAGScheduler: Failed
yet, but there is a PR open to fix it:
> https://issues.apache.org/jira/browse/SPARK-2446
>
>
> On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee wrote:
>
>> Hi,
>>
>> I am using spark-sql 1.0.1 to load parquet files generated from method
>> described in:
>&g
Filed SPARK-2446
2014-07-15 16:17 GMT+08:00 Michael Armbrust :
> Oh, maybe not. Please file another JIRA.
>
>
> On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee wrote:
>
>> Hi Michael,
>>
>> Good to know it is being handled. I tried master branch (9fe693b5) and
Sorry, should be SPARK-2489
2014-07-15 19:22 GMT+08:00 Pei-Lun Lee :
> Filed SPARK-2446
>
>
>
> 2014-07-15 16:17 GMT+08:00 Michael Armbrust :
>
> Oh, maybe not. Please file another JIRA.
>>
>>
>> On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee wrote:
>
:
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 1
Looks like spark sql tried to do a broadcast join and collecting one of the
table to master but it is too large.
How do we explicitly control the join behavior like this?
--
Pei-Lun Lee
ely, this is a query where we just don't have an efficiently
> implementation yet. You might try switching the table order.
>
> Here is the JIRA for doing something more efficient:
> https://issues.apache.org/jira/browse/SPARK-2212
>
>
> On Fri, Jul 18, 2014 at 7:05 AM
Hi,
We have a PR to support fixed length byte array in parquet file.
https://github.com/apache/spark/pull/1737
Can someone help verifying it?
Thanks.
2014-07-15 19:23 GMT+08:00 Pei-Lun Lee :
> Sorry, should be SPARK-2489
>
>
> 2014-07-15 19:22 GMT+08:00 Pei-Lun Lee :
>
>
Hi,
Does spark sql currently support user-defined custom aggregation function
in scala like the way UDF defined with sqlContext.registerFunction? (not
hive UDAF)
Thanks,
--
Pei-Lun
I created https://issues.apache.org/jira/browse/SPARK-3947
On Tue, Oct 14, 2014 at 3:54 AM, Michael Armbrust
wrote:
> Its not on the roadmap for 1.2. I'd suggest opening a JIRA.
>
> On Mon, Oct 13, 2014 at 4:28 AM, Pierre B <
> pierre.borckm...@realimpactanalytics.com> wrote:
>
>> Is it planned
Hi,
You can merge them into one table by:
sqlContext.unionAll(sqlContext.unionAll(sqlContext.table("table_1"),
sqlContext.table("table_2")),
sqlContext.table("table_3")).registarTempTable("table_all")
Or load them in one call by:
sqlContext.parquetFile("table_1.parquet,table_2.parquet,table_3.p
22 matches
Mail list logo