],
[c,270], [4,56], [1,1])
and if I run the same query again, the new result will be correct:
sql(select k,count(*) from foo group by k).collect
res2: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])
Should I file a bug?
--
Pei-Lun Lee
-Lun Lee pl...@appier.com wrote:
Hi,
I am using spark 1.0.0 and found in spark sql some queries use GROUP BY
give weird results.
To reproduce, type the following commands in spark-shell connecting to a
standalone server:
case class Foo(k: String, v: Int)
val sqlContext = new
submitted.
Don’t know if that can help.
On Jun 26, 2014, at 6:41 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
We have a long running spark application runs on spark 1.0 standalone
server and after it runs several hours the following exception shows up:
14/06/25 23:13:08 ERROR
Hi,
I am using spark-sql 1.0.1 to load parquet files generated from method
described in:
https://gist.github.com/massie/7224868
When I try to submit a select query with columns of type fixed length byte
array, the following error pops up:
14/07/14 11:09:14 INFO scheduler.DAGScheduler: Failed
, but there is a PR open to fix it:
https://issues.apache.org/jira/browse/SPARK-2446
On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
I am using spark-sql 1.0.1 to load parquet files generated from method
described in:
https://gist.github.com/massie/7224868
When I
Filed SPARK-2446
2014-07-15 16:17 GMT+08:00 Michael Armbrust mich...@databricks.com:
Oh, maybe not. Please file another JIRA.
On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi Michael,
Good to know it is being handled. I tried master branch (9fe693b5) and
got
Sorry, should be SPARK-2489
2014-07-15 19:22 GMT+08:00 Pei-Lun Lee pl...@appier.com:
Filed SPARK-2446
2014-07-15 16:17 GMT+08:00 Michael Armbrust mich...@databricks.com:
Oh, maybe not. Please file another JIRA.
On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote
:
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 1
Looks like spark sql tried to do a broadcast join and collecting one of the
table to master but it is too large.
How do we explicitly control the join behavior like this?
--
Pei-Lun Lee
:
Unfortunately, this is a query where we just don't have an efficiently
implementation yet. You might try switching the table order.
Here is the JIRA for doing something more efficient:
https://issues.apache.org/jira/browse/SPARK-2212
On Fri, Jul 18, 2014 at 7:05 AM, Pei-Lun Lee pl
I created https://issues.apache.org/jira/browse/SPARK-3947
On Tue, Oct 14, 2014 at 3:54 AM, Michael Armbrust mich...@databricks.com
wrote:
Its not on the roadmap for 1.2. I'd suggest opening a JIRA.
On Mon, Oct 13, 2014 at 4:28 AM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:
Hi,
You can merge them into one table by:
sqlContext.unionAll(sqlContext.unionAll(sqlContext.table(table_1),
sqlContext.table(table_2)),
sqlContext.table(table_3)).registarTempTable(table_all)
Or load them in one call by:
Hi,
I am trying jdbc data source in spark sql 1.3.0 and found some issues.
First, the syntax where str_col='value' will give error for both
postgresql and mysql:
psql create table foo(id int primary key,name text,age int);
bash SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar
JIRA and PR for first issue:
https://issues.apache.org/jira/browse/SPARK-6408
https://github.com/apache/spark/pull/5087
On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
I am trying jdbc data source in spark sql 1.3.0 and found some issues.
First, the syntax where
that direct dependency makes this injection much more
difficult for saveAsParquetFile.
On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee pl...@appier.com wrote:
Thanks for the DirectOutputCommitter example.
However I found it only works for saveAsHadoopFile. What about
saveAsParquetFile?
It looks
version matters here, but I did observe
cases where Spark behaves differently because of semantic differences of
the same API in different Hadoop versions.
Cheng
On 3/27/15 11:33 AM, Pei-Lun Lee wrote:
Hi Cheng,
on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode.
Overwrite
, and thus can be faster to
read than _metadata.
Cheng
On 3/26/15 12:48 PM, Pei-Lun Lee wrote:
Hi,
When I save parquet file with SaveMode.Overwrite, it never generate
_common_metadata. Whether it overwrites an existing dir or not.
Is this expected behavior?
And what is the benefit
Thanks for the DirectOutputCommitter example.
However I found it only works for saveAsHadoopFile. What about
saveAsParquetFile?
It looks like SparkSQL is using ParquetOutputCommitter, which is subclass
of FileOutputCommitter.
On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor
I updated the PR for SPARK-6352 to be more like SPARK-3595.
I added a new setting spark.sql.parquet.output.committer.class in hadoop
configuration to allow custom implementation of ParquetOutputCommitter.
Can someone take a look at the PR?
On Mon, Mar 16, 2015 at 5:23 PM, Pei-Lun Lee pl
Hi,
When I save parquet file with SaveMode.Overwrite, it never generate
_common_metadata. Whether it overwrites an existing dir or not.
Is this expected behavior?
And what is the benefit of _common_metadata? Will reading performs better
when it is present?
Thanks,
--
Pei-Lun
1.0.4. Would you
mind to open a JIRA for this?
Cheng
On 3/27/15 2:40 PM, Pei-Lun Lee wrote:
I'm using 1.0.4
Thanks,
--
Pei-Lun
On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com wrote:
Hm, which version of Hadoop are you using? Actually there should also
be a _metadata
20 matches
Mail list logo