version matters here, but I did observe
cases where Spark behaves differently because of semantic differences of
the same API in different Hadoop versions.
Cheng
On 3/27/15 11:33 AM, Pei-Lun Lee wrote:
Hi Cheng,
on my computer, execute res0.save(xxx, org.apache.spark.sql.SaveMode.
Overwrite
1.0.4. Would you
mind to open a JIRA for this?
Cheng
On 3/27/15 2:40 PM, Pei-Lun Lee wrote:
I'm using 1.0.4
Thanks,
--
Pei-Lun
On Fri, Mar 27, 2015 at 2:32 PM, Cheng Lian lian.cs@gmail.com wrote:
Hm, which version of Hadoop are you using? Actually there should also
be a _metadata
, and thus can be faster to
read than _metadata.
Cheng
On 3/26/15 12:48 PM, Pei-Lun Lee wrote:
Hi,
When I save parquet file with SaveMode.Overwrite, it never generate
_common_metadata. Whether it overwrites an existing dir or not.
Is this expected behavior?
And what is the benefit
I updated the PR for SPARK-6352 to be more like SPARK-3595.
I added a new setting spark.sql.parquet.output.committer.class in hadoop
configuration to allow custom implementation of ParquetOutputCommitter.
Can someone take a look at the PR?
On Mon, Mar 16, 2015 at 5:23 PM, Pei-Lun Lee pl
Hi,
When I save parquet file with SaveMode.Overwrite, it never generate
_common_metadata. Whether it overwrites an existing dir or not.
Is this expected behavior?
And what is the benefit of _common_metadata? Will reading performs better
when it is present?
Thanks,
--
Pei-Lun
JIRA and PR for first issue:
https://issues.apache.org/jira/browse/SPARK-6408
https://github.com/apache/spark/pull/5087
On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
I am trying jdbc data source in spark sql 1.3.0 and found some issues.
First, the syntax where
Hi,
I am trying jdbc data source in spark sql 1.3.0 and found some issues.
First, the syntax where str_col='value' will give error for both
postgresql and mysql:
psql create table foo(id int primary key,name text,age int);
bash SPARK_CLASSPATH=postgresql-9.4-1201-jdbc41.jar
that direct dependency makes this injection much more
difficult for saveAsParquetFile.
On Thu, Mar 5, 2015 at 12:28 AM, Pei-Lun Lee pl...@appier.com wrote:
Thanks for the DirectOutputCommitter example.
However I found it only works for saveAsHadoopFile. What about
saveAsParquetFile?
It looks
Thanks for the DirectOutputCommitter example.
However I found it only works for saveAsHadoopFile. What about
saveAsParquetFile?
It looks like SparkSQL is using ParquetOutputCommitter, which is subclass
of FileOutputCommitter.
On Fri, Feb 27, 2015 at 1:52 AM, Thomas Demoor
I created https://issues.apache.org/jira/browse/SPARK-3947
On Tue, Oct 14, 2014 at 3:54 AM, Michael Armbrust mich...@databricks.com
wrote:
Its not on the roadmap for 1.2. I'd suggest opening a JIRA.
On Mon, Oct 13, 2014 at 4:28 AM, Pierre B
pierre.borckm...@realimpactanalytics.com wrote:
Hi,
You can merge them into one table by:
sqlContext.unionAll(sqlContext.unionAll(sqlContext.table(table_1),
sqlContext.table(table_2)),
sqlContext.table(table_3)).registarTempTable(table_all)
Or load them in one call by:
:
Unfortunately, this is a query where we just don't have an efficiently
implementation yet. You might try switching the table order.
Here is the JIRA for doing something more efficient:
https://issues.apache.org/jira/browse/SPARK-2212
On Fri, Jul 18, 2014 at 7:05 AM, Pei-Lun Lee pl
:
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 1
Looks like spark sql tried to do a broadcast join and collecting one of the
table to master but it is too large.
How do we explicitly control the join behavior like this?
--
Pei-Lun Lee
, but there is a PR open to fix it:
https://issues.apache.org/jira/browse/SPARK-2446
On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
I am using spark-sql 1.0.1 to load parquet files generated from method
described in:
https://gist.github.com/massie/7224868
When I
Filed SPARK-2446
2014-07-15 16:17 GMT+08:00 Michael Armbrust mich...@databricks.com:
Oh, maybe not. Please file another JIRA.
On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi Michael,
Good to know it is being handled. I tried master branch (9fe693b5) and
got
Sorry, should be SPARK-2489
2014-07-15 19:22 GMT+08:00 Pei-Lun Lee pl...@appier.com:
Filed SPARK-2446
2014-07-15 16:17 GMT+08:00 Michael Armbrust mich...@databricks.com:
Oh, maybe not. Please file another JIRA.
On Tue, Jul 15, 2014 at 12:34 AM, Pei-Lun Lee pl...@appier.com wrote
Hi,
I am using spark-sql 1.0.1 to load parquet files generated from method
described in:
https://gist.github.com/massie/7224868
When I try to submit a select query with columns of type fixed length byte
array, the following error pops up:
14/07/14 11:09:14 INFO scheduler.DAGScheduler: Failed
submitted.
Don’t know if that can help.
On Jun 26, 2014, at 6:41 AM, Pei-Lun Lee pl...@appier.com wrote:
Hi,
We have a long running spark application runs on spark 1.0 standalone
server and after it runs several hours the following exception shows up:
14/06/25 23:13:08 ERROR
-Lun Lee pl...@appier.com wrote:
Hi,
I am using spark 1.0.0 and found in spark sql some queries use GROUP BY
give weird results.
To reproduce, type the following commands in spark-shell connecting to a
standalone server:
case class Foo(k: String, v: Int)
val sqlContext = new
],
[c,270], [4,56], [1,1])
and if I run the same query again, the new result will be correct:
sql(select k,count(*) from foo group by k).collect
res2: Array[org.apache.spark.sql.Row] = Array([b,200], [a,100], [c,300])
Should I file a bug?
--
Pei-Lun Lee
20 matches
Mail list logo