Hi all,
I create a dataframe, convert it to Avro with to_avro and write it to
Kafka.
Then I read it back out with from_avro.
(Not using Schema Registry.)
The problem is that the values skip every other field in the result.
I expect:
+-++-+---+
|firstName|lastName|color|
I don't know if this is a bug or a feature, but it's a bit counter-intuitive
when reading code.
The "b" dataframe does not have field "bar" in its schema, but is still able to
filter on that field.
scala> val a = sc.parallelize(Seq((1,10),(2,20))).toDF("foo","bar")
a:
it.
On Mon, Feb 26, 2018 at 5:47 PM, naresh Goud <nareshgoud.du...@gmail.com>
wrote:
> is this helps?
>
> sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").map(("
> foo","bar")=>("foo",("foo","bar"))
Is there a way to make outputs created with "partitionBy" to contain the
partitioned column? When reading the output with Spark or Hive or similar,
it's less of an issue because those tools know how to perform partition
discovery. But if I were to load the output into an external data warehouse
or
I am finding using the Dataset API to be very cumbersome to use, which is
unfortunate, as I was looking forward to the type-safety after coming from
a Dataframe codebase.
This link summarizes my troubles: http://loicdescotte.
github.io/posts/spark2-datasets-type-safety/
The problem is having to
d. In your example, what happens if data is of only 2 rows?
> On 27 Jul 2016 00:57, "Alex Nastetsky" <alex.nastet...@vervemobile.com>
> wrote:
>
>> Spark SQL has a "first" function that returns the first item in a group.
>> Is there a similar function
Spark SQL has a "first" function that returns the first item in a group. Is
there a similar function, perhaps in a third party lib, that allows you to
return an arbitrary (e.g. 3rd) item from the group? Was thinking of writing
a UDAF for it, but didn't want to reinvent the wheel. My endgoal is to
As a user of AWS EMR (running Spark and MapReduce), I am interested in
potential benefits that I may gain from Databricks Cloud. I was wondering
if anyone has used both and done comparison / contrast between the two
services.
In general, which resource manager(s) does Databricks Cloud use for
Ran into this need myself. Does Spark have an equivalent of "mapreduce.
input.fileinputformat.list-status.num-threads"?
Thanks.
On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park wrote:
> Hi,
>
> I am wondering if anyone has successfully enabled
>
Alex, see this jira-
> https://issues.apache.org/jira/browse/SPARK-9926
>
> On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
> alex.nastet...@vervemobile.com> wrote:
>
>> Ran into this need myself. Does Spark have an equivalent of "mapreduce.
>> input.fileinp
Hi,
I am using Spark 1.5.1.
I have a Spark SQL UDAF that works fine on a tiny dataset (13 narrow rows)
in local mode, but runs out of memory on YARN about half the time
(OutOfMemory: Java Heap Space). The rest of the time, it works on YARN.
Note that in all instances, the input data is the
Hi,
I believe I ran into the same bug in 1.5.0, although my error looks like
this:
Caused by: java.lang.ClassCastException:
[Lcom.verve.spark.sql.ElementWithCount; cannot be cast to
org.apache.spark.sql.types.ArrayData
at
with the identical join keys will be loaded by the
> same node/task , since lots of factors need to be considered, like task
> pool size, cluster size, source format, storage, data locality etc.,.
>
> I’ll agree it’s worth to optimize it for performance concerns, and
> actually in Hive
Hi,
I'm trying to understand SortMergeJoin (SPARK-2213).
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For
example, in the code below, the two datasets have different number of
partitions, but it still does a SortMerge join after a "hashpartitioning".
CODE:
val
I'm just trying to do some operation inside foreachPartition, but I can't
even get a simple println to work. Nothing gets printed.
scala> val a = sc.parallelize(List(1,2,3))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize
at :21
scala> a.foreachPartition(p =>
gt; On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky <
> alex.nastet...@vervemobile.com> wrote:
>
>> I'm just trying to do some operation inside foreachPartition, but I can't
>> even get a simple println to work. Nothing gets printed.
>>
>> scala> val a = sc.pa
Does Spark have an implementation similar to CompositeInputFormat in
MapReduce?
CompositeInputFormat joins multiple datasets prior to the mapper, that are
partitioned the same way with the same number of partitions, using the
"part" number in the file name in each dataset to figure out which file
Figured it out ... needed to use saveAsNewAPIHadoopFile, but was trying to
use it on myDF.rdd instead of converting it to a PairRDD first.
On Mon, Oct 19, 2015 at 2:14 PM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:
> Using Spark 1.5.1, Parquet 1.7.0.
>
> I'm tryin
Using Spark 1.5.1, Parquet 1.7.0.
I'm trying to write Avro/Parquet files. I have this code:
sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS,
classOf[AvroWriteSupport].getName)
AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$)
myDF.write.parquet(outputPath)
I save my dataframe to avro with spark-avro 1.0.0 and it looks like this
(using avro-tools tojson):
{"field1":"value1","field2":976200}
{"field1":"value2","field2":976200}
{"field1":"value3","field2":614100}
But when I use spark-avro 2.0.1, it looks like this:
;
> On Wed, Oct 14, 2015 at 1:38 PM, Alex Nastetsky <
> alex.nastet...@vervemobile.com> wrote:
>
>> I save my dataframe to avro with spark-avro 1.0.0 and it looks like this
>> (using avro-tools tojson):
>>
>> {"field1":"value1","field2&quo
A lot of RDD methods take a numPartitions parameter that lets you specify
the number of partitions in the result. For example, groupByKey.
The DataFrame counterparts don't have a numPartitions parameter, e.g.
groupBy only takes a bunch of Columns as params.
I understand that the DataFrame API is
The doc for DataFrameReader#json(RDD[String]) method says
Unless the schema is specified using schema function, this function goes
through the input once to determine the input schema.
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader
Why is this
cause recomputing all the stages. Consider running the job by generating
the RDD from Dataframe and then using that.
Of course, you can use caching in both core and DataFrames, which will
solve all these concerns.
On Tue, Jul 28, 2015 at 1:03 PM, Alex Nastetsky
alex.nastet
Is it possible to restart the job from the last successful stage instead of
from the beginning?
For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a long
time and is successful, but the job fails on stage 1, it would be useful to
be able to restart from the output of stage 0
When I call rdd() on a DataFrame, it ends the current stage and starts a
new one that just maps the DataFrame to rdd and nothing else. It doesn't
seem to do a shuffle (which is good and expected), but then why does why is
there a separate stage?
I also thought that stages only end when there's a
I am reading JSON data that has different schemas for every record. That
is, for a given field that would have a null value, it's simply absent from
that record (and therefore, its schema).
I would like to use the DataFrame API to select specific fields from this
data, and for fields that are
27 matches
Mail list logo