e:
Hi,
Can you show the code from map to reproduce the issue? You can create
encoders using Encoders object (I'm using it all over the place for schema
generation).
Jacek
On 25 Jan 2017 10:19 p.m., "Koert Kuipers" <ko...@tresata.com> wrote:
> i often run into problems like th
i often run into problems like this:
i need to write a Dataset[T] => Dataset[T], and inside i need to switch to
DataFrame for a particular operation.
but if i do:
dataset.toDF.map(...).as[T] i get error:
Unable to find encoder for type stored in a Dataset.
i know it has an encoder, because i
d thus will not change the
> schema of the dataframe.
>
> On Tue, Jan 24, 2017 at 8:28 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> scala> val x = Seq("a", "b").toDF("x")
>> x: org.apache.spark.sql.DataFrame = [x: string]
>>
>
fka producer, and write the
> data out to kafka.
> http://spark.apache.org/docs/latest/structured-streaming-
> programming-guide.html#using-foreach
>
> On Fri, Jan 13, 2017 at 8:28 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> how do you do this with structu
scala> val x = Seq("a", "b").toDF("x")
x: org.apache.spark.sql.DataFrame = [x: string]
scala> x.as[Array[Byte]].printSchema
root
|-- x: string (nullable = true)
scala> x.as[Array[Byte]].map(x => x).printSchema
root
|-- value: binary (nullable = true)
why does the first schema show string
looking at the docs for org.apache.spark.sql.expressions.Aggregator it says
for reduce method: "For performance, the function may modify `b` and return
it instead of constructing new object for b.".
it makes no such comment for the merge method.
this is surprising to me because i know that for
i get the same error using latest spark master branch
On Tue, Jan 17, 2017 at 6:24 PM, Koert Kuipers <ko...@tresata.com> wrote:
> and to be clear, this is not in the REPL or with Hive (both well known
> situations in which these errors arise)
>
> On Mon, Jan 16, 2017 at 11:51
sorry i meant to say SPARK-18980
On Sat, Jan 21, 2017 at 1:48 AM, Koert Kuipers <ko...@tresata.com> wrote:
> found it :) SPARK-1890
> thanks cloud-fan
>
> On Sat, Jan 21, 2017 at 1:46 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> trying to replicate this
i noticed when doing maven deploy for spark (for inhouse release) that it
tries to upload certain artifacts multiple times. for example it tried to
upload spark-network-common tests jar twice.
our inhouse repo doesnt appreciate this for releases. it will refuse the
second time.
also it makes no
found it :) SPARK-1890
thanks cloud-fan
On Sat, Jan 21, 2017 at 1:46 AM, Koert Kuipers <ko...@tresata.com> wrote:
> trying to replicate this in spark itself i can for v2.1.0 but not for
> master. i guess it has been fixed
>
> On Fri, Jan 20, 2017 at 4:57 PM, Koert Kuipers
trying to replicate this in spark itself i can for v2.1.0 but not for
master. i guess it has been fixed
On Fri, Jan 20, 2017 at 4:57 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i started printing out when kryo serializes my buffer data structure for
> my aggregator.
>
> i
into it). i realize that
in reality due to the order of the elements coming in this can not always
be achieved. but what i see instead is that the buffer is getting
serialized after every call to reduce a value into it, always. could this
be the reason it is so slow?
On Thu, Jan 19, 2017 at 4:17 PM, Koert
we just converted a job from RDD to Dataset. the job does a single map-red
phase using aggregators. we are seeing very bad performance for the Dataset
version, about 10x slower.
in the Dataset version we use kryo encoders for some of the aggregators.
based on some basic profiling of spark in
in our experience you can't really.
there are some settings to make spark wait longer before retrying when es
is overloaded, but i have never found them too much use.
check out these settings, maybe they are of some help:
es.batch.size.bytes
es.batch.size.entries
es.http.timeout
and to be clear, this is not in the REPL or with Hive (both well known
situations in which these errors arise)
On Mon, Jan 16, 2017 at 11:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i am experiencing a ScalaReflectionException exception when doing an
> aggregation on a spark-s
i am experiencing a ScalaReflectionException exception when doing an
aggregation on a spark-sql DataFrame. the error looks like this:
Exception in thread "main" scala.ScalaReflectionException: class
in JavaMirror with sun.misc.Launcher$AppClassLoader@28d93b30 of type class
how do you do this with structured streaming? i see no mention of writing
to kafka
On Fri, Jan 13, 2017 at 10:30 AM, Peyman Mohajerian
wrote:
> Yes, it is called Structured Streaming: https://docs.
> databricks.com/_static/notebooks/structured-streaming-kafka.html
>
i assumed topk of frequencies in one pass. if its topk by known
sorting/ordering then use priority queue aggregator instead of spacesaver.
On Tue, Jan 3, 2017 at 3:11 PM, Koert Kuipers <ko...@tresata.com> wrote:
> i dont know anything about windowing or about not using devel
the defect SPARK-18787 to either force these settings when
> spark.shuffle.io.preferDirectBufs=false is set in spark config or
> document it.
>
> Hope it will be helpful for other users as well.
>
> Thanks,
> Aniket
>
> On Sat, Nov 26, 2016 at 3:31 PM Koert Kuipers <ko...@tresat
i dont know anything about windowing or about not using developer apis...
but
but a trivial implementation of top-k requires a total sort per group. this
can be done with dataset. we do this using spark-sorted (
https://github.com/tresata/spark-sorted) but its not hard to do it yourself
for
spark sql is "runtime strongly typed" meaning it must know the actual type.
so this will not work
On Jan 3, 2017 07:46, "Linyuxin" wrote:
> Hi all
>
> *With Spark 1.5.1*
>
>
>
> *When I want to implement a oracle decode function (like
> decode(col1,1,’xxx’,’p2’,’yyy’,0))*
>
ah yes you are right. i must not have fetched correctly earlier
On Wed, Dec 28, 2016 at 2:53 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:
> The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0
>
> On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers <ko...@
seems like the artifacts are on maven central but the website is not yet
updated.
strangely the tag v2.1.0 is not yet available on github. i assume its equal
to v2.1.0-rc5
On Fri, Dec 23, 2016 at 10:52 AM, Justin Miller <
justin.mil...@protectwise.com> wrote:
> I'm curious about this as well.
spark only needs to be present on the machine that launches it using
spark-submit
On Sat, Dec 17, 2016 at 3:59 PM, Jorge Machado wrote:
> Hi Tiago,
>
> thx for the update. Lat question : but this spark-submit that you are
> using need to be on the same version on all yarn hosts ?
i have a dataframe on which i need to run many queries that start with a
filter on a column x.
currently i write the dataframe out to parquet datasource partitioned by
field x, after which i repeatedly read the datasource back in from parquet.
the queries are efficient because the filter gets
although this is correct, KeyValueGroupedDataset.coGroup requires one to
implement their own join logic with Iterator functions. its fun to do that,
and i appreciate the flexibility it gives, but i would not consider it a
good solution for someone that just wants to do a typed join
On Thu, Nov
i agree that offheap memory usage is unpredictable.
when we used rdds the memory was mostly on heap and total usage
predictable, and we almost never had yarn killing executors.
now with dataframes the memory usage is both on and off heap, and we have
no way of limiting the off heap memory usage
in YarnAllocator i see that memoryOverhead is by default set to
math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt,
MEMORY_OVERHEAD_MIN))
this does not take into account spark.memory.offHeap.size i think. should
it?
something like:
math.max((MEMORY_OVERHEAD_FACTOR * executorMemory +
we are testing Dataset/Dataframe jobs instead of RDD jobs. one thing we
keep running into is containers getting killed by yarn. i realize this has
to do with off-heap memory, and the suggestion is to increase
spark.yarn.executor.memoryOverhead.
at times our memoryOverhead is as large as the
getOrCreate uses existing SparkSession if available, in which case the
settings will be ignored
On Wed, Nov 16, 2016 at 10:55 PM, bluishpenguin
wrote:
> Hi all,
> I would like to configure the following setting during runtime as below:
>
> spark = (SparkSession
>
i am looking into writing a dataframe to parquet using partioning. so
something like
df
.write
.mode(saveMode)
.partitionBy(partitionColumn)
.format("parquet")
.save(path)
i imagine i will have thousands of partitions. generally my goal is not to
recreate all partitions every time, but
We see the analyzer break down almost guaranteed when programs get to a
certain size or complexity. It starts complaining with messages along the
lines of "cannot find column x#255 in list of columns that includes x#255".
The workaround is to go to rdd and back. Is there a way to achieve the same
oh ok i see now its not the same
On Sat, Nov 12, 2016 at 2:48 PM, Koert Kuipers <ko...@tresata.com> wrote:
> not sure i see the faster algo in the paper you mention.
>
> i see this in section 6.1.2:
> "In what follows we give a simple labeling algorithm that computes
>
not sure i see the faster algo in the paper you mention.
i see this in section 6.1.2:
"In what follows we give a simple labeling algorithm that computes
connectivity on sparse graphs in O(log N) rounds."
N here is the size of the graph, not the largest component diameter.
that is the exact
the planner to look for Exchange->Sort pairs and change the
> exchange.
>
> On Fri, Nov 4, 2016 at 7:06 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i just noticed Sort for Dataset has a global flag. and Dataset also has
>> sortWithinPartitions.
>>
>>
shpartitioning(key#5, 2)
+- LocalTableScan [key#5, value#6]
On Fri, Nov 4, 2016 at 9:18 AM, Koert Kuipers <ko...@tresata.com> wrote:
> sure, but then my values are not sorted per key, right?
>
> so a group by key with values sorted according to to some ordering is an
&g
s are colocated, it is cheaper to
> do a groupByKey followed by a flatMapGroups
> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1828840559545742/2840265927289860/latest.html>
> .
>
>
>
> On Thu, Nov 3,
i guess i could sort by (hashcode(key), key, secondarySortColumn) and then
do mapPartitions?
sorry thinking out loud a bit here. ok i think that could work. thanks
On Thu, Nov 3, 2016 at 10:25 PM, Koert Kuipers <ko...@tresata.com> wrote:
> thats an interesting thought abou
thats an interesting thought about orderBy and mapPartitions. i guess i
could emulate a groupBy with secondary sort using those two. however isn't
using an orderBy expensive since it is a total sort? i mean a groupBy with
secondary sort is also a total sort under the hood, but its on
Oh okay that makes sense. The trick is to take max on tuple2 so you carry
the other column along.
It is still unclear to me why we should remember all these tricks (or add
lots of extra little functions) when this elegantly can be expressed in a
reduce operation with a simple one line lamba
F is a good option .
>
> Can anyone share the best way to implement this in Spark .?
>
> Regards,
> Rabin Banerjee
>
> On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> Just realized you only want to keep first element. You can do
Just realized you only want to keep first element. You can do this without
sorting by doing something similar to min or max operation using a custom
aggregator/udaf or reduceGroups on Dataset. This is also more efficient.
On Nov 3, 2016 7:53 AM, "Rabin Banerjee"
What you require is secondary sort which is not available as such for a
DataFrame. The Window operator is what comes closest but it is strangely
limited in its abilities (probably because it was inspired by a SQL
construct instead of a more generic programmatic transformation capability).
On Nov
ly share our opinions. It would be good to see something
> documented.
> This may be the cause of the issue?: https://issues.apache.
> org/jira/browse/CSV-135
>
> From: Koert Kuipers <ko...@tresata.com>
> Date: Thursday, October 27, 2016 at 12:49 PM
>
> To: "Jai
t; Do you mind sharing why should escaping not work without quotes?
>
> From: Koert Kuipers <ko...@tresata.com>
> Date: Thursday, October 27, 2016 at 12:40 PM
> To: "Jain, Nishit" <nja...@underarmour.com>
> Cc: "user@spark.apache.org" <use
that is what i would expect: escaping only works if quoted
On Thu, Oct 27, 2016 at 1:24 PM, Jain, Nishit
wrote:
> Interesting finding: Escaping works if data is quoted but not otherwise.
>
> From: "Jain, Nishit"
> Date: Thursday, October 27, 2016
i tried setting both dateFormat and timestampFormat to impossible values
(e.g. "~|.G~z~a|wW") and it still detected my data to be TimestampType
On Wed, Oct 26, 2016 at 1:15 PM, Koert Kuipers <ko...@tresata.com> wrote:
> we had the inference of dates/timestamps when reading
ot
|-- date: timestamp (nullable = true)
On Wed, Oct 26, 2016 at 7:35 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> There are now timestampFormat for TimestampType and dateFormat for
> DateType.
>
> Do you mind if I ask to share your codes?
>
> On 27 Oct 2016 2:16 a.m., &
is there a reason a column with dates in format -mm-dd in a csv file is
inferred to be TimestampType and not DateType?
thanks! koert
we had the inference of dates/timestamps when reading csv files disabled in
spark 2.0.0 by always setting dateFormat to something impossible (e.g.
dateFormat "~|.G~z~a|wW")
i noticed in spark 2.0.1 that setting this impossible dateFormat does not
stop spark from inferring it is a date or
groupBy always materializes the entire group (on disk or in memory) which
is why you should avoid it for large groups.
The key is to never materialize the grouped and shuffled data.
To see one approach to do this take a look at
https://github.com/tresata/spark-sorted
It's basically a
This rather innocent looking optimization flag nullable has caused a lot of
bugs... Makes me wonder if we are better off without it
On Oct 21, 2016 8:37 PM, "Muthu Jayakumar" wrote:
> Thanks Cheng Lian for opening the JIRA. I found this with Spark 2.0.0.
>
> Thanks,
> Muthu
A single json object would mean for most parsers it needs to fit in memory
when reading or writing
On Oct 15, 2016 11:09, "codlife" <1004910...@qq.com> wrote:
> Hi:
>I'm doubt about the design of spark.read.json, why the json file is not
> a standard json file, who can tell me the internal
about the stackoverflow question, do this:
def validateAndTransform(df: DataFrame) : DataFrame = {
import df.sparkSession.implicits._
...
}
On Fri, Oct 14, 2016 at 5:51 PM, Koert Kuipers <ko...@tresata.com> wrote:
> b
> asically the implicit conversiosn that need it are rd
b
asically the implicit conversiosn that need it are rdd => dataset and seq
=> dataset
On Fri, Oct 14, 2016 at 5:47 PM, Koert Kuipers <ko...@tresata.com> wrote:
> for example when do you Seq(1,2,3).toDF("a") it needs to get the
> SparkSession from somewhere. by
for example when do you Seq(1,2,3).toDF("a") it needs to get the
SparkSession from somewhere. by importing the implicits from
spark.implicits._ they have access to a SparkSession for operations like
this.
On Fri, Oct 14, 2016 at 4:42 PM, Jakub Dubovsky <
spark.dubovsky.ja...@gmail.com> wrote:
>
hadoop filesystem api at all?
On Thu, Oct 6, 2016 at 9:45 AM, Koert Kuipers <ko...@tresata.com> wrote:
> well it seems to work if set spark.sql.warehouse.dir to
> /tmp/spark-warehouse in spark-defaults, and it creates it on hdfs.
>
> however can this directory safely be share
i just replaced out spark 2.0.0 install on yarn cluster with spark 2.0.1
and copied over the configs.
to give it a quick test i started spark-shell and created a dataset. i get
this:
16/10/05 23:55:13 WARN spark.SparkContext: Use an existing SparkContext,
some configuration may not take effect.
oh i forgot in step1 you will have to modify spark's pom.xml to include
cloudera repo so it can find the cloudera artifacts
anyhow we found this process to be pretty easy and we stopped using the
spark versions bundles with the distros
On Mon, Sep 26, 2016 at 3:57 PM, Koert Kuipers <
it is also easy to launch many different spark versions on yarn by simply
having them installed side-by-side.
1) build spark for your cdh version. for example for cdh 5 i do:
$ git checkout v2.0.0
$ dev/make-distribution.sh --name cdh5.4-hive --tgz -Phadoop-2.6
-Dhadoop.version=2.6.0-cdh5.4.4
https://issues.apache.org/jira/browse/SPARK-17668
On Mon, Sep 26, 2016 at 3:40 PM, Koert Kuipers <ko...@tresata.com> wrote:
> ok will create jira
>
> On Mon, Sep 26, 2016 at 3:27 PM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> I agree this shoul
IRA.
>
> On Sun, Sep 25, 2016 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> after having gotten used to have case classes represent complex
>> structures in Datasets, i am surprised to find out that when i work in
>> DataFrames with udfs no such magic exists
Bedrytski Aliaksandr
> sp...@bedryt.ski
>
>
>
> On Sun, Sep 25, 2016, at 23:41, Koert Kuipers wrote:
>
> after having gotten used to have case classes represent complex structures
> in Datasets, i am surprised to find out that when i work in DataFrames with
> ud
not pretty but this works:
import org.apache.spark.sql.functions.udf
df.withColumn("array", sqlf.udf({ () => Seq(1, 2, 3) }).apply())
On Sun, Sep 25, 2016 at 6:13 PM, Jason White
wrote:
> It seems that `functions.lit` doesn't support ArrayTypes. To reproduce:
>
>
after having gotten used to have case classes represent complex structures
in Datasets, i am surprised to find out that when i work in DataFrames with
udfs no such magic exists, and i have to fall back to manipulating Row
objects, which is error prone and somewhat ugly.
for example:
case class
i noticed much to my surprise that spark csv supports newlines inside
quoted values. ok thats cool
but how does this work with splitting files when reading? i assume
splitting is still simply done on newlines or something similar. wouldnt
that potentially split in the middle of a record?
for unit tests i would like to create a SparkSession that does not load
anything from system properties, similar to:
new SQLContext(new SparkContext(new SparkConf(loadDefaults = false)))
how do i go about doing this? i dont see a way.
thanks! koert
why do you want the array to have nullable = false? what is the benefit?
On Wed, Aug 3, 2016 at 10:45 AM, Kazuaki Ishizaki
wrote:
> Dear all,
> Would it be possible to let me know how to change nullable property in
> Dataset?
>
> When I looked for how to change nullable
/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from
you cannot mix spark 1 and spark 2 jars
change this
libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1"
to
libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0"
On Sun, Aug 14, 2016 at 11:58 AM, Mich Talebzadeh wrote:
> Hi,
>
> In Spark
you can get it from the SparkSession for backwards compatibility:
val sqlContext = spark.sqlContext
On Mon, Aug 8, 2016 at 9:11 AM, Mich Talebzadeh
wrote:
> Hi,
>
> In Spark 1.6.1 this worked
>
> scala> sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(),
you could have a very large key? perhaps a token value?
i love the rdd api but have found that for joins dataframe/dataset performs
better. maybe can you do the joins in that?
On Thu, Aug 11, 2016 at 7:41 PM, Muttineni, Vinay
wrote:
> Hello,
>
> I have a spark job that
i generally like the type inference feature of the spark-sql csv
datasource, however i have been stung several times by date inference. the
problem is that when a column is converted to a date type the original data
is lost. this is not a lossless conversion. and i often have a requirement
where i
thanks
On Fri, Aug 5, 2016 at 5:21 PM, Marcelo Vanzin <van...@cloudera.com> wrote:
> Yes, the 2.0 history server should be backwards compatible.
>
> On Fri, Aug 5, 2016 at 2:14 PM, Koert Kuipers <ko...@tresata.com> wrote:
> > we have spark 1.5.x, 1.6.x an
we have spark 1.5.x, 1.6.x and 2.0.0 job running on yarn
but yarn can have only one spark history server.
what to do? is it safe to use the spark 2 history server to report on spark
1 jobs?
we share a single single sparksession across tests, and they can run in
parallel. is pretty fast
On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson
wrote:
> Hi,
>
> Right now, if any code uses DataFrame/Dataset, I need a test setup that
> brings up a local master as in
leveraging this new repo?
>
>
> org.apache.orc
> orc
> 1.1.2
> pom
>
>
>
>
>
>
>
>
>
> Sent from my iPhone
> On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
> parquet was inspired by dremel but written
[1]https://parquet.apache.org/documentation/latest/
> [2]https://orc.apache.org/docs/
> [3]
> http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
>
> On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote:
>
> when parq
when parquet came out it was developed by a community of companies, and was
designed as a library to be supported by multiple big data projects. nice
orc on the other hand initially only supported hive. it wasn't even
designed as a library that can be re-used. even today it brings in the
kitchen
The whole point of a well designed global filesystem is to not move the data
On Jul 19, 2016 10:07, "Koert Kuipers" <ko...@tresata.com> wrote:
> If you run hdfs on those ssds (with low replication factor) wouldn't it
> also effectively write to local disk with low latency?
in my codebase i would like to gradually transition to SparkSession, so
while i start using SparkSession i also want a SQLContext to be available
as before (but with a deprecated warning when i use it). this should be
easy since SQLContext is now a wrapper for SparkSession.
so basically:
val
secondary sort first?
>>
>> On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik <naik.puni...@gmail.com>
>> wrote:
>>
>>> Okay. Can't I supply the same partitioner I used for
>>> "repartitionAndSortWithinPartitions" as an
the same partitioner I used for
> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"?
>
> On 14-Jul-2016 11:38 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
>
>> repartitionAndSortWithinPartitions partitions the rdd and so
wrote:
> Hi Koert
>
> I have already used "repartitionAndSortWithinPartitions" for secondary
> sorting and it works fine. Just wanted to know whether it will sort the
> entire RDD or not.
>
> On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com>
repartitionAndSortWithinPartit sort by keys, not values per key, so not
really secondary sort by itself.
for secondary sort also check out:
https://github.com/tresata/spark-sorted
On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik wrote:
> Hi guys
>
> In my spark/scala code I
i dont see any easy way to extend the plans, beyond creating a custom
version of spark.
On Thu, Jul 7, 2016 at 9:31 AM, tan shai wrote:
> Hi,
>
> I need to add new operations to the dataframe API.
> Can any one explain to me how to extend the plans of query execution?
>
>
, tan shai <tan.shai...@gmail.com> wrote:
> Using partitioning with dataframes, how can we retrieve informations about
> partitions? partitions bounds for example
>
> Thanks,
> Shaira
>
> 2016-07-07 6:30 GMT+02:00 Koert Kuipers <ko...@tresata.com>:
>
spark does keep some information on the partitions of an RDD, namely the
partitioning/partitioner.
GroupSorted is an extension for key-value RDDs that also keeps track of the
ordering, allowing for faster joins, non-reduce type operations on very
large groups of values per key, etc.
see here:
le the functionality discussed in SPARK-15598 ?
> without changing how the Aggregator works.
>
> I bypassed it by using Optional (Guava) because I'm using the Java API,
> but it's a bit cumbersome...
>
> Thanks,
> Amit
>
> On Thu, Jun 30, 2016 at 1:54 AM Koert Kuipers &
its the difference between a semigroup and a monoid, and yes max does not
easily fit into a monoid.
see also discussion here:
https://issues.apache.org/jira/browse/SPARK-15598
On Mon, Jun 27, 2016 at 3:19 AM, Amit Sela wrote:
> OK. I see that, but the current (provided)
an implicit encoder for Option[X] given an implicit encoder for X would be
nice, i run into this often too.
i do not think it exists. your best is to hope ExpressionEncoder will do...
On Thu, Jun 23, 2016 at 2:16 PM, Richard Marscher
wrote:
> Is there a proper way to
You can try passing in an explicit encoder:
org.apache.spark.sql.Encoders.kryo[Set[com.wix.accord.Violation]]
Although this might only be available in spark 2, i don't remember top of
my head...
On Wed, Jun 8, 2016 at 11:57 PM, Koert Kuipers <ko...@tresata.com> wrote:
> Sets are not
Sets are not supported. you basically need to stick to products (tuples,
case classes), Seq and Map (and in spark 2 also Option).
Or you can need to resort to the kryo-based encoder.
On Wed, Jun 8, 2016 at 3:45 PM, Peter Halliday wrote:
> I have some code that was producing
").show(false)
> +++
> |_1 |_2 |
> +++
> |[foo,42]|[foo,42]|
> |[bar,24]|[bar,24]|
> +++
>
> Pozdrawiam,
> Jacek Laskowski
>
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark h
for some operators on Dataset, like joinWith, one needs to use an
expression which means referring to columns by name.
how can i set the column names for a Dataset before doing a joinWith?
currently i am aware of:
df.toDF("k", "v").as[(K, V)]
but that seems inefficient/anti-pattern? i shouldn't
mhh i would not be very happy if the implication is that i have to start
maintaining separate spark builds for client clusters that use java 8...
On Mon, Jun 6, 2016 at 4:34 PM, Ted Yu wrote:
> Please see:
> https://spark.apache.org/docs/latest/security.html
>
> w.r.t. Java
i am glad to see this, i think we can into this as well (in 2.0.0-SNAPSHOT)
but i couldn't reproduce it nicely.
my observation was that joins of 2 datasets that were derived from the same
datasource gave this kind of trouble. i changed my datasource from val to
def (so it got created twice) as a
., 0), (2, ..., 5)).toDF("input", "c0", "c1", other
> needed columns, "cX")
> df.select(func($"a").as("r"), $"c0", $"c1", $"cX").select($"r._1",
> $"r._2", $"
our input dataframe?
>
> // maropu
>
> On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> that is nice and compact, but it does not add the columns to an existing
>> dataframe
>>
>> On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro <
We do disk-to-disk iterative algorithms in spark all the time, on datasets
that do not fit in memory, and it works well for us. I usually have to do
some tuning of number of partitions for a new dataset but that's about it
in terms of inconveniences.
On May 26, 2016 2:07 AM, "Jörn Franke"
101 - 200 of 482 matches
Mail list logo