Re: can we plz open up encoder on dataset

2017-01-26 Thread Koert Kuipers
e: Hi, Can you show the code from map to reproduce the issue? You can create encoders using Encoders object (I'm using it all over the place for schema generation). Jacek On 25 Jan 2017 10:19 p.m., "Koert Kuipers" <ko...@tresata.com> wrote: > i often run into problems like th

can we plz open up encoder on dataset

2017-01-25 Thread Koert Kuipers
i often run into problems like this: i need to write a Dataset[T] => Dataset[T], and inside i need to switch to DataFrame for a particular operation. but if i do: dataset.toDF.map(...).as[T] i get error: Unable to find encoder for type stored in a Dataset. i know it has an encoder, because i

Re: printSchema showing incorrect datatype?

2017-01-25 Thread Koert Kuipers
d thus will not change the > schema of the dataframe. > > On Tue, Jan 24, 2017 at 8:28 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> scala> val x = Seq("a", "b").toDF("x") >> x: org.apache.spark.sql.DataFrame = [x: string] >> >

Re: Spark SQL DataFrame to Kafka Topic

2017-01-24 Thread Koert Kuipers
fka producer, and write the > data out to kafka. > http://spark.apache.org/docs/latest/structured-streaming- > programming-guide.html#using-foreach > > On Fri, Jan 13, 2017 at 8:28 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> how do you do this with structu

printSchema showing incorrect datatype?

2017-01-24 Thread Koert Kuipers
scala> val x = Seq("a", "b").toDF("x") x: org.apache.spark.sql.DataFrame = [x: string] scala> x.as[Array[Byte]].printSchema root |-- x: string (nullable = true) scala> x.as[Array[Byte]].map(x => x).printSchema root |-- value: binary (nullable = true) why does the first schema show string

Aggregator mutate b1 in place in merge

2017-01-23 Thread Koert Kuipers
looking at the docs for org.apache.spark.sql.expressions.Aggregator it says for reduce method: "For performance, the function may modify `b` and return it instead of constructing new object for b.". it makes no such comment for the merge method. this is surprising to me because i know that for

Re: ScalaReflectionException (class not found) error for user class in spark 2.1.0

2017-01-23 Thread Koert Kuipers
i get the same error using latest spark master branch On Tue, Jan 17, 2017 at 6:24 PM, Koert Kuipers <ko...@tresata.com> wrote: > and to be clear, this is not in the REPL or with Hive (both well known > situations in which these errors arise) > > On Mon, Jan 16, 2017 at 11:51

Re: dataset aggregators with kryo encoder very slow

2017-01-21 Thread Koert Kuipers
sorry i meant to say SPARK-18980 On Sat, Jan 21, 2017 at 1:48 AM, Koert Kuipers <ko...@tresata.com> wrote: > found it :) SPARK-1890 > thanks cloud-fan > > On Sat, Jan 21, 2017 at 1:46 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> trying to replicate this

mvn deploy tries to upload artifacts multiple times

2017-01-21 Thread Koert Kuipers
i noticed when doing maven deploy for spark (for inhouse release) that it tries to upload certain artifacts multiple times. for example it tried to upload spark-network-common tests jar twice. our inhouse repo doesnt appreciate this for releases. it will refuse the second time. also it makes no

Re: dataset aggregators with kryo encoder very slow

2017-01-20 Thread Koert Kuipers
found it :) SPARK-1890 thanks cloud-fan On Sat, Jan 21, 2017 at 1:46 AM, Koert Kuipers <ko...@tresata.com> wrote: > trying to replicate this in spark itself i can for v2.1.0 but not for > master. i guess it has been fixed > > On Fri, Jan 20, 2017 at 4:57 PM, Koert Kuipers

Re: dataset aggregators with kryo encoder very slow

2017-01-20 Thread Koert Kuipers
trying to replicate this in spark itself i can for v2.1.0 but not for master. i guess it has been fixed On Fri, Jan 20, 2017 at 4:57 PM, Koert Kuipers <ko...@tresata.com> wrote: > i started printing out when kryo serializes my buffer data structure for > my aggregator. > > i

Re: dataset aggregators with kryo encoder very slow

2017-01-20 Thread Koert Kuipers
into it). i realize that in reality due to the order of the elements coming in this can not always be achieved. but what i see instead is that the buffer is getting serialized after every call to reduce a value into it, always. could this be the reason it is so slow? On Thu, Jan 19, 2017 at 4:17 PM, Koert

dataset aggregators with kryo encoder very slow

2017-01-19 Thread Koert Kuipers
we just converted a job from RDD to Dataset. the job does a single map-red phase using aggregators. we are seeing very bad performance for the Dataset version, about 10x slower. in the Dataset version we use kryo encoders for some of the aggregators. based on some basic profiling of spark in

Re: Spark / Elasticsearch Error: Maybe ES was overloaded? How to throttle down Spark as it writes to ES

2017-01-17 Thread Koert Kuipers
in our experience you can't really. there are some settings to make spark wait longer before retrying when es is overloaded, but i have never found them too much use. check out these settings, maybe they are of some help: es.batch.size.bytes es.batch.size.entries es.http.timeout

Re: ScalaReflectionException (class not found) error for user class in spark 2.1.0

2017-01-17 Thread Koert Kuipers
and to be clear, this is not in the REPL or with Hive (both well known situations in which these errors arise) On Mon, Jan 16, 2017 at 11:51 PM, Koert Kuipers <ko...@tresata.com> wrote: > i am experiencing a ScalaReflectionException exception when doing an > aggregation on a spark-s

ScalaReflectionException (class not found) error for user class in spark 2.1.0

2017-01-16 Thread Koert Kuipers
i am experiencing a ScalaReflectionException exception when doing an aggregation on a spark-sql DataFrame. the error looks like this: Exception in thread "main" scala.ScalaReflectionException: class in JavaMirror with sun.misc.Launcher$AppClassLoader@28d93b30 of type class

Re: Spark SQL DataFrame to Kafka Topic

2017-01-13 Thread Koert Kuipers
how do you do this with structured streaming? i see no mention of writing to kafka On Fri, Jan 13, 2017 at 10:30 AM, Peyman Mohajerian wrote: > Yes, it is called Structured Streaming: https://docs. > databricks.com/_static/notebooks/structured-streaming-kafka.html >

Re: top-k function for Window

2017-01-04 Thread Koert Kuipers
i assumed topk of frequencies in one pass. if its topk by known sorting/ordering then use priority queue aggregator instead of spacesaver. On Tue, Jan 3, 2017 at 3:11 PM, Koert Kuipers <ko...@tresata.com> wrote: > i dont know anything about windowing or about not using devel

Re: OS killing Executor due to high (possibly off heap) memory usage

2017-01-03 Thread Koert Kuipers
the defect SPARK-18787 to either force these settings when > spark.shuffle.io.preferDirectBufs=false is set in spark config or > document it. > > Hope it will be helpful for other users as well. > > Thanks, > Aniket > > On Sat, Nov 26, 2016 at 3:31 PM Koert Kuipers <ko...@tresat

Re: top-k function for Window

2017-01-03 Thread Koert Kuipers
i dont know anything about windowing or about not using developer apis... but but a trivial implementation of top-k requires a total sort per group. this can be done with dataset. we do this using spark-sorted ( https://github.com/tresata/spark-sorted) but its not hard to do it yourself for

Re: can UDF accept "Any"/"AnyVal"/"AnyRef"(java Object) as parameter or as return type ?

2017-01-03 Thread Koert Kuipers
spark sql is "runtime strongly typed" meaning it must know the actual type. so this will not work On Jan 3, 2017 07:46, "Linyuxin" wrote: > Hi all > > *With Spark 1.5.1* > > > > *When I want to implement a oracle decode function (like > decode(col1,1,’xxx’,’p2’,’yyy’,0))* >

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Koert Kuipers
ah yes you are right. i must not have fetched correctly earlier On Wed, Dec 28, 2016 at 2:53 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > The v2.1.0 tag is there: https://github.com/apache/spark/tree/v2.1.0 > > On Wed, Dec 28, 2016 at 2:04 PM, Koert Kuipers <ko...@

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Koert Kuipers
seems like the artifacts are on maven central but the website is not yet updated. strangely the tag v2.1.0 is not yet available on github. i assume its equal to v2.1.0-rc5 On Fri, Dec 23, 2016 at 10:52 AM, Justin Miller < justin.mil...@protectwise.com> wrote: > I'm curious about this as well.

Re: Running Multiple Versions of Spark on the same cluster (YARN)

2016-12-17 Thread Koert Kuipers
spark only needs to be present on the machine that launches it using spark-submit On Sat, Dec 17, 2016 at 3:59 PM, Jorge Machado wrote: > Hi Tiago, > > thx for the update. Lat question : but this spark-submit that you are > using need to be on the same version on all yarn hosts ?

efficient filtering on a dataframe

2016-12-06 Thread Koert Kuipers
i have a dataframe on which i need to run many queries that start with a filter on a column x. currently i write the dataframe out to parquet datasource partitioned by field x, after which i repeatedly read the datasource back in from parquet. the queries are efficient because the filter gets

Re: type-safe join in the new DataSet API?

2016-11-26 Thread Koert Kuipers
although this is correct, KeyValueGroupedDataset.coGroup requires one to implement their own join logic with Iterator functions. its fun to do that, and i appreciate the flexibility it gives, but i would not consider it a good solution for someone that just wants to do a typed join On Thu, Nov

Re: OS killing Executor due to high (possibly off heap) memory usage

2016-11-26 Thread Koert Kuipers
i agree that offheap memory usage is unpredictable. when we used rdds the memory was mostly on heap and total usage predictable, and we almost never had yarn killing executors. now with dataframes the memory usage is both on and off heap, and we have no way of limiting the off heap memory usage

spark.yarn.executor.memoryOverhead

2016-11-23 Thread Koert Kuipers
in YarnAllocator i see that memoryOverhead is by default set to math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN)) this does not take into account spark.memory.offHeap.size i think. should it? something like: math.max((MEMORY_OVERHEAD_FACTOR * executorMemory +

spark sql jobs heap memory

2016-11-23 Thread Koert Kuipers
we are testing Dataset/Dataframe jobs instead of RDD jobs. one thing we keep running into is containers getting killed by yarn. i realize this has to do with off-heap memory, and the suggestion is to increase spark.yarn.executor.memoryOverhead. at times our memoryOverhead is as large as the

Re: Configure spark.kryoserializer.buffer.max at runtime does not take effect

2016-11-17 Thread Koert Kuipers
getOrCreate uses existing SparkSession if available, in which case the settings will be ignored On Wed, Nov 16, 2016 at 10:55 PM, bluishpenguin wrote: > Hi all, > I would like to configure the following setting during runtime as below: > > spark = (SparkSession >

replace some partitions when writing dataframe

2016-11-17 Thread Koert Kuipers
i am looking into writing a dataframe to parquet using partioning. so something like df .write .mode(saveMode) .partitionBy(partitionColumn) .format("parquet") .save(path) i imagine i will have thousands of partitions. generally my goal is not to recreate all partitions every time, but

SQL analyzer breakdown

2016-11-15 Thread Koert Kuipers
We see the analyzer break down almost guaranteed when programs get to a certain size or complexity. It starts complaining with messages along the lines of "cannot find column x#255 in list of columns that includes x#255". The workaround is to go to rdd and back. Is there a way to achieve the same

Re: Strongly Connected Components

2016-11-12 Thread Koert Kuipers
oh ok i see now its not the same On Sat, Nov 12, 2016 at 2:48 PM, Koert Kuipers <ko...@tresata.com> wrote: > not sure i see the faster algo in the paper you mention. > > i see this in section 6.1.2: > "In what follows we give a simple labeling algorithm that computes >

Re: Strongly Connected Components

2016-11-12 Thread Koert Kuipers
not sure i see the faster algo in the paper you mention. i see this in section 6.1.2: "In what follows we give a simple labeling algorithm that computes connectivity on sparse graphs in O(log N) rounds." N here is the size of the graph, not the largest component diameter. that is the exact

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
the planner to look for Exchange->Sort pairs and change the > exchange. > > On Fri, Nov 4, 2016 at 7:06 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> i just noticed Sort for Dataset has a global flag. and Dataset also has >> sortWithinPartitions. >> >>

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
shpartitioning(key#5, 2) +- LocalTableScan [key#5, value#6] On Fri, Nov 4, 2016 at 9:18 AM, Koert Kuipers <ko...@tresata.com> wrote: > sure, but then my values are not sorted per key, right? > > so a group by key with values sorted according to to some ordering is an &g

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
s are colocated, it is cheaper to > do a groupByKey followed by a flatMapGroups > <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1828840559545742/2840265927289860/latest.html> > . > > > > On Thu, Nov 3,

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
i guess i could sort by (hashcode(key), key, secondarySortColumn) and then do mapPartitions? sorry thinking out loud a bit here. ok i think that could work. thanks On Thu, Nov 3, 2016 at 10:25 PM, Koert Kuipers <ko...@tresata.com> wrote: > thats an interesting thought abou

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
thats an interesting thought about orderBy and mapPartitions. i guess i could emulate a groupBy with secondary sort using those two. however isn't using an orderBy expensive since it is a total sort? i mean a groupBy with secondary sort is also a total sort under the hood, but its on

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
Oh okay that makes sense. The trick is to take max on tuple2 so you carry the other column along. It is still unclear to me why we should remember all these tricks (or add lots of extra little functions) when this elegantly can be expressed in a reduce operation with a simple one line lamba

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
F is a good option . > > Can anyone share the best way to implement this in Spark .? > > Regards, > Rabin Banerjee > > On Thu, Nov 3, 2016 at 6:59 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> Just realized you only want to keep first element. You can do

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
Just realized you only want to keep first element. You can do this without sorting by doing something similar to min or max operation using a custom aggregator/udaf or reduceGroups on Dataset. This is also more efficient. On Nov 3, 2016 7:53 AM, "Rabin Banerjee"

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
What you require is secondary sort which is not available as such for a DataFrame. The Window operator is what comes closest but it is strangely limited in its abilities (probably because it was inspired by a SQL construct instead of a more generic programmatic transformation capability). On Nov

Re: CSV escaping not working

2016-10-27 Thread Koert Kuipers
ly share our opinions. It would be good to see something > documented. > This may be the cause of the issue?: https://issues.apache. > org/jira/browse/CSV-135 > > From: Koert Kuipers <ko...@tresata.com> > Date: Thursday, October 27, 2016 at 12:49 PM > > To: "Jai

Re: CSV escaping not working

2016-10-27 Thread Koert Kuipers
t; Do you mind sharing why should escaping not work without quotes? > > From: Koert Kuipers <ko...@tresata.com> > Date: Thursday, October 27, 2016 at 12:40 PM > To: "Jain, Nishit" <nja...@underarmour.com> > Cc: "user@spark.apache.org" <use

Re: CSV escaping not working

2016-10-27 Thread Koert Kuipers
that is what i would expect: escaping only works if quoted On Thu, Oct 27, 2016 at 1:24 PM, Jain, Nishit wrote: > Interesting finding: Escaping works if data is quoted but not otherwise. > > From: "Jain, Nishit" > Date: Thursday, October 27, 2016

Re: csv date/timestamp type inference in spark 2.0.1

2016-10-26 Thread Koert Kuipers
i tried setting both dateFormat and timestampFormat to impossible values (e.g. "~|.G~z~a|wW") and it still detected my data to be TimestampType On Wed, Oct 26, 2016 at 1:15 PM, Koert Kuipers <ko...@tresata.com> wrote: > we had the inference of dates/timestamps when reading

Re: spark infers date to be timestamp type

2016-10-26 Thread Koert Kuipers
ot |-- date: timestamp (nullable = true) On Wed, Oct 26, 2016 at 7:35 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > There are now timestampFormat for TimestampType and dateFormat for > DateType. > > Do you mind if I ask to share your codes? > > On 27 Oct 2016 2:16 a.m., &

spark infers date to be timestamp type

2016-10-26 Thread Koert Kuipers
is there a reason a column with dates in format -mm-dd in a csv file is inferred to be TimestampType and not DateType? thanks! koert

csv date/timestamp type inference in spark 2.0.1

2016-10-26 Thread Koert Kuipers
we had the inference of dates/timestamps when reading csv files disabled in spark 2.0.0 by always setting dateFormat to something impossible (e.g. dateFormat "~|.G~z~a|wW") i noticed in spark 2.0.1 that setting this impossible dateFormat does not stop spark from inferring it is a date or

Re: RDD groupBy() then random sort each group ?

2016-10-22 Thread Koert Kuipers
groupBy always materializes the entire group (on disk or in memory) which is why you should avoid it for large groups. The key is to never materialize the grouped and shuffled data. To see one approach to do this take a look at https://github.com/tresata/spark-sorted It's basically a

Re: Dataframe schema...

2016-10-21 Thread Koert Kuipers
This rather innocent looking optimization flag nullable has caused a lot of bugs... Makes me wonder if we are better off without it On Oct 21, 2016 8:37 PM, "Muthu Jayakumar" wrote: > Thanks Cheng Lian for opening the JIRA. I found this with Spark 2.0.0. > > Thanks, > Muthu

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-16 Thread Koert Kuipers
A single json object would mean for most parsers it needs to fit in memory when reading or writing On Oct 15, 2016 11:09, "codlife" <1004910...@qq.com> wrote: > Hi: >I'm doubt about the design of spark.read.json, why the json file is not > a standard json file, who can tell me the internal

Re: import sql.implicits._

2016-10-14 Thread Koert Kuipers
about the stackoverflow question, do this: def validateAndTransform(df: DataFrame) : DataFrame = { import df.sparkSession.implicits._ ... } On Fri, Oct 14, 2016 at 5:51 PM, Koert Kuipers <ko...@tresata.com> wrote: > b > ​asically the implicit conversiosn that need it are rd

Re: import sql.implicits._

2016-10-14 Thread Koert Kuipers
b ​asically the implicit conversiosn that need it are rdd => dataset and seq => dataset​ On Fri, Oct 14, 2016 at 5:47 PM, Koert Kuipers <ko...@tresata.com> wrote: > for example when do you Seq(1,2,3).toDF("a") it needs to get the > SparkSession from somewhere. by

Re: import sql.implicits._

2016-10-14 Thread Koert Kuipers
for example when do you Seq(1,2,3).toDF("a") it needs to get the SparkSession from somewhere. by importing the implicits from spark.implicits._ they have access to a SparkSession for operations like this. On Fri, Oct 14, 2016 at 4:42 PM, Jakub Dubovsky < spark.dubovsky.ja...@gmail.com> wrote: >

Re: spark 2.0.1 upgrade breaks on WAREHOUSE_PATH

2016-10-06 Thread Koert Kuipers
hadoop filesystem api at all? On Thu, Oct 6, 2016 at 9:45 AM, Koert Kuipers <ko...@tresata.com> wrote: > well it seems to work if set spark.sql.warehouse.dir to > /tmp/spark-warehouse in spark-defaults, and it creates it on hdfs. > > however can this directory safely be share

spark 2.0.1 upgrade breaks on WAREHOUSE_PATH

2016-10-05 Thread Koert Kuipers
i just replaced out spark 2.0.0 install on yarn cluster with spark 2.0.1 and copied over the configs. to give it a quick test i started spark-shell and created a dataset. i get this: 16/10/05 23:55:13 WARN spark.SparkContext: Use an existing SparkContext, some configuration may not take effect.

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Koert Kuipers
oh i forgot in step1 you will have to modify spark's pom.xml to include cloudera repo so it can find the cloudera artifacts anyhow we found this process to be pretty easy and we stopped using the spark versions bundles with the distros On Mon, Sep 26, 2016 at 3:57 PM, Koert Kuipers <

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-26 Thread Koert Kuipers
it is also easy to launch many different spark versions on yarn by simply having them installed side-by-side. 1) build spark for your cdh version. for example for cdh 5 i do: $ git checkout v2.0.0 $ dev/make-distribution.sh --name cdh5.4-hive --tgz -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.4.4

Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-17668 On Mon, Sep 26, 2016 at 3:40 PM, Koert Kuipers <ko...@tresata.com> wrote: > ok will create jira > > On Mon, Sep 26, 2016 at 3:27 PM, Michael Armbrust <mich...@databricks.com> > wrote: > >> I agree this shoul

Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
IRA. > > On Sun, Sep 25, 2016 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> after having gotten used to have case classes represent complex >> structures in Datasets, i am surprised to find out that when i work in >> DataFrames with udfs no such magic exists

Re: udf forces usage of Row for complex types?

2016-09-26 Thread Koert Kuipers
Bedrytski Aliaksandr > sp...@bedryt.ski > > > > On Sun, Sep 25, 2016, at 23:41, Koert Kuipers wrote: > > after having gotten used to have case classes represent complex structures > in Datasets, i am surprised to find out that when i work in DataFrames with > ud

Re: ArrayType support in Spark SQL

2016-09-25 Thread Koert Kuipers
not pretty but this works: import org.apache.spark.sql.functions.udf df.withColumn("array", sqlf.udf({ () => Seq(1, 2, 3) }).apply()) On Sun, Sep 25, 2016 at 6:13 PM, Jason White wrote: > It seems that `functions.lit` doesn't support ArrayTypes. To reproduce: > >

udf forces usage of Row for complex types?

2016-09-25 Thread Koert Kuipers
after having gotten used to have case classes represent complex structures in Datasets, i am surprised to find out that when i work in DataFrames with udfs no such magic exists, and i have to fall back to manipulating Row objects, which is error prone and somewhat ugly. for example: case class

newlines inside csv quoted values

2016-08-30 Thread Koert Kuipers
i noticed much to my surprise that spark csv supports newlines inside quoted values. ok thats cool but how does this work with splitting files when reading? i assume splitting is still simply done on newlines or something similar. wouldnt that potentially split in the middle of a record?

create SparkSession without loading defaults for unit tests

2016-08-16 Thread Koert Kuipers
for unit tests i would like to create a SparkSession that does not load anything from system properties, similar to: new SQLContext(new SparkContext(new SparkConf(loadDefaults = false))) how do i go about doing this? i dont see a way. thanks! koert

Re: Change nullable property in Dataset schema

2016-08-15 Thread Koert Kuipers
why do you want the array to have nullable = false? what is the benefit? On Wed, Aug 3, 2016 at 10:45 AM, Kazuaki Ishizaki wrote: > Dear all, > Would it be possible to let me know how to change nullable property in > Dataset? > > When I looked for how to change nullable

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Koert Kuipers
/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Koert Kuipers
you cannot mix spark 1 and spark 2 jars change this libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" to libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" On Sun, Aug 14, 2016 at 11:58 AM, Mich Talebzadeh wrote: > Hi, > > In Spark

Re: Spark 2 and existing code with sqlContext

2016-08-12 Thread Koert Kuipers
you can get it from the SparkSession for backwards compatibility: val sqlContext = spark.sqlContext On Mon, Aug 8, 2016 at 9:11 AM, Mich Talebzadeh wrote: > Hi, > > In Spark 1.6.1 this worked > > scala> sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(),

Re: Losing executors due to memory problems

2016-08-12 Thread Koert Kuipers
you could have a very large key? perhaps a token value? i love the rdd api but have found that for joins dataframe/dataset performs better. maybe can you do the joins in that? On Thu, Aug 11, 2016 at 7:41 PM, Muttineni, Vinay wrote: > Hello, > > I have a spark job that

type inference csv dates

2016-08-12 Thread Koert Kuipers
i generally like the type inference feature of the spark-sql csv datasource, however i have been stung several times by date inference. the problem is that when a column is converted to a date type the original data is lost. this is not a lossless conversion. and i often have a requirement where i

Re: spark historyserver backwards compatible

2016-08-05 Thread Koert Kuipers
thanks On Fri, Aug 5, 2016 at 5:21 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > Yes, the 2.0 history server should be backwards compatible. > > On Fri, Aug 5, 2016 at 2:14 PM, Koert Kuipers <ko...@tresata.com> wrote: > > we have spark 1.5.x, 1.6.x an

spark historyserver backwards compatible

2016-08-05 Thread Koert Kuipers
we have spark 1.5.x, 1.6.x and 2.0.0 job running on yarn but yarn can have only one spark history server. what to do? is it safe to use the spark 2 history server to report on spark 1 jobs?

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Koert Kuipers
we share a single single sparksession across tests, and they can run in parallel. is pretty fast On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson wrote: > Hi, > > Right now, if any code uses DataFrame/Dataset, I need a test setup that > brings up a local master as in

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
leveraging this new repo? > > > org.apache.orc > orc > 1.1.2 > pom > > > > > > > > > > Sent from my iPhone > On Jul 26, 2016, at 4:50 PM, Koert Kuipers <ko...@tresata.com> wrote: > > parquet was inspired by dremel but written

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
[1]https://parquet.apache.org/documentation/latest/ > [2]https://orc.apache.org/docs/ > [3] > http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet > > On 26 Jul 2016, at 15:19, Koert Kuipers <ko...@tresata.com> wrote: > > when parq

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen

Re: Execute function once on each node

2016-07-19 Thread Koert Kuipers
The whole point of a well designed global filesystem is to not move the data On Jul 19, 2016 10:07, "Koert Kuipers" <ko...@tresata.com> wrote: > If you run hdfs on those ssds (with low replication factor) wouldn't it > also effectively write to local disk with low latency?

transtition SQLContext to SparkSession

2016-07-18 Thread Koert Kuipers
in my codebase i would like to gradually transition to SparkSession, so while i start using SparkSession i also want a SQLContext to be available as before (but with a deprecated warning when i use it). this should be easy since SQLContext is now a wrapper for SparkSession. so basically: val

Re: repartitionAndSortWithinPartitions HELP

2016-07-15 Thread Koert Kuipers
secondary sort first? >> >> On Thu, Jul 14, 2016 at 11:52 PM, Punit Naik <naik.puni...@gmail.com> >> wrote: >> >>> Okay. Can't I supply the same partitioner I used for >>> "repartitionAndSortWithinPartitions" as an

Re: repartitionAndSortWithinPartitions HELP

2016-07-15 Thread Koert Kuipers
the same partitioner I used for > "repartitionAndSortWithinPartitions" as an argument to "sortByKey"? > > On 14-Jul-2016 11:38 PM, "Koert Kuipers" <ko...@tresata.com> wrote: > >> repartitionAndSortWithinPartitions partitions the rdd and so

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Koert Kuipers
wrote: > Hi Koert > > I have already used "repartitionAndSortWithinPartitions" for secondary > sorting and it works fine. Just wanted to know whether it will sort the > entire RDD or not. > > On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com>

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Koert Kuipers
repartitionAndSortWithinPartit sort by keys, not values per key, so not really secondary sort by itself. for secondary sort also check out: https://github.com/tresata/spark-sorted On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik wrote: > Hi guys > > In my spark/scala code I

Re: Extend Dataframe API

2016-07-07 Thread Koert Kuipers
i dont see any easy way to extend the plans, beyond creating a custom version of spark. On Thu, Jul 7, 2016 at 9:31 AM, tan shai wrote: > Hi, > > I need to add new operations to the dataframe API. > Can any one explain to me how to extend the plans of query execution? > >

Re: Question regarding structured data and partitions

2016-07-07 Thread Koert Kuipers
, tan shai <tan.shai...@gmail.com> wrote: > Using partitioning with dataframes, how can we retrieve informations about > partitions? partitions bounds for example > > Thanks, > Shaira > > 2016-07-07 6:30 GMT+02:00 Koert Kuipers <ko...@tresata.com>: >

Re: Question regarding structured data and partitions

2016-07-06 Thread Koert Kuipers
spark does keep some information on the partitions of an RDD, namely the partitioning/partitioner. GroupSorted is an extension for key-value RDDs that also keeps track of the ordering, allowing for faster joins, non-reduce type operations on very large groups of values per key, etc. see here:

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-07-01 Thread Koert Kuipers
le the functionality discussed in SPARK-15598 ? > without changing how the Aggregator works. > > I bypassed it by using Optional (Guava) because I'm using the Java API, > but it's a bit cumbersome... > > Thanks, > Amit > > On Thu, Jun 30, 2016 at 1:54 AM Koert Kuipers &

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-29 Thread Koert Kuipers
its the difference between a semigroup and a monoid, and yes max does not easily fit into a monoid. see also discussion here: https://issues.apache.org/jira/browse/SPARK-15598 On Mon, Jun 27, 2016 at 3:19 AM, Amit Sela wrote: > OK. I see that, but the current (provided)

Re: Option Encoder

2016-06-23 Thread Koert Kuipers
an implicit encoder for Option[X] given an implicit encoder for X would be nice, i run into this often too. i do not think it exists. your best is to hope ExpressionEncoder will do... On Thu, Jun 23, 2016 at 2:16 PM, Richard Marscher wrote: > Is there a proper way to

Re: UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Koert Kuipers
You can try passing in an explicit encoder: org.apache.spark.sql.Encoders.kryo[Set[com.wix.accord.Violation]] Although this might only be available in spark 2, i don't remember top of my head... On Wed, Jun 8, 2016 at 11:57 PM, Koert Kuipers <ko...@tresata.com> wrote: > Sets are not

Re: UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Koert Kuipers
Sets are not supported. you basically need to stick to products (tuples, case classes), Seq and Map (and in spark 2 also Option). Or you can need to resort to the kryo-based encoder. On Wed, Jun 8, 2016 at 3:45 PM, Peter Halliday wrote: > I have some code that was producing

Re: setting column names on dataset

2016-06-07 Thread Koert Kuipers
").show(false) > +++ > |_1 |_2 | > +++ > |[foo,42]|[foo,42]| > |[bar,24]|[bar,24]| > +++ > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark h

setting column names on dataset

2016-06-07 Thread Koert Kuipers
for some operators on Dataset, like joinWith, one needs to use an expression which means referring to columns by name. how can i set the column names for a Dataset before doing a joinWith? currently i am aware of: df.toDF("k", "v").as[(K, V)] but that seems inefficient/anti-pattern? i shouldn't

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread Koert Kuipers
mhh i would not be very happy if the implication is that i have to start maintaining separate spark builds for client clusters that use java 8... On Mon, Jun 6, 2016 at 4:34 PM, Ted Yu wrote: > Please see: > https://spark.apache.org/docs/latest/security.html > > w.r.t. Java

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Koert Kuipers
i am glad to see this, i think we can into this as well (in 2.0.0-SNAPSHOT) but i couldn't reproduce it nicely. my observation was that joins of 2 datasets that were derived from the same datasource gave this kind of trouble. i changed my datasource from val to def (so it got created twice) as a

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
., 0), (2, ..., 5)).toDF("input", "c0", "c1", other > needed columns, "cX") > df.select(func($"a").as("r"), $"c0", $"c1", $"cX").select($"r._1", > $"r._2", $"

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
our input dataframe? > > // maropu > > On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> that is nice and compact, but it does not add the columns to an existing >> dataframe >> >> On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro <

Re: Pros and Cons

2016-05-26 Thread Koert Kuipers
We do disk-to-disk iterative algorithms in spark all the time, on datasets that do not fit in memory, and it works well for us. I usually have to do some tuning of number of partitions for a new dataset but that's about it in terms of inconveniences. On May 26, 2016 2:07 AM, "Jörn Franke"

<    1   2   3   4   5   >