Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
.toDF("a", "b") > df.select(func($"a").as("r")).select($"r._1", $"r._2") > > // maropu > > > On Thu, May 26, 2016 at 5:11 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> hello all, >> >>

Re: feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
t;> >> Cheng >> >> >> On 5/25/16 12:30 PM, Reynold Xin wrote: >> >> Based on this discussion I'm thinking we should deprecate the two explode >> functions. >> >> On Wednesday, May 25, 2016, Koert Kuipers < <ko...@tresata.com>

unsure how to create 2 outputs from spark-sql udf expression

2016-05-25 Thread Koert Kuipers
hello all, i have a single udf that creates 2 outputs (so a tuple 2). i would like to add these 2 columns to my dataframe. my current solution is along these lines: df .withColumn("_temp_", udf(inputColumns)) .withColumn("x", col("_temp_)("_1")) .withColumn("y", col("_temp_")("_2"))

Re: feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
w that it exists (i.e. explode($"arrayCol").as("Item")). It would be >> great to understand more why you are using these instead. >> >> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> we currently have 2

feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
we currently have 2 explode definitions in Dataset: def explode[A <: Product : TypeTag](input: Column*)(f: Row => TraversableOnce[A]): DataFrame def explode[A, B : TypeTag](inputColumn: String, outputColumn: String)(f: A => TraversableOnce[B]): DataFrame 1) the separation of the functions

Re: Is there a way to run a jar built for scala 2.11 on spark 1.6.1 (which is using 2.10?)

2016-05-18 Thread Koert Kuipers
no but you can trivially build spark 1.6.1 for scala 2.11 On Wed, May 18, 2016 at 6:11 PM, Sergey Zelvenskiy wrote: > >

Re: VectorAssembler handling null values

2016-04-20 Thread Koert Kuipers
thanks for that, its good to know that functionality exists. but shouldn't a decision tree be able to handle missing (aka null) values more intelligently than simply using replacement values? see for example here:

Re: Apache Flink

2016-04-17 Thread Koert Kuipers
i never found much info that flink was actually designed to be fault tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that doesn't bode well for large scale data processing. spark was designed with fault tolerance in mind from the beginning. On Sun, Apr 17, 2016 at 9:52 AM,

Re: Aggregator support in DataFrame

2016-04-12 Thread Koert Kuipers
mbrust <mich...@databricks.com> wrote: > Did you see these? > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/scala/typed.scala#L70 > > On Tue, Apr 12, 2016 at 9:46 AM, Koert Kuipers <ko...@tresata.com> wrote: >

Re: Aggregator support in DataFrame

2016-04-12 Thread Koert Kuipers
better because i have encoders so i can use kryo). On Mon, Apr 11, 2016 at 10:53 PM, Koert Kuipers <ko...@tresata.com> wrote: > saw that, dont think it solves it. i basically want to add some children > to the expression i guess, to indicate what i am operating on? not sure if > e

Re: Aggregator support in DataFrame

2016-04-11 Thread Koert Kuipers
recently: > https://github.com/apache/spark/commit/520dde48d0d52de1710a3275fdd5355dd69d > > I'm not sure that solves your problem though... > > On Mon, Apr 11, 2016 at 4:45 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> i like the Aggregator a lot >> (org.ap

Aggregator support in DataFrame

2016-04-11 Thread Koert Kuipers
i like the Aggregator a lot (org.apache.spark.sql.expressions.Aggregator), but i find the way to use it somewhat confusing. I am supposed to simply call aggregator.toColumn, but that doesn't allow me to specify which fields it operates on in a DataFrame. i would basically like to do something

Re: Datasets combineByKey

2016-04-10 Thread Koert Kuipers
yes it is On Apr 10, 2016 3:17 PM, "Amit Sela" wrote: > I think *org.apache.spark.sql.expressions.Aggregator* is what I'm looking > for, makes sense ? > > On Sun, Apr 10, 2016 at 4:08 PM Amit Sela wrote: > >> I'm mapping RDD API to Datasets API and I

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-26 Thread Koert Kuipers
To me this is expected behavior that I would not want fixed, but if you look at the recent commits for spark-csv it has one that deals this... On Mar 26, 2016 21:25, "Mich Talebzadeh" wrote: > > Hi, > > I have a standard csv file (saved as csv in HDFS) that has first

nullable in spark-sql

2016-03-24 Thread Koert Kuipers
In spark 2, is nullable treated as reliable? or is it just a hint for efficient code generation, the optimizer etc. The reason i ask is i see a lot of code generated with if statements handling null for struct fields where nullable=false

Re: spark 1.6.0 connect to hive metastore

2016-03-23 Thread Koert Kuipers
with CDH 5.5.3. > Not only with Spark 1.6 but with beeline as well. > I resolved it via installation & running hiveserver2 role instance at the > same server wher metastore is. <http://metastore.mycompany.com:9083> > > On Tue, Feb 9, 2016 at 10:58 PM, Koert Kuipers <ko..

spark shuffle service on yarn

2016-03-18 Thread Koert Kuipers
spark on yarn is nice because i can bring my own spark. i am worried that the shuffle service forces me to use some "sanctioned" spark version that is officially "installed" on the cluster. so... can i safely install the spark 1.3 shuffle service on yarn and use it with other 1.x versions of

Re: YARN process with Spark

2016-03-11 Thread Koert Kuipers
you get a spark executor per yarn container. the spark executor can have multiple cores, yes. this is configurable. so the number of partitions that can be processed in parallel is num-executors * executor-cores. and for processing a partition the available memory is executor-memory /

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers
>> Hadoop glob pattern doesn't support multi level wildcard. >>> >>> Thanks >>> >>> On Mar 9, 2016, at 6:15 AM, Koert Kuipers <ko...@tresata.com> wrote: >>> >>> if its based on HadoopFsRelation shouldn't it support it? >>> HadoopFsRelatio

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers
i use multi level wildcard with hadoop fs -ls, which is the exact same glob function call On Wed, Mar 9, 2016 at 9:24 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Hadoop glob pattern doesn't support multi level wildcard. > > Thanks > > On Mar 9, 2016, at 6:15 AM, Koert Kuipe

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers
if its based on HadoopFsRelation shouldn't it support it? HadoopFsRelation handles globs On Wed, Mar 9, 2016 at 8:56 AM, Ted Yu wrote: > This is currently not supported. > > On Mar 9, 2016, at 4:38 AM, Jakub Liska wrote: > > Hey, > > is something

Re: Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Koert Kuipers
we are not, but it seems reasonable to me that a user has the ability to implement their own serializer. can you refactor and break compatibility, but not make it private? On Mon, Mar 7, 2016 at 9:57 PM, Josh Rosen wrote: > Does anyone implement Spark's serializer

Re: AVRO vs Parquet

2016-03-03 Thread Koert Kuipers
well can you use orc without bringing in the kitchen sink of dependencies also known as hive? On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim wrote: > How about ORC? I have experimented briefly with Parquet and ORC, and I > liked the fact that ORC has its schema within the

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-03-02 Thread Koert Kuipers
worried that at some point the legacy memory management will be deprecated and then i am stuck with this performance issue. On Mon, Feb 29, 2016 at 12:47 PM, Koert Kuipers <ko...@tresata.com> wrote: > setting spark.shuffle.reduceLocality.enabled=false worked for me, thanks > >

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-29 Thread Koert Kuipers
ry spark.shuffle.reduceLocality.enabled=false >> This is an undocumented configuration. >> See: >> https://github.com/apache/spark/pull/8280 >> https://issues.apache.org/jira/browse/SPARK-10567 >> >> It solved the problem for me (both with and without memory legacy mode) >

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-28 Thread Koert Kuipers
same results. > > Still looking for resolution. > > Lior > > On Fri, Feb 19, 2016 at 2:01 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> looking at the cached rdd i see a similar story: >> with useLegacyMode = true the cached rdd is spread out across 10 &g

Re: Spark-avro issue in 1.5.2

2016-02-24 Thread Koert Kuipers
does your spark version come with batteries (hadoop included) or is it build with hadoop provided and you are adding hadoop binaries to classpath On Wed, Feb 24, 2016 at 3:08 PM, wrote: > I’m trying to save a data frame in Avro format but am getting the >

Re: Using functional programming rather than SQL

2016-02-24 Thread Koert Kuipers
the SQL gets translated into a much better plan (perhaps thanks to some pushdown into ORC?), i dont see why it can be much faster. On Wed, Feb 24, 2016 at 2:59 PM, Koert Kuipers <ko...@tresata.com> wrote: > i am still missing something. if it is executed in the source database, > w

Re: Using functional programming rather than SQL

2016-02-24 Thread Koert Kuipers
te: > >> That is incorrect HiveContext does not need a hive instance to run. >> On Feb 24, 2016 19:15, "Sabarish Sasidharan" < >> sabarish.sasidha...@manthan.com> wrote: >> >>> Yes >>> >>> Regards >>> Sab &g

Re: Using functional programming rather than SQL

2016-02-24 Thread Koert Kuipers
are you saying that HiveContext.sql(...) runs on hive, and not on spark sql? On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: > When using SQL your full query, including the joins, were executed in > Hive(or RDBMS) and only the results were brought

Re: Using functional programming rather than SQL

2016-02-23 Thread Koert Kuipers
​instead of: var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM sales") you should be able to do something like: val s = HiveContext.table("sales").select("AMOUNT_SOLD", "TIME_ID", "CHANNEL_ID") its not obvious to me why the dataframe (aka FP) version would be significantly

Re: Using functional programming rather than SQL

2016-02-22 Thread Koert Kuipers
however to really enjoy functional programming i assume you also want to use lambda in your map and filter, which means you need to convert DataFrame to Dataset, using df.as[SomeCaseClass]. Just be aware that its somewhat early days for Dataset. On Mon, Feb 22, 2016 at 6:45 PM, Kevin Mellott

Re: Serializing collections in Datasets

2016-02-22 Thread Koert Kuipers
it works in 2.0.0-SNAPSHOT On Mon, Feb 22, 2016 at 6:24 PM, Michael Armbrust wrote: > I think this will be fixed in 1.6.1. Can you test when we post the first > RC? (hopefully later today) > > On Mon, Feb 22, 2016 at 1:51 PM, Daniel Siegmann < >

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-18 Thread Koert Kuipers
partitioner, 50 partitions) before being cached. On Thu, Feb 18, 2016 at 6:51 PM, Koert Kuipers <ko...@tresata.com> wrote: > hello all, > we are just testing a semi-realtime application (it should return results > in less than 20 seconds from cached RDDs) on spark 1.6.0. before this i

spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-18 Thread Koert Kuipers
hello all, we are just testing a semi-realtime application (it should return results in less than 20 seconds from cached RDDs) on spark 1.6.0. before this it used to run on spark 1.5.1 in spark 1.6.0 the performance is similar to 1.5.1 if i set spark.memory.useLegacyMode = true, however if i

Re: How to use a custom partitioner in a dataframe in Spark

2016-02-18 Thread Koert Kuipers
although it is not a bad idea to write data out partitioned, and then use a merge join when reading it back in, this currently isn't even easily doable with rdds because when you read an rdd from disk the partitioning info is lost. re-introducing a partitioner at that point causes a shuffle

Re: trouble using Aggregator with DataFrame

2016-02-17 Thread Koert Kuipers
at 2.0? > > On Wed, Feb 17, 2016 at 2:22 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> first of all i wanted to say that i am very happy to see >> org.apache.spark.sql.expressions.Aggregator, it is a neat api, especially >> when compared to the UDAF/AggregateFuncti

trouble using Aggregator with DataFrame

2016-02-17 Thread Koert Kuipers
first of all i wanted to say that i am very happy to see org.apache.spark.sql.expressions.Aggregator, it is a neat api, especially when compared to the UDAF/AggregateFunction stuff. its doc/comments says: A base class for user-defined aggregations, which can be used in [[DataFrame]] and

Re: GroupedDataset needs a mapValues

2016-02-14 Thread Koert Kuipers
something similar using an Aggregator > <https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html>, > but I agree that we should consider something lighter weight like the > mapValues you propose. > > On Sat, Feb 13, 2016 at 1:35 PM,

GroupedDataset needs a mapValues

2016-02-13 Thread Koert Kuipers
i have a Dataset[(K, V)] i would like to group by k and then reduce V using a function (V, V) => V how do i do this? i would expect something like: val ds = Dataset[(K, V)] ds.groupBy(_._1).mapValues(_._2).reduce(f) or better: ds.grouped.reduce(f) # grouped only works on Dataset[(_, _)] and i

Re: coalesce and executor memory

2016-02-13 Thread Koert Kuipers
sorry i meant to say: and my way to deal with OOMs is almost always simply to increase number of partitions. maybe there is a better way that i am not aware of. On Sat, Feb 13, 2016 at 11:38 PM, Koert Kuipers <ko...@tresata.com> wrote: > thats right, its the reduce operation t

Re: coalesce and executor memory

2016-02-13 Thread Koert Kuipers
OOMs. and my to OOMs is almost always simply to increase number of partitions. maybe there is a better way that i am not aware of. On Sat, Feb 13, 2016 at 6:32 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > > On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers <ko...@tre

Re: GroupedDataset needs a mapValues

2016-02-13 Thread Koert Kuipers
you propose. > > On Sat, Feb 13, 2016 at 1:35 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> i have a Dataset[(K, V)] >> i would like to group by k and then reduce V using a function (V, V) => V >> how do i do this? >> >> i would expect so

GroupedDataset flatMapGroups with sorting (aka secondary sort redux)

2016-02-12 Thread Koert Kuipers
is there a way to leverage the shuffle in Dataset/GroupedDataset so that Iterator[V] in flatMapGroups has a well defined ordering? is hard for me to see many good use cases for flatMapGroups and mapGroups if you do not have sorting. since spark has a sort based shuffle not exposing this would be

Re: coalesce and executor memory

2016-02-12 Thread Koert Kuipers
in spark, every partition needs to fit in the memory available to the core processing it. as you coalesce you reduce number of partitions, increasing partition size. at some point the partition no longer fits in memory. On Fri, Feb 12, 2016 at 4:50 PM, Silvio Fiorito <

Dataset GroupedDataset.reduce

2016-02-12 Thread Koert Kuipers
i see that currently GroupedDataset.reduce simply calls flatMapgroups. does this mean that there is currently no partial aggregation for reduce?

spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
has anyone successfully connected to hive metastore using spark 1.6.0? i am having no luck. worked fine with spark 1.5.1 for me. i am on cdh 5.5 and launching spark with yarn. this is what i see in logs: 16/02/09 14:49:12 INFO hive.metastore: Trying to connect to metastore with URI

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
ive-site.xml on your classpath. Can you check that, please? > > Thanks, Alex. > > On Tue, Feb 9, 2016 at 8:58 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> has anyone successfully connected to hive metastore using spark 1.6.0? i >> am having no luck. worked fine

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
Cheers, Alex. > > On Tue, Feb 9, 2016 at 9:39 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> hey thanks. hive-site is on classpath in conf directory >> >> i currently got it to work by changing this hive setting in hive-site.xml: >> hive.metastore.schema.veri

Re: spark 1.6.0 connect to hive metastore

2016-02-09 Thread Koert Kuipers
hose set too? > > On Feb 9, 2016, at 1:12 PM, Koert Kuipers <ko...@tresata.com> wrote: > > yes its not using derby i think: i can see the tables in my actual hive > metastore. > > i was using a symlink to /etc/hive/conf/hive-site.xml for my hive-site.xml > which has a lo

Re: How to use a register temp table inside mapPartitions of an RDD

2016-02-09 Thread Koert Kuipers
if you mean to both register and use the table while you are inside mapPartition, i do not think that is possible or advisable. can you join the data? or broadcast it? On Tue, Feb 9, 2016 at 8:22 PM, SRK wrote: > hi, > > How to use a registerTempTable to register an

Re: Apache Spark data locality when integrating with Kafka

2016-02-06 Thread Koert Kuipers
spark can benefit from data locality and will try to launch tasks on the node where the kafka partition resides. however i think in production many organizations run a dedicated kafka cluster. On Sat, Feb 6, 2016 at 11:27 PM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: > Yes . To

Re: sc.textFile the number of the workers to parallelize

2016-02-04 Thread Koert Kuipers
increase minPartitions: sc.textFile(path, minPartitions = 9) On Thu, Feb 4, 2016 at 11:41 PM, Takeshi Yamamuro wrote: > Hi, > > ISTM these tasks are just assigned with executors in preferred nodes, so > how about repartitioning rdd? > > s3File.repartition(9).count > > On

make-distribution fails due to wrong order of modules

2016-02-02 Thread Koert Kuipers
i am seeing make-distribution fail because lib_managed does not exist. what seems to happen is that sql/hive module gets build and creates this directory. but after this sometime later module spark-parent gets build, which includes: [INFO] Building Spark Project Parent POM 1.6.0-SNAPSHOT [INFO]

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers
well the "hadoop" way is to save to a/b and a/c and read from a/* :) On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam wrote: > Hi Spark users and developers, > > anyone knows how to union two RDDs without the overhead of it? > > say rdd1.union(rdd2).saveTextFile(..) > This

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers
i am surprised union introduces a stage. UnionRDD should have only narrow dependencies. On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com> wrote: > well the "hadoop" way is to save to a/b and a/c and read from a/* :) > > On Tue, Feb 2, 2016 at 11:

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Koert Kuipers
with respect to joins, unfortunately not all implementations are available. for example i would like to use joins where one side is streaming (and the other cached). this seems to be available for DataFrame but not for RDD. On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Koert Kuipers
> string constants that falls apart left and right. Writing sql is old > school. period. good luck making money though :) > > On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> To have a product databricks can charge for their sql engine n

Re: how to run latest version of spark in old version of spark in cloudera cluster ?

2016-01-27 Thread Koert Kuipers
If you have yarn you can just launch your spark 1.6 job from a single machine with spark 1.6 available on it and ignore the version of spark (1.2) that is installed On Jan 27, 2016 11:29, "kali.tumm...@gmail.com" wrote: > Hi All, > > Just realized cloudera version of

Re: how to run latest version of spark in old version of spark in cloudera cluster ?

2016-01-27 Thread Koert Kuipers
sion of spark ? or should I say > override the spark_home variables to look at 1.6 spark jar ? > > Thanks > Sri > > On Wed, Jan 27, 2016 at 7:45 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> If you have yarn you can just launch your spark 1.6 job from a

Re: Spark 2.0.0 release plan

2016-01-26 Thread Koert Kuipers
y or so instead informally in > conversation. Does anyone have a particularly strong opinion on that? > That's basically an extra 3 month period. > > https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage > > On Tue, Jan 26, 2016 at 10:00 PM, Koert Kuipers <ko...@tresata.com>

Spark 2.0.0 release plan

2016-01-26 Thread Koert Kuipers
Is the idea that spark 2.0 comes out roughly 3 months after 1.6? So quarterly release as usual? Thanks

Re: simultaneous actions

2016-01-18 Thread Koert Kuipers
t;> either unrecognized or a greatly under-appreciated and underused feature of >> Spark. >> >> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers <ko...@tresata.com> >> wrote: >> >>> the re-use of shuffle files is always a nice surprise to me >>> >>

Re: simultaneous actions

2016-01-17 Thread Koert Kuipers
ng the RDD. > > On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> Same rdd means same sparkcontext means same workers >> >> Cache/persist the rdd to avoid repeated jobs >> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <men

Re: simultaneous actions

2016-01-17 Thread Koert Kuipers
..@gmail.com>: > >> I stand corrected. How considerable are the benefits though? Will the >> scheduler be able to dispatch jobs from both actions simultaneously (or on >> a when-workers-become-available basis)? >> >> On 15 January 2016 at 11:44, Koert Kuipers <

Re: simultaneous actions

2016-01-15 Thread Koert Kuipers
we run multiple actions on the same (cached) rdd all the time, i guess in different threads indeed (its in akka) On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia wrote: > RDDs actually are thread-safe, and quite a few applications use them this > way, e.g. the JDBC

rdd join very slow when rdd created from data frame

2016-01-12 Thread Koert Kuipers
we are having a join of 2 rdds thats fast (< 1 min), and suddenly it wouldn't even finish overnight anymore. the change was that the rdd was now derived from a dataframe. so the new code that runs forever is something like this: dataframe.rdd.map(row => (Row(row(0)), row)).join(...) any idea

Re: rdd join very slow when rdd created from data frame

2016-01-12 Thread Koert Kuipers
these together; perhaps by registering the Dataframes as > temp tables and constructing a Spark SQL query. > > Also, which version of Spark are you using? > > On Tue, Jan 12, 2016 at 4:16 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> we are having a join

Re: Spark on Apache Ingnite?

2016-01-11 Thread Koert Kuipers
where is ignite's resilience/fault-tolerance design documented? i can not find it. i would generally stay away from it if fault-tolerance is an afterthought. On Mon, Jan 11, 2016 at 10:31 AM, RodrigoB wrote: > Although I haven't work explicitly with either, they do

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
rhel/centos 6 ships with python 2.6, doesnt it? if so, i still know plenty of large companies where python 2.6 is the only option. asking them for python 2.7 is not going to work so i think its a bad idea On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland wrote: > I

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
>> >> I've been in a couple of projects using Spark (banking industry) where >> CentOS + Python 2.6 is the toolbox available. >> >> That said, I believe it should not be a concern for Spark. Python 2.6 is >> old and busted, which is totally opposite to the Spark ph

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
access). Does this address the Python versioning concerns for RHEL users? > > On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> yeah, the practical concern is that we have no control over java or >> python version on large company clusters. our curr

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
e, Jan 5, 2016 at 3:05 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> I think all the slaves need the same (or a compatible) version of Python >>> installed since they run Python code in PySpark jobs natively. >>> >>> On Tue, Jan

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
d > version without making your changes open source. The GPL-compatible > licenses make it possible to combine Python with other software that is > released under the GPL; the others don’t. > > Nick > ​ > > On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
if python 2.7 only has to be present on the node that launches the app (does it?) than that could be important indeed. On Tue, Jan 5, 2016 at 6:02 PM, Koert Kuipers <ko...@tresata.com> wrote: > interesting i didnt know that! > > On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas &l

Re: Large number of conf broadcasts

2015-12-17 Thread Koert Kuipers
our patch part of a pull request from the master branch in github? > > Thanks, > Prasad. > > From: Anders Arpteg > Date: Thursday, October 22, 2015 at 10:37 AM > To: Koert Kuipers > Cc: user > Subject: Re: Large number of conf broadcasts > > Yes, seems unnecessary.

Re: Preventing an RDD from shuffling

2015-12-16 Thread Koert Kuipers
a join needs a partitioner, and will shuffle the data as needed for the given partitioner (or if the data is already partitioned then it will leave it alone), after which it will process with something like a map-side join. if you can specify a partitioner that meets the exact layout of your data

Re: Dataset and lambas

2015-12-07 Thread Koert Kuipers
great thanks On Mon, Dec 7, 2015 at 3:02 PM, Michael Armbrust <mich...@databricks.com> wrote: > These specific JIRAs don't exist yet, but watch SPARK- as we'll make > sure everything shows up there. > > On Sun, Dec 6, 2015 at 10:06 AM, Koert Kuipers <ko...@tresata.co

Re: Dataset and lambas

2015-12-06 Thread Koert Kuipers
ich...@databricks.com> wrote: > On Sat, Dec 5, 2015 at 9:42 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> hello all, >> DataFrame internally uses a different encoding for values then what the >> user sees. i assume the same is true for Dataset? >> > > This is

Dataset and lambas

2015-12-05 Thread Koert Kuipers
hello all, DataFrame internally uses a different encoding for values then what the user sees. i assume the same is true for Dataset? if so, does this means that a function like Dataset.map needs to convert all the values twice (once to user format and then back to internal format)? or is it

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-24 Thread Koert Kuipers
what is your hdfs replication set to? On Wed, Nov 25, 2015 at 1:31 AM, AlexG wrote: > I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 > cluster > with 16.73 Tb storage, using > distcp. The dataset is a collection of tar files of about 1.7 Tb each. >

Re: NullPointerException with joda time

2015-11-12 Thread Koert Kuipers
i remember us having issues with joda classes not serializing property and coming out null "on the other side" in tasks On Thu, Nov 12, 2015 at 10:12 AM, Ted Yu wrote: > Even if log4j didn't work, you can still get some clue by wrapping the > following call with try block:

Re: Slow stage?

2015-11-11 Thread Koert Kuipers
i am a person that usually hates UIs, and i have to say i love these. very useful On Wed, Nov 11, 2015 at 3:23 PM, Mark Hamstra wrote: > Those are from the Application Web UI -- look for the "DAG Visualization" > and "Event Timeline" elements on Job and Stage pages. > >

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
> Sent from my iPhone > > On 01 Nov 2015, at 21:03, Koert Kuipers <ko...@tresata.com> wrote: > > hello all, > i am trying to get familiar with spark sql partitioning support. > > my data is partitioned by date, so like this: > data/date=2015-01-01 > data/date=201

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
it seems pretty fast, but if i have 2 partitions and 10mm records i do have to dedupe (distinct) 10mm records a direct way to just find out what the 2 partitions are would be much faster. spark knows it, but its not exposed. On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers <ko...@tresata.com>

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
if it requires scanning the whole data by > "explain" the query. The physical plan should say something about it. I > wonder if you are trying the distinct-sort-by-limit approach or the > max-date approach? > > Best Regards, > > Jerry > > > On Sun, Nov 1, 2

spark sql partitioned by date... read last date

2015-11-01 Thread Koert Kuipers
hello all, i am trying to get familiar with spark sql partitioning support. my data is partitioned by date, so like this: data/date=2015-01-01 data/date=2015-01-02 data/date=2015-01-03 ... lets say i would like a batch process to read data for the latest date only. how do i proceed? generally

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Koert Kuipers
it seems HadoopFsRelation keeps track of all part files (instead of just the data directories). i believe this has something to do with parquet footers but i didnt bother to look more into it. but yet the result is that driver side it: 1) tries to keep track of all part files in a Map[Path,

Re: question about HadoopFsRelation

2015-10-25 Thread Koert Kuipers
thanks i will read up on that On Sat, Oct 24, 2015 at 12:53 PM, Ted Yu <yuzhih...@gmail.com> wrote: > The code below was introduced by SPARK-7673 / PR #6225 > > See item #1 in the description of the PR. > > Cheers > > On Sat, Oct 24, 2015 at 12:59 AM, Koert Kuiper

Re: question about HadoopFsRelation

2015-10-24 Thread Koert Kuipers
in directories (to avoid the overhead and very large serialized jobconfs)? On Sat, Oct 24, 2015 at 12:23 AM, Koert Kuipers <ko...@tresata.com> wrote: > i noticed in the comments for HadoopFsRelation.buildScan it says: > * @param inputFiles For a non-partitioned relation, it contains paths o

Re: Large number of conf broadcasts

2015-10-23 Thread Koert Kuipers
Anders > > > On Thu, Oct 22, 2015 at 7:03 PM Koert Kuipers <ko...@tresata.com> wrote: > >> i am seeing the same thing. its gona completely crazy creating broadcasts >> for the last 15 mins or so. killing it... >> >> On Thu, Sep 24, 2015 at 1:24 PM, Anders A

Re: Large number of conf broadcasts

2015-10-23 Thread Koert Kuipers
https://github.com/databricks/spark-avro/pull/95 On Fri, Oct 23, 2015 at 5:01 AM, Koert Kuipers <ko...@tresata.com> wrote: > oh no wonder... it undoes the glob (i was reading from /some/path/*), > creates a hadoopRdd for every path, and then creates a union of them using > Unio

question about HadoopFsRelation

2015-10-23 Thread Koert Kuipers
i noticed in the comments for HadoopFsRelation.buildScan it says: * @param inputFiles For a non-partitioned relation, it contains paths of all data files in the *relation. For a partitioned relation, it contains paths of all data files in a single *selected partition. do i

Re: Large number of conf broadcasts

2015-10-22 Thread Koert Kuipers
i am seeing the same thing. its gona completely crazy creating broadcasts for the last 15 mins or so. killing it... On Thu, Sep 24, 2015 at 1:24 PM, Anders Arpteg wrote: > Hi, > > Running spark 1.5.0 in yarn-client mode, and am curios in why there are so > many broadcast

Re: Secondary Sorting in Spark

2015-10-04 Thread Koert Kuipers
See also https://github.com/tresata/spark-sorted On Oct 5, 2015 3:41 AM, "Bill Bejeck" wrote: > I've written blog post on secondary sorting in Spark and I'd thought I'd > share it with the group > > http://codingjunkie.net/spark-secondary-sort/ > > Thanks, > Bill >

Re: in joins, does one side stream?

2015-09-20 Thread Koert Kuipers
thought RDD also opens only an >>> iterator. Does it get materialized for joins? >>> >>> Rishi >>> >>> On Saturday, September 19, 2015, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> Yes for RDD -- both are materia

Re: in joins, does one side stream?

2015-09-20 Thread Koert Kuipers
sorry that was a typo. i meant to say: why do we have these features (broadcast join and sort-merge join) in DataFrame but not in RDD? they don't seem specific to structured data analysis to me. thanks! koert On Sun, Sep 20, 2015 at 2:46 PM, Koert Kuipers <ko...@tresata.com> wrote: >

in joins, does one side stream?

2015-09-17 Thread Koert Kuipers
in scalding we join with the smaller side on the left, since the smaller side will get buffered while the bigger side streams through the join. looking at CoGroupedRDD i do not get the impression such a distiction is made. it seems both sided are put into a map that can spill to disk. is this

Re: Replacing Esper with Spark Streaming?

2015-09-15 Thread Koert Kuipers
other, most likely many of these new streaming logic > containers will also be obsolete in the next few years. > Best regards, > Tom > > ------ > *From:* Koert Kuipers <ko...@tresata.com> > *To:* Bertrand Dechoux <decho...@gmail.com> >

Re: Replacing Esper with Spark Streaming?

2015-09-15 Thread Koert Kuipers
obsolete is not the same as dead... we have a few very large tech companies to prove that point On Tue, Sep 15, 2015 at 4:32 PM, Bertrand Dechoux wrote: > The big question would be what feature of Esper your are using. Esper is a > CEP solution. I doubt that Spark Streaming

<    1   2   3   4   5   >