Re: pull request template

2016-03-15 Thread Marcelo Vanzin
Nobody has suggested removing the template. On Tue, Mar 15, 2016 at 3:59 PM, Joseph Bradley wrote: > +1 for keeping the template > > I figure any template will require conscientiousness & enforcement. > > On Sat, Mar 12, 2016 at 1:30 AM, Sean Owen

Re: spark 2.0 logging binary incompatibility

2016-03-15 Thread Koert Kuipers
makes sense note that Logging was not private[spark] in 1.x, which is why i used it. On Tue, Mar 15, 2016 at 12:55 PM, Marcelo Vanzin wrote: > Logging is a "private[spark]" class so binary compatibility is not > important at all, because code outside of Spark isn't

Re: spark 2.0 logging binary incompatibility

2016-03-15 Thread Koert Kuipers
oh i just noticed the big warning in spark 1.x Logging * NOTE: DO NOT USE this class outside of Spark. It is intended as an internal utility. * This will likely be changed or removed in future releases. On Tue, Mar 15, 2016 at 3:29 PM, Koert Kuipers wrote: > makes

Re: spark 2.0 logging binary incompatibility

2016-03-15 Thread Reynold Xin
Yea we are going to tighten a lot of class' visibility. A lot of APIs were made experimental, developer, or public for no good reason in the past. Many of them (not Logging in this case) are tied to the internal implementation of Spark at a specific time, and no longer make sense given the

Re: pull request template

2016-03-15 Thread Joseph Bradley
+1 for keeping the template I figure any template will require conscientiousness & enforcement. On Sat, Mar 12, 2016 at 1:30 AM, Sean Owen wrote: > The template is a great thing as it gets instructions even more right > in front of people. > > Another idea is to just write

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Akhil Das
You can achieve this with the normal RDD way. Have one extra stage in the pipeline where you will properly standardize all the values (like replacing doc with doctor) for all the columns before the join. Thanks Best Regards On Tue, Mar 15, 2016 at 9:16 AM, Suniti Singh

SparkConf constructor now private

2016-03-15 Thread Koert Kuipers
in this commit 8301fadd8d269da11e72870b7a889596e3337839 Author: Marcelo Vanzin Date: Mon Mar 14 14:27:33 2016 -0700 [SPARK-13626][CORE] Avoid duplicate config deprecation warnings. the following change was made -class SparkConf(loadDefaults: Boolean) extends Cloneable

spark 2.0 logging binary incompatibility

2016-03-15 Thread Koert Kuipers
i have been using spark 2.0 snapshots with some libraries build for spark 1.0 so far (simply because it worked). in last few days i noticed this new error: [error] Uncaught exception when running com.tresata.spark.sql.fieldsapi.FieldsApiSpec: java.lang.AbstractMethodError sbt.ForkMain$ForkError:

Re: DynamicPartitionKafkaRDD - 1:n mapping between kafka and RDD partition

2016-03-15 Thread Cody Koeninger
No, I don't agree that someone explicitly calling repartition or shuffle is the same as a constructor that implicitly breaks guarantees. Realistically speaking, the changes you have made are also totally incompatible with the way kafka's new consumer works. Pulling different out-of-order chunks

Release Announcement: XGBoost4J - Portable Distributed XGBoost in Spark, Flink and Dataflow

2016-03-15 Thread Nan Zhu
Dear Spark Users and Developers, We (Distributed (Deep) Machine Learning Community (http://dmlc.ml/)) are happy to announce the release of XGBoost4J (http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html), a Portable Distributed XGBoost in Spark,

question about catalyst and TreeNode

2016-03-15 Thread Koert Kuipers
i am trying to understand some parts of the catalyst optimizer. but i struggle with one bigger picture issue: LogicalPlan extends TreeNode, which makes sense since the optimizations rely on tree transformations like transformUp and transformDown. but how can a LogicalPlan be a tree? isnt it

Re: SparkConf constructor now private

2016-03-15 Thread Marcelo Vanzin
Oh, my bad. I think I left that from a previous part of the patch and forgot to revert it. Will fix. On Tue, Mar 15, 2016 at 7:37 AM, Koert Kuipers wrote: > in this commit > > 8301fadd8d269da11e72870b7a889596e3337839 > Author: Marcelo Vanzin > Date: Mon

Re: spark 2.0 logging binary incompatibility

2016-03-15 Thread Marcelo Vanzin
Logging is a "private[spark]" class so binary compatibility is not important at all, because code outside of Spark isn't supposed to use it. Mixing Spark library versions is also not recommended, not just because of this reason. There have been other binary changes in the Logging class in the

Re: SparkConf constructor now private

2016-03-15 Thread Pete Robbins
Is the SparkConf effectively a singleton? Could there be a Utils method to return a clone of the SparkConf? Cheers On Tue, 15 Mar 2016 at 16:49 Marcelo Vanzin wrote: > Oh, my bad. I think I left that from a previous part of the patch and > forgot to revert it. Will fix. >

Re: Contributing to managed memory, Tungsten..

2016-03-15 Thread Jan Kotek
Hi Amit, I am slowly getting into it, so I will contact in a few weeks. Jan On Friday, March 11, 2016 09:22:27 Amit Chavan wrote: Hi Jan, Welcome to the group. I have used mapdb on some personal project and really enjoyed working with it. I am also willing to contribute to the spark

Re: Various forks

2016-03-15 Thread Sean Owen
Picking up this old thread, since we have the same problem updating to Scala 2.11.8 https://github.com/apache/spark/pull/11681#issuecomment-196932777 We can see the org.spark-project packages here: http://search.maven.org/#search%7Cga%7C1%7Cg%3A%22org.spark-project%22 I've forgotten who

Re: Various forks

2016-03-15 Thread Reynold Xin
+Xiangrui On Tue, Mar 15, 2016 at 10:24 AM, Sean Owen wrote: > Picking up this old thread, since we have the same problem updating to > Scala 2.11.8 > > https://github.com/apache/spark/pull/11681#issuecomment-196932777 > > We can see the org.spark-project packages here: > >

Re: question about catalyst and TreeNode

2016-03-15 Thread Michael Armbrust
Trees are immutable, and TreeNode takes care of copying unchanged parts of the tree when you are doing transformations. As a result, even if you do construct a DAG with the Dataset API, the first transformation will turn it back into a tree. The only exception to this rule is when we share the

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
Is it always the case that one title is a substring of another ? -- Not always. One title can have values like D.O.C, doctor_{areacode}, doc_{dep,areacode} On Mon, Mar 14, 2016 at 10:39 PM, Wail Alkowaileet wrote: > I think you need some sort of fuzzy join ? > Is it always

Re: Compare a column in two different tables/find the distance between column data

2016-03-15 Thread Suniti Singh
The data in the title is different, so to correct the data in the column requires to find out what is the correct data and then replace. To find the correct data could be tedious but if some mechanism is in place which can help to group the partially matched data then it might help to do the