Re: Process for backports?

2018-04-24 Thread Reynold Xin
1. We don't backport features. 2. In general we don't bump dependencies, unless they are for critical bug fixes. 3. We weight the risk of new regression vs bug fixes. To state the obvious, we wouldn't backport a bug fix if it only affects a very small number of use cases but require very complex

Process for backports?

2018-04-24 Thread Cody Koeninger
https://issues.apache.org/jira/browse/SPARK-24067 is asking to backport a change to the 2.3 branch. My questions - In general are there any concerns about what qualifies for backporting? This adds a configuration variable but shouldn't change default behavior. - Is a separate jira + pr

Re: [MLLib] Logistic Regression and standadization

2018-04-24 Thread DB Tsai
As I’m one of the original authors, let me chime in for some comments. Without the standardization, the LBFGS will be unstable. For example, if a feature is being x 10, then the corresponding coefficient should be / 10 to make the same prediction. But without standardization, the LBFGS will

Re: Transform plan with scope

2018-04-24 Thread Marco Gaido
Hi Joseph, Herman, thanks for your answers. The specific rule I was looking at is FoldablePropagation. If you look at it, what is done is that first a AttributeMap containing all the possible foldable alias is collected, then they are replace in the whole plan (it is a bit more complex than this,

Re: Sorting on a streaming dataframe

2018-04-24 Thread Chayapan Khannabha
Perhaps your use case fits to Apache Kafka better. More info at: https://kafka.apache.org/documentation/streams/ Everything really comes down to the architecture design and algorithm spec. However, from my experience with Spark, there are many

Transform plan with scope

2018-04-24 Thread Marco Gaido
Hi all, working on SPARK-24051 I realized that currently in the Optimizer and in all the places where we are transforming a query plan, we are lacking the context information of what is in scope and what is not. Coming back to the ticket, the bug reported in the ticket is caused mainly by two

Re: Sorting on a streaming dataframe

2018-04-24 Thread Arun Mahadevan
I guess sorting would make sense only when you have the complete data set. In streaming you don’t know what record is coming next so doesn’t make sense to sort it (except in the aggregated complete output mode where the entire result table is emitted each time and the results can be sorted).

Re: Block Missing Exception while connecting Spark with HDP

2018-04-24 Thread Marco Gaido
Hi Jasbir, As a first note, please if you are using a vendor distribution, please contact the vendor for any issue you are facing. This mailing list is for the community so we focus on the community edition of Spark. Anyway, the error seems to be quite clear: your file on HDFS has a missing

Block Missing Exception while connecting Spark with HDP

2018-04-24 Thread Sing, Jasbir
i am using HDP2.6.3 and 2.6.4 and using the below code – 1. Creating sparkContext object 2. Reading a text file using – rdd =sc.textFile(“hdfs://192.168.142.129:8020/abc/test1.txt”); 3. println(rdd.count); After executing the 3rd line i am getting the below error – Caused by:

Re: Sorting on a streaming dataframe

2018-04-24 Thread Hemant Bhanawat
Thanks Chris. There are many ways in which I can solve this problem but they are cumbersome. The easiest way would have been to sort the streaming dataframe. The reason I asked this question is because I could not find a reason why sorting on streaming dataframe is disallowed. Hemant On Mon, Apr