Evolutionary algorithm (EA) in Spark

2016-11-02 Thread Chris Lin
Hi All, I would like to know if there is any plan to implement evolutionary algorithm in Spark ML, such as particle swarm optimization, genetic algorithm, ant colony optimization, etc. Therefore, if someone is working on this in Spark or has already done, I would like to contribute to it and get

Evolutionary algorithm (EA) in Spark

2016-11-02 Thread Chris Lin
Hi All, I would like to know if there is any plan to implement evolutionary algorithm in Spark ML, such as particle swarm optimization, genetic algorithm, ant colony optimization, etc. Therefore, if someone is working on this in Spark or has already done, I would like to contribute to it and get

[VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-02 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a majority of at least 3+1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.6.3 [ ] -1 Do not release this package because ... The

Re: [VOTE] Release Apache Spark 1.6.3 (RC1)

2016-11-02 Thread Reynold Xin
This vote is cancelled and I'm sending out a new vote for rc2 now. On Mon, Oct 17, 2016 at 5:18 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.3. The vote is open until Thursday, Oct 20, 2016 at 18:00 PDT and >

Blocked PySpark changes

2016-11-02 Thread Holden Karau
Hi Spark Developers & Maintainers, I know we've been talking a lot about what we want changes we want in PySpark to help keep it interesting and usable (see http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-td19422.html).

Re: Question about using collaborative filtering in MLlib

2016-11-02 Thread Yuhao Yang
Hi Zak, Indeed the function is missing in DataFrame-based API. I can probably provide some quick prototype to see if it we can merge the function into next release. I would send update here and feel free to give it a try. Regards, Yuhao 2016-11-01 10:00 GMT-07:00 Zak H

Re: Structured streaming aggregation - update mode

2016-11-02 Thread Michael Armbrust
Yeah, agreed. As mentioned here , its near the top of my list. I just opened SPARK-18234 to track. On Wed, Nov 2, 2016 at 3:24 PM, Cristian Opris wrote: > Hi, > > I've

Structured streaming aggregation - update mode

2016-11-02 Thread Cristian Opris
Hi, I've been looking at planned jiras for this, but can't find anything. Is this something that may be added soon ? It's not clear to me how aggregation can realistically be used in a production scenario without this.. Thanks, Cristian

Re: Updating Parquet dep to 1.9

2016-11-02 Thread Ryan Blue
The stats problem is on the write side. Parquet compares byte buffers (used for UTF8 strings also) using byte-wise comparison, but got it wrong and compares the Java byte values, which are signed. UTF8 ordering is the same as byte-wise comparison, but only if the bytes are compared as unsigned

BiMap BroadCast Variable - Kryo Serialization Issue

2016-11-02 Thread Kalpana Jalawadi
Hi, I am getting Nullpointer exception due to Kryo Serialization issue, while trying to read a BiMap broadcast variable. Attached is the code snippets. Pointers shared here didn't help - link1 , link2

Re: Anyone seeing a lot of Spark emails go to Gmail spam?

2016-11-02 Thread Prajwal Tuladhar
Some messages from Apache mailing lists (Spark and ZK) were being marked as spam by Gmail. After manually unmarking them as Spam few times, it seems to have worked for me. On Wed, Nov 2, 2016 at 5:29 PM, Russell Spitzer wrote: > I had one bounce message last week, but

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Adam Roberts
I'm seeing the same failure but manifesting itself as a stackoverflow, various operating systems and architectures (RHEL 71, CentOS 72, SUSE 12, Ubuntu 14 04 and 16 04 LTS) Build and test options: mvn -T 1C -Psparkr -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests clean package mvn

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Reynold Xin
Looks like there is an issue with Maven (likely just the test itself though). We should look into it. On Wed, Nov 2, 2016 at 11:32 AM, Dongjoon Hyun wrote: > Hi, Sean. > > The same failure blocks me, too. > > - SPARK-18189: Fix serialization issue in KeyValueGroupedDataset

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Dongjoon Hyun
Hi, Sean. The same failure blocks me, too. - SPARK-18189: Fix serialization issue in KeyValueGroupedDataset *** FAILED *** I used `-Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver -Dsparkr` on CentOS 7 / OpenJDK1.8.0_111. Dongjoon. On 2016-11-02 10:44 (-0700), Sean Owen

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-02 Thread Sean Owen
Sigs, license, etc are OK. There are no Blockers for 2.0.2, though here are the 4 issues still open: SPARK-14387 Enable Hive-1.x ORC compatibility with spark.sql.hive.convertMetastoreOrc SPARK-17957 Calling outer join and na.fill(0) and then inner join will miss rows SPARK-17981 Incorrectly Set

Re: Anyone seeing a lot of Spark emails go to Gmail spam?

2016-11-02 Thread Russell Spitzer
I had one bounce message last week, but haven't seen anything else, I also do the skip inbox filter thing though. On Wed, Nov 2, 2016 at 10:16 AM Matei Zaharia wrote: > It might be useful to ask Apache Infra whether they have any information > on these (e.g. what do

Re: Handling questions in the mailing lists

2016-11-02 Thread Reynold Xin
Actually after talking with more ASF members, I believe the only policy is that development decisions have to be made and announced on ASF properties (dev list or jira), but user questions don't have to. I'm going to double check this. If it is true, I would actually recommend us moving entirely

Re: Anyone seeing a lot of Spark emails go to Gmail spam?

2016-11-02 Thread Matei Zaharia
It might be useful to ask Apache Infra whether they have any information on these (e.g. what do their own spam metrics say, do they get any feedback from Google, etc). Unfortunately mailing lists seem to be less and less well supported by most email providers. Matei > On Nov 2, 2016, at 6:48

Re: Handling questions in the mailing lists

2016-11-02 Thread Nicholas Chammas
We’ve discussed several times upgrading our communication tools, as far back as 2014 and maybe even before that too. The bottom line is that we can’t due to ASF rules requiring the use of ASF-managed mailing lists. For some history, see this discussion: -

Re: Handling questions in the mailing lists

2016-11-02 Thread Ricardo Almeida
I fell Assaf point is quite relevant if we want to move this project forward from the Spark user perspective (as I do). In fact, we're still using 20th century tools (mailing lists) with some add-ons (like Stack Overflow). As usually, Sean and Cody's contributions are very to the point. I fell it

Re: Handling questions in the mailing lists

2016-11-02 Thread Cody Koeninger
So concrete things people could do - users could tag subject lines appropriately to the component they're asking about - contributors could monitor user@ for tags relating to components they've worked on. I'd be surprised if my miss rate for any mailing list questions well-labeled as Kafka was

Re: Updating Parquet dep to 1.9

2016-11-02 Thread Michael Allman
Sounds great. Regarding the min/max stats issue, is that an issue with the way the files are written or read? What's the Parquet project issue for that bug? What's the 1.9.1 release timeline look like? I will aim to have a PR in by the end of the week. I feel strongly that either this or

Re: Anyone seeing a lot of Spark emails go to Gmail spam?

2016-11-02 Thread Pete Robbins
I have gmail filters to add labels and skip inbox for anything sent to dev@spark user@spark etc but still get the occasional message marked as spam On Wed, 2 Nov 2016 at 08:18 Sean Owen wrote: > I couldn't figure out why I was missing a lot of dev@ announcements, and > have

Re: Handling questions in the mailing lists

2016-11-02 Thread Sean Owen
There's already reviews@ and issues@. dev@ is for project development itself and I think is OK. You're suggesting splitting up user@ and I sympathize with the motivation. Experience tells me that we'll have a beginner@ that's then totally ignored, and people will quickly learn to post to advanced@

RE: Handling questions in the mailing lists

2016-11-02 Thread Mendelson, Assaf
What I am suggesting is basically to fix that. For example, we might say that mailing list A is only for voting, mailing list B is only for PR and have something like stack overflow for developer questions (I would even go as far as to have beginner, intermediate and advanced mailing list for

Re: Handling questions in the mailing lists

2016-11-02 Thread Sean Owen
I think that unfortunately mailing lists don't scale well. This one has thousands of subscribers with different interests and levels of experience. For any given person, most messages will be irrelevant. I also find that a lot of questions on user@ are not well-asked, aren't an SSCCE (

Handling questions in the mailing lists

2016-11-02 Thread assaf.mendelson
Hi, I know this is a little off topic but I wanted to raise an issue about handling questions in the mailing list (this is true both for the user mailing list and the dev but since there are other options such as stack overflow for user questions, this is more problematic in dev). Let's say I

Anyone seeing a lot of Spark emails go to Gmail spam?

2016-11-02 Thread Sean Owen
I couldn't figure out why I was missing a lot of dev@ announcements, and have just realized hundreds of messages to dev@ over the past month or so have been marked as spam for me by Gmail. I have no idea why but it's usually messages from Michael and Reynold, but not all of them. I'll see replies