Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-23 Thread Josh Devins
Thanks for the pointer Peter, that change will indeed fix this bug and it looks like it will make it into the upcoming 1.3.0 release. @Evan, for reference, completeness and posterity: Just to be clear - you're currently calling .persist() before you pass data to LogisticRegressionWithLBFGS?

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Robin East
Running ec2 launch scripts gives me the following error: ssl.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed Full stack trace at https://gist.github.com/insidedctm/4d41600bc22560540a26 I’m running OSX Mavericks 10.9.5 I’ll

Re: Spark SQL - Long running job

2015-02-23 Thread Cheng Lian
I meant using |saveAsParquetFile|. As for partition number, you can always control it with |spark.sql.shuffle.partitions| property. Cheng On 2/23/15 1:38 PM, nitin wrote: I believe calling processedSchemaRdd.persist(DISK) and processedSchemaRdd.checkpoint() only persists data and I will lose

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
This vote was supposed to close on Saturday but it looks like no PMCs voted (other than the implicit vote from Patrick). Was there a discussion offline to cut an RC2? Was the vote extended? On Mon, Feb 23, 2015 at 6:59 AM, Robin East robin.e...@xense.co.uk wrote: Running ec2 launch scripts

Re: Spark SQL, Hive Parquet data types

2015-02-23 Thread The Watcher
Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its own Parquet support to read partitioned Parquet tables declared in Hive metastore. Only writing to partitioned tables is not covered yet. These improvements will be included in Spark 1.3.0. Just created SPARK-5948 to

Re: Spark SQL, Hive Parquet data types

2015-02-23 Thread Cheng Lian
Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its own Parquet support to read partitioned Parquet tables declared in Hive metastore. Only writing to partitioned tables is not covered yet. These improvements will be included in Spark 1.3.0. Just created SPARK-5948 to

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
Thanks Sean. I glossed over the comment about SPARK-5669. On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen so...@cloudera.com wrote: Yes my understanding from Patrick's comment is that this RC will not be released, but, to keep testing. There's an implicit -1 out of the gates there, I believe, and

RE: StreamingContext textFileStream question

2015-02-23 Thread Shao, Saisai
Hi Mark, For input streams like text input stream, only RDDs can be recovered from checkpoint, no missed files, if file is missed, actually an exception will be raised. If you use HDFS, HDFS will guarantee no data loss since it has 3 copies.Otherwise user logic has to guarantee no file deleted

Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7

2015-02-23 Thread Nicholas Chammas
The first concern for Spark will probably be to ensure that we still build and test against Python 2.6, since that's the minimum version of Python we support. Otherwise this seems OK. We use numpy and other Python packages in PySpark, but I don't think we're pinned to any particular version of

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell
So actually, the list of blockers on JIRA is a bit outdated. These days I won't cut RC1 unless there are no known issues that I'm aware of that would actually block the release (that's what the snapshot ones are for). I'm going to clean those up and push others to do so also. The main issues I'm

StreamingContext textFileStream question

2015-02-23 Thread mkhaitman
Hello, I was interested in creating a StreamingContext textFileStream based job, which runs for long durations, and can also recover from prolonged driver failure... It seems like StreamingContext checkpointing is mainly used for the case when the driver dies during the processing of an RDD, and

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell
It's only been reported on this thread by Tom, so far. On Mon, Feb 23, 2015 at 10:29 AM, Marcelo Vanzin van...@cloudera.com wrote: Hey Patrick, Do you have a link to the bug related to Python and Yarn? I looked at the blockers in Jira but couldn't find it. On Mon, Feb 23, 2015 at 10:18 AM,

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Marcelo Vanzin
Hey Patrick, Do you have a link to the bug related to Python and Yarn? I looked at the blockers in Jira but couldn't find it. On Mon, Feb 23, 2015 at 10:18 AM, Patrick Wendell pwend...@gmail.com wrote: So actually, the list of blockers on JIRA is a bit outdated. These days I won't cut RC1

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Marcelo Vanzin
Hi Tom, are you using an sbt-built assembly by any chance? If so, take a look at SPARK-5808. I haven't had any problems with the maven-built assembly. Setting SPARK_HOME on the executors is a workaround if you want to use the sbt assembly. On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves

[jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7

2015-02-23 Thread shane knapp
good morning, developers! TL;DR: i will be installing anaconda and setting it in the system PATH so that your python will default to 2.7, as well as it taking over management of all of the sci-py packages. this is potentially a big change, so i'll be testing locally on my staging instance

RE: StreamingContext textFileStream question

2015-02-23 Thread mkhaitman
Hi Jerry, Thanks for the quick response! Looks like I'll need to come up with an alternative solution in the meantime, since I'd like to avoid the other input streams + WAL approach. :) Thanks again, Mark. -- View this message in context:

Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7

2015-02-23 Thread shane knapp
On Mon, Feb 23, 2015 at 11:36 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: The first concern for Spark will probably be to ensure that we still build and test against Python 2.6, since that's the minimum version of Python we support. sounds good... we can set up separate 2.6

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Michael Armbrust
On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra m...@clearstorydata.com wrote: So what are we expecting of Hive 0.12.0 builds with this RC? I know not every combination of Hadoop and Hive versions, etc., can be supported, but even an example build from the Building Spark page isn't looking too

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Mark Hamstra
Nothing that I can point to, so this may only be a problem in test scope. I am looking at a problem where some UDFs that run with 0.12 fail with 0.13; but that problem is already present in Spark 1.2.x, so it's not a blocking regression for 1.3. (Very likely a HiveFunctionWrapper serde problem,

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Soumitra Kumar
+1 (non-binding) For: https://issues.apache.org/jira/browse/SPARK-3660 . Docs OK . Example code is good -Soumitra. On Mon, Feb 23, 2015 at 10:33 AM, Marcelo Vanzin van...@cloudera.com wrote: Hi Tom, are you using an sbt-built assembly by any chance? If so, take a look at SPARK-5808. I

Re: Spark SQL, Hive Parquet data types

2015-02-23 Thread Cheng Lian
Ah, sorry for not being clear enough. So now in Spark 1.3.0, we have two Parquet support implementations, the old one is tightly coupled with the Spark SQL framework, while the new one is based on data sources API. In both versions, we try to intercept operations over Parquet tables

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Tathagata Das
Hey all, I found a major issue where JobProgressListener (a listener used to keep track of jobs for the web UI) never forgets stages in one of its data structures. This is a blocker for long running applications. https://issues.apache.org/jira/browse/SPARK-5967 I am testing a fix for this right

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Cheng Lian
My bad, had once fixed all Hive 12 test failures in PR #4107, but didn't got time to get it merged. Considering the release is close, I can cherry-pick those Hive 12 fixes from #4107 and open a more surgical PR soon. Cheng On 2/24/15 4:18 AM, Michael Armbrust wrote: On Sun, Feb 22, 2015 at