Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
Running ec2 launch scripts gives me the following error: ssl.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed Full stack trace at https://gist.github.com/insidedctm/4d41600bc22560540a26 I’m running OSX Mavericks 10.9.5 I’ll investigate further but wondered if anyone else has run into this. Robin
Re: Spark SQL, Hive & Parquet data types
Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its own Parquet support to read partitioned Parquet tables declared in Hive metastore. Only writing to partitioned tables is not covered yet. These improvements will be included in Spark 1.3.0. Just created SPARK-5948 to track writing to partitioned Parquet tables. Cheng On 2/20/15 10:58 PM, The Watcher wrote: 1. In Spark 1.3.0, timestamp support was added, also Spark SQL uses its own Parquet support to handle both read path and write path when dealing with Parquet tables declared in Hive metastore, as long as you’re not writing to a partitioned table. So yes, you can. Ah, I had missed the part about being partitioned or not. Is this related to the work being done on ParquetRelation2 ? We will indeed write to a partitioned table : do neither the read nor the write path go through Spark SQL's parquet support in that case ? Is there a JIRA/PR I can monitor to see when this would change ? Thanks - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark SQL - Long running job
I meant using |saveAsParquetFile|. As for partition number, you can always control it with |spark.sql.shuffle.partitions| property. Cheng On 2/23/15 1:38 PM, nitin wrote: I believe calling processedSchemaRdd.persist(DISK) and processedSchemaRdd.checkpoint() only persists data and I will lose all the RDD metadata and when I re-start my driver, that data is kind of useless for me (correct me if I am wrong). I thought of doing processedSchemaRdd.saveAsParquetFile (hdfs file system) but I fear that in case my "HDFS block size" > "partition file size", I will get more partitions when reading instead of original schemaRdd. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717p10727.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm
Thanks for the pointer Peter, that change will indeed fix this bug and it looks like it will make it into the upcoming 1.3.0 release. @Evan, for reference, completeness and posterity: > Just to be clear - you're currently calling .persist() before you pass data > to LogisticRegressionWithLBFGS? No. I added persist in GeneralizedLinearAlgorithm right before the `data` RDD goes into optimizer (LBFGS in our case). See here: https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204 > Also - can you give some parameters about the problem/cluster size you're > solving this on? How much memory per node? How big are n and d, what is its > sparsity (if any) and how many iterations are you running for? Is 0:45 the > per-iteration time or total time for some number of iterations? The vector is very sparse (few hundred entries) but 2.5M in size. The dataset is about 30M examples to learn from. 16x machines, 64GB memory, 32-cores. Josh On 17 February 2015 at 17:31, Peter Rudenko wrote: > It's fixed today: https://github.com/apache/spark/pull/4593 > > Thanks, > Peter Rudenko > > On 2015-02-17 18:25, Evan R. Sparks wrote: >> >> Josh - thanks for the detailed write up - this seems a little funny to me. >> I agree that with the current code path there is extra work being done >> than >> needs to be (e.g. the features are re-scaled at every iteration, but the >> relatively costly process of fitting the StandardScaler should not be >> re-done at each iteration. Instead, at each iteration, all points are >> re-scaled according to the pre-computed standard-deviations in the >> StandardScalerModel, and then an intercept is appended. >> >> Just to be clear - you're currently calling .persist() before you pass >> data >> to LogisticRegressionWithLBFGS? >> >> Also - can you give some parameters about the problem/cluster size you're >> solving this on? How much memory per node? How big are n and d, what is >> its >> sparsity (if any) and how many iterations are you running for? Is 0:45 the >> per-iteration time or total time for some number of iterations? >> >> A useful test might be to call GeneralizedLinearAlgorithm >> useFeatureScaling >> set to false (and maybe also addIntercept set to false) on persisted data, >> and see if you see the same performance wins. If that's the case we've >> isolated the issue and can start profiling to see where all the time is >> going. >> >> It would be great if you can open a JIRA. >> >> Thanks! >> >> >> >> On Tue, Feb 17, 2015 at 6:36 AM, Josh Devins wrote: >> >>> Cross-posting as I got no response on the users mailing list last >>> week. Any response would be appreciated :) >>> >>> Josh >>> >>> >>> -- Forwarded message -- >>> From: Josh Devins >>> Date: 9 February 2015 at 15:59 >>> Subject: [MLlib] Performance problem in GeneralizedLinearAlgorithm >>> To: "u...@spark.apache.org" >>> >>> >>> I've been looking into a performance problem when using >>> LogisticRegressionWithLBFGS (and in turn GeneralizedLinearAlgorithm). >>> Here's an outline of what I've figured out so far and it would be >>> great to get some confirmation of the problem, some input on how >>> wide-spread this problem might be and any ideas on a nice way to fix >>> this. >>> >>> Context: >>> - I will reference `branch-1.1` as we are currently on v1.1.1 however >>> this appears to still be a problem on `master` >>> - The cluster is run on YARN, on bare-metal hardware (no VMs) >>> - I've not filed a Jira issue yet but can do so >>> - This problem affects all algorithms based on >>> GeneralizedLinearAlgorithm (GLA) that use feature scaling (and less so >>> when not, but still a problem) (e.g. LogisticRegressionWithLBFGS) >>> >>> Problem Outline: >>> - Starting at GLA line 177 >>> ( >>> >>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L177 >>> ), >>> a feature scaler is created using the `input` RDD >>> - Refer next to line 186 which then maps over the `input` RDD and >>> produces a new `data` RDD >>> ( >>> >>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L186 >>> ) >>> - If you are using feature scaling or adding intercepts, the user >>> `input` RDD has been mapped over *after* the user has persisted it >>> (hopefully) and *before* going into the (iterative) optimizer on line >>> 204 ( >>> >>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204 >>> ) >>> - Since the RDD `data` that is iterated over in the optimizer is >>> unpersisted, when we are running the cost function in the optimizer >>> (e.g. LBFGS -- >>> >>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L198 >>> ), >>> the map
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
This vote was supposed to close on Saturday but it looks like no PMCs voted (other than the implicit vote from Patrick). Was there a discussion offline to cut an RC2? Was the vote extended? On Mon, Feb 23, 2015 at 6:59 AM, Robin East wrote: > Running ec2 launch scripts gives me the following error: > > ssl.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL > routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed > > Full stack trace at > https://gist.github.com/insidedctm/4d41600bc22560540a26 > > I’m running OSX Mavericks 10.9.5 > > I’ll investigate further but wondered if anyone else has run into this. > > Robin
Re: Spark SQL, Hive & Parquet data types
> > Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its > own Parquet support to read partitioned Parquet tables declared in Hive > metastore. Only writing to partitioned tables is not covered yet. These > improvements will be included in Spark 1.3.0. > > Just created SPARK-5948 to track writing to partitioned Parquet tables. > Ok, this is still a little confusing. Since I am able in 1.2.0 to write to a partitioned Hive by registering my SchemaRDD and calling INSERT into "the hive partitionned table" SELECT "the registrered", what is the write-path in this case ? Full Hive with a SparkSQL<->Hive bridge ? If that were the case, why wouldn't SKEWED ON be honored (see another thread I opened). Thanks
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
Yes my understanding from Patrick's comment is that this RC will not be released, but, to keep testing. There's an implicit -1 out of the gates there, I believe, and so the vote won't pass, so perhaps that's why there weren't further binding votes. I'm sure that will be formalized shortly. FWIW here are 10 issues still listed as blockers for 1.3.0: SPARK-5910 DataFrame.selectExpr("col as newName") does not work SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java SPARK-5873 Can't see partially analyzed plans SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API SPARK-5517 SPARK-5166 Add input types for Java UDFs SPARK-5463 Fix Parquet filter push-down SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3 SPARK-5183 SPARK-5180 Document data source API SPARK-3650 Triangle Count handles reverse edges incorrectly SPARK-3511 Create a RELEASE-NOTES.txt file in the repo On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet wrote: > This vote was supposed to close on Saturday but it looks like no PMCs voted > (other than the implicit vote from Patrick). Was there a discussion offline > to cut an RC2? Was the vote extended? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
Thanks Sean. I glossed over the comment about SPARK-5669. On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen wrote: > Yes my understanding from Patrick's comment is that this RC will not > be released, but, to keep testing. There's an implicit -1 out of the > gates there, I believe, and so the vote won't pass, so perhaps that's > why there weren't further binding votes. I'm sure that will be > formalized shortly. > > FWIW here are 10 issues still listed as blockers for 1.3.0: > > SPARK-5910 DataFrame.selectExpr("col as newName") does not work > SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java > SPARK-5873 Can't see partially analyzed plans > SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API > SPARK-5517 SPARK-5166 Add input types for Java UDFs > SPARK-5463 Fix Parquet filter push-down > SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3 > SPARK-5183 SPARK-5180 Document data source API > SPARK-3650 Triangle Count handles reverse edges incorrectly > SPARK-3511 Create a RELEASE-NOTES.txt file in the repo > > > On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet wrote: > > This vote was supposed to close on Saturday but it looks like no PMCs > voted > > (other than the implicit vote from Patrick). Was there a discussion > offline > > to cut an RC2? Was the vote extended? >
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
So actually, the list of blockers on JIRA is a bit outdated. These days I won't cut RC1 unless there are no known issues that I'm aware of that would actually block the release (that's what the snapshot ones are for). I'm going to clean those up and push others to do so also. The main issues I'm aware of that came about post RC1 are: 1. Python submission broken on YARN 2. The license issue in MLlib [now fixed]. 3. Varargs broken for Java Dataframes [now fixed] Re: Corey - yeah, as it stands now I try to wait if there are things that look like implicit -1 votes. On Mon, Feb 23, 2015 at 6:13 AM, Corey Nolet wrote: > Thanks Sean. I glossed over the comment about SPARK-5669. > > On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen wrote: >> >> Yes my understanding from Patrick's comment is that this RC will not >> be released, but, to keep testing. There's an implicit -1 out of the >> gates there, I believe, and so the vote won't pass, so perhaps that's >> why there weren't further binding votes. I'm sure that will be >> formalized shortly. >> >> FWIW here are 10 issues still listed as blockers for 1.3.0: >> >> SPARK-5910 DataFrame.selectExpr("col as newName") does not work >> SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java >> SPARK-5873 Can't see partially analyzed plans >> SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API >> SPARK-5517 SPARK-5166 Add input types for Java UDFs >> SPARK-5463 Fix Parquet filter push-down >> SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3 >> SPARK-5183 SPARK-5180 Document data source API >> SPARK-3650 Triangle Count handles reverse edges incorrectly >> SPARK-3511 Create a RELEASE-NOTES.txt file in the repo >> >> >> On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet wrote: >> > This vote was supposed to close on Saturday but it looks like no PMCs >> > voted >> > (other than the implicit vote from Patrick). Was there a discussion >> > offline >> > to cut an RC2? Was the vote extended? > > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
It's only been reported on this thread by Tom, so far. On Mon, Feb 23, 2015 at 10:29 AM, Marcelo Vanzin wrote: > Hey Patrick, > > Do you have a link to the bug related to Python and Yarn? I looked at > the blockers in Jira but couldn't find it. > > On Mon, Feb 23, 2015 at 10:18 AM, Patrick Wendell wrote: >> So actually, the list of blockers on JIRA is a bit outdated. These >> days I won't cut RC1 unless there are no known issues that I'm aware >> of that would actually block the release (that's what the snapshot >> ones are for). I'm going to clean those up and push others to do so >> also. >> >> The main issues I'm aware of that came about post RC1 are: >> 1. Python submission broken on YARN >> 2. The license issue in MLlib [now fixed]. >> 3. Varargs broken for Java Dataframes [now fixed] >> >> Re: Corey - yeah, as it stands now I try to wait if there are things >> that look like implicit -1 votes. > > -- > Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
Hey Patrick, Do you have a link to the bug related to Python and Yarn? I looked at the blockers in Jira but couldn't find it. On Mon, Feb 23, 2015 at 10:18 AM, Patrick Wendell wrote: > So actually, the list of blockers on JIRA is a bit outdated. These > days I won't cut RC1 unless there are no known issues that I'm aware > of that would actually block the release (that's what the snapshot > ones are for). I'm going to clean those up and push others to do so > also. > > The main issues I'm aware of that came about post RC1 are: > 1. Python submission broken on YARN > 2. The license issue in MLlib [now fixed]. > 3. Varargs broken for Java Dataframes [now fixed] > > Re: Corey - yeah, as it stands now I try to wait if there are things > that look like implicit -1 votes. -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
Hi Tom, are you using an sbt-built assembly by any chance? If so, take a look at SPARK-5808. I haven't had any problems with the maven-built assembly. Setting SPARK_HOME on the executors is a workaround if you want to use the sbt assembly. On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves wrote: > Trying to run pyspark on yarn in client mode with basic wordcount example I > see the following error when doing the collect: > Error from python worker: /usr/bin/python: No module named sqlPYTHONPATH > was: > /grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69) >at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at > org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)at > org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:64)at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > any ideas on this? > Tom > > On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell > wrote: > > > Please vote on releasing the following candidate as Apache Spark version > 1.3.0! > > The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a): > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2 > > The release files, including signatures, digests, etc. can be found at: > http://people.apache.org/~pwendell/spark-1.3.0-rc1/ > > Release artifacts are signed with the following key: > https://people.apache.org/keys/committer/pwendell.asc > > The staging repository for this release can be found at: > https://repository.apache.org/content/repositories/orgapachespark-1069/ > > The documentation corresponding to this release can be found at: > http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/ > > Please vote on releasing this package as Apache Spark 1.3.0! > > The vote is open until Saturday, February 21, at 08:03 UTC and passes > if a majority of at least 3 +1 PMC votes are cast. > > [ ] +1 Release this package as Apache Spark 1.3.0 > [ ] -1 Do not release this package because ... > > To learn more about Apache Spark, please see > http://spark.apache.org/ > > == How can I help test this release? == > If you are a Spark user, you can help us test this release by > taking a Spark 1.2 workload and running on this release candidate, > then reporting any regressions. > > == What justifies a -1 vote for this release? == > This vote is happening towards the end of the 1.3 QA period, > so -1 votes should only occur for significant regressions from 1.2.1. > Bugs already present in 1.2.X, minor regressions, or bugs related > to new features will not block this release. > > - Patrick > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > > > -- Marcelo - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
StreamingContext textFileStream question
Hello, I was interested in creating a StreamingContext textFileStream based job, which runs for long durations, and can also recover from prolonged driver failure... It seems like StreamingContext checkpointing is mainly used for the case when the driver dies during the processing of an RDD, and to recover that one RDD, but my question specifically relates to whether there is a way to also recover which files were missed between the timeframe of the driver dying and being started back up (whether manually or automatically). Any assistance/suggestions with this one would be greatly appreciated! Thanks, Mark. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 -> 2.7
good morning, developers! TL;DR: i will be installing anaconda and setting it in the system PATH so that your python will default to 2.7, as well as it taking over management of all of the sci-py packages. this is potentially a big change, so i'll be testing locally on my staging instance before deployment to the wide world. deployment is *tentatively* next monday, march 2nd. a little background: the jenkins test infra is currently (and happily) managed by a set of tools that allow me to set up and deploy new workers, manage their packages and make sure that all spark and research projects can happily and successfully build. we're currently at the state where ~50 or so packages are installed and configured on each worker. this is getting a little cumbersome, as the package-to-build dep tree is getting pretty large. the biggest offender is the science-based python infrastructure. everything is blindly installed w/yum and pip, so it's hard to control *exactly* what version of any given library is as compared to what's on a dev's laptop. the solution: anaconda (https://store.continuum.io/cshop/anaconda/)! everything is centralized! i can manage specific versions much easier! what this means to you: * python 2.7 will be the default system python. * 2.6 will still be installed and available (/usr/bin/python or /usr/bin/python/2.6) what you need to do: * install anaconda, have it update your PATH * build locally and try to fix any bugs (for spark, this "should just work") * if you have problems, reach out to me and i'll see what i can do to help. if we can't get your stuff running under python2.7, we can default to 2.6 via a job config change. what i will be doing: * setting up anaconda on my staging instance and spot-testing a lot of builds before deployment please let me know if there are any issues/concerns... i'll be posting updates this week and will let everyone know if there are any changes to the Plan[tm]. your friendly devops engineer, shane
RE: StreamingContext textFileStream question
Hi Mark, For input streams like text input stream, only RDDs can be recovered from checkpoint, no missed files, if file is missed, actually an exception will be raised. If you use HDFS, HDFS will guarantee no data loss since it has 3 copies.Otherwise user logic has to guarantee no file deleted before recovering. For input stream which is receiver based, like Kafka input stream or socket input stream, a WAL(write ahead log) mechanism can be enabled to store the received data as well as metadata, so data can be recovered from failure. Thanks Jerry -Original Message- From: mkhaitman [mailto:mark.khait...@chango.com] Sent: Monday, February 23, 2015 10:54 AM To: dev@spark.apache.org Subject: StreamingContext textFileStream question Hello, I was interested in creating a StreamingContext textFileStream based job, which runs for long durations, and can also recover from prolonged driver failure... It seems like StreamingContext checkpointing is mainly used for the case when the driver dies during the processing of an RDD, and to recover that one RDD, but my question specifically relates to whether there is a way to also recover which files were missed between the timeframe of the driver dying and being started back up (whether manually or automatically). Any assistance/suggestions with this one would be greatly appreciated! Thanks, Mark. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: StreamingContext textFileStream question
Hi Jerry, Thanks for the quick response! Looks like I'll need to come up with an alternative solution in the meantime, since I'd like to avoid the other input streams + WAL approach. :) Thanks again, Mark. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742p10745.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 -> 2.7
The first concern for Spark will probably be to ensure that we still build and test against Python 2.6, since that's the minimum version of Python we support. Otherwise this seems OK. We use numpy and other Python packages in PySpark, but I don't think we're pinned to any particular version of those packages. Nick On Mon Feb 23 2015 at 2:15:19 PM shane knapp wrote: > good morning, developers! > > TL;DR: > > i will be installing anaconda and setting it in the system PATH so that > your python will default to 2.7, as well as it taking over management of > all of the sci-py packages. this is potentially a big change, so i'll be > testing locally on my staging instance before deployment to the wide world. > > deployment is *tentatively* next monday, march 2nd. > > a little background: > > the jenkins test infra is currently (and happily) managed by a set of tools > that allow me to set up and deploy new workers, manage their packages and > make sure that all spark and research projects can happily and successfully > build. > > we're currently at the state where ~50 or so packages are installed and > configured on each worker. this is getting a little cumbersome, as the > package-to-build dep tree is getting pretty large. > > the biggest offender is the science-based python infrastructure. > everything is blindly installed w/yum and pip, so it's hard to control > *exactly* what version of any given library is as compared to what's on a > dev's laptop. > > the solution: > > anaconda (https://store.continuum.io/cshop/anaconda/)! everything is > centralized! i can manage specific versions much easier! > > what this means to you: > > * python 2.7 will be the default system python. > * 2.6 will still be installed and available (/usr/bin/python or > /usr/bin/python/2.6) > > what you need to do: > * install anaconda, have it update your PATH > * build locally and try to fix any bugs (for spark, this "should just > work") > * if you have problems, reach out to me and i'll see what i can do to help. > if we can't get your stuff running under python2.7, we can default to 2.6 > via a job config change. > > what i will be doing: > * setting up anaconda on my staging instance and spot-testing a lot of > builds before deployment > > please let me know if there are any issues/concerns... i'll be posting > updates this week and will let everyone know if there are any changes to > the Plan[tm]. > > your friendly devops engineer, > > shane >
Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 -> 2.7
On Mon, Feb 23, 2015 at 11:36 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > The first concern for Spark will probably be to ensure that we still build > and test against Python 2.6, since that's the minimum version of Python we > support. > > sounds good... we can set up separate 2.6 builds on specific versions... this could allow you to easily differentiate between "baseline" and "latest and greatest" if you wanted. it'll have a little bit more administrative overhead, due to more jobs needing configs, but offers more flexibility. let me know what you think. > Otherwise this seems OK. We use numpy and other Python packages in > PySpark, but I don't think we're pinned to any particular version of those > packages. > > cool. i'll start mucking about and let you guys know how it goes. shane
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra wrote: > So what are we expecting of Hive 0.12.0 builds with this RC? I know not > every combination of Hadoop and Hive versions, etc., can be supported, but > even an example build from the "Building Spark" page isn't looking too good > to me. > I would definitely expect this to build and we do actually test that for each PR. We don't yet run the tests for both versions of Hive and thus unfortunately these do get out of sync. Usually these are just problems diff-ing golden output or cases where we have added a test that uses a feature not available in hive 12. Have you seen problems with using Hive 12 outside of these test failures?
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
Nothing that I can point to, so this may only be a problem in test scope. I am looking at a problem where some UDFs that run with 0.12 fail with 0.13; but that problem is already present in Spark 1.2.x, so it's not a blocking regression for 1.3. (Very likely a HiveFunctionWrapper serde problem, but I haven't run it to ground yet.) On Mon, Feb 23, 2015 at 12:18 PM, Michael Armbrust wrote: > On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra > wrote: > >> So what are we expecting of Hive 0.12.0 builds with this RC? I know not >> every combination of Hadoop and Hive versions, etc., can be supported, but >> even an example build from the "Building Spark" page isn't looking too >> good >> to me. >> > > I would definitely expect this to build and we do actually test that for > each PR. We don't yet run the tests for both versions of Hive and thus > unfortunately these do get out of sync. Usually these are just problems > diff-ing golden output or cases where we have added a test that uses a > feature not available in hive 12. > > Have you seen problems with using Hive 12 outside of these test failures? >
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
My bad, had once fixed all Hive 12 test failures in PR #4107, but didn't got time to get it merged. Considering the release is close, I can cherry-pick those Hive 12 fixes from #4107 and open a more surgical PR soon. Cheng On 2/24/15 4:18 AM, Michael Armbrust wrote: On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra wrote: So what are we expecting of Hive 0.12.0 builds with this RC? I know not every combination of Hadoop and Hive versions, etc., can be supported, but even an example build from the "Building Spark" page isn't looking too good to me. I would definitely expect this to build and we do actually test that for each PR. We don't yet run the tests for both versions of Hive and thus unfortunately these do get out of sync. Usually these are just problems diff-ing golden output or cases where we have added a test that uses a feature not available in hive 12. Have you seen problems with using Hive 12 outside of these test failures? - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark SQL, Hive & Parquet data types
Ah, sorry for not being clear enough. So now in Spark 1.3.0, we have two Parquet support implementations, the old one is tightly coupled with the Spark SQL framework, while the new one is based on data sources API. In both versions, we try to intercept operations over Parquet tables registered in metastore when possible for better performance (mainly filter push-down optimization and extra metadata for more accurate schema inference). The distinctions are: 1. For old version (set |spark.sql.parquet.useDataSourceApi| to |false|) When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we “hijack” the read path. Namely whenever you query a Parquet table registered in metastore, we’re using our own Parquet implementation. For write path, we fallback to default Hive SerDe implementation (namely Spark SQL’s |InsertIntoHiveTable| operator). 2. For new data source version (set |spark.sql.parquet.useDataSourceApi| to |true|, which is the default value in master and branch-1.3) When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we “hijack” both read and write path, but if you’re writing to a partitioned table, we still fallback to default Hive SerDe implementation. For Spark 1.2.0, only 1 applies. Spark 1.2.0 also has a Parquet data source, but it’s not enabled if you’re not using data sources API specific DDL (|CREATE TEMPORARY TABLE USING |). Cheng On 2/23/15 10:05 PM, The Watcher wrote: Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its own Parquet support to read partitioned Parquet tables declared in Hive metastore. Only writing to partitioned tables is not covered yet. These improvements will be included in Spark 1.3.0. Just created SPARK-5948 to track writing to partitioned Parquet tables. Ok, this is still a little confusing. Since I am able in 1.2.0 to write to a partitioned Hive by registering my SchemaRDD and calling INSERT into "the hive partitionned table" SELECT "the registrered", what is the write-path in this case ? Full Hive with a SparkSQL<->Hive bridge ? If that were the case, why wouldn't SKEWED ON be honored (see another thread I opened). Thanks
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
+1 (non-binding) For: https://issues.apache.org/jira/browse/SPARK-3660 . Docs OK . Example code is good -Soumitra. On Mon, Feb 23, 2015 at 10:33 AM, Marcelo Vanzin wrote: > Hi Tom, are you using an sbt-built assembly by any chance? If so, take > a look at SPARK-5808. > > I haven't had any problems with the maven-built assembly. Setting > SPARK_HOME on the executors is a workaround if you want to use the sbt > assembly. > > On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves > wrote: > > Trying to run pyspark on yarn in client mode with basic wordcount > example I see the following error when doing the collect: > > Error from python worker: /usr/bin/python: No module named > sqlPYTHONPATH was: > /grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) > at > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at > org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:64)at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > > any ideas on this? > > Tom > > > > On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell < > pwend...@gmail.com> wrote: > > > > > > Please vote on releasing the following candidate as Apache Spark > version 1.3.0! > > > > The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a): > > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2 > > > > The release files, including signatures, digests, etc. can be found at: > > http://people.apache.org/~pwendell/spark-1.3.0-rc1/ > > > > Release artifacts are signed with the following key: > > https://people.apache.org/keys/committer/pwendell.asc > > > > The staging repository for this release can be found at: > > https://repository.apache.org/content/repositories/orgapachespark-1069/ > > > > The documentation corresponding to this release can be found at: > > http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/ > > > > Please vote on releasing this package as Apache Spark 1.3.0! > > > > The vote is open until Saturday, February 21, at 08:03 UTC and passes > > if a majority of at least 3 +1 PMC votes are cast. > > > > [ ] +1 Release this package as Apache Spark 1.3.0 > > [ ] -1 Do not release this package because ... > > > > To learn more about Apache Spark, please see > > http://spark.apache.org/ > > > > == How can I help test this release? == > > If you are a Spark user, you can help us test this release by > > taking a Spark 1.2 workload and running on this release candidate, > > then reporting any regressions. > > > > == What justifies a -1 vote for this release? == > > This vote is happening towards the end of the 1.3 QA period, > > so -1 votes should only occur for significant regressions from 1.2.1. > > Bugs already present in 1.2.X, minor regressions, or bugs related > > to new features will not block this release. > > > > - Patrick > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > > > > > > > > > -- > Marcelo > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: [VOTE] Release Apache Spark 1.3.0 (RC1)
Hey all, I found a major issue where JobProgressListener (a listener used to keep track of jobs for the web UI) never forgets stages in one of its data structures. This is a blocker for long running applications. https://issues.apache.org/jira/browse/SPARK-5967 I am testing a fix for this right now. TD On Mon, Feb 23, 2015 at 7:23 PM, Soumitra Kumar wrote: > +1 (non-binding) > > For: https://issues.apache.org/jira/browse/SPARK-3660 > > . Docs OK > . Example code is good > > -Soumitra. > > > On Mon, Feb 23, 2015 at 10:33 AM, Marcelo Vanzin > wrote: > > > Hi Tom, are you using an sbt-built assembly by any chance? If so, take > > a look at SPARK-5808. > > > > I haven't had any problems with the maven-built assembly. Setting > > SPARK_HOME on the executors is a workaround if you want to use the sbt > > assembly. > > > > On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves > > wrote: > > > Trying to run pyspark on yarn in client mode with basic wordcount > > example I see the following error when doing the collect: > > > Error from python worker: /usr/bin/python: No module named > > sqlPYTHONPATH was: > > > /grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException > > at java.io.DataInputStream.readInt(DataInputStream.java:392) > > at > > > org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163) > > at > > > org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86) > > at > > > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62) > > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105) > > at > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at > > org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > > at > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > > at org.apache.spark.scheduler.Task.run(Task.scala:64)at > > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) > > at > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:722) > > > any ideas on this? > > > Tom > > > > > > On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell < > > pwend...@gmail.com> wrote: > > > > > > > > > Please vote on releasing the following candidate as Apache Spark > > version 1.3.0! > > > > > > The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a): > > > > > > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2 > > > > > > The release files, including signatures, digests, etc. can be found at: > > > http://people.apache.org/~pwendell/spark-1.3.0-rc1/ > > > > > > Release artifacts are signed with the following key: > > > https://people.apache.org/keys/committer/pwendell.asc > > > > > > The staging repository for this release can be found at: > > > > https://repository.apache.org/content/repositories/orgapachespark-1069/ > > > > > > The documentation corresponding to this release can be found at: > > > http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/ > > > > > > Please vote on releasing this package as Apache Spark 1.3.0! > > > > > > The vote is open until Saturday, February 21, at 08:03 UTC and passes > > > if a majority of at least 3 +1 PMC votes are cast. > > > > > > [ ] +1 Release this package as Apache Spark 1.3.0 > > > [ ] -1 Do not release this package because ... > > > > > > To learn more about Apache Spark, please see > > > http://spark.apache.org/ > > > > > > == How can I help test this release? == > > > If you are a Spark user, you can help us test this release by > > > taking a Spark 1.2 workload and running on this release candidate, > > > then reporting any regressions. > > > > > > == What justifies a -1 vote for this release? == > > > This vote is happening towards the end of the 1.3 QA period, > > > so -1 votes should only occur for significant regressions from 1.2.1. > > > Bugs already present in 1.2.X, minor regressions, or bugs related > > > to new features will not block this release. > > > > > > - Patrick > > > > > > - > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > > > > > > > > > > > > > > > > > -- > > Marcelo > > > > --