Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-23 Thread Josh Devins
Thanks for the pointer Peter, that change will indeed fix this bug and
it looks like it will make it into the upcoming 1.3.0 release.

@Evan, for reference, completeness and posterity:

 Just to be clear - you're currently calling .persist() before you pass data 
 to LogisticRegressionWithLBFGS?

No. I added persist in GeneralizedLinearAlgorithm right before the
`data` RDD goes into optimizer (LBFGS in our case). See here:
https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204

 Also - can you give some parameters about the problem/cluster size you're 
 solving this on? How much memory per node? How big are n and d, what is its 
 sparsity (if any) and how many iterations are you running for? Is 0:45 the 
 per-iteration time or total time for some number of iterations?

The vector is very sparse (few hundred entries) but 2.5M in size. The
dataset is about 30M examples to learn from. 16x machines, 64GB
memory, 32-cores.

Josh


On 17 February 2015 at 17:31, Peter Rudenko petro.rude...@gmail.com wrote:
 It's fixed today: https://github.com/apache/spark/pull/4593

 Thanks,
 Peter Rudenko

 On 2015-02-17 18:25, Evan R. Sparks wrote:

 Josh - thanks for the detailed write up - this seems a little funny to me.
 I agree that with the current code path there is extra work being done
 than
 needs to be (e.g. the features are re-scaled at every iteration, but the
 relatively costly process of fitting the StandardScaler should not be
 re-done at each iteration. Instead, at each iteration, all points are
 re-scaled according to the pre-computed standard-deviations in the
 StandardScalerModel, and then an intercept is appended.

 Just to be clear - you're currently calling .persist() before you pass
 data
 to LogisticRegressionWithLBFGS?

 Also - can you give some parameters about the problem/cluster size you're
 solving this on? How much memory per node? How big are n and d, what is
 its
 sparsity (if any) and how many iterations are you running for? Is 0:45 the
 per-iteration time or total time for some number of iterations?

 A useful test might be to call GeneralizedLinearAlgorithm
 useFeatureScaling
 set to false (and maybe also addIntercept set to false) on persisted data,
 and see if you see the same performance wins. If that's the case we've
 isolated the issue and can start profiling to see where all the time is
 going.

 It would be great if you can open a JIRA.

 Thanks!



 On Tue, Feb 17, 2015 at 6:36 AM, Josh Devins j...@soundcloud.com wrote:

 Cross-posting as I got no response on the users mailing list last
 week. Any response would be appreciated :)

 Josh


 -- Forwarded message --
 From: Josh Devins j...@soundcloud.com
 Date: 9 February 2015 at 15:59
 Subject: [MLlib] Performance problem in GeneralizedLinearAlgorithm
 To: u...@spark.apache.org u...@spark.apache.org


 I've been looking into a performance problem when using
 LogisticRegressionWithLBFGS (and in turn GeneralizedLinearAlgorithm).
 Here's an outline of what I've figured out so far and it would be
 great to get some confirmation of the problem, some input on how
 wide-spread this problem might be and any ideas on a nice way to fix
 this.

 Context:
 - I will reference `branch-1.1` as we are currently on v1.1.1 however
 this appears to still be a problem on `master`
 - The cluster is run on YARN, on bare-metal hardware (no VMs)
 - I've not filed a Jira issue yet but can do so
 - This problem affects all algorithms based on
 GeneralizedLinearAlgorithm (GLA) that use feature scaling (and less so
 when not, but still a problem) (e.g. LogisticRegressionWithLBFGS)

 Problem Outline:
 - Starting at GLA line 177
 (

 https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L177
 ),
 a feature scaler is created using the `input` RDD
 - Refer next to line 186 which then maps over the `input` RDD and
 produces a new `data` RDD
 (

 https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L186
 )
 - If you are using feature scaling or adding intercepts, the user
 `input` RDD has been mapped over *after* the user has persisted it
 (hopefully) and *before* going into the (iterative) optimizer on line
 204 (

 https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204
 )
 - Since the RDD `data` that is iterated over in the optimizer is
 unpersisted, when we are running the cost function in the optimizer
 (e.g. LBFGS --

 https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L198
 ),
 the map phase will actually first go back and rerun the feature
 scaling (map tasks on `input`) and then map with the cost function
 (two maps pipelined into one stage)
 - As a 

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Robin East
Running ec2 launch scripts gives me the following error:

ssl.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL 
routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Full stack trace at
https://gist.github.com/insidedctm/4d41600bc22560540a26

I’m running OSX Mavericks 10.9.5

I’ll investigate further but wondered if anyone else has run into this.

Robin

Re: Spark SQL - Long running job

2015-02-23 Thread Cheng Lian
I meant using |saveAsParquetFile|. As for partition number, you can 
always control it with |spark.sql.shuffle.partitions| property.


Cheng

On 2/23/15 1:38 PM, nitin wrote:


I believe calling processedSchemaRdd.persist(DISK) and
processedSchemaRdd.checkpoint() only persists data and I will lose all the
RDD metadata and when I re-start my driver, that data is kind of useless for
me (correct me if I am wrong).

I thought of doing processedSchemaRdd.saveAsParquetFile (hdfs file system)
but I fear that in case my HDFS block size  partition file size, I will
get more partitions when reading instead of original schemaRdd.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717p10727.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



​


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
This vote was supposed to close on Saturday but it looks like no PMCs voted
(other than the implicit vote from Patrick). Was there a discussion offline
to cut an RC2? Was the vote extended?

On Mon, Feb 23, 2015 at 6:59 AM, Robin East robin.e...@xense.co.uk wrote:

 Running ec2 launch scripts gives me the following error:

 ssl.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL
 routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

 Full stack trace at
 https://gist.github.com/insidedctm/4d41600bc22560540a26

 I’m running OSX Mavericks 10.9.5

 I’ll investigate further but wondered if anyone else has run into this.

 Robin


Re: Spark SQL, Hive Parquet data types

2015-02-23 Thread The Watcher

 Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its
 own Parquet support to read partitioned Parquet tables declared in Hive
 metastore. Only writing to partitioned tables is not covered yet. These
 improvements will be included in Spark 1.3.0.

 Just created SPARK-5948 to track writing to partitioned Parquet tables.

Ok, this is still a little confusing.

Since I am able in 1.2.0 to write to a partitioned Hive by registering my
SchemaRDD and calling INSERT into the hive partitionned table SELECT the
registrered, what is the write-path in this case ? Full Hive with a
SparkSQL-Hive bridge ?
If that were the case, why wouldn't SKEWED ON be honored (see another
thread I opened).

Thanks


Re: Spark SQL, Hive Parquet data types

2015-02-23 Thread Cheng Lian
Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses 
its own Parquet support to read partitioned Parquet tables declared in 
Hive metastore. Only writing to partitioned tables is not covered yet. 
These improvements will be included in Spark 1.3.0.


Just created SPARK-5948 to track writing to partitioned Parquet tables.

Cheng

On 2/20/15 10:58 PM, The Watcher wrote:


1. In Spark 1.3.0, timestamp support was added, also Spark SQL uses
its own Parquet support to handle both read path and write path when
dealing with Parquet tables declared in Hive metastore, as long as you’re
not writing to a partitioned table. So yes, you can.

Ah, I had missed the part about being partitioned or not. Is this related

to the work being done on ParquetRelation2 ?

We will indeed write to a partitioned table : do neither the read nor the
write path go through Spark SQL's parquet support in that case ? Is there a
JIRA/PR I can monitor to see when this would change ?

Thanks




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
Thanks Sean. I glossed over the comment about SPARK-5669.

On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen so...@cloudera.com wrote:

 Yes my understanding from Patrick's comment is that this RC will not
 be released, but, to keep testing. There's an implicit -1 out of the
 gates there, I believe, and so the vote won't pass, so perhaps that's
 why there weren't further binding votes. I'm sure that will be
 formalized shortly.

 FWIW here are 10 issues still listed as blockers for 1.3.0:

 SPARK-5910 DataFrame.selectExpr(col as newName) does not work
 SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java
 SPARK-5873 Can't see partially analyzed plans
 SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API
 SPARK-5517 SPARK-5166 Add input types for Java UDFs
 SPARK-5463 Fix Parquet filter push-down
 SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3
 SPARK-5183 SPARK-5180 Document data source API
 SPARK-3650 Triangle Count handles reverse edges incorrectly
 SPARK-3511 Create a RELEASE-NOTES.txt file in the repo


 On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet cjno...@gmail.com wrote:
  This vote was supposed to close on Saturday but it looks like no PMCs
 voted
  (other than the implicit vote from Patrick). Was there a discussion
 offline
  to cut an RC2? Was the vote extended?



RE: StreamingContext textFileStream question

2015-02-23 Thread Shao, Saisai
Hi Mark,

For input streams like text input stream, only RDDs can be recovered from 
checkpoint, no missed files, if file is missed, actually an exception will be 
raised. If you use HDFS, HDFS will guarantee no data loss since it has 3 
copies.Otherwise user logic has to guarantee no file deleted before recovering.

For input stream which is receiver based, like Kafka input stream or socket 
input stream, a WAL(write ahead log) mechanism can be enabled to store the 
received data as well as metadata, so data can be recovered from failure.

Thanks
Jerry

-Original Message-
From: mkhaitman [mailto:mark.khait...@chango.com] 
Sent: Monday, February 23, 2015 10:54 AM
To: dev@spark.apache.org
Subject: StreamingContext textFileStream question

Hello,

I was interested in creating a StreamingContext textFileStream based job, which 
runs for long durations, and can also recover from prolonged driver failure... 
It seems like StreamingContext checkpointing is mainly used for the case when 
the driver dies during the processing of an RDD, and to recover that one RDD, 
but my question specifically relates to whether there is a way to also recover 
which files were missed between the timeframe of the driver dying and being 
started back up (whether manually or automatically).

Any assistance/suggestions with this one would be greatly appreciated!

Thanks,
Mark.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7

2015-02-23 Thread Nicholas Chammas
The first concern for Spark will probably be to ensure that we still build
and test against Python 2.6, since that's the minimum version of Python we
support.

Otherwise this seems OK. We use numpy and other Python packages in PySpark,
but I don't think we're pinned to any particular version of those packages.

Nick

On Mon Feb 23 2015 at 2:15:19 PM shane knapp skn...@berkeley.edu wrote:

 good morning, developers!

 TL;DR:

 i will be installing anaconda and setting it in the system PATH so that
 your python will default to 2.7, as well as it taking over management of
 all of the sci-py packages.  this is potentially a big change, so i'll be
 testing locally on my staging instance before deployment to the wide world.

 deployment is *tentatively* next monday, march 2nd.

 a little background:

 the jenkins test infra is currently (and happily) managed by a set of tools
 that allow me to set up and deploy new workers, manage their packages and
 make sure that all spark and research projects can happily and successfully
 build.

 we're currently at the state where ~50 or so packages are installed and
 configured on each worker.  this is getting a little cumbersome, as the
 package-to-build dep tree is getting pretty large.

 the biggest offender is the science-based python infrastructure.
  everything is blindly installed w/yum and pip, so it's hard to control
 *exactly* what version of any given library is as compared to what's on a
 dev's laptop.

 the solution:

 anaconda (https://store.continuum.io/cshop/anaconda/)!  everything is
 centralized!  i can manage specific versions much easier!

 what this means to you:

 * python 2.7 will be the default system python.
 * 2.6 will still be installed and available (/usr/bin/python or
 /usr/bin/python/2.6)

 what you need to do:
 * install anaconda, have it update your PATH
 * build locally and try to fix any bugs (for spark, this should just
 work)
 * if you have problems, reach out to me and i'll see what i can do to help.
  if we can't get your stuff running under python2.7, we can default to 2.6
 via a job config change.

 what i will be doing:
 * setting up anaconda on my staging instance and spot-testing a lot of
 builds before deployment

 please let me know if there are any issues/concerns...  i'll be posting
 updates this week and will let everyone know if there are any changes to
 the Plan[tm].

 your friendly devops engineer,

 shane



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell
So actually, the list of blockers on JIRA is a bit outdated. These
days I won't cut RC1 unless there are no known issues that I'm aware
of that would actually block the release (that's what the snapshot
ones are for). I'm going to clean those up and push others to do so
also.

The main issues I'm aware of that came about post RC1 are:
1. Python submission broken on YARN
2. The license issue in MLlib [now fixed].
3. Varargs broken for Java Dataframes [now fixed]

Re: Corey - yeah, as it stands now I try to wait if there are things
that look like implicit -1 votes.

On Mon, Feb 23, 2015 at 6:13 AM, Corey Nolet cjno...@gmail.com wrote:
 Thanks Sean. I glossed over the comment about SPARK-5669.

 On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen so...@cloudera.com wrote:

 Yes my understanding from Patrick's comment is that this RC will not
 be released, but, to keep testing. There's an implicit -1 out of the
 gates there, I believe, and so the vote won't pass, so perhaps that's
 why there weren't further binding votes. I'm sure that will be
 formalized shortly.

 FWIW here are 10 issues still listed as blockers for 1.3.0:

 SPARK-5910 DataFrame.selectExpr(col as newName) does not work
 SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java
 SPARK-5873 Can't see partially analyzed plans
 SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API
 SPARK-5517 SPARK-5166 Add input types for Java UDFs
 SPARK-5463 Fix Parquet filter push-down
 SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3
 SPARK-5183 SPARK-5180 Document data source API
 SPARK-3650 Triangle Count handles reverse edges incorrectly
 SPARK-3511 Create a RELEASE-NOTES.txt file in the repo


 On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet cjno...@gmail.com wrote:
  This vote was supposed to close on Saturday but it looks like no PMCs
  voted
  (other than the implicit vote from Patrick). Was there a discussion
  offline
  to cut an RC2? Was the vote extended?



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



StreamingContext textFileStream question

2015-02-23 Thread mkhaitman
Hello,

I was interested in creating a StreamingContext textFileStream based job,
which runs for long durations, and can also recover from prolonged driver
failure... It seems like StreamingContext checkpointing is mainly used for
the case when the driver dies during the processing of an RDD, and to
recover that one RDD, but my question specifically relates to whether there
is a way to also recover which files were missed between the timeframe of
the driver dying and being started back up (whether manually or
automatically).

Any assistance/suggestions with this one would be greatly appreciated!

Thanks,
Mark.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell
It's only been reported on this thread by Tom, so far.

On Mon, Feb 23, 2015 at 10:29 AM, Marcelo Vanzin van...@cloudera.com wrote:
 Hey Patrick,

 Do you have a link to the bug related to Python and Yarn? I looked at
 the blockers in Jira but couldn't find it.

 On Mon, Feb 23, 2015 at 10:18 AM, Patrick Wendell pwend...@gmail.com wrote:
 So actually, the list of blockers on JIRA is a bit outdated. These
 days I won't cut RC1 unless there are no known issues that I'm aware
 of that would actually block the release (that's what the snapshot
 ones are for). I'm going to clean those up and push others to do so
 also.

 The main issues I'm aware of that came about post RC1 are:
 1. Python submission broken on YARN
 2. The license issue in MLlib [now fixed].
 3. Varargs broken for Java Dataframes [now fixed]

 Re: Corey - yeah, as it stands now I try to wait if there are things
 that look like implicit -1 votes.

 --
 Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Marcelo Vanzin
Hey Patrick,

Do you have a link to the bug related to Python and Yarn? I looked at
the blockers in Jira but couldn't find it.

On Mon, Feb 23, 2015 at 10:18 AM, Patrick Wendell pwend...@gmail.com wrote:
 So actually, the list of blockers on JIRA is a bit outdated. These
 days I won't cut RC1 unless there are no known issues that I'm aware
 of that would actually block the release (that's what the snapshot
 ones are for). I'm going to clean those up and push others to do so
 also.

 The main issues I'm aware of that came about post RC1 are:
 1. Python submission broken on YARN
 2. The license issue in MLlib [now fixed].
 3. Varargs broken for Java Dataframes [now fixed]

 Re: Corey - yeah, as it stands now I try to wait if there are things
 that look like implicit -1 votes.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Marcelo Vanzin
Hi Tom, are you using an sbt-built assembly by any chance? If so, take
a look at SPARK-5808.

I haven't had any problems with the maven-built assembly. Setting
SPARK_HOME on the executors is a workaround if you want to use the sbt
assembly.

On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves
tgraves...@yahoo.com.invalid wrote:
 Trying to run pyspark on yarn in client mode with basic wordcount example I 
 see the following error when doing the collect:
 Error from python worker:  /usr/bin/python: No module named sqlPYTHONPATH 
 was:  
 /grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:392)
 at 
 org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
 at 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
 at 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
 at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)   
  at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at 
 org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)at 
 org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:722)
 any ideas on this?
 Tom

  On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell 
 pwend...@gmail.com wrote:


  Please vote on releasing the following candidate as Apache Spark version 
 1.3.0!

 The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2

 The release files, including signatures, digests, etc. can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc1/

 Release artifacts are signed with the following key:
 https://people.apache.org/keys/committer/pwendell.asc

 The staging repository for this release can be found at:
 https://repository.apache.org/content/repositories/orgapachespark-1069/

 The documentation corresponding to this release can be found at:
 http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/

 Please vote on releasing this package as Apache Spark 1.3.0!

 The vote is open until Saturday, February 21, at 08:03 UTC and passes
 if a majority of at least 3 +1 PMC votes are cast.

 [ ] +1 Release this package as Apache Spark 1.3.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see
 http://spark.apache.org/

 == How can I help test this release? ==
 If you are a Spark user, you can help us test this release by
 taking a Spark 1.2 workload and running on this release candidate,
 then reporting any regressions.

 == What justifies a -1 vote for this release? ==
 This vote is happening towards the end of the 1.3 QA period,
 so -1 votes should only occur for significant regressions from 1.2.1.
 Bugs already present in 1.2.X, minor regressions, or bugs related
 to new features will not block this release.

 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org







-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7

2015-02-23 Thread shane knapp
good morning, developers!

TL;DR:

i will be installing anaconda and setting it in the system PATH so that
your python will default to 2.7, as well as it taking over management of
all of the sci-py packages.  this is potentially a big change, so i'll be
testing locally on my staging instance before deployment to the wide world.

deployment is *tentatively* next monday, march 2nd.

a little background:

the jenkins test infra is currently (and happily) managed by a set of tools
that allow me to set up and deploy new workers, manage their packages and
make sure that all spark and research projects can happily and successfully
build.

we're currently at the state where ~50 or so packages are installed and
configured on each worker.  this is getting a little cumbersome, as the
package-to-build dep tree is getting pretty large.

the biggest offender is the science-based python infrastructure.
 everything is blindly installed w/yum and pip, so it's hard to control
*exactly* what version of any given library is as compared to what's on a
dev's laptop.

the solution:

anaconda (https://store.continuum.io/cshop/anaconda/)!  everything is
centralized!  i can manage specific versions much easier!

what this means to you:

* python 2.7 will be the default system python.
* 2.6 will still be installed and available (/usr/bin/python or
/usr/bin/python/2.6)

what you need to do:
* install anaconda, have it update your PATH
* build locally and try to fix any bugs (for spark, this should just work)
* if you have problems, reach out to me and i'll see what i can do to help.
 if we can't get your stuff running under python2.7, we can default to 2.6
via a job config change.

what i will be doing:
* setting up anaconda on my staging instance and spot-testing a lot of
builds before deployment

please let me know if there are any issues/concerns...  i'll be posting
updates this week and will let everyone know if there are any changes to
the Plan[tm].

your friendly devops engineer,

shane


RE: StreamingContext textFileStream question

2015-02-23 Thread mkhaitman
Hi Jerry,

Thanks for the quick response! Looks like I'll need to come up with an
alternative solution in the meantime,  since I'd like to avoid the other
input streams + WAL approach. :)

Thanks again,
Mark.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742p10745.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 - 2.7

2015-02-23 Thread shane knapp
On Mon, Feb 23, 2015 at 11:36 AM, Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 The first concern for Spark will probably be to ensure that we still build
 and test against Python 2.6, since that's the minimum version of Python we
 support.

 sounds good...  we can set up separate 2.6 builds on specific versions...
 this could allow you to easily differentiate between baseline and
latest and greatest if you wanted.  it'll have a little bit more
administrative overhead, due to more jobs needing configs, but offers more
flexibility.

let me know what you think.


 Otherwise this seems OK. We use numpy and other Python packages in
 PySpark, but I don't think we're pinned to any particular version of those
 packages.

 cool.  i'll start mucking about and let you guys know how it goes.

shane


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Michael Armbrust
On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra m...@clearstorydata.com
wrote:

 So what are we expecting of Hive 0.12.0 builds with this RC?  I know not
 every combination of Hadoop and Hive versions, etc., can be supported, but
 even an example build from the Building Spark page isn't looking too good
 to me.


I would definitely expect this to build and we do actually test that for
each PR.  We don't yet run the tests for both versions of Hive and thus
unfortunately these do get out of sync.  Usually these are just problems
diff-ing golden output or cases where we have added a test that uses a
feature not available in hive 12.

Have you seen problems with using Hive 12 outside of these test failures?


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Mark Hamstra
Nothing that I can point to, so this may only be a problem in test scope.
I am looking at a problem where some UDFs that run with 0.12 fail with
0.13; but that problem is already present in Spark 1.2.x, so it's not a
blocking regression for 1.3.  (Very likely a HiveFunctionWrapper serde
problem, but I haven't run it to ground yet.)

On Mon, Feb 23, 2015 at 12:18 PM, Michael Armbrust mich...@databricks.com
wrote:

 On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

 So what are we expecting of Hive 0.12.0 builds with this RC?  I know not
 every combination of Hadoop and Hive versions, etc., can be supported, but
 even an example build from the Building Spark page isn't looking too
 good
 to me.


 I would definitely expect this to build and we do actually test that for
 each PR.  We don't yet run the tests for both versions of Hive and thus
 unfortunately these do get out of sync.  Usually these are just problems
 diff-ing golden output or cases where we have added a test that uses a
 feature not available in hive 12.

 Have you seen problems with using Hive 12 outside of these test failures?



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Soumitra Kumar
+1 (non-binding)

For: https://issues.apache.org/jira/browse/SPARK-3660

. Docs OK
. Example code is good

-Soumitra.


On Mon, Feb 23, 2015 at 10:33 AM, Marcelo Vanzin van...@cloudera.com
wrote:

 Hi Tom, are you using an sbt-built assembly by any chance? If so, take
 a look at SPARK-5808.

 I haven't had any problems with the maven-built assembly. Setting
 SPARK_HOME on the executors is a workaround if you want to use the sbt
 assembly.

 On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves
 tgraves...@yahoo.com.invalid wrote:
  Trying to run pyspark on yarn in client mode with basic wordcount
 example I see the following error when doing the collect:
  Error from python worker:  /usr/bin/python: No module named
 sqlPYTHONPATH was:
 /grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:392)
 at
 org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
   at
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
   at
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
 org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
   at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:722)
  any ideas on this?
  Tom
 
   On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell 
 pwend...@gmail.com wrote:
 
 
   Please vote on releasing the following candidate as Apache Spark
 version 1.3.0!
 
  The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2
 
  The release files, including signatures, digests, etc. can be found at:
  http://people.apache.org/~pwendell/spark-1.3.0-rc1/
 
  Release artifacts are signed with the following key:
  https://people.apache.org/keys/committer/pwendell.asc
 
  The staging repository for this release can be found at:
  https://repository.apache.org/content/repositories/orgapachespark-1069/
 
  The documentation corresponding to this release can be found at:
  http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/
 
  Please vote on releasing this package as Apache Spark 1.3.0!
 
  The vote is open until Saturday, February 21, at 08:03 UTC and passes
  if a majority of at least 3 +1 PMC votes are cast.
 
  [ ] +1 Release this package as Apache Spark 1.3.0
  [ ] -1 Do not release this package because ...
 
  To learn more about Apache Spark, please see
  http://spark.apache.org/
 
  == How can I help test this release? ==
  If you are a Spark user, you can help us test this release by
  taking a Spark 1.2 workload and running on this release candidate,
  then reporting any regressions.
 
  == What justifies a -1 vote for this release? ==
  This vote is happening towards the end of the 1.3 QA period,
  so -1 votes should only occur for significant regressions from 1.2.1.
  Bugs already present in 1.2.X, minor regressions, or bugs related
  to new features will not block this release.
 
  - Patrick
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 
 
 



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




Re: Spark SQL, Hive Parquet data types

2015-02-23 Thread Cheng Lian

Ah, sorry for not being clear enough.

So now in Spark 1.3.0, we have two Parquet support implementations, the 
old one is tightly coupled with the Spark SQL framework, while the new 
one is based on data sources API. In both versions, we try to intercept 
operations over Parquet tables registered in metastore when possible for 
better performance (mainly filter push-down optimization and extra 
metadata for more accurate schema inference). The distinctions are:


1.

   For old version (set |spark.sql.parquet.useDataSourceApi| to |false|)

   When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
   “hijack” the read path. Namely whenever you query a Parquet table
   registered in metastore, we’re using our own Parquet implementation.

   For write path, we fallback to default Hive SerDe implementation
   (namely Spark SQL’s |InsertIntoHiveTable| operator).

2.

   For new data source version (set
   |spark.sql.parquet.useDataSourceApi| to |true|, which is the default
   value in master and branch-1.3)

   When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
   “hijack” both read and write path, but if you’re writing to a
   partitioned table, we still fallback to default Hive SerDe
   implementation.

For Spark 1.2.0, only 1 applies. Spark 1.2.0 also has a Parquet data 
source, but it’s not enabled if you’re not using data sources API 
specific DDL (|CREATE TEMPORARY TABLE table-name USING data-source|).


Cheng

On 2/23/15 10:05 PM, The Watcher wrote:


Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its
own Parquet support to read partitioned Parquet tables declared in Hive
metastore. Only writing to partitioned tables is not covered yet. These
improvements will be included in Spark 1.3.0.

Just created SPARK-5948 to track writing to partitioned Parquet tables.


Ok, this is still a little confusing.

Since I am able in 1.2.0 to write to a partitioned Hive by registering my
SchemaRDD and calling INSERT into the hive partitionned table SELECT the
registrered, what is the write-path in this case ? Full Hive with a
SparkSQL-Hive bridge ?
If that were the case, why wouldn't SKEWED ON be honored (see another
thread I opened).

Thanks


​


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Tathagata Das
Hey all,

I found a major issue where JobProgressListener (a listener used to keep
track of jobs for the web UI) never forgets stages in one of its data
structures. This is a blocker for long running applications.
https://issues.apache.org/jira/browse/SPARK-5967

I am testing a fix for this right now.

TD

On Mon, Feb 23, 2015 at 7:23 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:

 +1 (non-binding)

 For: https://issues.apache.org/jira/browse/SPARK-3660

 . Docs OK
 . Example code is good

 -Soumitra.


 On Mon, Feb 23, 2015 at 10:33 AM, Marcelo Vanzin van...@cloudera.com
 wrote:

  Hi Tom, are you using an sbt-built assembly by any chance? If so, take
  a look at SPARK-5808.
 
  I haven't had any problems with the maven-built assembly. Setting
  SPARK_HOME on the executors is a workaround if you want to use the sbt
  assembly.
 
  On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves
  tgraves...@yahoo.com.invalid wrote:
   Trying to run pyspark on yarn in client mode with basic wordcount
  example I see the following error when doing the collect:
   Error from python worker:  /usr/bin/python: No module named
  sqlPYTHONPATH was:
 
 /grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
  at
 
 org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at
 
 org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at
 
 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
at
 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
  org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)at
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
   any ideas on this?
   Tom
  
On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell 
  pwend...@gmail.com wrote:
  
  
Please vote on releasing the following candidate as Apache Spark
  version 1.3.0!
  
   The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
  
 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2
  
   The release files, including signatures, digests, etc. can be found at:
   http://people.apache.org/~pwendell/spark-1.3.0-rc1/
  
   Release artifacts are signed with the following key:
   https://people.apache.org/keys/committer/pwendell.asc
  
   The staging repository for this release can be found at:
  
 https://repository.apache.org/content/repositories/orgapachespark-1069/
  
   The documentation corresponding to this release can be found at:
   http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/
  
   Please vote on releasing this package as Apache Spark 1.3.0!
  
   The vote is open until Saturday, February 21, at 08:03 UTC and passes
   if a majority of at least 3 +1 PMC votes are cast.
  
   [ ] +1 Release this package as Apache Spark 1.3.0
   [ ] -1 Do not release this package because ...
  
   To learn more about Apache Spark, please see
   http://spark.apache.org/
  
   == How can I help test this release? ==
   If you are a Spark user, you can help us test this release by
   taking a Spark 1.2 workload and running on this release candidate,
   then reporting any regressions.
  
   == What justifies a -1 vote for this release? ==
   This vote is happening towards the end of the 1.3 QA period,
   so -1 votes should only occur for significant regressions from 1.2.1.
   Bugs already present in 1.2.X, minor regressions, or bugs related
   to new features will not block this release.
  
   - Patrick
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
   For additional commands, e-mail: dev-h...@spark.apache.org
  
  
  
  
 
 
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 
 



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Cheng Lian
My bad, had once fixed all Hive 12 test failures in PR #4107, but didn't 
got time to get it merged.


Considering the release is close, I can cherry-pick those Hive 12 fixes 
from #4107 and open a more surgical PR soon.


Cheng

On 2/24/15 4:18 AM, Michael Armbrust wrote:

On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra m...@clearstorydata.com
wrote:


So what are we expecting of Hive 0.12.0 builds with this RC?  I know not
every combination of Hadoop and Hive versions, etc., can be supported, but
even an example build from the Building Spark page isn't looking too good
to me.


I would definitely expect this to build and we do actually test that for
each PR.  We don't yet run the tests for both versions of Hive and thus
unfortunately these do get out of sync.  Usually these are just problems
diff-ing golden output or cases where we have added a test that uses a
feature not available in hive 12.

Have you seen problems with using Hive 12 outside of these test failures?




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org