Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Robin East
Running ec2 launch scripts gives me the following error:

ssl.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL 
routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Full stack trace at
https://gist.github.com/insidedctm/4d41600bc22560540a26

I’m running OSX Mavericks 10.9.5

I’ll investigate further but wondered if anyone else has run into this.

Robin

Re: Spark SQL, Hive & Parquet data types

2015-02-23 Thread Cheng Lian
Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses 
its own Parquet support to read partitioned Parquet tables declared in 
Hive metastore. Only writing to partitioned tables is not covered yet. 
These improvements will be included in Spark 1.3.0.


Just created SPARK-5948 to track writing to partitioned Parquet tables.

Cheng

On 2/20/15 10:58 PM, The Watcher wrote:


1. In Spark 1.3.0, timestamp support was added, also Spark SQL uses
its own Parquet support to handle both read path and write path when
dealing with Parquet tables declared in Hive metastore, as long as you’re
not writing to a partitioned table. So yes, you can.

Ah, I had missed the part about being partitioned or not. Is this related

to the work being done on ParquetRelation2 ?

We will indeed write to a partitioned table : do neither the read nor the
write path go through Spark SQL's parquet support in that case ? Is there a
JIRA/PR I can monitor to see when this would change ?

Thanks




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL - Long running job

2015-02-23 Thread Cheng Lian
I meant using |saveAsParquetFile|. As for partition number, you can 
always control it with |spark.sql.shuffle.partitions| property.


Cheng

On 2/23/15 1:38 PM, nitin wrote:


I believe calling processedSchemaRdd.persist(DISK) and
processedSchemaRdd.checkpoint() only persists data and I will lose all the
RDD metadata and when I re-start my driver, that data is kind of useless for
me (correct me if I am wrong).

I thought of doing processedSchemaRdd.saveAsParquetFile (hdfs file system)
but I fear that in case my "HDFS block size" > "partition file size", I will
get more partitions when reading instead of original schemaRdd.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717p10727.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



​


Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-23 Thread Josh Devins
Thanks for the pointer Peter, that change will indeed fix this bug and
it looks like it will make it into the upcoming 1.3.0 release.

@Evan, for reference, completeness and posterity:

> Just to be clear - you're currently calling .persist() before you pass data 
> to LogisticRegressionWithLBFGS?

No. I added persist in GeneralizedLinearAlgorithm right before the
`data` RDD goes into optimizer (LBFGS in our case). See here:
https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204

> Also - can you give some parameters about the problem/cluster size you're 
> solving this on? How much memory per node? How big are n and d, what is its 
> sparsity (if any) and how many iterations are you running for? Is 0:45 the 
> per-iteration time or total time for some number of iterations?

The vector is very sparse (few hundred entries) but 2.5M in size. The
dataset is about 30M examples to learn from. 16x machines, 64GB
memory, 32-cores.

Josh


On 17 February 2015 at 17:31, Peter Rudenko  wrote:
> It's fixed today: https://github.com/apache/spark/pull/4593
>
> Thanks,
> Peter Rudenko
>
> On 2015-02-17 18:25, Evan R. Sparks wrote:
>>
>> Josh - thanks for the detailed write up - this seems a little funny to me.
>> I agree that with the current code path there is extra work being done
>> than
>> needs to be (e.g. the features are re-scaled at every iteration, but the
>> relatively costly process of fitting the StandardScaler should not be
>> re-done at each iteration. Instead, at each iteration, all points are
>> re-scaled according to the pre-computed standard-deviations in the
>> StandardScalerModel, and then an intercept is appended.
>>
>> Just to be clear - you're currently calling .persist() before you pass
>> data
>> to LogisticRegressionWithLBFGS?
>>
>> Also - can you give some parameters about the problem/cluster size you're
>> solving this on? How much memory per node? How big are n and d, what is
>> its
>> sparsity (if any) and how many iterations are you running for? Is 0:45 the
>> per-iteration time or total time for some number of iterations?
>>
>> A useful test might be to call GeneralizedLinearAlgorithm
>> useFeatureScaling
>> set to false (and maybe also addIntercept set to false) on persisted data,
>> and see if you see the same performance wins. If that's the case we've
>> isolated the issue and can start profiling to see where all the time is
>> going.
>>
>> It would be great if you can open a JIRA.
>>
>> Thanks!
>>
>>
>>
>> On Tue, Feb 17, 2015 at 6:36 AM, Josh Devins  wrote:
>>
>>> Cross-posting as I got no response on the users mailing list last
>>> week. Any response would be appreciated :)
>>>
>>> Josh
>>>
>>>
>>> -- Forwarded message --
>>> From: Josh Devins 
>>> Date: 9 February 2015 at 15:59
>>> Subject: [MLlib] Performance problem in GeneralizedLinearAlgorithm
>>> To: "u...@spark.apache.org" 
>>>
>>>
>>> I've been looking into a performance problem when using
>>> LogisticRegressionWithLBFGS (and in turn GeneralizedLinearAlgorithm).
>>> Here's an outline of what I've figured out so far and it would be
>>> great to get some confirmation of the problem, some input on how
>>> wide-spread this problem might be and any ideas on a nice way to fix
>>> this.
>>>
>>> Context:
>>> - I will reference `branch-1.1` as we are currently on v1.1.1 however
>>> this appears to still be a problem on `master`
>>> - The cluster is run on YARN, on bare-metal hardware (no VMs)
>>> - I've not filed a Jira issue yet but can do so
>>> - This problem affects all algorithms based on
>>> GeneralizedLinearAlgorithm (GLA) that use feature scaling (and less so
>>> when not, but still a problem) (e.g. LogisticRegressionWithLBFGS)
>>>
>>> Problem Outline:
>>> - Starting at GLA line 177
>>> (
>>>
>>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L177
>>> ),
>>> a feature scaler is created using the `input` RDD
>>> - Refer next to line 186 which then maps over the `input` RDD and
>>> produces a new `data` RDD
>>> (
>>>
>>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L186
>>> )
>>> - If you are using feature scaling or adding intercepts, the user
>>> `input` RDD has been mapped over *after* the user has persisted it
>>> (hopefully) and *before* going into the (iterative) optimizer on line
>>> 204 (
>>>
>>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204
>>> )
>>> - Since the RDD `data` that is iterated over in the optimizer is
>>> unpersisted, when we are running the cost function in the optimizer
>>> (e.g. LBFGS --
>>>
>>> https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L198
>>> ),
>>> the map

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
This vote was supposed to close on Saturday but it looks like no PMCs voted
(other than the implicit vote from Patrick). Was there a discussion offline
to cut an RC2? Was the vote extended?

On Mon, Feb 23, 2015 at 6:59 AM, Robin East  wrote:

> Running ec2 launch scripts gives me the following error:
>
> ssl.SSLError: [Errno 1] _ssl.c:504: error:14090086:SSL
> routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed
>
> Full stack trace at
> https://gist.github.com/insidedctm/4d41600bc22560540a26
>
> I’m running OSX Mavericks 10.9.5
>
> I’ll investigate further but wondered if anyone else has run into this.
>
> Robin


Re: Spark SQL, Hive & Parquet data types

2015-02-23 Thread The Watcher
>
> Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its
> own Parquet support to read partitioned Parquet tables declared in Hive
> metastore. Only writing to partitioned tables is not covered yet. These
> improvements will be included in Spark 1.3.0.
>
> Just created SPARK-5948 to track writing to partitioned Parquet tables.
>
Ok, this is still a little confusing.

Since I am able in 1.2.0 to write to a partitioned Hive by registering my
SchemaRDD and calling INSERT into "the hive partitionned table" SELECT "the
registrered", what is the write-path in this case ? Full Hive with a
SparkSQL<->Hive bridge ?
If that were the case, why wouldn't SKEWED ON be honored (see another
thread I opened).

Thanks


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Sean Owen
Yes my understanding from Patrick's comment is that this RC will not
be released, but, to keep testing. There's an implicit -1 out of the
gates there, I believe, and so the vote won't pass, so perhaps that's
why there weren't further binding votes. I'm sure that will be
formalized shortly.

FWIW here are 10 issues still listed as blockers for 1.3.0:

SPARK-5910 DataFrame.selectExpr("col as newName") does not work
SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java
SPARK-5873 Can't see partially analyzed plans
SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API
SPARK-5517 SPARK-5166 Add input types for Java UDFs
SPARK-5463 Fix Parquet filter push-down
SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3
SPARK-5183 SPARK-5180 Document data source API
SPARK-3650 Triangle Count handles reverse edges incorrectly
SPARK-3511 Create a RELEASE-NOTES.txt file in the repo


On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet  wrote:
> This vote was supposed to close on Saturday but it looks like no PMCs voted
> (other than the implicit vote from Patrick). Was there a discussion offline
> to cut an RC2? Was the vote extended?

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
Thanks Sean. I glossed over the comment about SPARK-5669.

On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen  wrote:

> Yes my understanding from Patrick's comment is that this RC will not
> be released, but, to keep testing. There's an implicit -1 out of the
> gates there, I believe, and so the vote won't pass, so perhaps that's
> why there weren't further binding votes. I'm sure that will be
> formalized shortly.
>
> FWIW here are 10 issues still listed as blockers for 1.3.0:
>
> SPARK-5910 DataFrame.selectExpr("col as newName") does not work
> SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java
> SPARK-5873 Can't see partially analyzed plans
> SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API
> SPARK-5517 SPARK-5166 Add input types for Java UDFs
> SPARK-5463 Fix Parquet filter push-down
> SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3
> SPARK-5183 SPARK-5180 Document data source API
> SPARK-3650 Triangle Count handles reverse edges incorrectly
> SPARK-3511 Create a RELEASE-NOTES.txt file in the repo
>
>
> On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet  wrote:
> > This vote was supposed to close on Saturday but it looks like no PMCs
> voted
> > (other than the implicit vote from Patrick). Was there a discussion
> offline
> > to cut an RC2? Was the vote extended?
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell
So actually, the list of blockers on JIRA is a bit outdated. These
days I won't cut RC1 unless there are no known issues that I'm aware
of that would actually block the release (that's what the snapshot
ones are for). I'm going to clean those up and push others to do so
also.

The main issues I'm aware of that came about post RC1 are:
1. Python submission broken on YARN
2. The license issue in MLlib [now fixed].
3. Varargs broken for Java Dataframes [now fixed]

Re: Corey - yeah, as it stands now I try to wait if there are things
that look like implicit -1 votes.

On Mon, Feb 23, 2015 at 6:13 AM, Corey Nolet  wrote:
> Thanks Sean. I glossed over the comment about SPARK-5669.
>
> On Mon, Feb 23, 2015 at 9:05 AM, Sean Owen  wrote:
>>
>> Yes my understanding from Patrick's comment is that this RC will not
>> be released, but, to keep testing. There's an implicit -1 out of the
>> gates there, I believe, and so the vote won't pass, so perhaps that's
>> why there weren't further binding votes. I'm sure that will be
>> formalized shortly.
>>
>> FWIW here are 10 issues still listed as blockers for 1.3.0:
>>
>> SPARK-5910 DataFrame.selectExpr("col as newName") does not work
>> SPARK-5904 SPARK-5166 DataFrame methods with varargs do not work in Java
>> SPARK-5873 Can't see partially analyzed plans
>> SPARK-5546 Improve path to Kafka assembly when trying Kafka Python API
>> SPARK-5517 SPARK-5166 Add input types for Java UDFs
>> SPARK-5463 Fix Parquet filter push-down
>> SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3
>> SPARK-5183 SPARK-5180 Document data source API
>> SPARK-3650 Triangle Count handles reverse edges incorrectly
>> SPARK-3511 Create a RELEASE-NOTES.txt file in the repo
>>
>>
>> On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet  wrote:
>> > This vote was supposed to close on Saturday but it looks like no PMCs
>> > voted
>> > (other than the implicit vote from Patrick). Was there a discussion
>> > offline
>> > to cut an RC2? Was the vote extended?
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Patrick Wendell
It's only been reported on this thread by Tom, so far.

On Mon, Feb 23, 2015 at 10:29 AM, Marcelo Vanzin  wrote:
> Hey Patrick,
>
> Do you have a link to the bug related to Python and Yarn? I looked at
> the blockers in Jira but couldn't find it.
>
> On Mon, Feb 23, 2015 at 10:18 AM, Patrick Wendell  wrote:
>> So actually, the list of blockers on JIRA is a bit outdated. These
>> days I won't cut RC1 unless there are no known issues that I'm aware
>> of that would actually block the release (that's what the snapshot
>> ones are for). I'm going to clean those up and push others to do so
>> also.
>>
>> The main issues I'm aware of that came about post RC1 are:
>> 1. Python submission broken on YARN
>> 2. The license issue in MLlib [now fixed].
>> 3. Varargs broken for Java Dataframes [now fixed]
>>
>> Re: Corey - yeah, as it stands now I try to wait if there are things
>> that look like implicit -1 votes.
>
> --
> Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Marcelo Vanzin
Hey Patrick,

Do you have a link to the bug related to Python and Yarn? I looked at
the blockers in Jira but couldn't find it.

On Mon, Feb 23, 2015 at 10:18 AM, Patrick Wendell  wrote:
> So actually, the list of blockers on JIRA is a bit outdated. These
> days I won't cut RC1 unless there are no known issues that I'm aware
> of that would actually block the release (that's what the snapshot
> ones are for). I'm going to clean those up and push others to do so
> also.
>
> The main issues I'm aware of that came about post RC1 are:
> 1. Python submission broken on YARN
> 2. The license issue in MLlib [now fixed].
> 3. Varargs broken for Java Dataframes [now fixed]
>
> Re: Corey - yeah, as it stands now I try to wait if there are things
> that look like implicit -1 votes.

-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Marcelo Vanzin
Hi Tom, are you using an sbt-built assembly by any chance? If so, take
a look at SPARK-5808.

I haven't had any problems with the maven-built assembly. Setting
SPARK_HOME on the executors is a workaround if you want to use the sbt
assembly.

On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves
 wrote:
> Trying to run pyspark on yarn in client mode with basic wordcount example I 
> see the following error when doing the collect:
> Error from python worker:  /usr/bin/python: No module named sqlPYTHONPATH 
> was:  
> /grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
> at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)   
>  at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69) 
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at 
> org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:722)
> any ideas on this?
> Tom
>
>  On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell 
>  wrote:
>
>
>  Please vote on releasing the following candidate as Apache Spark version 
> 1.3.0!
>
> The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.3.0-rc1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1069/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/
>
> Please vote on releasing this package as Apache Spark 1.3.0!
>
> The vote is open until Saturday, February 21, at 08:03 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.3.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> == How can I help test this release? ==
> If you are a Spark user, you can help us test this release by
> taking a Spark 1.2 workload and running on this release candidate,
> then reporting any regressions.
>
> == What justifies a -1 vote for this release? ==
> This vote is happening towards the end of the 1.3 QA period,
> so -1 votes should only occur for significant regressions from 1.2.1.
> Bugs already present in 1.2.X, minor regressions, or bugs related
> to new features will not block this release.
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>
>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



StreamingContext textFileStream question

2015-02-23 Thread mkhaitman
Hello,

I was interested in creating a StreamingContext textFileStream based job,
which runs for long durations, and can also recover from prolonged driver
failure... It seems like StreamingContext checkpointing is mainly used for
the case when the driver dies during the processing of an RDD, and to
recover that one RDD, but my question specifically relates to whether there
is a way to also recover which files were missed between the timeframe of
the driver dying and being started back up (whether manually or
automatically).

Any assistance/suggestions with this one would be greatly appreciated!

Thanks,
Mark.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 -> 2.7

2015-02-23 Thread shane knapp
good morning, developers!

TL;DR:

i will be installing anaconda and setting it in the system PATH so that
your python will default to 2.7, as well as it taking over management of
all of the sci-py packages.  this is potentially a big change, so i'll be
testing locally on my staging instance before deployment to the wide world.

deployment is *tentatively* next monday, march 2nd.

a little background:

the jenkins test infra is currently (and happily) managed by a set of tools
that allow me to set up and deploy new workers, manage their packages and
make sure that all spark and research projects can happily and successfully
build.

we're currently at the state where ~50 or so packages are installed and
configured on each worker.  this is getting a little cumbersome, as the
package-to-build dep tree is getting pretty large.

the biggest offender is the science-based python infrastructure.
 everything is blindly installed w/yum and pip, so it's hard to control
*exactly* what version of any given library is as compared to what's on a
dev's laptop.

the solution:

anaconda (https://store.continuum.io/cshop/anaconda/)!  everything is
centralized!  i can manage specific versions much easier!

what this means to you:

* python 2.7 will be the default system python.
* 2.6 will still be installed and available (/usr/bin/python or
/usr/bin/python/2.6)

what you need to do:
* install anaconda, have it update your PATH
* build locally and try to fix any bugs (for spark, this "should just work")
* if you have problems, reach out to me and i'll see what i can do to help.
 if we can't get your stuff running under python2.7, we can default to 2.6
via a job config change.

what i will be doing:
* setting up anaconda on my staging instance and spot-testing a lot of
builds before deployment

please let me know if there are any issues/concerns...  i'll be posting
updates this week and will let everyone know if there are any changes to
the Plan[tm].

your friendly devops engineer,

shane


RE: StreamingContext textFileStream question

2015-02-23 Thread Shao, Saisai
Hi Mark,

For input streams like text input stream, only RDDs can be recovered from 
checkpoint, no missed files, if file is missed, actually an exception will be 
raised. If you use HDFS, HDFS will guarantee no data loss since it has 3 
copies.Otherwise user logic has to guarantee no file deleted before recovering.

For input stream which is receiver based, like Kafka input stream or socket 
input stream, a WAL(write ahead log) mechanism can be enabled to store the 
received data as well as metadata, so data can be recovered from failure.

Thanks
Jerry

-Original Message-
From: mkhaitman [mailto:mark.khait...@chango.com] 
Sent: Monday, February 23, 2015 10:54 AM
To: dev@spark.apache.org
Subject: StreamingContext textFileStream question

Hello,

I was interested in creating a StreamingContext textFileStream based job, which 
runs for long durations, and can also recover from prolonged driver failure... 
It seems like StreamingContext checkpointing is mainly used for the case when 
the driver dies during the processing of an RDD, and to recover that one RDD, 
but my question specifically relates to whether there is a way to also recover 
which files were missed between the timeframe of the driver dying and being 
started back up (whether manually or automatically).

Any assistance/suggestions with this one would be greatly appreciated!

Thanks,
Mark.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional 
commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: StreamingContext textFileStream question

2015-02-23 Thread mkhaitman
Hi Jerry,

Thanks for the quick response! Looks like I'll need to come up with an
alternative solution in the meantime,  since I'd like to avoid the other
input streams + WAL approach. :)

Thanks again,
Mark.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/StreamingContext-textFileStream-question-tp10742p10745.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 -> 2.7

2015-02-23 Thread Nicholas Chammas
The first concern for Spark will probably be to ensure that we still build
and test against Python 2.6, since that's the minimum version of Python we
support.

Otherwise this seems OK. We use numpy and other Python packages in PySpark,
but I don't think we're pinned to any particular version of those packages.

Nick

On Mon Feb 23 2015 at 2:15:19 PM shane knapp  wrote:

> good morning, developers!
>
> TL;DR:
>
> i will be installing anaconda and setting it in the system PATH so that
> your python will default to 2.7, as well as it taking over management of
> all of the sci-py packages.  this is potentially a big change, so i'll be
> testing locally on my staging instance before deployment to the wide world.
>
> deployment is *tentatively* next monday, march 2nd.
>
> a little background:
>
> the jenkins test infra is currently (and happily) managed by a set of tools
> that allow me to set up and deploy new workers, manage their packages and
> make sure that all spark and research projects can happily and successfully
> build.
>
> we're currently at the state where ~50 or so packages are installed and
> configured on each worker.  this is getting a little cumbersome, as the
> package-to-build dep tree is getting pretty large.
>
> the biggest offender is the science-based python infrastructure.
>  everything is blindly installed w/yum and pip, so it's hard to control
> *exactly* what version of any given library is as compared to what's on a
> dev's laptop.
>
> the solution:
>
> anaconda (https://store.continuum.io/cshop/anaconda/)!  everything is
> centralized!  i can manage specific versions much easier!
>
> what this means to you:
>
> * python 2.7 will be the default system python.
> * 2.6 will still be installed and available (/usr/bin/python or
> /usr/bin/python/2.6)
>
> what you need to do:
> * install anaconda, have it update your PATH
> * build locally and try to fix any bugs (for spark, this "should just
> work")
> * if you have problems, reach out to me and i'll see what i can do to help.
>  if we can't get your stuff running under python2.7, we can default to 2.6
> via a job config change.
>
> what i will be doing:
> * setting up anaconda on my staging instance and spot-testing a lot of
> builds before deployment
>
> please let me know if there are any issues/concerns...  i'll be posting
> updates this week and will let everyone know if there are any changes to
> the Plan[tm].
>
> your friendly devops engineer,
>
> shane
>


Re: [jenkins infra -- pls read ] installing anaconda, moving default python from 2.6 -> 2.7

2015-02-23 Thread shane knapp
On Mon, Feb 23, 2015 at 11:36 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> The first concern for Spark will probably be to ensure that we still build
> and test against Python 2.6, since that's the minimum version of Python we
> support.
>
> sounds good...  we can set up separate 2.6 builds on specific versions...
 this could allow you to easily differentiate between "baseline" and
"latest and greatest" if you wanted.  it'll have a little bit more
administrative overhead, due to more jobs needing configs, but offers more
flexibility.

let me know what you think.


> Otherwise this seems OK. We use numpy and other Python packages in
> PySpark, but I don't think we're pinned to any particular version of those
> packages.
>
> cool.  i'll start mucking about and let you guys know how it goes.

shane


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Michael Armbrust
On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra 
wrote:

> So what are we expecting of Hive 0.12.0 builds with this RC?  I know not
> every combination of Hadoop and Hive versions, etc., can be supported, but
> even an example build from the "Building Spark" page isn't looking too good
> to me.
>

I would definitely expect this to build and we do actually test that for
each PR.  We don't yet run the tests for both versions of Hive and thus
unfortunately these do get out of sync.  Usually these are just problems
diff-ing golden output or cases where we have added a test that uses a
feature not available in hive 12.

Have you seen problems with using Hive 12 outside of these test failures?


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Mark Hamstra
Nothing that I can point to, so this may only be a problem in test scope.
I am looking at a problem where some UDFs that run with 0.12 fail with
0.13; but that problem is already present in Spark 1.2.x, so it's not a
blocking regression for 1.3.  (Very likely a HiveFunctionWrapper serde
problem, but I haven't run it to ground yet.)

On Mon, Feb 23, 2015 at 12:18 PM, Michael Armbrust 
wrote:

> On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra 
> wrote:
>
>> So what are we expecting of Hive 0.12.0 builds with this RC?  I know not
>> every combination of Hadoop and Hive versions, etc., can be supported, but
>> even an example build from the "Building Spark" page isn't looking too
>> good
>> to me.
>>
>
> I would definitely expect this to build and we do actually test that for
> each PR.  We don't yet run the tests for both versions of Hive and thus
> unfortunately these do get out of sync.  Usually these are just problems
> diff-ing golden output or cases where we have added a test that uses a
> feature not available in hive 12.
>
> Have you seen problems with using Hive 12 outside of these test failures?
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Cheng Lian
My bad, had once fixed all Hive 12 test failures in PR #4107, but didn't 
got time to get it merged.


Considering the release is close, I can cherry-pick those Hive 12 fixes 
from #4107 and open a more surgical PR soon.


Cheng

On 2/24/15 4:18 AM, Michael Armbrust wrote:

On Sun, Feb 22, 2015 at 11:20 PM, Mark Hamstra 
wrote:


So what are we expecting of Hive 0.12.0 builds with this RC?  I know not
every combination of Hadoop and Hive versions, etc., can be supported, but
even an example build from the "Building Spark" page isn't looking too good
to me.


I would definitely expect this to build and we do actually test that for
each PR.  We don't yet run the tests for both versions of Hive and thus
unfortunately these do get out of sync.  Usually these are just problems
diff-ing golden output or cases where we have added a test that uses a
feature not available in hive 12.

Have you seen problems with using Hive 12 outside of these test failures?




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spark SQL, Hive & Parquet data types

2015-02-23 Thread Cheng Lian

Ah, sorry for not being clear enough.

So now in Spark 1.3.0, we have two Parquet support implementations, the 
old one is tightly coupled with the Spark SQL framework, while the new 
one is based on data sources API. In both versions, we try to intercept 
operations over Parquet tables registered in metastore when possible for 
better performance (mainly filter push-down optimization and extra 
metadata for more accurate schema inference). The distinctions are:


1.

   For old version (set |spark.sql.parquet.useDataSourceApi| to |false|)

   When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
   “hijack” the read path. Namely whenever you query a Parquet table
   registered in metastore, we’re using our own Parquet implementation.

   For write path, we fallback to default Hive SerDe implementation
   (namely Spark SQL’s |InsertIntoHiveTable| operator).

2.

   For new data source version (set
   |spark.sql.parquet.useDataSourceApi| to |true|, which is the default
   value in master and branch-1.3)

   When |spark.sql.hive.convertMetastoreParquet| is set to |true|, we
   “hijack” both read and write path, but if you’re writing to a
   partitioned table, we still fallback to default Hive SerDe
   implementation.

For Spark 1.2.0, only 1 applies. Spark 1.2.0 also has a Parquet data 
source, but it’s not enabled if you’re not using data sources API 
specific DDL (|CREATE TEMPORARY TABLE  USING |).


Cheng

On 2/23/15 10:05 PM, The Watcher wrote:


Yes, recently we improved ParquetRelation2 quite a bit. Spark SQL uses its
own Parquet support to read partitioned Parquet tables declared in Hive
metastore. Only writing to partitioned tables is not covered yet. These
improvements will be included in Spark 1.3.0.

Just created SPARK-5948 to track writing to partitioned Parquet tables.


Ok, this is still a little confusing.

Since I am able in 1.2.0 to write to a partitioned Hive by registering my
SchemaRDD and calling INSERT into "the hive partitionned table" SELECT "the
registrered", what is the write-path in this case ? Full Hive with a
SparkSQL<->Hive bridge ?
If that were the case, why wouldn't SKEWED ON be honored (see another
thread I opened).

Thanks


​


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Soumitra Kumar
+1 (non-binding)

For: https://issues.apache.org/jira/browse/SPARK-3660

. Docs OK
. Example code is good

-Soumitra.


On Mon, Feb 23, 2015 at 10:33 AM, Marcelo Vanzin 
wrote:

> Hi Tom, are you using an sbt-built assembly by any chance? If so, take
> a look at SPARK-5808.
>
> I haven't had any problems with the maven-built assembly. Setting
> SPARK_HOME on the executors is a workaround if you want to use the sbt
> assembly.
>
> On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves
>  wrote:
> > Trying to run pyspark on yarn in client mode with basic wordcount
> example I see the following error when doing the collect:
> > Error from python worker:  /usr/bin/python: No module named
> sqlPYTHONPATH was:
> /grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException
>   at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
>   at
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
>   at
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
>   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
> org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>   at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:64)at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
>   at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> > any ideas on this?
> > Tom
> >
> >  On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell <
> pwend...@gmail.com> wrote:
> >
> >
> >  Please vote on releasing the following candidate as Apache Spark
> version 1.3.0!
> >
> > The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.3.0-rc1/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1069/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/
> >
> > Please vote on releasing this package as Apache Spark 1.3.0!
> >
> > The vote is open until Saturday, February 21, at 08:03 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.3.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > == How can I help test this release? ==
> > If you are a Spark user, you can help us test this release by
> > taking a Spark 1.2 workload and running on this release candidate,
> > then reporting any regressions.
> >
> > == What justifies a -1 vote for this release? ==
> > This vote is happening towards the end of the 1.3 QA period,
> > so -1 votes should only occur for significant regressions from 1.2.1.
> > Bugs already present in 1.2.X, minor regressions, or bugs related
> > to new features will not block this release.
> >
> > - Patrick
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
> >
> >
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Tathagata Das
Hey all,

I found a major issue where JobProgressListener (a listener used to keep
track of jobs for the web UI) never forgets stages in one of its data
structures. This is a blocker for long running applications.
https://issues.apache.org/jira/browse/SPARK-5967

I am testing a fix for this right now.

TD

On Mon, Feb 23, 2015 at 7:23 PM, Soumitra Kumar 
wrote:

> +1 (non-binding)
>
> For: https://issues.apache.org/jira/browse/SPARK-3660
>
> . Docs OK
> . Example code is good
>
> -Soumitra.
>
>
> On Mon, Feb 23, 2015 at 10:33 AM, Marcelo Vanzin 
> wrote:
>
> > Hi Tom, are you using an sbt-built assembly by any chance? If so, take
> > a look at SPARK-5808.
> >
> > I haven't had any problems with the maven-built assembly. Setting
> > SPARK_HOME on the executors is a workaround if you want to use the sbt
> > assembly.
> >
> > On Fri, Feb 20, 2015 at 2:56 PM, Tom Graves
> >  wrote:
> > > Trying to run pyspark on yarn in client mode with basic wordcount
> > example I see the following error when doing the collect:
> > > Error from python worker:  /usr/bin/python: No module named
> > sqlPYTHONPATH was:
> >
> /grid/3/tmp/yarn-local/usercache/tgraves/filecache/20/spark-assembly-1.3.0-hadoop2.6.0.1.1411101121.jarjava.io.EOFException
> >   at java.io.DataInputStream.readInt(DataInputStream.java:392)
> > at
> >
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
> >   at
> >
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
> >   at
> >
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
> >   at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:105)
> >   at
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:69)
> >   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
> > org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:308)
> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)at
> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> >   at
> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> >   at org.apache.spark.scheduler.Task.run(Task.scala:64)at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
> >   at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >   at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >   at java.lang.Thread.run(Thread.java:722)
> > > any ideas on this?
> > > Tom
> > >
> > >  On Wednesday, February 18, 2015 2:14 AM, Patrick Wendell <
> > pwend...@gmail.com> wrote:
> > >
> > >
> > >  Please vote on releasing the following candidate as Apache Spark
> > version 1.3.0!
> > >
> > > The tag to be voted on is v1.3.0-rc1 (commit f97b0d4a):
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=f97b0d4a6b26504916816d7aefcf3132cd1da6c2
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > http://people.apache.org/~pwendell/spark-1.3.0-rc1/
> > >
> > > Release artifacts are signed with the following key:
> > > https://people.apache.org/keys/committer/pwendell.asc
> > >
> > > The staging repository for this release can be found at:
> > >
> https://repository.apache.org/content/repositories/orgapachespark-1069/
> > >
> > > The documentation corresponding to this release can be found at:
> > > http://people.apache.org/~pwendell/spark-1.3.0-rc1-docs/
> > >
> > > Please vote on releasing this package as Apache Spark 1.3.0!
> > >
> > > The vote is open until Saturday, February 21, at 08:03 UTC and passes
> > > if a majority of at least 3 +1 PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Spark 1.3.0
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see
> > > http://spark.apache.org/
> > >
> > > == How can I help test this release? ==
> > > If you are a Spark user, you can help us test this release by
> > > taking a Spark 1.2 workload and running on this release candidate,
> > > then reporting any regressions.
> > >
> > > == What justifies a -1 vote for this release? ==
> > > This vote is happening towards the end of the 1.3 QA period,
> > > so -1 votes should only occur for significant regressions from 1.2.1.
> > > Bugs already present in 1.2.X, minor regressions, or bugs related
> > > to new features will not block this release.
> > >
> > > - Patrick
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > >
> > >
> > >
> > >
> >
> >
> >
> > --
> > Marcelo
> >
> > --