Re: LinearRegressionWithSGD accuracy

2015-01-28 Thread DB Tsai
Hi Robin,

You can try this PR out. This has built-in features scaling, and has
ElasticNet regularization (L1/L2 mix). This implementation can stably
converge to model from R's glmnet package.

https://github.com/apache/spark/pull/4259

Sincerely,

DB Tsai
---
Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai



On Thu, Jan 15, 2015 at 9:42 AM, Robin East  wrote:
> -dev, +user
>
> You’ll need to set the gradient descent step size to something small - a bit 
> of trial and error shows that 0.0001 works.
>
> You’ll need to create a LinearRegressionWithSGD instance and set the step 
> size explicitly:
>
> val lr = new LinearRegressionWithSGD()
> lr.optimizer.setStepSize(0.0001)
> lr.optimizer.setNumIterations(100)
> val model = lr.run(parsedData)
>
> On 15 Jan 2015, at 16:46, devl.development  wrote:
>
>> From what I gather, you use LinearRegressionWithSGD to predict y or the
>> response variable given a feature vector x.
>>
>> In a simple example I used a perfectly linear dataset such that x=y
>> y,x
>> 1,1
>> 2,2
>> ...
>>
>> 1,1
>>
>> Using the out-of-box example from the website (with and without scaling):
>>
>> val data = sc.textFile(file)
>>
>>val parsedData = data.map { line =>
>>  val parts = line.split(',')
>> LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y
>> and x
>>
>>}
>>val scaler = new StandardScaler(withMean = true, withStd = true)
>>  .fit(parsedData.map(x => x.features))
>>val scaledData = parsedData
>>  .map(x =>
>>  LabeledPoint(x.label,
>>scaler.transform(Vectors.dense(x.features.toArray
>>
>>// Building the model
>>val numIterations = 100
>>val model = LinearRegressionWithSGD.train(parsedData, numIterations)
>>
>>// Evaluate model on training examples and compute training error *
>> tried using both scaledData and parsedData
>>val valuesAndPreds = scaledData.map { point =>
>>  val prediction = model.predict(point.features)
>>  (point.label, prediction)
>>}
>>val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
>>println("training Mean Squared Error = " + MSE)
>>
>> Both scaled and unscaled attempts give:
>>
>> training Mean Squared Error = NaN
>>
>> I've even tried x, y+(sample noise from normal with mean 0 and stddev 1)
>> still comes up with the same thing.
>>
>> Is this not supposed to work for x and y or 2 dimensional plots? Is there
>> something I'm missing or wrong in the code above? Or is there a limitation
>> in the method?
>>
>> Thanks for any advice.
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html
>> Sent from the Apache Spark Developers List mailing list archive at 
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: emergency jenkins restart soon

2015-01-28 Thread shane knapp
np!  the master builds haven't triggered yet, but let's give the rube
goldberg machine a minute to get it's bearings.

On Wed, Jan 28, 2015 at 10:31 PM, Reynold Xin  wrote:

> Thanks for doing that, Shane!
>
>
> On Wed, Jan 28, 2015 at 10:29 PM, shane knapp  wrote:
>
>> jenkins is back up and all builds have been retriggered...  things are
>> building and looking good, and i'll keep an eye on the spark master builds
>> tonite and tomorrow.
>>
>> On Wed, Jan 28, 2015 at 9:56 PM, shane knapp  wrote:
>>
>> > the spark master builds stopped triggering ~yesterday and the logs don't
>> > show anything.  i'm going to give the current batch of spark pull
>> request
>> > builder jobs a little more time (~30 mins) to finish, then kill
>> whatever is
>> > left and restart jenkins.  anything that was queued or killed will be
>> > retriggered once jenkins is back up.
>> >
>> > sorry for the inconvenience, we'll get this sorted asap.
>> >
>> > thanks,
>> >
>> > shane
>> >
>>
>
>


Re: emergency jenkins restart soon

2015-01-28 Thread Reynold Xin
Thanks for doing that, Shane!


On Wed, Jan 28, 2015 at 10:29 PM, shane knapp  wrote:

> jenkins is back up and all builds have been retriggered...  things are
> building and looking good, and i'll keep an eye on the spark master builds
> tonite and tomorrow.
>
> On Wed, Jan 28, 2015 at 9:56 PM, shane knapp  wrote:
>
> > the spark master builds stopped triggering ~yesterday and the logs don't
> > show anything.  i'm going to give the current batch of spark pull request
> > builder jobs a little more time (~30 mins) to finish, then kill whatever
> is
> > left and restart jenkins.  anything that was queued or killed will be
> > retriggered once jenkins is back up.
> >
> > sorry for the inconvenience, we'll get this sorted asap.
> >
> > thanks,
> >
> > shane
> >
>


Re: emergency jenkins restart soon

2015-01-28 Thread shane knapp
jenkins is back up and all builds have been retriggered...  things are
building and looking good, and i'll keep an eye on the spark master builds
tonite and tomorrow.

On Wed, Jan 28, 2015 at 9:56 PM, shane knapp  wrote:

> the spark master builds stopped triggering ~yesterday and the logs don't
> show anything.  i'm going to give the current batch of spark pull request
> builder jobs a little more time (~30 mins) to finish, then kill whatever is
> left and restart jenkins.  anything that was queued or killed will be
> retriggered once jenkins is back up.
>
> sorry for the inconvenience, we'll get this sorted asap.
>
> thanks,
>
> shane
>


emergency jenkins restart soon

2015-01-28 Thread shane knapp
the spark master builds stopped triggering ~yesterday and the logs don't
show anything.  i'm going to give the current batch of spark pull request
builder jobs a little more time (~30 mins) to finish, then kill whatever is
left and restart jenkins.  anything that was queued or killed will be
retriggered once jenkins is back up.

sorry for the inconvenience, we'll get this sorted asap.

thanks,

shane


Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan R. Sparks
You've got to be a little bit careful here. "NA" in systems like R or
pandas may have special meaning that is distinct from "null".

See, e.g. http://www.r-bloggers.com/r-na-vs-null/



On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin  wrote:

> Isn't that just "null" in SQL?
>
> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan 
> wrote:
>
> > I believe that most DataFrame implementations out there, like Pandas,
> > supports the idea of missing values / NA, and some support the idea of
> > Not Meaningful as well.
> >
> > Does Row support anything like that?  That is important for certain
> > applications.  I thought that Row worked by being a mutable object,
> > but haven't looked into the details in a while.
> >
> > -Evan
> >
> > On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin 
> wrote:
> > > It shouldn't change the data source api at all because data sources
> > create
> > > RDD[Row], and that gets converted into a DataFrame automatically
> > (previously
> > > to SchemaRDD).
> > >
> > >
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> > >
> > > One thing that will break the data source API in 1.3 is the location of
> > > types. Types were previously defined in sql.catalyst.types, and now
> > moved to
> > > sql.types. After 1.3, sql.catalyst is hidden from users, and all public
> > APIs
> > > have first class classes/objects defined in sql directly.
> > >
> > >
> > >
> > > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan 
> > wrote:
> > >>
> > >> Hey guys,
> > >>
> > >> How does this impact the data sources API?  I was planning on using
> > >> this for a project.
> > >>
> > >> +1 that many things from spark-sql / DataFrame is universally
> > >> desirable and useful.
> > >>
> > >> By the way, one thing that prevents the columnar compression stuff in
> > >> Spark SQL from being more useful is, at least from previous talks with
> > >> Reynold and Michael et al., that the format was not designed for
> > >> persistence.
> > >>
> > >> I have a new project that aims to change that.  It is a
> > >> zero-serialisation, high performance binary vector library, designed
> > >> from the outset to be a persistent storage friendly.  May be one day
> > >> it can replace the Spark SQL columnar compression.
> > >>
> > >> Michael told me this would be a lot of work, and recreates parts of
> > >> Parquet, but I think it's worth it.  LMK if you'd like more details.
> > >>
> > >> -Evan
> > >>
> > >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 
> > wrote:
> > >> > Alright I have merged the patch (
> > >> > https://github.com/apache/spark/pull/4173
> > >> > ) since I don't see any strong opinions against it (as a matter of
> > fact
> > >> > most were for it). We can still change it if somebody lays out a
> > strong
> > >> > argument.
> > >> >
> > >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> > >> > 
> > >> > wrote:
> > >> >
> > >> >> The type alias means your methods can specify either type and they
> > will
> > >> >> work. It's just another name for the same type. But Scaladocs and
> > such
> > >> >> will
> > >> >> show DataFrame as the type.
> > >> >>
> > >> >> Matei
> > >> >>
> > >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> > >> >> dirceu.semigh...@gmail.com> wrote:
> > >> >> >
> > >> >> > Reynold,
> > >> >> > But with type alias we will have the same problem, right?
> > >> >> > If the methods doesn't receive schemardd anymore, we will have to
> > >> >> > change
> > >> >> > our code to migrade from schema to dataframe. Unless we have an
> > >> >> > implicit
> > >> >> > conversion between DataFrame and SchemaRDD
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
> > >> >> >
> > >> >> >> Dirceu,
> > >> >> >>
> > >> >> >> That is not possible because one cannot overload return types.
> > >> >> >>
> > >> >> >> SQLContext.parquetFile (and many other methods) needs to return
> > some
> > >> >> type,
> > >> >> >> and that type cannot be both SchemaRDD and DataFrame.
> > >> >> >>
> > >> >> >> In 1.3, we will create a type alias for DataFrame called
> SchemaRDD
> > >> >> >> to
> > >> >> not
> > >> >> >> break source compatibility for Scala.
> > >> >> >>
> > >> >> >>
> > >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> > >> >> >> dirceu.semigh...@gmail.com> wrote:
> > >> >> >>
> > >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
> > removed
> > >> >> >>> in
> > >> >> the
> > >> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
> > >> >> DataFrame?
> > >> >> >>> With this, we don't impact in existing code for the next few
> > >> >> >>> releases.
> > >> >> >>>
> > >> >> >>>
> > >> >> >>>
> > >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta  >:
> > >> >> >>>
> > >> >>  I want to address the issue that Matei raised about the heavy
> > >> >>  lifting
> > >> >>  required for a full SQL support. It is amazing that even after
> > 30
> > >> >> years
> 

Re: spark akka fork : is the source anywhere?

2015-01-28 Thread Patrick Wendell
It's maintained here:

https://github.com/pwendell/akka/tree/2.2.3-shaded-proto

Over time, this is something that would be great to get rid of, per rxin

On Wed, Jan 28, 2015 at 3:33 PM, Reynold Xin  wrote:
> Hopefully problems like this will go away entirely in the next couple of
> releases. https://issues.apache.org/jira/browse/SPARK-5293
>
>
>
> On Wed, Jan 28, 2015 at 3:12 PM, jay vyas 
> wrote:
>
>> Hi spark. Where is akka coming from in spark ?
>>
>> I see the distribution referenced is a spark artifact... but not in the
>> apache namespace.
>>
>>  org.spark-project.akka
>>  2.3.4-spark
>>
>> Clearly this is a deliberate thought out change (See SPARK-1812), but its
>> not clear where 2.3.4 spark is coming from and who is maintaining its
>> release?
>>
>> --
>> jay vyas
>>
>> PS
>>
>> I've had some conversations with will benton as well about this, and its
>> clear that some modifications to akka are needed, or else a protobug error
>> occurs, which amount to serialization incompatibilities, hence if one wants
>> to build spark from sources, the patched akka is required (or else, manual
>> patching needs to be done)...
>>
>> 15/01/28 22:58:10 ERROR ActorSystemImpl: Uncaught fatal error from thread
>> [sparkWorker-akka.remote.default-remote-dispatcher-6] shutting down
>> ActorSystem [sparkWorker] java.lang.VerifyError: class
>> akka.remote.WireFormats$AkkaControlMessage overrides final method
>> getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Michael Armbrust
In particular the performance tricks are in SpecificMutableRow.

On Wed, Jan 28, 2015 at 5:49 PM, Evan Chan  wrote:

> Yeah, it's "null".   I was worried you couldn't represent it in Row
> because of primitive types like Int (unless you box the Int, which
> would be a performance hit).  Anyways, I'll take another look at the
> Row API again  :-p
>
> On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin  wrote:
> > Isn't that just "null" in SQL?
> >
> > On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan 
> wrote:
> >>
> >> I believe that most DataFrame implementations out there, like Pandas,
> >> supports the idea of missing values / NA, and some support the idea of
> >> Not Meaningful as well.
> >>
> >> Does Row support anything like that?  That is important for certain
> >> applications.  I thought that Row worked by being a mutable object,
> >> but haven't looked into the details in a while.
> >>
> >> -Evan
> >>
> >> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin 
> wrote:
> >> > It shouldn't change the data source api at all because data sources
> >> > create
> >> > RDD[Row], and that gets converted into a DataFrame automatically
> >> > (previously
> >> > to SchemaRDD).
> >> >
> >> >
> >> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >> >
> >> > One thing that will break the data source API in 1.3 is the location
> of
> >> > types. Types were previously defined in sql.catalyst.types, and now
> >> > moved to
> >> > sql.types. After 1.3, sql.catalyst is hidden from users, and all
> public
> >> > APIs
> >> > have first class classes/objects defined in sql directly.
> >> >
> >> >
> >> >
> >> > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan 
> >> > wrote:
> >> >>
> >> >> Hey guys,
> >> >>
> >> >> How does this impact the data sources API?  I was planning on using
> >> >> this for a project.
> >> >>
> >> >> +1 that many things from spark-sql / DataFrame is universally
> >> >> desirable and useful.
> >> >>
> >> >> By the way, one thing that prevents the columnar compression stuff in
> >> >> Spark SQL from being more useful is, at least from previous talks
> with
> >> >> Reynold and Michael et al., that the format was not designed for
> >> >> persistence.
> >> >>
> >> >> I have a new project that aims to change that.  It is a
> >> >> zero-serialisation, high performance binary vector library, designed
> >> >> from the outset to be a persistent storage friendly.  May be one day
> >> >> it can replace the Spark SQL columnar compression.
> >> >>
> >> >> Michael told me this would be a lot of work, and recreates parts of
> >> >> Parquet, but I think it's worth it.  LMK if you'd like more details.
> >> >>
> >> >> -Evan
> >> >>
> >> >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 
> >> >> wrote:
> >> >> > Alright I have merged the patch (
> >> >> > https://github.com/apache/spark/pull/4173
> >> >> > ) since I don't see any strong opinions against it (as a matter of
> >> >> > fact
> >> >> > most were for it). We can still change it if somebody lays out a
> >> >> > strong
> >> >> > argument.
> >> >> >
> >> >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >> >> > 
> >> >> > wrote:
> >> >> >
> >> >> >> The type alias means your methods can specify either type and they
> >> >> >> will
> >> >> >> work. It's just another name for the same type. But Scaladocs and
> >> >> >> such
> >> >> >> will
> >> >> >> show DataFrame as the type.
> >> >> >>
> >> >> >> Matei
> >> >> >>
> >> >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> >> >> dirceu.semigh...@gmail.com> wrote:
> >> >> >> >
> >> >> >> > Reynold,
> >> >> >> > But with type alias we will have the same problem, right?
> >> >> >> > If the methods doesn't receive schemardd anymore, we will have
> to
> >> >> >> > change
> >> >> >> > our code to migrade from schema to dataframe. Unless we have an
> >> >> >> > implicit
> >> >> >> > conversion between DataFrame and SchemaRDD
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
> >> >> >> >
> >> >> >> >> Dirceu,
> >> >> >> >>
> >> >> >> >> That is not possible because one cannot overload return types.
> >> >> >> >>
> >> >> >> >> SQLContext.parquetFile (and many other methods) needs to return
> >> >> >> >> some
> >> >> >> type,
> >> >> >> >> and that type cannot be both SchemaRDD and DataFrame.
> >> >> >> >>
> >> >> >> >> In 1.3, we will create a type alias for DataFrame called
> >> >> >> >> SchemaRDD
> >> >> >> >> to
> >> >> >> not
> >> >> >> >> break source compatibility for Scala.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> >> >> >> dirceu.semigh...@gmail.com> wrote:
> >> >> >> >>
> >> >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
> >> >> >> >>> removed
> >> >> >> >>> in
> >> >> >> the
> >> >> >> >>> release 1.5(+/- 1)  for example, and the new code been added
> to
> >> >> >> DataFrame?
> >> >> >> >>> With this, we don't impact in e

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan Chan
Yeah, it's "null".   I was worried you couldn't represent it in Row
because of primitive types like Int (unless you box the Int, which
would be a performance hit).  Anyways, I'll take another look at the
Row API again  :-p

On Wed, Jan 28, 2015 at 4:42 PM, Reynold Xin  wrote:
> Isn't that just "null" in SQL?
>
> On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan  wrote:
>>
>> I believe that most DataFrame implementations out there, like Pandas,
>> supports the idea of missing values / NA, and some support the idea of
>> Not Meaningful as well.
>>
>> Does Row support anything like that?  That is important for certain
>> applications.  I thought that Row worked by being a mutable object,
>> but haven't looked into the details in a while.
>>
>> -Evan
>>
>> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin  wrote:
>> > It shouldn't change the data source api at all because data sources
>> > create
>> > RDD[Row], and that gets converted into a DataFrame automatically
>> > (previously
>> > to SchemaRDD).
>> >
>> >
>> > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>> >
>> > One thing that will break the data source API in 1.3 is the location of
>> > types. Types were previously defined in sql.catalyst.types, and now
>> > moved to
>> > sql.types. After 1.3, sql.catalyst is hidden from users, and all public
>> > APIs
>> > have first class classes/objects defined in sql directly.
>> >
>> >
>> >
>> > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan 
>> > wrote:
>> >>
>> >> Hey guys,
>> >>
>> >> How does this impact the data sources API?  I was planning on using
>> >> this for a project.
>> >>
>> >> +1 that many things from spark-sql / DataFrame is universally
>> >> desirable and useful.
>> >>
>> >> By the way, one thing that prevents the columnar compression stuff in
>> >> Spark SQL from being more useful is, at least from previous talks with
>> >> Reynold and Michael et al., that the format was not designed for
>> >> persistence.
>> >>
>> >> I have a new project that aims to change that.  It is a
>> >> zero-serialisation, high performance binary vector library, designed
>> >> from the outset to be a persistent storage friendly.  May be one day
>> >> it can replace the Spark SQL columnar compression.
>> >>
>> >> Michael told me this would be a lot of work, and recreates parts of
>> >> Parquet, but I think it's worth it.  LMK if you'd like more details.
>> >>
>> >> -Evan
>> >>
>> >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 
>> >> wrote:
>> >> > Alright I have merged the patch (
>> >> > https://github.com/apache/spark/pull/4173
>> >> > ) since I don't see any strong opinions against it (as a matter of
>> >> > fact
>> >> > most were for it). We can still change it if somebody lays out a
>> >> > strong
>> >> > argument.
>> >> >
>> >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> >> > 
>> >> > wrote:
>> >> >
>> >> >> The type alias means your methods can specify either type and they
>> >> >> will
>> >> >> work. It's just another name for the same type. But Scaladocs and
>> >> >> such
>> >> >> will
>> >> >> show DataFrame as the type.
>> >> >>
>> >> >> Matei
>> >> >>
>> >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> >> >> dirceu.semigh...@gmail.com> wrote:
>> >> >> >
>> >> >> > Reynold,
>> >> >> > But with type alias we will have the same problem, right?
>> >> >> > If the methods doesn't receive schemardd anymore, we will have to
>> >> >> > change
>> >> >> > our code to migrade from schema to dataframe. Unless we have an
>> >> >> > implicit
>> >> >> > conversion between DataFrame and SchemaRDD
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
>> >> >> >
>> >> >> >> Dirceu,
>> >> >> >>
>> >> >> >> That is not possible because one cannot overload return types.
>> >> >> >>
>> >> >> >> SQLContext.parquetFile (and many other methods) needs to return
>> >> >> >> some
>> >> >> type,
>> >> >> >> and that type cannot be both SchemaRDD and DataFrame.
>> >> >> >>
>> >> >> >> In 1.3, we will create a type alias for DataFrame called
>> >> >> >> SchemaRDD
>> >> >> >> to
>> >> >> not
>> >> >> >> break source compatibility for Scala.
>> >> >> >>
>> >> >> >>
>> >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> >> >> >> dirceu.semigh...@gmail.com> wrote:
>> >> >> >>
>> >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
>> >> >> >>> removed
>> >> >> >>> in
>> >> >> the
>> >> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
>> >> >> DataFrame?
>> >> >> >>> With this, we don't impact in existing code for the next few
>> >> >> >>> releases.
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta :
>> >> >> >>>
>> >> >>  I want to address the issue that Matei raised about the heavy
>> >> >>  lifting
>> >> >>  required for a full SQL support. It is amazing that even after
>> >> >>  30
>> >> >> years
>> >> >> >>> of
>> >> >>

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Reynold Xin
Isn't that just "null" in SQL?

On Wed, Jan 28, 2015 at 4:41 PM, Evan Chan  wrote:

> I believe that most DataFrame implementations out there, like Pandas,
> supports the idea of missing values / NA, and some support the idea of
> Not Meaningful as well.
>
> Does Row support anything like that?  That is important for certain
> applications.  I thought that Row worked by being a mutable object,
> but haven't looked into the details in a while.
>
> -Evan
>
> On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin  wrote:
> > It shouldn't change the data source api at all because data sources
> create
> > RDD[Row], and that gets converted into a DataFrame automatically
> (previously
> > to SchemaRDD).
> >
> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
> >
> > One thing that will break the data source API in 1.3 is the location of
> > types. Types were previously defined in sql.catalyst.types, and now
> moved to
> > sql.types. After 1.3, sql.catalyst is hidden from users, and all public
> APIs
> > have first class classes/objects defined in sql directly.
> >
> >
> >
> > On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan 
> wrote:
> >>
> >> Hey guys,
> >>
> >> How does this impact the data sources API?  I was planning on using
> >> this for a project.
> >>
> >> +1 that many things from spark-sql / DataFrame is universally
> >> desirable and useful.
> >>
> >> By the way, one thing that prevents the columnar compression stuff in
> >> Spark SQL from being more useful is, at least from previous talks with
> >> Reynold and Michael et al., that the format was not designed for
> >> persistence.
> >>
> >> I have a new project that aims to change that.  It is a
> >> zero-serialisation, high performance binary vector library, designed
> >> from the outset to be a persistent storage friendly.  May be one day
> >> it can replace the Spark SQL columnar compression.
> >>
> >> Michael told me this would be a lot of work, and recreates parts of
> >> Parquet, but I think it's worth it.  LMK if you'd like more details.
> >>
> >> -Evan
> >>
> >> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin 
> wrote:
> >> > Alright I have merged the patch (
> >> > https://github.com/apache/spark/pull/4173
> >> > ) since I don't see any strong opinions against it (as a matter of
> fact
> >> > most were for it). We can still change it if somebody lays out a
> strong
> >> > argument.
> >> >
> >> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
> >> > 
> >> > wrote:
> >> >
> >> >> The type alias means your methods can specify either type and they
> will
> >> >> work. It's just another name for the same type. But Scaladocs and
> such
> >> >> will
> >> >> show DataFrame as the type.
> >> >>
> >> >> Matei
> >> >>
> >> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> >> dirceu.semigh...@gmail.com> wrote:
> >> >> >
> >> >> > Reynold,
> >> >> > But with type alias we will have the same problem, right?
> >> >> > If the methods doesn't receive schemardd anymore, we will have to
> >> >> > change
> >> >> > our code to migrade from schema to dataframe. Unless we have an
> >> >> > implicit
> >> >> > conversion between DataFrame and SchemaRDD
> >> >> >
> >> >> >
> >> >> >
> >> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
> >> >> >
> >> >> >> Dirceu,
> >> >> >>
> >> >> >> That is not possible because one cannot overload return types.
> >> >> >>
> >> >> >> SQLContext.parquetFile (and many other methods) needs to return
> some
> >> >> type,
> >> >> >> and that type cannot be both SchemaRDD and DataFrame.
> >> >> >>
> >> >> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD
> >> >> >> to
> >> >> not
> >> >> >> break source compatibility for Scala.
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> >> >> dirceu.semigh...@gmail.com> wrote:
> >> >> >>
> >> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be
> removed
> >> >> >>> in
> >> >> the
> >> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
> >> >> DataFrame?
> >> >> >>> With this, we don't impact in existing code for the next few
> >> >> >>> releases.
> >> >> >>>
> >> >> >>>
> >> >> >>>
> >> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta :
> >> >> >>>
> >> >>  I want to address the issue that Matei raised about the heavy
> >> >>  lifting
> >> >>  required for a full SQL support. It is amazing that even after
> 30
> >> >> years
> >> >> >>> of
> >> >>  research there is not a single good open source columnar
> database
> >> >>  like
> >> >>  Vertica. There is a column store option in MySQL, but it is not
> >> >>  nearly
> >> >> >>> as
> >> >>  sophisticated as Vertica or MonetDB. But there's a true need for
> >> >>  such
> >> >> a
> >> >>  system. I wonder why so and it's high time to change that.
> >> >>  On Jan 26, 2015 5:47 PM, "Sandy Ryza" 
> >> >> wrote:
> >> >> 
> >> >> > Both SchemaRDD and

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan Chan
I believe that most DataFrame implementations out there, like Pandas,
supports the idea of missing values / NA, and some support the idea of
Not Meaningful as well.

Does Row support anything like that?  That is important for certain
applications.  I thought that Row worked by being a mutable object,
but haven't looked into the details in a while.

-Evan

On Wed, Jan 28, 2015 at 4:23 PM, Reynold Xin  wrote:
> It shouldn't change the data source api at all because data sources create
> RDD[Row], and that gets converted into a DataFrame automatically (previously
> to SchemaRDD).
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala
>
> One thing that will break the data source API in 1.3 is the location of
> types. Types were previously defined in sql.catalyst.types, and now moved to
> sql.types. After 1.3, sql.catalyst is hidden from users, and all public APIs
> have first class classes/objects defined in sql directly.
>
>
>
> On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan  wrote:
>>
>> Hey guys,
>>
>> How does this impact the data sources API?  I was planning on using
>> this for a project.
>>
>> +1 that many things from spark-sql / DataFrame is universally
>> desirable and useful.
>>
>> By the way, one thing that prevents the columnar compression stuff in
>> Spark SQL from being more useful is, at least from previous talks with
>> Reynold and Michael et al., that the format was not designed for
>> persistence.
>>
>> I have a new project that aims to change that.  It is a
>> zero-serialisation, high performance binary vector library, designed
>> from the outset to be a persistent storage friendly.  May be one day
>> it can replace the Spark SQL columnar compression.
>>
>> Michael told me this would be a lot of work, and recreates parts of
>> Parquet, but I think it's worth it.  LMK if you'd like more details.
>>
>> -Evan
>>
>> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin  wrote:
>> > Alright I have merged the patch (
>> > https://github.com/apache/spark/pull/4173
>> > ) since I don't see any strong opinions against it (as a matter of fact
>> > most were for it). We can still change it if somebody lays out a strong
>> > argument.
>> >
>> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia
>> > 
>> > wrote:
>> >
>> >> The type alias means your methods can specify either type and they will
>> >> work. It's just another name for the same type. But Scaladocs and such
>> >> will
>> >> show DataFrame as the type.
>> >>
>> >> Matei
>> >>
>> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> >> dirceu.semigh...@gmail.com> wrote:
>> >> >
>> >> > Reynold,
>> >> > But with type alias we will have the same problem, right?
>> >> > If the methods doesn't receive schemardd anymore, we will have to
>> >> > change
>> >> > our code to migrade from schema to dataframe. Unless we have an
>> >> > implicit
>> >> > conversion between DataFrame and SchemaRDD
>> >> >
>> >> >
>> >> >
>> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
>> >> >
>> >> >> Dirceu,
>> >> >>
>> >> >> That is not possible because one cannot overload return types.
>> >> >>
>> >> >> SQLContext.parquetFile (and many other methods) needs to return some
>> >> type,
>> >> >> and that type cannot be both SchemaRDD and DataFrame.
>> >> >>
>> >> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD
>> >> >> to
>> >> not
>> >> >> break source compatibility for Scala.
>> >> >>
>> >> >>
>> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> >> >> dirceu.semigh...@gmail.com> wrote:
>> >> >>
>> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed
>> >> >>> in
>> >> the
>> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
>> >> DataFrame?
>> >> >>> With this, we don't impact in existing code for the next few
>> >> >>> releases.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta :
>> >> >>>
>> >>  I want to address the issue that Matei raised about the heavy
>> >>  lifting
>> >>  required for a full SQL support. It is amazing that even after 30
>> >> years
>> >> >>> of
>> >>  research there is not a single good open source columnar database
>> >>  like
>> >>  Vertica. There is a column store option in MySQL, but it is not
>> >>  nearly
>> >> >>> as
>> >>  sophisticated as Vertica or MonetDB. But there's a true need for
>> >>  such
>> >> a
>> >>  system. I wonder why so and it's high time to change that.
>> >>  On Jan 26, 2015 5:47 PM, "Sandy Ryza" 
>> >> wrote:
>> >> 
>> >> > Both SchemaRDD and DataFrame sound fine to me, though I like the
>> >> >>> former
>> >> > slightly better because it's more descriptive.
>> >> >
>> >> > Even if SchemaRDD's needs to rely on Spark SQL under the covers,
>> >> > it
>> >> >>> would
>> >> > be more clear from a user-facing perspective to at least choose a
>> >> >>> package
>> >> > name for it that o

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Reynold Xin
It shouldn't change the data source api at all because data sources create
RDD[Row], and that gets converted into a DataFrame automatically
(previously to SchemaRDD).

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala

One thing that will break the data source API in 1.3 is the location of
types. Types were previously defined in sql.catalyst.types, and now moved
to sql.types. After 1.3, sql.catalyst is hidden from users, and all public
APIs have first class classes/objects defined in sql directly.



On Wed, Jan 28, 2015 at 4:20 PM, Evan Chan  wrote:

> Hey guys,
>
> How does this impact the data sources API?  I was planning on using
> this for a project.
>
> +1 that many things from spark-sql / DataFrame is universally
> desirable and useful.
>
> By the way, one thing that prevents the columnar compression stuff in
> Spark SQL from being more useful is, at least from previous talks with
> Reynold and Michael et al., that the format was not designed for
> persistence.
>
> I have a new project that aims to change that.  It is a
> zero-serialisation, high performance binary vector library, designed
> from the outset to be a persistent storage friendly.  May be one day
> it can replace the Spark SQL columnar compression.
>
> Michael told me this would be a lot of work, and recreates parts of
> Parquet, but I think it's worth it.  LMK if you'd like more details.
>
> -Evan
>
> On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin  wrote:
> > Alright I have merged the patch (
> https://github.com/apache/spark/pull/4173
> > ) since I don't see any strong opinions against it (as a matter of fact
> > most were for it). We can still change it if somebody lays out a strong
> > argument.
> >
> > On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia  >
> > wrote:
> >
> >> The type alias means your methods can specify either type and they will
> >> work. It's just another name for the same type. But Scaladocs and such
> will
> >> show DataFrame as the type.
> >>
> >> Matei
> >>
> >> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
> >> dirceu.semigh...@gmail.com> wrote:
> >> >
> >> > Reynold,
> >> > But with type alias we will have the same problem, right?
> >> > If the methods doesn't receive schemardd anymore, we will have to
> change
> >> > our code to migrade from schema to dataframe. Unless we have an
> implicit
> >> > conversion between DataFrame and SchemaRDD
> >> >
> >> >
> >> >
> >> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
> >> >
> >> >> Dirceu,
> >> >>
> >> >> That is not possible because one cannot overload return types.
> >> >>
> >> >> SQLContext.parquetFile (and many other methods) needs to return some
> >> type,
> >> >> and that type cannot be both SchemaRDD and DataFrame.
> >> >>
> >> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD to
> >> not
> >> >> break source compatibility for Scala.
> >> >>
> >> >>
> >> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
> >> >> dirceu.semigh...@gmail.com> wrote:
> >> >>
> >> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed
> in
> >> the
> >> >>> release 1.5(+/- 1)  for example, and the new code been added to
> >> DataFrame?
> >> >>> With this, we don't impact in existing code for the next few
> releases.
> >> >>>
> >> >>>
> >> >>>
> >> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta :
> >> >>>
> >>  I want to address the issue that Matei raised about the heavy
> lifting
> >>  required for a full SQL support. It is amazing that even after 30
> >> years
> >> >>> of
> >>  research there is not a single good open source columnar database
> like
> >>  Vertica. There is a column store option in MySQL, but it is not
> nearly
> >> >>> as
> >>  sophisticated as Vertica or MonetDB. But there's a true need for
> such
> >> a
> >>  system. I wonder why so and it's high time to change that.
> >>  On Jan 26, 2015 5:47 PM, "Sandy Ryza" 
> >> wrote:
> >> 
> >> > Both SchemaRDD and DataFrame sound fine to me, though I like the
> >> >>> former
> >> > slightly better because it's more descriptive.
> >> >
> >> > Even if SchemaRDD's needs to rely on Spark SQL under the covers,
> it
> >> >>> would
> >> > be more clear from a user-facing perspective to at least choose a
> >> >>> package
> >> > name for it that omits "sql".
> >> >
> >> > I would also be in favor of adding a separate Spark Schema module
> for
> >>  Spark
> >> > SQL to rely on, but I imagine that might be too large a change at
> >> this
> >> > point?
> >> >
> >> > -Sandy
> >> >
> >> > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
> >> >>> matei.zaha...@gmail.com>
> >> > wrote:
> >> >
> >> >> (Actually when we designed Spark SQL we thought of giving it
> another
> >> > name,
> >> >> like Spark Schema, but we decided to stick with SQL since that
> was
> >> >>> the
> >> > most
> >> >> obvious use case

Re: renaming SchemaRDD -> DataFrame

2015-01-28 Thread Evan Chan
Hey guys,

How does this impact the data sources API?  I was planning on using
this for a project.

+1 that many things from spark-sql / DataFrame is universally
desirable and useful.

By the way, one thing that prevents the columnar compression stuff in
Spark SQL from being more useful is, at least from previous talks with
Reynold and Michael et al., that the format was not designed for
persistence.

I have a new project that aims to change that.  It is a
zero-serialisation, high performance binary vector library, designed
from the outset to be a persistent storage friendly.  May be one day
it can replace the Spark SQL columnar compression.

Michael told me this would be a lot of work, and recreates parts of
Parquet, but I think it's worth it.  LMK if you'd like more details.

-Evan

On Tue, Jan 27, 2015 at 4:35 PM, Reynold Xin  wrote:
> Alright I have merged the patch ( https://github.com/apache/spark/pull/4173
> ) since I don't see any strong opinions against it (as a matter of fact
> most were for it). We can still change it if somebody lays out a strong
> argument.
>
> On Tue, Jan 27, 2015 at 12:25 PM, Matei Zaharia 
> wrote:
>
>> The type alias means your methods can specify either type and they will
>> work. It's just another name for the same type. But Scaladocs and such will
>> show DataFrame as the type.
>>
>> Matei
>>
>> > On Jan 27, 2015, at 12:10 PM, Dirceu Semighini Filho <
>> dirceu.semigh...@gmail.com> wrote:
>> >
>> > Reynold,
>> > But with type alias we will have the same problem, right?
>> > If the methods doesn't receive schemardd anymore, we will have to change
>> > our code to migrade from schema to dataframe. Unless we have an implicit
>> > conversion between DataFrame and SchemaRDD
>> >
>> >
>> >
>> > 2015-01-27 17:18 GMT-02:00 Reynold Xin :
>> >
>> >> Dirceu,
>> >>
>> >> That is not possible because one cannot overload return types.
>> >>
>> >> SQLContext.parquetFile (and many other methods) needs to return some
>> type,
>> >> and that type cannot be both SchemaRDD and DataFrame.
>> >>
>> >> In 1.3, we will create a type alias for DataFrame called SchemaRDD to
>> not
>> >> break source compatibility for Scala.
>> >>
>> >>
>> >> On Tue, Jan 27, 2015 at 6:28 AM, Dirceu Semighini Filho <
>> >> dirceu.semigh...@gmail.com> wrote:
>> >>
>> >>> Can't the SchemaRDD remain the same, but deprecated, and be removed in
>> the
>> >>> release 1.5(+/- 1)  for example, and the new code been added to
>> DataFrame?
>> >>> With this, we don't impact in existing code for the next few releases.
>> >>>
>> >>>
>> >>>
>> >>> 2015-01-27 0:02 GMT-02:00 Kushal Datta :
>> >>>
>>  I want to address the issue that Matei raised about the heavy lifting
>>  required for a full SQL support. It is amazing that even after 30
>> years
>> >>> of
>>  research there is not a single good open source columnar database like
>>  Vertica. There is a column store option in MySQL, but it is not nearly
>> >>> as
>>  sophisticated as Vertica or MonetDB. But there's a true need for such
>> a
>>  system. I wonder why so and it's high time to change that.
>>  On Jan 26, 2015 5:47 PM, "Sandy Ryza" 
>> wrote:
>> 
>> > Both SchemaRDD and DataFrame sound fine to me, though I like the
>> >>> former
>> > slightly better because it's more descriptive.
>> >
>> > Even if SchemaRDD's needs to rely on Spark SQL under the covers, it
>> >>> would
>> > be more clear from a user-facing perspective to at least choose a
>> >>> package
>> > name for it that omits "sql".
>> >
>> > I would also be in favor of adding a separate Spark Schema module for
>>  Spark
>> > SQL to rely on, but I imagine that might be too large a change at
>> this
>> > point?
>> >
>> > -Sandy
>> >
>> > On Mon, Jan 26, 2015 at 5:32 PM, Matei Zaharia <
>> >>> matei.zaha...@gmail.com>
>> > wrote:
>> >
>> >> (Actually when we designed Spark SQL we thought of giving it another
>> > name,
>> >> like Spark Schema, but we decided to stick with SQL since that was
>> >>> the
>> > most
>> >> obvious use case to many users.)
>> >>
>> >> Matei
>> >>
>> >>> On Jan 26, 2015, at 5:31 PM, Matei Zaharia <
>> >>> matei.zaha...@gmail.com>
>> >> wrote:
>> >>>
>> >>> While it might be possible to move this concept to Spark Core
>> > long-term,
>> >> supporting structured data efficiently does require quite a bit of
>> >>> the
>> >> infrastructure in Spark SQL, such as query planning and columnar
>>  storage.
>> >> The intent of Spark SQL though is to be more than a SQL server --
>> >>> it's
>> >> meant to be a library for manipulating structured data. Since this
>> >>> is
>> >> possible to build over the core API, it's pretty natural to
>> >>> organize it
>> >> that way, same as Spark Streaming is a library.
>> >>>
>> >>> Matei
>> >>>
>>  On Jan 26, 2015, at 4:26 PM, Koert Kuipers 
>>  wrote:
>> 

Re: spark akka fork : is the source anywhere?

2015-01-28 Thread Reynold Xin
Hopefully problems like this will go away entirely in the next couple of
releases. https://issues.apache.org/jira/browse/SPARK-5293



On Wed, Jan 28, 2015 at 3:12 PM, jay vyas 
wrote:

> Hi spark. Where is akka coming from in spark ?
>
> I see the distribution referenced is a spark artifact... but not in the
> apache namespace.
>
>  org.spark-project.akka
>  2.3.4-spark
>
> Clearly this is a deliberate thought out change (See SPARK-1812), but its
> not clear where 2.3.4 spark is coming from and who is maintaining its
> release?
>
> --
> jay vyas
>
> PS
>
> I've had some conversations with will benton as well about this, and its
> clear that some modifications to akka are needed, or else a protobug error
> occurs, which amount to serialization incompatibilities, hence if one wants
> to build spark from sources, the patched akka is required (or else, manual
> patching needs to be done)...
>
> 15/01/28 22:58:10 ERROR ActorSystemImpl: Uncaught fatal error from thread
> [sparkWorker-akka.remote.default-remote-dispatcher-6] shutting down
> ActorSystem [sparkWorker] java.lang.VerifyError: class
> akka.remote.WireFormats$AkkaControlMessage overrides final method
> getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
>


spark akka fork : is the source anywhere?

2015-01-28 Thread jay vyas
Hi spark. Where is akka coming from in spark ?

I see the distribution referenced is a spark artifact... but not in the
apache namespace.

 org.spark-project.akka
 2.3.4-spark

Clearly this is a deliberate thought out change (See SPARK-1812), but its
not clear where 2.3.4 spark is coming from and who is maintaining its
release?

-- 
jay vyas

PS

I've had some conversations with will benton as well about this, and its
clear that some modifications to akka are needed, or else a protobug error
occurs, which amount to serialization incompatibilities, hence if one wants
to build spark from sources, the patched akka is required (or else, manual
patching needs to be done)...

15/01/28 22:58:10 ERROR ActorSystemImpl: Uncaught fatal error from thread
[sparkWorker-akka.remote.default-remote-dispatcher-6] shutting down
ActorSystem [sparkWorker] java.lang.VerifyError: class
akka.remote.WireFormats$AkkaControlMessage overrides final method
getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;


Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Krishna Sankar
+1 (non-binding, of course)

1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:22 min
 mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
-Dhadoop.version=2.6.0
-Phive -DskipTests
2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
1.2.0
2.1. statistics (min,max,mean,Pearson,Spearman) OK
2.2. Linear/Ridge/Laso Regression OK
2.3. Decision Tree, Naive Bayes OK
2.4. KMeans OK
   Center And Scale OK
   Fixed : org.apache.spark.SparkException in zip !
2.5. rdd operations OK
  State of the Union Texts - MapReduce, Filter,sortByKey (word count)
2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK
   Model evaluation/optimization (rank, numIter, lmbda) with itertools
OK

Cheers


On Wed, Jan 28, 2015 at 5:17 AM, Sean Owen  wrote:

> +1 (nonbinding). I verified that all the hash / signing items I
> mentioned before are resolved.
>
> The source package compiles on Ubuntu / Java 8. I ran tests and the
> passed. Well, actually I see the same failure I've seeing locally on
> OS X and on Ubuntu for a while, but I think nobody else has seen this?
>
> MQTTStreamSuite:
> - mqtt input stream *** FAILED ***
>   org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in
> progress
>   at
> org.eclipse.paho.client.mqttv3.internal.ClientState.send(ClientState.java:423)
>
> Doesn't happen on Jenkins. If nobody else is seeing this, I suspect it
> is something perhaps related to my env that I haven't figured out yet,
> so should not be considered a blocker.
>
> On Wed, Jan 28, 2015 at 10:06 AM, Patrick Wendell 
> wrote:
> > Please vote on releasing the following candidate as Apache Spark version
> 1.2.1!
> >
> > The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1062/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
> >
> > Changes from rc1:
> > This has no code changes from RC1. Only minor changes to the release
> script.
> >
> > Please vote on releasing this package as Apache Spark 1.2.1!
> >
> > The vote is open until  Saturday, January 31, at 10:04 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.2.1
> > [ ] -1 Do not release this package because ...
> >
> > For a list of fixes in this release, see http://s.apache.org/Mpn.
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Data source API | Support for dynamic schema

2015-01-28 Thread Reynold Xin
It's an interesting idea, but there are major challenges with per row
schema.

1. Performance - query optimizer and execution use assumptions about schema
and data to generate optimized query plans. Having to re-reason about
schema for each row can substantially slow down the engine, but due to
optimization and due to the overhead of schema information associated with
each row.

2. Data model: per-row schema is fundamentally a different data model. The
current relational model has gone through 40 years of research and have
very well defined semantics. I don't think there are well defined semantics
of a per-row schema data model. For example, what is the semantics of an
UDF function that is operating on a data cell that has incompatible schema?
Should we also coerce or convert the data type? If yes, will that lead to
conflicting semantics with some other rules? We need to answer questions
like this in order to have a robust data model.





On Wed, Jan 28, 2015 at 11:26 AM, Cheng Lian  wrote:

> Hi Aniket,
>
> In general the schema of all rows in a single table must be same. This is
> a basic assumption made by Spark SQL. Schema union does make sense, and
> we're planning to support this for Parquet. But as you've mentioned, it
> doesn't help if types of different versions of a column differ from each
> other. Also, you need to reload the data source table after schema changes
> happen.
>
> Cheng
>
>
> On 1/28/15 2:12 AM, Aniket Bhatnagar wrote:
>
>> I saw the talk on Spark data sources and looking at the interfaces, it
>> seems that the schema needs to be provided upfront. This works for many
>> data sources but I have a situation in which I would need to integrate a
>> system that supports schema evolutions by allowing users to change schema
>> without affecting existing rows. Basically, each row contains a schema
>> hint
>> (id and version) and this allows developers to evolve schema over time and
>> perform migration at will. Since the schema needs to be specified upfront
>> in the data source API, one possible way would be to build a union of all
>> schema versions and handle populating row values appropriately. This works
>> in case columns have been added or deleted in the schema but doesn't work
>> if types have changed. I was wondering if it is possible to change the API
>>   to provide schema for each row instead of expecting data source to
>> provide
>> schema upfront?
>>
>> Thanks,
>> Aniket
>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Data source API | Support for dynamic schema

2015-01-28 Thread Cheng Lian

Hi Aniket,

In general the schema of all rows in a single table must be same. This 
is a basic assumption made by Spark SQL. Schema union does make sense, 
and we're planning to support this for Parquet. But as you've mentioned, 
it doesn't help if types of different versions of a column differ from 
each other. Also, you need to reload the data source table after schema 
changes happen.


Cheng

On 1/28/15 2:12 AM, Aniket Bhatnagar wrote:

I saw the talk on Spark data sources and looking at the interfaces, it
seems that the schema needs to be provided upfront. This works for many
data sources but I have a situation in which I would need to integrate a
system that supports schema evolutions by allowing users to change schema
without affecting existing rows. Basically, each row contains a schema hint
(id and version) and this allows developers to evolve schema over time and
perform migration at will. Since the schema needs to be specified upfront
in the data source API, one possible way would be to build a union of all
schema versions and handle populating row values appropriately. This works
in case columns have been added or deleted in the schema but doesn't work
if types have changed. I was wondering if it is possible to change the API
  to provide schema for each row instead of expecting data source to provide
schema upfront?

Thanks,
Aniket




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Extending Scala style checks

2015-01-28 Thread Nicholas Chammas
FYI: scalastyle just merged in a patch to add support for external rules
.

I forget why I was following the linked issue, but I assume it's related to
this discussion.

Nick


On Thu Oct 09 2014 at 2:56:30 AM Reynold Xin  wrote:

> Thanks. I added one.
>
>
> On Wed, Oct 8, 2014 at 8:49 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I've created SPARK-3849: Automate remaining Scala style rules
>>
> .
>
>
>>
>> Please create sub-tasks on this issue for rules that we have not automated
>> and let's work through them as possible.
>>
>> I went ahead and created the first sub-task, SPARK-3850: Scala style:
>>
> Disallow trailing spaces > >.
>
>
>>
>> Nick
>>
>> On Tue, Oct 7, 2014 at 4:45 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com
>> > wrote:
>>
>> > For starters, do we have a list of all the Scala style rules that are
>> > currently not enforced automatically but are likely well-suited for
>> > automation?
>> >
>> > Let's put such a list together in a JIRA issue and work through
>> > implementing them.
>> >
>> > Nick
>> >
>> > On Thu, Oct 2, 2014 at 12:06 AM, Cheng Lian 
>> wrote:
>> >
>> >> Since we can easily catch the list of all changed files in a PR, I
>> think
>> >> we can start with adding the no trailing space check for newly changed
>> >> files only?
>> >>
>> >>
>> >> On 10/2/14 9:24 AM, Nicholas Chammas wrote:
>> >>
>> >>> Yeah, I remember that hell when I added PEP 8 to the build checks and
>> >>> fixed
>> >>> all the outstanding Python style issues. I had to keep rebasing and
>> >>> resolving merge conflicts until the PR was merged.
>> >>>
>> >>> It's a rough process, but thankfully it's also a one-time process. I
>> >>> might
>> >>> be able to help with that in the next week or two if no-one else
>> wants to
>> >>> pick it up.
>> >>>
>> >>> Nick
>> >>>
>> >>> On Wed, Oct 1, 2014 at 9:20 PM, Michael Armbrust <
>> mich...@databricks.com
>> >>> >
>> >>> wrote:
>> >>>
>> >>>  The hard part here is updating the existing code base... which is
>> going
>>  to
>>  create merge conflicts with like all of the open PRs...
>> 
>>  On Wed, Oct 1, 2014 at 6:13 PM, Nicholas Chammas <
>>  nicholas.cham...@gmail.com> wrote:
>> 
>>   Ah, since there appears to be a built-in rule for end-of-line
>> > whitespace,
>> > Michael and Cheng, y'all should be able to add this in pretty
>> easily.
>> >
>> > Nick
>> >
>> > On Wed, Oct 1, 2014 at 6:37 PM, Patrick Wendell > >
>> > wrote:
>> >
>> >  Hey Nick,
>> >>
>> >> We can always take built-in rules. Back when we added this Prashant
>> >> Sharma actually did some great work that lets us write our own
>> style
>> >> rules in cases where rules don't exist.
>> >>
>> >> You can see some existing rules here:
>> >>
>> >>
>> >>  https://github.com/apache/spark/tree/master/project/
>> > spark-style/src/main/scala/org/apache/spark/scalastyle
>> >
>> >> Prashant has over time contributed a lot of our custom rules
>> upstream
>> >> to stalastyle, so now there are only a couple there.
>> >>
>> >> - Patrick
>> >>
>> >> On Wed, Oct 1, 2014 at 2:36 PM, Ted Yu 
>> wrote:
>> >>
>> >>> Please take a look at WhitespaceEndOfLineChecker under:
>> >>> http://www.scalastyle.org/rules-0.1.0.html
>> >>>
>> >>> Cheers
>> >>>
>> >>> On Wed, Oct 1, 2014 at 2:01 PM, Nicholas Chammas <
>> >>>
>> >> nicholas.cham...@gmail.com
>> >>
>> >>> wrote:
>>  As discussed here ,
>> it
>> 
>> >>> would be
>> >>
>> >>> good to extend our Scala style checks to programmatically enforce
>> as
>> 
>> >>> many
>> >>
>> >>> of our style rules as possible.
>> 
>>  Does anyone know if it's relatively straightforward to enforce
>> 
>> >>> additional
>> >>
>> >>> rules like the "no trailing spaces" rule mentioned in the linked
>> PR?
>> 
>>  Nick
>> 
>> 
>> 
>> >>
>> >
>>
>


Re: Use mvn to build Spark 1.2.0 failed

2015-01-28 Thread Dirceu Semighini Filho
Before this I was facing the same problem, and fixed it adding the plugin
at the root pom.xml

Maybe this is related to the release, mine is:
Apache Maven 3.2.3 (33f8c3e1027c3ddde99d3cdebad2656a31e8fdf4;
2014-08-11T17:58:10-03:00)
Java version: 1.8.0_20, vendor: Oracle Corporation
OS name: "linux", version: "3.13.0-24-generic", arch: "amd64", family:
"unix"

Or the command that I'm using:
mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DskipTests -Phive
-Phive-thriftserver clean compile assembly:single

I'm trying to build using the pr/1290
wyphao.2007 have you figured out how to complete the build?



2015-01-28 13:32 GMT-02:00 Sean Owen :

> I don't see how this would relate to the problem in the OP? the
> assemblies build fine already as far as I can tell.
>
> Your new error may be introduced by your change.
>
> On Wed, Jan 28, 2015 at 2:52 PM, Dirceu Semighini Filho
>  wrote:
> > I was facing the same problem, and I fixed it by adding
> >
> > 
> > maven-assembly-plugin
> > 2.4.1
> > 
> >   
> >
>  assembly/src/main/assembly/assembly.xml
> >   
> > 
> > 
> >  in the root pom.xml, following the maven assembly plugin docs
> > <
> http://maven.apache.org/plugins-archives/maven-assembly-plugin-2.4.1/examples/multimodule/module-source-inclusion-simple.html
> >
> >
> > I can make a PR on this if you consider this an issue.
> >
> > Now I'm facing this problem, is that what you have now?
> > [ERROR] Failed to execute goal
> > org.apache.maven.plugins:maven-assembly-plugin:2.4.1:single (default-cli)
> > on project spark-network-common_2.10: Failed to create assembly: Error
> > adding file
> 'org.apache.spark:spark-network-common_2.10:jar:1.3.0-SNAPSHOT'
> > to archive:
> > /home/dirceu/projects/spark/network/common/target/scala-2.10/classes
> isn't
> > a file. -> [Help 1]
> >
> >
> > 2015-01-27 9:23 GMT-02:00 Sean Owen :
> >
> >> You certainly do not need yo build Spark as root. It might clumsily
> >> overcome a permissions problem in your local env but probably causes
> other
> >> problems.
> >> On Jan 27, 2015 11:18 AM, "angel__" 
> >> wrote:
> >>
> >> > I had that problem when I tried to build Spark 1.2. I don't exactly
> know
> >> > what
> >> > is causing it, but I guess it might have something to do with user
> >> > permissions.
> >> >
> >> > I could finally fix this by building Spark as "root" user (now I'm
> >> dealing
> >> > with another problem, but ...that's another story...)
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> >
> >>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Use-mvn-to-build-Spark-1-2-0-failed-tp9876p10285.html
> >> > Sent from the Apache Spark Developers List mailing list archive at
> >> > Nabble.com.
> >> >
> >> > -
> >> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> > For additional commands, e-mail: dev-h...@spark.apache.org
> >> >
> >> >
> >>
>


Re: Use mvn to build Spark 1.2.0 failed

2015-01-28 Thread Sean Owen
I don't see how this would relate to the problem in the OP? the
assemblies build fine already as far as I can tell.

Your new error may be introduced by your change.

On Wed, Jan 28, 2015 at 2:52 PM, Dirceu Semighini Filho
 wrote:
> I was facing the same problem, and I fixed it by adding
>
> 
> maven-assembly-plugin
> 2.4.1
> 
>   
> assembly/src/main/assembly/assembly.xml
>   
> 
> 
>  in the root pom.xml, following the maven assembly plugin docs
> 
>
> I can make a PR on this if you consider this an issue.
>
> Now I'm facing this problem, is that what you have now?
> [ERROR] Failed to execute goal
> org.apache.maven.plugins:maven-assembly-plugin:2.4.1:single (default-cli)
> on project spark-network-common_2.10: Failed to create assembly: Error
> adding file 'org.apache.spark:spark-network-common_2.10:jar:1.3.0-SNAPSHOT'
> to archive:
> /home/dirceu/projects/spark/network/common/target/scala-2.10/classes isn't
> a file. -> [Help 1]
>
>
> 2015-01-27 9:23 GMT-02:00 Sean Owen :
>
>> You certainly do not need yo build Spark as root. It might clumsily
>> overcome a permissions problem in your local env but probably causes other
>> problems.
>> On Jan 27, 2015 11:18 AM, "angel__" 
>> wrote:
>>
>> > I had that problem when I tried to build Spark 1.2. I don't exactly know
>> > what
>> > is causing it, but I guess it might have something to do with user
>> > permissions.
>> >
>> > I could finally fix this by building Spark as "root" user (now I'm
>> dealing
>> > with another problem, but ...that's another story...)
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> >
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Use-mvn-to-build-Spark-1-2-0-failed-tp9876p10285.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> > Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Use mvn to build Spark 1.2.0 failed

2015-01-28 Thread Dirceu Semighini Filho
I was facing the same problem, and I fixed it by adding


maven-assembly-plugin
2.4.1

  
assembly/src/main/assembly/assembly.xml
  


 in the root pom.xml, following the maven assembly plugin docs


I can make a PR on this if you consider this an issue.

Now I'm facing this problem, is that what you have now?
[ERROR] Failed to execute goal
org.apache.maven.plugins:maven-assembly-plugin:2.4.1:single (default-cli)
on project spark-network-common_2.10: Failed to create assembly: Error
adding file 'org.apache.spark:spark-network-common_2.10:jar:1.3.0-SNAPSHOT'
to archive:
/home/dirceu/projects/spark/network/common/target/scala-2.10/classes isn't
a file. -> [Help 1]


2015-01-27 9:23 GMT-02:00 Sean Owen :

> You certainly do not need yo build Spark as root. It might clumsily
> overcome a permissions problem in your local env but probably causes other
> problems.
> On Jan 27, 2015 11:18 AM, "angel__" 
> wrote:
>
> > I had that problem when I tried to build Spark 1.2. I don't exactly know
> > what
> > is causing it, but I guess it might have something to do with user
> > permissions.
> >
> > I could finally fix this by building Spark as "root" user (now I'm
> dealing
> > with another problem, but ...that's another story...)
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-spark-developers-list.1001551.n3.nabble.com/Use-mvn-to-build-Spark-1-2-0-failed-tp9876p10285.html
> > Sent from the Apache Spark Developers List mailing list archive at
> > Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
> >
>


Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Sean Owen
We had both been using Java 8; Ye reports that it fails on Java 6 too.
We both believe this has been failing for a fair while, so I do not
think it's a regression. I'll make a JIRA though.

On Wed, Jan 28, 2015 at 1:22 PM, Ye Xianjin  wrote:
> Sean,
> the MQRRStreamSuite is also failed for me on Mac OS X, Though I don’t have
> time to invest that.
>
> --
> Ye Xianjin
> Sent with Sparrow
>
> On Wednesday, January 28, 2015 at 9:17 PM, Sean Owen wrote:
>
> +1 (nonbinding). I verified that all the hash / signing items I
> mentioned before are resolved.
>
> The source package compiles on Ubuntu / Java 8. I ran tests and the
> passed. Well, actually I see the same failure I've seeing locally on
> OS X and on Ubuntu for a while, but I think nobody else has seen this?
>
> MQTTStreamSuite:
> - mqtt input stream *** FAILED ***
> org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in progress
> at
> org.eclipse.paho.client.mqttv3.internal.ClientState.send(ClientState.java:423)
>
> Doesn't happen on Jenkins. If nobody else is seeing this, I suspect it
> is something perhaps related to my env that I haven't figured out yet,
> so should not be considered a blocker.
>
> On Wed, Jan 28, 2015 at 10:06 AM, Patrick Wendell 
> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 1.2.1!
>
> The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1062/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
>
> Changes from rc1:
> This has no code changes from RC1. Only minor changes to the release script.
>
> Please vote on releasing this package as Apache Spark 1.2.1!
>
> The vote is open until Saturday, January 31, at 10:04 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.2.1
> [ ] -1 Do not release this package because ...
>
> For a list of fixes in this release, see http://s.apache.org/Mpn.
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Ye Xianjin
Sean,  
the MQRRStreamSuite is also failed for me on Mac OS X, Though I don’t have time 
to invest that.

--  
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, January 28, 2015 at 9:17 PM, Sean Owen wrote:

> +1 (nonbinding). I verified that all the hash / signing items I
> mentioned before are resolved.
>  
> The source package compiles on Ubuntu / Java 8. I ran tests and the
> passed. Well, actually I see the same failure I've seeing locally on
> OS X and on Ubuntu for a while, but I think nobody else has seen this?
>  
> MQTTStreamSuite:
> - mqtt input stream *** FAILED ***
> org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in progress
> at 
> org.eclipse.paho.client.mqttv3.internal.ClientState.send(ClientState.java:423)
>  
> Doesn't happen on Jenkins. If nobody else is seeing this, I suspect it
> is something perhaps related to my env that I haven't figured out yet,
> so should not be considered a blocker.
>  
> On Wed, Jan 28, 2015 at 10:06 AM, Patrick Wendell  (mailto:pwend...@gmail.com)> wrote:
> > Please vote on releasing the following candidate as Apache Spark version 
> > 1.2.1!
> >  
> > The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
> >  
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2/
> >  
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >  
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1062/
> >  
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
> >  
> > Changes from rc1:
> > This has no code changes from RC1. Only minor changes to the release script.
> >  
> > Please vote on releasing this package as Apache Spark 1.2.1!
> >  
> > The vote is open until Saturday, January 31, at 10:04 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >  
> > [ ] +1 Release this package as Apache Spark 1.2.1
> > [ ] -1 Do not release this package because ...
> >  
> > For a list of fixes in this release, see http://s.apache.org/Mpn.
> >  
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
> >  
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> > (mailto:dev-unsubscr...@spark.apache.org)
> > For additional commands, e-mail: dev-h...@spark.apache.org 
> > (mailto:dev-h...@spark.apache.org)
> >  
>  
>  
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> (mailto:dev-unsubscr...@spark.apache.org)
> For additional commands, e-mail: dev-h...@spark.apache.org 
> (mailto:dev-h...@spark.apache.org)
>  
>  




Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Sean Owen
+1 (nonbinding). I verified that all the hash / signing items I
mentioned before are resolved.

The source package compiles on Ubuntu / Java 8. I ran tests and the
passed. Well, actually I see the same failure I've seeing locally on
OS X and on Ubuntu for a while, but I think nobody else has seen this?

MQTTStreamSuite:
- mqtt input stream *** FAILED ***
  org.eclipse.paho.client.mqttv3.MqttException: Too many publishes in progress
  at 
org.eclipse.paho.client.mqttv3.internal.ClientState.send(ClientState.java:423)

Doesn't happen on Jenkins. If nobody else is seeing this, I suspect it
is something perhaps related to my env that I haven't figured out yet,
so should not be considered a blocker.

On Wed, Jan 28, 2015 at 10:06 AM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.2.1!
>
> The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1062/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
>
> Changes from rc1:
> This has no code changes from RC1. Only minor changes to the release script.
>
> Please vote on releasing this package as Apache Spark 1.2.1!
>
> The vote is open until  Saturday, January 31, at 10:04 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.2.1
> [ ] -1 Do not release this package because ...
>
> For a list of fixes in this release, see http://s.apache.org/Mpn.
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [SQL] Self join with ArrayType columns problems

2015-01-28 Thread PierreB
Should I file a JIRA for this?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-Self-join-with-ArrayType-columns-problems-tp10269p10322.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Yes - it fixes that issue.

On Wed, Jan 28, 2015 at 2:17 AM, Aniket  wrote:
> Hi Patrick,
>
> I am wondering if this version will address issues around certain artifacts
> not getting published in 1.2 which are gating people to migrate to 1.2. One
> such issue is https://issues.apache.org/jira/browse/SPARK-5144
>
> Thanks,
> Aniket
>
> On Wed Jan 28 2015 at 15:39:43 Patrick Wendell [via Apache Spark Developers
> List]  wrote:
>
>> Minor typo in the above e-mail - the tag is named v1.2.1-rc2 (not
>> v1.2.1-rc1).
>>
>> On Wed, Jan 28, 2015 at 2:06 AM, Patrick Wendell <[hidden email]
>> > wrote:
>>
>> > Please vote on releasing the following candidate as Apache Spark version
>> 1.2.1!
>> >
>> > The tag to be voted on is v1.2.1-rc1 (commit b77f876):
>> >
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc2/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1062/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
>> >
>> > Changes from rc1:
>> > This has no code changes from RC1. Only minor changes to the release
>> script.
>> >
>> > Please vote on releasing this package as Apache Spark 1.2.1!
>> >
>> > The vote is open until  Saturday, January 31, at 10:04 UTC and passes
>> > if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.2.1
>> > [ ] -1 Do not release this package because ...
>> >
>> > For a list of fixes in this release, see http://s.apache.org/Mpn.
>> >
>> > To learn more about Apache Spark, please see
>> > http://spark.apache.org/
>>
>> -
>> To unsubscribe, e-mail: [hidden email]
>> 
>> For additional commands, e-mail: [hidden email]
>> 
>>
>>
>>
>> --
>>  If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10318.html
>>  To start a new topic under Apache Spark Developers List, email
>> ml-node+s1001551n1...@n3.nabble.com
>> To unsubscribe from Apache Spark Developers List, click here
>> 
>> .
>> NAML
>> 
>>
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10320.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Aniket
Hi Patrick,

I am wondering if this version will address issues around certain artifacts
not getting published in 1.2 which are gating people to migrate to 1.2. One
such issue is https://issues.apache.org/jira/browse/SPARK-5144

Thanks,
Aniket

On Wed Jan 28 2015 at 15:39:43 Patrick Wendell [via Apache Spark Developers
List]  wrote:

> Minor typo in the above e-mail - the tag is named v1.2.1-rc2 (not
> v1.2.1-rc1).
>
> On Wed, Jan 28, 2015 at 2:06 AM, Patrick Wendell <[hidden email]
> > wrote:
>
> > Please vote on releasing the following candidate as Apache Spark version
> 1.2.1!
> >
> > The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
> >
> > The release files, including signatures, digests, etc. can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2/
> >
> > Release artifacts are signed with the following key:
> > https://people.apache.org/keys/committer/pwendell.asc
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1062/
> >
> > The documentation corresponding to this release can be found at:
> > http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
> >
> > Changes from rc1:
> > This has no code changes from RC1. Only minor changes to the release
> script.
> >
> > Please vote on releasing this package as Apache Spark 1.2.1!
> >
> > The vote is open until  Saturday, January 31, at 10:04 UTC and passes
> > if a majority of at least 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Spark 1.2.1
> > [ ] -1 Do not release this package because ...
> >
> > For a list of fixes in this release, see http://s.apache.org/Mpn.
> >
> > To learn more about Apache Spark, please see
> > http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: [hidden email]
> 
> For additional commands, e-mail: [hidden email]
> 
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10318.html
>  To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1...@n3.nabble.com
> To unsubscribe from Apache Spark Developers List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10320.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Data source API | Support for dynamic schema

2015-01-28 Thread Aniket Bhatnagar
I saw the talk on Spark data sources and looking at the interfaces, it
seems that the schema needs to be provided upfront. This works for many
data sources but I have a situation in which I would need to integrate a
system that supports schema evolutions by allowing users to change schema
without affecting existing rows. Basically, each row contains a schema hint
(id and version) and this allows developers to evolve schema over time and
perform migration at will. Since the schema needs to be specified upfront
in the data source API, one possible way would be to build a union of all
schema versions and handle populating row values appropriately. This works
in case columns have been added or deleted in the schema but doesn't work
if types have changed. I was wondering if it is possible to change the API
 to provide schema for each row instead of expecting data source to provide
schema upfront?

Thanks,
Aniket


Re: [VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Minor typo in the above e-mail - the tag is named v1.2.1-rc2 (not v1.2.1-rc1).

On Wed, Jan 28, 2015 at 2:06 AM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.2.1!
>
> The tag to be voted on is v1.2.1-rc1 (commit b77f876):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1062/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/
>
> Changes from rc1:
> This has no code changes from RC1. Only minor changes to the release script.
>
> Please vote on releasing this package as Apache Spark 1.2.1!
>
> The vote is open until  Saturday, January 31, at 10:04 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.2.1
> [ ] -1 Do not release this package because ...
>
> For a list of fixes in this release, see http://s.apache.org/Mpn.
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[VOTE] Release Apache Spark 1.2.1 (RC2)

2015-01-28 Thread Patrick Wendell
Please vote on releasing the following candidate as Apache Spark version 1.2.1!

The tag to be voted on is v1.2.1-rc1 (commit b77f876):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b77f87673d1f9f03d4c83cf583158227c551359b

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc2/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1062/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.1-rc2-docs/

Changes from rc1:
This has no code changes from RC1. Only minor changes to the release script.

Please vote on releasing this package as Apache Spark 1.2.1!

The vote is open until  Saturday, January 31, at 10:04 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.2.1
[ ] -1 Do not release this package because ...

For a list of fixes in this release, see http://s.apache.org/Mpn.

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC1)

2015-01-28 Thread Patrick Wendell
This vote is cancelled in favor of RC2.

On Tue, Jan 27, 2015 at 4:20 PM, Reynold Xin  wrote:
> +1
>
> Tested on Mac OS X
>
> On Tue, Jan 27, 2015 at 12:35 PM, Krishna Sankar 
> wrote:
>>
>> +1
>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 12:55 min
>>  mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4
>> -Dhadoop.version=2.6.0 -Phive -DskipTests
>> 2. Tested pyspark, mlib - running as well as compare results with 1.1.x &
>> 1.2.0
>> 2.1. statistics OK
>> 2.2. Linear/Ridge/Laso Regression OK
>> 2.3. Decision Tree, Naive Bayes OK
>> 2.4. KMeans OK
>>Center And Scale OK
>>Fixed : org.apache.spark.SparkException in zip !
>> 2.5. rdd operations OK
>>State of the Union Texts - MapReduce, Filter,sortByKey (word count)
>> 2.6. recommendation OK
>>
>> Cheers
>> 
>>
>> On Mon, Jan 26, 2015 at 11:02 PM, Patrick Wendell 
>> wrote:
>>
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.2.1!
>> >
>> > The tag to be voted on is v1.2.1-rc1 (commit 3e2d7d3):
>> >
>> >
>> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e2d7d310b76c293b9ac787f204e6880f508f6ec
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc1/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1061/
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-1.2.1-rc1-docs/
>> >
>> > Please vote on releasing this package as Apache Spark 1.2.1!
>> >
>> > The vote is open until Friday, January 30, at 07:00 UTC and passes
>> > if a majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.2.1
>> > [ ] -1 Do not release this package because ...
>> >
>> > For a list of fixes in this release, see http://s.apache.org/Mpn.
>> >
>> > To learn more about Apache Spark, please see
>> > http://spark.apache.org/
>> >
>> > - Patrick
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>> >
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org