status of 2.11 support?

2015-12-14 Thread Sachin Aggarwal
Hi,


adding question from user group to dev group need expert advice
please help us decide which version to choose for production as standard.

http://apache-spark-user-list.1001560.n3.nabble.com/Status-of-2-11-support-tp25362.html

thanks

-- 

Thanks & Regards

Sachin Aggarwal
7760502772


Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Mark Hamstra
I'm afraid you're correct, Krishna:

core/src/main/scala/org/apache/spark/package.scala:  val SPARK_VERSION =
"1.6.0-SNAPSHOT"
docs/_config.yml:SPARK_VERSION: 1.6.0-SNAPSHOT

On Mon, Dec 14, 2015 at 6:51 PM, Krishna Sankar  wrote:

> Guys,
>The sc.version gives 1.6.0-SNAPSHOT. Need to change to 1.6.0. Can you
> pl verify ?
> Cheers
> 
>
> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc2
>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>> *
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>
>> The test repository (versioned as v1.6.0-rc2) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>
>> ===
>> == How can I help test this release? ==
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> 
>> == What justifies a -1 vote for this release? ==
>> 
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==
>> == Major changes to help you focus your testing ==
>> ==
>>
>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>
>>- SPARK-2629  
>>trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>- SPARK-12165 
>>SPARK-12189  Fix
>>bugs in eviction of storage memory by execution.
>>- SPARK-12258  correct
>>passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>- SPARK-11787  Parquet
>>Performance - Improve Parquet scan performance when using flat
>>schemas.
>>- SPARK-10810 
>>Session Management - Isolated devault database (i.e USE mydb) even on
>>shared clusters.
>>- SPARK-   Dataset
>>API - A type-safe API (similar to RDDs) that performs many operations
>>on serialized binary data and code generation (i.e. Project Tungsten).
>>- SPARK-1  Unified
>>Memory Management - Shared memory for execution and caching instead
>>of exclusive division of the regions.
>>- SPARK-11197  SQL
>>Queries on Files - Concise syntax for running SQL queries over files
>>of any supported format without registering a table.
>>- SPARK-11745  Reading
>>non-standard JSON files - Added options to read non-standard JSON
>>files (e.g. single-quotes, unquoted attributes)
>>- SPARK-10412  
>> Per-operator
>>Metrics for SQL Execution - Display statistics on a peroperator basis
>>fo

Re: Problem using User Defined Predicate pushdown with core RDD and parquet - UDP class not found

2015-12-14 Thread chao chu
+spark user mailing list

Hi there,

I have exactly the same problem as mentioned below. My current work around
is to add the jar containing my UDP in one of the system classpath (for
example, put it under the same path as
/opt/cloudera/parcels/CDH-5.4.2-1.cdh5.4.2.p0.2/jars/parquet-hadoop-bundle-1.5.0-cdh5.4.2.jar)
listed
in "Classpath Entries" of spark executors.

Obviously, the downside is that you have to put the jar locally to every
node of the cluster and it's hard to maintain when the cluster's setup got
updated.

I'd like to hear if anyone has a better solution for this. Thanks a lot!


>
>
> -- Forwarded message --
> From: Vladimir Vladimirov 
> To: dev@spark.apache.org
> Cc:
> Date: Mon, 19 Oct 2015 19:38:07 -0400
> Subject: Problem using User Defined Predicate pushdown with core RDD and
> parquet - UDP class not found
> Hi all
>
> I feel like this questions is more Spark dev related that Spark user
> related. Please correct me if I'm wrong.
>
> My project's data flow involves sampling records from the data stored as
> Parquet dataset.
> I've checked DataFrames API and it doesn't support user defined predicates
> projection pushdown - only simple filter expressions.
> I want to use custom filter function predicate pushdown feature of parquet
> while loading data with newAPIHadoopFile.
> Simple filters constructed with org.apache.parquet.filter2 API works fine.
> But User Defined Predicate works only with `--master local` mode.
>
> When I try to run in yarn-client mode my test program that uses UDP class
> to be used by parquet-mr I'm getting class not found exception.
>
> I suspect that the issue could be related to the way how class loader
> works from parquet or maybe it could be related to the fact that Spark
> executor processes has my jar loaded from HTTP server and there is some
> security policies (classpath shows that the jar URI is actually HTTP URL
> and not local file).
>
> I've tried to create uber jar with all dependencies and shipt it with the
> spark app - no success.
>
> PS I'm using spark 1.5.1.
>
> Here is my command line I'm using to submit the application:
>
> SPARK_CLASSPATH=./lib/my-jar-with-dependencies.jar spark-submit \
> --master yarn-client
> --num-executors 3 --driver-memory 3G --executor-memory 2G \
> --executor-cores 1 \
> --jars
> ./lib/my-jar-with-dependencies.jar,./lib/snappy-java-1.1.2.jar,./lib/parquet-hadoop-1.7.0.jar,./lib/parquet-avro-1.7.0.jar,./lib/parquet-column-1.7.0.jar,/opt/cloudera/parcels/CDH/jars/avro-1.7.6-cdh5.4.0.jar,/opt/cloudera/parcels/CDH/jars/avro-mapred-1.7.6-cdh5.4.0-hadoop2.jar,
> \
> --class my.app.parquet.filters.tools.TestSparkApp \
> ./lib/my-jar-with-dependencies.jar \
> yarn-client \
> "/user/vvlad/2015/*/*/*/EVENTS"
>
> Here is the code of my UDP class:
>
> package my.app.parquet.filters.udp
>
> import org.apache.parquet.filter2.predicate.Statistics
> import org.apache.parquet.filter2.predicate.UserDefinedPredicate
>
>
> import java.lang.{Integer => JInt}
>
> import scala.util.Random
>
> class SampleIntColumn(threshold: Double) extends
> UserDefinedPredicate[JInt] with Serializable {
>   lazy val random = { new Random() }
>   val myThreshold = threshold
>   override def keep(value: JInt): Boolean = {
> random.nextFloat() < myThreshold
>   }
>
>   override def canDrop(statistics: Statistics[JInt]): Boolean = false
>
>   override def inverseCanDrop(statistics: Statistics[JInt]): Boolean =
> false
>
>   override def toString: String = {
> "%s(%f)".format(getClass.getName, myThreshold)
>   }
> }
>
> Spark app:
>
> package my.app.parquet.filters.tools
>
> import my.app.parquet.filters.udp.SampleIntColumn
> import org.apache.avro.generic.GenericRecord
> import org.apache.hadoop.mapreduce.Job
> import org.apache.parquet.avro.AvroReadSupport
> import org.apache.parquet.filter2.dsl.Dsl.IntColumn
> import org.apache.parquet.hadoop.ParquetInputFormat
> import org.apache.spark.{SparkContext, SparkConf}
>
> import org.apache.parquet.filter2.dsl.Dsl._
> import org.apache.parquet.filter2.predicate.FilterPredicate
>
>
> object TestSparkApp {
>   def main (args: Array[String]) {
> val conf = new SparkConf()
>   //"local[2]" or yarn-client etc
>   .setMaster(args(0))
>   .setAppName("Spark Scala App")
>   .set("spark.executor.memory", "1g")
>   .set("spark.rdd.compress", "true")
>   .set("spark.storage.memoryFraction", "1")
>
> val sc = new SparkContext(conf)
>
> val job = new Job(sc.hadoopConfiguration)
> ParquetInputFormat.setReadSupportClass(job,
> classOf[AvroReadSupport[GenericRecord]])
>
> val sampler = new SampleIntColumn(0.05)
> val impField = IntColumn("impression")
>
> val pred: FilterPredicate = impField.filterBy(sampler)
>
> ParquetInputFormat.setFilterPredicate(job.getConfiguration, pred)
>
>
> println(job.getConfiguration.get("parquet.private.read.filter.predicate"))
>
> println(job.getConfiguration.get("parquet.private.read.f

Re: Secondary Indexing of RDDs?

2015-12-14 Thread Nitin Goyal
Spar SQL's in-memory cache stores statistics per column which in turn is
used to skip batches(default size 1) within partition

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala#L25

Hope this helps

Thanks
-Nitin

On Tue, Dec 15, 2015 at 12:28 AM, Michael Segel 
wrote:

> Hi,
>
> This may be a silly question… couldn’t find the answer on my own…
>
> I’m trying to find out if anyone has implemented secondary indexing on
> Spark’s RDDs.
>
> If anyone could point me to some references, it would be helpful.
>
> I’ve seen some stuff on Succinct Spark (see:
> https://amplab.cs.berkeley.edu/succinct-spark-queries-on-compressed-rdds/
>  )
> but was more interested in integration with SparkSQL and SparkSQL support
> for secondary indexing.
>
> Also the reason I’m posting this to the dev list is that there’s more to
> this question …
>
>
> Thx
>
> -Mike
>
>


Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Krishna Sankar
Guys,
   The sc.version gives 1.6.0-SNAPSHOT. Need to change to 1.6.0. Can you pl
verify ?
Cheers


On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
> *
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>
>- SPARK-2629  
>trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>- SPARK-12165 
>SPARK-12189  Fix
>bugs in eviction of storage memory by execution.
>- SPARK-12258  correct
>passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>- SPARK-11787  Parquet
>Performance - Improve Parquet scan performance when using flat schemas.
>- SPARK-10810 
>Session Management - Isolated devault database (i.e USE mydb) even on
>shared clusters.
>- SPARK-   Dataset
>API - A type-safe API (similar to RDDs) that performs many operations
>on serialized binary data and code generation (i.e. Project Tungsten).
>- SPARK-1  Unified
>Memory Management - Shared memory for execution and caching instead of
>exclusive division of the regions.
>- SPARK-11197  SQL
>Queries on Files - Concise syntax for running SQL queries over files
>of any supported format without registering a table.
>- SPARK-11745  Reading
>non-standard JSON files - Added options to read non-standard JSON
>files (e.g. single-quotes, unquoted attributes)
>- SPARK-10412  
> Per-operator
>Metrics for SQL Execution - Display statistics on a peroperator basis
>for memory usage and spilled data size.
>- SPARK-11329  Star
>(*) expansion for StructTypes - Makes it easier to nest and unest
>arbitrary numbers of columns
>- SPARK-10917 ,
>SPARK-11149 

Re: [build system] brief downtime right now

2015-12-14 Thread shane knapp
that looks like the lintr checks failed, causing the build to fail.

On Mon, Dec 14, 2015 at 3:05 PM, Yin Huai  wrote:
> Hi Shane,
>
> Seems Spark's lint-r started to fail from
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/4260/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/console.
> Is it related to the upgrade work of R?
>
> Thanks,
>
> Yin
>
> On Mon, Dec 14, 2015 at 11:55 AM, shane knapp  wrote:
>>
>> ...and we're back.  we were getting reverse proxy timeouts, which seem
>> to have been caused by jenkins churning and doing a lot of IO.  i'll
>> dig in to the logs and see if i can find out what happened.
>>
>> weird.
>>
>> shane
>>
>> On Mon, Dec 14, 2015 at 11:51 AM, shane knapp  wrote:
>> > something is up w/apache.  looking.
>> >
>> > On Mon, Dec 14, 2015 at 11:37 AM, shane knapp 
>> > wrote:
>> >> after killing and restarting jenkins, things seem to be VERY slow.
>> >> i'm gonna kick jenkins again and see if that helps.
>> >>
>> >>
>> >>
>> >> On Mon, Dec 14, 2015 at 11:26 AM, shane knapp 
>> >> wrote:
>> >>> ok, we're back up and building.
>> >>>
>> >>> On Mon, Dec 14, 2015 at 10:31 AM, shane knapp 
>> >>> wrote:
>>  last week i forgot to downgrade R to 3.1.1, and since there's not
>>  much
>>  activity right now, i'm going to take jenkins down and finish up the
>>  ticket.
>> 
>>  https://issues.apache.org/jira/browse/SPARK-11255
>> 
>>  we should be back up and running within 30 minutes.
>> 
>>  thanks!
>> 
>>  shane
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [build system] brief downtime right now

2015-12-14 Thread Yin Huai
Hi Shane,

Seems Spark's lint-r started to fail from
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-SBT/4260/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/console.
Is it related to the upgrade work of R?

Thanks,

Yin

On Mon, Dec 14, 2015 at 11:55 AM, shane knapp  wrote:

> ...and we're back.  we were getting reverse proxy timeouts, which seem
> to have been caused by jenkins churning and doing a lot of IO.  i'll
> dig in to the logs and see if i can find out what happened.
>
> weird.
>
> shane
>
> On Mon, Dec 14, 2015 at 11:51 AM, shane knapp  wrote:
> > something is up w/apache.  looking.
> >
> > On Mon, Dec 14, 2015 at 11:37 AM, shane knapp 
> wrote:
> >> after killing and restarting jenkins, things seem to be VERY slow.
> >> i'm gonna kick jenkins again and see if that helps.
> >>
> >>
> >>
> >> On Mon, Dec 14, 2015 at 11:26 AM, shane knapp 
> wrote:
> >>> ok, we're back up and building.
> >>>
> >>> On Mon, Dec 14, 2015 at 10:31 AM, shane knapp 
> wrote:
>  last week i forgot to downgrade R to 3.1.1, and since there's not much
>  activity right now, i'm going to take jenkins down and finish up the
>  ticket.
> 
>  https://issues.apache.org/jira/browse/SPARK-11255
> 
>  we should be back up and running within 30 minutes.
> 
>  thanks!
> 
>  shane
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [build system] brief downtime right now

2015-12-14 Thread shane knapp
...and we're back.  we were getting reverse proxy timeouts, which seem
to have been caused by jenkins churning and doing a lot of IO.  i'll
dig in to the logs and see if i can find out what happened.

weird.

shane

On Mon, Dec 14, 2015 at 11:51 AM, shane knapp  wrote:
> something is up w/apache.  looking.
>
> On Mon, Dec 14, 2015 at 11:37 AM, shane knapp  wrote:
>> after killing and restarting jenkins, things seem to be VERY slow.
>> i'm gonna kick jenkins again and see if that helps.
>>
>>
>>
>> On Mon, Dec 14, 2015 at 11:26 AM, shane knapp  wrote:
>>> ok, we're back up and building.
>>>
>>> On Mon, Dec 14, 2015 at 10:31 AM, shane knapp  wrote:
 last week i forgot to downgrade R to 3.1.1, and since there's not much
 activity right now, i'm going to take jenkins down and finish up the
 ticket.

 https://issues.apache.org/jira/browse/SPARK-11255

 we should be back up and running within 30 minutes.

 thanks!

 shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [build system] brief downtime right now

2015-12-14 Thread shane knapp
something is up w/apache.  looking.

On Mon, Dec 14, 2015 at 11:37 AM, shane knapp  wrote:
> after killing and restarting jenkins, things seem to be VERY slow.
> i'm gonna kick jenkins again and see if that helps.
>
>
>
> On Mon, Dec 14, 2015 at 11:26 AM, shane knapp  wrote:
>> ok, we're back up and building.
>>
>> On Mon, Dec 14, 2015 at 10:31 AM, shane knapp  wrote:
>>> last week i forgot to downgrade R to 3.1.1, and since there's not much
>>> activity right now, i'm going to take jenkins down and finish up the
>>> ticket.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-11255
>>>
>>> we should be back up and running within 30 minutes.
>>>
>>> thanks!
>>>
>>> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Kousuke Saruta

+1 (non-binding)

Tested some workloads using basic API and DataFrame API on my 4-nodes 
YARN cluster (1 master and 3 slaves).

I also tested the Web UI.

(I'm resending this mail just in case because it seems that I failed to 
send the mail to dev@)

On 2015/12/13 2:39, Michael Armbrust wrote:
Please vote on releasing the following candidate as Apache Spark 
version 1.6.0!


The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and 
passes if a majority of at least 3 +1 PMC votes are cast.


[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is _v1.6.0-rc2 
(23f8dfd45187cb8f2216328ab907ddb5fbdffd0b) 
_


The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/ 



Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1169/

The test repository (versioned as v1.6.0-rc2) for this release can be 
found at:

https://repository.apache.org/content/repositories/orgapachespark-1168/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/ 


===
== How can I help test this release? ==
===
If you are a Spark user, you can help us test this release by taking 
an existing Spark workload and running on this release candidate, then 
reporting any regressions.



== What justifies a -1 vote for this release? ==

This vote is happening towards the end of the 1.6 QA period, so -1 
votes should only occur for significant regressions from 1.5. Bugs 
already present in 1.5, minor regressions, or bugs related to new 
features will not block this release.


===
== What should happen to JIRA tickets still targeting 1.6.0? ==
===
1. It is OK for documentation patches to target 1.6.0 and still go 
into branch-1.6, since documentations will be published separately 
from the release.

2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the 
target version.



==
== Major changes to help you focus your testing ==
==


  Spark 1.6.0 Preview


Notable changes since 1.6 RC1


  Spark Streaming

  * SPARK-2629 
|trackStateByKey| has been renamed to |mapWithState|


  Spark SQL

  * SPARK-12165 
SPARK-12189
 Fix bugs in
eviction of storage memory by execution.
  * SPARK-12258
 correct
passing null into ScalaUDF


Notable Features Since 1.5


  Spark SQL

  * SPARK-11787 
Parquet Performance - Improve Parquet scan performance when using
flat schemas.
  * SPARK-10810
Session
Management - Isolated devault database (i.e |USE mydb|) even on
shared clusters.
  * SPARK- 
Dataset API - A type-safe API (similar to RDDs) that performs many
operations on serialized binary data and code generation (i.e.
Project Tungsten).
  * SPARK-1 
Unified Memory Management - Shared memory for execution and
caching instead of exclusive division of the regions.
  * SPARK-11197 
SQL Queries on Files - Concise syntax for running SQL queries over
files of any supported format without registering a table.
  * SPARK-11745 
Reading non-standard JSON files - Added options to read
non-standard JSON files (e.g. single-quotes, unquoted attributes)
  * SPARK-10412 
Per-operator Metrics for SQL Execution - Display statistics on a
peroperator basis for memory usage and spilled data size.
  * SPARK-11329 
Star (*) expansion for StructType

Re: [build system] brief downtime right now

2015-12-14 Thread shane knapp
after killing and restarting jenkins, things seem to be VERY slow.
i'm gonna kick jenkins again and see if that helps.



On Mon, Dec 14, 2015 at 11:26 AM, shane knapp  wrote:
> ok, we're back up and building.
>
> On Mon, Dec 14, 2015 at 10:31 AM, shane knapp  wrote:
>> last week i forgot to downgrade R to 3.1.1, and since there's not much
>> activity right now, i'm going to take jenkins down and finish up the
>> ticket.
>>
>> https://issues.apache.org/jira/browse/SPARK-11255
>>
>> we should be back up and running within 30 minutes.
>>
>> thanks!
>>
>> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [build system] brief downtime right now

2015-12-14 Thread shane knapp
ok, we're back up and building.

On Mon, Dec 14, 2015 at 10:31 AM, shane knapp  wrote:
> last week i forgot to downgrade R to 3.1.1, and since there's not much
> activity right now, i'm going to take jenkins down and finish up the
> ticket.
>
> https://issues.apache.org/jira/browse/SPARK-11255
>
> we should be back up and running within 30 minutes.
>
> thanks!
>
> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Andrew Or
+1

Ran PageRank on standalone mode with 4 nodes and noticed a speedup after
the specific commits that were in RC2 but not RC1:

c247b6a Dec 10 [SPARK-12155][SPARK-12253] Fix executor OOM in unified
memory management
05e441e Dec 9 [SPARK-12165][SPARK-12189] Fix bugs in eviction of storage
memory by execution

Also jobs that triggered these issues now run successfully.


2015-12-14 10:45 GMT-08:00 Reynold Xin :

> +1
>
> Tested some dataframe operations on my Mac.
>
>
> On Saturday, December 12, 2015, Michael Armbrust 
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.6.0!
>>
>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
>> if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.6.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is *v1.6.0-rc2
>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>> *
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>
>> The test repository (versioned as v1.6.0-rc2) for this release can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>
>> ===
>> == How can I help test this release? ==
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> 
>> == What justifies a -1 vote for this release? ==
>> 
>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>> should only occur for significant regressions from 1.5. Bugs already
>> present in 1.5, minor regressions, or bugs related to new features will not
>> block this release.
>>
>> ===
>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>> ===
>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>> branch-1.6, since documentations will be published separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.7+.
>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
>> version.
>>
>>
>> ==
>> == Major changes to help you focus your testing ==
>> ==
>>
>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>
>>- SPARK-2629  
>>trackStateByKey has been renamed to mapWithState
>>
>> Spark SQL
>>
>>- SPARK-12165 
>>SPARK-12189  Fix
>>bugs in eviction of storage memory by execution.
>>- SPARK-12258  correct
>>passing null into ScalaUDF
>>
>> Notable Features Since 1.5Spark SQL
>>
>>- SPARK-11787  Parquet
>>Performance - Improve Parquet scan performance when using flat
>>schemas.
>>- SPARK-10810 
>>Session Management - Isolated devault database (i.e USE mydb) even on
>>shared clusters.
>>- SPARK-   Dataset
>>API - A type-safe API (similar to RDDs) that performs many operations
>>on serialized binary data and code generation (i.e. Project Tungsten).
>>- SPARK-1  Unified
>>Memory Management - Shared memory for execution and caching instead
>>of exclusive division of the regions.
>>- SPARK-11197  SQL
>>Queries on Files - Concise syntax for running SQL queries over files
>>of any supported format without registering a table.
>>- SPARK-11745  Reading
>>non-standard JSON files - Added options to read non-standard JSON
>>files (e.g. single-quotes, unquoted attributes)
>>- SPARK-10412 

SparkML algos limitations question.

2015-12-14 Thread Eugene Morozov
Hello!

I'm currently working on POC and try to use Random Forest (classification
and regression). I also have to check SVM and Multiclass perceptron (other
algos are less important at the moment). So far I've discovered that Random
Forest has a limitation of maxDepth for trees and just out of curiosity I
wonder why such a limitation has been introduced?

An actual question is that I'm going to use Spark ML in production next
year and would like to know if there are other limitations like maxDepth in
RF for other algorithms: Logistic Regression, Perceptron, SVM, etc.

Thanks in advance for your time.
--
Be well!
Jean Morozov


Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Reynold Xin
+1

Tested some dataframe operations on my Mac.

On Saturday, December 12, 2015, Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
> *
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>
>- SPARK-2629  
>trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>- SPARK-12165 
>SPARK-12189  Fix
>bugs in eviction of storage memory by execution.
>- SPARK-12258  correct
>passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>- SPARK-11787  Parquet
>Performance - Improve Parquet scan performance when using flat schemas.
>- SPARK-10810 
>Session Management - Isolated devault database (i.e USE mydb) even on
>shared clusters.
>- SPARK-   Dataset
>API - A type-safe API (similar to RDDs) that performs many operations
>on serialized binary data and code generation (i.e. Project Tungsten).
>- SPARK-1  Unified
>Memory Management - Shared memory for execution and caching instead of
>exclusive division of the regions.
>- SPARK-11197  SQL
>Queries on Files - Concise syntax for running SQL queries over files
>of any supported format without registering a table.
>- SPARK-11745  Reading
>non-standard JSON files - Added options to read non-standard JSON
>files (e.g. single-quotes, unquoted attributes)
>- SPARK-10412  
> Per-operator
>Metrics for SQL Execution - Display statistics on a peroperator basis
>for memory usage and spilled data size.
>- SPARK-11329  Star
>(*) expansion for StructTypes - Makes it easier to nest and unest
>arbitrary numbers of columns
>- SPARK-10917 ,
>SPARK-11149  In-memory
>Columnar Cache Performance -

[build system] brief downtime right now

2015-12-14 Thread shane knapp
last week i forgot to downgrade R to 3.1.1, and since there's not much
activity right now, i'm going to take jenkins down and finish up the
ticket.

https://issues.apache.org/jira/browse/SPARK-11255

we should be back up and running within 30 minutes.

thanks!

shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Maven build against Hadoop 2.4 times out

2015-12-14 Thread shane knapp
++joshrosen

This Is Known[tm], and we have a bug open against it:
https://issues.apache.org/jira/browse/SPARK-11823

On Mon, Dec 14, 2015 at 7:42 AM, Ted Yu  wrote:
> Attached was the tail of test suite output from local run.
> I got test failure.
>
> FYI
>
> On Sun, Dec 13, 2015 at 10:03 PM, Yin Huai  wrote:
>>
>> Can you reproduce the problem in your local environment? Our 1.6 hadoop
>> 2.4 maven build looks pretty good. Since our 1.6 is pretty close to master,
>> I am wondering if there is any environment related issue.
>>
>> On Sun, Dec 13, 2015 at 3:38 PM, Ted Yu  wrote:
>>>
>>> Thanks for checking, Yin.
>>>
>>> Looks like the cause might be in one of the commits for build #4438
>>>
>>> Cheers
>>>
>>> On Sat, Dec 12, 2015 at 6:19 PM, Yin Huai  wrote:

 Ted,

 Looks like thrift server tests were just hanging. See
 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-Maven-with-YARN/4453/HADOOP_PROFILE=hadoop-2.4,label=spark-test/artifact/sql/hive-thriftserver/target/unit-tests.log.
 If it is caused by a recent commit, it is also possible that a commit 
 listed
 in build 4438 or 4439 that caused it since 4438 and 4439 were failed way
 before the thrift server tests.

 On Fri, Dec 11, 2015 at 10:27 AM, Ted Yu  wrote:
>
> Hi,
> You may have noticed that maven build against Hadoop 2.4 times out on
> Jenkins.
>
> The last module is spark-hive-thriftserver
>
> This seemed to start with build #4440
>
> FYI
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

>>>
>>
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Michael Armbrust
Here are a fixed version of the docs for 1.6:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docsfixed-docs

There still might be some minor rendering issues of the ML page, but people
are investigating.

On Sat, Dec 12, 2015 at 6:58 PM, Burak Yavuz  wrote:

> +1 tested SparkSQL and Streaming on some production sized workloads
>
> On Sat, Dec 12, 2015 at 4:16 PM, Mark Hamstra 
> wrote:
>
>> +1
>>
>> On Sat, Dec 12, 2015 at 9:39 AM, Michael Armbrust > > wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 1.6.0!
>>>
>>> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and
>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 1.6.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is *v1.6.0-rc2
>>> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>>> *
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1169/
>>>
>>> The test repository (versioned as v1.6.0-rc2) for this release can be
>>> found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1168/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>>>
>>> ===
>>> == How can I help test this release? ==
>>> ===
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> 
>>> == What justifies a -1 vote for this release? ==
>>> 
>>> This vote is happening towards the end of the 1.6 QA period, so -1 votes
>>> should only occur for significant regressions from 1.5. Bugs already
>>> present in 1.5, minor regressions, or bugs related to new features will not
>>> block this release.
>>>
>>> ===
>>> == What should happen to JIRA tickets still targeting 1.6.0? ==
>>> ===
>>> 1. It is OK for documentation patches to target 1.6.0 and still go into
>>> branch-1.6, since documentations will be published separately from the
>>> release.
>>> 2. New features for non-alpha-modules should target 1.7+.
>>> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the
>>> target version.
>>>
>>>
>>> ==
>>> == Major changes to help you focus your testing ==
>>> ==
>>>
>>> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>>>
>>>- SPARK-2629  
>>>trackStateByKey has been renamed to mapWithState
>>>
>>> Spark SQL
>>>
>>>- SPARK-12165 
>>>SPARK-12189  Fix
>>>bugs in eviction of storage memory by execution.
>>>- SPARK-12258  correct
>>>passing null into ScalaUDF
>>>
>>> Notable Features Since 1.5Spark SQL
>>>
>>>- SPARK-11787  Parquet
>>>Performance - Improve Parquet scan performance when using flat
>>>schemas.
>>>- SPARK-10810 
>>>Session Management - Isolated devault database (i.e USE mydb) even
>>>on shared clusters.
>>>- SPARK-   Dataset
>>>API - A type-safe API (similar to RDDs) that performs many
>>>operations on serialized binary data and code generation (i.e. Project
>>>Tungsten).
>>>- SPARK-1  Unified
>>>Memory Management - Shared memory for execution and caching instead
>>>of exclusive division of the regions.
>>>- SPARK-11197  SQL
>>>Queries on Files - Concise syntax for running SQL queries over files
>>>of any supported format without registering a table.
>>>- SPARK-11745  Reading
>>>non-standard JSON files - Added options to read non-standard JSON
>>>files (e.g. single-quotes, unquoted attri

BIRCH clustering algorithm

2015-12-14 Thread Dženan Softić
Hi,

As a part of the project, we are trying to create parallel implementation
of BIRCH clustering algorithm [1]. We are mostly getting idea how to do it
from this paper, which used CUDA to make BIRCH parallel [2]. ([2] is short
paper, just section 4. is relevant).

We would like to implement BIRCH on Spark. Would this be an interesting
contribution for MLlib? Is there anyone already who tried to implement
BIRCH on Spark?

Any suggestions for implementation itself would be very much appreciated!


[1] http://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf
[2] http://boyuan.global-optimization.com/Mypaper/IDEAL2013-88.pdf


Best,
Dzeno


Re: Maven build against Hadoop 2.4 times out

2015-12-14 Thread Ted Yu
Attached was the tail of test suite output from local run.
I got test failure.

FYI

On Sun, Dec 13, 2015 at 10:03 PM, Yin Huai  wrote:

> Can you reproduce the problem in your local environment? Our 1.6 hadoop
> 2.4 maven build looks pretty good. Since our 1.6 is pretty close to master,
> I am wondering if there is any environment related issue.
>
> On Sun, Dec 13, 2015 at 3:38 PM, Ted Yu  wrote:
>
>> Thanks for checking, Yin.
>>
>> Looks like the cause might be in one of the commits for build #4438
>>
>> Cheers
>>
>> On Sat, Dec 12, 2015 at 6:19 PM, Yin Huai  wrote:
>>
>>> Ted,
>>>
>>> Looks like thrift server tests were just hanging. See
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-Master-Maven-with-YARN/4453/HADOOP_PROFILE=hadoop-2.4,label=spark-test/artifact/sql/hive-thriftserver/target/unit-tests.log.
>>> If it is caused by a recent commit, it is also possible that a commit
>>> listed in build 4438 or 4439 that caused it since 4438 and 4439 were failed
>>> way before the thrift server tests.
>>>
>>> On Fri, Dec 11, 2015 at 10:27 AM, Ted Yu  wrote:
>>>
 Hi,
 You may have noticed that maven build against Hadoop 2.4 times out on
 Jenkins.

 The last module is spark-hive-thriftserver

 This seemed to start with build #4440

 FYI
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>
>


spark-thrift-server.out
Description: Binary data

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Dev Environment (again)

2015-12-14 Thread Al Pivonka
I've read through the mail archives and read the different threads.



I believe there is a great deal of value in teaching others.



I'm a 14yr vet of Java and would like to contribute to different Spark
projects. Here are my dilemmas:

1)How does one quickly get a working environment up and running?



What do I mean by environment?


   1.  I have an IDE – Not a problem I can build using sbt.
   2. Environment to me is a working standalone spark cluster (Docker)
   which I can take what I build from #1 and re-deploy to test out my changes
   etc.
   3. What are the dependencies between projects internal to Spark?



How to on-board a new developer and make them productive as soon as
possible.



Not looking for answers to just these questions/dilemmas.



There is a wealth of knowledge here (existing Spark & sub-projects
developers/Architects).



My proposal, is to document the onboarding process and dependencies for a
new contributor, what someone will need in order to get working dev
environment up and running for the purposes of being able to add/test new
functionality to the Spark project. Also document how to setup a test
environment in order to deploy and test out said new functionality
(Docker/Standalone).





Suggestions?


Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Sean Owen
With Java 7 / Ubuntu 15, and "-Pyarn -Phadoop-2.6 -Phive
-Phive-thriftserver", I still see the Docker tests fail every time. Is
anyone else seeing them fail (or running them)?

The Hive CliSuite also fails (stack trace at the bottom).

Same deal -- if people are running this test and it's not failing,
this is probably just flakiness of some form.

There's the aforementioned doc generation issue too.

Other than that it compiled and ran all tests for me.

JIRA score: 28 issues, of which 11 bugs, of which 5 critical (listed
below), of which 0 blockers. OK there.



Critical bugs:
SPARK-8447 Test external shuffle service with all shuffle managers
SPARK-10680 Flaky test:
network.RequestTimeoutIntegrationSuite.timeoutInactiveRequests
SPARK-11224 Flaky test: o.a.s.ExternalShuffleServiceSuite
SPARK-11266 Peak memory tests swallow failures
SPARK-11293 Spillable collections leak shuffle memory



- Simple commands *** FAILED ***
  ===
  CliSuite failure output
  ===
  Spark SQL CLI command line: ../../bin/spark-sql --master local
--driver-java-options -Dderby.system.durability=test --conf
spark.ui.enabled=false --hiveconf
javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/home/srowen/spark-1.6.0/sql/hive-thriftserver/target/tmp/spark-240e9e22-8fe8-408b-a116-2a894b3cbf1f;create=true
--hiveconf 
hive.metastore.warehouse.dir=/home/srowen/spark-1.6.0/sql/hive-thriftserver/target/tmp/spark-c336bc67-8e51-4284-b574-e8b79d0d4fce
--hiveconf 
hive.exec.scratchdir=/home/srowen/spark-1.6.0/sql/hive-thriftserver/target/tmp/spark-3a4f9564-d9f1-467f-8016-d4c95389e568
  Exception: java.util.concurrent.TimeoutException: Futures timed out
after [3 minutes]
  Executed query 0 "CREATE TABLE hive_test(key INT, val STRING);",
  But failed to capture expected output "OK" within 3 minutes.

  2015-12-14 13:47:23.07 - stderr> SLF4J: Class path contains multiple
SLF4J bindings.
  2015-12-14 13:47:23.07 - stderr> SLF4J: Found binding in
[jar:file:/home/srowen/spark-1.6.0/assembly/target/scala-2.10/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  2015-12-14 13:47:23.07 - stderr> SLF4J: Found binding in
[jar:file:/home/srowen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
  2015-12-14 13:47:23.07 - stderr> SLF4J: See
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
  2015-12-14 13:47:23.074 - stderr> SLF4J: Actual binding is of type
[org.slf4j.impl.Log4jLoggerFactory]
  2015-12-14 13:47:39.36 - stdout> SET spark.sql.hive.version=1.2.1
  ===
  End CliSuite failure output
  === (CliSuite.scala:151)




On Sat, Dec 12, 2015 at 5:39 PM, Michael Armbrust
 wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes if
> a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already present
> in 1.5, minor regressions, or bugs related to new features will not block
> this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately fro

Re: [VOTE] Release Apache Spark 1.6.0 (RC2)

2015-12-14 Thread Ricardo Almeida
+1 (non binding)

Tested our workloads on a standalone cluster:
- Spark Core
- Spark SQL
- Spark MLlib
- Python API



On 12 December 2015 at 18:39, Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc2
> (23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
> *
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1169/
>
> The test repository (versioned as v1.6.0-rc2) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1168/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming
>
>- SPARK-2629  
>trackStateByKey has been renamed to mapWithState
>
> Spark SQL
>
>- SPARK-12165 
>SPARK-12189  Fix
>bugs in eviction of storage memory by execution.
>- SPARK-12258  correct
>passing null into ScalaUDF
>
> Notable Features Since 1.5Spark SQL
>
>- SPARK-11787  Parquet
>Performance - Improve Parquet scan performance when using flat schemas.
>- SPARK-10810 
>Session Management - Isolated devault database (i.e USE mydb) even on
>shared clusters.
>- SPARK-   Dataset
>API - A type-safe API (similar to RDDs) that performs many operations
>on serialized binary data and code generation (i.e. Project Tungsten).
>- SPARK-1  Unified
>Memory Management - Shared memory for execution and caching instead of
>exclusive division of the regions.
>- SPARK-11197  SQL
>Queries on Files - Concise syntax for running SQL queries over files
>of any supported format without registering a table.
>- SPARK-11745  Reading
>non-standard JSON files - Added options to read non-standard JSON
>files (e.g. single-quotes, unquoted attributes)
>- SPARK-10412  
> Per-operator
>Metrics for SQL Execution - Display statistics on a peroperator basis
>for memory usage and spilled data size.
>- SPARK-11329  Star
>(*) expansion for StructTypes - Makes it easier to nest and unest
>arbitrary numbers of columns
>- SPARK-10917 ,
>SPARK-11149 

Re: [SparkR] Any reason why saveDF's mode is append by default ?

2015-12-14 Thread Jeff Zhang
Thanks Shivaram, created https://issues.apache.org/jira/browse/SPARK-12318
I will work on it.

On Mon, Dec 14, 2015 at 4:13 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> I think its just a bug -- I think we originally followed the Python
> API (in the original PR [1]) but the Python API seems to have been
> changed to match Scala / Java in
> https://issues.apache.org/jira/browse/SPARK-6366
>
> Feel free to open a JIRA / PR for this.
>
> Thanks
> Shivaram
>
> [1] https://github.com/amplab-extras/SparkR-pkg/pull/199/files
>
> On Sun, Dec 13, 2015 at 11:58 PM, Jeff Zhang  wrote:
> > It is inconsistent with scala api which is error by default. Any reason
> for
> > that ? Thanks
> >
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
>



-- 
Best Regards

Jeff Zhang


Re: [SparkR] Any reason why saveDF's mode is append by default ?

2015-12-14 Thread Shivaram Venkataraman
I think its just a bug -- I think we originally followed the Python
API (in the original PR [1]) but the Python API seems to have been
changed to match Scala / Java in
https://issues.apache.org/jira/browse/SPARK-6366

Feel free to open a JIRA / PR for this.

Thanks
Shivaram

[1] https://github.com/amplab-extras/SparkR-pkg/pull/199/files

On Sun, Dec 13, 2015 at 11:58 PM, Jeff Zhang  wrote:
> It is inconsistent with scala api which is error by default. Any reason for
> that ? Thanks
>
>
>
> --
> Best Regards
>
> Jeff Zhang

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org