Re: Please add me

2016-11-29 Thread Bu Jianjian
Hi Srinivas,

You can subscribe the mail list in the community page by yourself
http://spark.apache.org/community.html

On Tue, Nov 29, 2016 at 9:59 AM, Srinivas Potluri 
wrote:

> Hi,
>
> I am interested to contribute code on Spark. Could you please add me into
> the mailing list / DL.
>
> Thanks,
> *Srinivas Potluri*
>


Proposal for SPARK-18278

2016-11-29 Thread Matt Cheah
Hi everyone,

Kubernetes is a technology that is a key player in the cluster computing world. 
Currently, running Spark applications on Kubernetes requires deploying a 
standalone Spark cluster on the Kubernetes cluster, and then running the jobs 
against the standalone Spark cluster. However, there would be many benefits to 
running Spark on Kubernetes natively, and so 
SPARK-18278 has been filed 
to track discussion around supporting Kubernetes as a cluster manager, in 
addition to the existing Mesos, YARN, and standalone cluster managers.

A first draft of a proposal outlining a potential long-term plan around this 
feature has been attached to the JIRA ticket. Any feedback and discussion would 
be greatly appreciated.

Thanks,

-Matt Cheah


Re: How is the order ensured in the jdbc relation provider when inserting data from multiple executors

2016-11-29 Thread Sachith Withana
Hi all,

To explain the scenario a bit more.

We need to retain the order when writing to the RDBMS tables.
The way we found was to execute the DB Write *job* for each partition which
is really costly.
One reason being that the partition count is really high( 200) and it seems
we cannot control the count( due to the count being inferred from the
parent RDD).

When we execute the insert job, the executors are run in parallel to
execute the writing tasks which jumbles up the order.
Is there anyway we can execute the tasks sequentially? or any other way of
doing this?
We have noticed that you handle this from inside Spark itself, to retain
the order when writing to RDBMS from Spark.

Thanks,
Sachith



On Fri, Nov 25, 2016 at 8:05 AM, nirandap  wrote:

> Hi Maciej,
>
> Thanks again for the reply. Once small clarification about the answer
> about my #1 point.
> I put local[4] and shouldn't this be forcing spark to read from 4
> partitions in parallel and write in parallel (by parallel I mean, the order
> from which partition, the data is read from a set of 4 partitions, is
> non-deterministic)? That was the reason why I was surprised to see that the
> final results are in the same order.
>
> On Tue, Nov 22, 2016 at 5:24 PM, Maciej Szymkiewicz [via Apache Spark
> Developers List] <[hidden email]
> > wrote:
>
>> On 11/22/2016 12:11 PM, nirandap wrote:
>>
>> Hi Maciej,
>>
>> Thank you for your reply.
>>
>> I have 2 queries.
>> 1. I can understand your explanation. But in my experience, when I check
>> the final RDBMS table, I see that the results follow the expected order,
>> without an issue. Is this just a coincidence?
>>
>> Not exactly a coincidence. This is typically a result of a physical
>> location on the disk. If writes and reads are sequential, (this is usually
>> the case) you'll see things in the expected order, but you have to remember
>> that location on disk is not stable. For example if you perform some
>> updates, deletes and VACUM ALL (PostgreSQL) physical location on disk will
>> change and with it things you see.
>>
>> There of course more advanced mechanisms out there. For example modern
>> columnar RDBMS like HANA use techniques like dimensions sorting and
>> differential stores so even the initial order may differ. There probably
>> some other solutions which choose different strategies (maybe some times
>> series oriented projects?) I am not aware of.
>>
>>
>> 2. I was further looking into this. So, say I run this query
>> "select value, count(*) from table1 group by value order by value"
>>
>> and I call df.collect() in the resultant dataframe. From my experience, I
>> see that the given values follow the expected order. May I know how spark
>> manages to retain the order of the results in a collect operation?
>>
>> Once you execute ordered operation each partition is sorted and the order
>> of partitions defines the global ordering. All what collect does is just
>> preserving this order by creating an array of results for each partition
>> and flattening it.
>>
>>
>> Best
>>
>>
>> On Mon, Nov 21, 2016 at 3:02 PM, Maciej Szymkiewicz [via Apache Spark
>> Developers List] <[hidden email]
>> > wrote:
>>
>>> In commonly used RDBM systems relations have no fixed order and physical
>>> location of the records can change during routine maintenance operations.
>>> Unless you explicitly order data during retrieval order you see is
>>> incidental and not guaranteed.
>>>
>>> Conclusion: order of inserts just doesn't matter.
>>> On 11/21/2016 10:03 AM, Niranda Perera wrote:
>>>
>>> Hi,
>>>
>>> Say, I have a table with 1 column and 1000 rows. I want to save the
>>> result in a RDBMS table using the jdbc relation provider. So I run the
>>> following query,
>>>
>>> "insert into table table2 select value, count(*) from table1 group by
>>> value order by value"
>>>
>>> While debugging, I found that the resultant df from select value,
>>> count(*) from table1 group by value order by value would have around 200+
>>> partitions and say I have 4 executors attached to my driver. So, I would
>>> have 200+ writing tasks assigned to 4 executors. I want to understand, how
>>> these executors are able to write the data to the underlying RDBMS table of
>>> table2 without messing up the order.
>>>
>>> I checked the jdbc insertable relation and in jdbcUtils [1] it does the
>>> following
>>>
>>> df.foreachPartition { iterator =>
>>>   savePartition(getConnection, table, iterator, rddSchema,
>>> nullTypes, batchSize, dialect)
>>> }
>>>
>>> So, my understanding is, all of my 4 executors will parallely run the
>>> savePartition function (or closure) where they do not know which one should
>>> write data before the other!
>>>
>>> In the savePartition method, in the comment, it says
>>> "Saves a partition of a DataFrame to the JDBC database.  This is done in
>>>* a single database transaction in 

Spark-9487, Need some insight

2016-11-29 Thread Saikat Kanjilal
Hello Spark dev community,

I took this the following jira item 
(https://github.com/apache/spark/pull/15848) and am looking for some general 
pointers, it seems that I am running into issues where things work successfully 
doing local development on my macbook pro but fail on jenkins for a multitiude 
of reasons and errors, here's an example,  if you see this build output report: 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69297/ you 
will see the DataFrameStatSuite, now locally I am running these individual 
tests with this command: ./build/mvn test -P... -DwildcardSuites=none 
-Dtest=org.apache.spark.sql.DataFrameStatSuite. It seems that I need to 
emulate a jenkins like environment locally, this seems sort of like an 
untenable hurdle, granted that my changes involve changing the total number of 
workers in the sparkcontext and if so should I be testing my changes in an 
environment that more closely resembles jenkins.  I really want to work 
on/complete this PR but I keep getting hamstrung by a dev environment that is 
not equivalent to our CI environment.



I'm guessing/hoping I'm not the first one to run into this so some insights. 
pointers to get past this would be very appreciated , would love to keep 
contributing and hoping this is a hurdle that's overcomeable with some tweaks 
to my dev environment.



Thanks in advance.


Question about spark.mllib.GradientDescent

2016-11-29 Thread WangJianfei
Hi devs:
   I think it's unnecessary to use c1._1 += c2.1 in combOp operation, I
think it's the same if we use c1._1+c2._1, see the code below :
in GradientDescent.scala

   val (gradientSum, lossSum, miniBatchSize) = data.sample(false,
miniBatchFraction, 42 + i)
.treeAggregate((BDV.zeros[Double](n), 0.0, 0L))(
  seqOp = (c, v) => {
// c: (grad, loss, count), v: (label, features)
// c._1 即 grad will be updated in gradient.compute
val l = gradient.compute(v._2, v._1, bcWeights.value,
Vectors.fromBreeze(c._1))
(c._1, c._2 + l, c._3 + 1)
  },
  combOp = (c1, c2) => {
// c: (grad, loss, count)
(c1._1 += c2._1, c1._2 + c2._2, c1._3 + c2._3)
  })



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Question-about-spark-mllib-GradientDescent-tp20052.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-29 Thread Michael Allman
This is not an issue with all tables created in Spark 2.1, though I'm not sure 
why some work and some do not. I have found that a table created as such

sql("create table test stored as parquet as select 1")

in Spark 2.1 cannot be read in previous versions of Spark.

Michael


> On Nov 29, 2016, at 5:15 PM, Michael Allman  wrote:
> 
> Hello,
> 
> When I try to read from a Hive table created by Spark 2.1 in Spark 2.0 or 
> earlier, I get an error:
> 
> java.lang.ClassNotFoundException: Failed to load class for data source: hive.
> 
> Is there a way to get previous versions of Spark to read tables written with 
> Spark 2.1?
> 
> Cheers,
> 
> Michael


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Can't read tables written in Spark 2.1 in Spark 2.0 (and earlier)

2016-11-29 Thread Michael Allman
Hello,

When I try to read from a Hive table created by Spark 2.1 in Spark 2.0 or 
earlier, I get an error:

java.lang.ClassNotFoundException: Failed to load class for data source: hive.

Is there a way to get previous versions of Spark to read tables written with 
Spark 2.1?

Cheers,

Michael
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Please add me

2016-11-29 Thread Srinivas Potluri
Hi,

I am interested to contribute code on Spark. Could you please add me into
the mailing list / DL.

Thanks,
*Srinivas Potluri*


Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-29 Thread Marcelo Vanzin
I'll send a -1 because of SPARK-18546. Haven't looked at anything else yet.

On Mon, Nov 28, 2016 at 5:25 PM, Reynold Xin  wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Thursday, December 1, 2016 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc1
> (80aabc0bd33dc5661a90133156247e7a8c1bf7f5)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1216/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ===
> What should happen to JIRA tickets still targeting 2.1.0?
> ===
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-29 Thread Sean Owen
We still have several blockers for 2.1, so I imagine at least one will mean
this won't be the final RC:

SPARK-18318 ML, Graph 2.1 QA: API: New Scala APIs, docs
SPARK-18319 ML, Graph 2.1 QA: API: Experimental, DeveloperApi, final,
sealed audit
SPARK-18326 SparkR 2.1 QA: New R APIs and API docs
SPARK-18516 Separate instantaneous state from progress performance
statistics
SPARK-18538 Concurrent Fetching DataFrameReader JDBC APIs Do Not Work
SPARK-18553 Executor loss may cause TaskSetManager to be leaked

However I understand the purpose here is of course to get started testing
early and we should all do so.

BTW here are the Critical issues still open:

SPARK-12347 Write script to run all MLlib examples for testing
SPARK-16032 Audit semantics of various insertion operations related to
partitioned tables
SPARK-17861 Store data source partitions in metastore and push partition
pruning into metastore
SPARK-18091 Deep if expressions cause Generated SpecificUnsafeProjection
code to exceed JVM code size limit
SPARK-18274 Memory leak in PySpark StringIndexer
SPARK-18316 Spark MLlib, GraphX 2.1 QA umbrella
SPARK-18322 ML, Graph 2.1 QA: Update user guide for new features & APIs
SPARK-18323 Update MLlib, GraphX websites for 2.1
SPARK-18324 ML, Graph 2.1 QA: Programming guide update and migration guide
SPARK-18329 Spark R 2.1 QA umbrella
SPARK-18330 SparkR 2.1 QA: Update user guide for new features & APIs
SPARK-18331 Update SparkR website for 2.1
SPARK-18332 SparkR 2.1 QA: Programming guide, migration guide, vignettes
updates
SPARK-18468 Flaky test:
org.apache.spark.sql.hive.HiveSparkSubmitSuite.SPARK-9757 Persist Parquet
relation with decimal column
SPARK-18549 Failed to Uncache a View that References a Dropped Table.
SPARK-18560 Receiver data can not be dataSerialized properly.


On Tue, Nov 29, 2016 at 1:26 AM Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.1.0. The vote is open until Thursday, December 1, 2016 at 18:00 UTC and
> passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.1.0
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.1.0-rc1
> (80aabc0bd33dc5661a90133156247e7a8c1bf7f5)
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1216/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> ===
> What should happen to JIRA tickets still targeting 2.1.0?
> ===
> Committers should look at those and triage. Extremely important bug fixes,
> documentation, and API tweaks that impact compatibility should be worked on
> immediately. Everything else please retarget to 2.1.1 or 2.2.0.
>
>
>