Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread aakash aakash
Thanks for the link and info Cody !


Regards,
Aakash


On Tue, Nov 15, 2016 at 7:47 PM, Cody Koeninger  wrote:

> Generating / defining an RDDis not the same thing as running the
> compute() method of an rdd .  The direct stream definitely runs kafka
> consumers on the executors.
>
> If you want more info, the blog post and video linked from
> https://github.com/koeninger/kafka-exactly-once refers to the 0.8
> implementation, but the general design is similar for the 0.10
> version.
>
> I think the likelihood of an official release supporting 0.9 is fairly
> slim at this point, it's a year out of date and wouldn't be a drop-in
> dependency change.
>
>
> On Tue, Nov 15, 2016 at 5:50 PM, aakash aakash 
> wrote:
> >
> >
> >> You can use the 0.8 artifact to consume from a 0.9 broker
> >
> > We are currently using "Camus" in production and one of the main goal to
> > move to Spark is to use new Kafka Consumer API  of Kafka 0.9 and in our
> case
> > we need the security provisions available in 0.9, that why we cannot use
> 0.8
> > client.
> >
> >> Where are you reading documentation indicating that the direct stream
> > only runs on the driver?
> >
> > I might be wrong here, but I see that new kafka+Spark stream code extend
> the
> > InputStream and its documentation says : Input streams that can generate
> > RDDs from new data by running a service/thread only on the driver node
> (that
> > is, without running a receiver on worker nodes)
> >
> > Thanks and regards,
> > Aakash Pradeep
> >
> >
> > On Tue, Nov 15, 2016 at 2:55 PM, Cody Koeninger 
> wrote:
> >>
> >> It'd probably be worth no longer marking the 0.8 interface as
> >> experimental.  I don't think it's likely to be subject to active
> >> development at this point.
> >>
> >> You can use the 0.8 artifact to consume from a 0.9 broker
> >>
> >> Where are you reading documentation indicating that the direct stream
> >> only runs on the driver?  It runs consumers on the worker nodes.
> >>
> >>
> >> On Tue, Nov 15, 2016 at 10:58 AM, aakash aakash  >
> >> wrote:
> >> > Re-posting it at dev group.
> >> >
> >> > Thanks and Regards,
> >> > Aakash
> >> >
> >> >
> >> > -- Forwarded message --
> >> > From: aakash aakash 
> >> > Date: Mon, Nov 14, 2016 at 4:10 PM
> >> > Subject: using Spark Streaming with Kafka 0.9/0.10
> >> > To: user-subscr...@spark.apache.org
> >> >
> >> >
> >> > Hi,
> >> >
> >> > I am planning to use Spark Streaming to consume messages from Kafka
> 0.9.
> >> > I
> >> > have couple of questions regarding this :
> >> >
> >> > I see APIs are annotated with @Experimental. So can you please tell me
> >> > when
> >> > are we planning to make it production ready ?
> >> > Currently, I see we are using Kafka 0.10 and so curious to know why
> not
> >> > we
> >> > started with 0.9 Kafka instead of 0.10 Kafka. As I see 0.10 kafka
> client
> >> > would not be compatible with 0.9 client since there are some changes
> in
> >> > arguments in consumer API.
> >> > Current API extends InputDstream and as per document it means RDD will
> >> > be
> >> > generated by running a service/thread only on the driver node instead
> of
> >> > worker node. Can you please explain to me why we are doing this and
> what
> >> > is
> >> > required to make sure that it runs on worker node.
> >> >
> >> >
> >> > Thanks in advance !
> >> >
> >> > Regards,
> >> > Aakash
> >> >
> >
> >
>


Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread Cody Koeninger
Generating / defining an RDDis not the same thing as running the
compute() method of an rdd .  The direct stream definitely runs kafka
consumers on the executors.

If you want more info, the blog post and video linked from
https://github.com/koeninger/kafka-exactly-once refers to the 0.8
implementation, but the general design is similar for the 0.10
version.

I think the likelihood of an official release supporting 0.9 is fairly
slim at this point, it's a year out of date and wouldn't be a drop-in
dependency change.


On Tue, Nov 15, 2016 at 5:50 PM, aakash aakash  wrote:
>
>
>> You can use the 0.8 artifact to consume from a 0.9 broker
>
> We are currently using "Camus" in production and one of the main goal to
> move to Spark is to use new Kafka Consumer API  of Kafka 0.9 and in our case
> we need the security provisions available in 0.9, that why we cannot use 0.8
> client.
>
>> Where are you reading documentation indicating that the direct stream
> only runs on the driver?
>
> I might be wrong here, but I see that new kafka+Spark stream code extend the
> InputStream and its documentation says : Input streams that can generate
> RDDs from new data by running a service/thread only on the driver node (that
> is, without running a receiver on worker nodes)
>
> Thanks and regards,
> Aakash Pradeep
>
>
> On Tue, Nov 15, 2016 at 2:55 PM, Cody Koeninger  wrote:
>>
>> It'd probably be worth no longer marking the 0.8 interface as
>> experimental.  I don't think it's likely to be subject to active
>> development at this point.
>>
>> You can use the 0.8 artifact to consume from a 0.9 broker
>>
>> Where are you reading documentation indicating that the direct stream
>> only runs on the driver?  It runs consumers on the worker nodes.
>>
>>
>> On Tue, Nov 15, 2016 at 10:58 AM, aakash aakash 
>> wrote:
>> > Re-posting it at dev group.
>> >
>> > Thanks and Regards,
>> > Aakash
>> >
>> >
>> > -- Forwarded message --
>> > From: aakash aakash 
>> > Date: Mon, Nov 14, 2016 at 4:10 PM
>> > Subject: using Spark Streaming with Kafka 0.9/0.10
>> > To: user-subscr...@spark.apache.org
>> >
>> >
>> > Hi,
>> >
>> > I am planning to use Spark Streaming to consume messages from Kafka 0.9.
>> > I
>> > have couple of questions regarding this :
>> >
>> > I see APIs are annotated with @Experimental. So can you please tell me
>> > when
>> > are we planning to make it production ready ?
>> > Currently, I see we are using Kafka 0.10 and so curious to know why not
>> > we
>> > started with 0.9 Kafka instead of 0.10 Kafka. As I see 0.10 kafka client
>> > would not be compatible with 0.9 client since there are some changes in
>> > arguments in consumer API.
>> > Current API extends InputDstream and as per document it means RDD will
>> > be
>> > generated by running a service/thread only on the driver node instead of
>> > worker node. Can you please explain to me why we are doing this and what
>> > is
>> > required to make sure that it runs on worker node.
>> >
>> >
>> > Thanks in advance !
>> >
>> > Regards,
>> > Aakash
>> >
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread aakash aakash
> You can use the 0.8 artifact to consume from a 0.9 broker

We are currently using "Camus
" in production and one
of the main goal to move to Spark is to use new Kafka Consumer API  of
Kafka 0.9 and in our case we need the security provisions available in 0.9,
that why we cannot use 0.8 client.

> Where are you reading documentation indicating that the direct stream
only runs on the driver?

I might be wrong here, but I see that new

kafka+Spark stream code extend the InputStream

and its documentation says :

* Input streams that can generate RDDs from new data by running a
service/thread only on the driver node (that is, without running a receiver
on worker nodes) *
Thanks and regards,
Aakash Pradeep


On Tue, Nov 15, 2016 at 2:55 PM, Cody Koeninger  wrote:

> It'd probably be worth no longer marking the 0.8 interface as
> experimental.  I don't think it's likely to be subject to active
> development at this point.
>
> You can use the 0.8 artifact to consume from a 0.9 broker
>
> Where are you reading documentation indicating that the direct stream
> only runs on the driver?  It runs consumers on the worker nodes.
>
>
> On Tue, Nov 15, 2016 at 10:58 AM, aakash aakash 
> wrote:
> > Re-posting it at dev group.
> >
> > Thanks and Regards,
> > Aakash
> >
> >
> > -- Forwarded message --
> > From: aakash aakash 
> > Date: Mon, Nov 14, 2016 at 4:10 PM
> > Subject: using Spark Streaming with Kafka 0.9/0.10
> > To: user-subscr...@spark.apache.org
> >
> >
> > Hi,
> >
> > I am planning to use Spark Streaming to consume messages from Kafka 0.9.
> I
> > have couple of questions regarding this :
> >
> > I see APIs are annotated with @Experimental. So can you please tell me
> when
> > are we planning to make it production ready ?
> > Currently, I see we are using Kafka 0.10 and so curious to know why not
> we
> > started with 0.9 Kafka instead of 0.10 Kafka. As I see 0.10 kafka client
> > would not be compatible with 0.9 client since there are some changes in
> > arguments in consumer API.
> > Current API extends InputDstream and as per document it means RDD will be
> > generated by running a service/thread only on the driver node instead of
> > worker node. Can you please explain to me why we are doing this and what
> is
> > required to make sure that it runs on worker node.
> >
> >
> > Thanks in advance !
> >
> > Regards,
> > Aakash
> >
>


Re: Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread WangJianfei
with predError.zip(input) ,we get RDD data,  so we can just do a sample on
predError or input, if so, we can't use zip(the elements number must be the
same in each partition),thank you!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-sample-first-in-GradientBoostedTrees-with-the-condition-that-subsam0-tp19826p19905.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



回复: Reduce the memory usage if we do same first inGradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread WangJianfei
with predError.zip(input) ,we get RDD data,  so we can just do a sample on 
predError or input, if so, we can't use zip(the elements number must be the 
same in each partition),thank you!




-- 原始邮件 --
发件人: "Joseph Bradley [via Apache Spark Developers 
List]";;
发送时间: 2016年11月16日(星期三) 凌晨3:54
收件人: "WangJianfei"; 

主题: Re: Reduce the memory usage if we do same first inGradientBoostedTrees if 
subsamplingRate< 1.0



Thanks for the suggestion.  That would be faster, but less accurate in 
most cases.  It's generally better to use a new random sample on each 
iteration, based on literature and results I've seen.Joseph


On Fri, Nov 11, 2016 at 5:13 AM, WangJianfei <[hidden email]> wrote:
when we train the mode, we will use the data with a subSampleRate, so if the
 subSampleRate < 1.0 , we can do a sample first to reduce the memory usage.
 se the code below in GradientBoostedTrees.boost()
 
  while (m < numIterations && !doneLearning) {
   // Update data with pseudo-residuals 剩余误差
   val data = predError.zip(input).map { case ((pred, _), point) =>
 LabeledPoint(-loss.gradient(pred, point.label), point.features)
   }
 
   timer.start(s"building tree $m")
   logDebug("###")
   logDebug("Gradient boosting tree iteration " + m)
   logDebug("###")
   val dt = new DecisionTreeRegressor().setSeed(seed + m)
   val model = dt.train(data, treeStrategy)
 
 
 
 
 
 --
 View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-same-first-in-GradientBoostedTrees-if-subsamplingRate-1-0-tp19826.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
 
 -
 To unsubscribe e-mail: [hidden email]
 
 





If you reply to this email, your message will be added 
to the discussion below:

http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-sample-first-in-GradientBoostedTrees-with-the-condition-that-subsam0-tp19826p19899.html
  
To unsubscribe from Reduce the memory 
usage if we do sample first in GradientBoostedTrees with the condition that 
subsamplingRate< 1.0, click here.
NAML



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-same-first-inGradientBoostedTrees-if-subsamplingRate-1-0-tp19904.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread Cody Koeninger
It'd probably be worth no longer marking the 0.8 interface as
experimental.  I don't think it's likely to be subject to active
development at this point.

You can use the 0.8 artifact to consume from a 0.9 broker

Where are you reading documentation indicating that the direct stream
only runs on the driver?  It runs consumers on the worker nodes.


On Tue, Nov 15, 2016 at 10:58 AM, aakash aakash  wrote:
> Re-posting it at dev group.
>
> Thanks and Regards,
> Aakash
>
>
> -- Forwarded message --
> From: aakash aakash 
> Date: Mon, Nov 14, 2016 at 4:10 PM
> Subject: using Spark Streaming with Kafka 0.9/0.10
> To: user-subscr...@spark.apache.org
>
>
> Hi,
>
> I am planning to use Spark Streaming to consume messages from Kafka 0.9. I
> have couple of questions regarding this :
>
> I see APIs are annotated with @Experimental. So can you please tell me when
> are we planning to make it production ready ?
> Currently, I see we are using Kafka 0.10 and so curious to know why not we
> started with 0.9 Kafka instead of 0.10 Kafka. As I see 0.10 kafka client
> would not be compatible with 0.9 client since there are some changes in
> arguments in consumer API.
> Current API extends InputDstream and as per document it means RDD will be
> generated by running a service/thread only on the driver node instead of
> worker node. Can you please explain to me why we are doing this and what is
> required to make sure that it runs on worker node.
>
>
> Thanks in advance !
>
> Regards,
> Aakash
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Running lint-java during PR builds?

2016-11-15 Thread Shixiong(Ryan) Zhu
I remember it's because you need to run `mvn install` before running
lint-java if the maven cache is empty, and `mvn install` is pretty heavy.

On Tue, Nov 15, 2016 at 1:21 PM, Marcelo Vanzin  wrote:

> Hey all,
>
> Is there a reason why lint-java is not run during PR builds? I see it
> seems to be maven-only, is it really expensive to run after an sbt
> build?
>
> I see a lot of PRs coming in to fix Java style issues, and those all
> seem a little unnecessary. Either we're enforcing style checks or
> we're not, and right now it seems we aren't.
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Running lint-java during PR builds?

2016-11-15 Thread Marcelo Vanzin
Hey all,

Is there a reason why lint-java is not run during PR builds? I see it
seems to be maven-only, is it really expensive to run after an sbt
build?

I see a lot of PRs coming in to fix Java style issues, and those all
seem a little unnecessary. Either we're enforcing style checks or
we're not, and right now it seems we aren't.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



NodeManager heap size with ExternalShuffleService

2016-11-15 Thread Artur Sukhenko
Hello guys,

When you enable ExternalShuffleService (spark-shuffle) in NodeManager,
there are no suggestions of increasing NM heap size in Spark docs or
anywhere else, shouldn't we include this in spark's documentation?

I have seen NM take a lot of memory 5+ gb with default 1g, and in case of
its GC pauses spark can become very slow when tasks are doing shuffle. I
don't think users are aware of NM becoming bottleneck.


Sincerely,
Artur Sukhenko
-- 
--
Artur Sukhenko


Re: Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread Joseph Bradley
Thanks for the suggestion.  That would be faster, but less accurate in most
cases.  It's generally better to use a new random sample on each iteration,
based on literature and results I've seen.
Joseph

On Fri, Nov 11, 2016 at 5:13 AM, WangJianfei <
wangjianfe...@otcaix.iscas.ac.cn> wrote:

> when we train the mode, we will use the data with a subSampleRate, so if
> the
> subSampleRate < 1.0 , we can do a sample first to reduce the memory usage.
> se the code below in GradientBoostedTrees.boost()
>
>  while (m < numIterations && !doneLearning) {
>   // Update data with pseudo-residuals 剩余误差
>   val data = predError.zip(input).map { case ((pred, _), point) =>
> LabeledPoint(-loss.gradient(pred, point.label), point.features)
>   }
>
>   timer.start(s"building tree $m")
>   logDebug("###")
>   logDebug("Gradient boosting tree iteration " + m)
>   logDebug("###")
>   val dt = new DecisionTreeRegressor().setSeed(seed + m)
>   val model = dt.train(data, treeStrategy)
>
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Reduce-the-memory-
> usage-if-we-do-same-first-in-GradientBoostedTrees-if-
> subsamplingRate-1-0-tp19826.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Fwd: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread aakash aakash
Re-posting it at dev group.

Thanks and Regards,
Aakash


-- Forwarded message --
From: aakash aakash 
Date: Mon, Nov 14, 2016 at 4:10 PM
Subject: using Spark Streaming with Kafka 0.9/0.10
To: user-subscr...@spark.apache.org


Hi,

I am planning to use Spark Streaming to consume messages from Kafka 0.9. I
have couple of questions regarding this :


   - I see APIs are annotated with @Experimental. So can you please tell me
   when are we planning to make it production ready ?
   - Currently, I see we are using Kafka 0.10 and so curious to know why
   not we started with 0.9 Kafka instead of 0.10 Kafka. As I see 0.10 kafka
   client would not be compatible with 0.9 client since there are some changes
   in arguments in consumer API.
   - Current API extends InputDstream and as per document it means RDD will
   be generated by running a service/thread only on the driver node instead of
   worker node. Can you please explain to me why we are doing this and what is
   required to make sure that it runs on worker node.


Thanks in advance !

Regards,
Aakash


Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra
You still have the problem that even within a single Job it is often the
case that not every Exchange really wants to use the same number of shuffle
partitions.

On Tue, Nov 15, 2016 at 2:46 AM, Sean Owen  wrote:

> Once you get to needing this level of fine-grained control, should you not
> consider using the programmatic API in part, to let you control individual
> jobs?
>
>
> On Tue, Nov 15, 2016 at 1:19 AM leo9r  wrote:
>
>> Hi Daniel,
>>
>> I completely agree with your request. As the amount of data being
>> processed
>> with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need
>> to prevent OOM and performance degradation. The fact that
>> sql.shuffle.partitions cannot be set several times in the same job/action,
>> because of the reason you explain, is a big inconvenient for the
>> development
>> of ETL pipelines.
>>
>> Have you got any answer or feedback in this regard?
>>
>> Thanks,
>> Leo Lezcano
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-
>> developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-
>> partitions-should-be-stored-in-the-lineage-tp13240p19867.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra
AFAIK, the adaptive shuffle partitioning still isn't completely ready to be
made the default, and there are some corner issues that need to be
addressed before this functionality is declared finished and ready.  E.g.,
the current logic can make data skew problems worse by turning One Big
Partition into an even larger partition before the ExchangeCoordinator
decides to create a new one.  That can be worked around by changing the
logic to "If including the nextShuffleInputSize would exceed the target
partition size, then start a new partition":
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ExchangeCoordinator.scala#L173

If you're willing to work around those kinds of issues to fit your use
case, then I do know that the adaptive shuffle partitioning can be made to
work well even if it is not perfect.  It would be nice, though, to see
adaptive partitioning be finished and hardened to the point where it
becomes the default, because a fixed number of shuffle partitions has some
significant limitations and problems.

On Tue, Nov 15, 2016 at 12:50 AM, leo9r  wrote:

> That's great insight Mark, I'm looking forward to give it a try!!
>
> According to jira's  Adaptive execution in Spark
>   , it seems that some
> functionality was added in Spark 1.6.0 and the rest is still in progress.
> Are there any improvements to the SparkSQL adaptive behavior in Spark 2.0+
> that you know?
>
> Thanks and best regards,
> Leo
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-
> partitions-should-be-stored-in-the-lineage-tp13240p19885.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


How statistical key rune time

2016-11-15 Thread 王桥石
hi guys!




Is there a way! Try to statistics top N of  run time,the  datas for key shuffle 
or transform after shuffle ,eg,reduceByKey, groupByKey, reduceByKey.  

So could find at a glance,which key problems.

RE: Handling questions in the mailing lists

2016-11-15 Thread assaf.mendelson
Should probably also update the helping others section in the how to contribute 
section 
(https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingbyHelpingOtherUsers)
Assaf.

From: Denny Lee [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19835...@n3.nabble.com]
Sent: Sunday, November 13, 2016 8:52 AM
To: Mendelson, Assaf
Subject: Re: Handling questions in the mailing lists

Hey Reynold,

Looks like we all of the proposed changes into Proposed Community Mailing Lists 
/ StackOverflow 
Changes.
  Anything else we can do to update the Spark Community page / welcome email?

Meanwhile, let's all start answering questions on SO, eh?! :)
Denny

On Thu, Nov 10, 2016 at 1:54 PM Holden Karau <[hidden 
email]> wrote:
That's a good question, looking at 
http://stackoverflow.com/tags/apache-spark/topusers shows a few contributors 
who have already been active on SO including some committers and  PMC members 
with very high overall SO reputations for any administrative needs (as well as 
a number of other contributors besides just PMC/committers).

On Wed, Nov 9, 2016 at 2:18 AM, assaf.mendelson <[hidden 
email]> wrote:
I was just wondering, before we move on to SO.
Do we have enough contributors with enough reputation do manage things in SO?
We would need contributors with enough reputation to have relevant privilages.
For example: creating tags (requires 1500 reputation), edit questions and 
answers (2000), create tag synonums (2500), approve tag wiki edits (5000), 
access to moderator tools (1, this is required to delete questions etc.), 
protect questions (15000).
All of these are important if we plan to have SO as a main resource.
I know I originally suggested SO, however, if we do not have contributors with 
the required privileges and the willingness to help manage everything then I am 
not sure this is a good fit.
Assaf.

From: Denny Lee [via Apache Spark Developers List] [mailto:[hidden 
email][hidden 
email]]
Sent: Wednesday, November 09, 2016 9:54 AM
To: Mendelson, Assaf
Subject: Re: Handling questions in the mailing lists

Agreed that by simply just moving the questions to SO will not solve anything 
but I think the call out about the meta-tags is that we need to abide by SO 
rules and if we were to just jump in and start creating meta-tags, we would be 
violating at minimum the spirit and at maximum the actual conventions around SO.

Saying this, perhaps we could suggest tags that we place in the header of the 
question whether it be SO or the mailing lists that will help us sort through 
all of these questions faster just as you suggested.  The Proposed Community 
Mailing Lists / StackOverflow 
Changes
 has been updated to include suggested tags.  WDYT?

On Tue, Nov 8, 2016 at 11:02 PM assaf.mendelson <[hidden 
email]> wrote:
I like the document and I think it is good but I still feel like we are missing 
an important part here.

Look at SO today. There are:

-   4658 unanswered questions under apache-spark tag.

-  394 unanswered questions under spark-dataframe tag.

-  639 unanswered questions under apache-spark-sql

-  859 unanswered questions under pyspark

Just moving people to ask there will not help. The whole issue is having people 
answer the questions.

The problem is that many of these questions do not fit SO (but are already 
there so they are noise), are bad (i.e. unclear or hard to answer), orphaned 
etc. while some are simply harder than what people with some experience in 
spark can handle and require more expertise.
The problem is that people with the relevant expertise are drowning in noise. 
This. Is true for the mailing list and this is true for SO.

For this reason I believe that just moving people to SO will not solve anything.

My original thought was that if we had different tags then different people 
could watch open questions on these tags and therefore have a much lower noise. 
I thought that we would have a low tier (current one) of people just not 
following the documentation (which would remain as noise), then a beginner tier 
where we could have people downvoting bad questions but in most cases the 
community can answer the questions because they are common, then a “medium” 
tier which would mean harder questions but that can still be answered by 
advanced users and lastly an “advanced” tier to which committers can actually 
subscribed to (and adding sub tags for subsystems would improve this even more).

I was not aware of SO policy for meta tags (the burnination link is about 
removing tags completely so I am not sure how it applies, I believe this link 

Fwd:

2016-11-15 Thread Anton Okolnychyi
Hi,

I have experienced a problem using the Datasets API in Spark 1.6, while
almost identical code works fine in Spark 2.0.
The problem is related to encoders and custom aggregators.

*Spark 1.6 (the aggregation produces an empty map):*

  implicit val intStringMapEncoder: Encoder[Map[Int, String]] =
ExpressionEncoder()
  // implicit val intStringMapEncoder: Encoder[Map[Int, String]] =
org.apache.spark.sql.Encoders.kryo[Map[Int, String]]

  val sparkConf = new SparkConf()
.setAppName("IVU DS Spark 1.6 Test")
.setMaster("local[4]")
  val sparkContext = new SparkContext(sparkConf)
  val sparkSqlContext = new SQLContext(sparkContext)

  import sparkSqlContext.implicits._

  val stopPointDS = Seq(TestStopPoint("33", 1, "id#1"), TestStopPoint("33",
2, "id#2")).toDS()

  val stopPointSequenceMap = new Aggregator[TestStopPoint, Map[Int,
String], Map[Int, String]] {
override def zero = Map[Int, String]()
override def reduce(map: Map[Int, String], stopPoint: TestStopPoint) = {
  map.updated(stopPoint.sequenceNumber, stopPoint.id)
}
override def merge(map: Map[Int, String], anotherMap: Map[Int, String])
= {
  map ++ anotherMap
}
override def finish(reduction: Map[Int, String]) = reduction
  }.toColumn

  val resultMap = stopPointDS
.groupBy(_.line)
.agg(stopPointSequenceMap)
.collect()
.toMap

In spark.sql.execution.TypedAggregateExpression.scala, I see that the
reduce is done correctly, but Spark cannot read the reduced values in the
merge phase.
If I replace the ExperessionEncoder with Kryo-based one (commented in the
presented code), then it works fine.

*Spark 2.0 (works correctly):*

  val spark = SparkSession
.builder()
.appName("IVU DS Spark 2.0 Test")
.config("spark.sql.warehouse.dir", "file:///D://sparkSql")
.master("local[4]")
.getOrCreate()

  import spark.implicits._

  val stopPointDS = Seq(TestStopPoint("33", 1, "id#1"), TestStopPoint("33",
2, "id#2")).toDS()

  val stopPointSequenceMap = new Aggregator[TestStopPoint, Map[Int,
String], Map[Int, String]] {
override def zero = Map[Int, String]()
override def reduce(map: Map[Int, String], stopPoint: TestStopPoint) = {
  map.updated(stopPoint.sequenceNumber, stopPoint.id)
}
override def merge(map: Map[Int, String], anotherMap: Map[Int, String])
= {
  map ++ anotherMap
}
override def finish(reduction: Map[Int, String]) = reduction
override def bufferEncoder: Encoder[Map[Int, String]] =
ExpressionEncoder()
override def outputEncoder: Encoder[Map[Int, String]] =
ExpressionEncoder()
  }.toColumn

  val resultMap = stopPointDS
.groupByKey(_.line)
.agg(stopPointSequenceMap)
.collect()
.toMap
I know that Spark 1.6 has only a preview of the Datasets concept and a lot
changed in 2.0. However, I would like to know if I am doing anything wrong
in my 1.6 code.

Thanks in advance,
Anton


Re: separate spark and hive

2016-11-15 Thread Herman van Hövell tot Westerflier
You can start a spark without hive support by setting the spark.sql.
catalogImplementation configuration to in-memory, for example:
>
> ./bin/spark-shell --master local[*] --conf
> spark.sql.catalogImplementation=in-memory


I would not change the default from Hive to Spark-only just yet.

On Tue, Nov 15, 2016 at 9:38 AM, assaf.mendelson 
wrote:

> After looking at the code, I found that spark.sql.catalogImplementation
> is set to “hive”. I would proposed that it should be set to “in-memory” by
> default (or at least have this in the documentation, the configuration
> documentation at http://spark.apache.org/docs/latest/configuration.html
> has no mentioning of hive at all)
>
> Assaf.
>
>
>
> *From:* Mendelson, Assaf
> *Sent:* Tuesday, November 15, 2016 10:11 AM
> *To:* 'rxin [via Apache Spark Developers List]'
> *Subject:* RE: separate spark and hive
>
>
>
> Spark shell (and pyspark) by default create the spark session with hive
> support (also true when the session is created using getOrCreate, at least
> in pyspark)
>
> At a minimum there should be a way to configure it using
> spark-defaults.conf
>
> Assaf.
>
>
>
> *From:* rxin [via Apache Spark Developers List] [[hidden email]
> ]
> *Sent:* Tuesday, November 15, 2016 9:46 AM
> *To:* Mendelson, Assaf
> *Subject:* Re: separate spark and hive
>
>
>
> If you just start a SparkSession without calling enableHiveSupport it
> actually won't use the Hive catalog support.
>
>
>
>
>
> On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <[hidden email]
> > wrote:
>
> The default generation of spark context is actually a hive context.
>
> I tried to find on the documentation what are the differences between hive
> context and sql context and couldn’t find it for spark 2.0 (I know for
> previous versions there were a couple of functions which required hive
> context as well as window functions but those seem to have all been fixed
> for spark 2.0).
>
> Furthermore, I can’t seem to find a way to configure spark not to use
> hive. I can only find how to compile it without hive (and having to build
> from source each time is not a good idea for a production system).
>
>
>
> I would suggest that working without hive should be either a simple
> configuration or even the default and that if there is any missing
> functionality it should be documented.
>
> Assaf.
>
>
>
>
>
> *From:* Reynold Xin [mailto:[hidden email]
> ]
> *Sent:* Tuesday, November 15, 2016 9:31 AM
> *To:* Mendelson, Assaf
> *Cc:* [hidden email] 
> *Subject:* Re: separate spark and hive
>
>
>
> I agree with the high level idea, and thus SPARK-15691
> .
>
>
>
> In reality, it's a huge amount of work to create & maintain a custom
> catalog. It might actually make sense to do, but it just seems a lot of
> work to do right now and it'd take a toll on interoperability.
>
>
>
> If you don't need persistent catalog, you can just run Spark without Hive
> mode, can't you?
>
>
>
>
>
>
>
>
>
> On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden email]
> > wrote:
>
> Hi,
>
> Today, we basically force people to use hive if they want to get the full
> use of spark SQL.
>
> When doing the default installation this means that a derby.log and
> metastore_db directory are created where we run from.
>
> The problem with this is that if we run multiple scripts from the same
> working directory we have a problem.
>
> The solution we employ locally is to always run from different directory
> as we ignore hive in practice (this of course means we lose the ability to
> use some of the catalog options in spark session).
>
> The only other solution is to create a full blown hive installation with
> proper configuration (probably for a JDBC solution).
>
>
>
> I would propose that in most cases there shouldn’t be any hive use at all.
> Even for catalog elements such as saving a permanent table, we should be
> able to configure a target directory and simply write to it (doing
> everything file based to avoid the need for locking). Hive should be
> reserved for those who actually use it (probably for backward
> compatibility).
>
>
>
> Am I missing something here?
>
> Assaf.
>
>
> --
>
> View this message in context: separate spark and hive
> 
> Sent from the Apache Spark Developers List mailing list archive
>  at
> Nabble.com.
>
>
>
>
>
>
> --
>
> *If you reply to this email, your message will be added to the discussion
> below:*
>
> http://apache-spark-developers-list.1001551.n3.
> 

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Sean Owen
Once you get to needing this level of fine-grained control, should you not
consider using the programmatic API in part, to let you control individual
jobs?

On Tue, Nov 15, 2016 at 1:19 AM leo9r  wrote:

> Hi Daniel,
>
> I completely agree with your request. As the amount of data being processed
> with SparkSQL grows, tweaking sql.shuffle.partitions becomes a common need
> to prevent OOM and performance degradation. The fact that
> sql.shuffle.partitions cannot be set several times in the same job/action,
> because of the reason you explain, is a big inconvenient for the
> development
> of ETL pipelines.
>
> Have you got any answer or feedback in this regard?
>
> Thanks,
> Leo Lezcano
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-partitions-should-be-stored-in-the-lineage-tp13240p19867.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread leo9r
That's great insight Mark, I'm looking forward to give it a try!!

According to jira's  Adaptive execution in Spark
  , it seems that some
functionality was added in Spark 1.6.0 and the rest is still in progress.
Are there any improvements to the SparkSQL adaptive behavior in Spark 2.0+
that you know?

Thanks and best regards,
Leo



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-parameters-like-shuffle-partitions-should-be-stored-in-the-lineage-tp13240p19885.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



RE: separate spark and hive

2016-11-15 Thread assaf.mendelson
After looking at the code, I found that spark.sql.catalogImplementation is set 
to “hive”. I would proposed that it should be set to “in-memory” by default (or 
at least have this in the documentation, the configuration documentation at 
http://spark.apache.org/docs/latest/configuration.html has no mentioning of 
hive at all)
Assaf.

From: Mendelson, Assaf
Sent: Tuesday, November 15, 2016 10:11 AM
To: 'rxin [via Apache Spark Developers List]'
Subject: RE: separate spark and hive

Spark shell (and pyspark) by default create the spark session with hive support 
(also true when the session is created using getOrCreate, at least in pyspark)
At a minimum there should be a way to configure it using spark-defaults.conf
Assaf.

From: rxin [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19882...@n3.nabble.com]
Sent: Tuesday, November 15, 2016 9:46 AM
To: Mendelson, Assaf
Subject: Re: separate spark and hive

If you just start a SparkSession without calling enableHiveSupport it actually 
won't use the Hive catalog support.


On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <[hidden 
email]> wrote:
The default generation of spark context is actually a hive context.
I tried to find on the documentation what are the differences between hive 
context and sql context and couldn’t find it for spark 2.0 (I know for previous 
versions there were a couple of functions which required hive context as well 
as window functions but those seem to have all been fixed for spark 2.0).
Furthermore, I can’t seem to find a way to configure spark not to use hive. I 
can only find how to compile it without hive (and having to build from source 
each time is not a good idea for a production system).

I would suggest that working without hive should be either a simple 
configuration or even the default and that if there is any missing 
functionality it should be documented.
Assaf.


From: Reynold Xin [mailto:[hidden 
email]]
Sent: Tuesday, November 15, 2016 9:31 AM
To: Mendelson, Assaf
Cc: [hidden email]
Subject: Re: separate spark and hive

I agree with the high level idea, and thus 
SPARK-15691.

In reality, it's a huge amount of work to create & maintain a custom catalog. 
It might actually make sense to do, but it just seems a lot of work to do right 
now and it'd take a toll on interoperability.

If you don't need persistent catalog, you can just run Spark without Hive mode, 
can't you?




On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden 
email]> wrote:
Hi,
Today, we basically force people to use hive if they want to get the full use 
of spark SQL.
When doing the default installation this means that a derby.log and 
metastore_db directory are created where we run from.
The problem with this is that if we run multiple scripts from the same working 
directory we have a problem.
The solution we employ locally is to always run from different directory as we 
ignore hive in practice (this of course means we lose the ability to use some 
of the catalog options in spark session).
The only other solution is to create a full blown hive installation with proper 
configuration (probably for a JDBC solution).

I would propose that in most cases there shouldn’t be any hive use at all. Even 
for catalog elements such as saving a permanent table, we should be able to 
configure a target directory and simply write to it (doing everything file 
based to avoid the need for locking). Hive should be reserved for those who 
actually use it (probably for backward compatibility).

Am I missing something here?
Assaf.


View this message in context: separate spark and 
hive
Sent from the Apache Spark Developers List mailing list 
archive at 
Nabble.com.




If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879p19882.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com
To unsubscribe from Apache Spark Developers List, click 
here.
NAML




--
View this message in context: 

RE: separate spark and hive

2016-11-15 Thread assaf.mendelson
Spark shell (and pyspark) by default create the spark session with hive support 
(also true when the session is created using getOrCreate, at least in pyspark)
At a minimum there should be a way to configure it using spark-defaults.conf
Assaf.

From: rxin [via Apache Spark Developers List] 
[mailto:ml-node+s1001551n19882...@n3.nabble.com]
Sent: Tuesday, November 15, 2016 9:46 AM
To: Mendelson, Assaf
Subject: Re: separate spark and hive

If you just start a SparkSession without calling enableHiveSupport it actually 
won't use the Hive catalog support.


On Mon, Nov 14, 2016 at 11:44 PM, Mendelson, Assaf <[hidden 
email]> wrote:
The default generation of spark context is actually a hive context.
I tried to find on the documentation what are the differences between hive 
context and sql context and couldn’t find it for spark 2.0 (I know for previous 
versions there were a couple of functions which required hive context as well 
as window functions but those seem to have all been fixed for spark 2.0).
Furthermore, I can’t seem to find a way to configure spark not to use hive. I 
can only find how to compile it without hive (and having to build from source 
each time is not a good idea for a production system).

I would suggest that working without hive should be either a simple 
configuration or even the default and that if there is any missing 
functionality it should be documented.
Assaf.


From: Reynold Xin [mailto:[hidden 
email]]
Sent: Tuesday, November 15, 2016 9:31 AM
To: Mendelson, Assaf
Cc: [hidden email]
Subject: Re: separate spark and hive

I agree with the high level idea, and thus 
SPARK-15691.

In reality, it's a huge amount of work to create & maintain a custom catalog. 
It might actually make sense to do, but it just seems a lot of work to do right 
now and it'd take a toll on interoperability.

If you don't need persistent catalog, you can just run Spark without Hive mode, 
can't you?




On Mon, Nov 14, 2016 at 11:23 PM, assaf.mendelson <[hidden 
email]> wrote:
Hi,
Today, we basically force people to use hive if they want to get the full use 
of spark SQL.
When doing the default installation this means that a derby.log and 
metastore_db directory are created where we run from.
The problem with this is that if we run multiple scripts from the same working 
directory we have a problem.
The solution we employ locally is to always run from different directory as we 
ignore hive in practice (this of course means we lose the ability to use some 
of the catalog options in spark session).
The only other solution is to create a full blown hive installation with proper 
configuration (probably for a JDBC solution).

I would propose that in most cases there shouldn’t be any hive use at all. Even 
for catalog elements such as saving a permanent table, we should be able to 
configure a target directory and simply write to it (doing everything file 
based to avoid the need for locking). Hive should be reserved for those who 
actually use it (probably for backward compatibility).

Am I missing something here?
Assaf.


View this message in context: separate spark and 
hive
Sent from the Apache Spark Developers List mailing list 
archive at 
Nabble.com.




If you reply to this email, your message will be added to the discussion below:
http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879p19882.html
To start a new topic under Apache Spark Developers List, email 
ml-node+s1001551n1...@n3.nabble.com
To unsubscribe from Apache Spark Developers List, click 
here.
NAML




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/separate-spark-and-hive-tp19879p19883.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.