Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread aakash aakash
Thanks for the link and info Cody ! Regards, Aakash On Tue, Nov 15, 2016 at 7:47 PM, Cody Koeninger wrote: > Generating / defining an RDDis not the same thing as running the > compute() method of an rdd . The direct stream definitely runs kafka > consumers on the

Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread Cody Koeninger
Generating / defining an RDDis not the same thing as running the compute() method of an rdd . The direct stream definitely runs kafka consumers on the executors. If you want more info, the blog post and video linked from https://github.com/koeninger/kafka-exactly-once refers to the 0.8

Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread aakash aakash
> You can use the 0.8 artifact to consume from a 0.9 broker We are currently using "Camus " in production and one of the main goal to move to Spark is to use new Kafka Consumer API of Kafka 0.9 and in our case we need the security provisions

Re: Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread WangJianfei
with predError.zip(input) ,we get RDD data, so we can just do a sample on predError or input, if so, we can't use zip(the elements number must be the same in each partition),thank you! -- View this message in context:

回复: Reduce the memory usage if we do same first inGradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread WangJianfei
with predError.zip(input) ,we get RDD data, so we can just do a sample on predError or input, if so, we can't use zip(the elements number must be the same in each partition),thank you! -- 原始邮件 -- 发件人: "Joseph Bradley [via Apache Spark Developers

Re: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread Cody Koeninger
It'd probably be worth no longer marking the 0.8 interface as experimental. I don't think it's likely to be subject to active development at this point. You can use the 0.8 artifact to consume from a 0.9 broker Where are you reading documentation indicating that the direct stream only runs on

Re: Running lint-java during PR builds?

2016-11-15 Thread Shixiong(Ryan) Zhu
I remember it's because you need to run `mvn install` before running lint-java if the maven cache is empty, and `mvn install` is pretty heavy. On Tue, Nov 15, 2016 at 1:21 PM, Marcelo Vanzin wrote: > Hey all, > > Is there a reason why lint-java is not run during PR builds?

Running lint-java during PR builds?

2016-11-15 Thread Marcelo Vanzin
Hey all, Is there a reason why lint-java is not run during PR builds? I see it seems to be maven-only, is it really expensive to run after an sbt build? I see a lot of PRs coming in to fix Java style issues, and those all seem a little unnecessary. Either we're enforcing style checks or we're

NodeManager heap size with ExternalShuffleService

2016-11-15 Thread Artur Sukhenko
Hello guys, When you enable ExternalShuffleService (spark-shuffle) in NodeManager, there are no suggestions of increasing NM heap size in Spark docs or anywhere else, shouldn't we include this in spark's documentation? I have seen NM take a lot of memory 5+ gb with default 1g, and in case of its

Re: Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread Joseph Bradley
Thanks for the suggestion. That would be faster, but less accurate in most cases. It's generally better to use a new random sample on each iteration, based on literature and results I've seen. Joseph On Fri, Nov 11, 2016 at 5:13 AM, WangJianfei < wangjianfe...@otcaix.iscas.ac.cn> wrote: > when

Fwd: using Spark Streaming with Kafka 0.9/0.10

2016-11-15 Thread aakash aakash
Re-posting it at dev group. Thanks and Regards, Aakash -- Forwarded message -- From: aakash aakash Date: Mon, Nov 14, 2016 at 4:10 PM Subject: using Spark Streaming with Kafka 0.9/0.10 To: user-subscr...@spark.apache.org Hi, I am planning to use Spark

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra
You still have the problem that even within a single Job it is often the case that not every Exchange really wants to use the same number of shuffle partitions. On Tue, Nov 15, 2016 at 2:46 AM, Sean Owen wrote: > Once you get to needing this level of fine-grained control,

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Mark Hamstra
AFAIK, the adaptive shuffle partitioning still isn't completely ready to be made the default, and there are some corner issues that need to be addressed before this functionality is declared finished and ready. E.g., the current logic can make data skew problems worse by turning One Big Partition

How statistical key rune time

2016-11-15 Thread 王桥石
hi guys! Is there a way! Try to statistics top N of run time,the datas for key shuffle or transform after shuffle ,eg,reduceByKey, groupByKey, reduceByKey. So could find at a glance,which key problems.

RE: Handling questions in the mailing lists

2016-11-15 Thread assaf.mendelson
Should probably also update the helping others section in the how to contribute section (https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingbyHelpingOtherUsers) Assaf. From: Denny Lee [via Apache Spark Developers List]

Fwd:

2016-11-15 Thread Anton Okolnychyi
Hi, I have experienced a problem using the Datasets API in Spark 1.6, while almost identical code works fine in Spark 2.0. The problem is related to encoders and custom aggregators. *Spark 1.6 (the aggregation produces an empty map):* implicit val intStringMapEncoder: Encoder[Map[Int,

Re: separate spark and hive

2016-11-15 Thread Herman van Hövell tot Westerflier
You can start a spark without hive support by setting the spark.sql. catalogImplementation configuration to in-memory, for example: > > ./bin/spark-shell --master local[*] --conf > spark.sql.catalogImplementation=in-memory I would not change the default from Hive to Spark-only just yet. On Tue,

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread Sean Owen
Once you get to needing this level of fine-grained control, should you not consider using the programmatic API in part, to let you control individual jobs? On Tue, Nov 15, 2016 at 1:19 AM leo9r wrote: > Hi Daniel, > > I completely agree with your request. As the amount of

Re: Spark-SQL parameters like shuffle.partitions should be stored in the lineage

2016-11-15 Thread leo9r
That's great insight Mark, I'm looking forward to give it a try!! According to jira's Adaptive execution in Spark , it seems that some functionality was added in Spark 1.6.0 and the rest is still in progress. Are there any improvements to the

RE: separate spark and hive

2016-11-15 Thread assaf.mendelson
After looking at the code, I found that spark.sql.catalogImplementation is set to “hive”. I would proposed that it should be set to “in-memory” by default (or at least have this in the documentation, the configuration documentation at http://spark.apache.org/docs/latest/configuration.html has

RE: separate spark and hive

2016-11-15 Thread assaf.mendelson
Spark shell (and pyspark) by default create the spark session with hive support (also true when the session is created using getOrCreate, at least in pyspark) At a minimum there should be a way to configure it using spark-defaults.conf Assaf. From: rxin [via Apache Spark Developers List]