Re: Improving performance of a kafka spark streaming app

2016-06-22 Thread Colin Kincaid Williams
Streaming UI tab showing empty events and very different metrics than on 1.5.2 On Thu, Jun 23, 2016 at 5:06 AM, Colin Kincaid Williams wrote: > After a bit of effort I moved from a Spark cluster running 1.5.2, to a > Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP.

Re: Improving performance of a kafka spark streaming app

2016-06-22 Thread Colin Kincaid Williams
After a bit of effort I moved from a Spark cluster running 1.5.2, to a Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP. The completed batches are no longer showing the number of events processed in the Streaming UI tab . I'm getting around 4k inserts per second in hbase, but I

Re: Spark Kafka stream processing time increasing gradually

2016-06-22 Thread Roshan Singh
Thanks for the detailed explanation. Just tested it, worked like a charm. On Mon, Jun 20, 2016 at 1:02 PM, N B wrote: > Its actually necessary to retire keys that become "Zero" or "Empty" so to > speak. In your case, the key is "imageURL" and values are a dictionary, one >

Re: Building Spark 2.X in Intellij

2016-06-22 Thread Praveen R
I had some errors like SqlBaseParser class missing, and figured out I needed to get these classes from SqlBase.g4 using antlr4. It works fine now. On Thu, Jun 23, 2016 at 9:20 AM, Jeff Zhang wrote: > It works well with me. You can try reimport it into intellij. > > On Thu, Jun

Re: Building Spark 2.X in Intellij

2016-06-22 Thread Jeff Zhang
It works well with me. You can try reimport it into intellij. On Thu, Jun 23, 2016 at 10:25 AM, Stephen Boesch wrote: > > Building inside intellij is an ever moving target. Anyone have the magical > procedures to get it going for 2.X? > > There are numerous library references

Re: NullPointerException when starting StreamingContext

2016-06-22 Thread Ted Yu
Which Scala version / Spark release are you using ? Cheers On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind wrote: > Hello Experts, > > I am getting this error repeatedly: > > 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the > context, marking it as

NullPointerException when starting StreamingContext

2016-06-22 Thread Sunita Arvind
Hello Experts, I am getting this error repeatedly: 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the context, marking it as stopped java.lang.NullPointerException at com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)

why did spark2.0 Disallow ROW FORMAT and STORED AS (parquet | orc | avro etc.)

2016-06-22 Thread linxi zeng
Hi All, I have tried the spark sql of Spark branch-2.0 and countered an unexpected problem: Operation not allowed: ROW FORMAT DELIMITED is only compatible with 'textfile', not 'orc'(line 1, pos 0) the sql is like: CREATE TABLE IF NOT EXISTS test.test_orc ( ... ) PARTITIONED BY (xxx) ROW

Building Spark 2.X in Intellij

2016-06-22 Thread Stephen Boesch
Building inside intellij is an ever moving target. Anyone have the magical procedures to get it going for 2.X? There are numerous library references that - although included in the pom.xml build - are for some reason not found when processed within Intellij.

Re: spark-1.6.1-bin-without-hadoop can not use spark-sql

2016-06-22 Thread Ted Yu
The tar ball was built against hadoop 2.6 which is compatible with hadoop 2.7.2 So the answer should be yes. On Wed, Jun 22, 2016 at 7:10 PM, 喜之郎 <251922...@qq.com> wrote: > Thanks. > > In addition,I want to know, if I can use spark-1.6.1-bin-hadoop2.6.tgz >

Re: Creating a python port for a Scala Spark Projeect

2016-06-22 Thread Daniel Imberman
Thank you Holden, I look forward to watching your talk! On Wed, Jun 22, 2016 at 7:12 PM Holden Karau wrote: > PySpark RDDs are (on the Java side) are essentially RDD of pickled objects > and mostly (but not entirely) opaque to the JVM. It is possible (by using > some

Re: Creating a python port for a Scala Spark Projeect

2016-06-22 Thread Holden Karau
PySpark RDDs are (on the Java side) are essentially RDD of pickled objects and mostly (but not entirely) opaque to the JVM. It is possible (by using some internals) to pass a PySpark DataFrame to a Scala library (you may or may not find the talk I gave at Spark Summit useful

?????? spark-1.6.1-bin-without-hadoop can not use spark-sql

2016-06-22 Thread ??????
Thanks. In addition??I want to know, if I can use spark-1.6.1-bin-hadoop2.6.tgz which is a pre-built package on hadoop 2.7.2?? -- -- ??: "Ted Yu";; : 2016??6??22??(??) 11:51 ??:

Creating a python port for a Scala Spark Projeect

2016-06-22 Thread Daniel Imberman
Hi All, I've developed a spark module in scala that I would like to add a python port for. I want to be able to allow users to create a pyspark RDD and send it to my system. I've been looking into the pyspark source code as well as py4J and was wondering if there has been anything like this

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
Thank you. Sure, if I find something I will post it. Regards, Raghava. On Wed, Jun 22, 2016 at 7:43 PM, Nirav Patel wrote: > I believe it would be task, partitions, task status etc information. I do > not know exact of those things but I had OOM on driver with 512MB and

Re: Spark ml and PMML export

2016-06-22 Thread jayantshekhar
I have the same question on Spark ML and PMML export as Philippe. Is there a way to export Spark ML generated models to PMML? Jayant -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ml-and-PMML-export-tp26773p27213.html Sent from the Apache Spark

Re: GraphX performance and settings

2016-06-22 Thread Maja Kabiljo
Thank you for the reply Deepak. I know with more executors / memory per executor it will work, we actually have a bunch of experiments we ran with various setups. I'm just trying to confirm that limits we are hitting are right, or there are some other configuration parameters we didn't try yet

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Nirav Patel
I believe it would be task, partitions, task status etc information. I do not know exact of those things but I had OOM on driver with 512MB and increasing it did help. Someone else might be able to answer about exact memory usage of driver better. You also seem to use broadcast means sending

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
Ok. Would be able to shed more light on what exact meta data it manages and what is the relationship with more number of partitions/nodes? There is one executor running on each node -- so there are 64 executors in total. Each executor, including the driver are give 12GB and this is the maximum

Re: Shuffle service fails to register driver - Spark - Mesos

2016-06-22 Thread Feller, Eugen
Make sure you are running the MesosShuffleService and not the standard shuffle service: * org.apache.spark.deploy.mesos.MesosExternalShuffleService v.s. org.apache.spark.deploy.ExternalShuffleService * start-mesos-shuffle-service.sh v.s. start-shuffle-service.sh Thanks to Timothy Chen

Re: Explode row with start and end dates into row for each date

2016-06-22 Thread Saurabh Sardeshpande
I don't think there would be any issues since MLlib is part of Spark as against being an external package. Most of the problems I've had to deal were because of the existence of both versions of Python on a system, and not Python 3 itself. On Wed, Jun 22, 2016 at 3:51 PM, John Aherne

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Nirav Patel
Yes driver keeps fair amount of meta data to manage scheduling across all your executors. I assume with 64 nodes you have more executors as well. Simple way to test is to increase driver memory. On Wed, Jun 22, 2016 at 10:10 AM, Raghava Mutharaju < m.vijayaragh...@gmail.com> wrote: > It is an

Re: Explode row with start and end dates into row for each date

2016-06-22 Thread John Aherne
Thanks Saurabh! That explode function looks like it is exactly what I need. We will be using MLlib quite a lot - Do I have to worry about python versions for that? John On Wed, Jun 22, 2016 at 4:34 PM, Saurabh Sardeshpande wrote: > Hi John, > > If you can do it in Hive,

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Nirav Patel
PS. In my reduceByKey operation I have two mutable object. What I do is merge mutable2 into mutable1 and return mutable1. I read that it works for aggregateByKey so thought it will work for reduceByKey as well. I might be wrong here. Can someone verify if this will work or be un predictable? On

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
The only documentation on this… in terms of direction … (that I could find) If your client is not close to the cluster (e.g. your PC) then you definitely want to go cluster to improve performance. If your client is close to the cluster (e.g. an edge node) then you could go either client or

Re: Explode row with start and end dates into row for each date

2016-06-22 Thread Saurabh Sardeshpande
Hi John, If you can do it in Hive, you should be able to do it in Spark. Just make sure you import HiveContext instead of SQLContext. If your intent is to explore rather than get stuff done, I've not aware of any RDD operations that do this for you, but there is a DataFrame operation called

Explode row with start and end dates into row for each date

2016-06-22 Thread John Aherne
Hi Everyone, I am pretty new to Spark (and the mailing list), so forgive me if the answer is obvious. I have a dataset, and each row contains a start date and end date. I would like to explode each row so that each day between the start and end dates becomes its own row. e.g. row1 2015-01-01

Recovery techniques for Spark Streaming scheduling delay

2016-06-22 Thread C. Josephson
We have a Spark Streaming application that has basically zero scheduling delay for hours, but then suddenly it jumps up to multiple minutes and spirals out of control (see screenshot of job manager here: http://i.stack.imgur.com/kSftN.png) This is happens after a while even if we double the batch

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Marcelo Vanzin
On Wed, Jun 22, 2016 at 1:32 PM, Mich Talebzadeh wrote: > Does it also depend on the number of Spark nodes involved in choosing which > way to go? Not really. -- Marcelo - To unsubscribe, e-mail:

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Mich Talebzadeh
Thanks Marcelo, Sounds like cluster mode is more resilient than the client-mode. Does it also depend on the number of Spark nodes involved in choosing which way to go? Cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Marcelo Vanzin
Trying to keep the answer short and simple... On Wed, Jun 22, 2016 at 1:19 PM, Michael Segel wrote: > But this gets to the question… what are the real differences between client > and cluster modes? > What are the pros/cons and use cases where one has advantages over

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
LOL… I hate YARN, but unfortunately I don’t get to make the call on which tools we’re going to use, I just get paid to make stuff work on the tools provided. ;-) Testing is somewhat problematic. You have to really test at some large enough fraction of scale. Fortunately for this issue (YARN

Executors killed in Workers with Error: invalid log directory

2016-06-22 Thread Yiannis Gkoufas
Hi there, I have been getting a strange error in spark-1.6.1 The job submitted uses only the executor launched on the Master node while the other workers are idle. When I check the errors from the web ui to investigate on the killed executors I see the error: Error: invalid log directory

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Mich Talebzadeh
This is exactly the sort of topics that distinguish lab work from enterprise practice :) The question on YARN client versus YARN cluster mode. I am not sure how much in real life it is going to make an impact if I choose one over the other? These days I yell developers that it is perfectly valid

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Nirav Patel
Hi, I do not see any indication of errors or executor getting killed in spark UI - jobs, stages, event timelines. No task failures. I also don't see any errors in executor logs. Thanks On Wed, Jun 22, 2016 at 2:32 AM, Ted Yu wrote: > For the run which returned incorrect

Re: spark streaming questions

2016-06-22 Thread pandees waran
For my question (2), From my understanding checkpointing ensures the recovery from failures. Sent from my iPhone > On Jun 22, 2016, at 10:27 AM, pandees waran wrote: > > In general, if you have multiple steps in a workflow : > For every batch > 1.stream data from s3 >

Networking Exceptions in Spark 1.6.1 with Dynamic Allocation and YARN Pre-Emption

2016-06-22 Thread Nick Peterson
Hey all, We're working on setting up a Spark 1.6.1 cluster on Amazon EC2, and encountering some problems related to pre-emption. We have followed all the instructions for setting up dynamic allocation, including enabling the external spark shuffle service in the YARN NodeManagers. When a

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
JDBC reliability problem? Ok… a bit more explanation… Usually when you have to go back to a legacy system, its because the data set is usually metadata and is relatively small. Its not the sort of data that gets ingested in to a data lake unless you’re also ingesting the metadata and are

Re: spark streaming questions

2016-06-22 Thread pandees waran
In general, if you have multiple steps in a workflow : For every batch 1.stream data from s3 2.write it to hbase 3.execute a hive step using the data in s3 In this case all these 3 steps are part of the workflow. That's the reason I mentioned about workflow orchestration. The other question

Re: spark streaming questions

2016-06-22 Thread Mich Talebzadeh
Hi Pandees, can you kindly explain what you are trying to achieve by incorporating Spark streaming with workflow orchestration. Is this some form of back-to-back seamless integration. I have not used it myself but would be interested in knowing more about your use case. Cheers, Dr Mich

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
It is an iterative algorithm which uses map, mapPartitions, join, union, filter, broadcast and count. The goal is to compute a set of tuples and in each iteration few tuples are added to it. Outline is given below 1) Start with initial set of tuples, T 2) In each iteration compute deltaT, and add

Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-22 Thread Ajay Chander
Thanks for the confirmation Mich! On Wednesday, June 22, 2016, Mich Talebzadeh wrote: > Hi Ajay, > > I am afraid for now transaction heart beat do not work through Spark, so I > have no other solution. > > This is interesting point as with Hive running on Spark engine

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Xinh Huynh
I can see how the linked documentation could be confusing: "Aggregate function: returns the number of items in a group." What it doesn't mention is that it returns the number of rows for which the given column is non-null. Xinh On Wed, Jun 22, 2016 at 9:31 AM, Takeshi Yamamuro

Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-22 Thread Mich Talebzadeh
Hi Ajay, I am afraid for now transaction heart beat do not work through Spark, so I have no other solution. This is interesting point as with Hive running on Spark engine there is no issue with this as Hive handles the transactions, I gather in simplest form Hive has to deal with its metadata

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Jakub Dubovsky
Nice reactions. My comments: @Ted.Yu: I see now that count(*) works for what I want @Takeshi: I understand this is the syntax but it was not clear to me what this $"b" column will be used for... My line of thinking was this: I started with 1) someDF.groupBy("colA").count() and then I realized

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Mich Talebzadeh
Thanks Mike for clarification. I think there is another option to get data out of RDBMS through some form of SELECT ALL COLUMNS TAB SEPARATED OR OTHER and put them in a flat file or files. scp that file from the RDBMS directory to a private directory on HDFS system and push it into HDFS. That

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Sonal Goyal
What does your application do? Best Regards, Sonal Founder, Nube Technologies Reifier at Strata Hadoop World Reifier at Spark Summit 2015

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Takeshi Yamamuro
Hi, An argument for `functions.count` is needed for per-column counting; df.groupBy($"a").agg(count($"b")) // maropu On Thu, Jun 23, 2016 at 1:27 AM, Ted Yu wrote: > See the first example in: > > http://www.w3schools.com/sql/sql_func_count.asp > > On Wed, Jun 22, 2016 at

OOM on the driver after increasing partitions

2016-06-22 Thread Raghava Mutharaju
Hello All, We have a Spark cluster where driver and master are running on the same node. We are using Spark Standalone cluster manager. If the number of nodes (and the partitions) are increased, the same dataset that used to run to completion on lesser number of nodes is now giving an out of

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Ted Yu
See the first example in: http://www.w3schools.com/sql/sql_func_count.asp On Wed, Jun 22, 2016 at 9:21 AM, Jakub Dubovsky < spark.dubovsky.ja...@gmail.com> wrote: > Hey Ted, > > thanks for reacting. > > I am refering to both of them. They both take column as parameter > regardless of its type.

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Jakub Dubovsky
Hey Ted, thanks for reacting. I am refering to both of them. They both take column as parameter regardless of its type. Intuition here is that count should take no parameter. Or am I missing something? Jakub On Wed, Jun 22, 2016 at 6:19 PM, Ted Yu wrote: > Are you

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Ted Yu
Are you referring to the following method in sql/core/src/main/scala/org/apache/spark/sql/functions.scala : def count(e: Column): Column = withAggregateFunction { Did you notice this method ? def count(columnName: String): TypedColumn[Any, Long] = On Wed, Jun 22, 2016 at 9:06 AM, Jakub

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Jörn Franke
That is the cost of exactly once :) > On 22 Jun 2016, at 12:54, sandesh deshmane wrote: > > We are going with checkpointing . we don't have identifier available to > identify if the message is already processed or not . > Even if we had it, then it will slow down the

Confusing argument of sql.functions.count

2016-06-22 Thread Jakub Dubovsky
Hey sparkers, an aggregate function *count* in *org.apache.spark.sql.functions* package takes a *column* as an argument. Is this needed for something? I find it confusing that I need to supply a column there. It feels like it might be distinct count or something. This can be seen in latest

Re: spark-1.6.1-bin-without-hadoop can not use spark-sql

2016-06-22 Thread Ted Yu
build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr -Dhadoop.version=2.7.2 package On Wed, Jun 22, 2016 at 8:00 AM, 251922566 <251922...@qq.com> wrote: > ok,i will rebuild myself. if i want to use spark with hadoop 2.7.2, when i > build spark, i should put what on param

Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-22 Thread Ajay Chander
Hi Mich, Right now I have a similar usecase where I have to delete some rows from a hive table. My hive table is of type ORC, Bucketed and included transactional property. I can delete from hive shell but not from my spark-shell or spark app. Were you able to find any work around? Thank you.

回复: spark-1.6.1-bin-without-hadoop can not use spark-sql

2016-06-22 Thread 251922566
ok,i will rebuild myself. if i want to use spark with hadoop 2.7.2, when i build spark, i should put what on param --hadoop, 2.7.2 or others?来自我的华为手机 原始邮件 主题:Re: spark-1.6.1-bin-without-hadoop can not use spark-sql发件人:Ted Yu 收件人:喜之郎 <251922...@qq.com>抄送:user

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Michael Segel
Hi, Just to clear a few things up… First I know its hard to describe some problems because they deal with client confidential information. (Also some basic ‘dead hooker’ thought problems to work through before facing them at a client.) The questions I pose here are very general and deal

spark streaming questions

2016-06-22 Thread pandees waran
Hello all, I have few questions regarding spark streaming : * I am wondering anyone uses spark streaming with workflow orchestrators such as data pipeline/SWF/any other framework. Is there any advantages /drawbacks on using a workflow orchestrator for spark streaming? *How do you guys manage

Re: spark-1.6.1-bin-without-hadoop can not use spark-sql

2016-06-22 Thread Ted Yu
I wonder if the tar ball was built with: -Phive -Phive-thriftserver Maybe rebuild by yourself with the above ? FYI On Wed, Jun 22, 2016 at 4:38 AM, 喜之郎 <251922...@qq.com> wrote: > Hi all. > I download spark-1.6.1-bin-without-hadoop.tgz >

?????? Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread ????????
thanks you-all patience help very much??i change the para spark.yarn.jar spark.yarn.jar hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar to spark.yarn.jar spark.yarn.jar hdfs://master:9000/user/shihj/spark_lib/spark-assembly-1.6.1-hadoop2.6.0.jar then run

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Cody Koeninger
The direct stream doesn't automagically give you exactly-once semantics. Indeed, you should be pretty suspicious of anything that claims to give you end-to-end exactly-once semantics without any additional work on your part. To the original poster, have you read / watched the materials linked

[Spark + MLlib] how to update offline model with the online model

2016-06-22 Thread diplomatic Guru
Hello all, I have built a spark batch model using MLlib and a Streaming online model. Now I would like to load the offline model in streaming job and apply and update the model. Could to please advise me how to do it. is there an example to look at. The streaming model does not allow saving or

Running JavaBased Implementation of StreamingKmeans Spark

2016-06-22 Thread Biplob Biswas
Hi, I implemented the streamingKmeans example provided in the spark website but in Java. The full implementation is here, http://pastebin.com/CJQfWNvk But i am not getting anything in the output except occasional timestamps like one below: --- Time:

Re: Can I use log4j2.xml in my Apache Saprk application

2016-06-22 Thread Prajwal Tuladhar
One way to integrate log4j2 would be to enable flags `spark.executor.userClassPathFirst` and `spark.driver.userClassPathFirst` when submitting the application. This would cause application class loader to load first, initializing log4j2 logging context. But this can also potentially break other

spark-1.6.1-bin-without-hadoop can not use spark-sql

2016-06-22 Thread ??????
Hi all. I download spark-1.6.1-bin-without-hadoop.tgz from website. And I configured "SPARK_DIST_CLASSPATH" in spark-env.sh. Now spark-shell run well. But spark-sql can not run. My hadoop version is 2.7.2. This is error infos: bin/spark-sql java.lang.ClassNotFoundException:

Can I use log4j2.xml in my Apache Saprk application

2016-06-22 Thread Charan Adabala
Hi, We are trying to integrate log4j2.xml instead of log4j.properties in Apache Spark application, We integrated log4j2.xml but, the problem is unable to write the worker log of the application and there is no problem for writing driver log. Can any one suggest how to integrate log4j2.xml in

Spark Task failure with File segment length as negative

2016-06-22 Thread Priya Ch
Hi All, I am running Spark Application with 1.8TB of data (which is stored in Hive tables format). I am reading the data using HiveContect and processing it. The cluster has 5 nodes total, 25 cores per machine and 250Gb per node. I am launching the application with 25 executors with 5 cores each

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
We have not tried direct approach . We are using receiver based approach ( we use zookeepers to connect from spark) We have around 20+ Kafka and some times we replace the kafka brokers ( they go down ). So each time I need to change list at spark application and I need to restart the streaming

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Denys Cherepanin
Hi Sandesh, As I understand you are using "receiver based" approach to integrate kafka with spark streaming. Did you tried "direct" approach ? In this case offsets will be tracked by

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
We are going with checkpointing . we don't have identifier available to identify if the message is already processed or not . Even if we had it, then it will slow down the processing as we do get 300k messages per sec , so lookup will slow down. Thanks Sandesh On Wed, Jun 22, 2016 at 3:28 PM,

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Jörn Franke
Spark Streamig does not guarantee exactly once for output action. It means that one item is only processed in an RDD. You can achieve at most once or at least once. You could however do at least once (via checkpoing) and record which messages have been proceed (some identifier available?) and

Re: javax.net.ssl.SSLHandshakeException: unable to find valid certification path to requested target

2016-06-22 Thread Steve Loughran
On 21 Jun 2016, at 00:03, Utkarsh Sengar > wrote: We are intermittently getting this error when spark tried to load data from S3:Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
Hi Sandesh, Where these messages end up? Are they written to a sink (file, database etc) What is the reason your app fails. Can that be remedied to reduce the impact. How do you identify that duplicates are sent and processed? Cheers, Dr Mich Talebzadeh LinkedIn *

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
Mich Talebzadeh thanks for reply. we have retention policy of 4 hours for kafka messages and we have multiple other consumers which reads from kafka cluster. ( spark is one of them) we have timestamp in message, but we actually have multiple message with same time stamp. its very hard to

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Ted Yu
For the run which returned incorrect result, did you observe any error (on workers) ? Cheers On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel wrote: > I have an RDD[String, MyObj] which is a result of Join + Map operation. It > has no partitioner info. I run reduceByKey

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
Yes this is more of Kafka issue as Kafka send the messages again. In your topic do messages come with an ID or timestamp where you can reject them if they have already been processed. In other words do you have a way what message was last processed via Spark before failing. You can of course

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
Here I refer to failure in spark app. So When I restart , i see duplicate messages. To replicate the scenario , i just do kill mysparkapp and then restart . On Wed, Jun 22, 2016 at 1:10 PM, Mich Talebzadeh wrote: > As I see it you are using Spark streaming to read

Re: FullOuterJoin on Spark

2016-06-22 Thread Gourav Sengupta
+1 for the guidance from Nirvan. Also it would be better to repartition and store the data in parquet format in case you are planning to do the joins more than once or with other data sources. Parquet with SPARK works likes a charm. Over S3 I have seen its performance being quite close to cached

'numBins' property not honoured in BinaryClassificationMetrics class when spark.default.parallelism is not set to 1

2016-06-22 Thread sneha29shukla
Hi, I'm trying to use the BinaryClassificationMetrics class to compute the pr curve as below - import org.apache.avro.generic.GenericRecord import org.apache.hadoop.conf.Configuration import org.apache.hadoop.mapred.JobConf import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

Re: how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread Mich Talebzadeh
As I see it you are using Spark streaming to read data from source through Kafka. Your batch interval is 10 sec, so in that interval you have 10*300K = 3Milion messages When you say there is failure are you referring to the failure in the source or in Spark streaming app? HTH Dr Mich Talebzadeh

Re: Does saveAsHadoopFile depend on master?

2016-06-22 Thread Spico Florin
Hi! I had a similar issue when the user that submit the job to the spark cluster didn't have permission to write into the hdfs. If you have the hdfs GUI then you can check which users are and what permissions. Also can in hdfs browser:(

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Takeshi Yamamuro
Hi, Could you check the issue also occurs in v1.6.1 and v2.0? // maropu On Wed, Jun 22, 2016 at 2:42 PM, Nirav Patel wrote: > I have an RDD[String, MyObj] which is a result of Join + Map operation. It > has no partitioner info. I run reduceByKey without passing any

?????? Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread ????????
yes??it run well shihj@master:/usr/local/spark/spark-1.6.1-bin-hadoop2.6$ ./bin/spark-submit \ > --class org.apache.spark.examples.SparkPi \ > --master local[4] \ > lib/spark-examples-1.6.1-hadoop2.6.0.jar 10 16/06/22 15:08:14 INFO SparkContext: Running Spark version 1.6.1 16/06/22

how to avoid duplicate messages with spark streaming using checkpoint after restart in case of failure

2016-06-22 Thread sandesh deshmane
Hi, I am writing spark streaming application which reads messages from Kafka. I am using checkpointing and write ahead logs ( WAL) to achieve fault tolerance . I have created batch size of 10 sec for reading messages from kafka. I read messages for kakfa and generate the count of messages as

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma
I cannot get a lot of info from these logs but it surely seems like yarn setup issue. Did you try the local mode to check if it works - ./bin/spark-submit \ > --class org.apache.spark.examples.SparkPi \ > --master local[4] \ > spark-examples-1.6.1-hadoop2.6.0.jar 10 Note - the jar is a local

?????? Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread ????????
Application application_1466568126079_0006 failed 2 times due to AM Container for appattempt_1466568126079_0006_02 exited with exitCode: 1 For more detailed output, check application tracking page:http://master:8088/proxy/application_1466568126079_0006/Then, click on links to logs of each

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma
Are you able to run anything else on the cluster, I suspect its yarn that not able to run the class. If you could just share the logs in pastebin we could confirm that. On Wed, Jun 22, 2016 at 4:43 PM, 另一片天 <958943...@qq.com> wrote: > i want to avoid Uploading resource file (especially jar

?????? Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread ????????
i want to avoid Uploading resource file ??especially jar packagebecause them very big??the application will wait for too long??there are good method so i config that para?? but not get the my want to effect?? -- -- ??: "Yash

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma
Ok, we moved to the next level :) Could you share more info on the error. You could get logs by the command - yarn logs -applicationId application_1466568126079_0006 On Wed, Jun 22, 2016 at 4:38 PM, 另一片天 <958943...@qq.com> wrote: > shihj@master:/usr/local/spark/spark-1.6.1-bin-hadoop2.6$ >

?????? Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread ????????
shihj@master:/usr/local/spark/spark-1.6.1-bin-hadoop2.6$ ./bin/spark-submit \ > --class org.apache.spark.examples.SparkPi \ > --master yarn-cluster \ > --driver-memory 512m \ > --num-executors 2 \ > --executor-memory 512m \ > --executor-cores 2 \ >

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma
Try with : --master yarn-cluster On Wed, Jun 22, 2016 at 4:30 PM, 另一片天 <958943...@qq.com> wrote: > ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master > yarn-client --driver-memory 512m --num-executors 2 --executor-memory 512m > --executor-cores 2 >

Re: Spark 2.0.0 : GLM problem

2016-06-22 Thread april_ZMQ
The picture below shows the reply from the creator for this package, Yanbo Liang( https://github.com/yanboliang ) -- View this message in context:

?????? Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread ????????
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-client --driver-memory 512m --num-executors 2 --executor-memory 512m --executor-cores 2 hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar 10 Warning: Skip remote jar

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma
Or better , try the master as yarn-cluster, ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --driver-memory 512m \ --num-executors 2 \ --executor-memory 512m \ --executor-cores 2 \

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma
I meant try having the full path in the spark submit command- ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn-client \ --driver-memory 512m \ --num-executors 2 \ --executor-memory 512m \ --executor-cores 2 \

?????? Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread ????????
Is it able to run on local mode ? what mean?? standalone mode ? -- -- ??: "Yash Sharma";; : 2016??6??22??(??) 2:18 ??: "Saisai Shao"; : ""<958943...@qq.com>;

?????? Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread ????????
shihj@master:~/workspace/hadoop-2.6.4$ bin/hadoop fs -ls hdfs://master:9000/user/shihj/spark_lib Found 1 items -rw-r--r-- 3 shihj supergroup 118955968 2016-06-22 10:24 hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar shihj@master:~/workspace/hadoop-2.6.4$ can

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2016-06-22 Thread Yash Sharma
Try providing the jar with the hdfs prefix. Its probably just because its not able to find the jar on all nodes. hdfs://master:9000/user/shihj/spark_lib/spark-examples-1.6.1-hadoop2.6.0.jar Is it able to run on local mode ? On Wed, Jun 22, 2016 at 4:14 PM, Saisai Shao

Re: Silly question about Yarn client vs Yarn cluster modes...

2016-06-22 Thread Mich Talebzadeh
If you are going to get data out of an RDBMS like Oracle then the correct procedure is: 1. Use Hive on Spark execution engine. That improves Hive performance 2. You can use JDBC through Spark itself. No issue there. It will use JDBC provided by HiveContext 3. JDBC is fine. Every

  1   2   >