Recovering from checkpoint question

2017-01-24 Thread shyla deshpande
If I just want to resubmit the spark streaming app with different configuration options like different --executor-memory or --total-executor-cores, will the checkpoint directory help me continue from where I left off. Appreciate your response. Thanks

Re: How to make the state in a streaming application idempotent?

2017-01-24 Thread shyla deshpande
Please share your thoughts. On Tue, Jan 24, 2017 at 4:01 PM, shyla deshpande wrote: > > > On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande > wrote: > >> My streaming application stores lot of aggregations using mapWithState. >> >> I want

[ML - Intermediate - Debug] - Loading Customized Transformers in Apache Spark raised a NullPointerException

2017-01-24 Thread Saulo Ricci
Hi, sorry if I'm being short here. I'm facing the issue related in this link , I would really appreciate any help from the team and happy to talk and discuss more about this

Cached table details

2017-01-24 Thread kumar r
Hi, I have cached some table in Spark Thrift Server. I want to get all cached table information. I can see it in 4040 web ui port. Is there any command or other way to get the cached table details programmatically? Thanks, Kumar

Re: How many spark masters and do I need to tolerate one failure in one DC and two AZ?

2017-01-24 Thread kant kodali
I think I made a mistake...I would need at least N/2 + 1 nodes available all the time to reach quorum and be able do leader election within zookeeper ensemble Given that I don't know ahead of time which availability zone is going to go down I guess I cant really tolerate one AZ going down within

How many spark masters and do I need to tolerate one failure in one DC and two AZ?

2017-01-24 Thread kant kodali
How many spark masters and zookeeper servers do I need to tolerate one failure in one DC that has two availability zones ? Note: The one failure that I want to tolerate can be in either availability zone. Here is my understanding so far. please correct me If I am wrong? for Zookeeper I would

Re: printSchema showing incorrect datatype?

2017-01-24 Thread Takeshi Yamamuro
Hi, AFAIK `Dataset#printSchema` just prints an output schema of the logical plan that the Dataset has. The logical plans in your example are as follows; --- scala> x.as[Array[Byte]].explain(true) == Analyzed Logical Plan == x: string Project [value#1 AS x#3] +- LocalRelation [value#1]

json_tuple fails to parse string with emoji

2017-01-24 Thread Andrew Ehrlich
On Spark 1.6.0, calling json_tuple() with an emoji character in one of the values returns nulls: Input: """ "myJsonBody": { "field1": "" } """ Query: """ ... LATERAL VIEW JSON_TUPLE(e.myJsonBody,'field1') k AS field1, ... """ This looks like a platform-dependent issue; the parsing

Re: Spark SQL DataFrame to Kafka Topic

2017-01-24 Thread ayan guha
Is there a plan to have this in pyspark in dome later release? On Wed, 25 Jan 2017 at 10:01 am, Koert Kuipers wrote: > i implemented a sink using foreach it was indeed straightforward thanks > > On Fri, Jan 13, 2017 at 6:30 PM, Tathagata Das < > tathagata.das1...@gmail.com>

Re: How to make the state in a streaming application idempotent?

2017-01-24 Thread shyla deshpande
On Tue, Jan 24, 2017 at 9:44 AM, shyla deshpande wrote: > My streaming application stores lot of aggregations using mapWithState. > > I want to know what are all the possible ways I can make it idempotent. > > Please share your views. > > Thanks > > On Mon, Jan 23, 2017

Re: Spark SQL DataFrame to Kafka Topic

2017-01-24 Thread Koert Kuipers
i implemented a sink using foreach it was indeed straightforward thanks On Fri, Jan 13, 2017 at 6:30 PM, Tathagata Das wrote: > Structured Streaming has a foreach sink, where you can essentially do what > you want with your data. Its easy to create a Kafka producer,

Re: Spark Streaming proactive monitoring

2017-01-24 Thread Jacek Laskowski
Hi, My impression is to use StreamingListener to track metrics and react appropriately. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0

How do I dynamically add nodes to spark standalone cluster and be able to discover them?

2017-01-24 Thread kant kodali
Hi, How do I dynamically add nodes to spark standalone cluster and be able to discover them? Does Zookeeper do service discovery? What is the standard tool for these things? Thanks, kant

Re: best practice for paralleling model training

2017-01-24 Thread Jacek Laskowski
Hi Shiyuan, Re 1) Yes, but it has (almost) nothing to do with Spark since model1 = pipeline1.fit(df) is a blocking operation and therefore the following line will only be executed after this line has finished. Re 2) Use a concurrency library like Java's

best practice for paralleling model training

2017-01-24 Thread Shiyuan
Hi spark users, I am looking for a way to paralleling #A and #B in the code below. Since dataframe in spark is immutable, #A and #B are completely separated operations My question is: 1). As for spark 2.1, #B only starts when #A is completed. Is it right? 2). What's the best way to

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
If you haven't looked at the offset ranges in the logs for the time period in question, I'd start there. On Jan 24, 2017 2:51 PM, "Hakan İlter" wrote: Sorry for misunderstanding. When I said that, I meant there are no lag in consumer. Kafka Manager shows each consumer's

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Hakan İlter
Sorry for misunderstanding. When I said that, I meant there are no lag in consumer. Kafka Manager shows each consumer's coverage and lag status. On Tue, Jan 24, 2017 at 10:45 PM, Cody Koeninger wrote: > When you said " I check the offset ranges from Kafka Manager and don't >

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
When you said " I check the offset ranges from Kafka Manager and don't see any significant deltas.", what were you comparing it against? The offset ranges printed in spark logs? On Tue, Jan 24, 2017 at 2:11 PM, Hakan İlter wrote: > First of all, I can both see the "Input

Re: Sorting each partitions and writing to CSVs

2017-01-24 Thread Ivan Gozali
For those interested, after digging further, I was able to consistently reproduce the issue with a synthetic dataset. My findings are documented here: https://gist.github.com/igozali/d327a85646abe7ab10c2ae479bed431f -- Regards, Ivan Gozali Lecida Email: i...@lecida.com On Wed, Jan 18, 2017

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Hakan İlter
First of all, I can both see the "Input Rate" from Spark job's statistics page and Kafka producer message/sec from Kafka manager. The numbers are different when I have the problem. Normally these numbers are very near. Besides, the job is an ETL job, it writes the results to Elastic Search. An

Node guardian implementation and Spark standalone process

2017-01-24 Thread Fernando Avalos
Hi, We were taking a look to https://github.com/killrweather/killrweather. This application is deployed on a akka cluster and looks like a ref app from Datastax. Is this the best approach for spark stream? Or should we go with spark standalone? What are the pros and cons? Thanks,

Re: How to make the state in a streaming application idempotent?

2017-01-24 Thread shyla deshpande
My streaming application stores lot of aggregations using mapWithState. I want to know what are all the possible ways I can make it idempotent. Please share your views. Thanks On Mon, Jan 23, 2017 at 5:41 PM, shyla deshpande wrote: > In a Wordcount application which

PrefixSpan

2017-01-24 Thread Madabhattula Rajesh Kumar
Hi, Please point me the internal functionality of PrefixSpan with examples. Regards, Rajesh

printSchema showing incorrect datatype?

2017-01-24 Thread Koert Kuipers
scala> val x = Seq("a", "b").toDF("x") x: org.apache.spark.sql.DataFrame = [x: string] scala> x.as[Array[Byte]].printSchema root |-- x: string (nullable = true) scala> x.as[Array[Byte]].map(x => x).printSchema root |-- value: binary (nullable = true) why does the first schema show string

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-24 Thread Takeshi Yamamuro
Hi, Could you show us the whole code to reproduce that? // maropu On Wed, Jan 25, 2017 at 12:02 AM, Deepak Sharma wrote: > Can you try writing the UDF directly in spark and register it with spark > sql or hive context ? > Or do you want to reuse the existing UDF jar for

Re: Spark streaming multiple kafka topic doesn't work at-least-once

2017-01-24 Thread Cody Koeninger
I'm confused, if you don't see any difference between the offsets the job is processing and the offsets available in kafka, then how do you know it's processing less than all of the data? On Tue, Jan 24, 2017 at 12:35 AM, Hakan İlter wrote: > I'm using DirectStream as one

Re: Failure handling

2017-01-24 Thread Cody Koeninger
Can you identify the error case and call System.exit ? It'll get retried on another executor, but as long as that one fails the same way... If you can identify the error case at the time you're doing database interaction and just prevent data being written then, that's what I typically do. On

Re: help!!!----issue with spark-sql type cast form long to longwritable

2017-01-24 Thread Deepak Sharma
Can you try writing the UDF directly in spark and register it with spark sql or hive context ? Or do you want to reuse the existing UDF jar for hive in spark ? Thanks Deepak On Jan 24, 2017 5:29 PM, "Sirisha Cheruvu" wrote: > Hi Team, > > I am trying to keep below code in

Failure handling

2017-01-24 Thread Erwan ALLAIN
Hello guys, I have a question regarding how spark handle failure. I’m using kafka direct stream Spark 2.0.2 Kafka 0.10.0.1 Here is a snippet of code val stream = createDirectStream(….) stream .map(…) .forEachRDD( doSomething) stream .map(…) .forEachRDD( doSomethingElse) The execution is in

Re: why does spark web UI keeps changing its port?

2017-01-24 Thread smartzjp
Config your spark master web ui you can set env SPARK_MASTER_WEBUI_PORT= You can running cmd netstat –nao|grep 4040 to check 4040 is in using ——— I am not sure why Spark web UI keeps changing its port every time I restart a cluster? how can I make it run always on one port? I did make

Re: Do jobs fail because of other users of a cluster?

2017-01-24 Thread David Frese
Am 24/01/2017 um 02:43 schrieb Matthew Dailey: In general, Java processes fail with an OutOfMemoryError when your code and data does not fit into the memory allocated to the runtime. In Spark, that memory is controlled through the --executor-memory flag. If you are running Spark on YARN, then

help!!!----issue with spark-sql type cast form long to longwritable

2017-01-24 Thread Sirisha Cheruvu
Hi Team, I am trying to keep below code in get method and calling that get mthod in another hive UDF and running the hive UDF using Hive Context.sql procedure.. switch (f) { case "double" : return ((DoubleWritable)obj).get(); case "bigint" : return ((LongWritable)obj).get();

Wierd performance on windows laptop

2017-01-24 Thread Mendelson, Assaf
Hi, I created a simple synthetic test which does a sample calculation twice, each time with different partitioning: def time[R](block: => R): Long = { val t0 = System.currentTimeMillis() block// call-by-name val t1 = System.currentTimeMillis() t1 - t0 } val base_df =

Can I use speed up my sparse vector operator with mkl?

2017-01-24 Thread haowei tian
I look into the source code of mllib, it seems sparse vector operator are implement with while circulation other than mkl sparse operator. Am I right, Can I speed up sparse vector operator with mkl ? Thank you. - To unsubscribe

Running Hive Beeline .hql file in Spark

2017-01-24 Thread Ravi Prasad
Hi , Currently we are running Hive Beeline queries as below. *Beeline :-* beeline -u "jdbc:hive2://localhost:1/default;principal=hive/_HOST@ nsroot.net" --showHeader=false --silent=true --outputformat=dsv --verbose =false -f /home/*sample.hql *> output_partition.txt Note : We run the

pyspark.ml Pipeline stages are corrupted under multi-threaded access - is this a bug?

2017-01-24 Thread Vinayak Joshi5
Hi, The code we're executing constructs pyspark.ml.Pipeline objects concurrently in separate python threads. We observe that the stages fed to the pipeline object get corrupted i.e the stages supplied to a Pipeline object in one thread appear inside a different Pipeline object constructed in