unsubscribe

2018-02-27 Thread 学生张洪斌
| | 学生张洪斌 | | 邮箱:hongbinzh...@163.com | 签名由 网易邮箱大师 定制

Re: Data loss in spark job

2018-02-27 Thread yncxcw
hi, Please check if your os supports memory overcommit. I doubted this caused by your os bans the memory overcommitment, and the os kills the process when memory overcommitment is detected (the spark executor is chosen to kill). This is why you receive sigterm, and executor failed with the

Re: Data loss in spark job

2018-02-27 Thread Faraz Mateen
Hi, I saw the following error message in executor logs: *Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x000662f0, 520093696, 0) failed; error='Cannot allocate memory' (errno=12)* By increasing RAM of my nodes to 40 GB each, I was able to get rid of RPC connection

Re: How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread kant kodali
I see! I get the logic now! On Tue, Feb 27, 2018 at 5:55 PM, naresh Goud wrote: > Hi Kant, > > TD's explanation makes a lot sense. Refer this stackoverflow, where its > was explained with program output. Hope this helps. > >

Re: How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread naresh Goud
Hi Kant, TD's explanation makes a lot sense. Refer this stackoverflow, where its was explained with program output. Hope this helps. https://stackoverflow.com/questions/45579100/structured-streaming-watermark-vs-exactly-once-semantics Thanks, Naresh www.linkedin.com/in/naresh-dulam

Re: How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread Tathagata Das
Let me answer the original question directly, that is, how do we determine that an event is late. We simply track the maximum event time the engine has seen in the data it has processed till now. And any data that has event time less than the max is basically "late" (as it is out-of-order). Now,

Re: [Beginner] Kafka 0.11 header support in Spark Structured Streaming

2018-02-27 Thread Tathagata Das
Unfortunately, exposing Kafka headers is not yet supported in Structured Streaming. The community is more than welcome to add support for it :) On Tue, Feb 27, 2018 at 2:51 PM, Karthik Jayaraman wrote: > Hi all, > > I am using Spark 2.2.1 Structured Streaming to read

[Beginner] Kafka 0.11 header support in Spark Structured Streaming

2018-02-27 Thread Karthik Jayaraman
Hi all, I am using Spark 2.2.1 Structured Streaming to read messages from Kafka. I would like to know how to access the Kafka headers programmatically ? Since the Kafka message header support is introduced in Kafka 0.11 (https://issues.apache.org/jira/browse/KAFKA-4208

Re: Spark MLlib: Should I call .cache before fitting a model?

2018-02-27 Thread Nick Pentreath
Currently, fit for many (most I think) models will cache the input data. For LogisticRegression this is definitely the case, so you won't get any benefit from caching it yourself. On Tue, 27 Feb 2018 at 21:25 Gevorg Hari wrote: > Imagine that I am training a Spark MLlib

Re: SizeEstimator

2018-02-27 Thread David Capwell
Thanks for the reply and sorry for my delayed response, had to go find the profile data to lookup the class again. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala That class extends SizeEstimator and has a field "map" which

Spark MLlib: Should I call .cache before fitting a model?

2018-02-27 Thread Gevorg Hari
Imagine that I am training a Spark MLlib model as follows: val traingData = loadTrainingData(...)val logisticRegression = new LogisticRegression() traingData.cacheval logisticRegressionModel = logisticRegression.fit(trainingData) Does the call traingData.cache improve performances at training

Suppressing output from Apache Ivy (?) when calling spark-submit with --packages

2018-02-27 Thread Nicholas Chammas
I’m not sure whether this is something controllable via Spark, but when you call spark-submit with --packages you get a lot of output. Is there any way to suppress it? Does it come from Apache Ivy? I posted more details about what I’m seeing on Stack Overflow

Re: CATALYST rule join

2018-02-27 Thread Yong Zhang
Not fully understand your question, but maybe you want check out this JIRA https://issues.apache.org/jira/browse/SPARK-17728, especially in the comments area. There are some discussion about the logic why UDF could be executed multi times by Spark. Yong From:

Unsubscribe

2018-02-27 Thread purna pradeep
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Returns Null when reading data from XML Ask Question

2018-02-27 Thread Sateesh Karuturi
I am trying to Parsing the data from XML file through Spark using databrics library Here is my code: import org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport org.apache.spark.sql.SQLContextimport org.apache.spark.sql.functionsimport java.text.Formatimport

How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread kant kodali
I read through the spark structured streaming documentation and I wonder how does spark structured streaming determine an event has arrived late? Does it compare the event-time with the processing time? [image: enter image description here] Taking the above

Re: Spark on K8s - using files fetched by init-container?

2018-02-27 Thread Felix Cheung
Yes you were pointing to HDFS on a loopback address... From: Jenna Hoole Sent: Monday, February 26, 2018 1:11:35 PM To: Yinan Li; user@spark.apache.org Subject: Re: Spark on K8s - using files fetched by init-container? Oh, duh. I

Re: CATALYST rule join

2018-02-27 Thread tan shai
Hi, I need to write a rule to customize the join function using Spark Catalyst optimizer. The objective to duplicate the second dataset using this process: - Execute a udf on the column called x, this udf returns an array - Execute an explode function on the new column Using SQL terms, my