An exception makes different phenomnon

2018-04-16 Thread big data
Hi all, we have two environments for spark streaming job, which consumes Kafka topic to do calculation. Now in one environment, spark streaming job consume an non-standard data from kafka and throw an excepiton(not catch it in code), then the sreaming job is down. But in another environment,

Re: can we use mapGroupsWithState in raw sql?

2018-04-16 Thread Tathagata Das
Unfortunately no. Honestly it does not make sense as for type-aware operations like map, mapGroups, etc., you have to provide an actual JVM function. That does not fit in with the SQL language structure. On Mon, Apr 16, 2018 at 7:34 PM, kant kodali wrote: > Hi All, > > can

unsubscribe

2018-04-16 Thread 韩盼
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

pyspark execution

2018-04-16 Thread anudeep
Hi All, I have a python file which I am executing directly with spark-submit command. Inside the python file, I have sql written using hive context.I created a generic variable for the database name inside sql The problem is : How can I pass the value for this variable dynamically just as we

can we use mapGroupsWithState in raw sql?

2018-04-16 Thread kant kodali
Hi All, can we use mapGroupsWithState in raw SQL? or is it in the roadmap? Thanks!

Re: Warning from user@spark.apache.org

2018-04-16 Thread Prasad Velagaleti
Hello, I got a message saying , messages sent to me (my gmail id) from the mailing list got bounced ? Wonder why ? thanks, Prasad. On Mon, Apr 16, 2018 at 6:16 PM, wrote: > Hi! This is the ezmlm program. I'm managing the > user@spark.apache.org mailing list. > >

[Spark 2.x Core] Job writing out an extra empty part-0000* file

2018-04-16 Thread klrmowse
the spark job succeeds (and with correct output), except there is always an extra part-* file, and it is empty... i even set number of partitions to only 2 via spark-submit, but there is still a 3rd, empty, part-file that shows up. why does it do that? how to fix? Thank you -- Sent

Re: Driver aborts on Mesos when unable to connect to one of external shuffle services

2018-04-16 Thread igor.berman
Hi Szuromi, We manage external shuffle service by Marathon and not manually sometime though, eg. when adding new node to cluster there is some delay between mesos schedules tasks on some slave and marathon scheduling external shuffle service task on this node. -- Sent from:

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
Hi Gerard, "If your actual source is Kafka, the original solution of using `spark.streams.awaitAnyTermination` should solve the problem." I tried literally everything, nothing worked out. 1) Tried NC from two different ports for two diff streams, still nothing worked. 2) Tried same using

Re: PySpark ML: Get best set of parameters from TrainValidationSplit

2018-04-16 Thread Bryan Cutler
Hi Aakash, First you will want to get the the random forest model stage from the best pipeline model result, for example if RF is the first stage: rfModel = model.bestModel.stages[0] Then you can check the values of the params you tuned like this: rfModel.getNumTrees On Mon, Apr 16, 2018 at

Structured streaming: Tried to fetch $offset but the returned record offset was ${record.offset}"

2018-04-16 Thread ARAVIND SETHURATHNAM
Hi, We have several structured streaming jobs (spark version 2.2.0) consuming from kafka and writing to s3. They were running fine for a month, since yesterday few jobs started failing and I see the below exception in the failed jobs log, ```Tried to fetch 473151075 but the returned record

Curious case of Spark SQL 2.3 - number of stages different for the same query ever?

2018-04-16 Thread Jacek Laskowski
Hi, I've got a case where the same structured query (it's union) gives 1 stage for a run and 5 stages for another. I could not find any pattern yet (and it's hard to reproduce it due to the volume and the application), but I'm pretty certain that it's *never* possible that Spark 2.3 could come up

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Gerard Maas
Aakash, There are two issues here. The issue with the code on the first question is that the first query blocks and the code for the second does not get executed. Panagiotis pointed this out correctly. In the updated code, the issue is related to netcat (nc) and the way structured streaming

Re: Spark-ML : Streaming library for Factorization Machine (FM/FFM)

2018-04-16 Thread Maximilien DEFOURNE
Hi, Unfortunately no. i just used this lib for FM and FFM raw. I thought it could be a good baseline for your need. Regards Maximilien On 16/04/18 15:43, Sundeep Kumar Mehta wrote: Hi Maximilien, Thanks for your response, Did you convert this repo into DStream for continuous/incremental

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Lalwani, Jayesh
You could have a really large window. From: Aakash Basu Date: Monday, April 16, 2018 at 10:56 AM To: "Lalwani, Jayesh" Cc: spark receiver , Panagiotis Garefalakis , user

Error: NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT while running a Spark-Hive Job

2018-04-16 Thread Rishikesh Gawade
Hello there, I am using *spark-2.3.0* compiled using the following Maven Command: *mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean install.* I have configured it to run with *Hive v2.3.3*. Also, all the Hive related jars (*v1.2.1*) in the Spark's JAR folder have been replaced by all the

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
If I use timestamp based windowing, then my average will not be global average but grouped by timestamp, which is not my requirement. I want to recalculate the avg of entire column, every time a new row(s) comes in and divide the other column with the updated avg. Let me know, in-case you or

Re: ERROR: Hive on Spark

2018-04-16 Thread naresh Goud
Change you table name in query to spam.spamdataset instead of spamdataset. On Sun, Apr 15, 2018 at 2:12 PM Rishikesh Gawade wrote: > Hello there. I am a newbie in the world of Spark. I have been working on a > Spark Project using Java. > I have configured Hive and

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Lalwani, Jayesh
You could do it if you had a timestamp in your data. You can use windowed operations to divide a value by it’s own average over a window. However, in structured streaming, you can only window by timestamp columns. You cannot do windows aggregations on integers. From: Aakash Basu

Re: Spark-ML : Streaming library for Factorization Machine (FM/FFM)

2018-04-16 Thread Sundeep Kumar Mehta
Hi Maximilien, Thanks for your response, Did you convert this repo into DStream for continuous/incremental training ? Regards Sundeep On Mon, Apr 16, 2018 at 4:17 PM, Maximilien DEFOURNE < maximilien.defou...@s4m.io> wrote: > Hi, > > I used this repo for FM/FFM : https://github.com/Intel- >

Re: Spark-ML : Streaming library for Factorization Machine (FM/FFM)

2018-04-16 Thread Maximilien DEFOURNE
Hi, I used this repo for FM/FFM : https://github.com/Intel-bigdata/imllib-spark Regards Maximilien DEFOURNE On 15/04/18 05:14, Sundeep Kumar Mehta wrote: Hi All, Any library/ github project to use factorization machine or field aware factorization machine via online learning for

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
Hey Jayesh and Others, Is there then, any other way to come to a solution for this use-case? Thanks, Aakash. On Mon, Apr 16, 2018 at 8:11 AM, Lalwani, Jayesh < jayesh.lalw...@capitalone.com> wrote: > Note that what you are trying to do here is join a streaming data frame > with an aggregated

Re: Structured Streaming on Kubernetes

2018-04-16 Thread Krishna Kalyan
Thank you so much TD, Matt, Anirudh and Oz, Really appropriate this. On Fri, Apr 13, 2018 at 9:54 PM, Oz Ben-Ami wrote: > I can confirm that Structured Streaming works on Kubernetes, though we're > not quite on production with that yet. Issues we're looking at are: > -