Re: Why spark.sql.autoBroadcastJoinThreshold not available

2017-05-15 Thread Jone Zhang
Solve it by remove lazy identity. 2.HiveContext.sql("cache table feature as "select * from src where ...) which result size is only 100K Thanks! 2017-05-15 21:26 GMT+08:00 Yong Zhang : > You should post the execution plan here, so we can provide more accurate > support. >

Re: Application dies, Driver keeps on running

2017-05-15 Thread map reduced
Ah interesting, I stopped spark context and System.exit() from driver with supervise ON and that seemed to start app if it gets killed. On Mon, May 15, 2017 at 5:01 PM, map reduced wrote: > Hi, > I was looking at incorrect place for logs, yes I see some errors in logs: > >

Re: Application dies, Driver keeps on running

2017-05-15 Thread map reduced
Hi, I was looking at incorrect place for logs, yes I see some errors in logs: "Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages."

Re: Restful API Spark Application

2017-05-15 Thread Nipun Arora
Thanks all for your response. I will have a look at them. Nipun On Sat, May 13, 2017 at 2:38 AM vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > It's in scala but it should be portable in java > https://github.com/vgkowski/akka-spark-experiments > > > Le 12 mai 2017 10:54 PM,

Re: Application dies, Driver keeps on running

2017-05-15 Thread Shixiong(Ryan) Zhu
So you are using `client` mode. Right? If so, Spark cluster doesn't manage the driver for you. Did you see any error logs in driver? On Mon, May 15, 2017 at 3:01 PM, map reduced wrote: > Hi, > > Setup: Standalone cluster with 32 workers, 1 master > I am running a long

Application dies, Driver keeps on running

2017-05-15 Thread map reduced
Hi, Setup: Standalone cluster with 32 workers, 1 master I am running a long running streaming spark job (read from Kafka -> process -> send to Http endpoint) which should ideally never stop. I have 2 questions: 1) I have seen some times Driver is still running but application marked as

Re: what is the difference between json format vs kafka format?

2017-05-15 Thread Michael Armbrust
For that simple count, you don't actually have to even parse the JSON data. You can just do a count. The following code assumes you are running Spark 2.2 .

Re: Spark SQL DataFrame to Kafka Topic

2017-05-15 Thread Michael Armbrust
The foreach sink from that blog post requires that you have a DataFrame with two columns in the form of a Tuple2, (String, String), where as your dataframe has only a single column `payload`. You could change the KafkaSink to extend ForeachWriter[KafkaMessage] and then it would work. I'd also

Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Edward Capriolo
Here is a similar but not exact way I did something similar to what you did. I had two data files in different formats the different columns needed to be different features. I wanted to feed them into spark's:

Re: Spark streaming - TIBCO EMS

2017-05-15 Thread Piotr Smoliński
Hi Pradeep, You need to connect via regular JMS API. Obtain factory from JNDI or create it directly using com.tibco.tibjms.TibjmsConnectionFactory. On the classpath you need JMS 2.0 API (jms-2.0.jar) and EMS driver classes (tibjms.jar). Regards, Piotr On Mon, May 15, 2017 at 5:47 PM, Pradeep

Spark streaming - TIBCO EMS

2017-05-15 Thread Pradeep
What is the best way to connect to TIBCO EMS using spark streaming? Do we need to write custom receivers or any libraries already exist. Thanks, Pradeep - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Adding worker dynamically in standalone mode

2017-05-15 Thread Sonal Goyal
If I remember correctly, just run the worker with master as current. On Monday, May 15, 2017, Seemanto Barua wrote: > Hi > > Is it possible to add a worker dynamically to the master in standalone > mode. If so can you please share the steps on how to ? > Thanks > --

Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread ayan guha
You may consider writing all your data to a nosql datastore such as hbase, using user id as key. There is a sql solution using max and inner case and finally union the results, but that may be expensive On Tue, 16 May 2017 at 12:13 am, Didac Gil wrote: > Or maybe you

Adding worker dynamically in standalone mode

2017-05-15 Thread Seemanto Barua
Hi Is it possible to add a worker dynamically to the master in standalone mode. If so can you please share the steps on how to ? Thanks

Adding worker dynamically in standalone mode

2017-05-15 Thread seemanto.barua
Hi, Is it possible to add a worker dynamically to the master in standalone mode. If so can you please share the steps on how to ? -thanks Seemanto Barua PLEASE READ: This message is for the named person's use only. It may contain confidential, proprietary or legally privileged information.

Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Didac Gil
Or maybe you could also check using the collect_list from the SQL functions val compacter = Data1.groupBy(“UserID") .agg(org.apache.spark.sql.functions.collect_list(“feature").as(“ListOfFeatures")) > On 15 May 2017, at 15:15, Jone Zhang wrote: > > For example >

Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Didac Gil
I guess that if your user_id field is the key, you could use the updateStateByKey function. I did not test it, but it could be something along these lines: def yourCombineFunction(input: Seq[(String)],accumulatedInput: Option[(String)] = { val state = accumulatedInput.getOrElse((“”))

Re: Why spark.sql.autoBroadcastJoinThreshold not available

2017-05-15 Thread Yong Zhang
You should post the execution plan here, so we can provide more accurate support. Since in your feature table, you are building it with projection ("where "), so my guess is that the following JIRA (SPARK-13383) stops the broadcast join.

How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Jone Zhang
For example Data1(has 1 billion records) user_id1 feature1 user_id1 feature2 Data2(has 1 billion records) user_id1 feature3 Data3(has 1 billion records) user_id1 feature4 user_id1 feature5 ... user_id1 feature100 I want to get the result as follow user_id1 feature1 feature2 feature3

Re: Kafka 0.8.x / 0.9.x support in structured streaming

2017-05-15 Thread David Kaczynski
I haven't done Structured Streaming in Spark 2.1 with Kafka 0.9.x, but I did do stream processing with Spark 2.0.1 and Kafka 0.10. Here's the official documenation that I used for Spark Streaming with Kafka 0.10: https://spark.apache.org/docs/2.1.0/streaming-kafka-integration.html. It looks like

Re: ElasticSearch Spark error

2017-05-15 Thread Rohit Verma
Try to switch the trace logging, is your es cluster running behind docker. Its possible that your spark cluster can’t communicate using docker ips. Regards Rohit On May 15, 2017, at 4:55 PM, Nick Pentreath > wrote: It may be best to

Re: ElasticSearch Spark error

2017-05-15 Thread Nick Pentreath
It may be best to ask on the elasticsearch-Hadoop github project On Mon, 15 May 2017 at 13:19, nayan sharma wrote: > Hi All, > > *ERROR:-* > > *Caused by: org.apache.spark.util.TaskCompletionListenerException: > Connection error (check network and/or proxy settings)-

ElasticSearch Spark error

2017-05-15 Thread nayan sharma
Hi All, ERROR:- Caused by: org.apache.spark.util.TaskCompletionListenerException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[10.0.1.8*:9200, 10.0.1.**:9200, 10.0.1.***:9200]] I am getting this error while trying to show the dataframe. df.count =5190767

Test

2017-05-15 Thread nayan sharma
Test - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: save SPark ml

2017-05-15 Thread issues solution
Hi , please i need help about that question 2017-05-15 10:32 GMT+02:00 issues solution : > Hi, > I am under Pyspark 1.6 i want save my model in hdfs file like parquet > > how i can do this ? > > > My model it s a RandomForestClassifier performed with

RE: Spark SQL DataFrame to Kafka Topic

2017-05-15 Thread Revin Chalil
I couldn’t get this working yet.. If anyone has successfully used forEach Sink for kafka with structured streaming, plz share... Thanks. From: Revin Chalil [mailto:rcha...@expedia.com] Sent: Sunday, May 14, 2017 9:32 AM To: Tathagata Das ; mich...@databricks.com Cc:

Kafka 0.8.x / 0.9.x support in structured streaming

2017-05-15 Thread Swapnil Chougule
Hello I am new to structured streaming. Wanted to learn if there is support for Kafka 0.8.x or Kafka 0.9.x in structured streaming ? My Kafka source is of version 0.9.x & want get have structured streaming solution on top of it. I checked documentation for Spark release 2.1.0 but didn't get exact

Any solution for this?

2017-05-15 Thread Aakash Basu
Hi all, Any solution for this issue - http://stackoverfl ow.com/q/43921392/7998705 Thanks, Aakash.

save SPark ml

2017-05-15 Thread issues solution
Hi, I am under Pyspark 1.6 i want save my model in hdfs file like parquet how i can do this ? My model it s a RandomForestClassifier performed with corssvalidation like this rf_csv2 = CrossValidator() how i can save it ? thx for adavance

spark on yarn cluster model can't use saveAsTable ?

2017-05-15 Thread lk_spark
hi,all: I have a test under spark2.1.0 , which read txt files as DataFrame and save to hive . When I submit the app jar with yarn client model it works well , but If I submit with cluster model , it will not create table and write data , and I didn't find any error log ... can anybody