Re: Why spark.sql.autoBroadcastJoinThreshold not available

2017-05-15 Thread Jone Zhang
Solve it by remove lazy identity.
2.HiveContext.sql("cache table feature as "select * from src where ...)
which result size is only 100K

Thanks!

2017-05-15 21:26 GMT+08:00 Yong Zhang :

> You should post the execution plan here, so we can provide more accurate
> support.
>
>
> Since in your feature table, you are building it with projection ("where
> "), so my guess is that the following JIRA (SPARK-13383
> ) stops the broadcast
> join. This is fixed in the Spark 2.x. Can you try it on Spark 2.0?
>
> Yong
>
> --
> *From:* Jone Zhang 
> *Sent:* Wednesday, May 10, 2017 7:10 AM
> *To:* user @spark/'user @spark'/spark users/user@spark
> *Subject:* Why spark.sql.autoBroadcastJoinThreshold not available
>
> Now i use spark1.6.0 in java
> I wish the following sql to be executed in BroadcastJoin way
> *select * from sample join feature*
>
> This is my step
> 1.set spark.sql.autoBroadcastJoinThreshold=100M
> 2.HiveContext.sql("cache lazy table feature as "select * from src where
> ...) which result size is only 100K
> 3.HiveContext.sql("select * from sample join feature")
> Why the join is SortMergeJoin?
>
> Grateful for any idea!
> Thanks.
>


Re: Application dies, Driver keeps on running

2017-05-15 Thread map reduced
Ah interesting, I stopped spark context and System.exit() from driver with
supervise ON and that seemed to start app if it gets killed.

On Mon, May 15, 2017 at 5:01 PM, map reduced  wrote:

> Hi,
> I was looking at incorrect place for logs, yes I see some errors in logs:
>
> "Remote RPC client disassociated. Likely due to containers exceeding
> thresholds, or network issues. Check driver logs for WARN messages."
>
> logger="org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend",message="Disconnected
> from Spark cluster! Waiting for reconnection..."
>
> So what is best way to deal with this situation? I would rather have
> driver killed along with it, is there a way to achieve that?
>
>
> On Mon, May 15, 2017 at 3:05 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> So you are using `client` mode. Right? If so, Spark cluster doesn't
>> manage the driver for you. Did you see any error logs in driver?
>>
>> On Mon, May 15, 2017 at 3:01 PM, map reduced  wrote:
>>
>>> Hi,
>>>
>>> Setup: Standalone cluster with 32 workers, 1 master
>>> I am running a long running streaming spark job (read from Kafka ->
>>> process -> send to Http endpoint) which should ideally never stop.
>>>
>>> I have 2 questions:
>>> 1) I have seen some times Driver is still running but application marked
>>> as *Finished*. *Any idea why this happens or any way to debug this?*
>>> Sometimes after running for say 2-3 days (or 4-5 days - random
>>> timeframe) this issue arises, not sure what is causing it. Nothing in logs
>>> suggests failures or exceptions
>>>
>>> 2) Is there a way for Driver to kill itself instead of keeping on
>>> running without any application to drive?
>>>
>>> Thanks,
>>> KP
>>>
>>
>>
>


Re: Application dies, Driver keeps on running

2017-05-15 Thread map reduced
Hi,
I was looking at incorrect place for logs, yes I see some errors in logs:

"Remote RPC client disassociated. Likely due to containers exceeding
thresholds, or network issues. Check driver logs for WARN messages."

logger="org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend",message="Disconnected
from Spark cluster! Waiting for reconnection..."

So what is best way to deal with this situation? I would rather have driver
killed along with it, is there a way to achieve that?


On Mon, May 15, 2017 at 3:05 PM, Shixiong(Ryan) Zhu  wrote:

> So you are using `client` mode. Right? If so, Spark cluster doesn't manage
> the driver for you. Did you see any error logs in driver?
>
> On Mon, May 15, 2017 at 3:01 PM, map reduced  wrote:
>
>> Hi,
>>
>> Setup: Standalone cluster with 32 workers, 1 master
>> I am running a long running streaming spark job (read from Kafka ->
>> process -> send to Http endpoint) which should ideally never stop.
>>
>> I have 2 questions:
>> 1) I have seen some times Driver is still running but application marked
>> as *Finished*. *Any idea why this happens or any way to debug this?*
>> Sometimes after running for say 2-3 days (or 4-5 days - random timeframe)
>> this issue arises, not sure what is causing it. Nothing in logs suggests
>> failures or exceptions
>>
>> 2) Is there a way for Driver to kill itself instead of keeping on running
>> without any application to drive?
>>
>> Thanks,
>> KP
>>
>
>


Re: Restful API Spark Application

2017-05-15 Thread Nipun Arora
Thanks all for your response. I will have a look at them.

Nipun

On Sat, May 13, 2017 at 2:38 AM vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> It's in scala but it should be portable in java
> https://github.com/vgkowski/akka-spark-experiments
>
>
> Le 12 mai 2017 10:54 PM, "Василец Дмитрий"  a
> écrit :
>
> and livy
> https://hortonworks.com/blog/livy-a-rest-interface-for-apache-spark/
>
> On Fri, May 12, 2017 at 10:51 PM, Sam Elamin 
> wrote:
> > Hi Nipun
> >
> > Have you checked out the job servwr
> >
> > https://github.com/spark-jobserver/spark-jobserver
> >
> > Regards
> > Sam
> > On Fri, 12 May 2017 at 21:00, Nipun Arora 
> wrote:
> >>
> >> Hi,
> >>
> >> We have written a java spark application (primarily uses spark sql). We
> >> want to expand this to provide our application "as a service". For
> this, we
> >> are trying to write a REST API. While a simple REST API can be easily
> made,
> >> and I can get Spark to run through the launcher. I wonder, how the spark
> >> context can be used by service requests, to process data.
> >>
> >> Are there any simple JAVA examples to illustrate this use-case? I am
> sure
> >> people have faced this before.
> >>
> >>
> >> Thanks
> >> Nipun
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: Application dies, Driver keeps on running

2017-05-15 Thread Shixiong(Ryan) Zhu
So you are using `client` mode. Right? If so, Spark cluster doesn't manage
the driver for you. Did you see any error logs in driver?

On Mon, May 15, 2017 at 3:01 PM, map reduced  wrote:

> Hi,
>
> Setup: Standalone cluster with 32 workers, 1 master
> I am running a long running streaming spark job (read from Kafka ->
> process -> send to Http endpoint) which should ideally never stop.
>
> I have 2 questions:
> 1) I have seen some times Driver is still running but application marked
> as *Finished*. *Any idea why this happens or any way to debug this?*
> Sometimes after running for say 2-3 days (or 4-5 days - random timeframe)
> this issue arises, not sure what is causing it. Nothing in logs suggests
> failures or exceptions
>
> 2) Is there a way for Driver to kill itself instead of keeping on running
> without any application to drive?
>
> Thanks,
> KP
>


Application dies, Driver keeps on running

2017-05-15 Thread map reduced
Hi,

Setup: Standalone cluster with 32 workers, 1 master
I am running a long running streaming spark job (read from Kafka -> process
-> send to Http endpoint) which should ideally never stop.

I have 2 questions:
1) I have seen some times Driver is still running but application marked as
*Finished*. *Any idea why this happens or any way to debug this?*
Sometimes after running for say 2-3 days (or 4-5 days - random timeframe)
this issue arises, not sure what is causing it. Nothing in logs suggests
failures or exceptions

2) Is there a way for Driver to kill itself instead of keeping on running
without any application to drive?

Thanks,
KP


Re: what is the difference between json format vs kafka format?

2017-05-15 Thread Michael Armbrust
For that simple count, you don't actually have to even parse the JSON
data.  You can just do a count.  The following code assumes you are
running Spark
2.2

.

df.groupBy().count().writeStream.outputMode("co
mplete").format("console").start()

If you want to do something more complicated, you will need specify the
schema at least for the columns that you want Spark to understand.  We need
to know the names and types of the column so that we know how to extract
them from the JSON.  Its okay however to omit columns that you don't care
about.

df.select(from_json($"value".cast("string"), "name STRING, age INT") as
'message).groupBy($"message.name").agg(avg($"age"))

If you are not sure what json looks like, you can ask spark to infer it
based on a sample.

spark.read.json(spark.read.format("kafka").option(...).load().limit(1000).select($"value".as[String])).printSchema()

On Sat, May 13, 2017 at 8:48 PM, kant kodali  wrote:

> Hi,
>
> Here is a little bit of background.
>
> I've been using stateless streaming API's for a while like using
> JavaDstream and so on and they worked well. It's has come to a point where
> we need to do realtime stateful streaming based on event time and other
> things but for now I am just trying to get used to structured streaming
> API's by running simple aggregations like count(). so naturally, I tend to
> think that the concepts that are behind JavaDstream would also apply in
> StructuredStreaming as well (Please correct me if I am wrong). For example,
> I can do the following with Dstreams without writing to any text files.
>
> // dstreams version
> jsonDStream.foreachRDD{rdd =>
> val jsonDF = spark.read.json(rdd)
> jsonDF.createOrReplaceTempView("dataframe")
> }
> javaStreamingContext.start()
> select count(*) from dataframe;
>
> or I can also do javaDstream.count() such that at every batch interval it
> spits out the count.
>
> I was expecting something like this with Structured Streaming as well. so
> I thought of doing something like below to mimic the above version. It
> looks very similar to me so I am not sure what you mean by
>
> "For streaming queries, you have to let it run in the background
> continuously by starting it using writeStreamstart()." Dstreams are
> also unbounded right? except at every batch interval the count() action
> gets invoked so why I can't call .count() on stream of dataframes in
> structured streaming (especially when it is possible with stream of RDD's
> like Dstreams)? I guess I probably have some misconception somewhere.
>
> //structured streaming version
> val ds = datasetRows.selectExpr("CAST(value AS STRING)").toJSON
> val foo = ds.select("*").count()
> val query = foo.writeStream.outputMode("complete").format("console").sta
> rt();
> query.awaitTermination()
>
> How should I change this code to do a simple count in structured
> streaming? you can assume there is schema ahead of time if thats really a
> problem.
>
> since we want to do real time structured streaming we would like to avoid
> any extra level of indirections such as writing to text files and so on but
> if I really have to do a workaround to infer schema like writing to text
> files I rather try and figure out how I can get schemas ahead of time which
> is not ideal for our case but I can try to survive.
>
> Thanks a lot!
>
>
>
>
>
>
>
>
>
>
>
>
> On Sat, May 13, 2017 at 7:11 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> You cant do ".count()" directly on streaming DataFrames. This is because
>> "count" is an Action (remember RDD actions) that executes and returns a
>> result immediately which can be done only when the data is bounded (e.g.
>> batch/interactive queries). For streaming queries, you have to let it run
>> in the background continuously by starting it using writeStreamstart().
>>
>> And for streaming queries, you have specify schema from before so that at
>> runtime it explicitly fails when schema is incorrectly changed.
>>
>> In your case, what you can do is the following.
>> - Run a streaming query that converts the binary data from KAFka to
>> string, and saves as text files (i.e. 
>> *writeStream.format("text").start("path")
>> *)
>>
>> - Then run a batch query on the saved text files with format json (i.e.  
>> *read.format("json").load(path)
>> *)  with schema inference, and get the schema from the Dataset created
>> (i.e Dataset.schema ).
>>
>> - Then you can run the real streaming query with from_json and the learnt
>> schema.
>>
>> Make sure that the generated text file have sufficient data to infer the
>> full schema. Let me know if this works for you.
>>
>> TD
>>
>>
>> On Sat, May 13, 2017 at 6:04 PM, kant kodali  wrote:
>>
>>> Hi!
>>>
>>> Thanks for the response. Looks like from_json requires schema ahead of
>>> time. Is there any function I can use to infer schema from the json

Re: Spark SQL DataFrame to Kafka Topic

2017-05-15 Thread Michael Armbrust
The foreach sink from that blog post requires that you have a DataFrame
with two columns in the form of a Tuple2, (String, String), where as your
dataframe has only a single column `payload`.  You could change the
KafkaSink to extend ForeachWriter[KafkaMessage] and then it would work.

I'd also suggest you just try the native KafkaSink

that is part of Spark 2.2

.

On Sun, May 14, 2017 at 9:31 AM, Revin Chalil  wrote:

> Hi TD / Michael,
>
>
>
> I am trying to use the foreach sink to write to Kafka and followed this 
> 
>  from DBricks blog by Sunil Sitaula 
>  . I get the below with 
> DF.writeStream.foreach(writer).outputMode("update").start() when using a 
> simple DF
>
> Type mismatch, expected: foreachWriter[Row], actual: KafkaSink
>
> Cannot resolve reference foreach with such signature
>
>
>
> Below is the snippet
>
> *val *data = session
>   .readStream
>   .format("kafka")
>   .option("kafka.bootstrap.servers", KafkaBroker)
>   .option("subscribe", InTopic)
>   .load()
>   .select($"value".as[Array[Byte]])
>   .flatMap(d => {
> *var *events = AvroHelper.*readEvents*(d)
> events.map((event: HdfsEvent) => {
>   *var *payload = EventPayloadParser.*read*(event.getPayload)
>   *new *KafkaMessage(payload)
> })
>   })
>
>
>
> *case class *KafkaMessage(
>   payload: String)
>
>
>
> This is where I use the foreach
>
> *val *writer = *new *KafkaSink("kafka-topic", KafkaBroker)
> *val *query = data.writeStream.foreach(writer).outputMode("update").start()
>
>
>
> In this case, it shows –
>
> Type mismatch, expected: foreachWriter[Main.KafkaMesage], actual: 
> Main.KafkaSink
>
> Cannot resolve reference foreach with such signature
>
>
>
> Any help is much appreciated. Thank you.
>
>
>
>
>
> *From:* Tathagata Das [mailto:tathagata.das1...@gmail.com]
> *Sent:* Friday, January 13, 2017 3:31 PM
> *To:* Koert Kuipers 
> *Cc:* Peyman Mohajerian ; Senthil Kumar <
> senthilec...@gmail.com>; User ;
> senthilec...@apache.org
> *Subject:* Re: Spark SQL DataFrame to Kafka Topic
>
>
>
> Structured Streaming has a foreach sink, where you can essentially do what
> you want with your data. Its easy to create a Kafka producer, and write the
> data out to kafka.
>
> http://spark.apache.org/docs/latest/structured-streaming-
> programming-guide.html#using-foreach
>
>
>
> On Fri, Jan 13, 2017 at 8:28 AM, Koert Kuipers  wrote:
>
> how do you do this with structured streaming? i see no mention of writing
> to kafka
>
>
>
> On Fri, Jan 13, 2017 at 10:30 AM, Peyman Mohajerian 
> wrote:
>
> Yes, it is called Structured Streaming: https://docs.
> databricks.com/_static/notebooks/structured-streaming-kafka.html
>
> http://spark.apache.org/docs/latest/structured-streaming-
> programming-guide.html
>
>
>
> On Fri, Jan 13, 2017 at 3:32 AM, Senthil Kumar 
> wrote:
>
> Hi Team ,
>
>
>
>  Sorry if this question already asked in this forum..
>
>
>
> Can we ingest data to Apache Kafka Topic from Spark SQL DataFrame ??
>
>
>
> Here is my Code which Reads Parquet File :
>
>
>
> *val sqlContext = new org.apache.spark.sql.SQLContext(sc);*
>
> *val df = sqlContext.read.parquet("/temp/*.parquet")*
>
> *df.registerTempTable("beacons")*
>
>
>
> I want to directly ingest df DataFrame to Kafka ! Is there any way to
> achieve this ??
>
>
>
> Cheers,
>
> Senthil
>
>
>
>
>
>
>


Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Edward Capriolo
Here is a similar but not exact way I did something similar to what you
did. I had two data files in different formats the different columns needed
to be different features. I wanted to feed them into spark's:
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_Pattern_Mining/The_FP-Growth_Algorithm

This only works because I have a few named features, and they become fields
in the model object AntecedentUnion. This would be a crappy solution for a
large sparse matrix.

Also my Scala code is crap too so there is probably a better way to do this!


val b = targ.as[TargetingAntecedent]
val b1 = b.map(c => (c.tdid, c)).rdd.groupByKey()
val bgen = b1.map(f =>
  (f._1 , f._2.map
  ( x => AntecedentUnion("targeting", "", x.targetingdataid,
"", "") )
  ) )

val c = imp.as[ImpressionAntecedent]
val c1 = c.map(k => (k.tdid, k)).rdd.groupByKey()
val cgen = c1.map (f =>
  (f._1 , f._2.map
  ( x => AntecedentUnion("impression", "", "", x.campaignid,
x.adgroupid) ).toSet.toIterable
  ) )

val bgen = TargetingUtil.targetingAntecedent(sparkSession, sqlContext,
targ)
val cgen = TargetingUtil.impressionAntecedent(sparkSession, sqlContext,
imp)
val joined = bgen.join(cgen)

val merged = joined.map(f => (f._1, f._2._1++:(f._2._2) ))
val fullResults : RDD[Array[AntecedentUnion]] = merged.map(f =>
f._2).map(_.toArray[audacity.AntecedentUnion])


So essentially converting everything into AntecedentUnion where the first
column is the type of the tuple, and other fields are supplied or not. Then
merge all those and run fpgrowth on them. Hope that helps!



On Mon, May 15, 2017 at 12:06 PM, goun na  wrote:
>
> I mentioned it opposite. collect_list generates duplicated results.
>
> 2017-05-16 0:50 GMT+09:00 goun na :
>>
>> Hi, Jone Zhang
>>
>> 1. Hive UDF
>> You might need collect_set or collect_list (to eliminate duplication),
but make sure reduce its cardinality before applying UDFs as it can cause
problems while handling 1 billion records. Union dataset 1,2,3 -> group by
user_id1 -> collect_set (feature column) would works.
>>
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
>>
>> 2.Spark Dataframe Pivot
>>
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html
>>
>> - Goun
>>
>> 2017-05-15 22:15 GMT+09:00 Jone Zhang :
>>>
>>> For example
>>> Data1(has 1 billion records)
>>> user_id1  feature1
>>> user_id1  feature2
>>>
>>> Data2(has 1 billion records)
>>> user_id1  feature3
>>>
>>> Data3(has 1 billion records)
>>> user_id1  feature4
>>> user_id1  feature5
>>> ...
>>> user_id1  feature100
>>>
>>> I want to get the result as follow
>>> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
>>>
>>> Is there a more efficient way except join?
>>>
>>> Thanks!
>>
>>
>


Re: Spark streaming - TIBCO EMS

2017-05-15 Thread Piotr Smoliński
Hi Pradeep,

You need to connect via regular JMS API. Obtain factory from JNDI or create
it directly using
com.tibco.tibjms.TibjmsConnectionFactory. On the classpath you need JMS 2.0
API (jms-2.0.jar)
and EMS driver classes (tibjms.jar).

Regards,
Piotr

On Mon, May 15, 2017 at 5:47 PM, Pradeep  wrote:

> What is the best way to connect to TIBCO EMS using spark streaming?
>
> Do we need to write custom receivers or any libraries already exist.
>
> Thanks,
> Pradeep
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Spark streaming - TIBCO EMS

2017-05-15 Thread Pradeep
What is the best way to connect to TIBCO EMS using spark streaming?

Do we need to write custom receivers or any libraries already exist.

Thanks,
Pradeep

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Adding worker dynamically in standalone mode

2017-05-15 Thread Sonal Goyal
If I remember correctly, just run the worker with master as current.

On Monday, May 15, 2017, Seemanto Barua  wrote:

> Hi
>
> Is it possible to add a worker dynamically to the master in standalone
> mode. If so can you please share the steps on how to ?
> Thanks
>


-- 
Thanks,
Sonal
Nube Technologies 




Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread ayan guha
You may consider writing all your data to a nosql datastore such as hbase,
using user id as key.

There is a sql solution using max and inner case and finally union the
results, but that may be expensive
On Tue, 16 May 2017 at 12:13 am, Didac Gil  wrote:

> Or maybe you could also check using the collect_list from the SQL functions
>
> val compacter = Data1.groupBy(“UserID")
>   
> .agg(org.apache.spark.sql.functions.collect_list(“feature").as(“ListOfFeatures"))
>
>
>
> On 15 May 2017, at 15:15, Jone Zhang  wrote:
>
> For example
> Data1(has 1 billion records)
> user_id1  feature1
> user_id1  feature2
>
> Data2(has 1 billion records)
> user_id1  feature3
>
> Data3(has 1 billion records)
> user_id1  feature4
> user_id1  feature5
> ...
> user_id1  feature100
>
> I want to get the result as follow
> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
>
> Is there a more efficient way except join?
>
> Thanks!
>
> Didac Gil de la Iglesia
> PhD in Computer Science
> didacg...@gmail.com
> Spain: +34 696 285 544
> Sweden: +46 (0)730229737
> Skype: didac.gil.de.la.iglesia
>
> --
Best Regards,
Ayan Guha


Adding worker dynamically in standalone mode

2017-05-15 Thread Seemanto Barua
Hi
Is it possible to add a worker dynamically to the master in standalone mode. If 
so can you please share the steps on how to ?

Thanks

Adding worker dynamically in standalone mode

2017-05-15 Thread seemanto.barua
Hi,

Is it possible to add a worker dynamically to the master in standalone mode. If 
so can you please share the steps on how to ?

-thanks
Seemanto Barua


PLEASE READ: This message is for the named person's use only. It may contain 
confidential, proprietary or legally privileged information. No confidentiality 
or privilege is waived or lost by any mistransmission. If you receive this 
message in error, please delete it and all copies from your system, destroy any 
hard copies and notify the sender. You must not, directly or indirectly, use, 
disclose, distribute, print, or copy any part of this message if you are not 
the intended recipient. Nomura Holding America Inc., Nomura Securities 
International, Inc, and their respective subsidiaries each reserve the right to 
monitor all e-mail communications through its networks. Any views expressed in 
this message are those of the individual sender, except where the message 
states otherwise and the sender is authorized to state the views of such 
entity. Unless otherwise stated, any pricing information in this message is 
indicative only, is subject to change and does not constitute an offer to deal 
at any price quoted. Any reference to the terms of executed transactions should 
be treated as preliminary only and subject to our formal written confirmation.



Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Didac Gil
Or maybe you could also check using the collect_list from the SQL functions
val compacter = Data1.groupBy(“UserID")
  
.agg(org.apache.spark.sql.functions.collect_list(“feature").as(“ListOfFeatures"))


> On 15 May 2017, at 15:15, Jone Zhang  wrote:
> 
> For example
> Data1(has 1 billion records)
> user_id1  feature1
> user_id1  feature2
> 
> Data2(has 1 billion records)
> user_id1  feature3
> 
> Data3(has 1 billion records)
> user_id1  feature4
> user_id1  feature5
> ...
> user_id1  feature100
> 
> I want to get the result as follow
> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
> 
> Is there a more efficient way except join?
> 
> Thanks!

Didac Gil de la Iglesia
PhD in Computer Science
didacg...@gmail.com
Spain: +34 696 285 544
Sweden: +46 (0)730229737
Skype: didac.gil.de.la.iglesia



signature.asc
Description: Message signed with OpenPGP


Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Didac Gil
I guess that if your user_id field is the key, you could use the 
updateStateByKey function.

I did not test it, but it could be something along these lines:

def yourCombineFunction(input: Seq[(String)],accumulatedInput: Option[(String)] 
= {
val state = accumulatedInput.getOrElse((“”)) //In case the current Key 
was not found before, the features list is empty
val feature = input._1 //We get the feature value of this new entry

val newFeature = state._1 +” “+feature
Some((newFeature)) //The new accumulated value for the features is 
returned
}

val updatedData = Data1.updateStateByKey(yourCombineFunction) //This would 
“iterate” among all the entries in your Dataset and, for each row, will update 
the “accumulatedFeatures”

Good luck

> On 15 May 2017, at 15:15, Jone Zhang  wrote:
> 
> For example
> Data1(has 1 billion records)
> user_id1  feature1
> user_id1  feature2
> 
> Data2(has 1 billion records)
> user_id1  feature3
> 
> Data3(has 1 billion records)
> user_id1  feature4
> user_id1  feature5
> ...
> user_id1  feature100
> 
> I want to get the result as follow
> user_id1  feature1 feature2 feature3 feature4 feature5...feature100
> 
> Is there a more efficient way except join?
> 
> Thanks!

Didac Gil de la Iglesia
PhD in Computer Science
didacg...@gmail.com
Spain: +34 696 285 544
Sweden: +46 (0)730229737
Skype: didac.gil.de.la.iglesia



signature.asc
Description: Message signed with OpenPGP


Re: Why spark.sql.autoBroadcastJoinThreshold not available

2017-05-15 Thread Yong Zhang
You should post the execution plan here, so we can provide more accurate 
support.


Since in your feature table, you are building it with projection ("where 
"), so my guess is that the following JIRA 
(SPARK-13383) stops the 
broadcast join. This is fixed in the Spark 2.x. Can you try it on Spark 2.0?

Yong


From: Jone Zhang 
Sent: Wednesday, May 10, 2017 7:10 AM
To: user @spark/'user @spark'/spark users/user@spark
Subject: Why spark.sql.autoBroadcastJoinThreshold not available

Now i use spark1.6.0 in java
I wish the following sql to be executed in BroadcastJoin way
select * from sample join feature

This is my step
1.set spark.sql.autoBroadcastJoinThreshold=100M
2.HiveContext.sql("cache lazy table feature as "select * from src where ...) 
which result size is only 100K
3.HiveContext.sql("select * from sample join feature")
Why the join is SortMergeJoin?

Grateful for any idea!
Thanks.


How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Jone Zhang
For example
Data1(has 1 billion records)
user_id1  feature1
user_id1  feature2

Data2(has 1 billion records)
user_id1  feature3

Data3(has 1 billion records)
user_id1  feature4
user_id1  feature5
...
user_id1  feature100

I want to get the result as follow
user_id1  feature1 feature2 feature3 feature4 feature5...feature100

Is there a more efficient way except join?

Thanks!


Re: Kafka 0.8.x / 0.9.x support in structured streaming

2017-05-15 Thread David Kaczynski
I haven't done Structured Streaming in Spark 2.1 with Kafka 0.9.x, but I
did do stream processing with Spark 2.0.1 and Kafka 0.10.

Here's the official documenation that I used for Spark Streaming with Kafka
0.10:  https://spark.apache.org/docs/2.1.0/streaming-kafka-integration.html.
It looks like you might need to use the 0.8.x connector, but I found that
connector to be inadequate for some use cases, such as using a regex to
subscribe to multiple topics.  Your results may vary, and I wouldn't be
surprised if there was another source of official documentation on the
topic.

On Mon, May 15, 2017 at 5:25 AM Swapnil Chougule 
wrote:

> Hello
>
> I am new to structured streaming. Wanted to learn if there is support for
> Kafka 0.8.x or Kafka 0.9.x in structured streaming ?
> My Kafka source is of version 0.9.x & want get have structured streaming
> solution on top of it.
> I checked documentation for Spark release 2.1.0 but didn't get exact info.
> Any help is most welcome. Thanks in advance.
>
> --Swapnil
>


Re: ElasticSearch Spark error

2017-05-15 Thread Rohit Verma
Try to switch the trace logging, is your es cluster running behind docker. Its 
possible that your spark cluster can’t communicate using docker ips.

Regards
Rohit

On May 15, 2017, at 4:55 PM, Nick Pentreath 
> wrote:

It may be best to ask on the elasticsearch-Hadoop github project

On Mon, 15 May 2017 at 13:19, nayan sharma 
> wrote:
Hi All,

ERROR:-

Caused by: org.apache.spark.util.TaskCompletionListenerException: Connection 
error (check network and/or proxy settings)- all nodes failed; tried 
[[10.0.1.8*:9200, 10.0.1.**:9200, 10.0.1.***:9200]]

I am getting this error while trying to show the dataframe.

df.count =5190767 and df.printSchema both are working fine.
It has 329 columns.

Do any one have any idea regarding this.Please help me to fix this.


Thanks,
Nayan






Re: ElasticSearch Spark error

2017-05-15 Thread Nick Pentreath
It may be best to ask on the elasticsearch-Hadoop github project

On Mon, 15 May 2017 at 13:19, nayan sharma  wrote:

> Hi All,
>
> *ERROR:-*
>
> *Caused by: org.apache.spark.util.TaskCompletionListenerException:
> Connection error (check network and/or proxy settings)- all nodes failed;
> tried [[10.0.1.8*:9200, 10.0.1.**:9200, 10.0.1.***:9200]]*
>
> I am getting this error while trying to show the dataframe.
>
> df.count =5190767 and df.printSchema both are working fine.
> It has 329 columns.
>
> Do any one have any idea regarding this.Please help me to fix this.
>
>
> Thanks,
> Nayan
>
>
>
>


ElasticSearch Spark error

2017-05-15 Thread nayan sharma
Hi All,

ERROR:-

Caused by: org.apache.spark.util.TaskCompletionListenerException: Connection 
error (check network and/or proxy settings)- all nodes failed; tried 
[[10.0.1.8*:9200, 10.0.1.**:9200, 10.0.1.***:9200]]

I am getting this error while trying to show the dataframe.

df.count =5190767 and df.printSchema both are working fine.
It has 329 columns.

Do any one have any idea regarding this.Please help me to fix this.


Thanks,
Nayan





Test

2017-05-15 Thread nayan sharma
Test

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: save SPark ml

2017-05-15 Thread issues solution
Hi ,
please i need help about that question 

2017-05-15 10:32 GMT+02:00 issues solution :

> Hi,
> I am under Pyspark 1.6 i want save my model in hdfs file like parquet
>
> how i can do this ?
>
>
> My  model it s  a RandomForestClassifier performed with corssvalidation
>
> like this
>
>
>
> rf_csv2  = CrossValidator()
>
> how i can save it ?
>
> thx for adavance
>
>
>
>
>


RE: Spark SQL DataFrame to Kafka Topic

2017-05-15 Thread Revin Chalil
I couldn’t get this working yet.. If anyone has successfully used forEach Sink 
for kafka with structured streaming, plz share... Thanks.

From: Revin Chalil [mailto:rcha...@expedia.com]
Sent: Sunday, May 14, 2017 9:32 AM
To: Tathagata Das ; mich...@databricks.com
Cc: Peyman Mohajerian ; Senthil Kumar 
; User ; 
senthilec...@apache.org; Ofir Manor ; Hemanth Gudela 
; lucas.g...@gmail.com; Koert Kuipers 
; silvio.fior...@granturing.com
Subject: RE: Spark SQL DataFrame to Kafka Topic

Hi TD / Michael,


I am trying to use the foreach sink to write to Kafka and followed 
this
 from DBricks blog by Sunil 
Sitaula . I get the below 
with DF.writeStream.foreach(writer).outputMode("update").start() when using a 
simple DF

Type mismatch, expected: foreachWriter[Row], actual: KafkaSink

Cannot resolve reference foreach with such signature



Below is the snippet

val data = session
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", KafkaBroker)
  .option("subscribe", InTopic)
  .load()
  .select($"value".as[Array[Byte]])
  .flatMap(d => {
var events = AvroHelper.readEvents(d)
events.map((event: HdfsEvent) => {
  var payload = EventPayloadParser.read(event.getPayload)
  new KafkaMessage(payload)
})
  })



case class KafkaMessage(
  payload: String)



This is where I use the foreach

val writer = new KafkaSink("kafka-topic", KafkaBroker)
val query = data.writeStream.foreach(writer).outputMode("update").start()



In this case, it shows –

Type mismatch, expected: foreachWriter[Main.KafkaMesage], actual: Main.KafkaSink

Cannot resolve reference foreach with such signature



Any help is much appreciated. Thank you.


From: Tathagata Das [mailto:tathagata.das1...@gmail.com]
Sent: Friday, January 13, 2017 3:31 PM
To: Koert Kuipers >
Cc: Peyman Mohajerian >; Senthil 
Kumar >; User 
>; 
senthilec...@apache.org
Subject: Re: Spark SQL DataFrame to Kafka Topic

Structured Streaming has a foreach sink, where you can essentially do what you 
want with your data. Its easy to create a Kafka producer, and write the data 
out to kafka.
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach

On Fri, Jan 13, 2017 at 8:28 AM, Koert Kuipers 
> wrote:
how do you do this with structured streaming? i see no mention of writing to 
kafka

On Fri, Jan 13, 2017 at 10:30 AM, Peyman Mohajerian 
> wrote:
Yes, it is called Structured Streaming: 
https://docs.databricks.com/_static/notebooks/structured-streaming-kafka.html
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

On Fri, Jan 13, 2017 at 3:32 AM, Senthil Kumar 
> wrote:
Hi Team ,

 Sorry if this question already asked in this forum..

Can we ingest data to Apache Kafka Topic from Spark SQL DataFrame ??

Here is my Code which Reads Parquet File :


val sqlContext = new org.apache.spark.sql.SQLContext(sc);

val df = sqlContext.read.parquet("/temp/*.parquet")

df.registerTempTable("beacons")



I want to directly ingest df DataFrame to Kafka ! Is there any way to achieve 
this ??



Cheers,

Senthil





Kafka 0.8.x / 0.9.x support in structured streaming

2017-05-15 Thread Swapnil Chougule
Hello

I am new to structured streaming. Wanted to learn if there is support for
Kafka 0.8.x or Kafka 0.9.x in structured streaming ?
My Kafka source is of version 0.9.x & want get have structured streaming
solution on top of it.
I checked documentation for Spark release 2.1.0 but didn't get exact info.
Any help is most welcome. Thanks in advance.

--Swapnil


Any solution for this?

2017-05-15 Thread Aakash Basu
Hi all,

Any solution for this issue - http://stackoverfl ow.com/q/43921392/7998705



Thanks,
Aakash.


save SPark ml

2017-05-15 Thread issues solution
Hi,
I am under Pyspark 1.6 i want save my model in hdfs file like parquet

how i can do this ?


My  model it s  a RandomForestClassifier performed with corssvalidation

like this



rf_csv2  = CrossValidator()

how i can save it ?

thx for adavance


spark on yarn cluster model can't use saveAsTable ?

2017-05-15 Thread lk_spark
hi,all:
I have a test under spark2.1.0 , which read txt files as DataFrame and 
save to hive . When I submit the app jar with yarn client model it works well , 
but If I submit with cluster model , it will not create table and write data , 
and I didn't find any error log ... can anybody give me some clue?

2017-05-15


lk_spark