Re: Using data frames to join separate RDDs in spark streaming

2016-06-05 Thread Cyril Scetbon
Problem solved by creating only one RDD.
> On Jun 1, 2016, at 14:05, Cyril Scetbon <cyril.scet...@free.fr> wrote:
> 
> It seems that to join a DStream with a RDD I can use :
> 
> mgs.transform(rdd => rdd.join(rdd1))
> 
> or
> 
> mgs.foreachRDD(rdd => rdd.join(rdd1))
> 
> But, I can't see why rdd1.toDF("id","aid") really causes SPARK-5063
> 
> 
>> On Jun 1, 2016, at 12:00, Cyril Scetbon <cyril.scet...@free.fr> wrote:
>> 
>> Hi guys,
>> 
>> I have a 2 input data streams that I want to join using Dataframes and 
>> unfortunately I get the message produced by 
>> https://issues.apache.org/jira/browse/SPARK-5063 as I can't reference rdd1  
>> in (2) :
>> 
>> (1)
>> val rdd1 = sc.esRDD(es_resource.toLowerCase, query)
>>.map(r => (r._1, r._2))
>> 
>> (2)
>> mgs.map(x => x._1)
>>  .foreachRDD { rdd =>
>>val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
>>import sqlContext.implicits._
>> 
>>val df_aids = rdd.toDF("id")
>> 
>>val df = rdd1.toDF("id", "aid")
>> 
>>df.select(explode(df("aid")).as("aid"), df("id"))
>>   .join(df_aids, $"aid" === df_aids("id"))
>>   .select(df("id"), df_aids("id"))
>>  .
>>  }
>> 
>> Is there a way to still use Dataframes to do it or I need to do everything 
>> using RDDs join only ?
>> And If I need to use only RDDs join, how to do it ? as I have a RDD (rdd1) 
>> and a DStream (mgs) ?
>> 
>> Thanks
>> -- 
>> Cyril SCETBON
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Streaming w/variables used as dynamic queries

2016-06-03 Thread Cyril Scetbon
Hey guys,

Can someone help me to solve the current issue. My code is the following :

var arr = new ArrayBuffer[String]()
sa_msgs.map(x => x._1)
   .foreachRDD { rdd => arr = new ArrayBuffer[String]()
}
(2)
sa_msgs.map(x => x._1)
   .foreachRDD { rdd => arr ++= rdd.collect
}
(3)
val rdd_site_aids = sc.esRDD(EsIndex, 
arr.toList.mkString(prefix,separator,suffix))  
(4)

sa_msgs.map(x => x._1)
   .foreachRDD { rdd =>

 rdd.map(x => (x, 0))
.join(rdd_site_aids)
...
   }
This code works well with Spark but not with Spark Streaming because arr is not 
defined at the same time (3) or (4) are executed

I'd enjoy to do everything in (4) foreachRDD but sc is not serializable ... and 
esRDD needs a query that I construct from sa_msgs

Does anybody see a way to solve it ? (1) and (2) not blocking seems to be the 
real issue, but there is certainly a Spark Streaming way to solve it.

Thanks

Re: Using data frames to join separate RDDs in spark streaming

2016-06-01 Thread Cyril Scetbon
It seems that to join a DStream with a RDD I can use :

mgs.transform(rdd => rdd.join(rdd1))

or

mgs.foreachRDD(rdd => rdd.join(rdd1))

But, I can't see why rdd1.toDF("id","aid") really causes SPARK-5063

 
> On Jun 1, 2016, at 12:00, Cyril Scetbon <cyril.scet...@free.fr> wrote:
> 
> Hi guys,
> 
> I have a 2 input data streams that I want to join using Dataframes and 
> unfortunately I get the message produced by 
> https://issues.apache.org/jira/browse/SPARK-5063 as I can't reference rdd1  
> in (2) :
> 
> (1)
> val rdd1 = sc.esRDD(es_resource.toLowerCase, query)
> .map(r => (r._1, r._2))
> 
> (2)
> mgs.map(x => x._1)
>   .foreachRDD { rdd =>
> val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
> import sqlContext.implicits._
> 
> val df_aids = rdd.toDF("id")
> 
> val df = rdd1.toDF("id", "aid")
> 
> df.select(explode(df("aid")).as("aid"), df("id"))
>.join(df_aids, $"aid" === df_aids("id"))
>.select(df("id"), df_aids("id"))
>   .
>   }
> 
> Is there a way to still use Dataframes to do it or I need to do everything 
> using RDDs join only ?
> And If I need to use only RDDs join, how to do it ? as I have a RDD (rdd1) 
> and a DStream (mgs) ?
> 
> Thanks
> -- 
> Cyril SCETBON
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Using data frames to join separate RDDs in spark streaming

2016-06-01 Thread Cyril Scetbon
Hi guys,

I have a 2 input data streams that I want to join using Dataframes and 
unfortunately I get the message produced by 
https://issues.apache.org/jira/browse/SPARK-5063 as I can't reference rdd1  in 
(2) :

(1)
val rdd1 = sc.esRDD(es_resource.toLowerCase, query)
 .map(r => (r._1, r._2))

(2)
mgs.map(x => x._1)
   .foreachRDD { rdd =>
 val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
 import sqlContext.implicits._

 val df_aids = rdd.toDF("id")

 val df = rdd1.toDF("id", "aid")

 df.select(explode(df("aid")).as("aid"), df("id"))
.join(df_aids, $"aid" === df_aids("id"))
.select(df("id"), df_aids("id"))
   .
   }

Is there a way to still use Dataframes to do it or I need to do everything 
using RDDs join only ?
And If I need to use only RDDs join, how to do it ? as I have a RDD (rdd1) and 
a DStream (mgs) ?

Thanks
-- 
Cyril SCETBON


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Joining a RDD to a Dataframe

2016-05-12 Thread Cyril Scetbon
Nobody has the answer ? 

Another thing I've seen is that if I have no documents at all : 

scala> df.select(explode(df("addresses.id")).as("aid")).collect
res27: Array[org.apache.spark.sql.Row] = Array()

Then

scala> df.select(explode(df("addresses.id")).as("aid"), df("id"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among 
(adresses);
at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)

Is there a better way to query nested objects and to join between a DF 
containing nested objects and another regular data frame (yes it's the current 
case) 

> On May 9, 2016, at 00:42, Cyril Scetbon <cyril.scet...@free.fr> wrote:
> 
> Hi Ashish,
> 
> The issue is not related to converting a RDD to a DF. I did it. I was just 
> asking if I should do it differently.
> 
> The issue regards the exception when using array_contains with a sql.Column 
> instead of a value.
> 
> I found another way to do it using explode as follows : 
> 
> df.select(explode(df("addresses.id")).as("aid"), df("id")).join(df_input, 
> $"aid" === df_input("id")).select(df("id"))
> 
> However, I'm wondering if it does almost the same or if the query is 
> different and worst in term of performance.
> 
> If someone can comment on it and maybe give me advices.
> 
> Thank you.
> 
>> On May 8, 2016, at 22:12, Ashish Dubey <ashish@gmail.com 
>> <mailto:ashish@gmail.com>> wrote:
>> 
>> Is there any reason you dont want to convert this - i dont think join b/w 
>> RDD and DF is supported.
>> 
>> On Sat, May 7, 2016 at 11:41 PM, Cyril Scetbon <cyril.scet...@free.fr 
>> <mailto:cyril.scet...@free.fr>> wrote:
>> Hi,
>> 
>> I have a RDD built during a spark streaming job and I'd like to join it to a 
>> DataFrame (E/S input) to enrich it.
>> It seems that I can't join the RDD and the DF without converting first the 
>> RDD to a DF (Tell me if I'm wrong). Here are the schemas of both DF :
>> 
>> scala> df
>> res32: org.apache.spark.sql.DataFrame = [f1: string, addresses: 
>> array<struct<sf1:int,sf2:string,sf3:string,id:string>>, id: string]
>> 
>> scala> df_input
>> res33: org.apache.spark.sql.DataFrame = [id: string]
>> 
>> scala> df_input.collect
>> res34: Array[org.apache.spark.sql.Row] = Array([idaddress2], [idaddress12])
>> 
>> I can get ids I want if I know the value to look for in addresses.id 
>> <http://addresses.id/> using :
>> 
>> scala> df.filter(array_contains(df("addresses.id <http://addresses.id/>"), 
>> "idaddress2")).select("id").collect
>> res35: Array[org.apache.spark.sql.Row] = Array([], [YY])
>> 
>> However when I try to join df_input and df and to use the previous filter as 
>> the join condition I get an exception :
>> 
>> scala> df.join(df_input, array_contains(df("adresses.id 
>> <http://adresses.id/>"), df_input("id")))
>> java.lang.RuntimeException: Unsupported literal type class 
>> org.apache.spark.sql.Column id
>> at 
>> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:50)
>> at 
>> org.apache.spark.sql.functions$.array_contains(functions.scala:2452)
>> ...
>> 
>> It seems that array_contains only supports static arguments and does not 
>> replace a sql.Column by its value.
>> 
>> What's the best way to achieve what I want to do ? (Also speaking in term of 
>> performance)
>> 
>> Thanks
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
> 



Re: Joining a RDD to a Dataframe

2016-05-08 Thread Cyril Scetbon
Hi Ashish,

The issue is not related to converting a RDD to a DF. I did it. I was just 
asking if I should do it differently.

The issue regards the exception when using array_contains with a sql.Column 
instead of a value.

I found another way to do it using explode as follows : 

df.select(explode(df("addresses.id")).as("aid"), df("id")).join(df_input, 
$"aid" === df_input("id")).select(df("id"))

However, I'm wondering if it does almost the same or if the query is different 
and worst in term of performance.

If someone can comment on it and maybe give me advices.

Thank you.

> On May 8, 2016, at 22:12, Ashish Dubey <ashish@gmail.com> wrote:
> 
> Is there any reason you dont want to convert this - i dont think join b/w RDD 
> and DF is supported.
> 
> On Sat, May 7, 2016 at 11:41 PM, Cyril Scetbon <cyril.scet...@free.fr 
> <mailto:cyril.scet...@free.fr>> wrote:
> Hi,
> 
> I have a RDD built during a spark streaming job and I'd like to join it to a 
> DataFrame (E/S input) to enrich it.
> It seems that I can't join the RDD and the DF without converting first the 
> RDD to a DF (Tell me if I'm wrong). Here are the schemas of both DF :
> 
> scala> df
> res32: org.apache.spark.sql.DataFrame = [f1: string, addresses: 
> array<struct<sf1:int,sf2:string,sf3:string,id:string>>, id: string]
> 
> scala> df_input
> res33: org.apache.spark.sql.DataFrame = [id: string]
> 
> scala> df_input.collect
> res34: Array[org.apache.spark.sql.Row] = Array([idaddress2], [idaddress12])
> 
> I can get ids I want if I know the value to look for in addresses.id 
> <http://addresses.id/> using :
> 
> scala> df.filter(array_contains(df("addresses.id <http://addresses.id/>"), 
> "idaddress2")).select("id").collect
> res35: Array[org.apache.spark.sql.Row] = Array([], [YY])
> 
> However when I try to join df_input and df and to use the previous filter as 
> the join condition I get an exception :
> 
> scala> df.join(df_input, array_contains(df("adresses.id 
> <http://adresses.id/>"), df_input("id")))
> java.lang.RuntimeException: Unsupported literal type class 
> org.apache.spark.sql.Column id
> at 
> org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:50)
> at 
> org.apache.spark.sql.functions$.array_contains(functions.scala:2452)
> ...
> 
> It seems that array_contains only supports static arguments and does not 
> replace a sql.Column by its value.
> 
> What's the best way to achieve what I want to do ? (Also speaking in term of 
> performance)
> 
> Thanks
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Joining a RDD to a Dataframe

2016-05-08 Thread Cyril Scetbon
Hi,

I have a RDD built during a spark streaming job and I'd like to join it to a 
DataFrame (E/S input) to enrich it.
It seems that I can't join the RDD and the DF without converting first the RDD 
to a DF (Tell me if I'm wrong). Here are the schemas of both DF :

scala> df
res32: org.apache.spark.sql.DataFrame = [f1: string, addresses: 
array>, id: string]

scala> df_input
res33: org.apache.spark.sql.DataFrame = [id: string]

scala> df_input.collect
res34: Array[org.apache.spark.sql.Row] = Array([idaddress2], [idaddress12])

I can get ids I want if I know the value to look for in addresses.id using :

scala> df.filter(array_contains(df("addresses.id"), 
"idaddress2")).select("id").collect
res35: Array[org.apache.spark.sql.Row] = Array([], [YY])

However when I try to join df_input and df and to use the previous filter as 
the join condition I get an exception :

scala> df.join(df_input, array_contains(df("adresses.id"), df_input("id")))
java.lang.RuntimeException: Unsupported literal type class 
org.apache.spark.sql.Column id
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:50)
at org.apache.spark.sql.functions$.array_contains(functions.scala:2452)
...

It seems that array_contains only supports static arguments and does not 
replace a sql.Column by its value.

What's the best way to achieve what I want to do ? (Also speaking in term of 
performance)

Thanks
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon
The only way I've found to make it work now is by using the current spark 
context and changing its configuration using spark-shell options. Which is 
really different from pyspark where you can't instantiate a new one, initialize 
it etc..
> On Apr 4, 2016, at 18:16, Cyril Scetbon <cyril.scet...@free.fr> wrote:
> 
> It doesn't as you can see : http://pastebin.com/nKcMCtGb 
> <http://pastebin.com/nKcMCtGb>
> 
> I don't need to set the master as I'm using Yarn and I'm on one of the yarn 
> nodes. When I instantiate the Spark Streaming Context with the spark conf, it 
> tries to create a new Spark Context but even with 
> .set("spark.driver.allowMultipleContexts", "true") it doesn't work and 
> complains at line 956 that the Spark Context created by spark-shell was not 
> initialized with allowMultipleContexts ...
> 
> 
>> On Apr 4, 2016, at 16:29, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Hi Cyril,
>> 
>> You can connect to Spark shell from any node. The connection is made to 
>> master through --master IP Address like below:
>> 
>> spark-shell --master spark://50.140.197.217:7077 
>> <http://50.140.197.217:7077/>
>> 
>> Now in the Scala code you can specify something like below:
>> 
>> val sparkConf = new SparkConf().
>>  setAppName("StreamTest").
>>  setMaster("local").
>>  set("spark.cores.max", "2").
>>  set("spark.driver.allowMultipleContexts", "true").
>>  set("spark.hadoop.validateOutputSpecs", "false")
>> 
>> And that will work
>> 
>> Have you tried it?
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 4 April 2016 at 21:11, Cyril Scetbon <cyril.scet...@free.fr 
>> <mailto:cyril.scet...@free.fr>> wrote:
>> I suppose it doesn't work using spark-shell too ? If you can confirm
>> 
>> Thanks
>> 
>>> On Apr 3, 2016, at 03:39, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> This works fine for me
>>> 
>>> val sparkConf = new SparkConf().
>>>  setAppName("StreamTest").
>>>  setMaster("yarn-client").
>>>  set("spark.cores.max", "12").
>>>  set("spark.driver.allowMultipleContexts", "true").
>>>  set("spark.hadoop.validateOutputSpecs", "false")
>>> 
>>> Time: 1459669805000 ms
>>> ---
>>> -------
>>> Time: 145966986 ms
>>> ---
>>> (Sun Apr 3 08:35:01 BST 2016  === Sending messages from rhes5)
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 3 April 2016 at 03:34, Cyril Scetbon <cyril.scet...@free.fr 
>>> <mailto:cyril.scet...@free.fr>> wrote:
>>> Nobody has any idea ?
>>> 
>>> > On Mar 31, 2016, at 23:22, Cyril Scetbon <cyril.scet...@free.fr 
>>> > <mailto:cyril.scet...@free.fr>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I'm having issues to create a StreamingContext with Scala using 
>>> > spark-shell. It tries to access the localhost interface and the 
>>> > Application Master is not running on this interface :
>>> >
>>> > ERROR ApplicationMaster: Failed to connect to driver at localhost:47257, 
>>> > retrying ...
>>> >
>>> > I don't have the issue with Python and pyspark which works fine (you can 
>>> > see it uses the ip address) :
>>> >

Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon
It doesn't as you can see : http://pastebin.com/nKcMCtGb

I don't need to set the master as I'm using Yarn and I'm on one of the yarn 
nodes. When I instantiate the Spark Streaming Context with the spark conf, it 
tries to create a new Spark Context but even with 
.set("spark.driver.allowMultipleContexts", "true") it doesn't work and 
complains at line 956 that the Spark Context created by spark-shell was not 
initialized with allowMultipleContexts ...


> On Apr 4, 2016, at 16:29, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Hi Cyril,
> 
> You can connect to Spark shell from any node. The connection is made to 
> master through --master IP Address like below:
> 
> spark-shell --master spark://50.140.197.217:7077 <http://50.140.197.217:7077/>
> 
> Now in the Scala code you can specify something like below:
> 
> val sparkConf = new SparkConf().
>  setAppName("StreamTest").
>  setMaster("local").
>  set("spark.cores.max", "2").
>  set("spark.driver.allowMultipleContexts", "true").
>  set("spark.hadoop.validateOutputSpecs", "false")
> 
> And that will work
> 
> Have you tried it?
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 4 April 2016 at 21:11, Cyril Scetbon <cyril.scet...@free.fr 
> <mailto:cyril.scet...@free.fr>> wrote:
> I suppose it doesn't work using spark-shell too ? If you can confirm
> 
> Thanks
> 
>> On Apr 3, 2016, at 03:39, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> This works fine for me
>> 
>> val sparkConf = new SparkConf().
>>  setAppName("StreamTest").
>>  setMaster("yarn-client").
>>  set("spark.cores.max", "12").
>>  set("spark.driver.allowMultipleContexts", "true").
>>  set("spark.hadoop.validateOutputSpecs", "false")
>> 
>> Time: 1459669805000 ms
>> ---
>> ---
>> Time: 145966986 ms
>> -------
>> (Sun Apr 3 08:35:01 BST 2016  === Sending messages from rhes5)
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 3 April 2016 at 03:34, Cyril Scetbon <cyril.scet...@free.fr 
>> <mailto:cyril.scet...@free.fr>> wrote:
>> Nobody has any idea ?
>> 
>> > On Mar 31, 2016, at 23:22, Cyril Scetbon <cyril.scet...@free.fr 
>> > <mailto:cyril.scet...@free.fr>> wrote:
>> >
>> > Hi,
>> >
>> > I'm having issues to create a StreamingContext with Scala using 
>> > spark-shell. It tries to access the localhost interface and the 
>> > Application Master is not running on this interface :
>> >
>> > ERROR ApplicationMaster: Failed to connect to driver at localhost:47257, 
>> > retrying ...
>> >
>> > I don't have the issue with Python and pyspark which works fine (you can 
>> > see it uses the ip address) :
>> >
>> > ApplicationMaster: Driver now available: 192.168.10.100:43290 
>> > <http://192.168.10.100:43290/>
>> >
>> > I use similar codes though :
>> >
>> > test.scala :
>> > --
>> >
>> > import org.apache.spark._
>> > import org.apache.spark.streaming._
>> > val app = "test-scala"
>> > val conf = new SparkConf().setAppName(app).setMaster("yarn-client")
>> > val ssc = new StreamingContext(conf, Seconds(3))
>> >
>> > command used : spark-shell -i test.scala
>> >
>> > test.py :
>> > ---
>> >
>> > from pyspark import SparkConf, SparkContext
>> > from pyspark.streaming import StreamingContext
>> > app = &qu

Re: spark-shell failing but pyspark works

2016-04-04 Thread Cyril Scetbon
I suppose it doesn't work using spark-shell too ? If you can confirm

Thanks
> On Apr 3, 2016, at 03:39, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> This works fine for me
> 
> val sparkConf = new SparkConf().
>  setAppName("StreamTest").
>  setMaster("yarn-client").
>  set("spark.cores.max", "12").
>  set("spark.driver.allowMultipleContexts", "true").
>  set("spark.hadoop.validateOutputSpecs", "false")
> 
> Time: 1459669805000 ms
> ---
> ---
> Time: 145966986 ms
> ---
> (Sun Apr 3 08:35:01 BST 2016  === Sending messages from rhes5)
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 3 April 2016 at 03:34, Cyril Scetbon <cyril.scet...@free.fr 
> <mailto:cyril.scet...@free.fr>> wrote:
> Nobody has any idea ?
> 
> > On Mar 31, 2016, at 23:22, Cyril Scetbon <cyril.scet...@free.fr 
> > <mailto:cyril.scet...@free.fr>> wrote:
> >
> > Hi,
> >
> > I'm having issues to create a StreamingContext with Scala using 
> > spark-shell. It tries to access the localhost interface and the Application 
> > Master is not running on this interface :
> >
> > ERROR ApplicationMaster: Failed to connect to driver at localhost:47257, 
> > retrying ...
> >
> > I don't have the issue with Python and pyspark which works fine (you can 
> > see it uses the ip address) :
> >
> > ApplicationMaster: Driver now available: 192.168.10.100:43290 
> > <http://192.168.10.100:43290/>
> >
> > I use similar codes though :
> >
> > test.scala :
> > --
> >
> > import org.apache.spark._
> > import org.apache.spark.streaming._
> > val app = "test-scala"
> > val conf = new SparkConf().setAppName(app).setMaster("yarn-client")
> > val ssc = new StreamingContext(conf, Seconds(3))
> >
> > command used : spark-shell -i test.scala
> >
> > test.py :
> > ---
> >
> > from pyspark import SparkConf, SparkContext
> > from pyspark.streaming import StreamingContext
> > app = "test-python"
> > conf = SparkConf().setAppName(app).setMaster("yarn-client")
> > sc = SparkContext(conf=conf)
> > ssc = StreamingContext(sc, 3)
> >
> > command used : pyspark test.py
> >
> > Any idea why scala can't instantiate it ? I thought python was barely using 
> > scala under the hood, but it seems there are differences. Are there any 
> > parameters set using Scala but not Python ?
> >
> > Thanks
> > --
> > Cyril SCETBON
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > <mailto:user-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > <mailto:user-h...@spark.apache.org>
> >
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Re: spark-shell failing but pyspark works

2016-04-02 Thread Cyril Scetbon
Nobody has any idea ?

> On Mar 31, 2016, at 23:22, Cyril Scetbon <cyril.scet...@free.fr> wrote:
> 
> Hi,
> 
> I'm having issues to create a StreamingContext with Scala using spark-shell. 
> It tries to access the localhost interface and the Application Master is not 
> running on this interface :
> 
> ERROR ApplicationMaster: Failed to connect to driver at localhost:47257, 
> retrying ...
> 
> I don't have the issue with Python and pyspark which works fine (you can see 
> it uses the ip address) : 
> 
> ApplicationMaster: Driver now available: 192.168.10.100:43290
> 
> I use similar codes though :
> 
> test.scala :
> --
> 
> import org.apache.spark._
> import org.apache.spark.streaming._
> val app = "test-scala"
> val conf = new SparkConf().setAppName(app).setMaster("yarn-client")
> val ssc = new StreamingContext(conf, Seconds(3))
> 
> command used : spark-shell -i test.scala
> 
> test.py :
> ---
> 
> from pyspark import SparkConf, SparkContext
> from pyspark.streaming import StreamingContext
> app = "test-python"
> conf = SparkConf().setAppName(app).setMaster("yarn-client")
> sc = SparkContext(conf=conf)
> ssc = StreamingContext(sc, 3)
> 
> command used : pyspark test.py
> 
> Any idea why scala can't instantiate it ? I thought python was barely using 
> scala under the hood, but it seems there are differences. Are there any 
> parameters set using Scala but not Python ? 
> 
> Thanks
> -- 
> Cyril SCETBON
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark-shell failing but pyspark works

2016-03-31 Thread Cyril Scetbon
Hi,

I'm having issues to create a StreamingContext with Scala using spark-shell. It 
tries to access the localhost interface and the Application Master is not 
running on this interface :

ERROR ApplicationMaster: Failed to connect to driver at localhost:47257, 
retrying ...

I don't have the issue with Python and pyspark which works fine (you can see it 
uses the ip address) : 

ApplicationMaster: Driver now available: 192.168.10.100:43290

I use similar codes though :

test.scala :
--

import org.apache.spark._
import org.apache.spark.streaming._
val app = "test-scala"
val conf = new SparkConf().setAppName(app).setMaster("yarn-client")
val ssc = new StreamingContext(conf, Seconds(3))

command used : spark-shell -i test.scala

test.py :
---

from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
app = "test-python"
conf = SparkConf().setAppName(app).setMaster("yarn-client")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 3)

command used : pyspark test.py

Any idea why scala can't instantiate it ? I thought python was barely using 
scala under the hood, but it seems there are differences. Are there any 
parameters set using Scala but not Python ? 

Thanks
-- 
Cyril SCETBON


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: StateSpec raises error "missing arguments for method"

2016-03-27 Thread Cyril Scetbon
hmm, I finally found that the following syntax works better :

scala> val mappingFunction = (key: String, value: Option[Int], state: 
State[Int]) => {
 | Option(key)
 | }
mappingFunction: (String, Option[Int], org.apache.spark.streaming.State[Int]) 
=> Option[String] = 

scala> val spec = StateSpec.function(mappingFunction)
spec: org.apache.spark.streaming.StateSpec[String,Int,Int,Option[String]] = 
StateSpecImpl()

This code comes from one of the spark examples. If anyone can explain me the 
difference between the two.

Regards 
> On Mar 27, 2016, at 23:57, Cyril Scetbon <cyril.scet...@free.fr> wrote:
> 
> Hi,
> 
> I'm testing a sample code and I get this error. Here is the sample code I use 
> :
> 
> scala> import org.apache.spark.streaming._
> import org.apache.spark.streaming._
> 
> scala> def mappingFunction(key: String, value: Option[Int], state: 
> State[Int]): Option[String] = {
> |Option(key)
> | }
> mappingFunction: (key: String, value: Option[Int], state: 
> org.apache.spark.streaming.State[Int])Option[String]
> 
> scala> val spec = StateSpec.function(mappingFunction)
> :41: error: missing arguments for method mappingFunction;
> follow this method with `_' if you want to treat it as a partially applied 
> function
> val spec = StateSpec.function(mappingFunction)
>   ^
> I followed the current documentation 
> https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/streaming/StateSpec.html
> 
> Any idea ?
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



StateSpec raises error "missing arguments for method"

2016-03-27 Thread Cyril Scetbon
Hi,

I'm testing a sample code and I get this error. Here is the sample code I use :

scala> import org.apache.spark.streaming._
import org.apache.spark.streaming._

scala> def mappingFunction(key: String, value: Option[Int], state: State[Int]): 
Option[String] = {
 |Option(key)
 | }
mappingFunction: (key: String, value: Option[Int], state: 
org.apache.spark.streaming.State[Int])Option[String]

scala> val spec = StateSpec.function(mappingFunction)
:41: error: missing arguments for method mappingFunction;
follow this method with `_' if you want to treat it as a partially applied 
function
 val spec = StateSpec.function(mappingFunction)
   ^
I followed the current documentation 
https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/streaming/StateSpec.html

Any idea ?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: python support of mapWithState

2016-03-24 Thread Cyril Scetbon
It's a pity, I have now to port python code cause the support the three 
languages is not the same :( 
> On Mar 24, 2016, at 20:58, Holden Karau <hol...@pigscanfly.ca> wrote:
> 
> In general the Python API lags behind the Scala & Java APIs. The Scala & Java 
> APIs tend to be easier to keep in sync since they are both in the JVM and a 
> bit more work is needed to expose the same functionality from the JVM in 
> Python (or re-implement the Scala code in Python where appropriate).
> 
> On Thursday, March 24, 2016, Cyril Scetbon <cyril.scet...@free.fr 
> <mailto:cyril.scet...@free.fr>> wrote:
> Hi guys,
> 
> It seems that mapWithState is not supported. Am I right ? Is there a reason 
> there are 3 languages supported and one is behind the two others ?
> 
> Thanks
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <>
> For additional commands, e-mail: user-h...@spark.apache.org <>
> 



python support of mapWithState

2016-03-24 Thread Cyril Scetbon
Hi guys,

It seems that mapWithState is not supported. Am I right ? Is there a reason 
there are 3 languages supported and one is behind the two others ?

Thanks
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming with Kafka DirectStream

2016-02-17 Thread Cyril Scetbon
I don't think we can print an integer value in a spark streaming process As 
opposed to a spark job. I think I can print the content of an rdd but not debug 
messages. Am I wrong ? 

Cyril Scetbon

> On Feb 17, 2016, at 12:51 AM, ayan guha <guha.a...@gmail.com> wrote:
> 
> Hi
> 
> You can always use RDD properties, which already has partition information.
> 
> https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html
>  
> 
>> On Wed, Feb 17, 2016 at 2:36 PM, Cyril Scetbon <cyril.scet...@free.fr> wrote:
>> Your understanding is the right one (having re-read the documentation). 
>> Still wondering how I can verify that 5 partitions have been created. My job 
>> is reading from a topic in Kafka that has 5 partitions and sends the data to 
>> E/S. I can see that when there is one task to read from Kafka there are 5 
>> tasks writing to E/S. So I'm supposing that the task reading from Kafka does 
>> it in // using 5 partitions and that's why there are then 5 tasks to write 
>> to E/S. But I'm supposing ...
>> 
>>> On Feb 16, 2016, at 21:12, ayan guha <guha.a...@gmail.com> wrote:
>>> 
>>> I have a slightly different understanding. 
>>> 
>>> Direct stream generates 1 RDD per batch, however, number of partitions in 
>>> that RDD = number of partitions in kafka topic.
>>> 
>>>> On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon <cyril.scet...@free.fr> 
>>>> wrote:
>>>> Hi guys,
>>>> 
>>>> I'm making some tests with Spark and Kafka using a Python script. I use 
>>>> the second method that doesn't need any receiver (Direct Approach). It 
>>>> should adapt the number of RDDs to the number of partitions in the topic. 
>>>> I'm trying to verify it. What's the easiest way to verify it ? I also 
>>>> tried to co-locate Yarn, Spark and Kafka to check if RDDs are created 
>>>> depending on the leaders of partitions in a topic, and they are not. Can 
>>>> you confirm that RDDs are not created depending on the location of 
>>>> partitions and that co-locating Kafka with Spark is not a must-have or 
>>>> that Spark does not take advantage of it ?
>>>> 
>>>> As the parallelism is simplified (by creating as many RDDs as there are 
>>>> partitions) I suppose that the biggest part of the tuning is playing with 
>>>> KafKa partitions (not talking about network configuration or management of 
>>>> Spark resources) ?
>>>> 
>>>> Thank you
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Ayan Guha
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha


Re: Spark Streaming with Kafka DirectStream

2016-02-16 Thread Cyril Scetbon
Your understanding is the right one (having re-read the documentation). Still 
wondering how I can verify that 5 partitions have been created. My job is 
reading from a topic in Kafka that has 5 partitions and sends the data to E/S. 
I can see that when there is one task to read from Kafka there are 5 tasks 
writing to E/S. So I'm supposing that the task reading from Kafka does it in // 
using 5 partitions and that's why there are then 5 tasks to write to E/S. But 
I'm supposing ...

> On Feb 16, 2016, at 21:12, ayan guha <guha.a...@gmail.com> wrote:
> 
> I have a slightly different understanding. 
> 
> Direct stream generates 1 RDD per batch, however, number of partitions in 
> that RDD = number of partitions in kafka topic.
> 
> On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon <cyril.scet...@free.fr 
> <mailto:cyril.scet...@free.fr>> wrote:
> Hi guys,
> 
> I'm making some tests with Spark and Kafka using a Python script. I use the 
> second method that doesn't need any receiver (Direct Approach). It should 
> adapt the number of RDDs to the number of partitions in the topic. I'm trying 
> to verify it. What's the easiest way to verify it ? I also tried to co-locate 
> Yarn, Spark and Kafka to check if RDDs are created depending on the leaders 
> of partitions in a topic, and they are not. Can you confirm that RDDs are not 
> created depending on the location of partitions and that co-locating Kafka 
> with Spark is not a must-have or that Spark does not take advantage of it ?
> 
> As the parallelism is simplified (by creating as many RDDs as there are 
> partitions) I suppose that the biggest part of the tuning is playing with 
> KafKa partitions (not talking about network configuration or management of 
> Spark resources) ?
> 
> Thank you
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha



Spark Streaming with Kafka DirectStream

2016-02-16 Thread Cyril Scetbon
Hi guys,

I'm making some tests with Spark and Kafka using a Python script. I use the 
second method that doesn't need any receiver (Direct Approach). It should adapt 
the number of RDDs to the number of partitions in the topic. I'm trying to 
verify it. What's the easiest way to verify it ? I also tried to co-locate 
Yarn, Spark and Kafka to check if RDDs are created depending on the leaders of 
partitions in a topic, and they are not. Can you confirm that RDDs are not 
created depending on the location of partitions and that co-locating Kafka with 
Spark is not a must-have or that Spark does not take advantage of it ?

As the parallelism is simplified (by creating as many RDDs as there are 
partitions) I suppose that the biggest part of the tuning is playing with KafKa 
partitions (not talking about network configuration or management of Spark 
resources) ?

Thank you
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org