Re: What are using Spark for

2016-08-01 Thread Rodrick Brown
Each of our micro services logs events to Kafka topic, we then use spark to 
consume messages from that queue and write it into elastic search. 
The data from ES is used by a number of support applications graphs, 
monitoring, reports, dash boards for client services teams etc.. 
 
-- 
 
Rodrick Brown / DevOPs Engineer 
+1 917 445 6839 / rodr...@orchardplatform.com 

Orchard Platform 
101 5th Avenue, 4th Floor, New York, NY 10003 
http://www.orchardplatform.com 
Orchard Blog  | Marketplace Lending 
Meetup 
> On Aug 2, 2016, at 1:48 AM, Rohit L  wrote:
> 
> ch Spark is used and hence can you please share for what purpose you are 
> using Apache Spark in your project? 
> 


-- 
*NOTICE TO RECIPIENTS*: This communication is confidential and intended for 
the use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an 
offer to sell or a solicitation of an indication of interest to purchase 
any loan, security or any other financial product or instrument, nor is it 
an offer to sell or a solicitation of an indication of interest to purchase 
any products or services to any persons who are prohibited from receiving 
such information under applicable law. The contents of this communication 
may not be accurate or complete and are subject to change without notice. 
As such, Orchard App, Inc. (including its subsidiaries and affiliates, 
"Orchard") makes no representation regarding the accuracy or completeness 
of the information contained herein. The intended recipient is advised to 
consult its own professional advisors, including those specializing in 
legal, tax and accounting matters. Orchard does not provide legal, tax or 
accounting advice.


Re: What are using Spark for

2016-08-01 Thread Xiao Li
Hi, Rohit,

The Spark summit has many interesting use cases. Hopefully, it can
answer your question.

https://spark-summit.org/2015/schedule/
https://spark-summit.org/2016/schedule/

Thanks,

Xiao

2016-08-01 22:48 GMT-07:00 Rohit L :
> Hi Everyone,
>
>   I want to know the real world uses cases for which Spark is used and
> hence can you please share for what purpose you are using Apache Spark in
> your project?
>
> --
> Rohit

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark GraphFrames

2016-08-01 Thread Divya Gehlot
Hi,

Has anybody has worked with GraphFrames.
Pls let me know as I need to know the real case scenarios where It can used
.


Thanks,
Divya


What are using Spark for

2016-08-01 Thread Rohit L
Hi Everyone,

  I want to know the real world uses cases for which Spark is used and
hence can you please share for what purpose you are using Apache Spark in
your project?

--
Rohit


Re: Sqoop On Spark

2016-08-01 Thread Takeshi Yamamuro
Hi,

Have you seen this previous thread?
https://www.mail-archive.com/user@spark.apache.org/msg49025.html
I'm not sure this is what you want though.

// maropu


On Tue, Aug 2, 2016 at 1:52 PM, Selvam Raman  wrote:

>  Hi Team,
>
> how can i use spark as execution engine in sqoop2. i see the patch(S
> QOOP-1532 ) but it
> shows in progess.
>
> so can not we use sqoop on spark.
>
> Please help me if you have an any idea.
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>



-- 
---
Takeshi Yamamuro


Sqoop On Spark

2016-08-01 Thread Selvam Raman
 Hi Team,

how can i use spark as execution engine in sqoop2. i see the patch(S
QOOP-1532 ) but it shows
in progess.

so can not we use sqoop on spark.

Please help me if you have an any idea.

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang
Hi Hao,

HashingTF directly apply a hash function (Murmurhash3) to the features to
determine their column index. It excluded any thought about the term
frequency or the length of the document. It does similar work compared with
sklearn FeatureHasher. The result is increased speed and reduced memory
usage, but it does not remember what the input features looked like and can
not convert the output back to the original features. Actually we misnamed
this transformer, it only does the work of feature hashing rather than
computing hashing term frequency.

CountVectorizer will select the top vocabSize words ordered by term
frequency across the corpus to build the hash table of the features. So it
will consume more memory than HashingTF. However, we can convert the output
back to the original feature.

Both of the transformers do not consider the length of each document. If
you want to compute term frequency divided by the length of the document,
you should write your own function based on transformers provided by MLlib.

Thanks
Yanbo

2016-08-01 15:29 GMT-07:00 Hao Ren :

> When computing term frequency, we can use either HashTF or CountVectorizer
> feature extractors.
> However, both of them just use the number of times that a term appears in
> a document.
> It is not a true frequency. Acutally, it should be divided by the length
> of the document.
>
> Is this a wanted feature ?
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>


unsubscribe

2016-08-01 Thread zhangjp
unsubscribe

Re: sql to spark scala rdd

2016-08-01 Thread sri hari kali charan Tummala
Hi All,

Below code calculates cumulative sum (running sum) and moving average using
scala RDD type of programming, I was using wrong function which is sliding
use scalleft instead.


sc.textFile("C:\\Users\\kalit_000\\Desktop\\Hadoop_IMP_DOC\\spark\\data.txt")
  .map(x => x.split("\\~"))
  .map(x => (x(0), x(1), x(2).toDouble))
  .groupBy(_._1)
  .mapValues{(x => x.toList.sortBy(_._2).zip(Stream from
1).scanLeft(("","",0.0,0.0,0.0,0.0))
  { (a,b) => (b._1._1,b._1._2,b._1._3,(b._1._3.toDouble +
a._3.toDouble),(b._1._3.toDouble + a._3.toDouble)/b._2,b._2)}.tail)}
  .flatMapValues(x => x.sortBy(_._1))
  .foreach(println)

Input Data:-

Headers:-
Key,Date,balance

786~20160710~234
786~20160709~-128
786~20160711~-457
987~20160812~456
987~20160812~567

Output Data:-

Column Headers:-
key, (key,Date,balance , daily balance, running average , row_number based
on key)

(786,(786,20160709,-128.0,-128.0,-128.0,1.0))
(786,(786,20160710,234.0,106.0,53.0,2.0))
(786,(786,20160711,-457.0,-223.0,-74.33,3.0))

(987,(987,20160812,567.0,1023.0,511.5,2.0))
(987,(987,20160812,456.0,456.0,456.0,1.0))

Reference:-

https://bzhangusc.wordpress.com/2014/06/21/calculate-running-sums/


Thanks
Sri


On Mon, Aug 1, 2016 at 12:07 AM, Sri  wrote:

> Hi ,
>
> I solved it using spark SQL which uses similar window functions mentioned
> below , for my own knowledge I am trying to solve using Scala RDD which I
> am unable to.
> What function in Scala supports window function like SQL unbounded
> preceding and current row ? Is it sliding ?
>
>
> Thanks
> Sri
>
> Sent from my iPhone
>
> On 31 Jul 2016, at 23:16, Mich Talebzadeh 
> wrote:
>
> hi
>
> You mentioned:
>
> I already solved it using DF and spark sql ...
>
> Are you referring to this code which is a classic analytics:
>
> SELECT DATE,balance,
>  SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
>
>  AND
>
>  CURRENT ROW) daily_balance
>
>  FROM  table
>
>
> So how did you solve it using DF in the first place?
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 August 2016 at 07:04, Sri  wrote:
>
>> Hi ,
>>
>> Just wondering how spark SQL works behind the scenes does it not convert
>> SQL to some Scala RDD ? Or Scala ?
>>
>> How to write below SQL in Scala or Scala RDD
>>
>> SELECT DATE,balance,
>>
>> SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
>>
>> AND
>>
>> CURRENT ROW) daily_balance
>>
>> FROM  table
>>
>>
>> Thanks
>> Sri
>> Sent from my iPhone
>>
>> On 31 Jul 2016, at 13:21, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> Impossible - see
>>
>> http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@sliding(size:Int,step:Int):Iterator[Repr]
>> .
>>
>> I tried to show you why you ended up with "non-empty iterator" after
>> println. You should really start with
>> http://www.scala-lang.org/documentation/
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sun, Jul 31, 2016 at 8:49 PM, sri hari kali charan Tummala
>>  wrote:
>>
>> Tuple
>>
>>
>> [Lscala.Tuple2;@65e4cb84
>>
>>
>> On Sun, Jul 31, 2016 at 1:00 AM, Jacek Laskowski  wrote:
>>
>>
>> Hi,
>>
>>
>> What's the result type of sliding(2,1)?
>>
>>
>> Pozdrawiam,
>>
>> Jacek Laskowski
>>
>> 
>>
>> https://medium.com/@jaceklaskowski/
>>
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>>
>> On Sun, Jul 31, 2016 at 9:23 AM, sri hari kali charan Tummala
>>
>>  wrote:
>>
>> tried this no luck, wht is non-empty iterator here ?
>>
>>
>> OP:-
>>
>> (-987,non-empty iterator)
>>
>> (-987,non-empty iterator)
>>
>> (-987,non-empty iterator)
>>
>> (-987,non-empty iterator)
>>
>> (-987,non-empty iterator)
>>
>>
>>
>> sc.textFile(file).keyBy(x => x.split("\\~") (0))
>>
>>  .map(x => x._2.split("\\~"))
>>
>>  .map(x => (x(0),x(2)))
>>
>>.map { case (key,value) =>
>>
>> (key,value.toArray.toSeq.sliding(2,1).map(x
>>
>> => x.sum/x.size))}.foreach(println)
>>
>>
>>
>> On Sun, Jul 31, 2016 at 12:03 AM, sri hari kali charan Tummala
>>
>>  wrote:
>>
>>
>> Hi All,
>>
>>
>> I managed to write using sliding 

Re: tpcds for spark2.0

2016-08-01 Thread kevin
finally I use  spark-sql-perf-0.4.3 :
./bin/spark-shell --jars
/home/dcos/spark-sql-perf-0.4.3/target/scala-2.11/spark-sql-perf_2.11-0.4.3.jar
--executor-cores 4 --executor-memory 10G --master spark://master1:7077
If I don't use "--jars" I will get error what I mentioned.

2016-07-29 21:17 GMT+08:00 Olivier Girardot :

> I have the same kind of issue (not using spark-sql-perf), just trying to
> deploy 2.0.0 on mesos.
> I'll keep you posted as I investigate
>
>
>
> On Wed, Jul 27, 2016 1:06 PM, kevin kiss.kevin...@gmail.com wrote:
>
>> hi,all:
>> I want to have a test about tpcds99 sql run on spark2.0.
>> I user https://github.com/databricks/spark-sql-perf
>>
>> about the master version ,when I run :val tpcds = new TPCDS (sqlContext =
>> sqlContext) I got error:
>>
>> scala> val tpcds = new TPCDS (sqlContext = sqlContext)
>> error: missing or invalid dependency detected while loading class file
>> 'Benchmarkable.class'.
>> Could not access term typesafe in package com,
>> because it (or its dependencies) are missing. Check your build definition
>> for
>> missing or conflicting dependencies. (Re-run with -Ylog-classpath to see
>> the problematic classpath.)
>> A full rebuild may help if 'Benchmarkable.class' was compiled against an
>> incompatible version of com.
>> error: missing or invalid dependency detected while loading class file
>> 'Benchmarkable.class'.
>> Could not access term scalalogging in value com.typesafe,
>> because it (or its dependencies) are missing. Check your build definition
>> for
>> missing or conflicting dependencies. (Re-run with -Ylog-classpath to see
>> the problematic classpath.)
>> A full rebuild may help if 'Benchmarkable.class' was compiled against an
>> incompatible version of com.typesafe.
>>
>> about spark-sql-perf-0.4.3 when I run
>> :tables.genData("hdfs://master1:9000/tpctest", "parquet", true, false,
>> false, false, false) I got error:
>>
>> Generating table catalog_sales in database to
>> hdfs://master1:9000/tpctest/catalog_sales with save mode Overwrite.
>> 16/07/27 18:59:59 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
>> slave1): java.lang.ClassCastException: cannot assign instance of
>> scala.collection.immutable.List$SerializationProxy to field
>> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
>> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>>
>>
>
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>


Re: spark.read.format("jdbc")

2016-08-01 Thread kevin
I was fix it by :
val jdbcDF =
spark.read.format("org.apache.spark.sql.execution.datasources.jdbc.DefaultSource")
  .options(Map("url" -> s"jdbc:mysql://${mysqlhost}:3306/test",
"driver" -> "com.mysql.jdbc.Driver", "dbtable" -> "i_user", "user" ->
"root", "password" -> "123456"))
  .load()

where org.apache.spark.sql.execution.datasources.jdbc.DefaultSource and
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider both
have the same short name:jdbc

2016-08-01 15:24 GMT+08:00 Nikolay Zhebet :

> You should specify classpath for your jdbc connection.
> As example, if you want connect to Impala, you can try it snippet:
>
>
>
> import java.util.Properties
> import org.apache.spark._
> import org.apache.spark.sql.SQLContext
> import java.sql.Connection
> import java.sql.DriverManager
> Class.forName("com.cloudera.impala.jdbc41.Driver")
>
> var conn: java.sql.Connection = null
> conn = 
> DriverManager.getConnection("jdbc:impala://127.0.0.1:21050/default;auth=noSasl",
>  "", "")
> val statement = conn.createStatement();
>
> val result = statement.executeQuery("SELECT * FROM users limit 10")
> result.next()
> result.getString("user_id")val sql_insert = "INSERT INTO users 
> VALUES('user_id','email','gender')"
> statement.executeUpdate(sql_insert)
>
>
> Also you should specify path your jdbc jar file in --driver-class-path
> variable when you running spark-submit:
>
> spark-shell --master "local[2]" --driver-class-path 
> /opt/cloudera/parcels/CDH/jars/ImpalaJDBC41.jar
>
>
> 2016-08-01 9:37 GMT+03:00 kevin :
>
>> maybe there is another version spark on the classpath?
>>
>> 2016-08-01 14:30 GMT+08:00 kevin :
>>
>>> hi,all:
>>>I try to load data from jdbc datasource,but I got error with :
>>> java.lang.RuntimeException: Multiple sources found for jdbc
>>> (org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider,
>>> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource), please
>>> specify the fully qualified class name.
>>>
>>> spark version is 2.0
>>>
>>>
>>
>


Re: Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-08-01 Thread Steve Rowe
UnaryTransformer’s scaladoc says "Abstract class for transformers that take one 
input column, apply transformation, and output the result as a new column.”

If you want to allow specification of more than one input column, or if your 
output column already exists, or you want multiple output columns, then you 
can’t use UnaryTransformer.  

If all of the above conditions are met, though, UnaryTransformer will simplify 
your subclass.

BTW the scaladocs for StringType say "The data type representing `String` 
values. Please use the singleton [[DataTypes.StringType]].” <- do that instead 
of calling StringType’s ctor.

--
Steve
www.lucidworks.com

> On Aug 1, 2016, at 2:30 PM, janardhan shetty  wrote:
> 
> What is the difference between UnaryTransformer and Transformer classes. In 
> which scenarios should we use  one or the other ?
> 
> On Sun, Jul 31, 2016 at 8:27 PM, janardhan shetty  
> wrote:
> Developing in scala but any help with difference between UnaryTransformer (Is 
> this experimental still ?)and Transformer class is appreciated.
> 
> Right now encountering  error for the code which extends UnaryTransformer
> override protected def outputDataType: DataType = new StringType
> 
> Error:(26, 53) constructor StringType in class StringType cannot be accessed 
> in class Capitalizer
>   override protected def outputDataType: DataType = new StringType
> ^
> 
> 
> On Thu, Jul 28, 2016 at 8:20 PM, Phuong LE-HONG  wrote:
> Hi,
> 
> I've developed a simple ML estimator (in Java) that implements
> conditional Markov model for sequence labelling in Vitk toolkit. You
> can check it out here:
> 
> https://github.com/phuonglh/vn.vitk/blob/master/src/main/java/vn/vitk/tag/CMM.java
> 
> Phuong Le-Hong
> 
> On Fri, Jul 29, 2016 at 9:01 AM, janardhan shetty
>  wrote:
> > Thanks Steve.
> >
> > Any pointers to custom estimators development as well ?
> >
> > On Wed, Jul 27, 2016 at 11:35 AM, Steve Rowe  wrote:
> >>
> >> You can see the source for my transformer configurable bridge to Lucene
> >> analysis components here, in my company Lucidworks’ spark-solr project:
> >> .
> >>
> >> Here’s a blog I wrote about using this transformer, as well as
> >> non-ML-context use in Spark of the underlying analysis component, here:
> >> .
> >>
> >> --
> >> Steve
> >> www.lucidworks.com
> >>
> >> > On Jul 27, 2016, at 1:31 PM, janardhan shetty 
> >> > wrote:
> >> >
> >> > 1.  Any links or blogs to develop custom transformers ? ex: Tokenizer
> >> >
> >> > 2. Any links or blogs to develop custom estimators ? ex: any ml
> >> > algorithm
> >>
> >
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: The equivalent for INSTR in Spark FP

2016-08-01 Thread Mich Talebzadeh
Thanks Jacek.

It sounds like the issue the position of the second variable in substring()

This works

scala> val wSpec2 =
Window.partitionBy(substring($"transactiondescription",1,20))
wSpec2: org.apache.spark.sql.expressions.WindowSpec =
org.apache.spark.sql.expressions.WindowSpec@1a4eae2

Using udf as suggested

scala> val mySubstr = udf { (s: String, start: Int, end: Int) =>
 |  s.substring(start, end) }
mySubstr: org.apache.spark.sql.UserDefinedFunction =
UserDefinedFunction(,StringType,List(StringType, IntegerType,
IntegerType))


This was throwing error:

val wSpec2 = Window.partitionBy(substring("transactiondescription",1,
indexOf("transactiondescription",'CD')-2))


So I tried using udf

scala> val wSpec2 =
Window.partitionBy($"transactiondescription".select(mySubstr('s, lit(1),
instr('s, "CD")))
 | )
:28: error: value select is not a member of
org.apache.spark.sql.ColumnName
 val wSpec2 =
Window.partitionBy($"transactiondescription".select(mySubstr('s, lit(1),
instr('s, "CD")))

Obviously I am not doing correctly :(

cheers



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 23:02, Jacek Laskowski  wrote:

> Hi,
>
> Interesting...
>
> I'm temping to think that substring function should accept the columns
> that hold the numbers for start and end. I'd love hearing people's
> thought on this.
>
> For now, I'd say you need to define udf to do substring as follows:
>
> scala> val mySubstr = udf { (s: String, start: Int, end: Int) =>
> s.substring(start, end) }
> mySubstr: org.apache.spark.sql.expressions.UserDefinedFunction =
> UserDefinedFunction(,StringType,Some(List(StringType,
> IntegerType, IntegerType)))
>
> scala> df.show
> +---+
> |  s|
> +---+
> |hello world|
> +---+
>
> scala> df.select(mySubstr('s, lit(1), instr('s, "ll"))).show
> +---+
> |UDF(s, 1, instr(s, ll))|
> +---+
> | el|
> +---+
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Mon, Aug 1, 2016 at 11:18 PM, Mich Talebzadeh
>  wrote:
> > Thanks Jacek,
> >
> > Do I have any other way of writing this with functional programming?
> >
> > select
> > substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
> >
> >
> > Cheers,
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
> >
> >
> > On 1 August 2016 at 22:13, Jacek Laskowski  wrote:
> >>
> >> Hi Mich,
> >>
> >> There's no indexOf UDF -
> >>
> >>
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
> >>
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >>
> >> On Mon, Aug 1, 2016 at 7:24 PM, Mich Talebzadeh
> >>  wrote:
> >> > Hi,
> >> >
> >> > What is the equivalent of FP for the following window/analytic that
> >> > works OK
> >> > in Spark SQL
> >> >
> >> > This one using INSTR
> >> >
> >> > select
> >> >
> >> >
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
> >> >
> >> >
> >> > select distinct *
> >> > from (
> >> >   select
> >> >
> >> >
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
> >> >   SUM(debitamount) OVER (PARTITION BY
> >> >
> >> >
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2)) AS
> >> > spent
> >> >   from accounts.ll_18740868 where transactiontype = 'DEB'
> >> >  ) tmp
> >> >
> >> >
> >> > I tried indexOf but it does not work!
> >> >
> >> > val wSpec2 =
> >> >
> >> >
> 

[MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Hao Ren
When computing term frequency, we can use either HashTF or CountVectorizer
feature extractors.
However, both of them just use the number of times that a term appears in a
document.
It is not a true frequency. Acutally, it should be divided by the length of
the document.

Is this a wanted feature ?

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France


Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Holden Karau
Thats a good point - there is an open issue for spark-testing-base to
support this shared sparksession approach - but I haven't had the time (
https://github.com/holdenk/spark-testing-base/issues/123 ). I'll try and
include this in the next release :)

On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers  wrote:

> we share a single single sparksession across tests, and they can run in
> parallel. is pretty fast
>
> On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson <
> ever...@nuna.com.invalid> wrote:
>
>> Hi,
>>
>> Right now, if any code uses DataFrame/Dataset, I need a test setup that
>> brings up a local master as in this article
>> 
>> .
>>
>> That's a lot of overhead for unit testing and the tests can't run in
>> parallel, so testing is slow -- this is more like what I'd call an
>> integration test.
>>
>> Do people have any tricks to get around this? Maybe using spy mocks on
>> fake DataFrame/Datasets?
>>
>> Anyone know if there are plans to make more traditional unit testing
>> possible with Spark SQL, perhaps with a stripped down in-memory
>> implementation? (I admit this does seem quite hard since there's so much
>> functionality in these classes!)
>>
>> Thanks!
>>
>> - Everett
>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: The equivalent for INSTR in Spark FP

2016-08-01 Thread Jacek Laskowski
Hi,

Interesting...

I'm temping to think that substring function should accept the columns
that hold the numbers for start and end. I'd love hearing people's
thought on this.

For now, I'd say you need to define udf to do substring as follows:

scala> val mySubstr = udf { (s: String, start: Int, end: Int) =>
s.substring(start, end) }
mySubstr: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(,StringType,Some(List(StringType,
IntegerType, IntegerType)))

scala> df.show
+---+
|  s|
+---+
|hello world|
+---+

scala> df.select(mySubstr('s, lit(1), instr('s, "ll"))).show
+---+
|UDF(s, 1, instr(s, ll))|
+---+
| el|
+---+

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, Aug 1, 2016 at 11:18 PM, Mich Talebzadeh
 wrote:
> Thanks Jacek,
>
> Do I have any other way of writing this with functional programming?
>
> select
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
>
>
> Cheers,
>
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 1 August 2016 at 22:13, Jacek Laskowski  wrote:
>>
>> Hi Mich,
>>
>> There's no indexOf UDF -
>>
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
>>
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Mon, Aug 1, 2016 at 7:24 PM, Mich Talebzadeh
>>  wrote:
>> > Hi,
>> >
>> > What is the equivalent of FP for the following window/analytic that
>> > works OK
>> > in Spark SQL
>> >
>> > This one using INSTR
>> >
>> > select
>> >
>> > substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
>> >
>> >
>> > select distinct *
>> > from (
>> >   select
>> >
>> > substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
>> >   SUM(debitamount) OVER (PARTITION BY
>> >
>> > substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2)) 
>> > AS
>> > spent
>> >   from accounts.ll_18740868 where transactiontype = 'DEB'
>> >  ) tmp
>> >
>> >
>> > I tried indexOf but it does not work!
>> >
>> > val wSpec2 =
>> >
>> > Window.partitionBy(substring(col("transactiondescription"),1,indexOf(col("transactiondescription"),"CD")))
>> > :26: error: not found: value indexOf
>> >  val wSpec2 =
>> >
>> > Window.partitionBy(substring(col("transactiondescription"),1,indexOf(col("transactiondescription"),"CD")))
>> >
>> >
>> > Thanks
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may
>> > arise
>> > from relying on this email's technical content is explicitly disclaimed.
>> > The
>> > author will in no case be liable for any monetary damages arising from
>> > such
>> > loss, damage or destruction.
>> >
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Problems initializing SparkUI

2016-08-01 Thread Mich Talebzadeh
sounds strange. What happens if you submit the job from the other node?



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 22:23, Maximiliano Patricio Méndez  wrote:

> Hi, the output of the host where the driver is running:
>
> ~$ jps
> 9895 DriverWrapper
> 24057 Jps
> 3531 Worker
>
> In that host, I gave too much memory for the driver and no executor could
> be place for that worker.
>
>
> 2016-08-01 18:06 GMT-03:00 Mich Talebzadeh :
>
>> OK
>>
>> Can you on the hostname that driver program is running do jps and send
>> the output please?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 August 2016 at 22:03, Maximiliano Patricio Méndez <
>> mmen...@despegar.com> wrote:
>>
>>> Hi, thanks again for the answer.
>>>
>>> Looking a little bit closer into that, I found out that the
>>> DriverWrapper process was not running in the hostname the log reported. It
>>> is runnning, but in another host. Mistery.
>>>
>>> If I manually go to the host that has the DriverWrapper running in it,
>>> on port 4040, I can see the sparkUI without problems, but if I go through
>>> the master > applicationUI it tries to send me to the wrong host (the one
>>> the driver reports in its log).
>>>
>>> The hostname the driver reports is the same from which I send the submit
>>> request.
>>>
>>> 2016-08-01 17:27 GMT-03:00 Mich Talebzadeh :
>>>
 Fine.

 In that case which process is your driver program (from jps output)?

 Thanks

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 1 August 2016 at 21:23, Maximiliano Patricio Méndez <
 mmen...@despegar.com> wrote:

> Hi, MIch, thanks for replying.
>
> I'm deploying from the same instance from where I showed the logs and
> commands using --deploy-mode cluster.
>
> The SparkSubmit process only appears while the bin/spark-submit binary
> is active.
> When the application starts and the driver takes control, the
> SparkSubmit process dies.
>
> 2016-08-01 16:07 GMT-03:00 Mich Talebzadeh 
> :
>
>> OK I can see the Worker (19286 Worker and the executor(6548
>> CoarseGrainedExecutorBackend) running on it
>>
>> Where is spark-submit? Did you submit your job from another node or
>> used another method to run it?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 August 2016 at 19:08, Maximiliano Patricio Méndez <
>> mmen...@despegar.com> wrote:
>>
>>> I just 

Re: Problems initializing SparkUI

2016-08-01 Thread Maximiliano Patricio Méndez
Hi, the output of the host where the driver is running:

~$ jps
9895 DriverWrapper
24057 Jps
3531 Worker

In that host, I gave too much memory for the driver and no executor could
be place for that worker.


2016-08-01 18:06 GMT-03:00 Mich Talebzadeh :

> OK
>
> Can you on the hostname that driver program is running do jps and send the
> output please?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 August 2016 at 22:03, Maximiliano Patricio Méndez <
> mmen...@despegar.com> wrote:
>
>> Hi, thanks again for the answer.
>>
>> Looking a little bit closer into that, I found out that the DriverWrapper
>> process was not running in the hostname the log reported. It is runnning,
>> but in another host. Mistery.
>>
>> If I manually go to the host that has the DriverWrapper running in it, on
>> port 4040, I can see the sparkUI without problems, but if I go through the
>> master > applicationUI it tries to send me to the wrong host (the one the
>> driver reports in its log).
>>
>> The hostname the driver reports is the same from which I send the submit
>> request.
>>
>> 2016-08-01 17:27 GMT-03:00 Mich Talebzadeh :
>>
>>> Fine.
>>>
>>> In that case which process is your driver program (from jps output)?
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 August 2016 at 21:23, Maximiliano Patricio Méndez <
>>> mmen...@despegar.com> wrote:
>>>
 Hi, MIch, thanks for replying.

 I'm deploying from the same instance from where I showed the logs and
 commands using --deploy-mode cluster.

 The SparkSubmit process only appears while the bin/spark-submit binary
 is active.
 When the application starts and the driver takes control, the
 SparkSubmit process dies.

 2016-08-01 16:07 GMT-03:00 Mich Talebzadeh :

> OK I can see the Worker (19286 Worker and the executor(6548
> CoarseGrainedExecutorBackend) running on it
>
> Where is spark-submit? Did you submit your job from another node or
> used another method to run it?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>
> On 1 August 2016 at 19:08, Maximiliano Patricio Méndez <
> mmen...@despegar.com> wrote:
>
>> I just recently tried again, the port 4040 is not used. And even if
>> it were, I think the log would reflect that trying to use the following
>> port (4041) as you mentioned.
>>
>> This is what the driver log says:
>>
>> 16/08/01 13:55:56 INFO Utils: Successfully started service 'SparkUI' on 
>> port 4040.
>> 16/08/01 13:55:56 INFO SparkUI: Started SparkUI at http://hostname:4040
>>
>>
>> If I go to {hostname}:
>> ~$ jps
>> 6548 CoarseGrainedExecutorBackend
>> 19286 Worker
>> 6843 Jps
>> 19182 Master
>>
>> ~$ netstat -nltp
>> Active Internet connections (only servers)
>> Proto Recv-Q Send-Q Local Address   Foreign Address
>> State   PID/Program name
>> tcp6   0  0 192.168.22.245:43037:::*
>>  LISTEN  6548/java
>> tcp6   0  0 192.168.22.245:56929:::*

Re: The equivalent for INSTR in Spark FP

2016-08-01 Thread Mich Talebzadeh
Thanks Jacek,

Do I have any other way of writing this with functional programming?

select substring(transactiondescription,1,INSTR(transactiondescription,'
CD')-2),


Cheers,











Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 22:13, Jacek Laskowski  wrote:

> Hi Mich,
>
> There's no indexOf UDF -
>
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
>
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Mon, Aug 1, 2016 at 7:24 PM, Mich Talebzadeh
>  wrote:
> > Hi,
> >
> > What is the equivalent of FP for the following window/analytic that
> works OK
> > in Spark SQL
> >
> > This one using INSTR
> >
> > select
> > substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
> >
> >
> > select distinct *
> > from (
> >   select
> > substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
> >   SUM(debitamount) OVER (PARTITION BY
> >
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2)) AS
> > spent
> >   from accounts.ll_18740868 where transactiontype = 'DEB'
> >  ) tmp
> >
> >
> > I tried indexOf but it does not work!
> >
> > val wSpec2 =
> >
> Window.partitionBy(substring(col("transactiondescription"),1,indexOf(col("transactiondescription"),"CD")))
> > :26: error: not found: value indexOf
> >  val wSpec2 =
> >
> Window.partitionBy(substring(col("transactiondescription"),1,indexOf(col("transactiondescription"),"CD")))
> >
> >
> > Thanks
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> > Disclaimer: Use it at your own risk. Any and all responsibility for any
> > loss, damage or destruction of data or any other property which may arise
> > from relying on this email's technical content is explicitly disclaimed.
> The
> > author will in no case be liable for any monetary damages arising from
> such
> > loss, damage or destruction.
> >
> >
>


Re: The equivalent for INSTR in Spark FP

2016-08-01 Thread Jacek Laskowski
Hi Mich,

There's no indexOf UDF -
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$


Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Mon, Aug 1, 2016 at 7:24 PM, Mich Talebzadeh
 wrote:
> Hi,
>
> What is the equivalent of FP for the following window/analytic that works OK
> in Spark SQL
>
> This one using INSTR
>
> select
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
>
>
> select distinct *
> from (
>   select
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
>   SUM(debitamount) OVER (PARTITION BY
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2)) AS
> spent
>   from accounts.ll_18740868 where transactiontype = 'DEB'
>  ) tmp
>
>
> I tried indexOf but it does not work!
>
> val wSpec2 =
> Window.partitionBy(substring(col("transactiondescription"),1,indexOf(col("transactiondescription"),"CD")))
> :26: error: not found: value indexOf
>  val wSpec2 =
> Window.partitionBy(substring(col("transactiondescription"),1,indexOf(col("transactiondescription"),"CD")))
>
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2.0 History Server Storage

2016-08-01 Thread Andrei Ivanov
Hi all,

I've just tried upgrading Spark to 2.0 and so far it looks generally good.

But there is at least one issue I see right away - jon histories are
missing storage information (persisted RRDs).
This info is also missing from pre upgrade jobs.

Does anyone have a clue what can be wrong?

Thanks, Andrei Ivanov.


Re: Problems initializing SparkUI

2016-08-01 Thread Mich Talebzadeh
OK

Can you on the hostname that driver program is running do jps and send the
output please?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 22:03, Maximiliano Patricio Méndez  wrote:

> Hi, thanks again for the answer.
>
> Looking a little bit closer into that, I found out that the DriverWrapper
> process was not running in the hostname the log reported. It is runnning,
> but in another host. Mistery.
>
> If I manually go to the host that has the DriverWrapper running in it, on
> port 4040, I can see the sparkUI without problems, but if I go through the
> master > applicationUI it tries to send me to the wrong host (the one the
> driver reports in its log).
>
> The hostname the driver reports is the same from which I send the submit
> request.
>
> 2016-08-01 17:27 GMT-03:00 Mich Talebzadeh :
>
>> Fine.
>>
>> In that case which process is your driver program (from jps output)?
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 August 2016 at 21:23, Maximiliano Patricio Méndez <
>> mmen...@despegar.com> wrote:
>>
>>> Hi, MIch, thanks for replying.
>>>
>>> I'm deploying from the same instance from where I showed the logs and
>>> commands using --deploy-mode cluster.
>>>
>>> The SparkSubmit process only appears while the bin/spark-submit binary
>>> is active.
>>> When the application starts and the driver takes control, the
>>> SparkSubmit process dies.
>>>
>>> 2016-08-01 16:07 GMT-03:00 Mich Talebzadeh :
>>>
 OK I can see the Worker (19286 Worker and the executor(6548
 CoarseGrainedExecutorBackend) running on it

 Where is spark-submit? Did you submit your job from another node or
 used another method to run it?

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 1 August 2016 at 19:08, Maximiliano Patricio Méndez <
 mmen...@despegar.com> wrote:

> I just recently tried again, the port 4040 is not used. And even if it
> were, I think the log would reflect that trying to use the following port
> (4041) as you mentioned.
>
> This is what the driver log says:
>
> 16/08/01 13:55:56 INFO Utils: Successfully started service 'SparkUI' on 
> port 4040.
> 16/08/01 13:55:56 INFO SparkUI: Started SparkUI at http://hostname:4040
>
>
> If I go to {hostname}:
> ~$ jps
> 6548 CoarseGrainedExecutorBackend
> 19286 Worker
> 6843 Jps
> 19182 Master
>
> ~$ netstat -nltp
> Active Internet connections (only servers)
> Proto Recv-Q Send-Q Local Address   Foreign Address
> State   PID/Program name
> tcp6   0  0 192.168.22.245:43037:::*
>  LISTEN  6548/java
> tcp6   0  0 192.168.22.245:56929:::*
>  LISTEN  19286/java
> tcp6   0  0 192.168.22.245:7077 :::*
>  LISTEN  19182/java
> tcp6   0  0 :::33296:::*
>  LISTEN  6548/java
> tcp6   0  0 :::8080 :::*
>  LISTEN  19182/java
> tcp6   0  0 :::8081 :::*
>  LISTEN  19286/java
> tcp6   0  0 192.168.22.245:6066 :::*
>  LISTEN  

Re: Problems initializing SparkUI

2016-08-01 Thread Maximiliano Patricio Méndez
Hi, thanks again for the answer.

Looking a little bit closer into that, I found out that the DriverWrapper
process was not running in the hostname the log reported. It is runnning,
but in another host. Mistery.

If I manually go to the host that has the DriverWrapper running in it, on
port 4040, I can see the sparkUI without problems, but if I go through the
master > applicationUI it tries to send me to the wrong host (the one the
driver reports in its log).

The hostname the driver reports is the same from which I send the submit
request.

2016-08-01 17:27 GMT-03:00 Mich Talebzadeh :

> Fine.
>
> In that case which process is your driver program (from jps output)?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 August 2016 at 21:23, Maximiliano Patricio Méndez <
> mmen...@despegar.com> wrote:
>
>> Hi, MIch, thanks for replying.
>>
>> I'm deploying from the same instance from where I showed the logs and
>> commands using --deploy-mode cluster.
>>
>> The SparkSubmit process only appears while the bin/spark-submit binary is
>> active.
>> When the application starts and the driver takes control, the SparkSubmit
>> process dies.
>>
>> 2016-08-01 16:07 GMT-03:00 Mich Talebzadeh :
>>
>>> OK I can see the Worker (19286 Worker and the executor(6548
>>> CoarseGrainedExecutorBackend) running on it
>>>
>>> Where is spark-submit? Did you submit your job from another node or used
>>> another method to run it?
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 August 2016 at 19:08, Maximiliano Patricio Méndez <
>>> mmen...@despegar.com> wrote:
>>>
 I just recently tried again, the port 4040 is not used. And even if it
 were, I think the log would reflect that trying to use the following port
 (4041) as you mentioned.

 This is what the driver log says:

 16/08/01 13:55:56 INFO Utils: Successfully started service 'SparkUI' on 
 port 4040.
 16/08/01 13:55:56 INFO SparkUI: Started SparkUI at http://hostname:4040


 If I go to {hostname}:
 ~$ jps
 6548 CoarseGrainedExecutorBackend
 19286 Worker
 6843 Jps
 19182 Master

 ~$ netstat -nltp
 Active Internet connections (only servers)
 Proto Recv-Q Send-Q Local Address   Foreign Address
 State   PID/Program name
 tcp6   0  0 192.168.22.245:43037:::*
  LISTEN  6548/java
 tcp6   0  0 192.168.22.245:56929:::*
  LISTEN  19286/java
 tcp6   0  0 192.168.22.245:7077 :::*
  LISTEN  19182/java
 tcp6   0  0 :::33296:::*
  LISTEN  6548/java
 tcp6   0  0 :::8080 :::*
  LISTEN  19182/java
 tcp6   0  0 :::8081 :::*
  LISTEN  19286/java
 tcp6   0  0 192.168.22.245:6066 :::*
  LISTEN  19182/java

 ~$ netstat -nltap | grep 4040

 I'm really lost here and don't know much about spark yet, but shouldn't
 there be a DriverWrapper process which holds the bind on port 4040?


 2016-08-01 13:49 GMT-03:00 Mich Talebzadeh :

> Can you check if port 4040 is actually used? If it used the next
> available one would 4041. For example below Zeppelin uses it
>
>
> *netstat -plten|grep 4040*tcp0  0
> :::4040 :::*LISTEN
> 1005   73372882   *10699*/java
> *ps aux|grep 10699*
> hduser   10699  0.1  3.8 3172308 952932 pts/3  SNl  Jul30   5:57
> /usr/java/latest/bin/java -cp /data6/hduser/zeppelin-0.6.0/ ...
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> 

Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
Put the queue in a static variable that is first referenced on the
workers (inside an rdd closure).  That way it will be created on each
of the workers, not the driver.

Easiest way to do that is with a lazy val in a companion object.

On Mon, Aug 1, 2016 at 3:22 PM, Martin Le  wrote:
> How to do that? if I put the queue inside .transform operation, it doesn't
> work.
>
> On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger  wrote:
>>
>> Can you keep a queue per executor in memory?
>>
>> On Mon, Aug 1, 2016 at 11:24 AM, Martin Le 
>> wrote:
>> > Hi Cody and all,
>> >
>> > Thank you for your answer. I implement simple random sampling (SRS) for
>> > DStream using transform method, and it works fine.
>> > However, I have a problem when I implement reservoir sampling (RS). In
>> > RS, I
>> > need to maintain a reservoir (a queue) to store selected data items
>> > (RDDs).
>> > If I define a large stream window, the queue also increases  and it
>> > leads to
>> > the driver run out of memory.  I explain my problem in detail here:
>> >
>> > https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>> >
>> > Could you please give me some suggestions or advice to fix this problem?
>> >
>> > Thanks
>> >
>> > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger 
>> > wrote:
>> >>
>> >> Most stream systems you're still going to incur the cost of reading
>> >> each message... I suppose you could rotate among reading just the
>> >> latest messages from a single partition of a Kafka topic if they were
>> >> evenly balanced.
>> >>
>> >> But once you've read the messages, nothing's stopping you from
>> >> filtering most of them out before doing further processing.  The
>> >> dstream .transform method will let you do any filtering / sampling you
>> >> could have done on an rdd.
>> >>
>> >> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le 
>> >> wrote:
>> >> > Hi all,
>> >> >
>> >> > I have to handle high-speed rate data stream. To reduce the heavy
>> >> > load,
>> >> > I
>> >> > want to use sampling techniques for each stream window. It means that
>> >> > I
>> >> > want
>> >> > to process a subset of data instead of whole window data. I saw Spark
>> >> > support sampling operations for RDD, but for DStream, Spark supports
>> >> > sampling operation as well? If not,  could you please give me a
>> >> > suggestion
>> >> > how to implement it?
>> >> >
>> >> > Thanks,
>> >> > Martin
>> >
>> >
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Problems initializing SparkUI

2016-08-01 Thread Mich Talebzadeh
Fine.

In that case which process is your driver program (from jps output)?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 21:23, Maximiliano Patricio Méndez  wrote:

> Hi, MIch, thanks for replying.
>
> I'm deploying from the same instance from where I showed the logs and
> commands using --deploy-mode cluster.
>
> The SparkSubmit process only appears while the bin/spark-submit binary is
> active.
> When the application starts and the driver takes control, the SparkSubmit
> process dies.
>
> 2016-08-01 16:07 GMT-03:00 Mich Talebzadeh :
>
>> OK I can see the Worker (19286 Worker and the executor(6548
>> CoarseGrainedExecutorBackend) running on it
>>
>> Where is spark-submit? Did you submit your job from another node or used
>> another method to run it?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 August 2016 at 19:08, Maximiliano Patricio Méndez <
>> mmen...@despegar.com> wrote:
>>
>>> I just recently tried again, the port 4040 is not used. And even if it
>>> were, I think the log would reflect that trying to use the following port
>>> (4041) as you mentioned.
>>>
>>> This is what the driver log says:
>>>
>>> 16/08/01 13:55:56 INFO Utils: Successfully started service 'SparkUI' on 
>>> port 4040.
>>> 16/08/01 13:55:56 INFO SparkUI: Started SparkUI at http://hostname:4040
>>>
>>>
>>> If I go to {hostname}:
>>> ~$ jps
>>> 6548 CoarseGrainedExecutorBackend
>>> 19286 Worker
>>> 6843 Jps
>>> 19182 Master
>>>
>>> ~$ netstat -nltp
>>> Active Internet connections (only servers)
>>> Proto Recv-Q Send-Q Local Address   Foreign Address
>>> State   PID/Program name
>>> tcp6   0  0 192.168.22.245:43037:::*
>>>  LISTEN  6548/java
>>> tcp6   0  0 192.168.22.245:56929:::*
>>>  LISTEN  19286/java
>>> tcp6   0  0 192.168.22.245:7077 :::*
>>>  LISTEN  19182/java
>>> tcp6   0  0 :::33296:::*
>>>  LISTEN  6548/java
>>> tcp6   0  0 :::8080 :::*
>>>  LISTEN  19182/java
>>> tcp6   0  0 :::8081 :::*
>>>  LISTEN  19286/java
>>> tcp6   0  0 192.168.22.245:6066 :::*
>>>  LISTEN  19182/java
>>>
>>> ~$ netstat -nltap | grep 4040
>>>
>>> I'm really lost here and don't know much about spark yet, but shouldn't
>>> there be a DriverWrapper process which holds the bind on port 4040?
>>>
>>>
>>> 2016-08-01 13:49 GMT-03:00 Mich Talebzadeh :
>>>
 Can you check if port 4040 is actually used? If it used the next
 available one would 4041. For example below Zeppelin uses it


 *netstat -plten|grep 4040*tcp0  0
 :::4040 :::*LISTEN
 1005   73372882   *10699*/java
 *ps aux|grep 10699*
 hduser   10699  0.1  3.8 3172308 952932 pts/3  SNl  Jul30   5:57
 /usr/java/latest/bin/java -cp /data6/hduser/zeppelin-0.6.0/ ...

 HTH

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 1 August 2016 at 17:44, Maximiliano Patricio Méndez <
 mmen...@despegar.com> wrote:

> Hi,
>
> Thanks for the answers.
>
> @Jacek: To verify if the ui is up, I enter 

Re: The equivalent for INSTR in Spark FP

2016-08-01 Thread Mich Talebzadeh
Any programming expert who can shed some light on this?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 18:24, Mich Talebzadeh 
wrote:

> Hi,
>
> What is the equivalent of FP for the following window/analytic that works
> OK in Spark SQL
>
> This one using INSTR
>
> select
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
>
>
> select distinct *
> from (
>   select
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
>   SUM(debitamount) OVER (PARTITION BY
> substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2))
> AS spent
>   from accounts.ll_18740868 where transactiontype = 'DEB'
>  ) tmp
>
>
> I tried indexOf but it does not work!
>
> val wSpec2 =
> Window.partitionBy(substring(col("transactiondescription"),1,indexOf(col("transactiondescription"),"CD")))
> :26: error: not found: value indexOf
>  val wSpec2 =
> Window.partitionBy(substring(col("transactiondescription"),1,indexOf(col("transactiondescription"),"CD")))
>
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Problems initializing SparkUI

2016-08-01 Thread Maximiliano Patricio Méndez
Hi, MIch, thanks for replying.

I'm deploying from the same instance from where I showed the logs and
commands using --deploy-mode cluster.

The SparkSubmit process only appears while the bin/spark-submit binary is
active.
When the application starts and the driver takes control, the SparkSubmit
process dies.

2016-08-01 16:07 GMT-03:00 Mich Talebzadeh :

> OK I can see the Worker (19286 Worker and the executor(6548
> CoarseGrainedExecutorBackend) running on it
>
> Where is spark-submit? Did you submit your job from another node or used
> another method to run it?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 August 2016 at 19:08, Maximiliano Patricio Méndez <
> mmen...@despegar.com> wrote:
>
>> I just recently tried again, the port 4040 is not used. And even if it
>> were, I think the log would reflect that trying to use the following port
>> (4041) as you mentioned.
>>
>> This is what the driver log says:
>>
>> 16/08/01 13:55:56 INFO Utils: Successfully started service 'SparkUI' on port 
>> 4040.
>> 16/08/01 13:55:56 INFO SparkUI: Started SparkUI at http://hostname:4040
>>
>>
>> If I go to {hostname}:
>> ~$ jps
>> 6548 CoarseGrainedExecutorBackend
>> 19286 Worker
>> 6843 Jps
>> 19182 Master
>>
>> ~$ netstat -nltp
>> Active Internet connections (only servers)
>> Proto Recv-Q Send-Q Local Address   Foreign Address State
>>   PID/Program name
>> tcp6   0  0 192.168.22.245:43037:::*
>>  LISTEN  6548/java
>> tcp6   0  0 192.168.22.245:56929:::*
>>  LISTEN  19286/java
>> tcp6   0  0 192.168.22.245:7077 :::*
>>  LISTEN  19182/java
>> tcp6   0  0 :::33296:::*
>>  LISTEN  6548/java
>> tcp6   0  0 :::8080 :::*
>>  LISTEN  19182/java
>> tcp6   0  0 :::8081 :::*
>>  LISTEN  19286/java
>> tcp6   0  0 192.168.22.245:6066 :::*
>>  LISTEN  19182/java
>>
>> ~$ netstat -nltap | grep 4040
>>
>> I'm really lost here and don't know much about spark yet, but shouldn't
>> there be a DriverWrapper process which holds the bind on port 4040?
>>
>>
>> 2016-08-01 13:49 GMT-03:00 Mich Talebzadeh :
>>
>>> Can you check if port 4040 is actually used? If it used the next
>>> available one would 4041. For example below Zeppelin uses it
>>>
>>>
>>> *netstat -plten|grep 4040*tcp0  0
>>> :::4040 :::*LISTEN
>>> 1005   73372882   *10699*/java
>>> *ps aux|grep 10699*
>>> hduser   10699  0.1  3.8 3172308 952932 pts/3  SNl  Jul30   5:57
>>> /usr/java/latest/bin/java -cp /data6/hduser/zeppelin-0.6.0/ ...
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 1 August 2016 at 17:44, Maximiliano Patricio Méndez <
>>> mmen...@despegar.com> wrote:
>>>
 Hi,

 Thanks for the answers.

 @Jacek: To verify if the ui is up, I enter to all the worker nodes of
 my cluster and run netstat -nltp | grep 4040 with no result. The log of the
 driver tells me in which server and on which port should the spark ui be
 up, but it isn't.


 @Mich: I've tried to specify spark.ui.port=nnn but I only manage to
 change the log, reporting that the driver should be in another port.

 The ui has no problem to start in that port (4040) when I run my
 application in client mode.

 Could there be a network issue making the ui to fail silently? I've
 read some of the code regarding those parts of the driver log, but couldn't
 find anything weird.

 2016-07-29 19:45 GMT-03:00 Mich Talebzadeh :

> why chance it. Best to explicitly specify in spark-submit (or
> whatever) which port to listen to
>
>  --conf 

Re: sampling operation for DStream

2016-08-01 Thread Martin Le
How to do that? if I put the queue inside .transform operation, it
doesn't work.

On Mon, Aug 1, 2016 at 6:43 PM, Cody Koeninger  wrote:

> Can you keep a queue per executor in memory?
>
> On Mon, Aug 1, 2016 at 11:24 AM, Martin Le 
> wrote:
> > Hi Cody and all,
> >
> > Thank you for your answer. I implement simple random sampling (SRS) for
> > DStream using transform method, and it works fine.
> > However, I have a problem when I implement reservoir sampling (RS). In
> RS, I
> > need to maintain a reservoir (a queue) to store selected data items
> (RDDs).
> > If I define a large stream window, the queue also increases  and it
> leads to
> > the driver run out of memory.  I explain my problem in detail here:
> >
> https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
> >
> > Could you please give me some suggestions or advice to fix this problem?
> >
> > Thanks
> >
> > On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger 
> wrote:
> >>
> >> Most stream systems you're still going to incur the cost of reading
> >> each message... I suppose you could rotate among reading just the
> >> latest messages from a single partition of a Kafka topic if they were
> >> evenly balanced.
> >>
> >> But once you've read the messages, nothing's stopping you from
> >> filtering most of them out before doing further processing.  The
> >> dstream .transform method will let you do any filtering / sampling you
> >> could have done on an rdd.
> >>
> >> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le 
> >> wrote:
> >> > Hi all,
> >> >
> >> > I have to handle high-speed rate data stream. To reduce the heavy
> load,
> >> > I
> >> > want to use sampling techniques for each stream window. It means that
> I
> >> > want
> >> > to process a subset of data instead of whole window data. I saw Spark
> >> > support sampling operations for RDD, but for DStream, Spark supports
> >> > sampling operation as well? If not,  could you please give me a
> >> > suggestion
> >> > how to implement it?
> >> >
> >> > Thanks,
> >> > Martin
> >
> >
>


Re: Java Recipes for Spark

2016-08-01 Thread Gourav Sengupta
JAVA? AGAIN? I am getting into serious depression


Regards,
Gourav

On Mon, Aug 1, 2016 at 9:03 PM, Marco Mistroni  wrote:

> Hi jg
>   +1 for link. I'd add ML and graph examples if u can
>   -1 for programmign language choice :))
>
>
> kr
>
> On 31 Jul 2016 9:13 pm, "Jean Georges Perrin"  wrote:
>
>> Thanks Guys - I really appreciate :)... If you have any idea of something
>> missing, I'll gladly add it.
>>
>> (and yeah, come on! Is that some kind of primitive racism or what: Java
>> rocks! What are those language where you can turn a list to a string and
>> back to an object. #StrongTypingRules)
>>
>> On Jul 30, 2016, at 12:19 AM, Shiva Ramagopal  wrote:
>>
>> +1 for the Java love :-)
>>
>> On 30-Jul-2016 4:39 AM, "Renato Perini"  wrote:
>>
>>> Not only very useful, but finally some Java love :-)
>>>
>>> Thank you.
>>>
>>>
>>> Il 29/07/2016 22:30, Jean Georges Perrin ha scritto:
>>>
 Sorry if this looks like a shameless self promotion, but some of you
 asked me to say when I'll have my Java recipes for Apache Spark updated.
 It's done here: http://jgp.net/2016/07/22/spark-java-recipes/ and in
 the GitHub repo.

 Enjoy / have a great week-end.

 jg



>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>


Re: Java Recipes for Spark

2016-08-01 Thread Marco Mistroni
Hi jg
  +1 for link. I'd add ML and graph examples if u can
  -1 for programmign language choice :))


kr

On 31 Jul 2016 9:13 pm, "Jean Georges Perrin"  wrote:

> Thanks Guys - I really appreciate :)... If you have any idea of something
> missing, I'll gladly add it.
>
> (and yeah, come on! Is that some kind of primitive racism or what: Java
> rocks! What are those language where you can turn a list to a string and
> back to an object. #StrongTypingRules)
>
> On Jul 30, 2016, at 12:19 AM, Shiva Ramagopal  wrote:
>
> +1 for the Java love :-)
>
> On 30-Jul-2016 4:39 AM, "Renato Perini"  wrote:
>
>> Not only very useful, but finally some Java love :-)
>>
>> Thank you.
>>
>>
>> Il 29/07/2016 22:30, Jean Georges Perrin ha scritto:
>>
>>> Sorry if this looks like a shameless self promotion, but some of you
>>> asked me to say when I'll have my Java recipes for Apache Spark updated.
>>> It's done here: http://jgp.net/2016/07/22/spark-java-recipes/ and in
>>> the GitHub repo.
>>>
>>> Enjoy / have a great week-end.
>>>
>>> jg
>>>
>>>
>>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Mlib RandomForest (Spark 2.0) predict a single vector

2016-08-01 Thread itai.efrati
Hello,

After training a RandomForestRegressor in PipelineModel using mlib and
DataFrame (Spark 2.0) 
I loaded the saved model into my RT environment in order to predict using
the model, each request
is handled and transform through the loaded PipelineModel but in the process
I had to convert the 
single request vector to a one row DataFrame using spark.createdataframe all
of this takes around 700ms!

comparing to 2.5ms if I uses mllib RDD
RandomForestRegressor.predict(VECTOR).
Is there any way to use the new mlib to predict a a single vector without
converting to DataFrame or do something else to speed things up?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Mlib-RandomForest-Spark-2-0-predict-a-single-vector-tp27447.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SQL predicate pushdown on parquet or other columnar formats

2016-08-01 Thread Mich Talebzadeh
Hi,

You mentioned:

In general, is this optimization done for all columnar databases or file
formats ?


Have you tried it using an ORC file? That is another columnar table/file.

Spark follows a rule based optimizer. It does not have a cost based
optimizer yet! It is planned for future I believe


https://issues.apache.org/jira/browse/SPARK-16026

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 19:17, Sandeep Joshi  wrote:

>
> Hi
>
> I just want to confirm my understanding of the physical plan generated by
> Spark SQL while reading from a Parquet file.
>
> When multiple predicates are pushed to the PrunedFilterScan, does Spark
> ensure that the Parquet file is not read multiple times while evaluating
> each predicate ?
>
> In general, is this optimization done for all columnar databases or file
> formats ?
>
> When I ran the following query in the spark-shell
>
> > val nameDF = sqlContext.sql("SELECT name FROM parquetFile WHERE age = 50
> AND name = 'someone'")
>
> I saw that both the filters are pushed, but I can't seem to find where it
> applies them to the file data.
>
> > nameDF.explain()
>
> shows
>
> Project [name#112]
> +- Filter ((age#111L = 50) && (name#112 = someone))
>+- Scan ParquetRelation[name#112,age#111L] InputPaths:
> file:/home/spark/spark-1.6.1/people.parquet,
>   PushedFilters: [EqualTo(age,50), EqualTo(name,someone)]
>
>
>


Re: Problems initializing SparkUI

2016-08-01 Thread Mich Talebzadeh
OK I can see the Worker (19286 Worker and the executor(6548
CoarseGrainedExecutorBackend) running on it

Where is spark-submit? Did you submit your job from another node or used
another method to run it?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 19:08, Maximiliano Patricio Méndez  wrote:

> I just recently tried again, the port 4040 is not used. And even if it
> were, I think the log would reflect that trying to use the following port
> (4041) as you mentioned.
>
> This is what the driver log says:
>
> 16/08/01 13:55:56 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 16/08/01 13:55:56 INFO SparkUI: Started SparkUI at http://hostname:4040
>
>
> If I go to {hostname}:
> ~$ jps
> 6548 CoarseGrainedExecutorBackend
> 19286 Worker
> 6843 Jps
> 19182 Master
>
> ~$ netstat -nltp
> Active Internet connections (only servers)
> Proto Recv-Q Send-Q Local Address   Foreign Address State
>   PID/Program name
> tcp6   0  0 192.168.22.245:43037:::*
>  LISTEN  6548/java
> tcp6   0  0 192.168.22.245:56929:::*
>  LISTEN  19286/java
> tcp6   0  0 192.168.22.245:7077 :::*
>  LISTEN  19182/java
> tcp6   0  0 :::33296:::*LISTEN
>  6548/java
> tcp6   0  0 :::8080 :::*LISTEN
>  19182/java
> tcp6   0  0 :::8081 :::*LISTEN
>  19286/java
> tcp6   0  0 192.168.22.245:6066 :::*
>  LISTEN  19182/java
>
> ~$ netstat -nltap | grep 4040
>
> I'm really lost here and don't know much about spark yet, but shouldn't
> there be a DriverWrapper process which holds the bind on port 4040?
>
>
> 2016-08-01 13:49 GMT-03:00 Mich Talebzadeh :
>
>> Can you check if port 4040 is actually used? If it used the next
>> available one would 4041. For example below Zeppelin uses it
>>
>>
>> *netstat -plten|grep 4040*tcp0  0
>> :::4040 :::*LISTEN
>> 1005   73372882   *10699*/java
>> *ps aux|grep 10699*
>> hduser   10699  0.1  3.8 3172308 952932 pts/3  SNl  Jul30   5:57
>> /usr/java/latest/bin/java -cp /data6/hduser/zeppelin-0.6.0/ ...
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 August 2016 at 17:44, Maximiliano Patricio Méndez <
>> mmen...@despegar.com> wrote:
>>
>>> Hi,
>>>
>>> Thanks for the answers.
>>>
>>> @Jacek: To verify if the ui is up, I enter to all the worker nodes of my
>>> cluster and run netstat -nltp | grep 4040 with no result. The log of the
>>> driver tells me in which server and on which port should the spark ui be
>>> up, but it isn't.
>>>
>>>
>>> @Mich: I've tried to specify spark.ui.port=nnn but I only manage to
>>> change the log, reporting that the driver should be in another port.
>>>
>>> The ui has no problem to start in that port (4040) when I run my
>>> application in client mode.
>>>
>>> Could there be a network issue making the ui to fail silently? I've read
>>> some of the code regarding those parts of the driver log, but couldn't find
>>> anything weird.
>>>
>>> 2016-07-29 19:45 GMT-03:00 Mich Talebzadeh :
>>>
 why chance it. Best to explicitly specify in spark-submit (or whatever)
 which port to listen to

  --conf "spark.ui.port=nnn"

 and see if it works

 HTH


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data 

Re: Writing custom Transformers and Estimators like Tokenizer in spark ML

2016-08-01 Thread janardhan shetty
What is the difference between UnaryTransformer and Transformer classes. In
which scenarios should we use  one or the other ?

On Sun, Jul 31, 2016 at 8:27 PM, janardhan shetty 
wrote:

> Developing in scala but any help with difference between UnaryTransformer
> (Is this experimental still ?)and Transformer class is appreciated.
>
> Right now encountering  error for the code which extends UnaryTransformer
>
> override protected def outputDataType: DataType = new StringType
>
> Error:(26, 53) constructor StringType in class StringType cannot be accessed 
> in class Capitalizer
>   override protected def outputDataType: DataType = new StringType
> ^
>
>
>
> On Thu, Jul 28, 2016 at 8:20 PM, Phuong LE-HONG 
> wrote:
>
>> Hi,
>>
>> I've developed a simple ML estimator (in Java) that implements
>> conditional Markov model for sequence labelling in Vitk toolkit. You
>> can check it out here:
>>
>>
>> https://github.com/phuonglh/vn.vitk/blob/master/src/main/java/vn/vitk/tag/CMM.java
>>
>> Phuong Le-Hong
>>
>> On Fri, Jul 29, 2016 at 9:01 AM, janardhan shetty
>>  wrote:
>> > Thanks Steve.
>> >
>> > Any pointers to custom estimators development as well ?
>> >
>> > On Wed, Jul 27, 2016 at 11:35 AM, Steve Rowe  wrote:
>> >>
>> >> You can see the source for my transformer configurable bridge to Lucene
>> >> analysis components here, in my company Lucidworks’ spark-solr project:
>> >> <
>> https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/ml/feature/LuceneTextAnalyzerTransformer.scala
>> >.
>> >>
>> >> Here’s a blog I wrote about using this transformer, as well as
>> >> non-ML-context use in Spark of the underlying analysis component, here:
>> >> > >.
>> >>
>> >> --
>> >> Steve
>> >> www.lucidworks.com
>> >>
>> >> > On Jul 27, 2016, at 1:31 PM, janardhan shetty <
>> janardhan...@gmail.com>
>> >> > wrote:
>> >> >
>> >> > 1.  Any links or blogs to develop custom transformers ? ex: Tokenizer
>> >> >
>> >> > 2. Any links or blogs to develop custom estimators ? ex: any ml
>> >> > algorithm
>> >>
>> >
>>
>
>


SQL predicate pushdown on parquet or other columnar formats

2016-08-01 Thread Sandeep Joshi
Hi

I just want to confirm my understanding of the physical plan generated by
Spark SQL while reading from a Parquet file.

When multiple predicates are pushed to the PrunedFilterScan, does Spark
ensure that the Parquet file is not read multiple times while evaluating
each predicate ?

In general, is this optimization done for all columnar databases or file
formats ?

When I ran the following query in the spark-shell

> val nameDF = sqlContext.sql("SELECT name FROM parquetFile WHERE age = 50
AND name = 'someone'")

I saw that both the filters are pushed, but I can't seem to find where it
applies them to the file data.

> nameDF.explain()

shows

Project [name#112]
+- Filter ((age#111L = 50) && (name#112 = someone))
   +- Scan ParquetRelation[name#112,age#111L] InputPaths:
file:/home/spark/spark-1.6.1/people.parquet,
  PushedFilters: [EqualTo(age,50), EqualTo(name,someone)]


Re: Problems initializing SparkUI

2016-08-01 Thread Maximiliano Patricio Méndez
I just recently tried again, the port 4040 is not used. And even if it
were, I think the log would reflect that trying to use the following port
(4041) as you mentioned.

This is what the driver log says:

16/08/01 13:55:56 INFO Utils: Successfully started service 'SparkUI'
on port 4040.
16/08/01 13:55:56 INFO SparkUI: Started SparkUI at http://hostname:4040


If I go to {hostname}:
~$ jps
6548 CoarseGrainedExecutorBackend
19286 Worker
6843 Jps
19182 Master

~$ netstat -nltp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address   Foreign Address State
PID/Program name
tcp6   0  0 192.168.22.245:43037:::*LISTEN
 6548/java
tcp6   0  0 192.168.22.245:56929:::*LISTEN
 19286/java
tcp6   0  0 192.168.22.245:7077 :::*LISTEN
 19182/java
tcp6   0  0 :::33296:::*LISTEN
 6548/java
tcp6   0  0 :::8080 :::*LISTEN
 19182/java
tcp6   0  0 :::8081 :::*LISTEN
 19286/java
tcp6   0  0 192.168.22.245:6066 :::*LISTEN
 19182/java

~$ netstat -nltap | grep 4040

I'm really lost here and don't know much about spark yet, but shouldn't
there be a DriverWrapper process which holds the bind on port 4040?


2016-08-01 13:49 GMT-03:00 Mich Talebzadeh :

> Can you check if port 4040 is actually used? If it used the next available
> one would 4041. For example below Zeppelin uses it
>
>
> *netstat -plten|grep 4040*tcp0  0 :::4040
> :::*LISTEN  1005   73372882   *10699*/java
> *ps aux|grep 10699*
> hduser   10699  0.1  3.8 3172308 952932 pts/3  SNl  Jul30   5:57
> /usr/java/latest/bin/java -cp /data6/hduser/zeppelin-0.6.0/ ...
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 August 2016 at 17:44, Maximiliano Patricio Méndez <
> mmen...@despegar.com> wrote:
>
>> Hi,
>>
>> Thanks for the answers.
>>
>> @Jacek: To verify if the ui is up, I enter to all the worker nodes of my
>> cluster and run netstat -nltp | grep 4040 with no result. The log of the
>> driver tells me in which server and on which port should the spark ui be
>> up, but it isn't.
>>
>>
>> @Mich: I've tried to specify spark.ui.port=nnn but I only manage to
>> change the log, reporting that the driver should be in another port.
>>
>> The ui has no problem to start in that port (4040) when I run my
>> application in client mode.
>>
>> Could there be a network issue making the ui to fail silently? I've read
>> some of the code regarding those parts of the driver log, but couldn't find
>> anything weird.
>>
>> 2016-07-29 19:45 GMT-03:00 Mich Talebzadeh :
>>
>>> why chance it. Best to explicitly specify in spark-submit (or whatever)
>>> which port to listen to
>>>
>>>  --conf "spark.ui.port=nnn"
>>>
>>> and see if it works
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 29 July 2016 at 23:37, Jacek Laskowski  wrote:
>>>
 Hi,

 I'm curious about "For some reason, sometimes the SparkUI does not
 appear to be bound on port 4040 (or any other) but the application
 runs perfectly and finishes giving the expected answer." How do you
 check that web UI listens to the port 4040?

 Pozdrawiam,
 Jacek Laskowski
 
 https://medium.com/@jaceklaskowski/
 Mastering Apache Spark http://bit.ly/mastering-apache-spark
 Follow me at https://twitter.com/jaceklaskowski


 On Thu, Jul 28, 2016 at 11:37 PM, Maximiliano Patricio Méndez
  wrote:
 > Hi,
 >
 > I'm having some trouble trying to 

The equivalent for INSTR in Spark FP

2016-08-01 Thread Mich Talebzadeh
Hi,

What is the equivalent of FP for the following window/analytic that works
OK in Spark SQL

This one using INSTR

select
substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),


select distinct *
from (
  select
substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2),
  SUM(debitamount) OVER (PARTITION BY
substring(transactiondescription,1,INSTR(transactiondescription,'CD')-2))
AS spent
  from accounts.ll_18740868 where transactiontype = 'DEB'
 ) tmp


I tried indexOf but it does not work!

val wSpec2 =
Window.partitionBy(substring(col("transactiondescription"),1,indexOf(col("transactiondescription"),"CD")))
:26: error: not found: value indexOf
 val wSpec2 =
Window.partitionBy(substring(col("transactiondescription"),1,indexOf(col("transactiondescription"),"CD")))


Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


python 'Jupyter' data frame problem with autocompletion

2016-08-01 Thread Andy Davidson
I started using python3 and jupyter in a chrome browser. I seem to be having
trouble with data frame code completion. Regular python functions seems to
work correctly.

I wonder if I need to import something so the notebook knows about data
frames?

Kind regards


Andy 




Re: Tuning level of Parallelism: Increase or decrease?

2016-08-01 Thread Nikolay Zhebet
Yes, Spark always trying to deliver snippet of code to the data (not vice
versa). But you should realize, that if you try to run groupBY or Join on
the large dataset, then you always should migrate temporary localy grouped
data from one worker node to the another(It is shuffle operation as i
know). In the end of all batch proceses, you can fetch your grouped
dataset. But in underhood you can see alot of network connection between
worker-nodes, because all your 2TB data was splitted on 128MB parts and was
writed on the different HDFSDataNodes.

As example: You analyze your workflow and realized, that in most cases, you
 grouped your data by date(-mm-dd). In this case you can save data from
all day in one Region Server(if you use Spark-on-HBase DataFrame). In this
case your "group By date" operation can be done on the local worker-node
and without shuffling your temporary data between other workers-nodes.
Maybe this article can be usefull:
http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/

2016-08-01 18:56 GMT+03:00 Jestin Ma :

> Hi Nikolay, I'm looking at data locality improvements for Spark, and I
> have conflicting sources on using YARN for Spark.
>
> Reynold said that Spark workers automatically take care of data locality
> here:
> https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS
>
> However, I've read elsewhere (
> https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
> that Spark on YARN increases data locality because YARN tries to place
> tasks next to HDFS blocks.
>
> Can anyone verify/support one side or the other?
>
> Thank you,
> Jestin
>
> On Mon, Aug 1, 2016 at 1:15 AM, Nikolay Zhebet  wrote:
>
>> Hi.
>> Maybe you can help "data locality"..
>> If you use groupBY and joins, than most likely you will see alot of
>> network operations. This can be werry slow. You can try prepare, transform
>> your information in that way, what can minimize transporting temporary
>> information between worker-nodes.
>>
>> Try google in this way "Data locality in Hadoop"
>>
>>
>> 2016-08-01 4:41 GMT+03:00 Jestin Ma :
>>
>>> It seems that the number of tasks being this large do not matter. Each
>>> task was set default by the HDFS as 128 MB (block size) which I've heard to
>>> be ok. I've tried tuning the block (task) size to be larger and smaller to
>>> no avail.
>>>
>>> I tried coalescing to 50 but that introduced large data skew and slowed
>>> down my job a lot.
>>>
>>> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich 
>>> wrote:
>>>
 15000 seems like a lot of tasks for that size. Test it out with a
 .coalesce(50) placed right after loading the data. It will probably either
 run faster or crash with out of memory errors.

 On Jul 29, 2016, at 9:02 AM, Jestin Ma 
 wrote:

 I am processing ~2 TB of hdfs data using DataFrames. The size of a task
 is equal to the block size specified by hdfs, which happens to be 128 MB,
 leading to about 15000 tasks.

 I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
 I'm performing groupBy, count, and an outer-join with another DataFrame
 of ~200 MB size (~80 MB cached but I don't need to cache it), then saving
 to disk.

 Right now it takes about 55 minutes, and I've been trying to tune it.

 I read on the Spark Tuning guide that:
 *In general, we recommend 2-3 tasks per CPU core in your cluster.*

 This means that I should have about 30-50 tasks instead of 15000, and
 each task would be much bigger in size. Is my understanding correct, and is
 this suggested? I've read from difference sources to decrease or increase
 parallelism, or even keep it default.

 Thank you for your help,
 Jestin



>>>
>>
>


Re: Problems initializing SparkUI

2016-08-01 Thread Mich Talebzadeh
Can you check if port 4040 is actually used? If it used the next available
one would 4041. For example below Zeppelin uses it


*netstat -plten|grep 4040*tcp0  0 :::4040
:::*LISTEN  1005   73372882   *10699*/java
*ps aux|grep 10699*
hduser   10699  0.1  3.8 3172308 952932 pts/3  SNl  Jul30   5:57
/usr/java/latest/bin/java -cp /data6/hduser/zeppelin-0.6.0/ ...

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 17:44, Maximiliano Patricio Méndez  wrote:

> Hi,
>
> Thanks for the answers.
>
> @Jacek: To verify if the ui is up, I enter to all the worker nodes of my
> cluster and run netstat -nltp | grep 4040 with no result. The log of the
> driver tells me in which server and on which port should the spark ui be
> up, but it isn't.
>
>
> @Mich: I've tried to specify spark.ui.port=nnn but I only manage to change
> the log, reporting that the driver should be in another port.
>
> The ui has no problem to start in that port (4040) when I run my
> application in client mode.
>
> Could there be a network issue making the ui to fail silently? I've read
> some of the code regarding those parts of the driver log, but couldn't find
> anything weird.
>
> 2016-07-29 19:45 GMT-03:00 Mich Talebzadeh :
>
>> why chance it. Best to explicitly specify in spark-submit (or whatever)
>> which port to listen to
>>
>>  --conf "spark.ui.port=nnn"
>>
>> and see if it works
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 29 July 2016 at 23:37, Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> I'm curious about "For some reason, sometimes the SparkUI does not
>>> appear to be bound on port 4040 (or any other) but the application
>>> runs perfectly and finishes giving the expected answer." How do you
>>> check that web UI listens to the port 4040?
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Thu, Jul 28, 2016 at 11:37 PM, Maximiliano Patricio Méndez
>>>  wrote:
>>> > Hi,
>>> >
>>> > I'm having some trouble trying to submit an application to my spark
>>> cluster.
>>> > For some reason, sometimes the SparkUI does not appear to be bound on
>>> port
>>> > 4040 (or any other) but the application runs perfectly and finishes
>>> giving
>>> > the expected answer.
>>> >
>>> > And don't know why, but if I restart all the workers at once sometimes
>>> it
>>> > begins to work and sometimes it doesn't.
>>> >
>>> > In the driver logs, when it fails to start the SparkUI I see some these
>>> > lines:
>>> > 16/07/28 16:13:37 INFO Utils: Successfully started service 'SparkUI'
>>> on port
>>> > 4040.
>>> > 16/07/28 16:13:37 INFO SparkUI: Started SparkUI at
>>> http://hostname-00:4040
>>> >
>>> > but nothing running in those ports.
>>> >
>>> > I'm attaching the full driver log in which I've activated jetty logs on
>>> > DEBUG but couldn't find anything.
>>> >
>>> > The only properties that I'm not leaving at default at the
>>> configuration is
>>> > the SPARK_PUBLIC_DNS=$(hostname), SPARK_WORKER_CORES and
>>> SPARK_WORKER_MEMORY
>>> >
>>> > Have anyone faced something similar?
>>> >
>>> > Thanks
>>> >
>>> >
>>> > -
>>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: Problems initializing SparkUI

2016-08-01 Thread Maximiliano Patricio Méndez
Hi,

Thanks for the answers.

@Jacek: To verify if the ui is up, I enter to all the worker nodes of my
cluster and run netstat -nltp | grep 4040 with no result. The log of the
driver tells me in which server and on which port should the spark ui be
up, but it isn't.


@Mich: I've tried to specify spark.ui.port=nnn but I only manage to change
the log, reporting that the driver should be in another port.

The ui has no problem to start in that port (4040) when I run my
application in client mode.

Could there be a network issue making the ui to fail silently? I've read
some of the code regarding those parts of the driver log, but couldn't find
anything weird.

2016-07-29 19:45 GMT-03:00 Mich Talebzadeh :

> why chance it. Best to explicitly specify in spark-submit (or whatever)
> which port to listen to
>
>  --conf "spark.ui.port=nnn"
>
> and see if it works
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 July 2016 at 23:37, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> I'm curious about "For some reason, sometimes the SparkUI does not
>> appear to be bound on port 4040 (or any other) but the application
>> runs perfectly and finishes giving the expected answer." How do you
>> check that web UI listens to the port 4040?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Thu, Jul 28, 2016 at 11:37 PM, Maximiliano Patricio Méndez
>>  wrote:
>> > Hi,
>> >
>> > I'm having some trouble trying to submit an application to my spark
>> cluster.
>> > For some reason, sometimes the SparkUI does not appear to be bound on
>> port
>> > 4040 (or any other) but the application runs perfectly and finishes
>> giving
>> > the expected answer.
>> >
>> > And don't know why, but if I restart all the workers at once sometimes
>> it
>> > begins to work and sometimes it doesn't.
>> >
>> > In the driver logs, when it fails to start the SparkUI I see some these
>> > lines:
>> > 16/07/28 16:13:37 INFO Utils: Successfully started service 'SparkUI' on
>> port
>> > 4040.
>> > 16/07/28 16:13:37 INFO SparkUI: Started SparkUI at
>> http://hostname-00:4040
>> >
>> > but nothing running in those ports.
>> >
>> > I'm attaching the full driver log in which I've activated jetty logs on
>> > DEBUG but couldn't find anything.
>> >
>> > The only properties that I'm not leaving at default at the
>> configuration is
>> > the SPARK_PUBLIC_DNS=$(hostname), SPARK_WORKER_CORES and
>> SPARK_WORKER_MEMORY
>> >
>> > Have anyone faced something similar?
>> >
>> > Thanks
>> >
>> >
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: spark 2.0 readStream from a REST API

2016-08-01 Thread Amit Sela
I think you're missing:

val query = wordCounts.writeStream

  .outputMode("complete")
  .format("console")
  .start()

Dis it help ?

On Mon, Aug 1, 2016 at 2:44 PM Jacek Laskowski  wrote:

> On Mon, Aug 1, 2016 at 11:01 AM, Ayoub Benali
>  wrote:
>
> > the problem now is that when I consume the dataframe for example with
> count
> > I get the stack trace below.
>
> Mind sharing the entire pipeline?
>
> > I followed the implementation of TextSocketSourceProvider to implement my
> > data source and Text Socket source is used in the official documentation
> > here.
>
> Right. Completely forgot about the provider. Thanks for reminding me about
> it!
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: sampling operation for DStream

2016-08-01 Thread Cody Koeninger
Can you keep a queue per executor in memory?

On Mon, Aug 1, 2016 at 11:24 AM, Martin Le  wrote:
> Hi Cody and all,
>
> Thank you for your answer. I implement simple random sampling (SRS) for
> DStream using transform method, and it works fine.
> However, I have a problem when I implement reservoir sampling (RS). In RS, I
> need to maintain a reservoir (a queue) to store selected data items (RDDs).
> If I define a large stream window, the queue also increases  and it leads to
> the driver run out of memory.  I explain my problem in detail here:
> https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok
>
> Could you please give me some suggestions or advice to fix this problem?
>
> Thanks
>
> On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger  wrote:
>>
>> Most stream systems you're still going to incur the cost of reading
>> each message... I suppose you could rotate among reading just the
>> latest messages from a single partition of a Kafka topic if they were
>> evenly balanced.
>>
>> But once you've read the messages, nothing's stopping you from
>> filtering most of them out before doing further processing.  The
>> dstream .transform method will let you do any filtering / sampling you
>> could have done on an rdd.
>>
>> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le 
>> wrote:
>> > Hi all,
>> >
>> > I have to handle high-speed rate data stream. To reduce the heavy load,
>> > I
>> > want to use sampling techniques for each stream window. It means that I
>> > want
>> > to process a subset of data instead of whole window data. I saw Spark
>> > support sampling operations for RDD, but for DStream, Spark supports
>> > sampling operation as well? If not,  could you please give me a
>> > suggestion
>> > how to implement it?
>> >
>> > Thanks,
>> > Martin
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Koert Kuipers
we share a single single sparksession across tests, and they can run in
parallel. is pretty fast

On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson 
wrote:

> Hi,
>
> Right now, if any code uses DataFrame/Dataset, I need a test setup that
> brings up a local master as in this article
> 
> .
>
> That's a lot of overhead for unit testing and the tests can't run in
> parallel, so testing is slow -- this is more like what I'd call an
> integration test.
>
> Do people have any tricks to get around this? Maybe using spy mocks on
> fake DataFrame/Datasets?
>
> Anyone know if there are plans to make more traditional unit testing
> possible with Spark SQL, perhaps with a stripped down in-memory
> implementation? (I admit this does seem quite hard since there's so much
> functionality in these classes!)
>
> Thanks!
>
> - Everett
>
>


Re: sampling operation for DStream

2016-08-01 Thread Martin Le
Hi Cody and all,

Thank you for your answer. I implement simple random sampling (SRS) for
DStream using transform method, and it works fine.
However, I have a problem when I implement reservoir sampling (RS). In RS,
I need to maintain a reservoir (a queue) to store selected data items
(RDDs). If I define a large stream window, the queue also increases  and it
leads to the driver run out of memory.  I explain my problem in detail
here:
https://docs.google.com/document/d/1YBV_eHH6U_dVF1giiajuG4ayoVN5R8Qw3IthoKvW5ok

Could you please give me some suggestions or advice to fix this problem?

Thanks

On Fri, Jul 29, 2016 at 6:28 PM, Cody Koeninger  wrote:

> Most stream systems you're still going to incur the cost of reading
> each message... I suppose you could rotate among reading just the
> latest messages from a single partition of a Kafka topic if they were
> evenly balanced.
>
> But once you've read the messages, nothing's stopping you from
> filtering most of them out before doing further processing.  The
> dstream .transform method will let you do any filtering / sampling you
> could have done on an rdd.
>
> On Fri, Jul 29, 2016 at 9:57 AM, Martin Le 
> wrote:
> > Hi all,
> >
> > I have to handle high-speed rate data stream. To reduce the heavy load, I
> > want to use sampling techniques for each stream window. It means that I
> want
> > to process a subset of data instead of whole window data. I saw Spark
> > support sampling operations for RDD, but for DStream, Spark supports
> > sampling operation as well? If not,  could you please give me a
> suggestion
> > how to implement it?
> >
> > Thanks,
> > Martin
>


Re: Tuning level of Parallelism: Increase or decrease?

2016-08-01 Thread Jestin Ma
Hi Nikolay, I'm looking at data locality improvements for Spark, and I have
conflicting sources on using YARN for Spark.

Reynold said that Spark workers automatically take care of data locality
here:
https://www.quora.com/Does-Apache-Spark-take-care-of-data-locality-when-Spark-workers-load-data-from-HDFS

However, I've read elsewhere (
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/yarn/)
that Spark on YARN increases data locality because YARN tries to place
tasks next to HDFS blocks.

Can anyone verify/support one side or the other?

Thank you,
Jestin

On Mon, Aug 1, 2016 at 1:15 AM, Nikolay Zhebet  wrote:

> Hi.
> Maybe you can help "data locality"..
> If you use groupBY and joins, than most likely you will see alot of
> network operations. This can be werry slow. You can try prepare, transform
> your information in that way, what can minimize transporting temporary
> information between worker-nodes.
>
> Try google in this way "Data locality in Hadoop"
>
>
> 2016-08-01 4:41 GMT+03:00 Jestin Ma :
>
>> It seems that the number of tasks being this large do not matter. Each
>> task was set default by the HDFS as 128 MB (block size) which I've heard to
>> be ok. I've tried tuning the block (task) size to be larger and smaller to
>> no avail.
>>
>> I tried coalescing to 50 but that introduced large data skew and slowed
>> down my job a lot.
>>
>> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich 
>> wrote:
>>
>>> 15000 seems like a lot of tasks for that size. Test it out with a
>>> .coalesce(50) placed right after loading the data. It will probably either
>>> run faster or crash with out of memory errors.
>>>
>>> On Jul 29, 2016, at 9:02 AM, Jestin Ma 
>>> wrote:
>>>
>>> I am processing ~2 TB of hdfs data using DataFrames. The size of a task
>>> is equal to the block size specified by hdfs, which happens to be 128 MB,
>>> leading to about 15000 tasks.
>>>
>>> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
>>> I'm performing groupBy, count, and an outer-join with another DataFrame
>>> of ~200 MB size (~80 MB cached but I don't need to cache it), then saving
>>> to disk.
>>>
>>> Right now it takes about 55 minutes, and I've been trying to tune it.
>>>
>>> I read on the Spark Tuning guide that:
>>> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>>>
>>> This means that I should have about 30-50 tasks instead of 15000, and
>>> each task would be much bigger in size. Is my understanding correct, and is
>>> this suggested? I've read from difference sources to decrease or increase
>>> parallelism, or even keep it default.
>>>
>>> Thank you for your help,
>>> Jestin
>>>
>>>
>>>
>>
>


Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Everett Anderson
Hi,

Right now, if any code uses DataFrame/Dataset, I need a test setup that
brings up a local master as in this article

.

That's a lot of overhead for unit testing and the tests can't run in
parallel, so testing is slow -- this is more like what I'd call an
integration test.

Do people have any tricks to get around this? Maybe using spy mocks on fake
DataFrame/Datasets?

Anyone know if there are plans to make more traditional unit testing
possible with Spark SQL, perhaps with a stripped down in-memory
implementation? (I admit this does seem quite hard since there's so much
functionality in these classes!)

Thanks!

- Everett


Re: Possible to push sub-queries down into the DataSource impl?

2016-08-01 Thread Timothy Potter
yes, that's exactly what I was looking for, thanks for the pointer ;-)

On Thu, Jul 28, 2016 at 1:07 AM, Takeshi Yamamuro  wrote:
> Hi,
>
> Have you seen this ticket?
> https://issues.apache.org/jira/browse/SPARK-12449
>
> // maropu
>
> On Thu, Jul 28, 2016 at 2:13 AM, Timothy Potter 
> wrote:
>>
>> I'm not looking for a one-off solution for a specific query that can
>> be solved on the client side as you suggest, but rather a generic
>> solution that can be implemented within the DataSource impl itself
>> when it knows a sub-query can be pushed down into the engine. In other
>> words, I'd like to intercept the query planning process to be able to
>> push-down computation into the engine when it makes sense.
>>
>> On Wed, Jul 27, 2016 at 8:04 AM, Marco Colombo
>>  wrote:
>> > Why don't you create a dataframe filtered, map it as temporary table and
>> > then use it in your query? You can also cache it, of multiple queries on
>> > the
>> > same inner queries are requested.
>> >
>> >
>> > Il mercoledì 27 luglio 2016, Timothy Potter  ha
>> > scritto:
>> >>
>> >> Take this simple join:
>> >>
>> >> SELECT m.title as title, solr.aggCount as aggCount FROM movies m INNER
>> >> JOIN (SELECT movie_id, COUNT(*) as aggCount FROM ratings WHERE rating
>> >> >= 4 GROUP BY movie_id ORDER BY aggCount desc LIMIT 10) as solr ON
>> >> solr.movie_id = m.movie_id ORDER BY aggCount DESC
>> >>
>> >> I would like the ability to push the inner sub-query aliased as "solr"
>> >> down into the data source engine, in this case Solr as it will
>> >> greatlly reduce the amount of data that has to be transferred from
>> >> Solr into Spark. I would imagine this issue comes up frequently if the
>> >> underlying engine is a JDBC data source as well ...
>> >>
>> >> Is this possible? Of course, my example is a bit cherry-picked so
>> >> determining if a sub-query can be pushed down into the data source
>> >> engine is probably not a trivial task, but I'm wondering if Spark has
>> >> the hooks to allow me to try ;-)
>> >>
>> >> Cheers,
>> >> Tim
>> >>
>> >> -
>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>
>> >
>> >
>> > --
>> > Ing. Marco Colombo
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>
>
> --
> ---
> Takeshi Yamamuro

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2 and Solr

2016-08-01 Thread Andrés Ivaldi
Hello, does any one know if Spark 2.0 will have a Solr connector?
Lucidworks has one but is not available yet for Spark 2.0


thanks!!


Need Advice: Spark-Streaming Setup

2016-08-01 Thread David Kaufman
Hi,

I'm currently working on Spark, HBase-Setup which processes log files (~10
GB/day). These log files are persisted hourly on n > 10 application servers
and copied to a 4 node hdfs.

Our current spark-job aggregates single visits (based on a session-uuid)
across all application-servers on a daily basis. Visits are filtered (only
about 1% of data remains) and stored in an HBase for further processing.

Currently there is no use of the Spark-Streaming API, i.e. a cronjob runs
every day and fires the visit calculation.

Questions
1) Ist it really necessary to store the log files in the HDFS or can spark
somehow read the files from a local file system and distribute the data to
the other nodes? Rationale: The data is (probably) only read once during
the visit calculation which defies the purpose of a dfs.

2) If the raw log files have to be in the HDFS, I have to remove the files
from the HDFS after processing them, so COPY -> PROCESS -> REMOVE. Is this
the way to go?

3) Before I can process a visit for an hour. I have to wait until all log
files of all application servers have been copied to the HDFS. It doesn't
seem like StreamingContext.fileStream can wait for more sophisticated
patterns, e.g. ("context*/logs-2016-08-01-15"). Do you guys have a
recommendation to solve this problem? One possible solution: After the
files have been copied, create an additional file that indicates spark that
all files are available?

If you have any questions, please don't hesitate to ask.

Thanks,
David


Re: Windows operation orderBy desc

2016-08-01 Thread Mich Talebzadeh
You need to get the position right


val wSpec = Window.partitionBy("col1").orderBy(desc("col2"))

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 14:56, Ashok Kumar  wrote:

> Hi,
>
> in the following Window spec I want orderBy ("") to be displayed
> in descending order please
>
> val W = Window.partitionBy("col1").orderBy("col2")
>
> If I Do
>
> val W = Window.partitionBy("col1").orderBy("col2".desc)
>
> It throws error
>
> console>:26: error: value desc is not a member of String
>
> How can I achieve that?
>
> Thanking you
>


Windows operation orderBy desc

2016-08-01 Thread Ashok Kumar
Hi,
in the following Window spec I want orderBy ("") to be displayed in 
descending order please
val W = Window.partitionBy("col1").orderBy("col2")
If I Do
val W = Window.partitionBy("col1").orderBy("col2".desc)

It throws error
console>:26: error: value desc is not a member of String

How can I achieve that?
Thanking you

Re: Windows - Spark 2 - Standalone - Worker not able to connect to Master

2016-08-01 Thread Nikolay Zhebet
Your exception says, that you have  connection trouble with Spark master.

Check if it is available from your environment where you trying to run job.
In Linux system for this can be suitable this commands: "telnet 127.0.0.1
7077" or "netstat -ntpl | grep 7077" or "nmap 127.0.0.1 | grep 7077".

Try to use analog of this commands in Windows and check if is available
spark master from your running environment?

2016-08-01 14:35 GMT+03:00 ayan guha :

> No I confirmed master is running by spark ui at localhost:8080
> On 1 Aug 2016 18:22, "Nikolay Zhebet"  wrote:
>
>> I think you haven't run spark master yet, or maybe port 7077 is not yours
>> default port for spark master.
>>
>> 2016-08-01 4:24 GMT+03:00 ayan guha :
>>
>>> Hi
>>>
>>> I just downloaded Spark 2.0 on my windows 7 to check it out. However,
>>> not able to set up a standalone cluster:
>>>
>>>
>>> Step 1: master set up (Successful)
>>>
>>> bin/spark-class org.apache.spark.deploy.master.Master
>>>
>>> It did throw an error about not able to find winutils, but started
>>> successfully.
>>>
>>> Step II: Set up Worker (Failed)
>>>
>>> bin/spark-class org.apache.spark.deploy.worker.Worker
>>> spark://localhost:7077
>>>
>>> This step fails with following error:
>>>
>>> 16/08/01 11:21:27 INFO Worker: Connecting to master localhost:7077...
>>> 16/08/01 11:21:28 WARN Worker: Failed to connect to master localhost:7077
>>> org.apache.spark.SparkException: Exception thrown in awaitResult
>>> at
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.sca
>>> la:77)
>>> at
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.sca
>>> la:75)
>>> at
>>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.s
>>> cala:36)
>>> at
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyO
>>> rElse(RpcTimeout.scala:59)
>>> at
>>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyO
>>> rElse(RpcTimeout.scala:59)
>>> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>>> at
>>> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>>> at
>>> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
>>> at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
>>> at
>>> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deplo
>>> y$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
>>> Source)
>>> at java.util.concurrent.FutureTask.run(Unknown Source)
>>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
>>> Source)
>>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
>>> Source)
>>> at java.lang.Thread.run(Unknown Source)
>>> Caused by: java.io.IOException: Failed to connect to localhost/
>>> 127.0.0.1:7077
>>> at
>>> org.apache.spark.network.client.TransportClientFactory.createClient(T
>>> ransportClientFactory.java:228)
>>> at
>>> org.apache.spark.network.client.TransportClientFactory.createClient(T
>>> ransportClientFactory.java:179)
>>> at
>>> org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala
>>> :197)
>>> at
>>> org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
>>> at
>>> org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
>>> ... 4 more
>>> Caused by: java.net.ConnectException: Connection refused: no further
>>> information
>>> : localhost/127.0.0.1:7077
>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
>>> at
>>> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocke
>>> tChannel.java:224)
>>> at
>>> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConne
>>> ct(AbstractNioChannel.java:289)
>>> at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
>>> a:528)
>>> at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
>>> ntLoop.java:468)
>>> at
>>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
>>> va:382)
>>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>>> at
>>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
>>> EventExecutor.java:111)
>>> ... 1 more
>>>
>>> Am I doing something wrong?
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>


Re: spark 2.0 readStream from a REST API

2016-08-01 Thread Jacek Laskowski
On Mon, Aug 1, 2016 at 11:01 AM, Ayoub Benali
 wrote:

> the problem now is that when I consume the dataframe for example with count
> I get the stack trace below.

Mind sharing the entire pipeline?

> I followed the implementation of TextSocketSourceProvider to implement my
> data source and Text Socket source is used in the official documentation
> here.

Right. Completely forgot about the provider. Thanks for reminding me about it!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Windows - Spark 2 - Standalone - Worker not able to connect to Master

2016-08-01 Thread ayan guha
No I confirmed master is running by spark ui at localhost:8080
On 1 Aug 2016 18:22, "Nikolay Zhebet"  wrote:

> I think you haven't run spark master yet, or maybe port 7077 is not yours
> default port for spark master.
>
> 2016-08-01 4:24 GMT+03:00 ayan guha :
>
>> Hi
>>
>> I just downloaded Spark 2.0 on my windows 7 to check it out. However, not
>> able to set up a standalone cluster:
>>
>>
>> Step 1: master set up (Successful)
>>
>> bin/spark-class org.apache.spark.deploy.master.Master
>>
>> It did throw an error about not able to find winutils, but started
>> successfully.
>>
>> Step II: Set up Worker (Failed)
>>
>> bin/spark-class org.apache.spark.deploy.worker.Worker
>> spark://localhost:7077
>>
>> This step fails with following error:
>>
>> 16/08/01 11:21:27 INFO Worker: Connecting to master localhost:7077...
>> 16/08/01 11:21:28 WARN Worker: Failed to connect to master localhost:7077
>> org.apache.spark.SparkException: Exception thrown in awaitResult
>> at
>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.sca
>> la:77)
>> at
>> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.sca
>> la:75)
>> at
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.s
>> cala:36)
>> at
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyO
>> rElse(RpcTimeout.scala:59)
>> at
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyO
>> rElse(RpcTimeout.scala:59)
>> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>> at
>> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>> at
>> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
>> at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
>> at
>> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deplo
>> y$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
>> Source)
>> at java.util.concurrent.FutureTask.run(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
>> Source)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
>> Source)
>> at java.lang.Thread.run(Unknown Source)
>> Caused by: java.io.IOException: Failed to connect to localhost/
>> 127.0.0.1:7077
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(T
>> ransportClientFactory.java:228)
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(T
>> ransportClientFactory.java:179)
>> at
>> org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala
>> :197)
>> at
>> org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
>> at
>> org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
>> ... 4 more
>> Caused by: java.net.ConnectException: Connection refused: no further
>> information
>> : localhost/127.0.0.1:7077
>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
>> at
>> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocke
>> tChannel.java:224)
>> at
>> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConne
>> ct(AbstractNioChannel.java:289)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
>> a:528)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
>> ntLoop.java:468)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
>> va:382)
>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>> at
>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
>> EventExecutor.java:111)
>> ... 1 more
>>
>> Am I doing something wrong?
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-08-01 Thread Ted Yu
Have you seen the following ?
http://stackoverflow.com/questions/27553547/xloggc-not-creating-log-file-if-path-doesnt-exist-for-the-first-time

On Sat, Jul 23, 2016 at 5:18 PM, Ascot Moss  wrote:

> I tried to add -Xloggc:./jvm_gc.log
>
> --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -Xloggc:./jvm_gc.log -XX:+PrintGCDateStamps"
>
> however, I could not find ./jvm_gc.log
>
> How to resolve the OOM and gc log issue?
>
> Regards
>
> On Sun, Jul 24, 2016 at 6:37 AM, Ascot Moss  wrote:
>
>> My JDK is Java 1.8 u40
>>
>> On Sun, Jul 24, 2016 at 3:45 AM, Ted Yu  wrote:
>>
>>> Since you specified +PrintGCDetails, you should be able to get some
>>> more detail from the GC log.
>>>
>>> Also, which JDK version are you using ?
>>>
>>> Please use Java 8 where G1GC is more reliable.
>>>
>>> On Sat, Jul 23, 2016 at 10:38 AM, Ascot Moss 
>>> wrote:
>>>
 Hi,

 I added the following parameter:

 --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC
 -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
 -XX:InitiatingHeapOccupancyPercent=70 -XX:+PrintGCDetails
 -XX:+PrintGCTimeStamps"

 Still got Java heap space error.

 Any idea to resolve?  (my spark is 1.6.1)


 16/07/23 23:31:50 WARN TaskSetManager: Lost task 1.0 in stage 6.0 (TID
 22, n1791): java.lang.OutOfMemoryError: Java heap space   at
 scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)

 at
 scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)

 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:248)

 at
 org.apache.spark.util.collection.CompactBuffer.toArray(CompactBuffer.scala:30)

 at
 org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findSplits$1(DecisionTree.scala:1009)
 at
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)

 at
 org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)

 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

 at scala.collection.Iterator$class.foreach(Iterator.scala:727)

 at
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

 at
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

 at
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)

 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)

 at scala.collection.AbstractIterator.to(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)

 at
 scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

 at
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)

 at
 scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

 at
 org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)

 at
 org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

 at
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)

 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

 at org.apache.spark.scheduler.Task.run(Task.scala:89)

 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 at java.lang.Thread.run(Thread.java:745)

 Regards



 On Sat, Jul 23, 2016 at 9:49 AM, Ascot Moss 
 wrote:

> Thanks. Trying with extra conf now.
>
> On Sat, Jul 23, 2016 at 6:59 AM, RK Aduri 
> wrote:
>
>> I can see large number of collections happening on driver and
>> eventually, driver is running out of memory. ( am not sure whether you 
>> have
>> persisted any rdd or data frame). May be you would want to avoid doing so
>> many collections or persist unwanted data in memory.
>>
>> To begin with, you may want to re-run the job with this following
>> config: --conf 

Testing --supervise flag

2016-08-01 Thread Noorul Islam K M

Hi all,

I was trying to test --supervise flag of spark-submit.

The documentation [1] says that, the flag helps in restarting your
application automatically if it exited with non-zero exit code.

I am looking for some clarification on that documentation. In this
context, does application means the driver?

Will the driver be re-launched if an exception is thrown by the
application? I tested this scenario and the driver is not re-launched.

~/spark-1.6.1/bin/spark-submit --deploy-mode cluster --master 
spark://10.29.83.162:6066 --class 
org.apache.spark.examples.ExceptionHandlingTest 
/home/spark/spark-1.6.1/lib/spark-examples-1.6.1-hadoop2.6.0.jar

I killed the driver java process using 'kill -9' command and the driver
is re-launched. 

Is this the only scenario were driver will be re-launched? Is there a
way to simulate non-zero exit code and test the use of --supervise flag?

Regards,
Noorul

[1] 
http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: JettyUtils.createServletHandler Method not Found?

2016-08-01 Thread Ted Yu
Original discussion was about Spark 1.3

Which Spark release are you using ?

Cheers

On Mon, Aug 1, 2016 at 1:37 AM, bg_spark <1412743...@qq.com> wrote:

> hello,I have the same problem like you,  how do you solve the problem?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/JettyUtils-createServletHandler-Method-not-Found-tp22262p27446.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


java.net.UnknownHostException

2016-08-01 Thread pseudo oduesp
hi
i get the following erreors when i try using pyspark 2.0 with ipython   on
yarn
somone can help me please .
java.lang.IllegalArgumentException: java.net.UnknownHostException:
s001.bigdata.;s003.bigdata;s008bigdata.
at
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at
org.apache.hadoop.crypto.key.kms.KMSClientProvider.getDelegationTokenService(KMSClientProvider.java:823)
at
org.apache.hadoop.crypto.key.kms.KMSClientProvider.addDelegationTokens(KMSClientProvider.java:779)
at
org.apache.hadoop.crypto.key.KeyProviderDelegationTokenExtension.addDelegationTokens(KeyProviderDelegationTokenExtension.java:86)
at
org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2046)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:133)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$obtainTokensForNamenodes$1.apply(YarnSparkHadoopUtil.scala:130)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:94)
at
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokensForNamenodes(YarnSparkHadoopUtil.scala:130)
at
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:367)
at
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:834)
at
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:149)
at org.apache.spark.SparkContext.(SparkContext.scala:500)
at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
at
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:236)
at
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Caused by:
java.net.UnknownHostException:s001.bigdata.;s003.bigdata;s008bigdata.


thanks


Re: multiple spark streaming contexts

2016-08-01 Thread Nikolay Zhebet
You always can save data in hdfs where you need, and you can controll
paralelizm in your app by configuring --driver-cores and --driver-memory.This
approach can maintain Spark master and it can controll your failure issues,
data locality and etc. But if you want to controll it by self with
"Executors.newFixedThreadPool(threadNum)" or other ways, i think you can
catch problems with yarn/mesos job recovery and failure mechanizm.
I wish you good luck in your struggle of parallelism )) This is an
interesting question!)

2016-08-01 10:41 GMT+03:00 Sumit Khanna :

> Hey Nikolay,
>
> I know the approach, but this pretty much doesnt fit the bill for my
> usecase wherein each topic needs to be logged / persisted as a separate
> hdfs location.
>
> I am looking for something where a streaming context pertains to a topic
> and that topic only, and was wondering if I could have them all in parallel
> in one app / jar run.
>
> Thanks,
>
> On Mon, Aug 1, 2016 at 1:08 PM, Nikolay Zhebet  wrote:
>
>> Hi, If you want read several kafka topics in spark-streaming job, you can
>> set names of topics splited by coma and after that you can read all
>> messages from all topics in one flow:
>>
>> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
>>
>> val lines = KafkaUtils.createStream[String, String, StringDecoder, 
>> StringDecoder](ssc, kafkaParams, topicMap, 
>> StorageLevel.MEMORY_ONLY).map(_._2)
>>
>>
>> After that you can use ".filter" function for splitting your topics and 
>> iterate messages separately.
>>
>> val orders_paid = lines.filter(x => { x("table_name") == 
>> "kismia.orders_paid"})
>>
>> orders_paid.foreachRDD( rdd => { 
>>
>>
>> Or you can you you if..else construction for splitting your messages by
>> names in foreachRDD:
>>
>> lines.foreachRDD((recrdd, time: Time) => {
>>
>>recrdd.foreachPartition(part => {
>>
>>   part.foreach(item_row => {
>>
>>  if (item_row("table_name") == "kismia.orders_paid") { ...} else if 
>> (...) {...}
>>
>> 
>>
>>
>> 2016-08-01 9:39 GMT+03:00 Sumit Khanna :
>>
>>> Any ideas guys? What are the best practices for multiple streams to be
>>> processed?
>>> I could trace a few Stack overflow comments wherein they better
>>> recommend a jar separate for each stream / use case. But that isn't pretty
>>> much what I want, as in it's better if one / multiple spark streaming
>>> contexts can all be handled well within a single jar.
>>>
>>> Guys please reply,
>>>
>>> Awaiting,
>>>
>>> Thanks,
>>> Sumit
>>>
>>> On Mon, Aug 1, 2016 at 12:24 AM, Sumit Khanna 
>>> wrote:
>>>
 Any ideas on this one guys ?

 I can do a sample run but can't be sure of imminent problems if any?
 How can I ensure different batchDuration etc etc in here, per
 StreamingContext.

 Thanks,

 On Sun, Jul 31, 2016 at 10:50 AM, Sumit Khanna 
 wrote:

> Hey,
>
> Was wondering if I could create multiple spark stream contexts in my
> application (e.g instantiating a worker actor per topic and it has its own
> streaming context its own batch duration everything).
>
> What are the caveats if any?
> What are the best practices?
>
> Have googled half heartedly on the same but the air isn't pretty much
> demystified yet. I could skim through something like
>
>
> http://stackoverflow.com/questions/29612726/how-do-you-setup-multiple-spark-streaming-jobs-with-different-batch-durations
>
>
> http://stackoverflow.com/questions/37006565/multiple-spark-streaming-contexts-on-one-worker
>
> Thanks in Advance!
> Sumit
>


>>>
>>
>


Re: spark 2.0 readStream from a REST API

2016-08-01 Thread Ayoub Benali
Hello,

using the full class name worked, thanks.

the problem now is that when I consume the dataframe for example with count
I get the stack trace below.

I followed the implementation of TextSocketSourceProvider

to
implement my data source and Text Socket source is used in the official
documentation here

.

Why does count works in the example documentation? is there some other
trait that need to be implemented ?

Thanks,
Ayoub.


org.apache.spark.sql.AnalysisException: Queries with streaming sources must
> be executed with writeStream.start();
> at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:173)
> at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:33)
> at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:31)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:125)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:125)
> at
> org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:31)
> at
> org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:59)
> at
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:70)
> at
> org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:68)
> at
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:74)
> at
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:74)
> at
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
> at
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
> at
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
> at
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
> at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2541)
> at org.apache.spark.sql.Dataset.count(Dataset.scala:2216)







2016-07-31 21:56 GMT+02:00 Michael Armbrust :

> You have to add a file in resource too (example
> ).
> Either that or give a full class name.
>
> On Sun, Jul 31, 2016 at 9:45 AM, Ayoub Benali  > wrote:
>
>> Looks like the way to go in spark 2.0 is to implement
>> StreamSourceProvider
>> 
>>  with DataSourceRegister
>> .
>> But now spark fails at loading the class when doing:
>>
>> spark.readStream.format("mysource").load()
>>
>> I get :
>>
>> java.lang.ClassNotFoundException: Failed to find data source: mysource.
>> Please find packages at http://spark-packages.org
>>
>> Is there something I need to do in order to "load" the Stream source
>> provider ?
>>
>> Thanks,
>> Ayoub
>>
>> 2016-07-31 17:19 GMT+02:00 Jacek Laskowski :
>>
>>> On Sun, Jul 31, 2016 at 12:53 PM, Ayoub Benali
>>>  wrote:
>>>
>>> > I started playing with the Structured Streaming API in spark 2.0 and I
>>> am
>>> > looking for a way to create streaming Dataset/Dataframe from a rest
>>> HTTP
>>> > endpoint but I am bit stuck.
>>>
>>> What a great idea! Why did I myself not think about this?!?!
>>>
>>> > What would be the easiest way to hack around it ? Do I need to
>>> implement the
>>> > Datasource API ?
>>>
>>> Yes and perhaps Hadoop API too, but not sure which one exactly since I
>>> haven't even thought about it (not even once).
>>>
>>> > Are there examples on how to create a DataSource from a REST endpoint ?
>>>
>>> Never heard of one.
>>>
>>> I'm hosting a Spark/Scala meetup this week so I'll definitely propose
>>> it as a topic. Thanks a lot!
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> 

Re: Getting error, when I do df.show()

2016-08-01 Thread Saisai Shao
>
> java.lang.NoClassDefFoundError: spray/json/JsonReader
>
> at
> com.memsql.spark.pushdown.MemSQLPhysicalRDD$.fromAbstractQueryTree(MemSQLPhysicalRDD.scala:95)
>
> at
> com.memsql.spark.pushdown.MemSQLPushdownStrategy.apply(MemSQLPushdownStrategy.scala:49)
>

Looks like a memsql problem from the stack. Maybe some jars are missing
when you're using memsql datasource.

On Mon, Aug 1, 2016 at 4:22 PM, Subhajit Purkayastha 
wrote:

> I am getting this error in the spark-shell when I do . Which jar file I
> need to download to fix this error?
>
>
>
> Df.show()
>
>
>
> Error
>
>
>
> scala> val df = msc.sql(query)
>
> df: org.apache.spark.sql.DataFrame = [id: int, name: string]
>
>
>
> scala> df.show()
>
> java.lang.NoClassDefFoundError: spray/json/JsonReader
>
> at
> com.memsql.spark.pushdown.MemSQLPhysicalRDD$.fromAbstractQueryTree(MemSQLPhysicalRDD.scala:95)
>
> at
> com.memsql.spark.pushdown.MemSQLPushdownStrategy.apply(MemSQLPushdownStrategy.scala:49)
>
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>
> at
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>
> at
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:374)
>
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>
> at
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>
> at
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>
> at
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926)
>
> at
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924)
>
> at
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930)
>
> at
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930)
>
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
>
> at
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
>
> at
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
>
> at
> org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314)
>
> at
> org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377)
>
> at
> org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178)
>
> at org.apache.spark.sql.DataFrame.show(DataFrame.scala:401)
>
> at org.apache.spark.sql.DataFrame.show(DataFrame.scala:362)
>
> at org.apache.spark.sql.DataFrame.show(DataFrame.scala:370)
>
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:48)
>
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:59)
>
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61)
>
>
>
>
>


Re: JettyUtils.createServletHandler Method not Found?

2016-08-01 Thread bg_spark
hello,I have the same problem like you,  how do you solve the problem?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/JettyUtils-createServletHandler-Method-not-Found-tp22262p27446.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Windows - Spark 2 - Standalone - Worker not able to connect to Master

2016-08-01 Thread Nikolay Zhebet
I think you haven't run spark master yet, or maybe port 7077 is not yours
default port for spark master.

2016-08-01 4:24 GMT+03:00 ayan guha :

> Hi
>
> I just downloaded Spark 2.0 on my windows 7 to check it out. However, not
> able to set up a standalone cluster:
>
>
> Step 1: master set up (Successful)
>
> bin/spark-class org.apache.spark.deploy.master.Master
>
> It did throw an error about not able to find winutils, but started
> successfully.
>
> Step II: Set up Worker (Failed)
>
> bin/spark-class org.apache.spark.deploy.worker.Worker
> spark://localhost:7077
>
> This step fails with following error:
>
> 16/08/01 11:21:27 INFO Worker: Connecting to master localhost:7077...
> 16/08/01 11:21:28 WARN Worker: Failed to connect to master localhost:7077
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.sca
> la:77)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.sca
> la:75)
> at
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.s
> cala:36)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyO
> rElse(RpcTimeout.scala:59)
> at
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyO
> rElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at
> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
> at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
> at
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deplo
> y$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.io.IOException: Failed to connect to localhost/
> 127.0.0.1:7077
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(T
> ransportClientFactory.java:228)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(T
> ransportClientFactory.java:179)
> at
> org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala
> :197)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
> ... 4 more
> Caused by: java.net.ConnectException: Connection refused: no further
> information
> : localhost/127.0.0.1:7077
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocke
> tChannel.java:224)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConne
> ct(AbstractNioChannel.java:289)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.jav
> a:528)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEve
> ntLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.ja
> va:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThread
> EventExecutor.java:111)
> ... 1 more
>
> Am I doing something wrong?
>
>
> --
> Best Regards,
> Ayan Guha
>


Getting error, when I do df.show()

2016-08-01 Thread Subhajit Purkayastha
I am getting this error in the spark-shell when I do . Which jar file I need
to download to fix this error?

 

Df.show()

 

Error

 

scala> val df = msc.sql(query)

df: org.apache.spark.sql.DataFrame = [id: int, name: string]

 

scala> df.show()

java.lang.NoClassDefFoundError: spray/json/JsonReader

at
com.memsql.spark.pushdown.MemSQLPhysicalRDD$.fromAbstractQueryTree(MemSQLPhy
sicalRDD.scala:95)

at
com.memsql.spark.pushdown.MemSQLPushdownStrategy.apply(MemSQLPushdownStrateg
y.scala:49)

at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPl
anner.scala:58)

at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPl
anner.scala:58)

at
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

at
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:
59)

at
org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.s
cala:54)

at
org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkSt
rategies.scala:374)

at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPl
anner.scala:58)

at
org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPl
anner.scala:58)

at
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)

at
org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:
59)

at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLConte
xt.scala:926)

at
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:92
4)

at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLCo
ntext.scala:930)

at
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala
:930)

at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution
.scala:53)

at
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)

at
org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)

at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314)

at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377)

at
org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178)

at org.apache.spark.sql.DataFrame.show(DataFrame.scala:401)

at org.apache.spark.sql.DataFrame.show(DataFrame.scala:362)

at org.apache.spark.sql.DataFrame.show(DataFrame.scala:370)

at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$
iwC$$iwC$$iwC.(:48)

at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$
iwC$$iwC.(:53)

at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$
iwC.(:55)

at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<
init>(:57)

at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.
(:59)

at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:61)

 

 



Re: Tuning level of Parallelism: Increase or decrease?

2016-08-01 Thread Nikolay Zhebet
Hi.
Maybe you can help "data locality"..
If you use groupBY and joins, than most likely you will see alot of network
operations. This can be werry slow. You can try prepare, transform your
information in that way, what can minimize transporting temporary
information between worker-nodes.

Try google in this way "Data locality in Hadoop"


2016-08-01 4:41 GMT+03:00 Jestin Ma :

> It seems that the number of tasks being this large do not matter. Each
> task was set default by the HDFS as 128 MB (block size) which I've heard to
> be ok. I've tried tuning the block (task) size to be larger and smaller to
> no avail.
>
> I tried coalescing to 50 but that introduced large data skew and slowed
> down my job a lot.
>
> On Sun, Jul 31, 2016 at 5:27 PM, Andrew Ehrlich 
> wrote:
>
>> 15000 seems like a lot of tasks for that size. Test it out with a
>> .coalesce(50) placed right after loading the data. It will probably either
>> run faster or crash with out of memory errors.
>>
>> On Jul 29, 2016, at 9:02 AM, Jestin Ma  wrote:
>>
>> I am processing ~2 TB of hdfs data using DataFrames. The size of a task
>> is equal to the block size specified by hdfs, which happens to be 128 MB,
>> leading to about 15000 tasks.
>>
>> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
>> I'm performing groupBy, count, and an outer-join with another DataFrame
>> of ~200 MB size (~80 MB cached but I don't need to cache it), then saving
>> to disk.
>>
>> Right now it takes about 55 minutes, and I've been trying to tune it.
>>
>> I read on the Spark Tuning guide that:
>> *In general, we recommend 2-3 tasks per CPU core in your cluster.*
>>
>> This means that I should have about 30-50 tasks instead of 15000, and
>> each task would be much bigger in size. Is my understanding correct, and is
>> this suggested? I've read from difference sources to decrease or increase
>> parallelism, or even keep it default.
>>
>> Thank you for your help,
>> Jestin
>>
>>
>>
>


Re: multiple spark streaming contexts

2016-08-01 Thread Sumit Khanna
Hey Nikolay,

I know the approach, but this pretty much doesnt fit the bill for my
usecase wherein each topic needs to be logged / persisted as a separate
hdfs location.

I am looking for something where a streaming context pertains to a topic
and that topic only, and was wondering if I could have them all in parallel
in one app / jar run.

Thanks,

On Mon, Aug 1, 2016 at 1:08 PM, Nikolay Zhebet  wrote:

> Hi, If you want read several kafka topics in spark-streaming job, you can
> set names of topics splited by coma and after that you can read all
> messages from all topics in one flow:
>
> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
>
> val lines = KafkaUtils.createStream[String, String, StringDecoder, 
> StringDecoder](ssc, kafkaParams, topicMap, StorageLevel.MEMORY_ONLY).map(_._2)
>
>
> After that you can use ".filter" function for splitting your topics and 
> iterate messages separately.
>
> val orders_paid = lines.filter(x => { x("table_name") == 
> "kismia.orders_paid"})
>
> orders_paid.foreachRDD( rdd => { 
>
>
> Or you can you you if..else construction for splitting your messages by
> names in foreachRDD:
>
> lines.foreachRDD((recrdd, time: Time) => {
>
>recrdd.foreachPartition(part => {
>
>   part.foreach(item_row => {
>
>  if (item_row("table_name") == "kismia.orders_paid") { ...} else if 
> (...) {...}
>
> 
>
>
> 2016-08-01 9:39 GMT+03:00 Sumit Khanna :
>
>> Any ideas guys? What are the best practices for multiple streams to be
>> processed?
>> I could trace a few Stack overflow comments wherein they better recommend
>> a jar separate for each stream / use case. But that isn't pretty much what
>> I want, as in it's better if one / multiple spark streaming contexts can
>> all be handled well within a single jar.
>>
>> Guys please reply,
>>
>> Awaiting,
>>
>> Thanks,
>> Sumit
>>
>> On Mon, Aug 1, 2016 at 12:24 AM, Sumit Khanna 
>> wrote:
>>
>>> Any ideas on this one guys ?
>>>
>>> I can do a sample run but can't be sure of imminent problems if any? How
>>> can I ensure different batchDuration etc etc in here, per StreamingContext.
>>>
>>> Thanks,
>>>
>>> On Sun, Jul 31, 2016 at 10:50 AM, Sumit Khanna 
>>> wrote:
>>>
 Hey,

 Was wondering if I could create multiple spark stream contexts in my
 application (e.g instantiating a worker actor per topic and it has its own
 streaming context its own batch duration everything).

 What are the caveats if any?
 What are the best practices?

 Have googled half heartedly on the same but the air isn't pretty much
 demystified yet. I could skim through something like


 http://stackoverflow.com/questions/29612726/how-do-you-setup-multiple-spark-streaming-jobs-with-different-batch-durations


 http://stackoverflow.com/questions/37006565/multiple-spark-streaming-contexts-on-one-worker

 Thanks in Advance!
 Sumit

>>>
>>>
>>
>


Re: multiple spark streaming contexts

2016-08-01 Thread Nikolay Zhebet
Hi, If you want read several kafka topics in spark-streaming job, you can
set names of topics splited by coma and after that you can read all
messages from all topics in one flow:

val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap

val lines = KafkaUtils.createStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, topicMap,
StorageLevel.MEMORY_ONLY).map(_._2)


After that you can use ".filter" function for splitting your topics
and iterate messages separately.

val orders_paid = lines.filter(x => { x("table_name") == "kismia.orders_paid"})

orders_paid.foreachRDD( rdd => { 


Or you can you you if..else construction for splitting your messages by
names in foreachRDD:

lines.foreachRDD((recrdd, time: Time) => {

   recrdd.foreachPartition(part => {

  part.foreach(item_row => {

 if (item_row("table_name") == "kismia.orders_paid") { ...}
else if (...) {...}




2016-08-01 9:39 GMT+03:00 Sumit Khanna :

> Any ideas guys? What are the best practices for multiple streams to be
> processed?
> I could trace a few Stack overflow comments wherein they better recommend
> a jar separate for each stream / use case. But that isn't pretty much what
> I want, as in it's better if one / multiple spark streaming contexts can
> all be handled well within a single jar.
>
> Guys please reply,
>
> Awaiting,
>
> Thanks,
> Sumit
>
> On Mon, Aug 1, 2016 at 12:24 AM, Sumit Khanna 
> wrote:
>
>> Any ideas on this one guys ?
>>
>> I can do a sample run but can't be sure of imminent problems if any? How
>> can I ensure different batchDuration etc etc in here, per StreamingContext.
>>
>> Thanks,
>>
>> On Sun, Jul 31, 2016 at 10:50 AM, Sumit Khanna 
>> wrote:
>>
>>> Hey,
>>>
>>> Was wondering if I could create multiple spark stream contexts in my
>>> application (e.g instantiating a worker actor per topic and it has its own
>>> streaming context its own batch duration everything).
>>>
>>> What are the caveats if any?
>>> What are the best practices?
>>>
>>> Have googled half heartedly on the same but the air isn't pretty much
>>> demystified yet. I could skim through something like
>>>
>>>
>>> http://stackoverflow.com/questions/29612726/how-do-you-setup-multiple-spark-streaming-jobs-with-different-batch-durations
>>>
>>>
>>> http://stackoverflow.com/questions/37006565/multiple-spark-streaming-contexts-on-one-worker
>>>
>>> Thanks in Advance!
>>> Sumit
>>>
>>
>>
>


Re: spark.read.format("jdbc")

2016-08-01 Thread Nikolay Zhebet
You should specify classpath for your jdbc connection.
As example, if you want connect to Impala, you can try it snippet:



import java.util.Properties
import org.apache.spark._
import org.apache.spark.sql.SQLContext
import java.sql.Connection
import java.sql.DriverManager
Class.forName("com.cloudera.impala.jdbc41.Driver")

var conn: java.sql.Connection = null
conn = 
DriverManager.getConnection("jdbc:impala://127.0.0.1:21050/default;auth=noSasl",
"", "")
val statement = conn.createStatement();

val result = statement.executeQuery("SELECT * FROM users limit 10")
result.next()
result.getString("user_id")val sql_insert = "INSERT INTO users
VALUES('user_id','email','gender')"
statement.executeUpdate(sql_insert)


Also you should specify path your jdbc jar file in --driver-class-path
variable when you running spark-submit:

spark-shell --master "local[2]" --driver-class-path
/opt/cloudera/parcels/CDH/jars/ImpalaJDBC41.jar


2016-08-01 9:37 GMT+03:00 kevin :

> maybe there is another version spark on the classpath?
>
> 2016-08-01 14:30 GMT+08:00 kevin :
>
>> hi,all:
>>I try to load data from jdbc datasource,but I got error with :
>> java.lang.RuntimeException: Multiple sources found for jdbc
>> (org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider,
>> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource), please
>> specify the fully qualified class name.
>>
>> spark version is 2.0
>>
>>
>


Re: sql to spark scala rdd

2016-08-01 Thread Sri
Hi ,

I solved it using spark SQL which uses similar window functions mentioned below 
, for my own knowledge I am trying to solve using Scala RDD which I am unable 
to.
What function in Scala supports window function like SQL unbounded preceding 
and current row ? Is it sliding ?


Thanks
Sri

Sent from my iPhone

> On 31 Jul 2016, at 23:16, Mich Talebzadeh  wrote:
> 
> hi
> 
> You mentioned:
> 
> I already solved it using DF and spark sql ...
> 
> Are you referring to this code which is a classic analytics:
> 
> 
> SELECT DATE,balance,
> 
>  SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
>  AND
>  CURRENT ROW) daily_balance
>  FROM  table
> 
> So how did you solve it using DF in the first place?
> 
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 1 August 2016 at 07:04, Sri  wrote:
>> Hi ,
>> 
>> Just wondering how spark SQL works behind the scenes does it not convert SQL 
>> to some Scala RDD ? Or Scala ?
>> 
>> How to write below SQL in Scala or Scala RDD
>> 
>> SELECT DATE,balance,
>> SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
>> AND
>> CURRENT ROW) daily_balance
>> FROM  table
>> 
>> Thanks
>> Sri
>> Sent from my iPhone
>> 
>>> On 31 Jul 2016, at 13:21, Jacek Laskowski  wrote:
>>> 
>>> Hi,
>>> 
>>> Impossible - see
>>> http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@sliding(size:Int,step:Int):Iterator[Repr].
>>> 
>>> I tried to show you why you ended up with "non-empty iterator" after
>>> println. You should really start with
>>> http://www.scala-lang.org/documentation/
>>> 
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>> 
>>> 
>>> On Sun, Jul 31, 2016 at 8:49 PM, sri hari kali charan Tummala
>>>  wrote:
 Tuple
 
 [Lscala.Tuple2;@65e4cb84
 
> On Sun, Jul 31, 2016 at 1:00 AM, Jacek Laskowski  wrote:
> 
> Hi,
> 
> What's the result type of sliding(2,1)?
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
> On Sun, Jul 31, 2016 at 9:23 AM, sri hari kali charan Tummala
>  wrote:
>> tried this no luck, wht is non-empty iterator here ?
>> 
>> OP:-
>> (-987,non-empty iterator)
>> (-987,non-empty iterator)
>> (-987,non-empty iterator)
>> (-987,non-empty iterator)
>> (-987,non-empty iterator)
>> 
>> 
>> sc.textFile(file).keyBy(x => x.split("\\~") (0))
>>  .map(x => x._2.split("\\~"))
>>  .map(x => (x(0),x(2)))
>>.map { case (key,value) =>
>> (key,value.toArray.toSeq.sliding(2,1).map(x
>> => x.sum/x.size))}.foreach(println)
>> 
>> 
>> On Sun, Jul 31, 2016 at 12:03 AM, sri hari kali charan Tummala
>>  wrote:
>>> 
>>> Hi All,
>>> 
>>> I managed to write using sliding function but can it get key as well in
>>> my
>>> output ?
>>> 
>>> sc.textFile(file).keyBy(x => x.split("\\~") (0))
>>>  .map(x => x._2.split("\\~"))
>>>  .map(x => (x(2).toDouble)).toArray().sliding(2,1).map(x =>
>>> (x,x.size)).foreach(println)
>>> 
>>> 
>>> at the moment my output:-
>>> 
>>> 75.0
>>> -25.0
>>> 50.0
>>> -50.0
>>> -100.0
>>> 
>>> I want with key how to get moving average output based on key ?
>>> 
>>> 
>>> 987,75.0
>>> 987,-25
>>> 987,50.0
>>> 
>>> Thanks
>>> Sri
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Jul 30, 2016 at 11:40 AM, sri hari kali charan Tummala
>>>  wrote:
 
 for knowledge just wondering how to write it up in scala or spark RDD.
 
 Thanks
 Sri
 
 On Sat, Jul 30, 2016 at 11:24 AM, Jacek Laskowski 
 wrote:
> 
> Why?
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 

Re: multiple spark streaming contexts

2016-08-01 Thread Sumit Khanna
Any ideas guys? What are the best practices for multiple streams to be
processed?
I could trace a few Stack overflow comments wherein they better recommend a
jar separate for each stream / use case. But that isn't pretty much what I
want, as in it's better if one / multiple spark streaming contexts can all
be handled well within a single jar.

Guys please reply,

Awaiting,

Thanks,
Sumit

On Mon, Aug 1, 2016 at 12:24 AM, Sumit Khanna  wrote:

> Any ideas on this one guys ?
>
> I can do a sample run but can't be sure of imminent problems if any? How
> can I ensure different batchDuration etc etc in here, per StreamingContext.
>
> Thanks,
>
> On Sun, Jul 31, 2016 at 10:50 AM, Sumit Khanna 
> wrote:
>
>> Hey,
>>
>> Was wondering if I could create multiple spark stream contexts in my
>> application (e.g instantiating a worker actor per topic and it has its own
>> streaming context its own batch duration everything).
>>
>> What are the caveats if any?
>> What are the best practices?
>>
>> Have googled half heartedly on the same but the air isn't pretty much
>> demystified yet. I could skim through something like
>>
>>
>> http://stackoverflow.com/questions/29612726/how-do-you-setup-multiple-spark-streaming-jobs-with-different-batch-durations
>>
>>
>> http://stackoverflow.com/questions/37006565/multiple-spark-streaming-contexts-on-one-worker
>>
>> Thanks in Advance!
>> Sumit
>>
>
>


Re: spark.read.format("jdbc")

2016-08-01 Thread kevin
maybe there is another version spark on the classpath?

2016-08-01 14:30 GMT+08:00 kevin :

> hi,all:
>I try to load data from jdbc datasource,but I got error with :
> java.lang.RuntimeException: Multiple sources found for jdbc
> (org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider,
> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource), please
> specify the fully qualified class name.
>
> spark version is 2.0
>
>


spark.read.format("jdbc")

2016-08-01 Thread kevin
hi,all:
   I try to load data from jdbc datasource,but I got error with :
java.lang.RuntimeException: Multiple sources found for jdbc
(org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider,
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource), please
specify the fully qualified class name.

spark version is 2.0


Re: sql to spark scala rdd

2016-08-01 Thread Mich Talebzadeh
hi

You mentioned:

I already solved it using DF and spark sql ...

Are you referring to this code which is a classic analytics:

SELECT DATE,balance,
 SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING

 AND

 CURRENT ROW) daily_balance

 FROM  table


So how did you solve it using DF in the first place?


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 1 August 2016 at 07:04, Sri  wrote:

> Hi ,
>
> Just wondering how spark SQL works behind the scenes does it not convert
> SQL to some Scala RDD ? Or Scala ?
>
> How to write below SQL in Scala or Scala RDD
>
> SELECT DATE,balance,
>
> SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
>
> AND
>
> CURRENT ROW) daily_balance
>
> FROM  table
>
>
> Thanks
> Sri
> Sent from my iPhone
>
> On 31 Jul 2016, at 13:21, Jacek Laskowski  wrote:
>
> Hi,
>
> Impossible - see
>
> http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@sliding(size:Int,step:Int):Iterator[Repr]
> .
>
> I tried to show you why you ended up with "non-empty iterator" after
> println. You should really start with
> http://www.scala-lang.org/documentation/
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Jul 31, 2016 at 8:49 PM, sri hari kali charan Tummala
>  wrote:
>
> Tuple
>
>
> [Lscala.Tuple2;@65e4cb84
>
>
> On Sun, Jul 31, 2016 at 1:00 AM, Jacek Laskowski  wrote:
>
>
> Hi,
>
>
> What's the result type of sliding(2,1)?
>
>
> Pozdrawiam,
>
> Jacek Laskowski
>
> 
>
> https://medium.com/@jaceklaskowski/
>
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>
> Follow me at https://twitter.com/jaceklaskowski
>
>
>
> On Sun, Jul 31, 2016 at 9:23 AM, sri hari kali charan Tummala
>
>  wrote:
>
> tried this no luck, wht is non-empty iterator here ?
>
>
> OP:-
>
> (-987,non-empty iterator)
>
> (-987,non-empty iterator)
>
> (-987,non-empty iterator)
>
> (-987,non-empty iterator)
>
> (-987,non-empty iterator)
>
>
>
> sc.textFile(file).keyBy(x => x.split("\\~") (0))
>
>  .map(x => x._2.split("\\~"))
>
>  .map(x => (x(0),x(2)))
>
>.map { case (key,value) =>
>
> (key,value.toArray.toSeq.sliding(2,1).map(x
>
> => x.sum/x.size))}.foreach(println)
>
>
>
> On Sun, Jul 31, 2016 at 12:03 AM, sri hari kali charan Tummala
>
>  wrote:
>
>
> Hi All,
>
>
> I managed to write using sliding function but can it get key as well in
>
> my
>
> output ?
>
>
> sc.textFile(file).keyBy(x => x.split("\\~") (0))
>
>  .map(x => x._2.split("\\~"))
>
>  .map(x => (x(2).toDouble)).toArray().sliding(2,1).map(x =>
>
> (x,x.size)).foreach(println)
>
>
>
> at the moment my output:-
>
>
> 75.0
>
> -25.0
>
> 50.0
>
> -50.0
>
> -100.0
>
>
> I want with key how to get moving average output based on key ?
>
>
>
> 987,75.0
>
> 987,-25
>
> 987,50.0
>
>
> Thanks
>
> Sri
>
>
>
>
>
>
>
> On Sat, Jul 30, 2016 at 11:40 AM, sri hari kali charan Tummala
>
>  wrote:
>
>
> for knowledge just wondering how to write it up in scala or spark RDD.
>
>
> Thanks
>
> Sri
>
>
> On Sat, Jul 30, 2016 at 11:24 AM, Jacek Laskowski 
>
> wrote:
>
>
> Why?
>
>
> Pozdrawiam,
>
> Jacek Laskowski
>
> 
>
> https://medium.com/@jaceklaskowski/
>
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>
> Follow me at https://twitter.com/jaceklaskowski
>
>
>
> On Sat, Jul 30, 2016 at 4:42 AM, kali.tumm...@gmail.com
>
>  wrote:
>
> Hi All,
>
>
> I managed to write business requirement in spark-sql and hive I am
>
> still
>
> learning scala how this below sql be written using spark RDD not
>
> spark
>
> data
>
> frames.
>
>
> SELECT DATE,balance,
>
> SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
>
> AND
>
> CURRENT ROW) daily_balance
>
> FROM  table
>
>
>
>
>
>
> --
>
> View this message in context:
>
>
>
> http://apache-spark-user-list.1001560.n3.nabble.com/sql-to-spark-scala-rdd-tp27433.html
>
> Sent from the Apache Spark User List mailing list archive at
>
> Nabble.com .
>
>
>
> -
>
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
>
>
> --
>
> Thanks & Regards
>
> Sri Tummala
>
>
>
>
>
> 

Re: sql to spark scala rdd

2016-08-01 Thread Sri
Hi ,

Just wondering how spark SQL works behind the scenes does it not convert SQL to 
some Scala RDD ? Or Scala ?

How to write below SQL in Scala or Scala RDD

 SELECT DATE,balance,
 SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
 AND
 CURRENT ROW) daily_balance
 FROM  table

Thanks
Sri
Sent from my iPhone

> On 31 Jul 2016, at 13:21, Jacek Laskowski  wrote:
> 
> Hi,
> 
> Impossible - see
> http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@sliding(size:Int,step:Int):Iterator[Repr].
> 
> I tried to show you why you ended up with "non-empty iterator" after
> println. You should really start with
> http://www.scala-lang.org/documentation/
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
> On Sun, Jul 31, 2016 at 8:49 PM, sri hari kali charan Tummala
>  wrote:
>> Tuple
>> 
>> [Lscala.Tuple2;@65e4cb84
>> 
>>> On Sun, Jul 31, 2016 at 1:00 AM, Jacek Laskowski  wrote:
>>> 
>>> Hi,
>>> 
>>> What's the result type of sliding(2,1)?
>>> 
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>> 
>>> 
>>> On Sun, Jul 31, 2016 at 9:23 AM, sri hari kali charan Tummala
>>>  wrote:
 tried this no luck, wht is non-empty iterator here ?
 
 OP:-
 (-987,non-empty iterator)
 (-987,non-empty iterator)
 (-987,non-empty iterator)
 (-987,non-empty iterator)
 (-987,non-empty iterator)
 
 
 sc.textFile(file).keyBy(x => x.split("\\~") (0))
  .map(x => x._2.split("\\~"))
  .map(x => (x(0),x(2)))
.map { case (key,value) =>
 (key,value.toArray.toSeq.sliding(2,1).map(x
 => x.sum/x.size))}.foreach(println)
 
 
 On Sun, Jul 31, 2016 at 12:03 AM, sri hari kali charan Tummala
  wrote:
> 
> Hi All,
> 
> I managed to write using sliding function but can it get key as well in
> my
> output ?
> 
> sc.textFile(file).keyBy(x => x.split("\\~") (0))
>  .map(x => x._2.split("\\~"))
>  .map(x => (x(2).toDouble)).toArray().sliding(2,1).map(x =>
> (x,x.size)).foreach(println)
> 
> 
> at the moment my output:-
> 
> 75.0
> -25.0
> 50.0
> -50.0
> -100.0
> 
> I want with key how to get moving average output based on key ?
> 
> 
> 987,75.0
> 987,-25
> 987,50.0
> 
> Thanks
> Sri
> 
> 
> 
> 
> 
> 
> On Sat, Jul 30, 2016 at 11:40 AM, sri hari kali charan Tummala
>  wrote:
>> 
>> for knowledge just wondering how to write it up in scala or spark RDD.
>> 
>> Thanks
>> Sri
>> 
>> On Sat, Jul 30, 2016 at 11:24 AM, Jacek Laskowski 
>> wrote:
>>> 
>>> Why?
>>> 
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>> 
>>> 
>>> On Sat, Jul 30, 2016 at 4:42 AM, kali.tumm...@gmail.com
>>>  wrote:
 Hi All,
 
 I managed to write business requirement in spark-sql and hive I am
 still
 learning scala how this below sql be written using spark RDD not
 spark
 data
 frames.
 
 SELECT DATE,balance,
 SUM(balance) OVER (ORDER BY DATE ROWS BETWEEN UNBOUNDED PRECEDING
 AND
 CURRENT ROW) daily_balance
 FROM  table
 
 
 
 
 
 --
 View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/sql-to-spark-scala-rdd-tp27433.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
 
 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
>> 
>> 
>> 
>> --
>> Thanks & Regards
>> Sri Tummala
> 
> 
> 
> --
> Thanks & Regards
> Sri Tummala
 
 
 
 --
 Thanks & Regards
 Sri Tummala
>> 
>> 
>> 
>> 
>> --
>> Thanks & Regards
>> Sri Tummala
>>