date:20160403

Spark - Mesos HTTP API

2016-04-03 Thread John Omernik

Hey all, are there any plans to implement the Mesos HTTP API rather than
native libs?  The reason I ask is I am trying to run an application in a
docker container (Zeppelin) that would use Spark connecting to Mesos, but I
am finding that using the NATIVE_LIB from docker is difficult or would
result in a really big/heavy docker images in order to achieve this.So
that got me thinking about the HTTP API, and was wondering if there is JIRA
to track this or if this is something Spark is planning.

Thanks!

John

Re: multiple splits fails

2016-04-03 Thread Mich Talebzadeh

I am afraid print(Integer.MAX_VALUE) does not return any lines!  However,
print(1000) does

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 3 April 2016 at 19:46, Ted Yu  wrote:

> Since num is an Int, you can specify Integer.MAX_VALUE
>
> FYI
>
> On Sun, Apr 3, 2016 at 10:58 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Ted.
>>
>> As I see print() materializes the first 10 rows. On the other hand
>> print(n) will materialise n rows.
>>
>> How about if I wanted to materialize all rows?
>>
>> Cheers
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 3 April 2016 at 18:05, Ted Yu  wrote:
>>
>>> Mich:
>>> See the following method of DStream:
>>>
>>>* Print the first num elements of each RDD generated in this DStream.
>>> This is an output
>>>* operator, so this DStream will be registered as an output stream
>>> and there materialized.
>>>*/
>>>   def print(num: Int): Unit = ssc.withScope {
>>>
>>> On Sun, Apr 3, 2016 at 9:32 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 However this works. I am checking the logic to see if it does what I
 asked it to do

 val v = lines.filter(_.contains("ASE 15")).filter(_ contains("UPDATE
 INDEX STATISTICS")).flatMap(line => line.split("\n,")).map(word => (word,
 1)).reduceByKey(_ + _).print

 scala> ssc.start()

 scala> ---
 Time: 1459701925000 ms
 ---
 (* Check that you have run UPDATE INDEX STATISTICS on all ASE 15
 databases,27)
 (o You can try UPDATE INDEX STATISTICS WITH SAMPLING in ASE 15 OR,27)
 (Once databases are loaded to ASE 15, then you will need to maintain
 them the way you maintain your PROD. For example run UPDATE INDEX
 STATISTICS and REORG COMPACT as necessary. One of the frequent mistakes
 that people do is NOT pruning data from daily log tables in ASE 15 etc as
 they do it in PROD. This normally results in slower performance on ASE 15
 databases as test cycles continue. Use MDA readings to measure daily DML
 activities on ASE 15 tables and compare them with those of PROD. A 24 hour
 cycle measurement should be good. If you notice that certain tables have
 different DML hits (insert/update/delete) compared to PROD you will know
 that either ASE 15 is not doing everything in terms of batch activity (some
 jobs are missing), or there is something inconsistent somewhere. ,27)
 (* Make sure that you have enough tempdb system segment space for
 UPDATE INDEX STATISTICS. It is always advisable to gauge the tempdb size
 required in ASE 15 QA and expand the tempdb database in production
 accordingly. The last thing you want is to blow up tempdb over the
 migration weekend.,27)
 (o In ASE 15 you can subdivide the task by running parallel UPDATE
 INDEX STATISTICS on different tables in the same database at the same time.
 Watch tempdb segment growth though! OR,27)



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 3 April 2016 at 17:06, Mich Talebzadeh 
 wrote:

> Thanks Ted
>
> This works
>
> scala> val messages = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](ssc, kafkaParams, topic)
> messages: org.apache.spark.streaming.dstream.InputDStream[(String,
> String)] = 
> org.apache.spark.streaming.kafka.DirectKafkaInputDStream@3bfc0063
>
> scala> // Get the lines
> scala> val lines = messages.map(_._2)
> lines: org.apache.spark.streaming.dstream.DStream[String] =
> org.apache.spark.streaming.dstream.MappedDStream@1e4afd64
>
> scala> val v = lines.filter(_.contains("ASE 15")).filter(_
> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + _)
> v: org.apache.spark.streaming.dstream.DStream[(String, Int)] =
> org.apache.spark.streaming.dstream.ShuffledDStream@5aa09d
>
> However, this fails
>
> scala> val v = lines.filter(_.contains("ASE 15")).filter(_
> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
> line.split("\n,")).map(word => (word,

Re: multiple splits fails

2016-04-03 Thread Ted Yu

Since num is an Int, you can specify Integer.MAX_VALUE

FYI

On Sun, Apr 3, 2016 at 10:58 AM, Mich Talebzadeh 
wrote:

> Thanks Ted.
>
> As I see print() materializes the first 10 rows. On the other hand
> print(n) will materialise n rows.
>
> How about if I wanted to materialize all rows?
>
> Cheers
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 3 April 2016 at 18:05, Ted Yu  wrote:
>
>> Mich:
>> See the following method of DStream:
>>
>>* Print the first num elements of each RDD generated in this DStream.
>> This is an output
>>* operator, so this DStream will be registered as an output stream and
>> there materialized.
>>*/
>>   def print(num: Int): Unit = ssc.withScope {
>>
>> On Sun, Apr 3, 2016 at 9:32 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> However this works. I am checking the logic to see if it does what I
>>> asked it to do
>>>
>>> val v = lines.filter(_.contains("ASE 15")).filter(_ contains("UPDATE
>>> INDEX STATISTICS")).flatMap(line => line.split("\n,")).map(word => (word,
>>> 1)).reduceByKey(_ + _).print
>>>
>>> scala> ssc.start()
>>>
>>> scala> ---
>>> Time: 1459701925000 ms
>>> ---
>>> (* Check that you have run UPDATE INDEX STATISTICS on all ASE 15
>>> databases,27)
>>> (o You can try UPDATE INDEX STATISTICS WITH SAMPLING in ASE 15 OR,27)
>>> (Once databases are loaded to ASE 15, then you will need to maintain
>>> them the way you maintain your PROD. For example run UPDATE INDEX
>>> STATISTICS and REORG COMPACT as necessary. One of the frequent mistakes
>>> that people do is NOT pruning data from daily log tables in ASE 15 etc as
>>> they do it in PROD. This normally results in slower performance on ASE 15
>>> databases as test cycles continue. Use MDA readings to measure daily DML
>>> activities on ASE 15 tables and compare them with those of PROD. A 24 hour
>>> cycle measurement should be good. If you notice that certain tables have
>>> different DML hits (insert/update/delete) compared to PROD you will know
>>> that either ASE 15 is not doing everything in terms of batch activity (some
>>> jobs are missing), or there is something inconsistent somewhere. ,27)
>>> (* Make sure that you have enough tempdb system segment space for UPDATE
>>> INDEX STATISTICS. It is always advisable to gauge the tempdb size required
>>> in ASE 15 QA and expand the tempdb database in production accordingly. The
>>> last thing you want is to blow up tempdb over the migration weekend.,27)
>>> (o In ASE 15 you can subdivide the task by running parallel UPDATE INDEX
>>> STATISTICS on different tables in the same database at the same time. Watch
>>> tempdb segment growth though! OR,27)
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 3 April 2016 at 17:06, Mich Talebzadeh 
>>> wrote:
>>>
 Thanks Ted

 This works

 scala> val messages = KafkaUtils.createDirectStream[String, String,
 StringDecoder, StringDecoder](ssc, kafkaParams, topic)
 messages: org.apache.spark.streaming.dstream.InputDStream[(String,
 String)] = 
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream@3bfc0063

 scala> // Get the lines
 scala> val lines = messages.map(_._2)
 lines: org.apache.spark.streaming.dstream.DStream[String] =
 org.apache.spark.streaming.dstream.MappedDStream@1e4afd64

 scala> val v = lines.filter(_.contains("ASE 15")).filter(_
 contains("UPDATE INDEX STATISTICS")).flatMap(line =>
 line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + _)
 v: org.apache.spark.streaming.dstream.DStream[(String, Int)] =
 org.apache.spark.streaming.dstream.ShuffledDStream@5aa09d

 However, this fails

 scala> val v = lines.filter(_.contains("ASE 15")).filter(_
 contains("UPDATE INDEX STATISTICS")).flatMap(line =>
 line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
 _).collect.foreach(println)
 :43: error: value collect is not a member of
 org.apache.spark.streaming.dstream.DStream[(String, Int)]
  val v = lines.filter(_.contains("ASE 15")).filter(_
 contains("UPDATE INDEX STATISTICS")).flatMap(line =>
 line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
 _).collect.foreach(println)


 Dr Mich Talebzadeh



 LinkedIn *

GenericRowWithSchema to case class

2016-04-03 Thread asethia

Hi,

My Cassandra table has custom user defined say example:

CREATE TYPE address (
 addressline1 text,
 addressline2 text,
 city text,
 state text,
 country text,
 pincode text
)

create table person (
  id text,
  name text,
  addresses set>,
  PRIMARY KEY (id));

val rdd=sqlContext.read.format("org.apache.spark.sql.cassandra")
  .options(Map("keyspace" ->"test", "table" -> "test")).load()

rdd.collect()

I would like to collect as case objects; which I am able to for person id
and name using row.getAs, but for addresses it is showing WrappleArray, and
each array element is
CompactBuffer(org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema). 

Is there any way I can get addresses mapped to Address case class?

Thanks,
Arun


 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/GenericRowWithSchema-to-case-class-tp2.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: multiple splits fails

2016-04-03 Thread Mich Talebzadeh

Thanks Ted.

As I see print() materializes the first 10 rows. On the other hand
print(n) will materialise n rows.

How about if I wanted to materialize all rows?

Cheers

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 3 April 2016 at 18:05, Ted Yu  wrote:

> Mich:
> See the following method of DStream:
>
>* Print the first num elements of each RDD generated in this DStream.
> This is an output
>* operator, so this DStream will be registered as an output stream and
> there materialized.
>*/
>   def print(num: Int): Unit = ssc.withScope {
>
> On Sun, Apr 3, 2016 at 9:32 AM, Mich Talebzadeh  > wrote:
>
>> However this works. I am checking the logic to see if it does what I
>> asked it to do
>>
>> val v = lines.filter(_.contains("ASE 15")).filter(_ contains("UPDATE
>> INDEX STATISTICS")).flatMap(line => line.split("\n,")).map(word => (word,
>> 1)).reduceByKey(_ + _).print
>>
>> scala> ssc.start()
>>
>> scala> ---
>> Time: 1459701925000 ms
>> ---
>> (* Check that you have run UPDATE INDEX STATISTICS on all ASE 15
>> databases,27)
>> (o You can try UPDATE INDEX STATISTICS WITH SAMPLING in ASE 15 OR,27)
>> (Once databases are loaded to ASE 15, then you will need to maintain them
>> the way you maintain your PROD. For example run UPDATE INDEX STATISTICS and
>> REORG COMPACT as necessary. One of the frequent mistakes that people do is
>> NOT pruning data from daily log tables in ASE 15 etc as they do it in PROD.
>> This normally results in slower performance on ASE 15 databases as test
>> cycles continue. Use MDA readings to measure daily DML activities on ASE 15
>> tables and compare them with those of PROD. A 24 hour cycle measurement
>> should be good. If you notice that certain tables have different DML hits
>> (insert/update/delete) compared to PROD you will know that either ASE 15 is
>> not doing everything in terms of batch activity (some jobs are missing), or
>> there is something inconsistent somewhere. ,27)
>> (* Make sure that you have enough tempdb system segment space for UPDATE
>> INDEX STATISTICS. It is always advisable to gauge the tempdb size required
>> in ASE 15 QA and expand the tempdb database in production accordingly. The
>> last thing you want is to blow up tempdb over the migration weekend.,27)
>> (o In ASE 15 you can subdivide the task by running parallel UPDATE INDEX
>> STATISTICS on different tables in the same database at the same time. Watch
>> tempdb segment growth though! OR,27)
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 3 April 2016 at 17:06, Mich Talebzadeh 
>> wrote:
>>
>>> Thanks Ted
>>>
>>> This works
>>>
>>> scala> val messages = KafkaUtils.createDirectStream[String, String,
>>> StringDecoder, StringDecoder](ssc, kafkaParams, topic)
>>> messages: org.apache.spark.streaming.dstream.InputDStream[(String,
>>> String)] = org.apache.spark.streaming.kafka.DirectKafkaInputDStream@3bfc0063
>>>
>>> scala> // Get the lines
>>> scala> val lines = messages.map(_._2)
>>> lines: org.apache.spark.streaming.dstream.DStream[String] =
>>> org.apache.spark.streaming.dstream.MappedDStream@1e4afd64
>>>
>>> scala> val v = lines.filter(_.contains("ASE 15")).filter(_
>>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + _)
>>> v: org.apache.spark.streaming.dstream.DStream[(String, Int)] =
>>> org.apache.spark.streaming.dstream.ShuffledDStream@5aa09d
>>>
>>> However, this fails
>>>
>>> scala> val v = lines.filter(_.contains("ASE 15")).filter(_
>>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
>>> _).collect.foreach(println)
>>> :43: error: value collect is not a member of
>>> org.apache.spark.streaming.dstream.DStream[(String, Int)]
>>>  val v = lines.filter(_.contains("ASE 15")).filter(_
>>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
>>> _).collect.foreach(println)
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 3 April 2016 at 16:01, Ted Yu  wrote:
>>>
 bq. is not a member of (String, String)

 As shown

Re: multiple splits fails

2016-04-03 Thread Ted Yu

Mich:
See the following method of DStream:

   * Print the first num elements of each RDD generated in this DStream.
This is an output
   * operator, so this DStream will be registered as an output stream and
there materialized.
   */
  def print(num: Int): Unit = ssc.withScope {

On Sun, Apr 3, 2016 at 9:32 AM, Mich Talebzadeh 
wrote:

> However this works. I am checking the logic to see if it does what I asked
> it to do
>
> val v = lines.filter(_.contains("ASE 15")).filter(_ contains("UPDATE INDEX
> STATISTICS")).flatMap(line => line.split("\n,")).map(word => (word,
> 1)).reduceByKey(_ + _).print
>
> scala> ssc.start()
>
> scala> ---
> Time: 1459701925000 ms
> ---
> (* Check that you have run UPDATE INDEX STATISTICS on all ASE 15
> databases,27)
> (o You can try UPDATE INDEX STATISTICS WITH SAMPLING in ASE 15 OR,27)
> (Once databases are loaded to ASE 15, then you will need to maintain them
> the way you maintain your PROD. For example run UPDATE INDEX STATISTICS and
> REORG COMPACT as necessary. One of the frequent mistakes that people do is
> NOT pruning data from daily log tables in ASE 15 etc as they do it in PROD.
> This normally results in slower performance on ASE 15 databases as test
> cycles continue. Use MDA readings to measure daily DML activities on ASE 15
> tables and compare them with those of PROD. A 24 hour cycle measurement
> should be good. If you notice that certain tables have different DML hits
> (insert/update/delete) compared to PROD you will know that either ASE 15 is
> not doing everything in terms of batch activity (some jobs are missing), or
> there is something inconsistent somewhere. ,27)
> (* Make sure that you have enough tempdb system segment space for UPDATE
> INDEX STATISTICS. It is always advisable to gauge the tempdb size required
> in ASE 15 QA and expand the tempdb database in production accordingly. The
> last thing you want is to blow up tempdb over the migration weekend.,27)
> (o In ASE 15 you can subdivide the task by running parallel UPDATE INDEX
> STATISTICS on different tables in the same database at the same time. Watch
> tempdb segment growth though! OR,27)
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 3 April 2016 at 17:06, Mich Talebzadeh 
> wrote:
>
>> Thanks Ted
>>
>> This works
>>
>> scala> val messages = KafkaUtils.createDirectStream[String, String,
>> StringDecoder, StringDecoder](ssc, kafkaParams, topic)
>> messages: org.apache.spark.streaming.dstream.InputDStream[(String,
>> String)] = org.apache.spark.streaming.kafka.DirectKafkaInputDStream@3bfc0063
>>
>> scala> // Get the lines
>> scala> val lines = messages.map(_._2)
>> lines: org.apache.spark.streaming.dstream.DStream[String] =
>> org.apache.spark.streaming.dstream.MappedDStream@1e4afd64
>>
>> scala> val v = lines.filter(_.contains("ASE 15")).filter(_
>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ + _)
>> v: org.apache.spark.streaming.dstream.DStream[(String, Int)] =
>> org.apache.spark.streaming.dstream.ShuffledDStream@5aa09d
>>
>> However, this fails
>>
>> scala> val v = lines.filter(_.contains("ASE 15")).filter(_
>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
>> _).collect.foreach(println)
>> :43: error: value collect is not a member of
>> org.apache.spark.streaming.dstream.DStream[(String, Int)]
>>  val v = lines.filter(_.contains("ASE 15")).filter(_
>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
>> _).collect.foreach(println)
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 3 April 2016 at 16:01, Ted Yu  wrote:
>>
>>> bq. is not a member of (String, String)
>>>
>>> As shown above, contains shouldn't be applied directly on a tuple.
>>>
>>> Choose the element of the tuple and then apply contains on it.
>>>
>>> On Sun, Apr 3, 2016 at 7:54 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Thank you gents.

 That should "\n" as carriage return

 OK I am using spark streaming to analyse the message

 It does the streaming

 import _root_.kafka.serializer.StringDecoder
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming._
 import org.apache.spark.streaming.kafka.KafkaUtils
 //
 scala> val

Re: multiple splits fails

2016-04-03 Thread Mich Talebzadeh

Thanks Ted

This works

scala> val messages = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topic)
messages: org.apache.spark.streaming.dstream.InputDStream[(String, String)]
= org.apache.spark.streaming.kafka.DirectKafkaInputDStream@3bfc0063

scala> // Get the lines
scala> val lines = messages.map(_._2)
lines: org.apache.spark.streaming.dstream.DStream[String] =
org.apache.spark.streaming.dstream.MappedDStream@1e4afd64

scala> val v = lines.filter(_.contains("ASE 15")).filter(_ contains("UPDATE
INDEX STATISTICS")).flatMap(line => line.split("\n,")).map(word => (word,
1)).reduceByKey(_ + _)
v: org.apache.spark.streaming.dstream.DStream[(String, Int)] =
org.apache.spark.streaming.dstream.ShuffledDStream@5aa09d

However, this fails

scala> val v = lines.filter(_.contains("ASE 15")).filter(_ contains("UPDATE
INDEX STATISTICS")).flatMap(line => line.split("\n,")).map(word => (word,
1)).reduceByKey(_ + _).collect.foreach(println)
:43: error: value collect is not a member of
org.apache.spark.streaming.dstream.DStream[(String, Int)]
 val v = lines.filter(_.contains("ASE 15")).filter(_
contains("UPDATE INDEX STATISTICS")).flatMap(line =>
line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
_).collect.foreach(println)


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 3 April 2016 at 16:01, Ted Yu  wrote:

> bq. is not a member of (String, String)
>
> As shown above, contains shouldn't be applied directly on a tuple.
>
> Choose the element of the tuple and then apply contains on it.
>
> On Sun, Apr 3, 2016 at 7:54 AM, Mich Talebzadeh  > wrote:
>
>> Thank you gents.
>>
>> That should "\n" as carriage return
>>
>> OK I am using spark streaming to analyse the message
>>
>> It does the streaming
>>
>> import _root_.kafka.serializer.StringDecoder
>> import org.apache.spark.SparkConf
>> import org.apache.spark.streaming._
>> import org.apache.spark.streaming.kafka.KafkaUtils
>> //
>> scala> val sparkConf = new SparkConf().
>>  |  setAppName("StreamTest").
>>  |  setMaster("local[12]").
>>  |  set("spark.driver.allowMultipleContexts", "true").
>>  |  set("spark.hadoop.validateOutputSpecs", "false")
>> scala> val ssc = new StreamingContext(sparkConf, Seconds(55))
>> scala>
>> scala> val kafkaParams = Map[String, String]("bootstrap.servers" ->
>> "rhes564:9092", "schema.registry.url" -> "http://rhes564:8081;,
>> "zookeeper.connect" -> "rhes564:2181", "group.id" -> "StreamTest" )
>> kafkaParams: scala.collection.immutable.Map[String,String] =
>> Map(bootstrap.servers -> rhes564:9092, schema.registry.url ->
>> http://rhes564:8081, zookeeper.connect -> rhes564:2181, group.id ->
>> StreamTest)
>> scala> val topic = Set("newtopic")
>> topic: scala.collection.immutable.Set[String] = Set(newtopic)
>> scala> val messages = KafkaUtils.createDirectStream[String, String,
>> StringDecoder, StringDecoder](ssc, kafkaParams, topic)
>> messages: org.apache.spark.streaming.dstream.InputDStream[(String,
>> String)] = org.apache.spark.streaming.kafka.DirectKafkaInputDStream@5d8ccb6c
>>
>> This part is tricky
>>
>> scala> val showlines = messages.filter(_ contains("ASE 15")).filter(_
>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
>> _).collect.foreach(println)
>> :47: error: value contains is not a member of (String, String)
>>  val showlines = messages.filter(_ contains("ASE 15")).filter(_
>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
>> _).collect.foreach(println)
>>
>>
>> How does one refer to the content of the stream here?
>>
>> Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> //
>> // Now want to do some analysis on the same text file
>> //
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 3 April 2016 at 15:32, Ted Yu  wrote:
>>
>>> bq. split"\t," splits the filter by carriage return
>>>
>>> Minor correction: "\t" denotes tab character.
>>>
>>> On Sun, Apr 3, 2016 at 7:24 AM, Eliran Bivas  wrote:
>>>
 Hi Mich,

 1. The first underscore in your filter call is refering to a line in
 the file (as textFile() results in a collection of strings)
 2. You're correct. No need for it.
 3. Filter is expecting a Boolean result. So you can merge your contains
 filters to one with AND (&&) statement.
 4. Correct. Each character in split() is used

Re: spark.driver.memory meaning

2016-04-03 Thread Carlile, Ken

Cool. My users tend to interact with the driver via iPython Notebook, so clearly I’ll have to leave (fairly significant amounts of) ram for that. But I should be able to write a one liner into the spark-env.sh that will determine whether it’s on a 128 or 256GB
 node and have it size itself accordingly. 

Thanks!

—Ken

On Apr 3, 2016, at 11:06 AM, Yong Zhang  wrote:

In the standalone mode, it applies to the Driver JVM processor heap size.

You should consider giving enough memory space to it, in standalone mode, due to:

1) Any data you bring back to the driver will store in it, like RDD.collect or DF.show
2) The Driver also host a web UI for the application job you are running, and there could be big memory requirement as huge job related metrics data, if the job contains lots of stages and tasks.

Yong

> From: carli...@janelia.hhmi.org
> To: user@spark.apache.org
> Subject: spark.driver.memory meaning
> Date: Sun, 3 Apr 2016 14:57:51 +
> 
> In the spark-env.sh example file, the comments indicate that the spark.driver.memory is the memory for the master in YARN mode. None of that actually makes any sense… 
> 
> In any case, I’m using spark in a standalone mode, running the driver on a separate machine from the master. I have a few questions regarding that: 
> 
> Does the spark.driver.memory only work in YARN mode? 
> 
> Does the value apply to the master or the driver? 
> 
> If the memory applies to the driver, what is that memory used for? 
> 
> Does it make sense to change it based on what kind of machine the driver is running on? (We have both 256GB nodes and 128GB nodes available for use as the driver)
> 
> Thanks,
> Ken
> -
> To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
> For additional commands, e-mail: 
user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: spark.driver.memory meaning

2016-04-03 Thread Yong Zhang

In the standalone mode, it applies to the Driver JVM processor heap size.
You should consider giving enough memory space to it, in standalone mode, due 
to:
1) Any data you bring back to the driver will store in it, like RDD.collect or 
DF.show2) The Driver also host a web UI for the application job you are 
running, and there could be big memory requirement as huge job related metrics 
data, if the job contains lots of stages and tasks.
Yong

> From: carli...@janelia.hhmi.org
> To: user@spark.apache.org
> Subject: spark.driver.memory meaning
> Date: Sun, 3 Apr 2016 14:57:51 +
> 
> In the spark-env.sh example file, the comments indicate that the 
> spark.driver.memory is the memory for the master in YARN mode. None of that 
> actually makes any sense… 
> 
> In any case, I’m using spark in a standalone mode, running the driver on a 
> separate machine from the master. I have a few questions regarding that: 
> 
> Does the spark.driver.memory only work in YARN mode? 
> 
> Does the value apply to the master or the driver? 
> 
> If the memory applies to the driver, what is that memory used for? 
> 
> Does it make sense to change it based on what kind of machine the driver is 
> running on? (We have both 256GB nodes and 128GB nodes available for use as 
> the driver)
> 
> Thanks,
> Ken
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Re: multiple splits fails

2016-04-03 Thread Ted Yu

bq. is not a member of (String, String)

As shown above, contains shouldn't be applied directly on a tuple.

Choose the element of the tuple and then apply contains on it.

On Sun, Apr 3, 2016 at 7:54 AM, Mich Talebzadeh 
wrote:

> Thank you gents.
>
> That should "\n" as carriage return
>
> OK I am using spark streaming to analyse the message
>
> It does the streaming
>
> import _root_.kafka.serializer.StringDecoder
> import org.apache.spark.SparkConf
> import org.apache.spark.streaming._
> import org.apache.spark.streaming.kafka.KafkaUtils
> //
> scala> val sparkConf = new SparkConf().
>  |  setAppName("StreamTest").
>  |  setMaster("local[12]").
>  |  set("spark.driver.allowMultipleContexts", "true").
>  |  set("spark.hadoop.validateOutputSpecs", "false")
> scala> val ssc = new StreamingContext(sparkConf, Seconds(55))
> scala>
> scala> val kafkaParams = Map[String, String]("bootstrap.servers" ->
> "rhes564:9092", "schema.registry.url" -> "http://rhes564:8081;,
> "zookeeper.connect" -> "rhes564:2181", "group.id" -> "StreamTest" )
> kafkaParams: scala.collection.immutable.Map[String,String] =
> Map(bootstrap.servers -> rhes564:9092, schema.registry.url ->
> http://rhes564:8081, zookeeper.connect -> rhes564:2181, group.id ->
> StreamTest)
> scala> val topic = Set("newtopic")
> topic: scala.collection.immutable.Set[String] = Set(newtopic)
> scala> val messages = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](ssc, kafkaParams, topic)
> messages: org.apache.spark.streaming.dstream.InputDStream[(String,
> String)] = org.apache.spark.streaming.kafka.DirectKafkaInputDStream@5d8ccb6c
>
> This part is tricky
>
> scala> val showlines = messages.filter(_ contains("ASE 15")).filter(_
> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
> _).collect.foreach(println)
> :47: error: value contains is not a member of (String, String)
>  val showlines = messages.filter(_ contains("ASE 15")).filter(_
> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
> line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
> _).collect.foreach(println)
>
>
> How does one refer to the content of the stream here?
>
> Thanks
>
>
>
>
>
>
>
>
>
>
> //
> // Now want to do some analysis on the same text file
> //
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 3 April 2016 at 15:32, Ted Yu  wrote:
>
>> bq. split"\t," splits the filter by carriage return
>>
>> Minor correction: "\t" denotes tab character.
>>
>> On Sun, Apr 3, 2016 at 7:24 AM, Eliran Bivas  wrote:
>>
>>> Hi Mich,
>>>
>>> 1. The first underscore in your filter call is refering to a line in the
>>> file (as textFile() results in a collection of strings)
>>> 2. You're correct. No need for it.
>>> 3. Filter is expecting a Boolean result. So you can merge your contains
>>> filters to one with AND (&&) statement.
>>> 4. Correct. Each character in split() is used as a divider.
>>>
>>> Eliran Bivas
>>>
>>> *From:* Mich Talebzadeh 
>>> *Sent:* Apr 3, 2016 15:06
>>> *To:* Eliran Bivas
>>> *Cc:* user @spark
>>> *Subject:* Re: multiple splits fails
>>>
>>> Hi Eliran,
>>>
>>> Many thanks for your input on this.
>>>
>>> I thought about what I was trying to achieve so I rewrote the logic as
>>> follows:
>>>
>>>
>>>1. Read the text file in
>>>2. Filter out empty lines (well not really needed here)
>>>3. Search for lines that contain "ASE 15" and further have sentence
>>>"UPDATE INDEX STATISTICS" in the said line
>>>4. Split the text by "\t" and ","
>>>5. Print the outcome
>>>
>>>
>>> This was what I did with your suggestions included
>>>
>>> val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
>>> f.cache()
>>>  f.filter(_.length > 0).filter(_ contains("ASE 15")).filter(_
>>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>>> line.split("\t,")).map(word => (word, 1)).reduceByKey(_ +
>>> _).collect.foreach(println)
>>>
>>>
>>> Couple of questions if I may
>>>
>>>
>>>1. I take that "_" refers to content of the file read in by default?
>>>2. _.length > 0 basically filters out blank lines (not really needed
>>>here)
>>>3. Multiple filters are needed for each *contains* logic
>>>4. split"\t," splits the filter by carriage return AND ,?
>>>
>>>
>>> Regards
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 3 April 2016 at 12:35,

spark.driver.memory meaning

2016-04-03 Thread Carlile, Ken

In the spark-env.sh example file, the comments indicate that the 
spark.driver.memory is the memory for the master in YARN mode. None of that 
actually makes any sense… 

In any case, I’m using spark in a standalone mode, running the driver on a 
separate machine from the master. I have a few questions regarding that: 

Does the spark.driver.memory only work in YARN mode? 

Does the value apply to the master or the driver? 

If the memory applies to the driver, what is that memory used for? 

Does it make sense to change it based on what kind of machine the driver is 
running on? (We have both 256GB nodes and 128GB nodes available for use as the 
driver)

Thanks,
Ken
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: multiple splits fails

2016-04-03 Thread Mich Talebzadeh

Thank you gents.

That should "\n" as carriage return

OK I am using spark streaming to analyse the message

It does the streaming

import _root_.kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka.KafkaUtils
//
scala> val sparkConf = new SparkConf().
 |  setAppName("StreamTest").
 |  setMaster("local[12]").
 |  set("spark.driver.allowMultipleContexts", "true").
 |  set("spark.hadoop.validateOutputSpecs", "false")
scala> val ssc = new StreamingContext(sparkConf, Seconds(55))
scala>
scala> val kafkaParams = Map[String, String]("bootstrap.servers" ->
"rhes564:9092", "schema.registry.url" -> "http://rhes564:8081;,
"zookeeper.connect" -> "rhes564:2181", "group.id" -> "StreamTest" )
kafkaParams: scala.collection.immutable.Map[String,String] =
Map(bootstrap.servers -> rhes564:9092, schema.registry.url ->
http://rhes564:8081, zookeeper.connect -> rhes564:2181, group.id ->
StreamTest)
scala> val topic = Set("newtopic")
topic: scala.collection.immutable.Set[String] = Set(newtopic)
scala> val messages = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topic)
messages: org.apache.spark.streaming.dstream.InputDStream[(String, String)]
= org.apache.spark.streaming.kafka.DirectKafkaInputDStream@5d8ccb6c

This part is tricky

scala> val showlines = messages.filter(_ contains("ASE 15")).filter(_
contains("UPDATE INDEX STATISTICS")).flatMap(line =>
line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
_).collect.foreach(println)
:47: error: value contains is not a member of (String, String)
 val showlines = messages.filter(_ contains("ASE 15")).filter(_
contains("UPDATE INDEX STATISTICS")).flatMap(line =>
line.split("\n,")).map(word => (word, 1)).reduceByKey(_ +
_).collect.foreach(println)


How does one refer to the content of the stream here?

Thanks










//
// Now want to do some analysis on the same text file
//



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 3 April 2016 at 15:32, Ted Yu  wrote:

> bq. split"\t," splits the filter by carriage return
>
> Minor correction: "\t" denotes tab character.
>
> On Sun, Apr 3, 2016 at 7:24 AM, Eliran Bivas  wrote:
>
>> Hi Mich,
>>
>> 1. The first underscore in your filter call is refering to a line in the
>> file (as textFile() results in a collection of strings)
>> 2. You're correct. No need for it.
>> 3. Filter is expecting a Boolean result. So you can merge your contains
>> filters to one with AND (&&) statement.
>> 4. Correct. Each character in split() is used as a divider.
>>
>> Eliran Bivas
>>
>> *From:* Mich Talebzadeh 
>> *Sent:* Apr 3, 2016 15:06
>> *To:* Eliran Bivas
>> *Cc:* user @spark
>> *Subject:* Re: multiple splits fails
>>
>> Hi Eliran,
>>
>> Many thanks for your input on this.
>>
>> I thought about what I was trying to achieve so I rewrote the logic as
>> follows:
>>
>>
>>1. Read the text file in
>>2. Filter out empty lines (well not really needed here)
>>3. Search for lines that contain "ASE 15" and further have sentence
>>"UPDATE INDEX STATISTICS" in the said line
>>4. Split the text by "\t" and ","
>>5. Print the outcome
>>
>>
>> This was what I did with your suggestions included
>>
>> val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
>> f.cache()
>>  f.filter(_.length > 0).filter(_ contains("ASE 15")).filter(_
>> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
>> line.split("\t,")).map(word => (word, 1)).reduceByKey(_ +
>> _).collect.foreach(println)
>>
>>
>> Couple of questions if I may
>>
>>
>>1. I take that "_" refers to content of the file read in by default?
>>2. _.length > 0 basically filters out blank lines (not really needed
>>here)
>>3. Multiple filters are needed for each *contains* logic
>>4. split"\t," splits the filter by carriage return AND ,?
>>
>>
>> Regards
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 3 April 2016 at 12:35, Eliran Bivas  wrote:
>>
>>> Hi Mich,
>>>
>>> Few comments:
>>>
>>> When doing .filter(_ > “”) you’re actually doing a lexicographic
>>> comparison and not filtering for empty lines (which could be achieved with
>>> _.notEmpty or _.length > 0).
>>> I think that filtering with _.contains should be sufficient and the
>>> first filter can be omitted.
>>>
>>> As for line => line.split(“\t”).split(“,”):
>>> You have to do a second map or (since split()

Re: multiple splits fails

2016-04-03 Thread Ted Yu

bq. split"\t," splits the filter by carriage return

Minor correction: "\t" denotes tab character.

On Sun, Apr 3, 2016 at 7:24 AM, Eliran Bivas  wrote:

> Hi Mich,
>
> 1. The first underscore in your filter call is refering to a line in the
> file (as textFile() results in a collection of strings)
> 2. You're correct. No need for it.
> 3. Filter is expecting a Boolean result. So you can merge your contains
> filters to one with AND (&&) statement.
> 4. Correct. Each character in split() is used as a divider.
>
> Eliran Bivas
>
> *From:* Mich Talebzadeh 
> *Sent:* Apr 3, 2016 15:06
> *To:* Eliran Bivas
> *Cc:* user @spark
> *Subject:* Re: multiple splits fails
>
> Hi Eliran,
>
> Many thanks for your input on this.
>
> I thought about what I was trying to achieve so I rewrote the logic as
> follows:
>
>
>1. Read the text file in
>2. Filter out empty lines (well not really needed here)
>3. Search for lines that contain "ASE 15" and further have sentence
>"UPDATE INDEX STATISTICS" in the said line
>4. Split the text by "\t" and ","
>5. Print the outcome
>
>
> This was what I did with your suggestions included
>
> val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
> f.cache()
>  f.filter(_.length > 0).filter(_ contains("ASE 15")).filter(_
> contains("UPDATE INDEX STATISTICS")).flatMap(line =>
> line.split("\t,")).map(word => (word, 1)).reduceByKey(_ +
> _).collect.foreach(println)
>
>
> Couple of questions if I may
>
>
>1. I take that "_" refers to content of the file read in by default?
>2. _.length > 0 basically filters out blank lines (not really needed
>here)
>3. Multiple filters are needed for each *contains* logic
>4. split"\t," splits the filter by carriage return AND ,?
>
>
> Regards
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 3 April 2016 at 12:35, Eliran Bivas  wrote:
>
>> Hi Mich,
>>
>> Few comments:
>>
>> When doing .filter(_ > “”) you’re actually doing a lexicographic
>> comparison and not filtering for empty lines (which could be achieved with
>> _.notEmpty or _.length > 0).
>> I think that filtering with _.contains should be sufficient and the
>> first filter can be omitted.
>>
>> As for line => line.split(“\t”).split(“,”):
>> You have to do a second map or (since split() requires a regex as input)
>> .split(“\t,”).
>> The problem is that your first split() call will generate an Array and
>> then your second call will result in an error.
>> e.g.
>>
>> val lines: Array[String] = line.split(“\t”)
>> lines.split(“,”) // Compilation error - no method split() exists for Array
>>
>> So either go with map(_.split(“\t”)).map(_.split(“,”)) or
>> map(_.split(“\t,”))
>>
>> Hope that helps.
>>
>> *Eliran Bivas*
>> Data Team | iguaz.io
>>
>>
>> On 3 Apr 2016, at 13:31, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> I am not sure this is the correct approach
>>
>> Read a text file in
>>
>> val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
>>
>>
>> Now I want to get rid of empty lines and filter only the lines that
>> contain "ASE15"
>>
>>  f.filter(_ > "").filter(_ contains("ASE15")).
>>
>> The above works but I am not sure whether I need two filter
>> transformation above? Can it be done in one?
>>
>> Now I want to map the above filter to lines with carriage return ans
>> split them by ","
>>
>> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
>> (line.split("\t")))
>> res88: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[131] at
>> map at :30
>>
>> Now I want to split the output by ","
>>
>> scala> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
>> (line.split("\t").split(",")))
>> :30: error: value split is not a member of Array[String]
>>   f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
>> (line.split("\t").split(",")))
>>
>> ^
>> Any advice will be appreciated
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>

Re: multiple splits fails

2016-04-03 Thread Eliran Bivas

Hi Mich,

1. The first underscore in your filter call is refering to a line in the file 
(as textFile() results in a collection of strings)
2. You're correct. No need for it.
3. Filter is expecting a Boolean result. So you can merge your contains filters 
to one with AND (&&) statement.
4. Correct. Each character in split() is used as a divider.

Eliran Bivas

From: Mich Talebzadeh 
Sent: Apr 3, 2016 15:06
To: Eliran Bivas
Cc: user @spark
Subject: Re: multiple splits fails

Hi Eliran,

Many thanks for your input on this.

I thought about what I was trying to achieve so I rewrote the logic as follows:

  1.  Read the text file in
  2.  Filter out empty lines (well not really needed here)
  3.  Search for lines that contain "ASE 15" and further have sentence "UPDATE 
INDEX STATISTICS" in the said line
  4.  Split the text by "\t" and ","
  5.  Print the outcome

This was what I did with your suggestions included

val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
f.cache()
 f.filter(_.length > 0).filter(_ contains("ASE 15")).filter(_ contains("UPDATE 
INDEX STATISTICS")).flatMap(line => line.split("\t,")).map(word => (word, 
1)).reduceByKey(_ + _).collect.foreach(println)

Couple of questions if I may

  1.  I take that "_" refers to content of the file read in by default?
  2.  _.length > 0 basically filters out blank lines (not really needed here)
  3.  Multiple filters are needed for each *contains* logic
  4.  split"\t," splits the filter by carriage return AND ,?

Regards

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

On 3 April 2016 at 12:35, Eliran Bivas 
> wrote:
Hi Mich,

Few comments:

When doing .filter(_ > "") you're actually doing a lexicographic comparison and 
not filtering for empty lines (which could be achieved with _.notEmpty or 
_.length > 0).
I think that filtering with _.contains should be sufficient and the first 
filter can be omitted.

As for line => line.split("\t").split(","):
You have to do a second map or (since split() requires a regex as input) 
.split("\t,").
The problem is that your first split() call will generate an Array and then 
your second call will result in an error.
e.g.

val lines: Array[String] = line.split("\t")
lines.split(",") // Compilation error - no method split() exists for Array

So either go with map(_.split("\t")).map(_.split(",")) or map(_.split("\t,"))

Hope that helps.

Eliran Bivas
Data Team | iguaz.io

On 3 Apr 2016, at 13:31, Mich Talebzadeh 
> wrote:

Hi,

I am not sure this is the correct approach

Read a text file in

val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")

Now I want to get rid of empty lines and filter only the lines that contain 
"ASE15"

 f.filter(_ > "").filter(_ contains("ASE15")).

The above works but I am not sure whether I need two filter transformation 
above? Can it be done in one?

Now I want to map the above filter to lines with carriage return ans split them 
by ","

f.filter(_ > "").filter(_ contains("ASE15")).map(line => (line.split("\t")))
res88: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[131] at map 
at :30

Now I want to split the output by ","

scala> f.filter(_ > "").filter(_ contains("ASE15")).map(line => 
(line.split("\t").split(",")))
:30: error: value split is not a member of Array[String]
  f.filter(_ > "").filter(_ contains("ASE15")).map(line => 
(line.split("\t").split(",")))

 ^
Any advice will be appreciated

Thanks

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

Evicting a lower version of a library loaded in Spark Worker

2016-04-03 Thread Yuval.Itzchakov

My code uses "com.typesafe.config" in order to read configuration values.
Currently, our code uses v1.3.0, whereas Spark uses 1.2.1 internally.

When I initiate a job, the worker process invokes a method in my code but
fails, because it's defined abstract in v1.2.1 whereas in v1.3.0 it is not.
The exception message is:

java.lang.IllegalAccessError: tried to access class
com.typesafe.config.impl.SimpleConfig from class
com.typesafe.config.impl.ConfigBeanImpl
at
com.typesafe.config.impl.ConfigBeanImpl.createInternal(ConfigBeanImpl.java:40)

My spark-submit command is follows:

spark-submit \
--driver-class-path
config-1.3.0.jar:hadoop-aws-2.7.1.jar:aws-java-sdk-1.10.62.jar" \
--driver-java-options "-Dconfig.file=/classes/application.conf
-Dlog4j.configurationFile=/classes/log4j2.xml -XX:+UseG1GC
-XX:+UseStringDeduplication" \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.executor.memory=5g" \
--conf "spark.driver.memory=5g" \
--conf
"spark.executor.extraClassPath=./config-1.3.0.jar:./hadoop-aws-2.7.1.jar:./aws-java-sdk-1.10.62.jar"
\
--conf "spark.executor.extraJavaOptionsFile=-Dconfig.file=./application.conf
-Dlog4j.configurationFile=./classes/log4j2.xml -XX:+UseG1GC
-XX:+UseStringDeduplication" \
--class SparkRunner spark-job-0.1.1.jar

This still fails, regardless of the spark worker loading. This is what I see
on the worker node:

Fetching http://***/jars/config-1.3.0.jar to
/tmp/spark-cac4dfb9-bf59-49bc-ab81-e24923051c86/executor-1104e522-3fa5-4eff-8e0c-43b3c3b24c65/spark-1ce3b6bd-92f2-4652-993e-4f7054d07d21/fetchFileTemp3242938794803852920.tmp
16/04/03 13:57:20 INFO util.Utils: Copying
/tmp/spark-cac4dfb9-bf59-49bc-ab81-e24923051c86/executor-1104e522-3fa5-4eff-8e0c-43b3c3b24c65/spark-1ce3b6bd-92f2-4652-993e-4f7054d07d21/-13916990391459691836506_cache
to /var/run/spark/work/app-20160403135716-0040/1/./config-1.3.0.jar
16/04/03 13:57:20 INFO executor.Executor: Adding
file:/var/run/spark/work/app-20160403135716-0040/1/./config-1.3.0.jar to
class loader

I have tried setting "spark.executor.userClassPathFirst" to true, but then
it blows up with an error saying a different SLF4J library was already
loaded, and crashes the worker process.

Has anyone had anything similar he had to achieve?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Evicting-a-lower-version-of-a-library-loaded-in-Spark-Worker-tp26664.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: multiple splits fails

2016-04-03 Thread Mich Talebzadeh

Hi Eliran,

Many thanks for your input on this.

I thought about what I was trying to achieve so I rewrote the logic as
follows:


   1. Read the text file in
   2. Filter out empty lines (well not really needed here)
   3. Search for lines that contain "ASE 15" and further have sentence
   "UPDATE INDEX STATISTICS" in the said line
   4. Split the text by "\t" and ","
   5. Print the outcome


This was what I did with your suggestions included

val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
f.cache()
 f.filter(_.length > 0).filter(_ contains("ASE 15")).filter(_
contains("UPDATE INDEX STATISTICS")).flatMap(line =>
line.split("\t,")).map(word => (word, 1)).reduceByKey(_ +
_).collect.foreach(println)


Couple of questions if I may


   1. I take that "_" refers to content of the file read in by default?
   2. _.length > 0 basically filters out blank lines (not really needed
   here)
   3. Multiple filters are needed for each *contains* logic
   4. split"\t," splits the filter by carriage return AND ,?


Regards


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 3 April 2016 at 12:35, Eliran Bivas  wrote:

> Hi Mich,
>
> Few comments:
>
> When doing .filter(_ > “”) you’re actually doing a lexicographic
> comparison and not filtering for empty lines (which could be achieved with
> _.notEmpty or _.length > 0).
> I think that filtering with _.contains should be sufficient and the first
> filter can be omitted.
>
> As for line => line.split(“\t”).split(“,”):
> You have to do a second map or (since split() requires a regex as input)
> .split(“\t,”).
> The problem is that your first split() call will generate an Array and
> then your second call will result in an error.
> e.g.
>
> val lines: Array[String] = line.split(“\t”)
> lines.split(“,”) // Compilation error - no method split() exists for Array
>
> So either go with map(_.split(“\t”)).map(_.split(“,”)) or
> map(_.split(“\t,”))
>
> Hope that helps.
>
> *Eliran Bivas*
> Data Team | iguaz.io
>
>
> On 3 Apr 2016, at 13:31, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I am not sure this is the correct approach
>
> Read a text file in
>
> val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")
>
>
> Now I want to get rid of empty lines and filter only the lines that
> contain "ASE15"
>
>  f.filter(_ > "").filter(_ contains("ASE15")).
>
> The above works but I am not sure whether I need two filter transformation
> above? Can it be done in one?
>
> Now I want to map the above filter to lines with carriage return ans split
> them by ","
>
> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
> (line.split("\t")))
> res88: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[131] at
> map at :30
>
> Now I want to split the output by ","
>
> scala> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
> (line.split("\t").split(",")))
> :30: error: value split is not a member of Array[String]
>   f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
> (line.split("\t").split(",")))
>
> ^
> Any advice will be appreciated
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>

Re: multiple splits fails

2016-04-03 Thread Eliran Bivas

Hi Mich,

Few comments:

When doing .filter(_ > “”) you’re actually doing a lexicographic comparison and 
not filtering for empty lines (which could be achieved with _.notEmpty or 
_.length > 0).
I think that filtering with _.contains should be sufficient and the first 
filter can be omitted.

As for line => line.split(“\t”).split(“,”):
You have to do a second map or (since split() requires a regex as input) 
.split(“\t,”).
The problem is that your first split() call will generate an Array and then 
your second call will result in an error.
e.g.

val lines: Array[String] = line.split(“\t”)
lines.split(“,”) // Compilation error - no method split() exists for Array

So either go with map(_.split(“\t”)).map(_.split(“,”)) or map(_.split(“\t,”))

Hope that helps.

Eliran Bivas
Data Team | iguaz.io


On 3 Apr 2016, at 13:31, Mich Talebzadeh 
> wrote:

Hi,

I am not sure this is the correct approach

Read a text file in

val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")


Now I want to get rid of empty lines and filter only the lines that contain 
"ASE15"

 f.filter(_ > "").filter(_ contains("ASE15")).

The above works but I am not sure whether I need two filter transformation 
above? Can it be done in one?

Now I want to map the above filter to lines with carriage return ans split them 
by ","

f.filter(_ > "").filter(_ contains("ASE15")).map(line => (line.split("\t")))
res88: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[131] at map 
at :30

Now I want to split the output by ","

scala> f.filter(_ > "").filter(_ contains("ASE15")).map(line => 
(line.split("\t").split(",")))
:30: error: value split is not a member of Array[String]
  f.filter(_ > "").filter(_ contains("ASE15")).map(line => 
(line.split("\t").split(",")))

 ^
Any advice will be appreciated

Thanks

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

multiple splits fails

2016-04-03 Thread Mich Talebzadeh

Hi,

I am not sure this is the correct approach

Read a text file in

val f = sc.textFile("/tmp/ASE15UpgradeGuide.txt")


Now I want to get rid of empty lines and filter only the lines that contain
"ASE15"

 f.filter(_ > "").filter(_ contains("ASE15")).

The above works but I am not sure whether I need two filter transformation
above? Can it be done in one?

Now I want to map the above filter to lines with carriage return ans split
them by ","

f.filter(_ > "").filter(_ contains("ASE15")).map(line => (line.split("\t")))
res88: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[131] at
map at :30

Now I want to split the output by ","

scala> f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
(line.split("\t").split(",")))
:30: error: value split is not a member of Array[String]
  f.filter(_ > "").filter(_ contains("ASE15")).map(line =>
(line.split("\t").split(",")))

^
Any advice will be appreciated

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com

Re: value saveToCassandra is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

2016-04-03 Thread Yuval.Itzchakov

You need to import com.datastax.spark.connector.streaming._ to have the
methods available.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/8_streaming.md



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/value-saveToCassandra-is-not-a-member-of-org-apache-spark-streaming-dstream-DStream-String-Int-tp26655p26663.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RDDs caching in typical machine learning use cases

2016-04-03 Thread Sergey

Hi Spark ML experts!

Do you use RDDs caching somewhere together with ML lib to speed up
calculation?
I mean typical machine learning use cases.
Train-test split, train, evaluate, apply model.

Sergey.

Random forest implementation details

2016-04-03 Thread Sergey

Hi!

I'm playing with random forest implementation in Apache Spark.
First impression is - it is not fast :-(

Does somebody know how random forest is parallelized in Spark?
I mean both fitting and predicting.

And also what do mean this parameters? Didn't find documentation for them.
maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10

Sergey

Re: spark-shell failing but pyspark works

2016-04-03 Thread Mich Talebzadeh

This works fine for me

val sparkConf = new SparkConf().
 setAppName("StreamTest").
 setMaster("yarn-client").
 set("spark.cores.max", "12").
 set("spark.driver.allowMultipleContexts", "true").
 set("spark.hadoop.validateOutputSpecs", "false")

Time: 1459669805000 ms
---
---
Time: 145966986 ms
---
(Sun Apr 3 08:35:01 BST 2016  === Sending messages from rhes5)




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 3 April 2016 at 03:34, Cyril Scetbon  wrote:

> Nobody has any idea ?
>
> > On Mar 31, 2016, at 23:22, Cyril Scetbon  wrote:
> >
> > Hi,
> >
> > I'm having issues to create a StreamingContext with Scala using
> spark-shell. It tries to access the localhost interface and the Application
> Master is not running on this interface :
> >
> > ERROR ApplicationMaster: Failed to connect to driver at localhost:47257,
> retrying ...
> >
> > I don't have the issue with Python and pyspark which works fine (you can
> see it uses the ip address) :
> >
> > ApplicationMaster: Driver now available: 192.168.10.100:43290
> >
> > I use similar codes though :
> >
> > test.scala :
> > --
> >
> > import org.apache.spark._
> > import org.apache.spark.streaming._
> > val app = "test-scala"
> > val conf = new SparkConf().setAppName(app).setMaster("yarn-client")
> > val ssc = new StreamingContext(conf, Seconds(3))
> >
> > command used : spark-shell -i test.scala
> >
> > test.py :
> > ---
> >
> > from pyspark import SparkConf, SparkContext
> > from pyspark.streaming import StreamingContext
> > app = "test-python"
> > conf = SparkConf().setAppName(app).setMaster("yarn-client")
> > sc = SparkContext(conf=conf)
> > ssc = StreamingContext(sc, 3)
> >
> > command used : pyspark test.py
> >
> > Any idea why scala can't instantiate it ? I thought python was barely
> using scala under the hood, but it seems there are differences. Are there
> any parameters set using Scala but not Python ?
> >
> > Thanks
> > --
> > Cyril SCETBON
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Spark - Mesos HTTP API

Re: multiple splits fails

Re: multiple splits fails

GenericRowWithSchema to case class

Re: multiple splits fails

Re: multiple splits fails

Re: multiple splits fails

Re: spark.driver.memory meaning

RE: spark.driver.memory meaning

Re: multiple splits fails

spark.driver.memory meaning

Re: multiple splits fails

Re: multiple splits fails

Re: multiple splits fails

Evicting a lower version of a library loaded in Spark Worker

Re: multiple splits fails

Re: multiple splits fails

multiple splits fails

Re: value saveToCassandra is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

RDDs caching in typical machine learning use cases

Random forest implementation details

Re: spark-shell failing but pyspark works

22 matches

Site Navigation

Mail list logo

Footer information