date:20160607

Re: Spark 2.0 Release Date

2016-06-07 Thread Reynold Xin

It'd be great to cut an RC as soon as possible. Looking at the
blocker/critical issue list, majority of them are API audits. I think
people will get back to those once Spark Summit is over, and then we should
see some good progress towards an RC.

On Tue, Jun 7, 2016 at 6:20 AM, Jacek Laskowski  wrote:

> Finally, the PMC voice on the subject. Thanks a lot, Sean!
>
> p.s. Given how much time it takes to ship 2.0 (with so many cool
> features already backed in!) I'd vote for releasing a few more RCs
> before 2.0 hits the shelves. I hope 2.0 is not Java 9 or Jigsaw ;-)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Jun 7, 2016 at 3:06 PM, Sean Owen  wrote:
> > I don't believe the intent was to get it out before Spark Summit or
> > something. That shouldn't drive the schedule anyway. But now that
> > there's a 2.0.0 preview available, people who are eager to experiment
> > or test on it can do so now.
> >
> > That probably reduces urgency to get it out the door in order to
> > deliver new functionality. I guessed the 2.0.0 release would be mid
> > June and now I'd guess early July. But, nobody's discussed it per se.
> >
> > In theory only fixes, tests and docs are being merged, so the JIRA
> > count should be going down. It has, slowly. Right now there are 72
> > open issues for 2.0.0, of which 20 are blockers. Most of those are
> > simple "audit x" or "document x" tasks or umbrellas, but, they do
> > represent things that have to get done before a release, and that to
> > me looks like a few more weeks of finishing, pushing, and sweeping
> > under carpets.
> >
> >
> > On Tue, Jun 7, 2016 at 1:45 PM, Jacek Laskowski  wrote:
> >> On Tue, Jun 7, 2016 at 1:25 PM, Arun Patel 
> wrote:
> >>> Do we have any further updates on release date?
> >>
> >> Nope :( And it's even more quiet than I could have thought. I was so
> >> certain that today's the date. Looks like Spark Summit has "consumed"
> >> all the people behind 2.0...Can't believe no one (from the
> >> PMC/committers) even mean to shoot a date :( Patrick's gone. Reynold's
> >> busy. Perhaps Sean?
> >>
> >>> Also, Is there a updated documentation for 2.0 somewhere?
> >>
> >>
> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/
> ?
> >>
> >> Jacek
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Apache design pattern approaches

2016-06-07 Thread sparkie

Hi I have been working through some examples tutorials for Apache Spark in an
attempt to establish how I would solve the following scenario (see data
examples in Appendix):
I have 1 billion+ rows that have a key value (i.e. driver ID) and a number
of relevant attributes (product class, date/time) that I need to evaluate
using certain business rules/algorithms. These rules are based on grouped
data (i.e. perform business rules on driver ID 1 and then perform the same
rules on driver ID 2 etc.); typical business rules include the ability to
perform backward and forward looking checks (see sample below) within a
grouped dataset. Importantly, I need to process the grouped data (driver ID
1, 2,3,4 …) concurrently. An example of the business rules:
 
For each data grouping / set (i.e. driver ID = 1 order chronologically by
date):
· The first row is always an ‘initiate’ = ROW ID 1

· the product class value has previously/future (backward or forward
looking) occurred  = ‘DUPLICATE’ = ROW ID 2

· changed product (backward looking only) in the same product class
aka A - > A1  = ‘SWAP’ = ROW ID 3

· because the product has not previously occurred =  ‘ADD’ = ROW ID
4

· the product class value previously/future (backward or forward
looking) occurred  = ‘DUPLICATE’ = ROW ID 5

· the product class value previously/future (backward or forward
looking) occurred  = ‘DUPLICATE’ = ROW ID 6

 Questions:
1.   Should I use dataframes to ‘pull the source data? If so, do I do a
groupby and order by as part of the SQL query?

2.   How do I then split the grouped data (i.e. driver ID key value
pairs) to then be parallelized for concurrent processing (i.e. ideally the
number of parallel datasets/grouped data should run at max node cluster
capacity)? DO I need to do some sort of mappartitioning ?

3.   Pending (1) & (2) answers: How does each (i.e. grouped data set)
dataframe or RDD or dataset perform these rules based checks (i.e. backward
and forward looking checks) ? i.e. how is this achieved in SPARK?

ps. I have solid JAVA background but a complete Apache Spark novice so your
help would be really appreciated



Appendix
Input/OUTPUT
 
ROWID,Driver ID, product class,   
date, RESULT
1,1,A,  
 
1/1/16   INITIATE
2,1,A,  
 
2/2/16   DUPLICATE
3,1,A1, 
   
3/4/16   SWAP
4,1,B,  
 
2/5/16   ADD
5,1,C,  
 
1/1/16   DUPLICATE
6,1,C,  
 
2/2/16   DUPLICATE
7,2,A,  
 
2/2/16   INITIATE
8,2,B,  
 
3/4/16   ADD
9,2,A,  
 
2/5/16   DUPLICATE



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-design-pattern-approaches-tp27109.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Apache design patterns

2016-06-07 Thread Francois Le Roux

Thanks Ted

Hi I have been working through some examples tutorials for Apache Spark in
an attempt to establish how I would solve the following scenario (see data
examples in Appendix):

I have 1 billion+ rows that have a key value (i.e. driver ID) and a number
of relevant attributes (product class, date/time) that I need to evaluate
using certain business rules/algorithms. These rules are based on grouped
data (i.e. perform business rules on driver ID 1 and then perform the same
rules on driver ID 2 etc.); typical business rules include the ability to
perform backward and forward looking checks (see sample below) within a
grouped dataset. Importantly, I need to process the grouped data (driver ID
1, 2,3,4 …) concurrently. An example of the business rules:

For each data grouping / set (i.e. driver ID = 1 order chronologically by
date):

· The first row is always an ‘initiate’ = ROW ID 1

· the product class value has previously/future (backward or
forward looking) occurred  = ‘DUPLICATE’ = ROW ID 2

· changed product (backward looking only) in the same product class
aka A - > A1  = ‘SWAP’ = ROW ID 3

· because the product has not previously occurred =  ‘ADD’ = ROW ID
4

· the product class value previously/future (backward or forward
looking) occurred  = ‘DUPLICATE’ = ROW ID 5

· the product class value previously/future (backward or forward
looking) occurred  = ‘DUPLICATE’ = ROW ID 6

 Questions:

1.   Should I use dataframes to ‘pull the source data? If so, do I do a
groupby and order by as part of the SQL query?

2.   How do I then split the grouped data (i.e. driver ID key value
pairs) to then be parallelized for concurrent processing (i.e. ideally the
number of parallel datasets/grouped data should run at max node cluster
capacity)? DO I need to do some sort of mappartitioning ?

3.   Pending (1) & (2) answers: How does each (i.e. grouped data set)
dataframe or RDD or dataset perform these rules based checks (i.e. backward
and forward looking checks) ? i.e. how is this achieved in SPARK?

ps. I have solid JAVA background but a complete APache Spark novice so your
help would be really appreciated

sparkie

Appendix

Input/OUTPUT

ROWID,Driver ID, product class,
   date, RESULT

1,1,A,
   1/1/16   INITIATE

2,1,A,
2/2/16   DUPLICATE

3,1,A1,
3/4/16   SWAP

4,1,B,
   2/5/16   ADD

5,1,C,
   1/1/16   DUPLICATE

6,1,C,
2/2/16   DUPLICATE

7,2,A,
   2/2/16   INITIATE

8,2,B,
   3/4/16   ADD

9,2,A,
   2/5/16   DUPLICATE

On Wed, Jun 8, 2016 at 1:39 PM, Ted Yu  wrote:

> I think this is the correct forum.
>
> Please describe your case.
>
> On Jun 7, 2016, at 8:33 PM, Francois Le Roux 
> wrote:
>
> HI folks, I have been working through the available online Apache spark
>  tutorials and I am stuck with a scenario that i would like to solve in
> SPARK. Is this a forum where i can publish a narrative for the problem /
> scenario that i am trying to solve ?
>
>
> any assitance appreciated
>
>
> thanks
>
> frank
>
>

Re: Apache design patterns

2016-06-07 Thread Ted Yu

I think this is the correct forum. 

Please describe your case. 

> On Jun 7, 2016, at 8:33 PM, Francois Le Roux  wrote:
> 
> HI folks, I have been working through the available online Apache spark  
> tutorials and I am stuck with a scenario that i would like to solve in SPARK. 
> Is this a forum where i can publish a narrative for the problem / scenario 
> that i am trying to solve ? 
> 
> 
> 
> any assitance appreciated
> 
> 
> 
> thanks
> 
> frank

Re: Spark_Usecase

2016-06-07 Thread vaquar khan

Deepak Spark does provide support to incremental load,if users want to
schedule their batch jobs frequently and  want to have incremental load of
their data from databases.

You will not get good performance  to update your Spark SQL tables backed
by files. Instead, you can use message queues and Spark Streaming or do an
incremental select to make sure your Spark SQL tables stay up to date with
your production databases

Regards,
Vaquar khan
On 7 Jun 2016 10:29, "Deepak Sharma"  wrote:

I am not sure if Spark provides any support for incremental extracts
inherently.
But you can maintain a file e.g. extractRange.conf in hdfs , to read from
it the end range and update it with new end range from  spark job before it
finishes with the new relevant ranges to be used next time.

On Tue, Jun 7, 2016 at 8:49 PM, Ajay Chander  wrote:

> Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL.
> Now I am using spark to do the same. Right now, I am trying
> to implement incremental updates while loading from MySQL through spark.
> Can you suggest any best practices for this ? Thank you.
>
>
> On Tuesday, June 7, 2016, Mich Talebzadeh 
> wrote:
>
>> I use Spark rather that Sqoop to import data from an Oracle table into a
>> Hive ORC table.
>>
>> It used JDBC for this purpose. All inclusive in Scala itself.
>>
>> Also Hive runs on Spark engine. Order of magnitude faster with Inde on
>> map-reduce/.
>>
>> pretty simple.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 15:38, Ted Yu  wrote:
>>
>>> bq. load the data from edge node to hdfs
>>>
>>> Does the loading involve accessing sqlserver ?
>>>
>>> Please take a look at
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>>
>>> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni 
>>> wrote:
>>>
 Hi
 how about

 1.  have a process that read the data from your sqlserver and dumps it
 as a file into a directory on your hd
 2. use spark-streanming to read data from that directory  and store it
 into hdfs

 perhaps there is some sort of spark 'connectors' that allows you to
 read data from a db directly so you dont need to go via spk streaming?

 hth

 On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander 
 wrote:

> Hi Spark users,
>
> Right now we are using spark for everything(loading the data from
> sqlserver, apply transformations, save it as permanent tables in
> hive) in our environment. Everything is being done in one spark 
> application.
>
> The only thing we do before we launch our spark application through
> oozie is, to load the data from edge node to hdfs(it is being triggered
> through a ssh action from oozie to run shell script on edge node).
>
> My question is,  there's any way we can accomplish edge-to-hdfs copy
> through spark ? So that everything is done in one spark DAG and lineage
> graph?
>
> Any pointers are highly appreciated. Thanks
>
> Regards,
> Aj
>

>>>
>>

-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Apache design patterns

2016-06-07 Thread Francois Le Roux

HI folks, I have been working through the available online Apache spark
 tutorials and I am stuck with a scenario that i would like to solve in
SPARK. Is this a forum where i can publish a narrative for the problem /
scenario that i am trying to solve ?


any assitance appreciated


thanks

frank

Re: SparkR interaction with R libraries (currently 1.5.2)

2016-06-07 Thread Sun Rui

Hi, Ian,
You should not use the Spark DataFrame a_df in your closure.
For an R function for lapplyPartition, the parameter is a list of lists, 
representing the rows in the corresponding partition.
In Spark 2.0, SparkR provides a new public API called dapply, which can apply 
an R function to each partition of a Spark DataFrame. The input of the R 
function is a data.frame corresponds to the partition data, and the output is 
also a data.frame.
you may download the Spark 2.0 preview release and give it a try.
> On Jun 8, 2016, at 01:58, rachmaninovquartet  
> wrote:
> 
> Hi,
> I'm trying to figure out how to work with R libraries in spark, properly.
> I've googled and done some trial and error. The main error, I've been
> running into is "cannot coerce class "structure("DataFrame", package =
> "SparkR")" to a data.frame". I'm wondering if there is a way to use the R
> dataframe functionality on worker nodes or if there is a way to "hack" the R
> function in order to make it accept Spark dataframes. Here is an example of
> what I'm trying to do, with a_df being a spark dataframe:
> 
> ***DISTRIBUTED***
> #0 filter out nulls
> a_df <- filter(a_df, isNotNull(a_df$Ozone))
> 
> #1 make closure
> treeParty <- function(x) {
># Use sparseMatrix function from the Matrix package
>air.ct <- ctree(Ozone ~ ., data = a_df)
> }
> 
> #2  put package in context
> SparkR:::includePackage(sc, partykit)
> 
> #3 apply to all partitions
> partied <- SparkR:::lapplyPartition(a_df, treeParty)
> 
> 
> **LOCAL***
> Here is R code that works with a local dataframe, local_df:
> local_df <- subset(airquality, !is.na(Ozone))
> air.ct <- ctree(Ozone ~ ., data = local_df)
> 
> Any advice would be greatly appreciated!
> 
> Thanks,
> 
> Ian
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-interaction-with-R-libraries-currently-1-5-2-tp27107.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Dealing with failures

2016-06-07 Thread Mohit Anchlia

I am looking to write an ETL job using spark that reads data from the
source, perform transformation and insert it into the destination. I am
trying to understand how spark deals with failures? I can't seem to find
the documentation. I am interested in learning the following scenarios:
1. Source becomes slow or un-responsive. How to control such a situation so
that it doesn't cause DDoS on the source? Also, at the same time how to
make it resilient that it does pick up from where it left?
2. In the same context when destination becomes slow or un-responsive.

Re: Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter

I'm using dataframes, types are all doubles and I'm only extracting what I
need.

The caveat on these is that I am porting an existing system for a client
and for there business it's likely to be cheaper to throw hardware (in aws)
at the problem for a couple of hours than re-engineer there algorithms

cheers


On 7 June 2016 at 21:54, Jörn Franke  wrote:

> Before hardware optimization there is always software optimization.
> Are you using dataset / dataframe? Are you using the  right data types (
> eg int where int is appropriate , try to avoid string and char etc)
> Do you extract only the stuff needed? What are the algorithm parameters?
>
> > On 07 Jun 2016, at 13:09, Franc Carter  wrote:
> >
> >
> > Hi,
> >
> > I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and
> am interested in how it might be best to scale it - e.g more cpus per
> instances, more memory per instance, more instances etc.
> >
> > I'm currently using 32 m3.xlarge instances for for a training set with
> 2.5 million rows, 1300 columns and a total size of 31GB (parquet)
> >
> > thanks
> >
> > --
> > Franc
>



-- 
Franc

Re: Sequential computation over several partitions

2016-06-07 Thread Mich Talebzadeh

Am I correct in understanding that you want to read and iterate all the
data to be correct. For example if a user is already unsubscribed then you
want to ignore all the subsequent subscribe regardless

how often do you want to iterate through the full data. The frequency of
your analysis?

the issue I believe you may face as you go from t0-> t1-.tn you volume of
data is going to rise.

How about periodic storage of your analysis and working on deltas only
afterwards?

What sort of data is it? Is it typical web-users?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 22:54, Jeroen Miller  wrote:

> Dear fellow Sparkers,
>
> I am a new Spark user and I am trying to solve a (conceptually simple)
> problem which may not be a good use case for Spark, at least for the RDD
> API. But before I turn my back on it, I would rather have the opinion of
> more knowledgeable developers than me, as it is highly likely that I am
> missing something.
>
> Here is my problem in a nutshell.
>
> I have numerous files where each line is an event of the form:
>
> (subscribe|unsubscribe),,,
>
> I need to gather time-framed (for example, weekly or monthly) statistics
> of the kind:
>
>   ,
>   ,
>   ,
>   
>
> Ideally, I would need a Spark job that output these statistics for all
> time periods at once. The number of unique  is a about a few
> hundreds, the number of unique  is a few dozens of millions.
>
> The trouble is that the data is not "clean", in the sense that I can have
> 'unsubscribe' events for users that are not subscribed, or 'subscribe'
> events for users that are already subscribed.
>
> This means I have to keep in memory the complete list of
>
> (subscribe|unsubscribe),,,
>
> keeping only the entry for the most recent  for a given couple
> (list_id,user_id).
>
> If one is only interested in keeping the final statistics, this is
> relatively easy to do with reduceByKey and combineByKey on a properly keyed
> RDD containing all events.
>
> However I am struggling when it comes down to compute the "partial"
> statistics, as I obviously do not want to duplicate most of the
> computations done for period (i-1) when I am adding the events for period
> (i) as my reduceByKey/combineByKey approach will lead to.
>
> Sequentially, the problem is trivial: keep all events (with the latest
> 'valid' event for each couple (list_id,user_id)) in a huge hash table which
> can be used to decide whether to increment or decrement 
> (for example) and save the states of the current statistics whenever we are
> done dealing with period (i).
>
> I do not know how to efficiently solve this in Spark though.
>
> A naive idea would be to fetch the data for period(0) in an explicitly
> partitioned RDD (for example according to the last few characters of
> ) and proceed in a sequential fashion within a call to
> mapPartition.
>
> The trouble would then be how to process new data files for later periods.
> Suppose I store the event RDDs in an array 'data' (indexed by period
> number), all of them similarly partitioned, I am afraid something like this
> is not viable (please excuse pseudo-code):
>
> data[0].mapPartitionWithIndex(
>
>   (index, iterator) => {
> //
> // 1. Initialize 'hashmap' keyed by (list_id,user_id) for the
> partition
> //
> val hashmap = new HashMap[(String, String), Event]
>
> //
> // 2. Iterate over events in data[0] rdd, update 'hashmap',
> //output stats for this partition and period.
> //
> while (iterator.hasNext) {
> //
> // Process entry, update 'hashmap', output stats
> // for the partition and period.
> //
> }
>
> //
> // 3. Loop over all the periods.
> //
> for (period <- 1 until max) {
> val next = data[period].mapPartitionWithIndex(
> (index2, iterator2) => {
> if (index2 == index) {
> while (iterator2.hasNext) {
> //
> // Iterate over the elements of next (since
> // the data should be on the same node, so
> no
> // shuffling after the initial
> partitioning,
> // right?), update 'hashmap', and output
> stats
> // for this partition and period.
> //
> }
> } else {
> iterator2
> }
> }
>

Sequential computation over several partitions

2016-06-07 Thread Jeroen Miller

Dear fellow Sparkers,

I am a new Spark user and I am trying to solve a (conceptually simple)
problem which may not be a good use case for Spark, at least for the RDD
API. But before I turn my back on it, I would rather have the opinion of
more knowledgeable developers than me, as it is highly likely that I am
missing something.

Here is my problem in a nutshell.

I have numerous files where each line is an event of the form:

(subscribe|unsubscribe),,,

I need to gather time-framed (for example, weekly or monthly) statistics of
the kind:

  ,
  ,
  ,
  

Ideally, I would need a Spark job that output these statistics for all time
periods at once. The number of unique  is a about a few hundreds,
the number of unique  is a few dozens of millions.

The trouble is that the data is not "clean", in the sense that I can have
'unsubscribe' events for users that are not subscribed, or 'subscribe'
events for users that are already subscribed.

This means I have to keep in memory the complete list of

(subscribe|unsubscribe),,,

keeping only the entry for the most recent  for a given couple
(list_id,user_id).

If one is only interested in keeping the final statistics, this is
relatively easy to do with reduceByKey and combineByKey on a properly keyed
RDD containing all events.

However I am struggling when it comes down to compute the "partial"
statistics, as I obviously do not want to duplicate most of the
computations done for period (i-1) when I am adding the events for period
(i) as my reduceByKey/combineByKey approach will lead to.

Sequentially, the problem is trivial: keep all events (with the latest
'valid' event for each couple (list_id,user_id)) in a huge hash table which
can be used to decide whether to increment or decrement 
(for example) and save the states of the current statistics whenever we are
done dealing with period (i).

I do not know how to efficiently solve this in Spark though.

A naive idea would be to fetch the data for period(0) in an explicitly
partitioned RDD (for example according to the last few characters of
) and proceed in a sequential fashion within a call to
mapPartition.

The trouble would then be how to process new data files for later periods.
Suppose I store the event RDDs in an array 'data' (indexed by period
number), all of them similarly partitioned, I am afraid something like this
is not viable (please excuse pseudo-code):

data[0].mapPartitionWithIndex(

  (index, iterator) => {
//
// 1. Initialize 'hashmap' keyed by (list_id,user_id) for the
partition
//
val hashmap = new HashMap[(String, String), Event]

//
// 2. Iterate over events in data[0] rdd, update 'hashmap',
//output stats for this partition and period.
//
while (iterator.hasNext) {
//
// Process entry, update 'hashmap', output stats
// for the partition and period.
//
}

//
// 3. Loop over all the periods.
//
for (period <- 1 until max) {
val next = data[period].mapPartitionWithIndex(
(index2, iterator2) => {
if (index2 == index) {
while (iterator2.hasNext) {
//
// Iterate over the elements of next (since
// the data should be on the same node, so
no
// shuffling after the initial partitioning,
// right?), update 'hashmap', and output
stats
// for this partition and period.
//
}
} else {
iterator2
}
}
)
}
}
)

The trouble with this approach it that I am afraid the data files for
period (i > 0) will be read as many times as there are partitions in
data[0] unless I explicitly persist them maybe, which is inefficient. That
said there is probably a (clumsy) way to unpersist them once I am sure I'm
100% done with them.

All of this looks not only inelegant but shamefully un-spark like to me.

Am I missing a trick here, maybe a well-known pattern? Are RDDs not the
most appropriate API to handle this kind of tasks? If so, what do you
suggest I could look into?

Thank you for taking the time to read that overly long message ;-)

Jeroen

Optional columns in Aggregated Metrics by Executor in web UI?

2016-06-07 Thread Jacek Laskowski

Hi,

I'd like to have the other optional columns in Aggregated Metrics by
Executor table per stage in web UI. I can easily have Shuffle Read
Size / Records and Shuffle Write Size / Records columns.

scala> sc.parallelize(0 to 9).map((_,1)).groupBy(_._1).count

I can't seem to figure out what Spark job to execute to have Input
Size / Records and Output Size / Records + Shuffle Spill (Memory) and
Shuffle Spill (Disk) columns.

Any ideas? Thanks!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: setting column names on dataset

2016-06-07 Thread Koert Kuipers

That's neat
On Jun 7, 2016 4:34 PM, "Jacek Laskowski"  wrote:

> Hi,
>
> What about this?
>
> scala> final case class Person(name: String, age: Int)
> warning: there was one unchecked warning; re-run with -unchecked for
> details
> defined class Person
>
> scala> val ds = Seq(Person("foo", 42), Person("bar", 24)).toDS
> ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]
>
> scala> ds.as("a").joinWith(ds.as("b"), $"a.name" === $"b.name
> ").show(false)
> +++
> |_1  |_2  |
> +++
> |[foo,42]|[foo,42]|
> |[bar,24]|[bar,24]|
> +++
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Jun 7, 2016 at 9:30 PM, Koert Kuipers  wrote:
> > for some operators on Dataset, like joinWith, one needs to use an
> expression
> > which means referring to columns by name.
> >
> > how can i set the column names for a Dataset before doing a joinWith?
> >
> > currently i am aware of:
> > df.toDF("k", "v").as[(K, V)]
> >
> > but that seems inefficient/anti-pattern? i shouldn't have to go to a
> > DataFrame and back to set the column names?
> >
> > or if this is the only way to set names, and column names really
> shouldn't
> > be used in Datasets, can i perhaps refer to the columns by their
> position?
> >
> > thanks, koert
>

Re: Dataset - reduceByKey

2016-06-07 Thread Jacek Laskowski

Hi Bryan,

What about groupBy [1] and agg [2]? What about UserDefinedAggregateFunction [3]?

[1] 
https://home.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@groupBy(col1:String,cols:String*):org.apache.spark.sql.RelationalGroupedDataset
[2] 
https://home.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.RelationalGroupedDataset
[3] 
https://home.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedAggregateFunction

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jun 7, 2016 at 8:32 PM, Bryan Jeffrey  wrote:
> Hello.
>
> I am looking at the option of moving RDD based operations to Dataset based
> operations.  We are calling 'reduceByKey' on some pair RDDs we have.  What
> would the equivalent be in the Dataset interface - I do not see a simple
> reduceByKey replacement.
>
> Regards,
>
> Bryan Jeffrey
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: setting column names on dataset

2016-06-07 Thread Jacek Laskowski

Hi,

What about this?

scala> final case class Person(name: String, age: Int)
warning: there was one unchecked warning; re-run with -unchecked for details
defined class Person

scala> val ds = Seq(Person("foo", 42), Person("bar", 24)).toDS
ds: org.apache.spark.sql.Dataset[Person] = [name: string, age: int]

scala> ds.as("a").joinWith(ds.as("b"), $"a.name" === $"b.name").show(false)
+++
|_1  |_2  |
+++
|[foo,42]|[foo,42]|
|[bar,24]|[bar,24]|
+++

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jun 7, 2016 at 9:30 PM, Koert Kuipers  wrote:
> for some operators on Dataset, like joinWith, one needs to use an expression
> which means referring to columns by name.
>
> how can i set the column names for a Dataset before doing a joinWith?
>
> currently i am aware of:
> df.toDF("k", "v").as[(K, V)]
>
> but that seems inefficient/anti-pattern? i shouldn't have to go to a
> DataFrame and back to set the column names?
>
> or if this is the only way to set names, and column names really shouldn't
> be used in Datasets, can i perhaps refer to the columns by their position?
>
> thanks, koert

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Environment tab meaning

2016-06-07 Thread Jacek Laskowski

Hi,

I'm not surprised to see Hadoop jars on the driver (yet I couldn't
explain exactly why they need to be there). I can't find a way now to
display the classpath for executors.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jun 7, 2016 at 10:05 PM, satish saley  wrote:
> Thank you Jacek.
> In case of YARN, I see that hadoop jars are present in system classpath for
> Driver. Will it be the same for all executors?
>
> On Tue, Jun 7, 2016 at 11:22 AM, Jacek Laskowski  wrote:
>>
>> Ouch, I made a mistake :( Sorry.
>>
>> You've asked about spark **history** server. It's pretty much the same.
>>
>> HistoryServer is a web interface for completed and running (aka
>> incomplete) Spark applications. It uses EventLoggingListener to
>> collect events as JSON using org.apache.spark.util.JsonProtocol
>> object.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Tue, Jun 7, 2016 at 8:18 PM, Jacek Laskowski  wrote:
>> > Hi,
>> >
>> > It is the driver - see the port. Is this 4040 or similar? It's started
>> > when SparkContext starts and is controlled by spark.ui.enabled.
>> >
>> > spark.ui.enabled (default: true) = controls whether the web UI is
>> > started or not.
>> >
>> > It's through JobProgressListener which is the SparkListener for web UI
>> > that the console knows what happens under the covers (and can
>> > calculate the stats).
>> >
>> > BTW, spark.ui.port (default: 4040) controls the port Web UI binds to.
>> >
>> > Pozdrawiam,
>> > Jacek Laskowski
>> > 
>> > https://medium.com/@jaceklaskowski/
>> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> > Follow me at https://twitter.com/jaceklaskowski
>> >
>> >
>> > On Tue, Jun 7, 2016 at 8:11 PM, satish saley 
>> > wrote:
>> >> Hi,
>> >> In spark history server, we see environment tab. Is it show environment
>> >> of
>> >> Driver or Executor or both?
>> >>
>> >> Jobs
>> >> Stages
>> >> Storage
>> >> Environment
>> >> Executors
>> >>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Join two Spark Streaming

2016-06-07 Thread vinay453

I am working on window dstreams wherein each dstream contains 3 rdd with
following keys:

a,b,c
b,c,d
c,d,e
d,e,f

I want to get only unique keys across all dstream

a,b,c,d,e,f
How to do it in pyspark streaming?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Join-two-Spark-Streaming-tp9052p27108.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Environment tab meaning

2016-06-07 Thread satish saley

Thank you Jacek.
In case of YARN, I see that hadoop jars are present in system classpath for
Driver. Will it be the same for all executors?

On Tue, Jun 7, 2016 at 11:22 AM, Jacek Laskowski  wrote:

> Ouch, I made a mistake :( Sorry.
>
> You've asked about spark **history** server. It's pretty much the same.
>
> HistoryServer is a web interface for completed and running (aka
> incomplete) Spark applications. It uses EventLoggingListener to
> collect events as JSON using org.apache.spark.util.JsonProtocol
> object.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Jun 7, 2016 at 8:18 PM, Jacek Laskowski  wrote:
> > Hi,
> >
> > It is the driver - see the port. Is this 4040 or similar? It's started
> > when SparkContext starts and is controlled by spark.ui.enabled.
> >
> > spark.ui.enabled (default: true) = controls whether the web UI is
> > started or not.
> >
> > It's through JobProgressListener which is the SparkListener for web UI
> > that the console knows what happens under the covers (and can
> > calculate the stats).
> >
> > BTW, spark.ui.port (default: 4040) controls the port Web UI binds to.
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > 
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> >
> > On Tue, Jun 7, 2016 at 8:11 PM, satish saley 
> wrote:
> >> Hi,
> >> In spark history server, we see environment tab. Is it show environment
> of
> >> Driver or Executor or both?
> >>
> >> Jobs
> >> Stages
> >> Storage
> >> Environment
> >> Executors
> >>
>

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey

All,

Thank you for the replies.  It seems as though the Dataset API is still far
behind the RDD API.  This is unfortunate as the Dataset API potentially
provides a number of performance benefits.  I will move to using it in a
more limited set of cases for the moment.

Thank you!

Bryan Jeffrey

On Tue, Jun 7, 2016 at 2:50 PM, Richard Marscher 
wrote:

> There certainly are some gaps between the richness of the RDD API and the
> Dataset API. I'm also migrating from RDD to Dataset and ran into
> reduceByKey and join scenarios.
>
> In the spark-dev list, one person was discussing reduceByKey being
> sub-optimal at the moment and it spawned this JIRA
> https://issues.apache.org/jira/browse/SPARK-15598. But you might be able
> to get by with groupBy().reduce() for now, check performance though.
>
> As for join, the approach would be using the joinWith function on Dataset.
> Although the API isn't as sugary as it was for RDD IMO, something which
> I've been discussing in a separate thread as well. I can't find a weblink
> for it but the thread subject is "Dataset Outer Join vs RDD Outer Join".
>
> On Tue, Jun 7, 2016 at 2:40 PM, Bryan Jeffrey 
> wrote:
>
>> It would also be nice if there was a better example of joining two
>> Datasets. I am looking at the documentation here:
>> http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems
>> a little bit sparse - is there a better documentation source?
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>> On Tue, Jun 7, 2016 at 2:32 PM, Bryan Jeffrey 
>> wrote:
>>
>>> Hello.
>>>
>>> I am looking at the option of moving RDD based operations to Dataset
>>> based operations.  We are calling 'reduceByKey' on some pair RDDs we have.
>>> What would the equivalent be in the Dataset interface - I do not see a
>>> simple reduceByKey replacement.
>>>
>>> Regards,
>>>
>>> Bryan Jeffrey
>>>
>>>
>>
>
>
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook  | LinkedIn
> 
>

Affinity Propagation

2016-06-07 Thread Tim Gautier

Does anyone know of a good library usable in scala spark that has affinity
propagation?

setting column names on dataset

2016-06-07 Thread Koert Kuipers

for some operators on Dataset, like joinWith, one needs to use an
expression which means referring to columns by name.

how can i set the column names for a Dataset before doing a joinWith?

currently i am aware of:
df.toDF("k", "v").as[(K, V)]

but that seems inefficient/anti-pattern? i shouldn't have to go to a
DataFrame and back to set the column names?

or if this is the only way to set names, and column names really shouldn't
be used in Datasets, can i perhaps refer to the columns by their position?

thanks, koert

MapType in Java unsupported in Spark 1.5

2016-06-07 Thread Baichuan YANG

Hi ,

I am a Spark 1.5 user and currently do some testing on Spark-shell. When I
execute a SQL query using UDF returning Map type in Java, spark-shell
returns an error message saying that MapType is Java is unsupported. So I
wonder whether it is possible to convert MapType in Java to MapType in
scala or any other types supported in below link:
http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types
Or there is no way to do so?
Thanks

Regards,
BaiChuan Yang

Re: Dataset - reduceByKey

2016-06-07 Thread Richard Marscher

There certainly are some gaps between the richness of the RDD API and the
Dataset API. I'm also migrating from RDD to Dataset and ran into
reduceByKey and join scenarios.

In the spark-dev list, one person was discussing reduceByKey being
sub-optimal at the moment and it spawned this JIRA
https://issues.apache.org/jira/browse/SPARK-15598. But you might be able to
get by with groupBy().reduce() for now, check performance though.

As for join, the approach would be using the joinWith function on Dataset.
Although the API isn't as sugary as it was for RDD IMO, something which
I've been discussing in a separate thread as well. I can't find a weblink
for it but the thread subject is "Dataset Outer Join vs RDD Outer Join".

On Tue, Jun 7, 2016 at 2:40 PM, Bryan Jeffrey 
wrote:

> It would also be nice if there was a better example of joining two
> Datasets. I am looking at the documentation here:
> http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems
> a little bit sparse - is there a better documentation source?
>
> Regards,
>
> Bryan Jeffrey
>
> On Tue, Jun 7, 2016 at 2:32 PM, Bryan Jeffrey 
> wrote:
>
>> Hello.
>>
>> I am looking at the option of moving RDD based operations to Dataset
>> based operations.  We are calling 'reduceByKey' on some pair RDDs we have.
>> What would the equivalent be in the Dataset interface - I do not see a
>> simple reduceByKey replacement.
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>>
>

-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn

Re: Dataset - reduceByKey

2016-06-07 Thread Takeshi Yamamuro

Seems you can see docs for 2.0 for now;
https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/spark-2.0.0-SNAPSHOT-2016_06_07_07_01-1e2c931-docs/

// maropu

On Tue, Jun 7, 2016 at 11:40 AM, Bryan Jeffrey 
wrote:

> It would also be nice if there was a better example of joining two
> Datasets. I am looking at the documentation here:
> http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems
> a little bit sparse - is there a better documentation source?
>
> Regards,
>
> Bryan Jeffrey
>
> On Tue, Jun 7, 2016 at 2:32 PM, Bryan Jeffrey 
> wrote:
>
>> Hello.
>>
>> I am looking at the option of moving RDD based operations to Dataset
>> based operations.  We are calling 'reduceByKey' on some pair RDDs we have.
>> What would the equivalent be in the Dataset interface - I do not see a
>> simple reduceByKey replacement.
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>>
>


-- 
---
Takeshi Yamamuro

Re: Dataset Outer Join vs RDD Outer Join

2016-06-07 Thread Richard Marscher

For anyone following along the chain went private for a bit, but there were
still issues with the bytecode generation in the 2.0-preview so this JIRA
was created: https://issues.apache.org/jira/browse/SPARK-15786

On Mon, Jun 6, 2016 at 1:11 PM, Michael Armbrust 
wrote:

> That kind of stuff is likely fixed in 2.0.  If you can get a reproduction
> working there it would be very helpful if you could open a JIRA.
>
> On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher  > wrote:
>
>> A quick unit test attempt didn't get far replacing map with as[], I'm
>> only working against 1.6.1 at the moment though, I was going to try 2.0 but
>> I'm having a hard time building a working spark-sql jar from source, the
>> only ones I've managed to make are intended for the full assembly fat jar.
>>
>>
>> Example of the error from calling joinWith as left_outer and then
>> .as[(Option[T], U]) where T and U are Int and Int.
>>
>> [info] newinstance(class scala.Tuple2,decodeusingserializer(input[0,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true),decodeusingserializer(input[1,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true),false,ObjectType(class
>> scala.Tuple2),None)
>> [info] :- decodeusingserializer(input[0,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true)
>> [info] :  +- input[0, StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))]
>> [info] +- decodeusingserializer(input[1,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true)
>> [info]+- input[1, StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))]
>>
>> Cause: java.util.concurrent.ExecutionException: java.lang.Exception:
>> failed to compile: org.codehaus.commons.compiler.CompileException: File
>> 'generated.java', Line 32, Column 60: No applicable constructor/method
>> found for actual parameters "org.apache.spark.sql.catalyst.InternalRow";
>> candidates are: "public static java.nio.ByteBuffer
>> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer
>> java.nio.ByteBuffer.wrap(byte[], int, int)"
>>
>> The generated code is passing InternalRow objects into the ByteBuffer
>>
>> Starting from two Datasets of types Dataset[(Int, Int)] with expression
>> $"left._1" === $"right._1". I'll have to spend some time getting a better
>> understanding of this analysis phase, but hopefully I can come up with
>> something.
>>
>> On Wed, Jun 1, 2016 at 3:43 PM, Michael Armbrust 
>> wrote:
>>
>>> Option should place nicely with encoders, but its always possible there
>>> are bugs.  I think those function signatures are slightly more expensive
>>> (one extra object allocation) and its not as java friendly so we probably
>>> don't want them to be the default.
>>>
>>> That said, I would like to enable that kind of sugar while still taking
>>> advantage of all the optimizations going on under the covers.  Can you get
>>> it to work if you use `as[...]` instead of `map`?
>>>
>>> On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher <
>>> rmarsc...@localytics.com> wrote:
>>>
 Ah thanks, I missed seeing the PR for
 https://issues.apache.org/jira/browse/SPARK-15441. If the rows became
 null objects then I can implement methods that will map those back to
 results that align closer to the RDD interface.

 As a follow on, I'm curious about thoughts regarding enriching the
 Dataset join interface versus a package or users sugaring for themselves. I
 haven't considered the implications of what the optimizations datasets,
 tungsten, and/or bytecode gen can do now regarding joins so I may be
 missing a critical benefit there around say avoiding Options in favor of
 nulls. If nothing else, I guess Option doesn't have a first class Encoder
 or DataType yet and maybe for good reasons.

 I did find the RDD join interface elegant, though. In the ideal world
 an API comparable the following would be nice:
 https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06


 On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

> Thanks for the feedback.  I think this will address at least some of
> the problems you are describing:
> https://github.com/apache/spark/pull/13425
>
> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <
> rmarsc...@localytics.com> wrote:
>
>> Hi,
>>
>> I've been working on transitioning from RDD to Datasets in our
>> codebase in anticipation of being able to leverage features of 2.0.
>>
>> I'm having a lot of difficulties with the impedance mismatches
>> between how outer joins worked with RDD versus Dataset. The Dataset joins
>> feel like a big

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey

It would also be nice if there was a better example of joining two
Datasets. I am looking at the documentation here:
http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems a
little bit sparse - is there a better documentation source?

Regards,

Bryan Jeffrey

On Tue, Jun 7, 2016 at 2:32 PM, Bryan Jeffrey 
wrote:

> Hello.
>
> I am looking at the option of moving RDD based operations to Dataset based
> operations.  We are calling 'reduceByKey' on some pair RDDs we have.  What
> would the equivalent be in the Dataset interface - I do not see a simple
> reduceByKey replacement.
>
> Regards,
>
> Bryan Jeffrey
>
>

Re: Spark_Usecase

2016-06-07 Thread Ajay Chander

Hi Deepak, thanks for the info. I was thinking of reading both source and
destination tables into separate rdds/dataframes, then apply some specific
transformations to find the updated info, remove updated keyed rows from
destination and append updated info to the destination. Any pointers on
this kind of usage ?

It would be great, If it is possible for you to provide an example with
regards to what you mentioned below? Thanks much.

Regards,
Aj

On Tuesday, June 7, 2016, Deepak Sharma  wrote:

> I am not sure if Spark provides any support for incremental extracts
> inherently.
> But you can maintain a file e.g. extractRange.conf in hdfs , to read from
> it the end range and update it with new end range from  spark job before it
> finishes with the new relevant ranges to be used next time.
>
> On Tue, Jun 7, 2016 at 8:49 PM, Ajay Chander  > wrote:
>
>> Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL.
>> Now I am using spark to do the same. Right now, I am trying
>> to implement incremental updates while loading from MySQL through spark.
>> Can you suggest any best practices for this ? Thank you.
>>
>>
>> On Tuesday, June 7, 2016, Mich Talebzadeh > > wrote:
>>
>>> I use Spark rather that Sqoop to import data from an Oracle table into a
>>> Hive ORC table.
>>>
>>> It used JDBC for this purpose. All inclusive in Scala itself.
>>>
>>> Also Hive runs on Spark engine. Order of magnitude faster with Inde on
>>> map-reduce/.
>>>
>>> pretty simple.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 7 June 2016 at 15:38, Ted Yu  wrote:
>>>
 bq. load the data from edge node to hdfs

 Does the loading involve accessing sqlserver ?

 Please take a look at
 https://spark.apache.org/docs/latest/sql-programming-guide.html

 On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni 
 wrote:

> Hi
> how about
>
> 1.  have a process that read the data from your sqlserver and dumps it
> as a file into a directory on your hd
> 2. use spark-streanming to read data from that directory  and store it
> into hdfs
>
> perhaps there is some sort of spark 'connectors' that allows you to
> read data from a db directly so you dont need to go via spk streaming?
>
>
> hth
>
>
>
>
>
>
>
>
>
>
> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander 
> wrote:
>
>> Hi Spark users,
>>
>> Right now we are using spark for everything(loading the data from
>> sqlserver, apply transformations, save it as permanent tables in
>> hive) in our environment. Everything is being done in one spark 
>> application.
>>
>> The only thing we do before we launch our spark application through
>> oozie is, to load the data from edge node to hdfs(it is being triggered
>> through a ssh action from oozie to run shell script on edge node).
>>
>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>> through spark ? So that everything is done in one spark DAG and lineage
>> graph?
>>
>> Any pointers are highly appreciated. Thanks
>>
>> Regards,
>> Aj
>>
>
>

>>>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>

Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey

Hello.

I am looking at the option of moving RDD based operations to Dataset based
operations.  We are calling 'reduceByKey' on some pair RDDs we have.  What
would the equivalent be in the Dataset interface - I do not see a simple
reduceByKey replacement.

Regards,

Bryan Jeffrey

Re: Environment tab meaning

2016-06-07 Thread Jacek Laskowski

Ouch, I made a mistake :( Sorry.

You've asked about spark **history** server. It's pretty much the same.

HistoryServer is a web interface for completed and running (aka
incomplete) Spark applications. It uses EventLoggingListener to
collect events as JSON using org.apache.spark.util.JsonProtocol
object.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jun 7, 2016 at 8:18 PM, Jacek Laskowski  wrote:
> Hi,
>
> It is the driver - see the port. Is this 4040 or similar? It's started
> when SparkContext starts and is controlled by spark.ui.enabled.
>
> spark.ui.enabled (default: true) = controls whether the web UI is
> started or not.
>
> It's through JobProgressListener which is the SparkListener for web UI
> that the console knows what happens under the covers (and can
> calculate the stats).
>
> BTW, spark.ui.port (default: 4040) controls the port Web UI binds to.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Jun 7, 2016 at 8:11 PM, satish saley  wrote:
>> Hi,
>> In spark history server, we see environment tab. Is it show environment of
>> Driver or Executor or both?
>>
>> Jobs
>> Stages
>> Storage
>> Environment
>> Executors
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Environment tab meaning

2016-06-07 Thread Jacek Laskowski

Hi,

It is the driver - see the port. Is this 4040 or similar? It's started
when SparkContext starts and is controlled by spark.ui.enabled.

spark.ui.enabled (default: true) = controls whether the web UI is
started or not.

It's through JobProgressListener which is the SparkListener for web UI
that the console knows what happens under the covers (and can
calculate the stats).

BTW, spark.ui.port (default: 4040) controls the port Web UI binds to.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

On Tue, Jun 7, 2016 at 8:11 PM, satish saley  wrote:
> Hi,
> In spark history server, we see environment tab. Is it show environment of
> Driver or Executor or both?
>
> Jobs
> Stages
> Storage
> Environment
> Executors
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark dynamic allocation - efficiently request new resource

2016-06-07 Thread Nirav Patel

Hi,

Do current or future(2.0) spark dynamic allocation have capability to
request a container with varying resource requirements based on various
factor? Few factors I can think of is based on stage and data its
processing it can either ask for more CPUs or more Memory. i.e. new
executor can have different number of CPU cores or memory available for all
of its task.
That way spark can process data skew with heavier executor by assigning
more Memory or CPUs to new executors.

Thanks
Nirav

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube]

Environment tab meaning

2016-06-07 Thread satish saley

Hi,
In spark history server, we see environment tab. Is it show environment of
Driver or Executor or both?


   - Jobs
   
   - Stages
   
   - Storage
   
   - Environment
   
   - Executors
   
   -

SparkR interaction with R libraries (currently 1.5.2)

2016-06-07 Thread rachmaninovquartet

Hi,
I'm trying to figure out how to work with R libraries in spark, properly.
I've googled and done some trial and error. The main error, I've been
running into is "cannot coerce class "structure("DataFrame", package =
"SparkR")" to a data.frame". I'm wondering if there is a way to use the R
dataframe functionality on worker nodes or if there is a way to "hack" the R
function in order to make it accept Spark dataframes. Here is an example of
what I'm trying to do, with a_df being a spark dataframe:

***DISTRIBUTED***
#0 filter out nulls
a_df <- filter(a_df, isNotNull(a_df$Ozone))

#1 make closure
treeParty <- function(x) {
# Use sparseMatrix function from the Matrix package
air.ct <- ctree(Ozone ~ ., data = a_df)
}

#2  put package in context
SparkR:::includePackage(sc, partykit)

#3 apply to all partitions
partied <- SparkR:::lapplyPartition(a_df, treeParty)


**LOCAL***
Here is R code that works with a local dataframe, local_df:
local_df <- subset(airquality, !is.na(Ozone))
air.ct <- ctree(Ozone ~ ., data = local_df)

Any advice would be greatly appreciated!

Thanks,

Ian




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-interaction-with-R-libraries-currently-1-5-2-tp27107.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL: org.apache.spark.sql.AnalysisException: cannot resolve "some columns" given input columns.

2016-06-07 Thread Ted Yu

Please see:

[SPARK-13953][SQL] Specifying the field name for corrupted record via
option at JSON datasource

FYI

On Tue, Jun 7, 2016 at 10:18 AM, Jerry Wong 
wrote:

> Hi,
>
> Two JSON files but one of them miss some columns, like
>
> {"firstName": "Jack", "lastName": "Nelson"}
> {"firstName": "Landy", "middleName": "Ken", "lastName": "Yong"}
>
> slqContext.sql("select firstName as first_name, middleName as middle_name,
> lastName as last_name from jsonTable)
>
> But there are an error
> org.apache.spark.sql.AnalysisException: cannot resolve 'middleName' given
> input columns firstName, lastName;
>
> Can anybody give me your wisdom or any suggestions?
>
> Thanks!
> Jerry
>
>

Spark SQL: org.apache.spark.sql.AnalysisException: cannot resolve "some columns" given input columns.

2016-06-07 Thread Jerry Wong

Hi,

Two JSON files but one of them miss some columns, like

{"firstName": "Jack", "lastName": "Nelson"}
{"firstName": "Landy", "middleName": "Ken", "lastName": "Yong"}

slqContext.sql("select firstName as first_name, middleName as middle_name,
lastName as last_name from jsonTable)

But there are an error
org.apache.spark.sql.AnalysisException: cannot resolve 'middleName' given
input columns firstName, lastName;

Can anybody give me your wisdom or any suggestions?

Thanks!
Jerry

Re: Spark 2.0 Release Date

2016-06-07 Thread Sean Owen

For this moment you can look at
https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/

On Tue, Jun 7, 2016 at 6:14 PM, Arun Patel  wrote:
> Thanks Sean and Jacek.
>
> Do we have any updated documentation for 2.0 somewhere?
>
> On Tue, Jun 7, 2016 at 9:34 AM, Jacek Laskowski  wrote:
>>
>> On Tue, Jun 7, 2016 at 3:25 PM, Sean Owen  wrote:
>> > That's not any kind of authoritative statement, just my opinion and
>> > guess.
>>
>> Oh, come on. You're not **a** Sean but **the** Sean (= a PMC member
>> and the JIRA/PRs keeper) so what you say **is** kinda official. Sorry.
>> But don't worry the PMC (the group) can decide whatever it wants to
>> decide so any date is a good date.
>>
>> > Reynold mentioned the idea of releasing monthly milestone releases for
>> > the latest branch. That's an interesting idea for the future.
>>
>> +1
>>
>> > I know there was concern that publishing a preview release, which is
>> > like an alpha,
>> > might leave alpha-quality code out there too long as the latest
>> > release. Hence, probably support for publishing some kind of preview 2
>> > or beta or whatever
>>
>> The issue seems to have been sorted out when Matei and Reynold agreed
>> to push the preview out (which is a good thing!), and I'm sure
>> there'll be little to no concern to do it again and again. 2.0 is
>> certainly taking far too long (as if there were some magic in 2.0).
>>
>> p.s. It's so frustrating to tell people about the latest and greatest
>> of 2.0 and then switch to 1.6.1 or even older in projects :( S
>> frustrating...
>>
>> Jacek
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel

Thanks Sean and Jacek.

Do we have any updated documentation for 2.0 somewhere?

On Tue, Jun 7, 2016 at 9:34 AM, Jacek Laskowski  wrote:

> On Tue, Jun 7, 2016 at 3:25 PM, Sean Owen  wrote:
> > That's not any kind of authoritative statement, just my opinion and
> guess.
>
> Oh, come on. You're not **a** Sean but **the** Sean (= a PMC member
> and the JIRA/PRs keeper) so what you say **is** kinda official. Sorry.
> But don't worry the PMC (the group) can decide whatever it wants to
> decide so any date is a good date.
>
> > Reynold mentioned the idea of releasing monthly milestone releases for
> > the latest branch. That's an interesting idea for the future.
>
> +1
>
> > I know there was concern that publishing a preview release, which is
> like an alpha,
> > might leave alpha-quality code out there too long as the latest
> > release. Hence, probably support for publishing some kind of preview 2
> > or beta or whatever
>
> The issue seems to have been sorted out when Matei and Reynold agreed
> to push the preview out (which is a good thing!), and I'm sure
> there'll be little to no concern to do it again and again. 2.0 is
> certainly taking far too long (as if there were some magic in 2.0).
>
> p.s. It's so frustrating to tell people about the latest and greatest
> of 2.0 and then switch to 1.6.1 or even older in projects :( S
> frustrating...
>
> Jacek
>

Integrating spark source in an eclipse project?

2016-06-07 Thread Cesar Flores

I created a spark application in Eclipse by including the
spark-assembly-1.6.0-hadoop2.6.0.jar file in the path.

However, this method does not allow me see spark code. Is there an easy way
to include spark source code for reference in an application developed in
Eclipse?


Thanks !
-- 
Cesar Flores

Re: Integrating spark source in an eclipse project?

2016-06-07 Thread Ted Yu

Please see:
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup

Use proper branch.

FYI

On Tue, Jun 7, 2016 at 9:04 AM, Cesar Flores  wrote:

>
> I created a spark application in Eclipse by including the
> spark-assembly-1.6.0-hadoop2.6.0.jar file in the path.
>
> However, this method does not allow me see spark code. Is there an easy
> way to include spark source code for reference in an application developed
> in Eclipse?
>
>
> Thanks !
> --
> Cesar Flores
>

Re: RESOLVED - Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh

OK so this was Kafka issue?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 16:55, Dominik Safaric  wrote:

> Dear all,
>
> I managed to resolve the issue. Since I kept getting the exception
> "org.apache.spark.SparkException: java.nio.channels.ClosedChannelException”,
>
> a reasonable direction was checking the advertised.host.name key which as
> I’ve read from the docs basically sets for the broker the host.name it
> should advertise to the consumers and producers.
>
> By setting this property, I instantly started receiving Kafka log messages.
>
> Nevertheless, thank you all for your help, I appreciate it!
>
> On 07 Jun 2016, at 17:44, Dominik Safaric 
> wrote:
>
> Dear Todd,
>
> By running bin/kafka-run-class.sh kafka.tools.GetOffsetShell --topic
>  --broker-list localhost:9092 --time -1
>
> I get the following current offset for  :0:1760
>
> But I guess this does not provide as much information.
>
> To answer your other question, onto how exactly do I track the offset -
> implicitly via Spark Streaming, i.e. using the default checkpoints.
>
> On 07 Jun 2016, at 15:46, Todd Nist  wrote:
>
> Hi Dominik,
>
> Right, and spark 1.6.x uses Kafka v0.8.2.x as I recall.  However, it
> appears as though the v.0.8 consumer is compatible with the Kafka v0.9.x
> broker, but not the other way around; sorry for the confusion there.
>
> With the direct stream, simple consumer, offsets are tracked by Spark
> Streaming within its checkpoints by default.  You can also manage them
> yourself if desired.  How are you dealing with offsets ?
>
> Can you verify the offsets on the broker:
>
> kafka-run-class.sh kafka.tools.GetOffsetShell --topic 
> --broker-list  --time -1
>
> -Todd
>
> On Tue, Jun 7, 2016 at 8:17 AM, Dominik Safaric 
> wrote:
>
>> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0"
>> libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
>> libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % 
>> "1.6.1"
>>
>> Please take a look at the SBT copy.
>>
>> I would rather think that the problem is related to the Zookeeper/Kafka
>> consumers.
>>
>> [2016-06-07 11:24:52,484] WARN Either no config or no quorum defined in
>> config, running  in standalone mode
>> (org.apache.zookeeper.server.quorum.QuorumPeerMain)
>>
>> Any indication onto why the channel connection might be closed? Would it
>> be Kafka or Zookeeper related?
>>
>> On 07 Jun 2016, at 14:07, Todd Nist  wrote:
>>
>> What version of Spark are you using?  I do not believe that 1.6.x is
>> compatible with 0.9.0.1 due to changes in the kafka clients between 0.8.2.2
>> and 0.9.0.x.  See this for more information:
>>
>> https://issues.apache.org/jira/browse/SPARK-12177
>>
>> -Todd
>>
>> On Tue, Jun 7, 2016 at 7:35 AM, Dominik Safaric > > wrote:
>>
>>> Hi,
>>>
>>> Correct, I am using the 0.9.0.1 version.
>>>
>>> As already described, the topic contains messages. Those messages are
>>> produced using the Confluence REST API.
>>>
>>> However, what I’ve observed is that the problem is not in the Spark
>>> configuration, but rather Zookeeper or Kafka related.
>>>
>>> Take a look at the exception’s stack top item:
>>>
>>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>> Set([,0])
>>> at
>>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>> at
>>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>> at scala.util.Either.fold(Either.scala:97)
>>> at
>>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>>> at
>>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>>> at
>>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>>> at org.mediasoft.spark.Driver$.main(Driver.scala:22)
>>> at .(:11)
>>> at .()
>>> at .(:7)
>>>
>>> By listing all active connections using netstat, I’ve also observed that
>>> both Zookeper and Kafka are running. Zookeeper on port 2181, while Kafka
>>> 9092.
>>>
>>> Furthermore, I am also able to retrieve all log messages using the
>>> console consumer.
>>>
>>> Any clue what might be going wrong?
>>>
>>> On 07 Jun 2016, at 13:13, Jacek Laskowski  wrote:
>>>
>>> Hi,
>>>
>>> What's the version of Spark? You're using Kafka 0.9.0.1, ain't you?
>>> What's the topic name?
>>>
>>> Jacek
>>> On 7 Jun 2016 11:06 a.m., "Dominik Safaric" 
>>> wrote:
>>>
 As I am trying to integrate Kafka into Spark,

RESOLVED - Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric

Dear all,

I managed to resolve the issue. Since I kept getting the exception 
"org.apache.spark.SparkException: java.nio.channels.ClosedChannelException”,

a reasonable direction was checking the advertised.host.name key which as I’ve 
read from the docs basically sets for the broker the host.name it should 
advertise to the consumers and producers.

By setting this property, I instantly started receiving Kafka log messages.

Nevertheless, thank you all for your help, I appreciate it! 

> On 07 Jun 2016, at 17:44, Dominik Safaric  wrote:
> 
> Dear Todd,
> 
> By running bin/kafka-run-class.sh kafka.tools.GetOffsetShell --topic 
>  --broker-list localhost:9092 --time -1
> 
> I get the following current offset for  :0:1760
> 
> But I guess this does not provide as much information. 
> 
> To answer your other question, onto how exactly do I track the offset - 
> implicitly via Spark Streaming, i.e. using the default checkpoints. 
> 
>> On 07 Jun 2016, at 15:46, Todd Nist > > wrote:
>> 
>> Hi Dominik,
>> 
>> Right, and spark 1.6.x uses Kafka v0.8.2.x as I recall.  However, it appears 
>> as though the v.0.8 consumer is compatible with the Kafka v0.9.x broker, but 
>> not the other way around; sorry for the confusion there.
>> 
>> With the direct stream, simple consumer, offsets are tracked by Spark 
>> Streaming within its checkpoints by default.  You can also manage them 
>> yourself if desired.  How are you dealing with offsets ?
>> 
>> Can you verify the offsets on the broker:
>> 
>> kafka-run-class.sh kafka.tools.GetOffsetShell --topic  --broker-list 
>>  --time -1
>> 
>> -Todd
>> 
>> On Tue, Jun 7, 2016 at 8:17 AM, Dominik Safaric > > wrote:
>> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0"
>> libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
>> libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % 
>> "1.6.1"
>> Please take a look at the SBT copy. 
>> 
>> I would rather think that the problem is related to the Zookeeper/Kafka 
>> consumers. 
>> 
>> [2016-06-07 11:24:52,484] WARN Either no config or no quorum defined in 
>> config, running  in standalone mode 
>> (org.apache.zookeeper.server.quorum.QuorumPeerMain)
>> 
>> Any indication onto why the channel connection might be closed? Would it be 
>> Kafka or Zookeeper related? 
>> 
>>> On 07 Jun 2016, at 14:07, Todd Nist >> > wrote:
>>> 
>>> What version of Spark are you using?  I do not believe that 1.6.x is 
>>> compatible with 0.9.0.1 due to changes in the kafka clients between 0.8.2.2 
>>> and 0.9.0.x.  See this for more information:
>>> 
>>> https://issues.apache.org/jira/browse/SPARK-12177 
>>> 
>>> 
>>> -Todd
>>> 
>>> On Tue, Jun 7, 2016 at 7:35 AM, Dominik Safaric >> > wrote:
>>> Hi,
>>> 
>>> Correct, I am using the 0.9.0.1 version. 
>>> 
>>> As already described, the topic contains messages. Those messages are 
>>> produced using the Confluence REST API.
>>> 
>>> However, what I’ve observed is that the problem is not in the Spark 
>>> configuration, but rather Zookeeper or Kafka related. 
>>> 
>>> Take a look at the exception’s stack top item:
>>> 
>>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>>> org.apache.spark.SparkException: Couldn't find leader offsets for 
>>> Set([,0])
>>> at 
>>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>> at 
>>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>> at scala.util.Either.fold(Either.scala:97)
>>> at 
>>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>>> at 
>>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>>> at 
>>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>>> at org.mediasoft.spark.Driver$.main(Driver.scala:22)
>>> at .(:11)
>>> at .()
>>> at .(:7)
>>> 
>>> By listing all active connections using netstat, I’ve also observed that 
>>> both Zookeper and Kafka are running. Zookeeper on port 2181, while Kafka 
>>> 9092. 
>>> 
>>> Furthermore, I am also able to retrieve all log messages using the console 
>>> consumer.
>>> 
>>> Any clue what might be going wrong?
>>> 
 On 07 Jun 2016, at 13:13, Jacek Laskowski > wrote:
 
 Hi,
 
 What's the version of Spark? You're using Kafka 0.9.0.1, ain't you? What's 
 the topic name?
 
 Jacek
 
 On 7 Jun 2016 11:06 a.m., "Dominik Safaric" > wrote:
 As I am

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric

Dear Todd,

By running bin/kafka-run-class.sh kafka.tools.GetOffsetShell --topic 
 --broker-list localhost:9092 --time -1

I get the following current offset for  :0:1760

But I guess this does not provide as much information. 

To answer your other question, onto how exactly do I track the offset - 
implicitly via Spark Streaming, i.e. using the default checkpoints. 

> On 07 Jun 2016, at 15:46, Todd Nist  wrote:
> 
> Hi Dominik,
> 
> Right, and spark 1.6.x uses Kafka v0.8.2.x as I recall.  However, it appears 
> as though the v.0.8 consumer is compatible with the Kafka v0.9.x broker, but 
> not the other way around; sorry for the confusion there.
> 
> With the direct stream, simple consumer, offsets are tracked by Spark 
> Streaming within its checkpoints by default.  You can also manage them 
> yourself if desired.  How are you dealing with offsets ?
> 
> Can you verify the offsets on the broker:
> 
> kafka-run-class.sh kafka.tools.GetOffsetShell --topic  --broker-list 
>  --time -1
> 
> -Todd
> 
> On Tue, Jun 7, 2016 at 8:17 AM, Dominik Safaric  > wrote:
> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0"
> libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
> libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % 
> "1.6.1"
> Please take a look at the SBT copy. 
> 
> I would rather think that the problem is related to the Zookeeper/Kafka 
> consumers. 
> 
> [2016-06-07 11:24:52,484] WARN Either no config or no quorum defined in 
> config, running  in standalone mode 
> (org.apache.zookeeper.server.quorum.QuorumPeerMain)
> 
> Any indication onto why the channel connection might be closed? Would it be 
> Kafka or Zookeeper related? 
> 
>> On 07 Jun 2016, at 14:07, Todd Nist > > wrote:
>> 
>> What version of Spark are you using?  I do not believe that 1.6.x is 
>> compatible with 0.9.0.1 due to changes in the kafka clients between 0.8.2.2 
>> and 0.9.0.x.  See this for more information:
>> 
>> https://issues.apache.org/jira/browse/SPARK-12177 
>> 
>> 
>> -Todd
>> 
>> On Tue, Jun 7, 2016 at 7:35 AM, Dominik Safaric > > wrote:
>> Hi,
>> 
>> Correct, I am using the 0.9.0.1 version. 
>> 
>> As already described, the topic contains messages. Those messages are 
>> produced using the Confluence REST API.
>> 
>> However, what I’ve observed is that the problem is not in the Spark 
>> configuration, but rather Zookeeper or Kafka related. 
>> 
>> Take a look at the exception’s stack top item:
>> 
>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>> org.apache.spark.SparkException: Couldn't find leader offsets for 
>> Set([,0])
>>  at 
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>  at 
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>  at scala.util.Either.fold(Either.scala:97)
>>  at 
>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>>  at 
>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>>  at 
>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>>  at org.mediasoft.spark.Driver$.main(Driver.scala:22)
>>  at .(:11)
>>  at .()
>>  at .(:7)
>> 
>> By listing all active connections using netstat, I’ve also observed that 
>> both Zookeper and Kafka are running. Zookeeper on port 2181, while Kafka 
>> 9092. 
>> 
>> Furthermore, I am also able to retrieve all log messages using the console 
>> consumer.
>> 
>> Any clue what might be going wrong?
>> 
>>> On 07 Jun 2016, at 13:13, Jacek Laskowski >> > wrote:
>>> 
>>> Hi,
>>> 
>>> What's the version of Spark? You're using Kafka 0.9.0.1, ain't you? What's 
>>> the topic name?
>>> 
>>> Jacek
>>> 
>>> On 7 Jun 2016 11:06 a.m., "Dominik Safaric" >> > wrote:
>>> As I am trying to integrate Kafka into Spark, the following exception 
>>> occurs:
>>> 
>>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>> Set([**,0])
>>> at
>>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>> at
>>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>> at scala.util.Either.fold(Either.scala:97)
>>> at
>>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>>> at
>>>

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh

Thanks. This is getting a bit confusing.

I have these modes for using Spark.


   1. Spark local All on the same host -->  -master local[n]l.. No need to
   start master and slaves. Uses resources as you submit the job.
   2. Spark Standalone. Use a simple cluster manager included with Spark
   that makes it easy to set up a cluster -->  --master
   spark://:7077. Can run on different hosts. Does not rely on
   Yarn. It looks after scheduling itself. Need to start master and slaves


The doc says: There are two deploy modes that can be used to launch Spark
applications* on YARN*.

*In cluster mode*, the Spark driver runs inside an application master
process which is managed by YARN on the cluster, and the client can go away
after initiating the application.

*In client mode*, the driver runs in the client process, and the
application master is only used for requesting resources from YARN.

Unlike Spark standalone
 and Mesos
 modes, in which
the master’s address is specified in the --master parameter, in YARN mode
the ResourceManager’s address is picked up from the Hadoop configuration.
Thus, the --master parameter is yarn.
  So either we have -->   --master yarn --deploy-mode cluster

OR
 -->master yarn-client

So I am not sure running Spark with Yarn in either yarn-client or yarn
cluster is going to make much difference. In sounds like yarn-cluster
supercedes yarn-client?


Any comments welcome




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 15:40, Sebastian Piu  wrote:

> If you run that job then the driver will ALWAYS run in the machine from
> where you are issuing the spark submit command (E.g. some edge node with
> the clients installed). No matter where the resource manager is running.
>
> If you change yarn-client for yarn-cluster then your driver will start
> somewhere else in the cluster as will the workers and the spark submit
> command will return before the program finishes
>
> On Tue, 7 Jun 2016, 14:53 Jacek Laskowski,  wrote:
>
>> Hi,
>>
>> --master yarn-client is deprecated and you should use --master yarn
>> --deploy-mode client instead. There are two deploy-modes: client
>> (default) and cluster. See
>> http://spark.apache.org/docs/latest/cluster-overview.html.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Tue, Jun 7, 2016 at 2:50 PM, Mich Talebzadeh
>>  wrote:
>> > ok thanks
>> >
>> > so I start SparkSubmit or similar Spark app on the Yarn resource manager
>> > node.
>> >
>> > What you are stating is that Yan may decide to start the driver program
>> in
>> > another node as opposed to the resource manager node
>> >
>> > ${SPARK_HOME}/bin/spark-submit \
>> > --driver-memory=4G \
>> > --num-executors=5 \
>> > --executor-memory=4G \
>> > --master yarn-client \
>> > --executor-cores=4 \
>> >
>> > Due to lack of resources in the resource manager node? What is the
>> > likelihood of that. The resource manager node is the defector master
>> node in
>> > all probability much more powerful than other nodes. Also the node that
>> > running resource manager is also running one of the node manager as
>> well. So
>> > in theory may be in practice may not?
>> >
>> > HTH
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> >
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> >
>> >
>> > On 7 June 2016 at 13:20, Sebastian Piu  wrote:
>> >>
>> >> What you are explaining is right for yarn-client mode, but the
>> question is
>> >> about yarn-cluster in which case the spark driver is also submitted
>> and run
>> >> in one of the node managers
>> >>
>> >>
>> >> On Tue, 7 Jun 2016, 13:45 Mich Talebzadeh, 
>> >> wrote:
>> >>>
>> >>> can you elaborate on the above statement please.
>> >>>
>> >>> When you start yarn you start the resource manager daemon only on the
>> >>> resource manager node
>> >>>
>> >>> yarn-daemon.sh start resourcemanager
>> >>>
>> >>> Then you start nodemanager deamons on all nodes
>> >>>
>> >>> yarn-daemon.sh start nodemanager
>> >>>
>> >>> A spark app has to start somewhere. That is SparkSubmit. and that is
>> >>> deterministic. I start SparkSubmit that talks to Yarn Resource
>> Manager that
>> >>> initialises and registers an Application master. The crucial

Re: Spark_Usecase

2016-06-07 Thread Deepak Sharma

I am not sure if Spark provides any support for incremental extracts
inherently.
But you can maintain a file e.g. extractRange.conf in hdfs , to read from
it the end range and update it with new end range from  spark job before it
finishes with the new relevant ranges to be used next time.

On Tue, Jun 7, 2016 at 8:49 PM, Ajay Chander  wrote:

> Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL.
> Now I am using spark to do the same. Right now, I am trying
> to implement incremental updates while loading from MySQL through spark.
> Can you suggest any best practices for this ? Thank you.
>
>
> On Tuesday, June 7, 2016, Mich Talebzadeh 
> wrote:
>
>> I use Spark rather that Sqoop to import data from an Oracle table into a
>> Hive ORC table.
>>
>> It used JDBC for this purpose. All inclusive in Scala itself.
>>
>> Also Hive runs on Spark engine. Order of magnitude faster with Inde on
>> map-reduce/.
>>
>> pretty simple.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 15:38, Ted Yu  wrote:
>>
>>> bq. load the data from edge node to hdfs
>>>
>>> Does the loading involve accessing sqlserver ?
>>>
>>> Please take a look at
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>>
>>> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni 
>>> wrote:
>>>
 Hi
 how about

 1.  have a process that read the data from your sqlserver and dumps it
 as a file into a directory on your hd
 2. use spark-streanming to read data from that directory  and store it
 into hdfs

 perhaps there is some sort of spark 'connectors' that allows you to
 read data from a db directly so you dont need to go via spk streaming?


 hth










 On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander 
 wrote:

> Hi Spark users,
>
> Right now we are using spark for everything(loading the data from
> sqlserver, apply transformations, save it as permanent tables in
> hive) in our environment. Everything is being done in one spark 
> application.
>
> The only thing we do before we launch our spark application through
> oozie is, to load the data from edge node to hdfs(it is being triggered
> through a ssh action from oozie to run shell script on edge node).
>
> My question is,  there's any way we can accomplish edge-to-hdfs copy
> through spark ? So that everything is done in one spark DAG and lineage
> graph?
>
> Any pointers are highly appreciated. Thanks
>
> Regards,
> Aj
>


>>>
>>


-- 
Thanks
Deepak
www.bigdatabig.com
www.keosha.net

Re: Spark_Usecase

2016-06-07 Thread Ajay Chander

Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL.
Now I am using spark to do the same. Right now, I am trying
to implement incremental updates while loading from MySQL through spark.
Can you suggest any best practices for this ? Thank you.

On Tuesday, June 7, 2016, Mich Talebzadeh  wrote:

> I use Spark rather that Sqoop to import data from an Oracle table into a
> Hive ORC table.
>
> It used JDBC for this purpose. All inclusive in Scala itself.
>
> Also Hive runs on Spark engine. Order of magnitude faster with Inde on
> map-reduce/.
>
> pretty simple.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 15:38, Ted Yu  > wrote:
>
>> bq. load the data from edge node to hdfs
>>
>> Does the loading involve accessing sqlserver ?
>>
>> Please take a look at
>> https://spark.apache.org/docs/latest/sql-programming-guide.html
>>
>> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni > > wrote:
>>
>>> Hi
>>> how about
>>>
>>> 1.  have a process that read the data from your sqlserver and dumps it
>>> as a file into a directory on your hd
>>> 2. use spark-streanming to read data from that directory  and store it
>>> into hdfs
>>>
>>> perhaps there is some sort of spark 'connectors' that allows you to read
>>> data from a db directly so you dont need to go via spk streaming?
>>>
>>>
>>> hth
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander >> > wrote:
>>>
 Hi Spark users,

 Right now we are using spark for everything(loading the data from
 sqlserver, apply transformations, save it as permanent tables in
 hive) in our environment. Everything is being done in one spark 
 application.

 The only thing we do before we launch our spark application through
 oozie is, to load the data from edge node to hdfs(it is being triggered
 through a ssh action from oozie to run shell script on edge node).

 My question is,  there's any way we can accomplish edge-to-hdfs copy
 through spark ? So that everything is done in one spark DAG and lineage
 graph?

 Any pointers are highly appreciated. Thanks

 Regards,
 Aj

>>>
>>>
>>
>

Re: Spark_Usecase

2016-06-07 Thread Mich Talebzadeh

I use Spark rather that Sqoop to import data from an Oracle table into a
Hive ORC table.

It used JDBC for this purpose. All inclusive in Scala itself.

Also Hive runs on Spark engine. Order of magnitude faster with Inde on
map-reduce/.

pretty simple.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 15:38, Ted Yu  wrote:

> bq. load the data from edge node to hdfs
>
> Does the loading involve accessing sqlserver ?
>
> Please take a look at
> https://spark.apache.org/docs/latest/sql-programming-guide.html
>
> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni 
> wrote:
>
>> Hi
>> how about
>>
>> 1.  have a process that read the data from your sqlserver and dumps it as
>> a file into a directory on your hd
>> 2. use spark-streanming to read data from that directory  and store it
>> into hdfs
>>
>> perhaps there is some sort of spark 'connectors' that allows you to read
>> data from a db directly so you dont need to go via spk streaming?
>>
>>
>> hth
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander  wrote:
>>
>>> Hi Spark users,
>>>
>>> Right now we are using spark for everything(loading the data from
>>> sqlserver, apply transformations, save it as permanent tables in
>>> hive) in our environment. Everything is being done in one spark application.
>>>
>>> The only thing we do before we launch our spark application through
>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>> through a ssh action from oozie to run shell script on edge node).
>>>
>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>> through spark ? So that everything is done in one spark DAG and lineage
>>> graph?
>>>
>>> Any pointers are highly appreciated. Thanks
>>>
>>> Regards,
>>> Aj
>>>
>>
>>
>

Re: Spark_Usecase

2016-06-07 Thread Ajay Chander

Marco, Ted, thanks for your time. I am sorry if I wasn't clear enough. We
have two sources,

1) sql server
2) files are pushed onto edge node by upstreams on a daily basis.

Point 1 has been achieved by using JDBC format in spark sql.

Point 2 has been achieved by using shell script.

My only concern is about point 2. To see if there is any way I can do it in
my spark app instead os shell script.

Thanks.

On Tuesday, June 7, 2016, Ted Yu  wrote:

> bq. load the data from edge node to hdfs
>
> Does the loading involve accessing sqlserver ?
>
> Please take a look at
> https://spark.apache.org/docs/latest/sql-programming-guide.html
>
> On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni  > wrote:
>
>> Hi
>> how about
>>
>> 1.  have a process that read the data from your sqlserver and dumps it as
>> a file into a directory on your hd
>> 2. use spark-streanming to read data from that directory  and store it
>> into hdfs
>>
>> perhaps there is some sort of spark 'connectors' that allows you to read
>> data from a db directly so you dont need to go via spk streaming?
>>
>>
>> hth
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander > > wrote:
>>
>>> Hi Spark users,
>>>
>>> Right now we are using spark for everything(loading the data from
>>> sqlserver, apply transformations, save it as permanent tables in
>>> hive) in our environment. Everything is being done in one spark application.
>>>
>>> The only thing we do before we launch our spark application through
>>> oozie is, to load the data from edge node to hdfs(it is being triggered
>>> through a ssh action from oozie to run shell script on edge node).
>>>
>>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>>> through spark ? So that everything is done in one spark DAG and lineage
>>> graph?
>>>
>>> Any pointers are highly appreciated. Thanks
>>>
>>> Regards,
>>> Aj
>>>
>>
>>
>

Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh

thanks I will have a look.

Mich

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 13:38, Jörn Franke  wrote:

> Solr is basically an in-memory text index with a lot of capabilities for
> language analysis extraction (you can compare  it to a Google for your
> tweets). The system itself has a lot of features and has a complexity
> similar to Big data systems. This index files can be backed by HDFS. You
> can put the tweets directly into solr without going via HDFS files.
>
> Carefully decide what fields to index / you want to search. It does not
> make sense to index everything.
>
> On 07 Jun 2016, at 13:51, Mich Talebzadeh 
> wrote:
>
> Ok So basically for predictive off-line (as opposed to streaming) in a
> nutshell one can use Apache Flume to store twitter data in hdfs and use
> Solr to query the data?
>
> This is what it says:
>
> Solr is a standalone enterprise search server with a REST-like API. You
> put documents in it (called "indexing") via JSON, XML, CSV or binary over
> HTTP. You query it via HTTP GET and receive JSON, XML, CSV or binary
> results.
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 12:39, Jörn Franke  wrote:
>
>> Well I have seen that The algorithms mentioned are used for this. However
>> some preprocessing through solr makes sense - it takes care of synonyms,
>> homonyms, stemming etc
>>
>> On 07 Jun 2016, at 13:33, Mich Talebzadeh 
>> wrote:
>>
>> Thanks Jorn,
>>
>> To start I would like to explore how can one turn some of the data into
>> useful information.
>>
>> I would like to look at certain trend analysis. Simple correlation shows
>> that the more there is a mention of a typical topic say for example
>> "organic food" the more people are inclined to go for it. To see one can
>> deduce that orgaind food is a potential growth area.
>>
>> Now I have all infra-structure to ingest that data. Like using flume to
>> store it or Spark streaming to do near real time work.
>>
>> Now I want to slice and dice that data for say organic food.
>>
>> I presume this is a typical question.
>>
>> You mentioned Spark ml (machine learning?) . Is that something viable?
>>
>> Cheers
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 12:22, Jörn Franke  wrote:
>>
>>> Spark ml Support Vector machines or neural networks could be candidates.
>>> For unstructured learning it could be clustering.
>>> For doing a graph analysis On the followers you can easily use Spark
>>> Graphx
>>> Keep in mind that each tweet contains a lot of meta data (location,
>>> followers etc) that is more or less structured.
>>> For unstructured text analytics (eg tweet itself)I recommend
>>> solr/ElasticSearch .
>>>
>>> However I am not sure what you want to do with the data exactly.
>>>
>>>
>>> On 07 Jun 2016, at 13:16, Mich Talebzadeh 
>>> wrote:
>>>
>>> Hi,
>>>
>>> This is really a general question.
>>>
>>> I use Spark to get twitter data. I did some looking at it
>>>
>>> val ssc = new StreamingContext(sparkConf, Seconds(2))
>>> val tweets = TwitterUtils.createStream(ssc, None)
>>> val statuses = tweets.map(status => status.getText())
>>> statuses.print()
>>>
>>> Ok
>>>
>>> Also I can use Apache flume to store data in hdfs directory
>>>
>>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
>>> Dflume.root.logger=DEBUG,console -n TwitterAgent
>>> Now that stores twitter data in binary format in  hdfs directory.
>>>
>>> My question is pretty basic.
>>>
>>> What is the best tool/language to dif in to that data. For example
>>> twitter streaming data. I am getting all sorts od stuff coming in. Say I am
>>> only interested in certain topics like sport etc. How can I detect the
>>> signal from the noise using what tool and language?
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
>>
>

Re: Specify node where driver should run

2016-06-07 Thread Sebastian Piu

If you run that job then the driver will ALWAYS run in the machine from
where you are issuing the spark submit command (E.g. some edge node with
the clients installed). No matter where the resource manager is running.

If you change yarn-client for yarn-cluster then your driver will start
somewhere else in the cluster as will the workers and the spark submit
command will return before the program finishes

On Tue, 7 Jun 2016, 14:53 Jacek Laskowski,  wrote:

> Hi,
>
> --master yarn-client is deprecated and you should use --master yarn
> --deploy-mode client instead. There are two deploy-modes: client
> (default) and cluster. See
> http://spark.apache.org/docs/latest/cluster-overview.html.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Jun 7, 2016 at 2:50 PM, Mich Talebzadeh
>  wrote:
> > ok thanks
> >
> > so I start SparkSubmit or similar Spark app on the Yarn resource manager
> > node.
> >
> > What you are stating is that Yan may decide to start the driver program
> in
> > another node as opposed to the resource manager node
> >
> > ${SPARK_HOME}/bin/spark-submit \
> > --driver-memory=4G \
> > --num-executors=5 \
> > --executor-memory=4G \
> > --master yarn-client \
> > --executor-cores=4 \
> >
> > Due to lack of resources in the resource manager node? What is the
> > likelihood of that. The resource manager node is the defector master
> node in
> > all probability much more powerful than other nodes. Also the node that
> > running resource manager is also running one of the node manager as
> well. So
> > in theory may be in practice may not?
> >
> > HTH
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> >
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >
> >
> >
> > http://talebzadehmich.wordpress.com
> >
> >
> >
> >
> > On 7 June 2016 at 13:20, Sebastian Piu  wrote:
> >>
> >> What you are explaining is right for yarn-client mode, but the question
> is
> >> about yarn-cluster in which case the spark driver is also submitted and
> run
> >> in one of the node managers
> >>
> >>
> >> On Tue, 7 Jun 2016, 13:45 Mich Talebzadeh, 
> >> wrote:
> >>>
> >>> can you elaborate on the above statement please.
> >>>
> >>> When you start yarn you start the resource manager daemon only on the
> >>> resource manager node
> >>>
> >>> yarn-daemon.sh start resourcemanager
> >>>
> >>> Then you start nodemanager deamons on all nodes
> >>>
> >>> yarn-daemon.sh start nodemanager
> >>>
> >>> A spark app has to start somewhere. That is SparkSubmit. and that is
> >>> deterministic. I start SparkSubmit that talks to Yarn Resource Manager
> that
> >>> initialises and registers an Application master. The crucial point is
> Yarn
> >>> Resource manager which is basically a resource scheduler. It optimizes
> for
> >>> cluster resource utilization to keep all resources in use all the time.
> >>> However, resource manager itself is on the resource manager node.
> >>>
> >>> Now I always start my Spark app on the same node as the resource
> manager
> >>> node and let Yarn take care of the rest.
> >>>
> >>> Thanks
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn
> >>>
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>>
> >>>
> >>> On 7 June 2016 at 12:17, Jacek Laskowski  wrote:
> 
>  Hi,
> 
>  It's not possible. YARN uses CPU and memory for resource constraints
> and
>  places AM on any node available. Same about executors (unless data
> locality
>  constraints the placement).
> 
>  Jacek
> 
>  On 6 Jun 2016 1:54 a.m., "Saiph Kappa"  wrote:
> >
> > Hi,
> >
> > In yarn-cluster mode, is there any way to specify on which node I
> want
> > the driver to run?
> >
> > Thanks.
> >>>
> >>>
> >
>

Re: Spark_Usecase

2016-06-07 Thread Ted Yu

bq. load the data from edge node to hdfs

Does the loading involve accessing sqlserver ?

Please take a look at
https://spark.apache.org/docs/latest/sql-programming-guide.html

On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni  wrote:

> Hi
> how about
>
> 1.  have a process that read the data from your sqlserver and dumps it as
> a file into a directory on your hd
> 2. use spark-streanming to read data from that directory  and store it
> into hdfs
>
> perhaps there is some sort of spark 'connectors' that allows you to read
> data from a db directly so you dont need to go via spk streaming?
>
>
> hth
>
>
>
>
>
>
>
>
>
>
> On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander  wrote:
>
>> Hi Spark users,
>>
>> Right now we are using spark for everything(loading the data from
>> sqlserver, apply transformations, save it as permanent tables in
>> hive) in our environment. Everything is being done in one spark application.
>>
>> The only thing we do before we launch our spark application through
>> oozie is, to load the data from edge node to hdfs(it is being triggered
>> through a ssh action from oozie to run shell script on edge node).
>>
>> My question is,  there's any way we can accomplish edge-to-hdfs copy
>> through spark ? So that everything is done in one spark DAG and lineage
>> graph?
>>
>> Any pointers are highly appreciated. Thanks
>>
>> Regards,
>> Aj
>>
>
>

Re: Spark_Usecase

2016-06-07 Thread Marco Mistroni

Hi
how about

1.  have a process that read the data from your sqlserver and dumps it as a
file into a directory on your hd
2. use spark-streanming to read data from that directory  and store it into
hdfs

perhaps there is some sort of spark 'connectors' that allows you to read
data from a db directly so you dont need to go via spk streaming?

hth

On Tue, Jun 7, 2016 at 3:09 PM, Ajay Chander  wrote:

> Hi Spark users,
>
> Right now we are using spark for everything(loading the data from
> sqlserver, apply transformations, save it as permanent tables in hive) in
> our environment. Everything is being done in one spark application.
>
> The only thing we do before we launch our spark application through
> oozie is, to load the data from edge node to hdfs(it is being triggered
> through a ssh action from oozie to run shell script on edge node).
>
> My question is,  there's any way we can accomplish edge-to-hdfs copy
> through spark ? So that everything is done in one spark DAG and lineage
> graph?
>
> Any pointers are highly appreciated. Thanks
>
> Regards,
> Aj
>

Spark_Usecase

2016-06-07 Thread Ajay Chander

Hi Spark users,

Right now we are using spark for everything(loading the data from
sqlserver, apply transformations, save it as permanent tables in hive) in
our environment. Everything is being done in one spark application.

The only thing we do before we launch our spark application through
oozie is, to load the data from edge node to hdfs(it is being triggered
through a ssh action from oozie to run shell script on edge node).

My question is,  there's any way we can accomplish edge-to-hdfs copy
through spark ? So that everything is done in one spark DAG and lineage
graph?

Any pointers are highly appreciated. Thanks

Regards,
Aj

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Todd Nist

Hi Dominik,

Right, and spark 1.6.x uses Kafka v0.8.2.x as I recall.  However, it
appears as though the v.0.8 consumer is compatible with the Kafka v0.9.x
broker, but not the other way around; sorry for the confusion there.

With the direct stream, simple consumer, offsets are tracked by Spark
Streaming within its checkpoints by default.  You can also manage them
yourself if desired.  How are you dealing with offsets ?

Can you verify the offsets on the broker:

kafka-run-class.sh kafka.tools.GetOffsetShell --topic  --broker-list
 --time -1

-Todd

On Tue, Jun 7, 2016 at 8:17 AM, Dominik Safaric 
wrote:

> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0"
> libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
> libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % 
> "1.6.1"
>
> Please take a look at the SBT copy.
>
> I would rather think that the problem is related to the Zookeeper/Kafka
> consumers.
>
> [2016-06-07 11:24:52,484] WARN Either no config or no quorum defined in
> config, running  in standalone mode
> (org.apache.zookeeper.server.quorum.QuorumPeerMain)
>
> Any indication onto why the channel connection might be closed? Would it
> be Kafka or Zookeeper related?
>
> On 07 Jun 2016, at 14:07, Todd Nist  wrote:
>
> What version of Spark are you using?  I do not believe that 1.6.x is
> compatible with 0.9.0.1 due to changes in the kafka clients between 0.8.2.2
> and 0.9.0.x.  See this for more information:
>
> https://issues.apache.org/jira/browse/SPARK-12177
>
> -Todd
>
> On Tue, Jun 7, 2016 at 7:35 AM, Dominik Safaric 
> wrote:
>
>> Hi,
>>
>> Correct, I am using the 0.9.0.1 version.
>>
>> As already described, the topic contains messages. Those messages are
>> produced using the Confluence REST API.
>>
>> However, what I’ve observed is that the problem is not in the Spark
>> configuration, but rather Zookeeper or Kafka related.
>>
>> Take a look at the exception’s stack top item:
>>
>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>> org.apache.spark.SparkException: Couldn't find leader offsets for
>> Set([,0])
>> at
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at scala.util.Either.fold(Either.scala:97)
>> at
>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>> at
>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>> at
>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>> at org.mediasoft.spark.Driver$.main(Driver.scala:22)
>> at .(:11)
>> at .()
>> at .(:7)
>>
>> By listing all active connections using netstat, I’ve also observed that
>> both Zookeper and Kafka are running. Zookeeper on port 2181, while Kafka
>> 9092.
>>
>> Furthermore, I am also able to retrieve all log messages using the
>> console consumer.
>>
>> Any clue what might be going wrong?
>>
>> On 07 Jun 2016, at 13:13, Jacek Laskowski  wrote:
>>
>> Hi,
>>
>> What's the version of Spark? You're using Kafka 0.9.0.1, ain't you?
>> What's the topic name?
>>
>> Jacek
>> On 7 Jun 2016 11:06 a.m., "Dominik Safaric" 
>> wrote:
>>
>>> As I am trying to integrate Kafka into Spark, the following exception
>>> occurs:
>>>
>>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>> Set([**,0])
>>> at
>>>
>>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>> at
>>>
>>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>>> at scala.util.Either.fold(Either.scala:97)
>>> at
>>>
>>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>>> at
>>>
>>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>>> at
>>>
>>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>>> at org.mediasoft.spark.Driver$.main(Driver.scala:42)
>>> at .(:11)
>>> at .()
>>> at .(:7)
>>> at .()
>>> at $print()
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:483)
>>> at
>>> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
>>> at
>>>

Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-07 Thread Daniel Darabos

On Sun, Jun 5, 2016 at 9:51 PM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:

> If you fill up the cache, 1.6.0+ will suffer performance degradation from
> GC thrashing. You can set spark.memory.useLegacyMode to true, or
> spark.memory.fraction to 0.66, or spark.executor.extraJavaOptions to
> -XX:NewRatio=3 to avoid this issue.
>
> I think my colleague filed a ticket for this issue, but I can't find it
> now. So treat it like unverified rumor for now, and try it for yourself if
> you're out of better ideas :). Good luck!
>

FYI there is a ticket for this issue now, with much more details:
https://issues.apache.org/jira/browse/SPARK-15796

On Sat, Jun 4, 2016 at 11:49 AM, Cosmin Ciobanu  wrote:
>
>> Microbatch is 20 seconds. We’re not using window operations.
>>
>>
>>
>> The graphs are for a test cluster, and the entire load is artificially
>> generated by load tests (100k / 200k generated sessions).
>>
>>
>>
>> We’ve performed a few more performance tests. On the same 5 node cluster,
>> with the same application:
>>
>> · Spark 1.5.1 handled 170k+ generated sessions for 24hours with
>> no scheduling delay – the limit seems to be around 180k, above which
>> scheduling delay starts to increase;
>>
>> · Spark 1.6.1 had constant upward-trending scheduling delay from
>> the beginning for 100k+ generated sessions (this is also mentioned in the
>> initial post) – the load test was stopped after 25 minutes as scheduling
>> delay reached 3,5 minutes.
>>
>>
>>
>> P.S. Florin and I will be in SF next week, attending the Spark Summit on
>> Tuesday and Wednesday. We can meet and go into more details there - is
>> anyone working on Spark Streaming available?
>>
>>
>>
>> Cosmin
>>
>>
>>
>>
>>
>> *From: *Mich Talebzadeh 
>> *Date: *Saturday 4 June 2016 at 12:33
>> *To: *Florin Broască 
>> *Cc: *David Newberger , Adrian Tanase <
>> atan...@adobe.com>, "user@spark.apache.org" ,
>> ciobanu 
>> *Subject: *Re: [REPOST] Severe Spark Streaming performance degradation
>> after upgrading to 1.6.1
>>
>>
>>
>> batch interval I meant
>>
>>
>>
>> thx
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn  
>> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>>
>> On 4 June 2016 at 10:32, Mich Talebzadeh 
>> wrote:
>>
>> I may have missed these but:
>>
>>
>>
>> What is the windows interval, windowsLength and SlidingWindow
>>
>>
>>
>> Has the volume of ingest data (Kafka streaming) changed recently that you
>> may not be aware of?
>>
>>
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn  
>> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>>
>> On 4 June 2016 at 09:50, Florin Broască  wrote:
>>
>> Hi David,
>>
>>
>>
>> Thanks for looking into this. This is how the processing time looks like:
>>
>>
>>
>> [image: nline image 1]
>>
>>
>>
>> Appreciate any input,
>>
>> Florin
>>
>>
>>
>>
>>
>> On Fri, Jun 3, 2016 at 3:22 PM, David Newberger <
>> david.newber...@wandcorp.com> wrote:
>>
>> What does your processing time look like. Is it consistently within that
>> 20sec micro batch window?
>>
>>
>>
>> *David Newberger*
>>
>>
>>
>> *From:* Adrian Tanase [mailto:atan...@adobe.com]
>> *Sent:* Friday, June 3, 2016 8:14 AM
>> *To:* user@spark.apache.org
>> *Cc:* Cosmin Ciobanu
>> *Subject:* [REPOST] Severe Spark Streaming performance degradation after
>> upgrading to 1.6.1
>>
>>
>>
>> Hi all,
>>
>>
>>
>> Trying to repost this question from a colleague on my team, somehow his
>> subscription is not active:
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Severe-Spark-Streaming-performance-degradation-after-upgrading-to-1-6-1-td27056.html
>>
>>
>>
>> Appreciate any thoughts,
>>
>> -adrian
>>
>>
>>
>>
>>
>>
>>
>
>

Re: Spark 2.0 Release Date

2016-06-07 Thread Jacek Laskowski

On Tue, Jun 7, 2016 at 3:25 PM, Sean Owen  wrote:
> That's not any kind of authoritative statement, just my opinion and guess.

Oh, come on. You're not **a** Sean but **the** Sean (= a PMC member
and the JIRA/PRs keeper) so what you say **is** kinda official. Sorry.
But don't worry the PMC (the group) can decide whatever it wants to
decide so any date is a good date.

> Reynold mentioned the idea of releasing monthly milestone releases for
> the latest branch. That's an interesting idea for the future.

+1

> I know there was concern that publishing a preview release, which is like an 
> alpha,
> might leave alpha-quality code out there too long as the latest
> release. Hence, probably support for publishing some kind of preview 2
> or beta or whatever

The issue seems to have been sorted out when Matei and Reynold agreed
to push the preview out (which is a good thing!), and I'm sure
there'll be little to no concern to do it again and again. 2.0 is
certainly taking far too long (as if there were some magic in 2.0).

p.s. It's so frustrating to tell people about the latest and greatest
of 2.0 and then switch to 1.6.1 or even older in projects :( S
frustrating...

Jacek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 2.0 Release Date

2016-06-07 Thread Sean Owen

That's not any kind of authoritative statement, just my opinion and guess.

Reynold mentioned the idea of releasing monthly milestone releases for
the latest branch. That's an interesting idea for the future.

If the issue burn-down takes significantly longer to get right, then
maybe indeed another preview release would help. I know there was
concern that publishing a preview release, which is like an alpha,
might leave alpha-quality code out there too long as the latest
release. Hence, probably support for publishing some kind of preview 2
or beta or whatever

On Tue, Jun 7, 2016 at 2:20 PM, Jacek Laskowski  wrote:
> Finally, the PMC voice on the subject. Thanks a lot, Sean!
>
> p.s. Given how much time it takes to ship 2.0 (with so many cool
> features already backed in!) I'd vote for releasing a few more RCs
> before 2.0 hits the shelves. I hope 2.0 is not Java 9 or Jigsaw ;-)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Tue, Jun 7, 2016 at 3:06 PM, Sean Owen  wrote:
>> I don't believe the intent was to get it out before Spark Summit or
>> something. That shouldn't drive the schedule anyway. But now that
>> there's a 2.0.0 preview available, people who are eager to experiment
>> or test on it can do so now.
>>
>> That probably reduces urgency to get it out the door in order to
>> deliver new functionality. I guessed the 2.0.0 release would be mid
>> June and now I'd guess early July. But, nobody's discussed it per se.
>>
>> In theory only fixes, tests and docs are being merged, so the JIRA
>> count should be going down. It has, slowly. Right now there are 72
>> open issues for 2.0.0, of which 20 are blockers. Most of those are
>> simple "audit x" or "document x" tasks or umbrellas, but, they do
>> represent things that have to get done before a release, and that to
>> me looks like a few more weeks of finishing, pushing, and sweeping
>> under carpets.
>>
>>
>> On Tue, Jun 7, 2016 at 1:45 PM, Jacek Laskowski  wrote:
>>> On Tue, Jun 7, 2016 at 1:25 PM, Arun Patel  wrote:
 Do we have any further updates on release date?
>>>
>>> Nope :( And it's even more quiet than I could have thought. I was so
>>> certain that today's the date. Looks like Spark Summit has "consumed"
>>> all the people behind 2.0...Can't believe no one (from the
>>> PMC/committers) even mean to shoot a date :( Patrick's gone. Reynold's
>>> busy. Perhaps Sean?
>>>
 Also, Is there a updated documentation for 2.0 somewhere?
>>>
>>> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/ ?
>>>
>>> Jacek
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke

Solr is basically an in-memory text index with a lot of capabilities for 
language analysis extraction (you can compare  it to a Google for your tweets). 
The system itself has a lot of features and has a complexity similar to Big 
data systems. This index files can be backed by HDFS. You can put the tweets 
directly into solr without going via HDFS files.

Carefully decide what fields to index / you want to search. It does not make 
sense to index everything.

> On 07 Jun 2016, at 13:51, Mich Talebzadeh  wrote:
> 
> Ok So basically for predictive off-line (as opposed to streaming) in a 
> nutshell one can use Apache Flume to store twitter data in hdfs and use Solr 
> to query the data?
> 
> This is what it says:
> 
> Solr is a standalone enterprise search server with a REST-like API. You put 
> documents in it (called "indexing") via JSON, XML, CSV or binary over HTTP. 
> You query it via HTTP GET and receive JSON, XML, CSV or binary results.
> 
> thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 7 June 2016 at 12:39, Jörn Franke  wrote:
>> Well I have seen that The algorithms mentioned are used for this. However 
>> some preprocessing through solr makes sense - it takes care of synonyms, 
>> homonyms, stemming etc
>> 
>>> On 07 Jun 2016, at 13:33, Mich Talebzadeh  wrote:
>>> 
>>> Thanks Jorn,
>>> 
>>> To start I would like to explore how can one turn some of the data into 
>>> useful information.
>>> 
>>> I would like to look at certain trend analysis. Simple correlation shows 
>>> that the more there is a mention of a typical topic say for example 
>>> "organic food" the more people are inclined to go for it. To see one can 
>>> deduce that orgaind food is a potential growth area.
>>> 
>>> Now I have all infra-structure to ingest that data. Like using flume to 
>>> store it or Spark streaming to do near real time work.
>>> 
>>> Now I want to slice and dice that data for say organic food.
>>> 
>>> I presume this is a typical question.
>>> 
>>> You mentioned Spark ml (machine learning?) . Is that something viable?
>>> 
>>> Cheers
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
 On 7 June 2016 at 12:22, Jörn Franke  wrote:
 Spark ml Support Vector machines or neural networks could be candidates. 
 For unstructured learning it could be clustering.
 For doing a graph analysis On the followers you can easily use Spark Graphx
 Keep in mind that each tweet contains a lot of meta data (location, 
 followers etc) that is more or less structured.
 For unstructured text analytics (eg tweet itself)I recommend 
 solr/ElasticSearch .
 
 However I am not sure what you want to do with the data exactly.
 
 
> On 07 Jun 2016, at 13:16, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> This is really a general question.
> 
> I use Spark to get twitter data. I did some looking at it
> 
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> val tweets = TwitterUtils.createStream(ssc, None)
> val statuses = tweets.map(status => status.getText())
> statuses.print()
> 
> Ok
> 
> Also I can use Apache flume to store data in hdfs directory
> 
> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf 
> Dflume.root.logger=DEBUG,console -n TwitterAgent
> Now that stores twitter data in binary format in  hdfs directory.
> 
> My question is pretty basic.
> 
> What is the best tool/language to dif in to that data. For example 
> twitter streaming data. I am getting all sorts od stuff coming in. Say I 
> am only interested in certain topics like sport etc. How can I detect the 
> signal from the noise using what tool and language?
> 
> Thanks
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>

Re: Spark 2.0 Release Date

2016-06-07 Thread Jacek Laskowski

Finally, the PMC voice on the subject. Thanks a lot, Sean!

p.s. Given how much time it takes to ship 2.0 (with so many cool
features already backed in!) I'd vote for releasing a few more RCs
before 2.0 hits the shelves. I hope 2.0 is not Java 9 or Jigsaw ;-)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jun 7, 2016 at 3:06 PM, Sean Owen  wrote:
> I don't believe the intent was to get it out before Spark Summit or
> something. That shouldn't drive the schedule anyway. But now that
> there's a 2.0.0 preview available, people who are eager to experiment
> or test on it can do so now.
>
> That probably reduces urgency to get it out the door in order to
> deliver new functionality. I guessed the 2.0.0 release would be mid
> June and now I'd guess early July. But, nobody's discussed it per se.
>
> In theory only fixes, tests and docs are being merged, so the JIRA
> count should be going down. It has, slowly. Right now there are 72
> open issues for 2.0.0, of which 20 are blockers. Most of those are
> simple "audit x" or "document x" tasks or umbrellas, but, they do
> represent things that have to get done before a release, and that to
> me looks like a few more weeks of finishing, pushing, and sweeping
> under carpets.
>
>
> On Tue, Jun 7, 2016 at 1:45 PM, Jacek Laskowski  wrote:
>> On Tue, Jun 7, 2016 at 1:25 PM, Arun Patel  wrote:
>>> Do we have any further updates on release date?
>>
>> Nope :( And it's even more quiet than I could have thought. I was so
>> certain that today's the date. Looks like Spark Summit has "consumed"
>> all the people behind 2.0...Can't believe no one (from the
>> PMC/committers) even mean to shoot a date :( Patrick's gone. Reynold's
>> busy. Perhaps Sean?
>>
>>> Also, Is there a updated documentation for 2.0 somewhere?
>>
>> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/ ?
>>
>> Jacek
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark 2.0 Release Date

2016-06-07 Thread Sean Owen

I don't believe the intent was to get it out before Spark Summit or
something. That shouldn't drive the schedule anyway. But now that
there's a 2.0.0 preview available, people who are eager to experiment
or test on it can do so now.

That probably reduces urgency to get it out the door in order to
deliver new functionality. I guessed the 2.0.0 release would be mid
June and now I'd guess early July. But, nobody's discussed it per se.

In theory only fixes, tests and docs are being merged, so the JIRA
count should be going down. It has, slowly. Right now there are 72
open issues for 2.0.0, of which 20 are blockers. Most of those are
simple "audit x" or "document x" tasks or umbrellas, but, they do
represent things that have to get done before a release, and that to
me looks like a few more weeks of finishing, pushing, and sweeping
under carpets.

On Tue, Jun 7, 2016 at 1:45 PM, Jacek Laskowski  wrote:
> On Tue, Jun 7, 2016 at 1:25 PM, Arun Patel  wrote:
>> Do we have any further updates on release date?
>
> Nope :( And it's even more quiet than I could have thought. I was so
> certain that today's the date. Looks like Spark Summit has "consumed"
> all the people behind 2.0...Can't believe no one (from the
> PMC/committers) even mean to shoot a date :( Patrick's gone. Reynold's
> busy. Perhaps Sean?
>
>> Also, Is there a updated documentation for 2.0 somewhere?
>
> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/ ?
>
> Jacek
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Specify node where driver should run

2016-06-07 Thread Jacek Laskowski

Hi,

--master yarn-client is deprecated and you should use --master yarn
--deploy-mode client instead. There are two deploy-modes: client
(default) and cluster. See
http://spark.apache.org/docs/latest/cluster-overview.html.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Tue, Jun 7, 2016 at 2:50 PM, Mich Talebzadeh
 wrote:
> ok thanks
>
> so I start SparkSubmit or similar Spark app on the Yarn resource manager
> node.
>
> What you are stating is that Yan may decide to start the driver program in
> another node as opposed to the resource manager node
>
> ${SPARK_HOME}/bin/spark-submit \
> --driver-memory=4G \
> --num-executors=5 \
> --executor-memory=4G \
> --master yarn-client \
> --executor-cores=4 \
>
> Due to lack of resources in the resource manager node? What is the
> likelihood of that. The resource manager node is the defector master node in
> all probability much more powerful than other nodes. Also the node that
> running resource manager is also running one of the node manager as well. So
> in theory may be in practice may not?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 7 June 2016 at 13:20, Sebastian Piu  wrote:
>>
>> What you are explaining is right for yarn-client mode, but the question is
>> about yarn-cluster in which case the spark driver is also submitted and run
>> in one of the node managers
>>
>>
>> On Tue, 7 Jun 2016, 13:45 Mich Talebzadeh, 
>> wrote:
>>>
>>> can you elaborate on the above statement please.
>>>
>>> When you start yarn you start the resource manager daemon only on the
>>> resource manager node
>>>
>>> yarn-daemon.sh start resourcemanager
>>>
>>> Then you start nodemanager deamons on all nodes
>>>
>>> yarn-daemon.sh start nodemanager
>>>
>>> A spark app has to start somewhere. That is SparkSubmit. and that is
>>> deterministic. I start SparkSubmit that talks to Yarn Resource Manager that
>>> initialises and registers an Application master. The crucial point is Yarn
>>> Resource manager which is basically a resource scheduler. It optimizes for
>>> cluster resource utilization to keep all resources in use all the time.
>>> However, resource manager itself is on the resource manager node.
>>>
>>> Now I always start my Spark app on the same node as the resource manager
>>> node and let Yarn take care of the rest.
>>>
>>> Thanks
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>>
>>> On 7 June 2016 at 12:17, Jacek Laskowski  wrote:

 Hi,

 It's not possible. YARN uses CPU and memory for resource constraints and
 places AM on any node available. Same about executors (unless data locality
 constraints the placement).

 Jacek

 On 6 Jun 2016 1:54 a.m., "Saiph Kappa"  wrote:
>
> Hi,
>
> In yarn-cluster mode, is there any way to specify on which node I want
> the driver to run?
>
> Thanks.
>>>
>>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh

ok thanks

so I start SparkSubmit or similar Spark app on the Yarn resource manager
node.

What you are stating is that Yan may decide to start the driver program in
another node as opposed to the resource manager node

${SPARK_HOME}/bin/spark-submit \
--driver-memory=4G \
--num-executors=5 \
--executor-memory=4G \
--master yarn-client \
--executor-cores=4 \

Due to lack of resources in the resource manager node? What is the
likelihood of that. The resource manager node is the defector master node
in all probability much more powerful than other nodes. Also the node that
running resource manager is also running one of the node manager as well.
So in theory may be in practice may not?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 13:20, Sebastian Piu  wrote:

> What you are explaining is right for yarn-client mode, but the question is
> about yarn-cluster in which case the spark driver is also submitted and run
> in one of the node managers
>
>
> On Tue, 7 Jun 2016, 13:45 Mich Talebzadeh, 
> wrote:
>
>> can you elaborate on the above statement please.
>>
>> When you start yarn you start the resource manager daemon only on the
>> resource manager node
>>
>> yarn-daemon.sh start resourcemanager
>>
>> Then you start nodemanager deamons on all nodes
>>
>> yarn-daemon.sh start nodemanager
>>
>> A spark app has to start somewhere. That is SparkSubmit. and that is
>> deterministic. I start SparkSubmit that talks to Yarn Resource Manager that
>> initialises and registers an Application master. The crucial point is Yarn
>> Resource manager which is basically a resource scheduler. It optimizes for
>> cluster resource utilization to keep all resources in use all the time.
>> However, resource manager itself is on the resource manager node.
>>
>> Now I always start my Spark app on the same node as the resource manager
>> node and let Yarn take care of the rest.
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 7 June 2016 at 12:17, Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> It's not possible. YARN uses CPU and memory for resource constraints and
>>> places AM on any node available. Same about executors (unless data locality
>>> constraints the placement).
>>>
>>> Jacek
>>> On 6 Jun 2016 1:54 a.m., "Saiph Kappa"  wrote:
>>>
 Hi,

 In yarn-cluster mode, is there any way to specify on which node I want
 the driver to run?

 Thanks.

>>>
>>

Re: Spark 2.0 Release Date

2016-06-07 Thread Jacek Laskowski

On Tue, Jun 7, 2016 at 1:25 PM, Arun Patel  wrote:
> Do we have any further updates on release date?

Nope :( And it's even more quiet than I could have thought. I was so
certain that today's the date. Looks like Spark Summit has "consumed"
all the people behind 2.0...Can't believe no one (from the
PMC/committers) even mean to shoot a date :( Patrick's gone. Reynold's
busy. Perhaps Sean?

> Also, Is there a updated documentation for 2.0 somewhere?

http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/ ?

Jacek

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Specify node where driver should run

2016-06-07 Thread Sebastian Piu

What you are explaining is right for yarn-client mode, but the question is
about yarn-cluster in which case the spark driver is also submitted and run
in one of the node managers

On Tue, 7 Jun 2016, 13:45 Mich Talebzadeh, 
wrote:

> can you elaborate on the above statement please.
>
> When you start yarn you start the resource manager daemon only on the
> resource manager node
>
> yarn-daemon.sh start resourcemanager
>
> Then you start nodemanager deamons on all nodes
>
> yarn-daemon.sh start nodemanager
>
> A spark app has to start somewhere. That is SparkSubmit. and that is
> deterministic. I start SparkSubmit that talks to Yarn Resource Manager that
> initialises and registers an Application master. The crucial point is Yarn
> Resource manager which is basically a resource scheduler. It optimizes for
> cluster resource utilization to keep all resources in use all the time.
> However, resource manager itself is on the resource manager node.
>
> Now I always start my Spark app on the same node as the resource manager
> node and let Yarn take care of the rest.
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 12:17, Jacek Laskowski  wrote:
>
>> Hi,
>>
>> It's not possible. YARN uses CPU and memory for resource constraints and
>> places AM on any node available. Same about executors (unless data locality
>> constraints the placement).
>>
>> Jacek
>> On 6 Jun 2016 1:54 a.m., "Saiph Kappa"  wrote:
>>
>>> Hi,
>>>
>>> In yarn-cluster mode, is there any way to specify on which node I want
>>> the driver to run?
>>>
>>> Thanks.
>>>
>>
>

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % 
"1.6.1"
Please take a look at the SBT copy. 

I would rather think that the problem is related to the Zookeeper/Kafka 
consumers. 

[2016-06-07 11:24:52,484] WARN Either no config or no quorum defined in config, 
running  in standalone mode (org.apache.zookeeper.server.quorum.QuorumPeerMain)

Any indication onto why the channel connection might be closed? Would it be 
Kafka or Zookeeper related? 

> On 07 Jun 2016, at 14:07, Todd Nist  wrote:
> 
> What version of Spark are you using?  I do not believe that 1.6.x is 
> compatible with 0.9.0.1 due to changes in the kafka clients between 0.8.2.2 
> and 0.9.0.x.  See this for more information:
> 
> https://issues.apache.org/jira/browse/SPARK-12177 
> 
> 
> -Todd
> 
> On Tue, Jun 7, 2016 at 7:35 AM, Dominik Safaric  > wrote:
> Hi,
> 
> Correct, I am using the 0.9.0.1 version. 
> 
> As already described, the topic contains messages. Those messages are 
> produced using the Confluence REST API.
> 
> However, what I’ve observed is that the problem is not in the Spark 
> configuration, but rather Zookeeper or Kafka related. 
> 
> Take a look at the exception’s stack top item:
> 
> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
> org.apache.spark.SparkException: Couldn't find leader offsets for 
> Set([,0])
>   at 
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>   at 
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>   at scala.util.Either.fold(Either.scala:97)
>   at 
> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>   at 
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>   at 
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>   at org.mediasoft.spark.Driver$.main(Driver.scala:22)
>   at .(:11)
>   at .()
>   at .(:7)
> 
> By listing all active connections using netstat, I’ve also observed that both 
> Zookeper and Kafka are running. Zookeeper on port 2181, while Kafka 9092. 
> 
> Furthermore, I am also able to retrieve all log messages using the console 
> consumer.
> 
> Any clue what might be going wrong?
> 
>> On 07 Jun 2016, at 13:13, Jacek Laskowski > > wrote:
>> 
>> Hi,
>> 
>> What's the version of Spark? You're using Kafka 0.9.0.1, ain't you? What's 
>> the topic name?
>> 
>> Jacek
>> 
>> On 7 Jun 2016 11:06 a.m., "Dominik Safaric" > > wrote:
>> As I am trying to integrate Kafka into Spark, the following exception occurs:
>> 
>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>> org.apache.spark.SparkException: Couldn't find leader offsets for
>> Set([**,0])
>> at
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at scala.util.Either.fold(Either.scala:97)
>> at
>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>> at
>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>> at
>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>> at org.mediasoft.spark.Driver$.main(Driver.scala:42)
>> at .(:11)
>> at .()
>> at .(:7)
>> at .()
>> at $print()
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:483)
>> at 
>> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
>> at 
>> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
>> at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
>> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
>> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
>> at 
>> scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
>> at 
>> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)
>> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717)
>> at

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Todd Nist

What version of Spark are you using?  I do not believe that 1.6.x is
compatible with 0.9.0.1 due to changes in the kafka clients between 0.8.2.2
and 0.9.0.x.  See this for more information:

https://issues.apache.org/jira/browse/SPARK-12177

-Todd

On Tue, Jun 7, 2016 at 7:35 AM, Dominik Safaric 
wrote:

> Hi,
>
> Correct, I am using the 0.9.0.1 version.
>
> As already described, the topic contains messages. Those messages are
> produced using the Confluence REST API.
>
> However, what I’ve observed is that the problem is not in the Spark
> configuration, but rather Zookeeper or Kafka related.
>
> Take a look at the exception’s stack top item:
>
> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
> org.apache.spark.SparkException: Couldn't find leader offsets for
> Set([,0])
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at scala.util.Either.fold(Either.scala:97)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
> at org.mediasoft.spark.Driver$.main(Driver.scala:22)
> at .(:11)
> at .()
> at .(:7)
>
> By listing all active connections using netstat, I’ve also observed that
> both Zookeper and Kafka are running. Zookeeper on port 2181, while Kafka
> 9092.
>
> Furthermore, I am also able to retrieve all log messages using the console
> consumer.
>
> Any clue what might be going wrong?
>
> On 07 Jun 2016, at 13:13, Jacek Laskowski  wrote:
>
> Hi,
>
> What's the version of Spark? You're using Kafka 0.9.0.1, ain't you? What's
> the topic name?
>
> Jacek
> On 7 Jun 2016 11:06 a.m., "Dominik Safaric" 
> wrote:
>
>> As I am trying to integrate Kafka into Spark, the following exception
>> occurs:
>>
>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>> org.apache.spark.SparkException: Couldn't find leader offsets for
>> Set([**,0])
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at scala.util.Either.fold(Either.scala:97)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>> at org.mediasoft.spark.Driver$.main(Driver.scala:42)
>> at .(:11)
>> at .()
>> at .(:7)
>> at .()
>> at $print()
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:483)
>> at
>> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
>> at
>> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
>> at
>> scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
>> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
>> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
>> at
>> scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
>> at
>> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)
>> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717)
>> at
>> scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581)
>> at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588)
>> at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591)
>> at
>>
>> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882)
>> at
>>
>> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
>> at
>>
>> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
>> at
>>
>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837)
>> at scala.tools.nsc.interpreter.ILoop.main(ILoop.scala:904)
>> at
>>
>> org.jetbrains.plugins.scala.compiler.rt.ConsoleRunner.main(ConsoleRunner.java:64)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>>

Re: Advice on Scaling RandomForest

2016-06-07 Thread Jörn Franke

Before hardware optimization there is always software optimization.
Are you using dataset / dataframe? Are you using the  right data types ( eg int 
where int is appropriate , try to avoid string and char etc)
Do you extract only the stuff needed? What are the algorithm parameters?

> On 07 Jun 2016, at 13:09, Franc Carter  wrote:
> 
> 
> Hi,
> 
> I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am 
> interested in how it might be best to scale it - e.g more cpus per instances, 
> more memory per instance, more instances etc.
> 
> I'm currently using 32 m3.xlarge instances for for a training set with 2.5 
> million rows, 1300 columns and a total size of 31GB (parquet)
> 
> thanks
> 
> -- 
> Franc

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh

Ok So basically for predictive off-line (as opposed to streaming) in a
nutshell one can use Apache Flume to store twitter data in hdfs and use
Solr to query the data?

This is what it says:

Solr is a standalone enterprise search server with a REST-like API. You put
documents in it (called "indexing") via JSON, XML, CSV or binary over HTTP.
You query it via HTTP GET and receive JSON, XML, CSV or binary results.

thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 12:39, Jörn Franke  wrote:

> Well I have seen that The algorithms mentioned are used for this. However
> some preprocessing through solr makes sense - it takes care of synonyms,
> homonyms, stemming etc
>
> On 07 Jun 2016, at 13:33, Mich Talebzadeh 
> wrote:
>
> Thanks Jorn,
>
> To start I would like to explore how can one turn some of the data into
> useful information.
>
> I would like to look at certain trend analysis. Simple correlation shows
> that the more there is a mention of a typical topic say for example
> "organic food" the more people are inclined to go for it. To see one can
> deduce that orgaind food is a potential growth area.
>
> Now I have all infra-structure to ingest that data. Like using flume to
> store it or Spark streaming to do near real time work.
>
> Now I want to slice and dice that data for say organic food.
>
> I presume this is a typical question.
>
> You mentioned Spark ml (machine learning?) . Is that something viable?
>
> Cheers
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 12:22, Jörn Franke  wrote:
>
>> Spark ml Support Vector machines or neural networks could be candidates.
>> For unstructured learning it could be clustering.
>> For doing a graph analysis On the followers you can easily use Spark
>> Graphx
>> Keep in mind that each tweet contains a lot of meta data (location,
>> followers etc) that is more or less structured.
>> For unstructured text analytics (eg tweet itself)I recommend
>> solr/ElasticSearch .
>>
>> However I am not sure what you want to do with the data exactly.
>>
>>
>> On 07 Jun 2016, at 13:16, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> This is really a general question.
>>
>> I use Spark to get twitter data. I did some looking at it
>>
>> val ssc = new StreamingContext(sparkConf, Seconds(2))
>> val tweets = TwitterUtils.createStream(ssc, None)
>> val statuses = tweets.map(status => status.getText())
>> statuses.print()
>>
>> Ok
>>
>> Also I can use Apache flume to store data in hdfs directory
>>
>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
>> Dflume.root.logger=DEBUG,console -n TwitterAgent
>> Now that stores twitter data in binary format in  hdfs directory.
>>
>> My question is pretty basic.
>>
>> What is the best tool/language to dif in to that data. For example
>> twitter streaming data. I am getting all sorts od stuff coming in. Say I am
>> only interested in certain topics like sport etc. How can I detect the
>> signal from the noise using what tool and language?
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>>
>

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh

can you elaborate on the above statement please.

When you start yarn you start the resource manager daemon only on the
resource manager node

yarn-daemon.sh start resourcemanager

Then you start nodemanager deamons on all nodes

yarn-daemon.sh start nodemanager

A spark app has to start somewhere. That is SparkSubmit. and that is
deterministic. I start SparkSubmit that talks to Yarn Resource Manager that
initialises and registers an Application master. The crucial point is Yarn
Resource manager which is basically a resource scheduler. It optimizes for
cluster resource utilization to keep all resources in use all the time.
However, resource manager itself is on the resource manager node.

Now I always start my Spark app on the same node as the resource manager
node and let Yarn take care of the rest.

Thanks

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

On 7 June 2016 at 12:17, Jacek Laskowski  wrote:

> Hi,
>
> It's not possible. YARN uses CPU and memory for resource constraints and
> places AM on any node available. Same about executors (unless data locality
> constraints the placement).
>
> Jacek
> On 6 Jun 2016 1:54 a.m., "Saiph Kappa"  wrote:
>
>> Hi,
>>
>> In yarn-cluster mode, is there any way to specify on which node I want
>> the driver to run?
>>
>> Thanks.
>>
>

Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke

Well I have seen that The algorithms mentioned are used for this. However some 
preprocessing through solr makes sense - it takes care of synonyms, homonyms, 
stemming etc

> On 07 Jun 2016, at 13:33, Mich Talebzadeh  wrote:
> 
> Thanks Jorn,
> 
> To start I would like to explore how can one turn some of the data into 
> useful information.
> 
> I would like to look at certain trend analysis. Simple correlation shows that 
> the more there is a mention of a typical topic say for example "organic food" 
> the more people are inclined to go for it. To see one can deduce that orgaind 
> food is a potential growth area. 
> 
> Now I have all infra-structure to ingest that data. Like using flume to store 
> it or Spark streaming to do near real time work.
> 
> Now I want to slice and dice that data for say organic food.
> 
> I presume this is a typical question.
> 
> You mentioned Spark ml (machine learning?) . Is that something viable?
> 
> Cheers
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 7 June 2016 at 12:22, Jörn Franke  wrote:
>> Spark ml Support Vector machines or neural networks could be candidates. 
>> For unstructured learning it could be clustering.
>> For doing a graph analysis On the followers you can easily use Spark Graphx
>> Keep in mind that each tweet contains a lot of meta data (location, 
>> followers etc) that is more or less structured.
>> For unstructured text analytics (eg tweet itself)I recommend 
>> solr/ElasticSearch .
>> 
>> However I am not sure what you want to do with the data exactly.
>> 
>> 
>>> On 07 Jun 2016, at 13:16, Mich Talebzadeh  wrote:
>>> 
>>> Hi,
>>> 
>>> This is really a general question.
>>> 
>>> I use Spark to get twitter data. I did some looking at it
>>> 
>>> val ssc = new StreamingContext(sparkConf, Seconds(2))
>>> val tweets = TwitterUtils.createStream(ssc, None)
>>> val statuses = tweets.map(status => status.getText())
>>> statuses.print()
>>> 
>>> Ok
>>> 
>>> Also I can use Apache flume to store data in hdfs directory
>>> 
>>> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf 
>>> Dflume.root.logger=DEBUG,console -n TwitterAgent
>>> Now that stores twitter data in binary format in  hdfs directory.
>>> 
>>> My question is pretty basic.
>>> 
>>> What is the best tool/language to dif in to that data. For example twitter 
>>> streaming data. I am getting all sorts od stuff coming in. Say I am only 
>>> interested in certain topics like sport etc. How can I detect the signal 
>>> from the noise using what tool and language?
>>> 
>>> Thanks
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric

Hi,

Correct, I am using the 0.9.0.1 version. 

As already described, the topic contains messages. Those messages are produced 
using the Confluence REST API.

However, what I’ve observed is that the problem is not in the Spark 
configuration, but rather Zookeeper or Kafka related. 

Take a look at the exception’s stack top item:

org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
org.apache.spark.SparkException: Couldn't find leader offsets for 
Set([,0])
at 
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at 
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at scala.util.Either.fold(Either.scala:97)
at 
org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
at 
org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
at 
org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
at org.mediasoft.spark.Driver$.main(Driver.scala:22)
at .(:11)
at .()
at .(:7)

By listing all active connections using netstat, I’ve also observed that both 
Zookeper and Kafka are running. Zookeeper on port 2181, while Kafka 9092. 

Furthermore, I am also able to retrieve all log messages using the console 
consumer.

Any clue what might be going wrong?

> On 07 Jun 2016, at 13:13, Jacek Laskowski  wrote:
> 
> Hi,
> 
> What's the version of Spark? You're using Kafka 0.9.0.1, ain't you? What's 
> the topic name?
> 
> Jacek
> 
> On 7 Jun 2016 11:06 a.m., "Dominik Safaric"  > wrote:
> As I am trying to integrate Kafka into Spark, the following exception occurs:
> 
> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
> org.apache.spark.SparkException: Couldn't find leader offsets for
> Set([**,0])
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at scala.util.Either.fold(Either.scala:97)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
> at org.mediasoft.spark.Driver$.main(Driver.scala:42)
> at .(:11)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
> at 
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
> at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
> at 
> scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
> at 
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717)
> at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581)
> at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588)
> at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837)
> at scala.tools.nsc.interpreter.ILoop.main(ILoop.scala:904)
> at
> org.jetbrains.plugins.scala.compiler.rt.ConsoleRunner.main(ConsoleRunner.java:64)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> 
> As for the Spark configuration:
> 
>

Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh

Thanks Jorn,

To start I would like to explore how can one turn some of the data into
useful information.

I would like to look at certain trend analysis. Simple correlation shows
that the more there is a mention of a typical topic say for example
"organic food" the more people are inclined to go for it. To see one can
deduce that orgaind food is a potential growth area.

Now I have all infra-structure to ingest that data. Like using flume to
store it or Spark streaming to do near real time work.

Now I want to slice and dice that data for say organic food.

I presume this is a typical question.

You mentioned Spark ml (machine learning?) . Is that something viable?

Cheers





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 12:22, Jörn Franke  wrote:

> Spark ml Support Vector machines or neural networks could be candidates.
> For unstructured learning it could be clustering.
> For doing a graph analysis On the followers you can easily use Spark Graphx
> Keep in mind that each tweet contains a lot of meta data (location,
> followers etc) that is more or less structured.
> For unstructured text analytics (eg tweet itself)I recommend
> solr/ElasticSearch .
>
> However I am not sure what you want to do with the data exactly.
>
>
> On 07 Jun 2016, at 13:16, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> This is really a general question.
>
> I use Spark to get twitter data. I did some looking at it
>
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> val tweets = TwitterUtils.createStream(ssc, None)
> val statuses = tweets.map(status => status.getText())
> statuses.print()
>
> Ok
>
> Also I can use Apache flume to store data in hdfs directory
>
> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
> Dflume.root.logger=DEBUG,console -n TwitterAgent
> Now that stores twitter data in binary format in  hdfs directory.
>
> My question is pretty basic.
>
> What is the best tool/language to dif in to that data. For example twitter
> streaming data. I am getting all sorts od stuff coming in. Say I am only
> interested in certain topics like sport etc. How can I detect the signal
> from the noise using what tool and language?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>

Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel

Do we have any further updates on release date?

Also, Is there a updated documentation for 2.0 somewhere?

Thanks
Arun

On Thu, Apr 28, 2016 at 4:50 PM, Jacek Laskowski  wrote:

> Hi Arun,
>
> My bet is...https://spark-summit.org/2016 :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Apr 28, 2016 at 1:43 PM, Arun Patel 
> wrote:
> > A small request.
> >
> > Would you mind providing an approximate date of Spark 2.0 release?  Is it
> > early May or Mid May or End of May?
> >
> > Thanks,
> > Arun
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke

Spark ml Support Vector machines or neural networks could be candidates. 
For unstructured learning it could be clustering.
For doing a graph analysis On the followers you can easily use Spark Graphx
Keep in mind that each tweet contains a lot of meta data (location, followers 
etc) that is more or less structured.
For unstructured text analytics (eg tweet itself)I recommend solr/ElasticSearch 
.

However I am not sure what you want to do with the data exactly.


> On 07 Jun 2016, at 13:16, Mich Talebzadeh  wrote:
> 
> Hi,
> 
> This is really a general question.
> 
> I use Spark to get twitter data. I did some looking at it
> 
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> val tweets = TwitterUtils.createStream(ssc, None)
> val statuses = tweets.map(status => status.getText())
> statuses.print()
> 
> Ok
> 
> Also I can use Apache flume to store data in hdfs directory
> 
> $FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf 
> Dflume.root.logger=DEBUG,console -n TwitterAgent
> Now that stores twitter data in binary format in  hdfs directory.
> 
> My question is pretty basic.
> 
> What is the best tool/language to dif in to that data. For example twitter 
> streaming data. I am getting all sorts od stuff coming in. Say I am only 
> interested in certain topics like sport etc. How can I detect the signal from 
> the noise using what tool and language?
> 
> Thanks
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>

Re: Specify node where driver should run

2016-06-07 Thread Jacek Laskowski

Hi,

It's not possible. YARN uses CPU and memory for resource constraints and
places AM on any node available. Same about executors (unless data locality
constraints the placement).

Jacek
On 6 Jun 2016 1:54 a.m., "Saiph Kappa"  wrote:

> Hi,
>
> In yarn-cluster mode, is there any way to specify on which node I want the
> driver to run?
>
> Thanks.
>

Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh

Hi,

This is really a general question.

I use Spark to get twitter data. I did some looking at it

val ssc = new StreamingContext(sparkConf, Seconds(2))
val tweets = TwitterUtils.createStream(ssc, None)
val statuses = tweets.map(status => status.getText())
statuses.print()

Ok

Also I can use Apache flume to store data in hdfs directory

$FLUME_HOME/bin/flume-ng agent --conf ./conf/ -f conf/twitter.conf
Dflume.root.logger=DEBUG,console -n TwitterAgent
Now that stores twitter data in binary format in  hdfs directory.

My question is pretty basic.

What is the best tool/language to dif in to that data. For example twitter
streaming data. I am getting all sorts od stuff coming in. Say I am only
interested in certain topics like sport etc. How can I detect the signal
from the noise using what tool and language?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Jacek Laskowski

Hi,

What's the version of Spark? You're using Kafka 0.9.0.1, ain't you? What's
the topic name?

Jacek
On 7 Jun 2016 11:06 a.m., "Dominik Safaric" 
wrote:

> As I am trying to integrate Kafka into Spark, the following exception
> occurs:
>
> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
> org.apache.spark.SparkException: Couldn't find leader offsets for
> Set([**,0])
> at
>
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at
>
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at scala.util.Either.fold(Either.scala:97)
> at
>
> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
> at
>
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
> at
>
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
> at org.mediasoft.spark.Driver$.main(Driver.scala:42)
> at .(:11)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
> at
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
> at
> scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
> at
> scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
> at
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717)
> at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581)
> at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588)
> at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591)
> at
>
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
> at
>
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837)
> at scala.tools.nsc.interpreter.ILoop.main(ILoop.scala:904)
> at
>
> org.jetbrains.plugins.scala.compiler.rt.ConsoleRunner.main(ConsoleRunner.java:64)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
>
> As for the Spark configuration:
>
>val conf: SparkConf = new
> SparkConf().setAppName("AppName").setMaster("local[2]")
>
> val confParams: Map[String, String] = Map(
>   "metadata.broker.list" -> ":9092",
>   "auto.offset.reset" -> "largest"
> )
>
> val topics: Set[String] = Set("")
>
> val context: StreamingContext = new StreamingContext(conf, Seconds(1))
> val kafkaStream = KafkaUtils.createDirectStream(context,confParams,
> topics)
>
> kafkaStream.foreachRDD(rdd => {
>   rdd.collect().foreach(println)
> })
>
> context.awaitTermination()
> context.start()
>
> The Kafka topic does exist, Kafka server is up and running and I am able to
> produce messages to that particular topic using the Confluent REST API.
>
> What might the problem actually be?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Kafka-Integration-org-apache-spark-SparkException-Couldn-t-find-leader-offsets-for-Set-tp27103.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter

Hi,

I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am
interested in how it might be best to scale it - e.g more cpus per
instances, more memory per instance, more instances etc.

I'm currently using 32 m3.xlarge instances for for a training set with 2.5
million rows, 1300 columns and a total size of 31GB (parquet)

thanks

-- 
Franc

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh

For now you can move away from Spark and look at the cause of your kafka
publishing

Also check that zookeeper is running
jps
*17102* QuorumPeerMain

runs on default port 2181

netstat -plten|grep 2181
tcp0  0 :::2181 :::*
LISTEN  1005   8765628*17102*/java

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 11:59, Dominik Safaric  wrote:

> Sounds like the issue is with Kafka channel, it is closing.
>
>
> Made the same conclusion as well. I’ve even tried further refining the
> configuration files:
>
> Zookeeper properties:
>
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> # the directory where the snapshot is stored.
> dataDir=/tmp/zookeeper
> # the port at which the clients will connect
> clientPort=2181
> # disable the per-ip limit on the number of connections since this is a
> non-production config
> maxClientCnxns=20
>
> Kafka server properties:
>
> # Licensed to the Apache Software Foundation (ASF) under one or more
> # contributor license agreements.  See the NOTICE file distributed with
> # this work for additional information regarding copyright ownership.
> # The ASF licenses this file to You under the Apache License, Version 2.0
> # (the "License"); you may not use this file except in compliance with
> # the License.  You may obtain a copy of the License at
> #
> #http://www.apache.org/licenses/LICENSE-2.0
> #
> # Unless required by applicable law or agreed to in writing, software
> # distributed under the License is distributed on an "AS IS" BASIS,
> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
> # See the License for the specific language governing permissions and
> # limitations under the License.
> # see kafka.server.KafkaConfig for additional details and defaults
>
> # Server Basics #
>
> # The id of the broker. This must be set to a unique integer for each
> broker.
> broker.id=1
>
> # Socket Server Settings
> #
>
> listeners=PLAINTEXT://:9092
>
> # The port the socket server listens on
> #port=9092
>
> # Hostname the broker will bind to. If not set, the server will bind to
> all interfaces
> host.name=0.0.0.0
>
> # Hostname the broker will advertise to producers and consumers. If not
> set, it uses the
> # value for "host.name" if configured.  Otherwise, it will use the value
> returned from
> # java.net.InetAddress.getCanonicalHostName().
> #advertised.host.name=
>
> # The port to publish to ZooKeeper for clients to use. If this is not set,
> # it will publish the same port that the broker binds to.
> #advertised.port=
>
> # The number of threads handling network requests
> num.network.threads=3
>
> # The number of threads doing disk I/O
> num.io.threads=8
>
> # The send buffer (SO_SNDBUF) used by the socket server
> socket.send.buffer.bytes=102400
>
> # The receive buffer (SO_RCVBUF) used by the socket server
> socket.receive.buffer.bytes=102400
>
> # The maximum size of a request that the socket server will accept
> (protection against OOM)
> socket.request.max.bytes=104857600
>
>
> # Log Basics #
>
> # A comma seperated list of directories under which to store log files
> log.dirs=/tmp/kafka-logs
>
> # The default number of log partitions per topic. More partitions allow
> greater
> # parallelism for consumption, but this will also result in more files
> across
> # the brokers.
> num.partitions=1
>
> # The number of threads per data directory to be used for log recovery at
> startup and flushing at shutdown.
> # This value is recommended to be increased for installations with data
> dirs located in RAID array.
> num.recovery.threads.per.data.dir=1
>
> # Log Flush Policy
> #
>
> # Messages are immediately written to the filesystem but by default we
> only fsync() to sync
> # the OS cache lazily. The

Re: Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-07 Thread Ofir Manor

TD - this might not be the best forum, but (1) - batch left outer stream -
is always feasible under reasonable constraints, for example a window
constraint on the stream.

I think it would be super useful to have a central place in the 2.0 docs
that spells out what exactly is included, what is targeted to 2.1 and what
will likely be post 2.1...
I think that so far it is not well-communicated (and we are a couple of
weeks after the preview release) - as a user and potential early adopter I
have to constantly dig into the source code and pull requests trying to
decipher if I could use 2.0 APIs for my use case.

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Tue, Jun 7, 2016 at 12:36 PM, Tathagata Das 
wrote:

> 1.  Not all types of joins are supported. Here is the list.
> - Right outer joins - stream-batch not allowed, batch-stream allowed
> - Left outer joins - batch-stream not allowed, stream-batch allowed
>  (reverse of Right outer join)
> - Stream-stream joins are not allowed
>
> In the cases of outer joins, the not-allowed-cases are fundamentally hard
> because to do them correctly, every time there is new data in the stream,
> all the past data in the stream needs to be processed. Since we cannot
> stored ever-increasing amount of data in memory, this is not feasible.
>
> 2. For the update mode, the timeline is Spark 2.1.
>
>
> TD
>
> On Mon, Jun 6, 2016 at 6:54 AM, raaggarw  wrote:
>
>> Thanks
>> So,
>>
>> 1) For joins (stream-batch) - are all types of joins supported - i mean
>> inner, leftouter etc or specific ones?
>> Also what is the timeline for complete support - I mean stream-stream
>> joins?
>>
>> 2) So now outputMode is exposed via DataFrameWriter but will work in
>> specific cases as you mentioned? We were looking for delta & append output
>> modes for aggregation/groupBy. What is the timeline for that?
>>
>> Thanks
>> Ravi
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Timeline-for-supporting-basic-operations-like-groupBy-joins-etc-on-Streaming-DataFrames-tp27091p27093.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric

> Sounds like the issue is with Kafka channel, it is closing.


Made the same conclusion as well. I’ve even tried further refining the 
configuration files:

Zookeeper properties:

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
# 
#http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# the directory where the snapshot is stored.
dataDir=/tmp/zookeeper
# the port at which the clients will connect
clientPort=2181
# disable the per-ip limit on the number of connections since this is a 
non-production config
maxClientCnxns=20

Kafka server properties:

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# see kafka.server.KafkaConfig for additional details and defaults

# Server Basics #

# The id of the broker. This must be set to a unique integer for each broker.
broker.id=1

# Socket Server Settings 
#

listeners=PLAINTEXT://:9092

# The port the socket server listens on
#port=9092

# Hostname the broker will bind to. If not set, the server will bind to all 
interfaces
host.name=0.0.0.0

# Hostname the broker will advertise to producers and consumers. If not set, it 
uses the
# value for "host.name" if configured.  Otherwise, it will use the value 
returned from
# java.net.InetAddress.getCanonicalHostName().
#advertised.host.name=

# The port to publish to ZooKeeper for clients to use. If this is not set,
# it will publish the same port that the broker binds to.
#advertised.port=

# The number of threads handling network requests
num.network.threads=3

# The number of threads doing disk I/O
num.io.threads=8

# The send buffer (SO_SNDBUF) used by the socket server
socket.send.buffer.bytes=102400

# The receive buffer (SO_RCVBUF) used by the socket server
socket.receive.buffer.bytes=102400

# The maximum size of a request that the socket server will accept (protection 
against OOM)
socket.request.max.bytes=104857600


# Log Basics #

# A comma seperated list of directories under which to store log files
log.dirs=/tmp/kafka-logs

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=1

# The number of threads per data directory to be used for log recovery at 
startup and flushing at shutdown.
# This value is recommended to be increased for installations with data dirs 
located in RAID array.
num.recovery.threads.per.data.dir=1

# Log Flush Policy #

# Messages are immediately written to the filesystem but by default we only 
fsync() to sync
# the OS cache lazily. The following configurations control the flush of data 
to disk.
# There are a few important trade-offs here:
#1. Durability: Unflushed data may be lost if you are not using replication.
#2. Latency: Very large flush intervals may lead to latency spikes when the 
flush does occur as there will be a lot of data to flush.
#3. Throughput: The flush is generally the most expensive operation, and a 
small flush interval may lead to exceessive seeks.
# The settings below allow one to configure the flush policy to flush data 
after a period of time or
# every N messages (or both). This can be done globally and overridden on a 
per-topic basis.

# The number of messages to accept before forcing a flush of data to disk
#log.flush.interval.messages=1

# The maximum amount of time a message can sit in a log before

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh

Sounds like the issue is with Kafka channel, it is closing.

 Reconnect due to socket error: java.nio.channels.ClosedChannelException

Can you relax that

val ssc = new StreamingContext(sparkConf, Seconds(20)

Also how are you getting your source data? You can actually have both Spark
and the output below at the same time running tol see the exact cause of it

${KAFKA_HOME}/bin/kafka-console-consumer.sh --zookeeper rhes564:2181
--from-beginning --topic newtopic





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 11:32, Dominik Safaric  wrote:

> Unfortunately, even with this Spark configuration and Kafka parameters,
> the same exception keeps occurring:
>
> 16/06/07 12:26:11 INFO SimpleConsumer: Reconnect due to socket error:
> java.nio.channels.ClosedChannelException
> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
> org.apache.spark.SparkException: Couldn't find leader offsets for
> Set([,0])
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at scala.util.Either.fold(Either.scala:97)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>
> If it helps for troubleshooting, here are the logs of the Kafka server:
>
> 16-06-07 10:24:58,349] INFO Initiating client connection,
> connectString=localhost:2181 sessionTimeout=6000
> watcher=org.I0Itec.zkclient.ZkClient@4e05faa7
> (org.apache.zookeeper.ZooKeeper)
> [2016-06-07 10:24:58,365] INFO Opening socket connection to server
> localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL
> (unknown error) (org.apache.zookeeper.ClientCnxn)
> [2016-06-07 10:24:58,365] INFO Waiting for keeper state SyncConnected
> (org.I0Itec.zkclient.ZkClient)
> [2016-06-07 10:24:58,375] INFO Socket connection established to localhost/
> 127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn)
> [2016-06-07 10:24:58,405] INFO Session establishment complete on server
> localhost/127.0.0.1:2181, sessionid = 0x1552a64a9a8, negotiated
> timeout = 6000 (org.apache.zookeeper.ClientCnxn)
> [2016-06-07 10:24:58,408] INFO zookeeper state changed (SyncConnected)
> (org.I0Itec.zkclient.ZkClient)
> [2016-06-07 10:24:58,562] INFO Loading logs. (kafka.log.LogManager)
> [2016-06-07 10:24:58,608] INFO Completed load of log -0 with
> log end offset 15 (kafka.log.Log)
> [2016-06-07 10:24:58,614] INFO Completed load of log _schemas-0 with log
> end offset 1 (kafka.log.Log)
> [2016-06-07 10:24:58,617] INFO Completed load of log -0 with
> log end offset 5 (kafka.log.Log)
> [2016-06-07 10:24:58,620] INFO Completed load of log -0 with
> log end offset 2 (kafka.log.Log)
> [2016-06-07 10:24:58,629] INFO Completed load of log -0 with
> log end offset 1759 (kafka.log.Log)
> [2016-06-07 10:24:58,635] INFO Logs loading complete.
> (kafka.log.LogManager)
> [2016-06-07 10:24:58,737] INFO Starting log cleanup with a period of
> 30 ms. (kafka.log.LogManager)
> [2016-06-07 10:24:58,739] INFO Starting log flusher with a default period
> of 9223372036854775807 ms. (kafka.log.LogManager)
> [2016-06-07 10:24:58,798] INFO Awaiting socket connections on 0.0.0.0:9092.
> (kafka.network.Acceptor)
> [2016-06-07 10:24:58,809] INFO [Socket Server on Broker 1], Started 1
> acceptor threads (kafka.network.SocketServer)
> [2016-06-07 10:24:58,849] INFO [ExpirationReaper-1], Starting
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2016-06-07 10:24:58,850] INFO [ExpirationReaper-1], Starting
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2016-06-07 10:24:58,953] INFO Creating /controller (is it secure? false)
> (kafka.utils.ZKCheckedEphemeral)
> [2016-06-07 10:24:58,973] INFO Result of znode creation is: OK
> (kafka.utils.ZKCheckedEphemeral)
> [2016-06-07 10:24:58,974] INFO 1 successfully elected as leader
> (kafka.server.ZookeeperLeaderElector)
> [2016-06-07 10:24:59,180] INFO [GroupCoordinator 1]: Starting up.
> (kafka.coordinator.GroupCoordinator)
> [2016-06-07 10:24:59,191] INFO [ExpirationReaper-1], Starting
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2016-06-07 10:24:59,194] INFO New leader is 1
> (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
> [2016-06-07 10:24:59,198] INFO [Group Metadata Manager on Broker 1]:
> Removed 0 expired offsets in 16 milliseconds.
> (kafka.coordinator.GroupMetadataManager)
> [2016-06-07 10:24:59,195] INFO [ExpirationReaper-1], Starting
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2016-06-07 10:24:59,195] INFO

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric

Unfortunately, even with this Spark configuration and Kafka parameters, the 
same exception keeps occurring:

16/06/07 12:26:11 INFO SimpleConsumer: Reconnect due to socket error: 
java.nio.channels.ClosedChannelException
org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
org.apache.spark.SparkException: Couldn't find leader offsets for 
Set([,0])
at 
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at 
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at scala.util.Either.fold(Either.scala:97)
at 
org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
at 
org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)

If it helps for troubleshooting, here are the logs of the Kafka server:

16-06-07 10:24:58,349] INFO Initiating client connection, 
connectString=localhost:2181 sessionTimeout=6000 
watcher=org.I0Itec.zkclient.ZkClient@4e05faa7 (org.apache.zookeeper.ZooKeeper)
[2016-06-07 10:24:58,365] INFO Opening socket connection to server 
localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown 
error) (org.apache.zookeeper.ClientCnxn)
[2016-06-07 10:24:58,365] INFO Waiting for keeper state SyncConnected 
(org.I0Itec.zkclient.ZkClient)
[2016-06-07 10:24:58,375] INFO Socket connection established to 
localhost/127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2016-06-07 10:24:58,405] INFO Session establishment complete on server 
localhost/127.0.0.1:2181, sessionid = 0x1552a64a9a8, negotiated timeout = 
6000 (org.apache.zookeeper.ClientCnxn)
[2016-06-07 10:24:58,408] INFO zookeeper state changed (SyncConnected) 
(org.I0Itec.zkclient.ZkClient)
[2016-06-07 10:24:58,562] INFO Loading logs. (kafka.log.LogManager)
[2016-06-07 10:24:58,608] INFO Completed load of log -0 with log 
end offset 15 (kafka.log.Log)
[2016-06-07 10:24:58,614] INFO Completed load of log _schemas-0 with log end 
offset 1 (kafka.log.Log)
[2016-06-07 10:24:58,617] INFO Completed load of log -0 with log 
end offset 5 (kafka.log.Log)
[2016-06-07 10:24:58,620] INFO Completed load of log -0 with log 
end offset 2 (kafka.log.Log)
[2016-06-07 10:24:58,629] INFO Completed load of log -0 with log 
end offset 1759 (kafka.log.Log)
[2016-06-07 10:24:58,635] INFO Logs loading complete. (kafka.log.LogManager)
[2016-06-07 10:24:58,737] INFO Starting log cleanup with a period of 30 ms. 
(kafka.log.LogManager)
[2016-06-07 10:24:58,739] INFO Starting log flusher with a default period of 
9223372036854775807 ms. (kafka.log.LogManager)
[2016-06-07 10:24:58,798] INFO Awaiting socket connections on 0.0.0.0:9092. 
(kafka.network.Acceptor)
[2016-06-07 10:24:58,809] INFO [Socket Server on Broker 1], Started 1 acceptor 
threads (kafka.network.SocketServer)
[2016-06-07 10:24:58,849] INFO [ExpirationReaper-1], Starting  
(kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2016-06-07 10:24:58,850] INFO [ExpirationReaper-1], Starting  
(kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2016-06-07 10:24:58,953] INFO Creating /controller (is it secure? false) 
(kafka.utils.ZKCheckedEphemeral)
[2016-06-07 10:24:58,973] INFO Result of znode creation is: OK 
(kafka.utils.ZKCheckedEphemeral)
[2016-06-07 10:24:58,974] INFO 1 successfully elected as leader 
(kafka.server.ZookeeperLeaderElector)
[2016-06-07 10:24:59,180] INFO [GroupCoordinator 1]: Starting up. 
(kafka.coordinator.GroupCoordinator)
[2016-06-07 10:24:59,191] INFO [ExpirationReaper-1], Starting  
(kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2016-06-07 10:24:59,194] INFO New leader is 1 
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
[2016-06-07 10:24:59,198] INFO [Group Metadata Manager on Broker 1]: Removed 0 
expired offsets in 16 milliseconds. (kafka.coordinator.GroupMetadataManager)
[2016-06-07 10:24:59,195] INFO [ExpirationReaper-1], Starting  
(kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2016-06-07 10:24:59,195] INFO [GroupCoordinator 1]: Startup complete. 
(kafka.coordinator.GroupCoordinator)
[2016-06-07 10:24:59,215] INFO [ThrottledRequestReaper-Produce], Starting  
(kafka.server.ClientQuotaManager$ThrottledRequestReaper)
[2016-06-07 10:24:59,217] INFO [ThrottledRequestReaper-Fetch], Starting  
(kafka.server.ClientQuotaManager$ThrottledRequestReaper)
[2016-06-07 10:24:59,220] INFO Will not load MX4J, mx4j-tools.jar is not in the 
classpath (kafka.utils.Mx4jLoader$)
[2016-06-07 10:24:59,230] INFO Creating /brokers/ids/1 (is it secure? false) 
(kafka.utils.ZKCheckedEphemeral)
[2016-06-07 10:24:59,244] INFO Result of znode creation is: OK 
(kafka.utils.ZKCheckedEphemeral)
[2016-06-07 10:24:59,245] INFO Registered broker 1 at path /brokers/ids/1 with 
addresses: PLAINTEXT -> EndPoint(,9092,PLAINTEXT) 
(kafka.utils.ZkUtils)
[2016-06-07 10:24:59,257] INFO Kafka version :

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh

ok that is good

Yours is basically simple streaming with Kafka (publishing topic) and your
Spark streaming. use the following as blueprint

// Create a local StreamingContext with two working thread and batch
interval of 2 seconds.
val sparkConf = new SparkConf().
 setAppName("CEP_streaming").
 setMaster("local[2]").
 set("spark.executor.memory", "4G").
 set("spark.cores.max", "2").
 set("spark.streaming.concurrentJobs", "2").
 set("spark.driver.allowMultipleContexts", "true").
 set("spark.hadoop.validateOutputSpecs", "false")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val kafkaParams = Map[String, String]("bootstrap.servers" ->
"rhes564:9092", "schema.registry.url" -> "http://rhes564:8081;,
"zookeeper.connect" -> "rhes564:2181", "group.id" -> "CEP_streaming" )
val topics = Set("newtopic")
val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, topics)
dstream.cache()

val lines = dstream.map(_._2)
val price = lines.map(_.split(',').view(2)).map(_.toFloat)
// window length - The duration of the window below that must be multiple
of batch interval n in = > StreamingContext(sparkConf, Seconds(n))
val windowLength = 4
// sliding interval - The interval at which the window operation is
performed in other words data is collected within this "previous interval'
val slidingInterval = 2  // keep this the same as batch window for
continuous streaming. You are aggregating data that you are collecting over
the  batch Window
val countByValueAndWindow = price.filter(_ >
95.0).countByValueAndWindow(Seconds(windowLength), Seconds(slidingInterval))
countByValueAndWindow.print()
//
ssc.start()
ssc.awaitTermination()

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

On 7 June 2016 at 10:58, Dominik Safaric  wrote:

> Dear Mich,
>
> Thank you for the reply.
>
> By running the following command in the command line:
>
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic
>  --from-beginning
>
> I do indeed retrieve all messages of a topic.
>
> Any indication onto what might cause the issue?
>
> An important note to make,  I’m using the default configuration of both
> Kafka and Zookeeper.
>
> On 07 Jun 2016, at 11:39, Mich Talebzadeh 
> wrote:
>
> I assume you zookeeper is up and running
>
> can you confirm that you are getting topics from kafka independently for
> example on the command line
>
> ${KAFKA_HOME}/bin/kafka-console-consumer.sh --zookeeper rhes564:2181
> --from-beginning --topic newtopic
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 7 June 2016 at 10:06, Dominik Safaric  wrote:
>
>> As I am trying to integrate Kafka into Spark, the following exception
>> occurs:
>>
>> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
>> org.apache.spark.SparkException: Couldn't find leader offsets for
>> Set([**,0])
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
>> at scala.util.Either.fold(Either.scala:97)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
>> at
>>
>> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
>> at org.mediasoft.spark.Driver$.main(Driver.scala:42)
>> at .(:11)
>> at .()
>> at .(:7)
>> at .()
>> at $print()
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:483)
>> at
>> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
>> at
>> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
>> at
>> scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
>> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
>> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
>> at
>>

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric

Dear Mich,

Thank you for the reply.

By running the following command in the command line:

bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic  
--from-beginning

I do indeed retrieve all messages of a topic. 

Any indication onto what might cause the issue? 

An important note to make,  I’m using the default configuration of both Kafka 
and Zookeeper.

> On 07 Jun 2016, at 11:39, Mich Talebzadeh  wrote:
> 
> I assume you zookeeper is up and running
> 
> can you confirm that you are getting topics from kafka independently for 
> example on the command line
> 
> ${KAFKA_HOME}/bin/kafka-console-consumer.sh --zookeeper rhes564:2181 
> --from-beginning --topic newtopic
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 7 June 2016 at 10:06, Dominik Safaric  > wrote:
> As I am trying to integrate Kafka into Spark, the following exception occurs:
> 
> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
> org.apache.spark.SparkException: Couldn't find leader offsets for
> Set([**,0])
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at scala.util.Either.fold(Either.scala:97)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
> at org.mediasoft.spark.Driver$.main(Driver.scala:42)
> at .(:11)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
> at 
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
> at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
> at 
> scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
> at 
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717)
> at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581)
> at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588)
> at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837)
> at scala.tools.nsc.interpreter.ILoop.main(ILoop.scala:904)
> at
> org.jetbrains.plugins.scala.compiler.rt.ConsoleRunner.main(ConsoleRunner.java:64)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> 
> As for the Spark configuration:
> 
>val conf: SparkConf = new
> SparkConf().setAppName("AppName").setMaster("local[2]")
> 
> val confParams: Map[String, String] = Map(
>   "metadata.broker.list" -> ":9092",
>   "auto.offset.reset" -> "largest"
> )
> 
> val topics: Set[String] = Set("")
> 
> val context: StreamingContext = new StreamingContext(conf, Seconds(1))
> val kafkaStream = KafkaUtils.createDirectStream(context,confParams,
> topics)
> 
> kafkaStream.foreachRDD(rdd => {
>   rdd.collect().foreach(println)
> })
> 
> context.awaitTermination()
> context.start()
> 
> The Kafka topic

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh

by default the driver will start where you have started
sbin/start-master.sh. that is where you start you app SparkSubmit.

The slaves have to have an entry in slaves file

What is the issue here?




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 6 June 2016 at 18:59, Bryan Cutler  wrote:

> I'm not an expert on YARN so anyone please correct me if I'm wrong, but I
> believe the Resource Manager will schedule the application to be run on the
> AM of any node that has a Node Manager, depending on available resources.
> So you would normally query the RM via the REST API to determine that.  You
> can restrict which nodes get scheduled using this propery 
> spark.yarn.am.nodeLabelExpression.
> See here for details
> http://spark.apache.org/docs/latest/running-on-yarn.html
>
> On Mon, Jun 6, 2016 at 9:04 AM, Saiph Kappa  wrote:
>
>> How can I specify the node where application master should run in the
>> yarn conf? I haven't found any useful information regarding that.
>>
>> Thanks.
>>
>> On Mon, Jun 6, 2016 at 4:52 PM, Bryan Cutler  wrote:
>>
>>> In that mode, it will run on the application master, whichever node that
>>> is as specified in your yarn conf.
>>> On Jun 5, 2016 4:54 PM, "Saiph Kappa"  wrote:
>>>
 Hi,

 In yarn-cluster mode, is there any way to specify on which node I want
 the driver to run?

 Thanks.

>>>
>>
>

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh

I assume you zookeeper is up and running

can you confirm that you are getting topics from kafka independently for
example on the command line

${KAFKA_HOME}/bin/kafka-console-consumer.sh --zookeeper rhes564:2181
--from-beginning --topic newtopic






Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 7 June 2016 at 10:06, Dominik Safaric  wrote:

> As I am trying to integrate Kafka into Spark, the following exception
> occurs:
>
> org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
> org.apache.spark.SparkException: Couldn't find leader offsets for
> Set([**,0])
> at
>
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at
>
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
> at scala.util.Either.fold(Either.scala:97)
> at
>
> org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
> at
>
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
> at
>
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
> at org.mediasoft.spark.Driver$.main(Driver.scala:42)
> at .(:11)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
> at
> scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
> at
> scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
> at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
> at
> scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
> at
> scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)
> at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717)
> at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581)
> at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588)
> at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591)
> at
>
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
> at
> scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
> at
>
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837)
> at scala.tools.nsc.interpreter.ILoop.main(ILoop.scala:904)
> at
>
> org.jetbrains.plugins.scala.compiler.rt.ConsoleRunner.main(ConsoleRunner.java:64)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at
> com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
>
> As for the Spark configuration:
>
>val conf: SparkConf = new
> SparkConf().setAppName("AppName").setMaster("local[2]")
>
> val confParams: Map[String, String] = Map(
>   "metadata.broker.list" -> ":9092",
>   "auto.offset.reset" -> "largest"
> )
>
> val topics: Set[String] = Set("")
>
> val context: StreamingContext = new StreamingContext(conf, Seconds(1))
> val kafkaStream = KafkaUtils.createDirectStream(context,confParams,
> topics)
>
> kafkaStream.foreachRDD(rdd => {
>   rdd.collect().foreach(println)
> })
>
> context.awaitTermination()
> context.start()
>
> The Kafka topic does exist, Kafka server is up and running and I am able to
> produce messages to that particular topic using the Confluent REST API.
>
> What might the problem actually be?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Kafka-Integration-org-apache-spark-SparkException-Couldn-t-find-leader-offsets-for-Set-tp27103.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe,

Re: Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-07 Thread Tathagata Das

1.  Not all types of joins are supported. Here is the list.
- Right outer joins - stream-batch not allowed, batch-stream allowed
- Left outer joins - batch-stream not allowed, stream-batch allowed
 (reverse of Right outer join)
- Stream-stream joins are not allowed

In the cases of outer joins, the not-allowed-cases are fundamentally hard
because to do them correctly, every time there is new data in the stream,
all the past data in the stream needs to be processed. Since we cannot
stored ever-increasing amount of data in memory, this is not feasible.

2. For the update mode, the timeline is Spark 2.1.


TD

On Mon, Jun 6, 2016 at 6:54 AM, raaggarw  wrote:

> Thanks
> So,
>
> 1) For joins (stream-batch) - are all types of joins supported - i mean
> inner, leftouter etc or specific ones?
> Also what is the timeline for complete support - I mean stream-stream
> joins?
>
> 2) So now outputMode is exposed via DataFrameWriter but will work in
> specific cases as you mentioned? We were looking for delta & append output
> modes for aggregation/groupBy. What is the timeline for that?
>
> Thanks
> Ravi
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Timeline-for-supporting-basic-operations-like-groupBy-joins-etc-on-Streaming-DataFrames-tp27091p27093.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric

As I am trying to integrate Kafka into Spark, the following exception occurs:

org.apache.spark.SparkException: java.nio.channels.ClosedChannelException
org.apache.spark.SparkException: Couldn't find leader offsets for
Set([**,0])
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$checkErrors$1.apply(KafkaCluster.scala:366)
at scala.util.Either.fold(Either.scala:97)
at
org.apache.spark.streaming.kafka.KafkaCluster$.checkErrors(KafkaCluster.scala:365)
at
org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:222)
at
org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
at org.mediasoft.spark.Driver$.main(Driver.scala:42)
at .(:11)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:734)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:983)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:604)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:568)
at scala.tools.nsc.interpreter.ILoop.reallyInterpret$1(ILoop.scala:760)
at 
scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:805)
at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:717)
at scala.tools.nsc.interpreter.ILoop.processLine$1(ILoop.scala:581)
at scala.tools.nsc.interpreter.ILoop.innerLoop$1(ILoop.scala:588)
at scala.tools.nsc.interpreter.ILoop.loop(ILoop.scala:591)
at
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:882)
at
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
at
scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:837)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:837)
at scala.tools.nsc.interpreter.ILoop.main(ILoop.scala:904)
at
org.jetbrains.plugins.scala.compiler.rt.ConsoleRunner.main(ConsoleRunner.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

As for the Spark configuration:

   val conf: SparkConf = new
SparkConf().setAppName("AppName").setMaster("local[2]")

val confParams: Map[String, String] = Map(
  "metadata.broker.list" -> ":9092",
  "auto.offset.reset" -> "largest"
)

val topics: Set[String] = Set("")

val context: StreamingContext = new StreamingContext(conf, Seconds(1))
val kafkaStream = KafkaUtils.createDirectStream(context,confParams,
topics)

kafkaStream.foreachRDD(rdd => {
  rdd.collect().foreach(println)
})

context.awaitTermination()
context.start()

The Kafka topic does exist, Kafka server is up and running and I am able to
produce messages to that particular topic using the Confluent REST API. 

What might the problem actually be? 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Kafka-Integration-org-apache-spark-SparkException-Couldn-t-find-leader-offsets-for-Set-tp27103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Switching broadcast mechanism from torrrent

2016-06-07 Thread Takeshi Yamamuro

Hi,

Since `HttpBroadcastFactory` has already been removed in master, so
you cannot use the broadcast mechanism in future releases.

Anyway, I couldn't find a root cause only from the stacktraces...

// maropu




On Mon, Jun 6, 2016 at 2:14 AM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> Hi,
> I've set  spark.broadcast.factory to
> org.apache.spark.broadcast.HttpBroadcastFactory and it indeed resolve my
> issue.
>
> I'm creating a dataframe which creates a broadcast variable internally and
> then fails due to the torrent broadcast with the following stacktrace:
> Caused by: org.apache.spark.SparkException: Failed to get
> broadcast_3_piece0 of broadcast_3
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
> at scala.Option.getOrElse(Option.scala:120)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at org.apache.spark.broadcast.TorrentBroadcast.org
> $apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
> at
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1220)
>
> I'm using spark 1.6.0 on CDH 5.7
>
> Thanks,
> Daniel
>
>
> On Wed, Jun 1, 2016 at 5:52 PM, Ted Yu  wrote:
>
>> I found spark.broadcast.blockSize but no parameter to switch broadcast
>> method.
>>
>> Can you describe the issues with torrent broadcast in more detail ?
>>
>> Which version of Spark are you using ?
>>
>> Thanks
>>
>> On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv <
>> daniel.ha...@veracity-group.com> wrote:
>>
>>> Hi,
>>> Our application is failing due to issues with the torrent broadcast, is
>>> there a way to switch to another broadcast method ?
>>>
>>> Thank you.
>>> Daniel
>>>
>>
>>
>


-- 
---
Takeshi Yamamuro

89 matches

Mail list logo