Re: Spark 2.0 Release Date

2016-06-07 Thread Reynold Xin
It'd be great to cut an RC as soon as possible. Looking at the blocker/critical issue list, majority of them are API audits. I think people will get back to those once Spark Summit is over, and then we should see some good progress towards an RC. On Tue, Jun 7, 2016 at 6:20 AM, Jacek Laskowski

Apache design pattern approaches

2016-06-07 Thread sparkie
Hi I have been working through some examples tutorials for Apache Spark in an attempt to establish how I would solve the following scenario (see data examples in Appendix): I have 1 billion+ rows that have a key value (i.e. driver ID) and a number of relevant attributes (product class, date/time)

Re: Apache design patterns

2016-06-07 Thread Francois Le Roux
Thanks Ted Hi I have been working through some examples tutorials for Apache Spark in an attempt to establish how I would solve the following scenario (see data examples in Appendix): I have 1 billion+ rows that have a key value (i.e. driver ID) and a number of relevant attributes (product

Re: Apache design patterns

2016-06-07 Thread Ted Yu
I think this is the correct forum. Please describe your case. > On Jun 7, 2016, at 8:33 PM, Francois Le Roux wrote: > > HI folks, I have been working through the available online Apache spark > tutorials and I am stuck with a scenario that i would like to solve in

Re: Spark_Usecase

2016-06-07 Thread vaquar khan
Deepak Spark does provide support to incremental load,if users want to schedule their batch jobs frequently and want to have incremental load of their data from databases. You will not get good performance to update your Spark SQL tables backed by files. Instead, you can use message queues and

Apache design patterns

2016-06-07 Thread Francois Le Roux
HI folks, I have been working through the available online Apache spark tutorials and I am stuck with a scenario that i would like to solve in SPARK. Is this a forum where i can publish a narrative for the problem / scenario that i am trying to solve ? any assitance appreciated thanks frank

Re: SparkR interaction with R libraries (currently 1.5.2)

2016-06-07 Thread Sun Rui
Hi, Ian, You should not use the Spark DataFrame a_df in your closure. For an R function for lapplyPartition, the parameter is a list of lists, representing the rows in the corresponding partition. In Spark 2.0, SparkR provides a new public API called dapply, which can apply an R function to each

Dealing with failures

2016-06-07 Thread Mohit Anchlia
I am looking to write an ETL job using spark that reads data from the source, perform transformation and insert it into the destination. I am trying to understand how spark deals with failures? I can't seem to find the documentation. I am interested in learning the following scenarios: 1. Source

Re: Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
I'm using dataframes, types are all doubles and I'm only extracting what I need. The caveat on these is that I am porting an existing system for a client and for there business it's likely to be cheaper to throw hardware (in aws) at the problem for a couple of hours than re-engineer there

Re: Sequential computation over several partitions

2016-06-07 Thread Mich Talebzadeh
Am I correct in understanding that you want to read and iterate all the data to be correct. For example if a user is already unsubscribed then you want to ignore all the subsequent subscribe regardless how often do you want to iterate through the full data. The frequency of your analysis? the

Sequential computation over several partitions

2016-06-07 Thread Jeroen Miller
Dear fellow Sparkers, I am a new Spark user and I am trying to solve a (conceptually simple) problem which may not be a good use case for Spark, at least for the RDD API. But before I turn my back on it, I would rather have the opinion of more knowledgeable developers than me, as it is highly

Optional columns in Aggregated Metrics by Executor in web UI?

2016-06-07 Thread Jacek Laskowski
Hi, I'd like to have the other optional columns in Aggregated Metrics by Executor table per stage in web UI. I can easily have Shuffle Read Size / Records and Shuffle Write Size / Records columns. scala> sc.parallelize(0 to 9).map((_,1)).groupBy(_._1).count I can't seem to figure out what Spark

Re: setting column names on dataset

2016-06-07 Thread Koert Kuipers
That's neat On Jun 7, 2016 4:34 PM, "Jacek Laskowski" wrote: > Hi, > > What about this? > > scala> final case class Person(name: String, age: Int) > warning: there was one unchecked warning; re-run with -unchecked for > details > defined class Person > > scala> val ds =

Re: Dataset - reduceByKey

2016-06-07 Thread Jacek Laskowski
Hi Bryan, What about groupBy [1] and agg [2]? What about UserDefinedAggregateFunction [3]? [1]

Re: setting column names on dataset

2016-06-07 Thread Jacek Laskowski
Hi, What about this? scala> final case class Person(name: String, age: Int) warning: there was one unchecked warning; re-run with -unchecked for details defined class Person scala> val ds = Seq(Person("foo", 42), Person("bar", 24)).toDS ds: org.apache.spark.sql.Dataset[Person] = [name: string,

Re: Environment tab meaning

2016-06-07 Thread Jacek Laskowski
Hi, I'm not surprised to see Hadoop jars on the driver (yet I couldn't explain exactly why they need to be there). I can't find a way now to display the classpath for executors. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark

Re: Join two Spark Streaming

2016-06-07 Thread vinay453
I am working on window dstreams wherein each dstream contains 3 rdd with following keys: a,b,c b,c,d c,d,e d,e,f I want to get only unique keys across all dstream a,b,c,d,e,f How to do it in pyspark streaming? -- View this message in context:

Re: Environment tab meaning

2016-06-07 Thread satish saley
Thank you Jacek. In case of YARN, I see that hadoop jars are present in system classpath for Driver. Will it be the same for all executors? On Tue, Jun 7, 2016 at 11:22 AM, Jacek Laskowski wrote: > Ouch, I made a mistake :( Sorry. > > You've asked about spark **history**

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
All, Thank you for the replies. It seems as though the Dataset API is still far behind the RDD API. This is unfortunate as the Dataset API potentially provides a number of performance benefits. I will move to using it in a more limited set of cases for the moment. Thank you! Bryan Jeffrey

Affinity Propagation

2016-06-07 Thread Tim Gautier
Does anyone know of a good library usable in scala spark that has affinity propagation?

setting column names on dataset

2016-06-07 Thread Koert Kuipers
for some operators on Dataset, like joinWith, one needs to use an expression which means referring to columns by name. how can i set the column names for a Dataset before doing a joinWith? currently i am aware of: df.toDF("k", "v").as[(K, V)] but that seems inefficient/anti-pattern? i shouldn't

MapType in Java unsupported in Spark 1.5

2016-06-07 Thread Baichuan YANG
Hi , I am a Spark 1.5 user and currently do some testing on Spark-shell. When I execute a SQL query using UDF returning Map type in Java, spark-shell returns an error message saying that MapType is Java is unsupported. So I wonder whether it is possible to convert MapType in Java to MapType in

Re: Dataset - reduceByKey

2016-06-07 Thread Richard Marscher
There certainly are some gaps between the richness of the RDD API and the Dataset API. I'm also migrating from RDD to Dataset and ran into reduceByKey and join scenarios. In the spark-dev list, one person was discussing reduceByKey being sub-optimal at the moment and it spawned this JIRA

Re: Dataset - reduceByKey

2016-06-07 Thread Takeshi Yamamuro
Seems you can see docs for 2.0 for now; https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/spark-2.0.0-SNAPSHOT-2016_06_07_07_01-1e2c931-docs/ // maropu On Tue, Jun 7, 2016 at 11:40 AM, Bryan Jeffrey wrote: > It would also be nice if there was a

Re: Dataset Outer Join vs RDD Outer Join

2016-06-07 Thread Richard Marscher
For anyone following along the chain went private for a bit, but there were still issues with the bytecode generation in the 2.0-preview so this JIRA was created: https://issues.apache.org/jira/browse/SPARK-15786 On Mon, Jun 6, 2016 at 1:11 PM, Michael Armbrust wrote: >

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
It would also be nice if there was a better example of joining two Datasets. I am looking at the documentation here: http://spark.apache.org/docs/latest/sql-programming-guide.html. It seems a little bit sparse - is there a better documentation source? Regards, Bryan Jeffrey On Tue, Jun 7, 2016

Re: Spark_Usecase

2016-06-07 Thread Ajay Chander
Hi Deepak, thanks for the info. I was thinking of reading both source and destination tables into separate rdds/dataframes, then apply some specific transformations to find the updated info, remove updated keyed rows from destination and append updated info to the destination. Any pointers on this

Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
Hello. I am looking at the option of moving RDD based operations to Dataset based operations. We are calling 'reduceByKey' on some pair RDDs we have. What would the equivalent be in the Dataset interface - I do not see a simple reduceByKey replacement. Regards, Bryan Jeffrey

Re: Environment tab meaning

2016-06-07 Thread Jacek Laskowski
Ouch, I made a mistake :( Sorry. You've asked about spark **history** server. It's pretty much the same. HistoryServer is a web interface for completed and running (aka incomplete) Spark applications. It uses EventLoggingListener to collect events as JSON using org.apache.spark.util.JsonProtocol

Re: Environment tab meaning

2016-06-07 Thread Jacek Laskowski
Hi, It is the driver - see the port. Is this 4040 or similar? It's started when SparkContext starts and is controlled by spark.ui.enabled. spark.ui.enabled (default: true) = controls whether the web UI is started or not. It's through JobProgressListener which is the SparkListener for web UI

Spark dynamic allocation - efficiently request new resource

2016-06-07 Thread Nirav Patel
Hi, Do current or future(2.0) spark dynamic allocation have capability to request a container with varying resource requirements based on various factor? Few factors I can think of is based on stage and data its processing it can either ask for more CPUs or more Memory. i.e. new executor can have

Environment tab meaning

2016-06-07 Thread satish saley
Hi, In spark history server, we see environment tab. Is it show environment of Driver or Executor or both? - Jobs - Stages - Storage

SparkR interaction with R libraries (currently 1.5.2)

2016-06-07 Thread rachmaninovquartet
Hi, I'm trying to figure out how to work with R libraries in spark, properly. I've googled and done some trial and error. The main error, I've been running into is "cannot coerce class "structure("DataFrame", package = "SparkR")" to a data.frame". I'm wondering if there is a way to use the R

Re: Spark SQL: org.apache.spark.sql.AnalysisException: cannot resolve "some columns" given input columns.

2016-06-07 Thread Ted Yu
Please see: [SPARK-13953][SQL] Specifying the field name for corrupted record via option at JSON datasource FYI On Tue, Jun 7, 2016 at 10:18 AM, Jerry Wong wrote: > Hi, > > Two JSON files but one of them miss some columns, like > > {"firstName": "Jack", "lastName":

Spark SQL: org.apache.spark.sql.AnalysisException: cannot resolve "some columns" given input columns.

2016-06-07 Thread Jerry Wong
Hi, Two JSON files but one of them miss some columns, like {"firstName": "Jack", "lastName": "Nelson"} {"firstName": "Landy", "middleName": "Ken", "lastName": "Yong"} slqContext.sql("select firstName as first_name, middleName as middle_name, lastName as last_name from jsonTable) But there are

Re: Spark 2.0 Release Date

2016-06-07 Thread Sean Owen
For this moment you can look at https://home.apache.org/~pwendell/spark-nightly/spark-branch-2.0-docs/latest/ On Tue, Jun 7, 2016 at 6:14 PM, Arun Patel wrote: > Thanks Sean and Jacek. > > Do we have any updated documentation for 2.0 somewhere? > > On Tue, Jun 7, 2016 at

Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel
Thanks Sean and Jacek. Do we have any updated documentation for 2.0 somewhere? On Tue, Jun 7, 2016 at 9:34 AM, Jacek Laskowski wrote: > On Tue, Jun 7, 2016 at 3:25 PM, Sean Owen wrote: > > That's not any kind of authoritative statement, just my opinion and

Integrating spark source in an eclipse project?

2016-06-07 Thread Cesar Flores
I created a spark application in Eclipse by including the spark-assembly-1.6.0-hadoop2.6.0.jar file in the path. However, this method does not allow me see spark code. Is there an easy way to include spark source code for reference in an application developed in Eclipse? Thanks ! -- Cesar

Re: Integrating spark source in an eclipse project?

2016-06-07 Thread Ted Yu
Please see: https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup Use proper branch. FYI On Tue, Jun 7, 2016 at 9:04 AM, Cesar Flores wrote: > > I created a spark application in Eclipse by including the >

Re: RESOLVED - Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh
OK so this was Kafka issue? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 7 June 2016 at 16:55,

RESOLVED - Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
Dear all, I managed to resolve the issue. Since I kept getting the exception "org.apache.spark.SparkException: java.nio.channels.ClosedChannelException”, a reasonable direction was checking the advertised.host.name key which as I’ve read from the docs basically sets for the broker the

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
Dear Todd, By running bin/kafka-run-class.sh kafka.tools.GetOffsetShell --topic --broker-list localhost:9092 --time -1 I get the following current offset for :0:1760 But I guess this does not provide as much information. To answer your other question, onto how exactly do I track the offset

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh
Thanks. This is getting a bit confusing. I have these modes for using Spark. 1. Spark local All on the same host --> -master local[n]l.. No need to start master and slaves. Uses resources as you submit the job. 2. Spark Standalone. Use a simple cluster manager included with Spark

Re: Spark_Usecase

2016-06-07 Thread Deepak Sharma
I am not sure if Spark provides any support for incremental extracts inherently. But you can maintain a file e.g. extractRange.conf in hdfs , to read from it the end range and update it with new end range from spark job before it finishes with the new relevant ranges to be used next time. On

Re: Spark_Usecase

2016-06-07 Thread Ajay Chander
Hi Mich, thanks for your inputs. I used sqoop to get the data from MySQL. Now I am using spark to do the same. Right now, I am trying to implement incremental updates while loading from MySQL through spark. Can you suggest any best practices for this ? Thank you. On Tuesday, June 7, 2016, Mich

Re: Spark_Usecase

2016-06-07 Thread Mich Talebzadeh
I use Spark rather that Sqoop to import data from an Oracle table into a Hive ORC table. It used JDBC for this purpose. All inclusive in Scala itself. Also Hive runs on Spark engine. Order of magnitude faster with Inde on map-reduce/. pretty simple. HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark_Usecase

2016-06-07 Thread Ajay Chander
Marco, Ted, thanks for your time. I am sorry if I wasn't clear enough. We have two sources, 1) sql server 2) files are pushed onto edge node by upstreams on a daily basis. Point 1 has been achieved by using JDBC format in spark sql. Point 2 has been achieved by using shell script. My only

Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
thanks I will have a look. Mich Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 7 June 2016 at

Re: Specify node where driver should run

2016-06-07 Thread Sebastian Piu
If you run that job then the driver will ALWAYS run in the machine from where you are issuing the spark submit command (E.g. some edge node with the clients installed). No matter where the resource manager is running. If you change yarn-client for yarn-cluster then your driver will start

Re: Spark_Usecase

2016-06-07 Thread Ted Yu
bq. load the data from edge node to hdfs Does the loading involve accessing sqlserver ? Please take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html On Tue, Jun 7, 2016 at 7:19 AM, Marco Mistroni wrote: > Hi > how about > > 1. have a process that

Re: Spark_Usecase

2016-06-07 Thread Marco Mistroni
Hi how about 1. have a process that read the data from your sqlserver and dumps it as a file into a directory on your hd 2. use spark-streanming to read data from that directory and store it into hdfs perhaps there is some sort of spark 'connectors' that allows you to read data from a db

Spark_Usecase

2016-06-07 Thread Ajay Chander
Hi Spark users, Right now we are using spark for everything(loading the data from sqlserver, apply transformations, save it as permanent tables in hive) in our environment. Everything is being done in one spark application. The only thing we do before we launch our spark application through

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Todd Nist
Hi Dominik, Right, and spark 1.6.x uses Kafka v0.8.2.x as I recall. However, it appears as though the v.0.8 consumer is compatible with the Kafka v0.9.x broker, but not the other way around; sorry for the confusion there. With the direct stream, simple consumer, offsets are tracked by Spark

Re: [REPOST] Severe Spark Streaming performance degradation after upgrading to 1.6.1

2016-06-07 Thread Daniel Darabos
On Sun, Jun 5, 2016 at 9:51 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > If you fill up the cache, 1.6.0+ will suffer performance degradation from > GC thrashing. You can set spark.memory.useLegacyMode to true, or > spark.memory.fraction to 0.66, or

Re: Spark 2.0 Release Date

2016-06-07 Thread Jacek Laskowski
On Tue, Jun 7, 2016 at 3:25 PM, Sean Owen wrote: > That's not any kind of authoritative statement, just my opinion and guess. Oh, come on. You're not **a** Sean but **the** Sean (= a PMC member and the JIRA/PRs keeper) so what you say **is** kinda official. Sorry. But don't

Re: Spark 2.0 Release Date

2016-06-07 Thread Sean Owen
That's not any kind of authoritative statement, just my opinion and guess. Reynold mentioned the idea of releasing monthly milestone releases for the latest branch. That's an interesting idea for the future. If the issue burn-down takes significantly longer to get right, then maybe indeed

Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke
Solr is basically an in-memory text index with a lot of capabilities for language analysis extraction (you can compare it to a Google for your tweets). The system itself has a lot of features and has a complexity similar to Big data systems. This index files can be backed by HDFS. You can put

Re: Spark 2.0 Release Date

2016-06-07 Thread Jacek Laskowski
Finally, the PMC voice on the subject. Thanks a lot, Sean! p.s. Given how much time it takes to ship 2.0 (with so many cool features already backed in!) I'd vote for releasing a few more RCs before 2.0 hits the shelves. I hope 2.0 is not Java 9 or Jigsaw ;-) Pozdrawiam, Jacek Laskowski

Re: Spark 2.0 Release Date

2016-06-07 Thread Sean Owen
I don't believe the intent was to get it out before Spark Summit or something. That shouldn't drive the schedule anyway. But now that there's a 2.0.0 preview available, people who are eager to experiment or test on it can do so now. That probably reduces urgency to get it out the door in order to

Re: Specify node where driver should run

2016-06-07 Thread Jacek Laskowski
Hi, --master yarn-client is deprecated and you should use --master yarn --deploy-mode client instead. There are two deploy-modes: client (default) and cluster. See http://spark.apache.org/docs/latest/cluster-overview.html. Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh
ok thanks so I start SparkSubmit or similar Spark app on the Yarn resource manager node. What you are stating is that Yan may decide to start the driver program in another node as opposed to the resource manager node ${SPARK_HOME}/bin/spark-submit \ --driver-memory=4G \

Re: Spark 2.0 Release Date

2016-06-07 Thread Jacek Laskowski
On Tue, Jun 7, 2016 at 1:25 PM, Arun Patel wrote: > Do we have any further updates on release date? Nope :( And it's even more quiet than I could have thought. I was so certain that today's the date. Looks like Spark Summit has "consumed" all the people behind

Re: Specify node where driver should run

2016-06-07 Thread Sebastian Piu
What you are explaining is right for yarn-client mode, but the question is about yarn-cluster in which case the spark driver is also submitted and run in one of the node managers On Tue, 7 Jun 2016, 13:45 Mich Talebzadeh, wrote: > can you elaborate on the above

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0" libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "1.6.0" libraryDependencies += "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.1" Please take a look at the SBT copy. I would rather think that

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Todd Nist
What version of Spark are you using? I do not believe that 1.6.x is compatible with 0.9.0.1 due to changes in the kafka clients between 0.8.2.2 and 0.9.0.x. See this for more information: https://issues.apache.org/jira/browse/SPARK-12177 -Todd On Tue, Jun 7, 2016 at 7:35 AM, Dominik Safaric

Re: Advice on Scaling RandomForest

2016-06-07 Thread Jörn Franke
Before hardware optimization there is always software optimization. Are you using dataset / dataframe? Are you using the right data types ( eg int where int is appropriate , try to avoid string and char etc) Do you extract only the stuff needed? What are the algorithm parameters? > On 07 Jun

Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
Ok So basically for predictive off-line (as opposed to streaming) in a nutshell one can use Apache Flume to store twitter data in hdfs and use Solr to query the data? This is what it says: Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh
can you elaborate on the above statement please. When you start yarn you start the resource manager daemon only on the resource manager node yarn-daemon.sh start resourcemanager Then you start nodemanager deamons on all nodes yarn-daemon.sh start nodemanager A spark app has to start

Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke
Well I have seen that The algorithms mentioned are used for this. However some preprocessing through solr makes sense - it takes care of synonyms, homonyms, stemming etc > On 07 Jun 2016, at 13:33, Mich Talebzadeh wrote: > > Thanks Jorn, > > To start I would like

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
Hi, Correct, I am using the 0.9.0.1 version. As already described, the topic contains messages. Those messages are produced using the Confluence REST API. However, what I’ve observed is that the problem is not in the Spark configuration, but rather Zookeeper or Kafka related. Take a look

Re: Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
Thanks Jorn, To start I would like to explore how can one turn some of the data into useful information. I would like to look at certain trend analysis. Simple correlation shows that the more there is a mention of a typical topic say for example "organic food" the more people are inclined to go

Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel
Do we have any further updates on release date? Also, Is there a updated documentation for 2.0 somewhere? Thanks Arun On Thu, Apr 28, 2016 at 4:50 PM, Jacek Laskowski wrote: > Hi Arun, > > My bet is...https://spark-summit.org/2016 :) > > Pozdrawiam, > Jacek Laskowski > >

Re: Analyzing twitter data

2016-06-07 Thread Jörn Franke
Spark ml Support Vector machines or neural networks could be candidates. For unstructured learning it could be clustering. For doing a graph analysis On the followers you can easily use Spark Graphx Keep in mind that each tweet contains a lot of meta data (location, followers etc) that is more

Re: Specify node where driver should run

2016-06-07 Thread Jacek Laskowski
Hi, It's not possible. YARN uses CPU and memory for resource constraints and places AM on any node available. Same about executors (unless data locality constraints the placement). Jacek On 6 Jun 2016 1:54 a.m., "Saiph Kappa" wrote: > Hi, > > In yarn-cluster mode, is

Analyzing twitter data

2016-06-07 Thread Mich Talebzadeh
Hi, This is really a general question. I use Spark to get twitter data. I did some looking at it val ssc = new StreamingContext(sparkConf, Seconds(2)) val tweets = TwitterUtils.createStream(ssc, None) val statuses = tweets.map(status => status.getText()) statuses.print() Ok

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Jacek Laskowski
Hi, What's the version of Spark? You're using Kafka 0.9.0.1, ain't you? What's the topic name? Jacek On 7 Jun 2016 11:06 a.m., "Dominik Safaric" wrote: > As I am trying to integrate Kafka into Spark, the following exception > occurs: > >

Advice on Scaling RandomForest

2016-06-07 Thread Franc Carter
Hi, I am training a RandomForest Regression Model on Spark-1.6.1 (EMR) and am interested in how it might be best to scale it - e.g more cpus per instances, more memory per instance, more instances etc. I'm currently using 32 m3.xlarge instances for for a training set with 2.5 million rows, 1300

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh
For now you can move away from Spark and look at the cause of your kafka publishing Also check that zookeeper is running jps *17102* QuorumPeerMain runs on default port 2181 netstat -plten|grep 2181 tcp0 0 :::2181 :::* LISTEN 1005 8765628

Re: Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-07 Thread Ofir Manor
TD - this might not be the best forum, but (1) - batch left outer stream - is always feasible under reasonable constraints, for example a window constraint on the stream. I think it would be super useful to have a central place in the 2.0 docs that spells out what exactly is included, what is

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
> Sounds like the issue is with Kafka channel, it is closing. Made the same conclusion as well. I’ve even tried further refining the configuration files: Zookeeper properties: # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh
Sounds like the issue is with Kafka channel, it is closing. Reconnect due to socket error: java.nio.channels.ClosedChannelException Can you relax that val ssc = new StreamingContext(sparkConf, Seconds(20) Also how are you getting your source data? You can actually have both Spark and the

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
Unfortunately, even with this Spark configuration and Kafka parameters, the same exception keeps occurring: 16/06/07 12:26:11 INFO SimpleConsumer: Reconnect due to socket error: java.nio.channels.ClosedChannelException org.apache.spark.SparkException: java.nio.channels.ClosedChannelException

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh
ok that is good Yours is basically simple streaming with Kafka (publishing topic) and your Spark streaming. use the following as blueprint // Create a local StreamingContext with two working thread and batch interval of 2 seconds. val sparkConf = new SparkConf().

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
Dear Mich, Thank you for the reply. By running the following command in the command line: bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic --from-beginning I do indeed retrieve all messages of a topic. Any indication onto what might cause the issue? An important note to

Re: Specify node where driver should run

2016-06-07 Thread Mich Talebzadeh
by default the driver will start where you have started sbin/start-master.sh. that is where you start you app SparkSubmit. The slaves have to have an entry in slaves file What is the issue here? Dr Mich Talebzadeh LinkedIn *

Re: Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Mich Talebzadeh
I assume you zookeeper is up and running can you confirm that you are getting topics from kafka independently for example on the command line ${KAFKA_HOME}/bin/kafka-console-consumer.sh --zookeeper rhes564:2181 --from-beginning --topic newtopic Dr Mich Talebzadeh LinkedIn *

Re: Timeline for supporting basic operations like groupBy, joins etc on Streaming DataFrames

2016-06-07 Thread Tathagata Das
1. Not all types of joins are supported. Here is the list. - Right outer joins - stream-batch not allowed, batch-stream allowed - Left outer joins - batch-stream not allowed, stream-batch allowed (reverse of Right outer join) - Stream-stream joins are not allowed In the cases of outer joins,

Apache Spark Kafka Integration - org.apache.spark.SparkException: Couldn't find leader offsets for Set()

2016-06-07 Thread Dominik Safaric
As I am trying to integrate Kafka into Spark, the following exception occurs: org.apache.spark.SparkException: java.nio.channels.ClosedChannelException org.apache.spark.SparkException: Couldn't find leader offsets for Set([**,0]) at

Re: Switching broadcast mechanism from torrrent

2016-06-07 Thread Takeshi Yamamuro
Hi, Since `HttpBroadcastFactory` has already been removed in master, so you cannot use the broadcast mechanism in future releases. Anyway, I couldn't find a root cause only from the stacktraces... // maropu On Mon, Jun 6, 2016 at 2:14 AM, Daniel Haviv < daniel.ha...@veracity-group.com>