Trouble while running spark at ec2 cluster

2016-07-15 Thread Hassaan Chaudhry
Hi I have launched my cluster and I am trying to submit my application to run on cluster but its not allowing me to connect . It prompts the following error "*Master endpoint spark://**ec2-54-187-59-117.us-west-2.compute.amazonaws.com:7077

Fwd: Spark streaming takes longer time to read json into dataframes

2016-07-15 Thread Diwakar Dhanuskodi
-- Forwarded message -- From: Diwakar Dhanuskodi Date: Sat, Jul 16, 2016 at 9:30 AM Subject: Re: Spark streaming takes longer time to read json into dataframes To: Jean Georges Perrin Hello, I need it on memory. Increased executor

Re: Spark streaming takes longer time to read json into dataframes

2016-07-15 Thread Jean Georges Perrin
Do you need it on disk or just push it to memory? Can you try to increase memory or # of cores (I know it sounds basic) > On Jul 15, 2016, at 11:43 PM, Diwakar Dhanuskodi > wrote: > > Hello, > > I have 400K json messages pulled from Kafka into spark streaming

Spark streaming takes longer time to read json into dataframes

2016-07-15 Thread Diwakar Dhanuskodi
Hello, I have 400K json messages pulled from Kafka into spark streaming using DirectStream approach. Size of 400K messages is around 5G. Kafka topic is single partitioned. I am using rdd.read.json(_._2) inside foreachRDD to convert rdd into dataframe. It takes almost 2.3 minutes to convert into

Size of cached dataframe

2016-07-15 Thread Brandon White
Is there any public API to get the size of a dataframe in cache? It's seen through the Spark UI but I don't see the API to access this information. Do I need to build it myself using a forked version of Spark?

Re: RDD and Dataframes

2016-07-15 Thread Taotao.Li
hi, brccosta, databricks have just posted a blog about *RDD, Dataframe and Dataset*, you can check it here : https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html , which will be very helpful for you I think. *___* Quant |

Re: Spark Streaming - Best Practices to handle multiple datapoints arriving at different time interval

2016-07-15 Thread RK Aduri
You can probably define sliding windows and set larger batch intervals. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Best-Practices-to-handle-multiple-datapoints-arriving-at-different-time-interval-tp27315p27348.html Sent from the Apache

Re: java.lang.OutOfMemoryError related to Graphframe bfs

2016-07-15 Thread RK Aduri
Did you try with different driver's memory? Increasing driver's memory can be one option. Can you print the GC and post the GC times? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-related-to-Graphframe-bfs-tp27318p27347.html

Re: RDD and Dataframes

2016-07-15 Thread RK Aduri
DataFrames uses RDDs as internal implementation of its structure. It doesn't convert to RDD but uses RDD partitions to produce logical plan. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306p27346.html Sent from the Apache Spark

Re: spark.executor.cores

2016-07-15 Thread Brad Cox
Mitch: could you elaborate on: You can practically run most of your unit testing with Local mode and deploy variety of options including running SQL queries, reading data from CSV files, writing to HDFS, creating Hive tables including ORC tables and doing Spark Streaming. In particular, are

Re: Custom InputFormat (SequenceFileInputFormat vs FileInputFormat)

2016-07-15 Thread Jörn Franke
I am not sure if I exactly understand your use case, but for my Hadoop/Spark format that reads the Bitcoin blockchain I extend from FileInputFormat. I use the default split mechanism. This could mean that I split in the middle of a bitcoin block, which is no issue, because the first split can

Re: spark.executor.cores

2016-07-15 Thread Mich Talebzadeh
Great stuff thanks Jean. These are from my notes: These are the Spark operation modes that I know - Spark Local - Spark runs on the local host. This is the simplest set up and best suited for learners who want to understand different concepts of Spark and those performing unit

Re: Streaming from Kinesis is not getting data in Yarn cluster

2016-07-15 Thread Yash Sharma
I struggled with kinesis for a long time and got all my findings documented at - http://stackoverflow.com/questions/35567440/spark-not-able-to-fetch-events-from-amazon-kinesis Let me know if it helps. Cheers, Yash - Thanks, via mobile, excuse brevity. On Jul 16, 2016 6:05 AM, "dharmendra"

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
Hey Mich, Oh well, you know, us humble programmers try to modestly understand what the brilliant data scientists are designing and, I can assure you that it is not easy. Basically the way I use Spark is in 2 ways: 1) As a developer I just embed the Spark binaries (jars) in my Maven POM. In

Streaming from Kinesis is not getting data in Yarn cluster

2016-07-15 Thread dharmendra
I have created small spark streaming program to fetch data from Kinesis and put some data in database. When i ran it in spark standalone cluster using master as local[*] it is working fine but when i tried to run in yarn cluster with master as "yarn" application doesn't receive any data. I submit

How can we control CPU and Memory per Spark job operation..

2016-07-15 Thread Pavan Achanta
Hi All, Here is my use case: I have a pipeline job consisting of 2 map functions: 1. CPU intensive map operation that does not require a lot of memory. 2. Memory intensive map operation that requires upto 4 GB of memory. And this 4GB memory cannot be distributed since it is an NLP model.

Re: How to convert from DataFrame to Dataset[Row]?

2016-07-15 Thread Mich Talebzadeh
can't you create a temp table from DF say df.registerTempTable("tmp") and use that instead? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

How to convert from DataFrame to Dataset[Row]?

2016-07-15 Thread Daniel Barclay
In Spark 1.6.1, how can I convert a DataFrame to a Dataset[Row]? Is there a direct conversion? (Trying .as[Row] doesn't work, even after importing .implicits._ .) Is there some way to map the Rows from the Dataframe into the Dataset[Row]? (DataFrame.map would just make another Dataframe,

Re: spark.executor.cores

2016-07-15 Thread Mich Talebzadeh
Interesting For some stuff I create an uber jar file and use that against spark-submit. I have not attempted to start the cluster from through application. I tend to use a shell program (actually a k-shell) to compile it via maven or sbt and then run it accordingly. In general you can

spark single PROCESS_LOCAL task

2016-07-15 Thread Matt K
Hi all, I'm seeing some curious behavior which I have a hard time interpreting. I have a job which does a "groupByKey" and results in 300 executors. 299 are run in NODE_LOCAL mode. 1 executor is run in PROCESS_LOCAL mode. The 1 executor that runs in PROCESS_LOCAL mode gets about 10x as much

Custom InputFormat (SequenceFileInputFormat vs FileInputFormat)

2016-07-15 Thread jtgenesis
I'm working with a single image file that consists of headers and a multitude of different of data segment types (each data segment having its own sub-header that contains meta data). Currently using Hadoop's HDFS. Example file layout: | Header | Seg A-1 Sub-Header | Seg A-1 Data | Seg A-2

Re: standalone mode only supports FIFO scheduler across applications ? still in spark 2.0 time ?

2016-07-15 Thread Mark Hamstra
Nothing has changed in that regard, nor is there likely to be "progress", since more sophisticated or capable resource scheduling at the Application level is really beyond the design goals for standalone mode. If you want more in the way of multi-Application resource scheduling, then you should

Re: Maximum Size of Reference Look Up Table in Spark

2016-07-15 Thread Jacek Laskowski
Hi, Never worked in a project that would require it. Jacek On 15 Jul 2016 5:31 p.m., "Saravanan Subramanian" wrote: > Hello Jacek, > > Have you seen any practical limitation or performance degradation issues > while using more than 10GB of broadcast cache ? > > Thanks, >

many 'activity' job are pending

2016-07-15 Thread 陆巍|Wei Lu(RD)
Hi there, I meet with a “many Active jobs” issue when using direct kafka streaming on YARN. (spark 1.5, hadoop 2.6, CDH5.5.1) The problem happens when kafka has almost NO traffic. From application UI, I see many ‘active’ jobs are pending for hours. And finally the driver “Requesting 4 new

standalone mode only supports FIFO scheduler across applications ? still in spark 2.0 time ?

2016-07-15 Thread Teng Qiu
Hi, http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/spark-standalone.html#resource-scheduling The standalone cluster mode currently only supports a simple FIFO scheduler across applications. is this sentence still true? any progress on this? it will really helpful. some

Re: Complications with saving Kafka offsets?

2016-07-15 Thread Cody Koeninger
The bottom line short answer for this is that if you actually care about data integrity, you need to store your offsets transactionally alongside your results in the same data store. If you're ok with double-counting in the event of failures, saving offsets _after_ saving your results, using

Re: Spark Streaming - Direct Approach

2016-07-15 Thread Cody Koeninger
We've been running direct stream jobs in production for over a year, with uptimes in the range of months. I'm pretty slammed with work right now, but when I get time to submit a PR for the 0.10 docs i'll remove the experimental note from 0.8 On Mon, Jul 11, 2016 at 4:35 PM, Tathagata Das

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
lol - young padawan I am and path to knowledge seeking I am... And on this path I also tried (without luck)... if (restId == 0) { conf = conf.setExecutorEnv("spark.executor.cores", "22"); } else { conf =

Re: spark.executor.cores

2016-07-15 Thread Daniel Darabos
Mich's invocation is for starting a Spark application against an already running Spark standalone cluster. It will not start the cluster for you. We used to not use "spark-submit", but we started using it when it solved some problem for us. Perhaps that day has also come for you? :) On Fri, Jul

Re: Maximum Size of Reference Look Up Table in Spark

2016-07-15 Thread Saravanan Subramanian
Hello Jacek, Have you seen any practical limitation or performance degradation issues while using more than 10GB of broadcast cache ? Thanks,Saravanan S. On Thursday, 14 July 2016 8:06 PM, Jacek Laskowski wrote: Hi, My understanding is that the maximum size of a

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
I don't use submit: I start my standalone cluster and connect to it remotely. Is that a bad practice? I'd like to be able to it dynamically as the system knows whether it needs more or less resources based on its own context > On Jul 15, 2016, at 10:55 AM, Mich Talebzadeh

Re: spark.executor.cores

2016-07-15 Thread Mich Talebzadeh
Hi, You can also do all this at env or submit time with spark-submit which I believe makes it more flexible than coding in. Example ${SPARK_HOME}/bin/spark-submit \ --packages com.databricks:spark-csv_2.11:1.3.0 \ --driver-memory 2G \

Re: repartitionAndSortWithinPartitions HELP

2016-07-15 Thread Koert Kuipers
spark's shuffle mechanism takes care of this kind of optimization internally when you use the sort-based shuffle (which is the default). On Thu, Jul 14, 2016 at 2:57 PM, Punit Naik wrote: > I meant to say that first we can sort the individual partitions and then > sort

Re: repartitionAndSortWithinPartitions HELP

2016-07-15 Thread Koert Kuipers
sortByKey needs to use a range partitioner, a very particular partitioner, so you cannot supply your own partitioner. you should not have to shuffle twice to do a secondary sort algo On Thu, Jul 14, 2016 at 2:22 PM, Punit Naik wrote: > Okay. Can't I supply the same

How to verify in Spark 1.6.x usage, User Memory used after Cache table

2016-07-15 Thread Yogesh Rajak
Hi Team, I am using HDP 2.4 Sandbox for checking Spark 1.6 memory feature. I have connected to spark using spark thrift server through Squirrel (JDBC Client) and executed the CACHE command to cache the hive table. Command execution is successful and SQL is returning data in less than seconds.

Error starting HiveServer2: could not start ThriftBinaryCLIService

2016-07-15 Thread ram kumar
Hi all, I started Hive Thrift Server with command, /sbin/start-thriftserver.sh --master yarn -hiveconf hive.server2.thrift.port 10003 The Thrift server started at the particular node without any error. When doing the same, except pointing to different node to start the server,

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
Merci Nihed, this is one of the tests I did :( still not working > On Jul 15, 2016, at 8:41 AM, nihed mbarek wrote: > > can you try with : > SparkConf conf = new SparkConf().setAppName("NC Eatery > app").set("spark.executor.memory", "4g") >

XML

2016-07-15 Thread VON RUEDEN, Jonathan
Hi everyone, I want to read an XML file with multiple attributes per tag and would need some help. I am able to read and process the sample files but can't find a solution for my XML. Here's the file structure: https://githudoc.doc.doc.doc.docm.md; severity="Info" />

Re: spark.executor.cores

2016-07-15 Thread nihed mbarek
can you try with : SparkConf conf = new SparkConf().setAppName("NC Eatery app").set( "spark.executor.memory", "4g") .setMaster("spark://10.0.100.120:7077"); if (restId == 0) { conf = conf.set("spark.executor.cores", "22"); } else { conf = conf.set("spark.executor.cores", "2"); } JavaSparkContext

spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
Hi, Configuration: standalone cluster, Java, Spark 1.6.2, 24 cores My process uses all the cores of my server (good), but I am trying to limit it so I can actually submit a second job. I tried SparkConf conf = new SparkConf().setAppName("NC Eatery

Random Forest gererate model failed (DecisionTree.scala:642), which has no missing parents

2016-07-15 Thread Ascot Moss
Hi, I am trying to create the Random Forest model, my source_code as follows: val rf_model = RandomForest.trainClassifier(trainData, 7, Map[Int,Int](), 20, "auto", "entropy", 30, 300) I got following error: ## 16/07/15 19:55:04 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks

Random Forest Job got killed (DAGScheduler: failed: Set() , DecisionTree.scala:642), which has no missing parents)

2016-07-15 Thread Ascot Moss
Hi, I am trying to create the Random Forest model, my source_code as follows: val rf_model = edhRF.trainClassifier(trainData, 7, Map[Int,Int](), 20, "auto", "entropy", 30, 300) I got following error: ## 16/07/15 19:55:04 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks have all

Re: How to recommend most similar users using Spark ML

2016-07-15 Thread nguyen duc Tuan
Hi jeremycod, If you want to find top N nearest neighbors for all users using exact top-k algorithm for all users, I recommend using the same approach as as used in Mllib :

Re: Input path does not exist error in giving input file for word count program

2016-07-15 Thread Ted Yu
>From >examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala : val lines = ssc.textFileStream(args(0)) val words = lines.flatMap(_.split(" ")) In your case, looks like inputfile didn't correspond to an existing path. On Fri, Jul 15, 2016 at 1:05 AM, RK Spark

RE: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-15 Thread Joaquin Alzola
It is on the 20th (Wednesday) next week. From: Marco Mistroni [mailto:mmistr...@gmail.com] Sent: 15 July 2016 11:04 To: Mich Talebzadeh Cc: user @spark ; user Subject: Re: Presentation in London: Running Spark on Hive or

Re: Presentation in London: Running Spark on Hive or Hive on Spark

2016-07-15 Thread Marco Mistroni
Dr Mich do you have any slides or videos available for the presentation you did @Canary Wharf? kindest regards marco On Wed, Jul 6, 2016 at 10:37 PM, Mich Talebzadeh wrote: > Dear forum members > > I will be presenting on the topic of "Running Spark on Hive or Hive

Re: Call http request from within Spark

2016-07-15 Thread ayan guha
Can you explain what do you mean by count never stops? On 15 Jul 2016 00:53, "Amit Dutta" wrote: > Hi All, > > > I have a requirement to call a rest service url for 300k customer ids. > > Things I have tried so far is > > > custid_rdd =

Re: scala.MatchError on stand-alone cluster mode

2016-07-15 Thread Saisai Shao
The error stack is throwing from your code: Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class [Ljava.lang.String;) at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29) at com.jd.deeplog.LogAggregator.main(LogAggregator.scala) I think you should debug

Input path does not exist error in giving input file for word count program

2016-07-15 Thread RK Spark
val count = inputfile.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_ + _); org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:

Re: Getting error in inputfile | inputFile

2016-07-15 Thread RK Spark
scala> val count = inputfile.flatMap(line => line.split((" ").map(word => (word,1)).reduceByKey(_ + _) | | You typed two blank lines. Starting a new command. I am getting like how to solve this Regrads, Ramkrishna KT

Re: Getting error in inputfile | inputFile

2016-07-15 Thread ram kumar
check the "*inputFile*" variable name lol On Fri, Jul 15, 2016 at 12:12 PM, RK Spark wrote: > I am using Spark version is 1.5.1, I am getting errors in first program of > spark,ie.e., word count. Please help me to solve this > > *scala> val inputfile =

scala.MatchError on stand-alone cluster mode

2016-07-15 Thread Mekal Zheng
Hi, I have a Spark Streaming job written in Scala and is running well on local and client mode, but when I submit it on cluster mode, the driver reported an error shown as below. Is there anyone know what is wrong here? pls help me! the Job CODE is after 16/07/14 17:28:21 DEBUG ByteBufUtil:

Getting error in inputfile | inputFile

2016-07-15 Thread RK Spark
I am using Spark version is 1.5.1, I am getting errors in first program of spark,ie.e., word count. Please help me to solve this *scala> val inputfile = sc.textFile("input.txt")* *inputfile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at textFile at :21* *scala> val counts =

find two consective points

2016-07-15 Thread Divya Gehlot
Hi, I have huge data set like similar below : timestamp,fieldid,point_id 1468564189,89,1 1468564090,76,4 1468304090,89,9 1468304090,54,6 1468304090,54,4 Have configuration file of consecutive points -- 1,9 4,6 like 1 and 9 are consecutive points similarly 4,6 are consecutive points Now I need