Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread Charles Feduke
You could also try setting your `nofile` value in /etc/security/limits.conf for `soft` to some ridiculously high value if you haven't done so already. On Fri, Apr 3, 2015 at 2:09 AM Akhil Das wrote: > Did you try these? > > - Disable shuffle : spark.shuffle.spill=false > - Enable log rotation: >

Re: Delaying failed task retries + giving failing tasks to different nodes

2015-04-02 Thread Akhil Das
I think these are the following configurations that you are looking for: *spark.locality.wait*: Number of milliseconds to wait to launch a data-local task before giving up and launching it on a less-local node. The same wait will be used to step through multiple locality levels (process-local, nod

Re: Connection pooling in spark jobs

2015-04-02 Thread Charles Feduke
Out of curiosity I wanted to see what JBoss supported in terms of clustering and database connection pooling since its implementation should suffice for your use case. I found: *Note:* JBoss does not recommend using this feature on a production environment. It requires accessing a connection pool

Re: Mllib kmeans #iteration

2015-04-02 Thread amoners
Have you refer to official document of kmeans on https://spark.apache.org/docs/1.1.1/mllib-clustering.html ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353p22365.html Sent from the Apache Spark User List mailing list archive

Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread Akhil Das
Did you try these? - Disable shuffle : spark.shuffle.spill=false - Enable log rotation: sparkConf.set("spark.executor.logs.rolling.strategy", "size") .set("spark.executor.logs.rolling.size.maxBytes", "1024") .set("spark.executor.logs.rolling.maxRetainedFiles", "3") Thanks Best Regards On Fri,

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-02 Thread Akhil Das
How did you build spark? which version of spark are you having? Doesn't this thread already explains it? https://www.mail-archive.com/user@spark.apache.org/msg25505.html Thanks Best Regards On Thu, Apr 2, 2015 at 11:10 PM, Todd Nist wrote: > Hi Akhil, > > Tried your suggestion to no avail. I a

Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Each executor runs for about 5 secs until which time the db connection can potentially be open. Each executor will have 1 connection open. Connection pooling surely has its advantages of performance and not hitting the dbserver for every open/close. The database in question is not just used by the

Matei Zaharai: Reddit Ask Me Anything

2015-04-02 Thread ben lorica
*Ask Me Anything about Apache Spark & big data* Reddit AMA with Matei Zaharia Friday, April 3 at 9AM PT/ 12PM ET Details can be found here: http://strataconf.com/big-data-conference-uk-2015/public/content/reddit-ama -- View this message in context: http://apache-spark-user-list.1001560.n3.na

Fwd:

2015-04-02 Thread Himanish Kushary
Actually they may not be sequentially generated and also the list (RDD) could come from a different component. For example from this RDD : (105,918) (105,757) (502,516) (105,137) (516,816) (350,502) I would like to separate into two RDD's : 1) (105,918) (502,516) 2) (105,757) (105,1

Re: Connection pooling in spark jobs

2015-04-02 Thread Charles Feduke
How long does each executor keep the connection open for? How many connections does each executor open? Are you certain that connection pooling is a performant and suitable solution? Are you running out of resources on the database server and cannot tolerate each executor having a single connectio

Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
But this basically means that the pool is confined to the job (of a single app) in question, but is not sharable across multiple apps? The setup we have is a job server (the spark-jobserver) that creates jobs. Currently, we have each job opening and closing a connection to the database. What we wou

RE: Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread Michael Armbrust
Looks like a typo, try: *df.select**(**df**("name"), **df**("age") + 1)* Or df.select("name", "age") PRs to fix docs are always appreciated :) On Apr 2, 2015 7:44 PM, "java8964" wrote: > The import command already run. > > Forgot the mention, the rest of examples related to "df" all works, ju

Re: ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
Thanks Michael - that was it! I was drawing a blank on this one for some reason - much appreciated! On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust wrote: > A lateral view explode using HiveQL. I'm hopping to add explode shorthand > directly to the df API in 1.4. > > On Thu, Apr 2, 2015 at 7:

Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread a mesar
Yes, with spark.cleaner.ttl set there is no cleanup. We pass --properties-file spark-dev.conf to spark-submit where spark-dev.conf contains: spark.master spark://10.250.241.66:7077 spark.logConf true spark.cleaner.ttl 1800 spark.executor.memory 10709m spark.cores.max 4 spark.shuffle.consolidateF

Re: [SQL] Simple DataFrame questions

2015-04-02 Thread Yin Huai
For cast, you can use selectExpr method. For example, df.selectExpr("cast(col1 as int) as col1", "cast(col2 as bigint) as col2"). Or, df.select(df("colA").cast("int"), ...) On Thu, Apr 2, 2015 at 8:33 PM, Michael Armbrust wrote: > val df = Seq(("test", 1)).toDF("col1", "col2") > > You can use SQ

Re: [SQL] Simple DataFrame questions

2015-04-02 Thread Michael Armbrust
val df = Seq(("test", 1)).toDF("col1", "col2") You can use SQL style expressions as a string: df.filter("col1 IS NOT NULL").collect() res1: Array[org.apache.spark.sql.Row] = Array([test,1]) Or you can also reference columns using df("colName") or quot;colName" or col("colName") df.filter(df("c

Re: ArrayBuffer within a DataFrame

2015-04-02 Thread Michael Armbrust
A lateral view explode using HiveQL. I'm hopping to add explode shorthand directly to the df API in 1.4. On Thu, Apr 2, 2015 at 7:10 PM, Denny Lee wrote: > Quick question - the output of a dataframe is in the format of: > > [2015-04, ArrayBuffer(A, B, C, D)] > > and I'd like to return it as: >

Delaying failed task retries + giving failing tasks to different nodes

2015-04-02 Thread Stephen Merity
Hi there, I've been using Spark for processing 33,000 gzipped files that contain billions of JSON records (the metadata [WAT] dataset from Common Crawl). I've hit a few issues and have not yet found the answers from the documentation / search. This may well just be me not finding the right pages t

[SQL] Simple DataFrame questions

2015-04-02 Thread Yana Kadiyska
Hi folks, having some seemingly noob issues with the dataframe API. I have a DF which came from the csv package. 1. What would be an easy way to cast a column to a given type -- my DF columns are all typed as strings coming from a csv. I see a schema getter but not setter on DF 2. I am trying to

RE: Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread java8964
The import command already run. Forgot the mention, the rest of examples related to "df" all works, just this one caused problem. Thanks Yong Date: Fri, 3 Apr 2015 10:36:45 +0800 From: fightf...@163.com To: java8...@hotmail.com; user@spark.apache.org Subject: Re: Cannot run the example in the Spa

Re: Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread fightf...@163.com
Hi, there you may need to add : import sqlContext.implicits._ Best, Sun fightf...@163.com From: java8964 Date: 2015-04-03 10:15 To: user@spark.apache.org Subject: Cannot run the example in the Spark 1.3.0 following the document I tried to check out what Spark SQL 1.3.0. I installed it an

maven compile error

2015-04-02 Thread myelinji
Hi,all:   Just now i checked out spark-1.2 on github , wanna to build it use maven, how ever I encountered an error during compiling: [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile

Tableau + Spark SQL Thrift Server + Cassandra

2015-04-02 Thread Mohammed Guller
Hi - Is anybody using Tableau to analyze data in Cassandra through the Spark SQL Thrift Server? Thanks! Mohammed

RE: ArrayBuffer within a DataFrame

2015-04-02 Thread Mohammed Guller
Hint: DF.rdd.map{} Mohammed From: Denny Lee [mailto:denny.g@gmail.com] Sent: Thursday, April 2, 2015 7:10 PM To: user@spark.apache.org Subject: ArrayBuffer within a DataFrame Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd like

Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread java8964
I tried to check out what Spark SQL 1.3.0. I installed it and following the online document here: http://spark.apache.org/docs/latest/sql-programming-guide.html In the example, it shows something like this:// Select everybody, but increment the age by 1 df.select("name", df("age") + 1).show() //

ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd like to return it as: 2015-04, A 2015-04, B 2015-04, C 2015-04, D What's the best way to do this? Thanks in advance!

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Okehee Goh
Michael, You are right. The build brought " org.scala-lang:scala-library:2.10.1" from other package (as below). It works fine after excluding the old scala version. Thanks a lot, Okehee == dependency: |+--- org.apache.kafka:kafka_2.10:0.8.1.1 ||+--- com.yammer.metrics:metrics-core:2

Re: "Spark-events does not exist" error, while it does with all the req. rights

2015-04-02 Thread Marcelo Vanzin
FYI I wrote a small test to try to reproduce this, and filed SPARK-6688 to track the fix. On Tue, Mar 31, 2015 at 1:15 PM, Marcelo Vanzin wrote: > Hmmm... could you try to set the log dir to > "file:/home/hduser/spark/spark-events"? > > I checked the code and it might be the case that the behavio

Re: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread Todd Nist
Hi Young, Sorry for the duplicate post, want to reply to all. I just downloaded the bits prebuilt form apache spark download site. Started the spark shell and got the same error. I then started the shell as follows: ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2 --d

RE: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread java8964
Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but I cannot reproduce it on Spark 1.2.1 If we check the code change below: Spark 1.3 branchhttps://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala vs Spark

Re: Spark SQL does not read from cached table if table is renamed

2015-04-02 Thread Michael Armbrust
I'll add we just back ported this so it'll be included in 1.2.2 also. On Wed, Apr 1, 2015 at 4:14 PM, Michael Armbrust wrote: > This is fixed in Spark 1.3. > https://issues.apache.org/jira/browse/SPARK-5195 > > On Wed, Apr 1, 2015 at 4:05 PM, Judy Nash > wrote: > >> Hi all, >> >> >> >> Noticed

RE: Reading a large file (binary) into RDD

2015-04-02 Thread java8964
I think implementing your own InputFormat and using SparkContext.hadoopFile() is the best option for your case. Yong From: kvi...@vt.edu Date: Thu, 2 Apr 2015 17:31:30 -0400 Subject: Re: Reading a large file (binary) into RDD To: freeman.jer...@gmail.com CC: user@spark.apache.org The file has a

RE: [SparkSQL 1.3.0] Cannot resolve column name "SUM('p.q)" among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang
Michael, thanks for the response and looking forward to try 1.3.1 From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, April 03, 2015 6:52 AM To: Haopu Wang Cc: user Subject: Re: [SparkSQL 1.3.0] Cannot resolve column name "SUM('p.q)" among (k,

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Michael Armbrust
This looks to me like you have incompatible versions of scala on your classpath? On Thu, Apr 2, 2015 at 4:28 PM, Okehee Goh wrote: > yes, below is the stacktrace. > Thanks, > Okehee > > java.lang.NoSuchMethodError: > scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; >

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Okehee Goh
yes, below is the stacktrace. Thanks, Okehee java.lang.NoSuchMethodError: scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; at scala.reflect.internal.StdNames$CommonNames.(StdNames.scala:97) at scala.reflect.internal.StdNames$Keywords.(StdNames.scala:203)

Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread Tathagata Das
Are you saying that even with the spark.cleaner.ttl set your files are not getting cleaned up? TD On Thu, Apr 2, 2015 at 8:23 AM, andrem wrote: > Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and > the worker nodes eventually run out of inodes. > We see tons of old shuf

Re: [SparkSQL 1.3.0] Cannot resolve column name "SUM('p.q)" among (k, SUM('p.q));

2015-04-02 Thread Michael Armbrust
Thanks for reporting. The root cause is (SPARK-5632 ), which is actually pretty hard to fix. Fortunately, for this particular case there is an easy workaround: https://github.com/apache/spark/pull/5337 We can try to include this in 1.3.1. On Thu

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
The file has a specific structure. I outline it below. The input file is basically a representation of a graph. INT INT(A) LONG (B) A INTs(Degrees) A SHORTINTs (Vertex_Attribute) B INTs B INTs B SHORTINTs B SHORTINTs A - number of vertices B - number of edges (no

Re: Error in SparkSQL/Scala IDE

2015-04-02 Thread Michael Armbrust
This is actually a problem with our use of Scala's reflection library. Unfortunately you need to load Spark SQL using the primordial classloader, otherwise you run into this problem. If anyone from the scala side can hint how we can tell scala.reflect which classloader to use when creating the mir

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Michael Armbrust
Do you have a full stack trace? On Thu, Apr 2, 2015 at 11:45 AM, ogoh wrote: > > Hello, > My ETL uses sparksql to generate parquet files which are served through > Thriftserver using hive ql. > It especially defines a schema programmatically since the schema can be > only > known at runtime. > W

RE: Spark SQL. Memory consumption

2015-04-02 Thread java8964
It is hard to say what could be reason without more detail information. If you provide some more information, maybe people here can help you better. 1) What is your worker's memory setting? It looks like that your nodes have 128G physical memory each, but what do you specify for the worker's heap

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread jay vyas
yup a related JIRA is here https://issues.apache.org/jira/browse/SPARK-5113 which you might want to leave a comment in. This can be quite tricky we found ! but there are a host of env variable hacks you can use when launching spark masters/slaves. On Thu, Apr 2, 2015 at 5:18 PM, Michael Quinl

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread Michael Quinlan
I was able to hack this on my similar setup issue by running (on the driver) $ sudo hostname ip Where ip is the same value set in the "spark.driver.host" property. This isn't a solution I would use universally and hope the someone can fix this bug in the distribution. Regards, Mike -- View

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman
Hm, that will indeed be trickier because this method assumes records are the same byte size. Is the file an arbitrary sequence of mixed types, or is there structure, e.g. short, long, short, long, etc.? If you could post a gist with an example of the kind of file and how it should look once re

Re: Need a spark mllib tutorial

2015-04-02 Thread Reza Zadeh
Here's one: https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html Reza On Thu, Apr 2, 2015 at 12:51 PM, Phani Yadavilli -X (pyadavil) < pyada...@cisco.com> wrote: > Hi, > > > > I am new to the spark MLLib and I was browsing through the internet for > good tutorials ad

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza
This isn't currently a capability that Spark has, though it has definitely been discussed: https://issues.apache.org/jira/browse/SPARK-1061. The primary obstacle at this point is that Hadoop's FileInputFormat doesn't guarantee that each file corresponds to a single split, so the records correspond

Re: Spark 1.3 UDF ClassNotFoundException

2015-04-02 Thread Ted Yu
Can you show more code in CreateMasterData ? How do you run your code ? Thanks On Thu, Apr 2, 2015 at 11:06 AM, ganterm wrote: > Hello, > > I started to use the dataframe API in Spark 1.3 with Scala. > I am trying to implement a UDF and am following the sample here: > > https://spark.apache.or

Re: input size too large | Performance issues with Spark

2015-04-02 Thread Christian Perez
To Akhil's point, see Tuning Data structures. Avoid standard collection hashmap. With fewer machines, try running 4 or 5 cores per executor and only 3-4 executors (1 per node): http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/. Ought to reduce shuffle performance hit

Need a spark mllib tutorial

2015-04-02 Thread Phani Yadavilli -X (pyadavil)
Hi, I am new to the spark MLLib and I was browsing through the internet for good tutorials advanced to the spark documentation example. But, I do not find any. Need help. Regards Phani Kumar

Mesos - spark task constraints

2015-04-02 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am trying to figure out how to run spark jobs on a mesos cluster. The mesos cluster has some nodes that have tachyon install on some nodes and I would like the spark jobs to be started on only those nodes. Each of these nodes have been configure

Re: persist(MEMORY_ONLY) takes lot of time

2015-04-02 Thread Christian Perez
+1. Caching is way too slow. On Wed, Apr 1, 2015 at 12:33 PM, SamyaMaiti wrote: > Hi Experts, > > I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL > queries repetitively. > > Few questions : > > 1. When I do the below (persist to memory after reading from disk), it takes

RE: Date and decimal datatype not working

2015-04-02 Thread BASAK, ANANDA
Thanks all. Finally I am able to run my code successfully. It is running in Spark 1.2.1. I will try it on Spark 1.3 too. The major cause of all errors I faced was that the delimiter was not correctly declared. val TABLE_A = sc.textFile("/Myhome/SPARK/files/table_a_file.txt").map(_.split("|")).m

Re: Mllib kmeans #iteration

2015-04-02 Thread Joseph Bradley
Check out the Spark docs for that parameter: *maxIterations* http://spark.apache.org/docs/latest/mllib-clustering.html#k-means On Thu, Apr 2, 2015 at 4:42 AM, podioss wrote: > Hello, > i am running the Kmeans algorithm in cluster mode from Mllib and i was > wondering if i could run the algorithm

Re: From DataFrame to LabeledPoint

2015-04-02 Thread Joseph Bradley
Peter's suggestion sounds good, but watch out for the match case since I believe you'll have to match on: case (Row(feature1, feature2, ...), Row(label)) => On Thu, Apr 2, 2015 at 7:57 AM, Peter Rudenko wrote: > Hi try next code: > > val labeledPoints: RDD[LabeledPoint] = features.zip(labels).

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of short and long integers. Is there any other way that could of use here? My current method happens to have a large overhead (much more than actual computation time). Also, I am short of memory at the driver when it has to

Simple but faster data streaming

2015-04-02 Thread Harut Martirosyan
Hi guys. Is there a more lightweight way of stream processing with Spark? What we want is a simpler way, preferably with no scheduling, which just streams the data to destinations multiple. We extensively use Spark Core, SQL, Streaming, GraphX, so it's our main tool and don't want to add new thin

Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread ogoh
Hello, My ETL uses sparksql to generate parquet files which are served through Thriftserver using hive ql. It especially defines a schema programmatically since the schema can be only known at runtime. With spark 1.2.1, it worked fine (followed https://spark.apache.org/docs/latest/sql-programming

Spark 1.3 UDF ClassNotFoundException

2015-04-02 Thread ganterm
Hello, I started to use the dataframe API in Spark 1.3 with Scala. I am trying to implement a UDF and am following the sample here: https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.UserDefinedFunction meaning val predict = udf((score: Double) => if (score > 0.5) tr

Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread Todd Nist
I was trying a simple test from the spark-shell to see if 1.3.0 would address a problem I was having with locating the json_tuple class and got the following error: scala> import org.apache.spark.sql.hive._ import org.apache.spark.sql.hive._ scala> val sqlContext = new HiveContext(sc)sqlContext:

Re: Spark + Kinesis

2015-04-02 Thread Vadim Bichutskiy
Thanks Jonathan. Helpful. VB > On Apr 2, 2015, at 1:15 PM, Kelly, Jonathan wrote: > > It looks like you're attempting to mix Scala versions, so that's going to > cause some problems. If you really want to use Scala 2.11.5, you must also > use Spark package versions built for Scala 2.11 rath

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman
If it’s a flat binary file and each record is the same length (in bytes), you can use Spark’s binaryRecords method (defined on the SparkContext), which loads records from one or more large flat binary files into an RDD. Here’s an example in python to show how it works: > # write data from an ar

Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-02 Thread Todd Nist
Hi Akhil, Tried your suggestion to no avail. I actually to not see and "jackson" or "json serde" jars in the $HIVE/lib directory. This is hive 0.13.1 and spark 1.2.1 Here is what I did: I have added the lib folder to the –jars option when starting the spark-shell, but the job fails. The hive-s

Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
What are some efficient ways to read a large file into RDDs? For example, have several executors read a specific/unique portion of the file and construct RDDs. Is this possible to do in Spark? Currently, I am doing a line-by-line read of the file at the driver and constructing the RDD.

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Dean Wampler
I misread that you're running in standalone mode, so ignore the "local[3]" example ;) How many separate readers are listening to rabbitmq topics? This might not be the problem, but I'm just eliminating possibilities. Another possibility is that the in-bound data rate exceeds your ability to proce

Re: Spark SQL. Memory consumption

2015-04-02 Thread Vladimir Rodionov
>> Using large memory for executors (*--executor-memory 120g*). Not really a good advice. On Thu, Apr 2, 2015 at 9:17 AM, Cheng, Hao wrote: > Spark SQL tries to load the entire partition data and organized as > In-Memory HashMaps, it does eat large memory if there are not many > duplicated gro

Re: How to learn Spark ?

2015-04-02 Thread Dean Wampler
You're welcome. Two limitations to know about: 1. I haven't updated it to 1.3 2. It uses Scala for all examples (my bias ;), so less useful if you don't want to use Scala. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Type

Re: Spark + Kinesis

2015-04-02 Thread Kelly, Jonathan
It looks like you're attempting to mix Scala versions, so that's going to cause some problems. If you really want to use Scala 2.11.5, you must also use Spark package versions built for Scala 2.11 rather than 2.10. Anyway, that's not quite the correct way to specify Scala dependencies in build

Re: How to learn Spark ?

2015-04-02 Thread Slim Baltagi
Hi I maintain an Apache Spark Knowledge Base at http://www.SparkBigData.com with over 4,000 related web resources. You can check the ‘Quick Start’ section at http://sparkbigdata.com/tutorials There is a plenty of tutorials and examples to start with after you decide what you would like to use:

Re: How to learn Spark ?

2015-04-02 Thread Vadim Bichutskiy
Thanks Dean. This is great. ᐧ On Thu, Apr 2, 2015 at 9:01 AM, Dean Wampler wrote: > I have a "self-study" workshop here: > > https://github.com/deanwampler/spark-workshop > > dean > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition >

Spark + Kinesis

2015-04-02 Thread Vadim Bichutskiy
Hi all, I am trying to write an Amazon Kinesis consumer Scala app that processes data in the Kinesis stream. Is this the correct way to specify *build.sbt*: --- *import AssemblyKeys._* *name := "Kinesis Consumer"* *version := "1.0"organization := "com.myconsumer"scalaVersion := "2.11.5"

Re: How to learn Spark ?

2015-04-02 Thread Star Guo
Yes, I just search for it ! Best Regards, Star Guo == You can start with http://spark.apache.org/docs/1.3.0/index.html Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great. Enjoy! Vadim ᐧ On Thu, Apr 2, 2015 at 4:19 AM, Star Guo wrote:

RE: Spark SQL. Memory consumption

2015-04-02 Thread Cheng, Hao
Spark SQL tries to load the entire partition data and organized as In-Memory HashMaps, it does eat large memory if there are not many duplicated group by keys with large amount of records; Couple of things you can try case by case: ·Increasing the partition numbers (the records count in

Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-02 Thread Wang, Ningjun (LNG-NPV)
I set SPARK_LOCAL_DIRS to C:\temp\spark-temp. When RDDs are shuffled, spark writes to this folder. I found that the disk space of this folder keep on increase quickly and at certain point I will run out of disk space. I wonder does spark clean up the disk spac in this folder once the shuffle

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Bill Young
Thank you for the response, Dean. There are 2 worker nodes, with 8 cores total, attached to the stream. I have the following settings applied: spark.executor.memory 21475m spark.cores.max 16 spark.driver.memory 5235m On Thu, Apr 2, 2015 at 11:50 AM, Dean Wampler wrote: > Are you allocating 1 c

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Bill Young
Sorry for the obvious typo, I have 4 workers with 16 cores total* On Thu, Apr 2, 2015 at 11:56 AM, Bill Young wrote: > Thank you for the response, Dean. There are 2 worker nodes, with 8 cores > total, attached to the stream. I have the following settings applied: > > spark.executor.memory 21475m

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Dean Wampler
Are you allocating 1 core per input stream plus additional cores for the rest of the processing? Each input stream Reader requires a dedicated core. So, if you have two input streams, you'll need "local[3]" at least. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Spark SQL. Memory consumption

2015-04-02 Thread Masf
Hi. I'm using Spark SQL 1.2. I have this query: CREATE TABLE test_MA STORED AS PARQUET AS SELECT field1 ,field2 ,field3 ,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7) ,MIN(field8) ,SUM(field9 / 100) ,COUNT(field10) ,SUM(IF(field11 < -500, 1, 0)) ,MAX(field12) ,SUM(IF(field13 = 1, 1, 0)) ,SUM(I

Spark Streaming Error in block pushing thread

2015-04-02 Thread byoung
I am running a spark streaming stand-alone cluster, connected to rabbitmq endpoint(s). The application will run for 20-30 minutes before failing with the following error: WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 22 - Ask timed out

Re: conversion from java collection type to scala JavaRDD

2015-04-02 Thread Dean Wampler
Use JavaSparkContext.parallelize. http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

Re: Spark, snappy and HDFS

2015-04-02 Thread Nick Travers
Thanks all. I was able to get the decompression working by adding the following to my spark-env.sh script: export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export SPARK_LIBRARY_PATH=$SPARK_LIBRAR

conversion from java collection type to scala JavaRDD

2015-04-02 Thread Jeetendra Gangele
Hi All Is there an way to make the JavaRDD from existing java collection type List? I know this can be done using scala , but i am looking how to do this using java. Regards Jeetendra

Re: workers no route to host

2015-04-02 Thread Dean Wampler
It appears you are using a Cloudera Spark build, 1.3.0-cdh5.4.0-SNAPSHOT, which expects to find the hadoop command: /data/PlatformDep/cdh5/dist/bin/compute-classpath.sh: line 164: hadoop: command not found If you don't want to use Hadoop, download one of the pre-built Spark releases from spark.ap

Re: Re:How to learn Spark ?

2015-04-02 Thread Star Guo
Thanks a lot. Follow you suggestion . Best Regards, Star Guo = The best way of learning spark is to use spark you may follow the instruction of apache spark website.http://spark.apache.org/docs/latest/ download->deploy it in standalone mode->run som

Spark Streaming Worker runs out of inodes

2015-04-02 Thread andrem
Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and the worker nodes eventually run out of inodes. We see tons of old shuffle_*.data and *.index files that are never deleted. How do we get Spark to remove these files? We have a simple standalone app with one RabbitMQ receive

Re How to learn Spark ?

2015-04-02 Thread Star Guo
So cool !! Thanks. Best Regards, Star Guo = You can also refer this blog http://blog.prabeeshk.com/blog/archives/ On 2 April 2015 at 12:19, Star Guo wrote: Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Tha

Re: How to learn Spark ?

2015-04-02 Thread Star Guo
Thank you ! I Begin with it. Best Regards, Star Guo I have a "self-study" workshop here: https://github.com/deanwampler/spark-workshop dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-02 Thread Dean Wampler
To clarify one thing, is count() the first "action" ( http://spark.apache.org/docs/latest/programming-guide.html#actions) you're attempting? As defined in the programming guide, an action forces evaluation of the pipeline of RDDs. It's only then that reading the data actually occurs. So, count() mi

Re: From DataFrame to LabeledPoint

2015-04-02 Thread Peter Rudenko
Hi try next code: |val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{ case Row(feture1, feture2,..., label) => LabeledPoint(label, Vectors.dense(feature1, feature2, ...)) } | Thanks, Peter Rudenko On 2015-04-02 17:17, drarse wrote: Hello!, I have a questions since days ago.

A problem with Spark 1.3 artifacts

2015-04-02 Thread Jacek Lewandowski
A very simple example which works well with Spark 1.2, and fail to compile with Spark 1.3: build.sbt: name := "untitled" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" Test.scala: package org.apache.spark.metrics import org.apache.s

Spark streaming error in block pushing thread

2015-04-02 Thread Bill Young
I am running a standalone Spark streaming cluster, connected to multiple RabbitMQ endpoints. The application will run for 20-30 minutes before raising the following error: WARN 2015-04-01 21:00:53,944 > org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove > RDD 22 - Ask time

Re: Connection pooling in spark jobs

2015-04-02 Thread Cody Koeninger
Connection pools aren't serializable, so you generally need to set them up inside of a closure. Doing that for every item is wasteful, so you typically want to use mapPartitions or foreachPartition rdd.mapPartition { part => setupPool part.map { ... See "Design Patterns for using foreachRDD" i

Re: Setup Spark jobserver for Spark SQL

2015-04-02 Thread Daniel Siegmann
You shouldn't need to do anything special. Are you using a named context? I'm not sure those work with SparkSqlJob. By the way, there is a forum on Google groups for the Spark Job Server: https://groups.google.com/forum/#!forum/spark-jobserver On Thu, Apr 2, 2015 at 5:10 AM, Harika wrote: > Hi,

From DataFrame to LabeledPoint

2015-04-02 Thread drarse
Hello!, I have a questions since days ago. I am working with DataFrame and with Spark SQL I imported a jsonFile: /val df = sqlContext.jsonFile("file.json")/ In this json I have the label and de features. I selected it: / val features = df.select ("feature1","feature2","feature3",...); val labe

RE: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-02 Thread Ashley Rose
That’s precisely what I was trying to check. It should have 42577 records in it, because that’s how many there were in the text file I read in. // Load a text file and convert each line to a JavaBean. JavaRDD lines = sc.textFile("file.txt"); JavaRDD tbBER = lines.map(s ->

Re: A stream of json objects using Java

2015-04-02 Thread Sean Owen
This just reduces to finding a library that can translate a String of JSON into a POJO, Map, or other representation of the JSON. There are loads of these, like Gson or Jackson. Sure, you can easily use these in a function that you apply to each JSON string in each line of the file. It's not differ

A stream of json objects using Java

2015-04-02 Thread James King
I'm reading a stream of string lines that are in json format. I'm using Java with Spark. Is there a way to get this from a transformation? so that I end up with a stream of JSON objects. I would also welcome any feedback about this approach or alternative approaches. thanks jk

Re: How to learn Spark ?

2015-04-02 Thread Dean Wampler
I have a "self-study" workshop here: https://github.com/deanwampler/spark-workshop dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://pol

Re: Error in SparkSQL/Scala IDE

2015-04-02 Thread Dean Wampler
It failed to find the class class org.apache.spark.sql.catalyst.ScalaReflection in the Spark SQL library. Make sure it's in the classpath and the version is correct, too. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Types

Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Right, I am aware on how to use connection pooling with oracle, but the specific question is how to use it in the context of spark job execution On 2 Apr 2015 17:41, "Ted Yu" wrote: > http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm > > The question doesn't seem to be Spark specif

Error in SparkSQL/Scala IDE

2015-04-02 Thread Sathish Kumaran Vairavelu
Hi Everyone, I am getting following error while registering table using Scala IDE. Please let me know how to resolve this error. I am using Spark 1.2.1 import sqlContext.createSchemaRDD val empFile = sc.textFile("/tmp/emp.csv", 4) .map ( _.split(",") )

  1   2   >