Re: Spark Sql - Missing Jar ? json_tuple NoClassDefFoundError

2015-04-02 Thread Akhil Das
Try adding all the jars in your $HIVE/lib directory. If you want the specific jar, you could look fr jackson or json serde in it. Thanks Best Regards On Thu, Apr 2, 2015 at 12:49 AM, Todd Nist tsind...@gmail.com wrote: I have a feeling I’m missing a Jar that provides the support or could this

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Xiangrui Meng
I think before 1.3 you also get stackoverflow problem in ~35 iterations. In 1.3.x, please use setCheckpointInterval to solve this problem, which is available in the current master and 1.3.1 (to be released soon). Btw, do you find 80 iterations are needed for convergence? -Xiangrui On Wed, Apr 1,

Re: Spark, snappy and HDFS

2015-04-02 Thread Sean Owen
Yes, any Hadoop-related process that asks for Snappy compression or needs to read it will have to have the Snappy libs available on the library path. That's usually set up for you in a distro or you can do it manually like this. This is not Spark-specific. The second question also isn't

How to learn Spark ?

2015-04-02 Thread Star Guo
Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo

Support for Data flow graphs and not DAG only

2015-04-02 Thread anshu shukla
Hey , I didn't find any documentation regarding support for cycles in spark topology , although storm supports this using manual configuration in acker function logic (setting it to a particular count) .By cycles i doesn't mean infinite loops . -- Thanks Regards, Anshu Shukla

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Justin Yip
Thanks Xiangrui, I used 80 iterations to demonstrates the marginal diminishing return in prediction quality :) Justin On Apr 2, 2015 00:16, Xiangrui Meng men...@gmail.com wrote: I think before 1.3 you also get stackoverflow problem in ~35 iterations. In 1.3.x, please use

how to find near duplicate items from given dataset using spark

2015-04-02 Thread Somnath Pandeya
Hi All, I want to find near duplicate items from given dataset For e.g consider a data set 1. Cricket,bat,ball,stumps 2. Cricket,bowler,ball,stumps, 3. Football,goalie,midfielder,goal 4. Football,refree,midfielder,goal, Here 1 and 2 are near duplicates (only field 2 is

Re: pyspark hbase range scan

2015-04-02 Thread gen tang
Hi, Maybe this might be helpful: https://github.com/GenTang/spark_hbase/blob/master/src/main/scala/examples/pythonConverters.scala Cheers Gen On Thu, Apr 2, 2015 at 1:50 AM, Eric Kimbrel eric.kimb...@soteradefense.com wrote: I am attempting to read an hbase table in pyspark with a range

Re: Streaming anomaly detection using ARIMA

2015-04-02 Thread Sean Owen
This inside out parallelization has been a way people have used R with MapReduce for a long time. Run N copies of an R script on the cluster, on different subsets of the data, babysat by Mappers. You just need R installed on the cluster. Hadoop Streaming makes this easy and things like RDD.pipe in

JAVA_HOME problem

2015-04-02 Thread 董帅阳
spark 1.3.0 spark@pc-zjqdyyn1:~ tail /etc/profile export JAVA_HOME=/usr/jdk64/jdk1.7.0_45 export PATH=$PATH:$JAVA_HOME/bin # # End of /etc/profile #‍ But ERROR LOG Container: container_1427449644855_0092_02_01 on pc-zjqdyy04_45454

Re: Spark throws rsync: change_dir errors on startup

2015-04-02 Thread Horsmann, Tobias
Hi, Verbose output showed no additional information about the origin of the error rsync from right sending incremental file list sent 20 bytes received 12 bytes 64.00 bytes/sec total size is 0 speedup is 0.00 starting org.apache.spark.deploy.master.Master, logging to

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Nick Pentreath
Fair enough but I'd say you hit that diminishing return after 20 iterations or so... :) On Thu, Apr 2, 2015 at 9:39 AM, Justin Yip yipjus...@gmail.com wrote: Thanks Xiangrui, I used 80 iterations to demonstrates the marginal diminishing return in prediction quality :) Justin On Apr 2,

Starting httpd: http: Syntax error on line 154

2015-04-02 Thread Ganon Pierce
I’m unable to access ganglia, it looks like due the web server not starting as I receive this error when I launch spark: Starting httpd: http: Syntax error on line 154 of /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so This occurs when I’m using the vanilla script.

StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Justin Yip
Hello, I have been using Mllib's ALS in 1.2 and it works quite well. I have just upgraded to 1.3 and I encountered stackoverflow problem. After some digging, I realized that when the iteration ~35, I will get overflow problem. However, I can get at least 80 iterations with ALS in 1.2. Is there

Setup Spark jobserver for Spark SQL

2015-04-02 Thread Harika
Hi, I am trying to Spark Jobserver( https://github.com/spark-jobserver/spark-jobserver https://github.com/spark-jobserver/spark-jobserver ) for running Spark SQL jobs. I was able to start the server but when I run my application(my Scala class which extends SparkSqlJob), I am getting the

Re: Unable to save dataframe with UDT created with sqlContext.createDataFrame

2015-04-02 Thread Xiangrui Meng
I reproduced the bug on master and submitted a patch for it: https://github.com/apache/spark/pull/5329. It may get into Spark 1.3.1. Thanks for reporting the bug! -Xiangrui On Wed, Apr 1, 2015 at 12:57 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hmm, I got the same error with the master. Here

Re: HiveContext setConf seems not stable

2015-04-02 Thread Hao Ren
Hi, Jira created: https://issues.apache.org/jira/browse/SPARK-6675 Thank you. On Wed, Apr 1, 2015 at 7:50 PM, Michael Armbrust mich...@databricks.com wrote: Can you open a JIRA please? On Wed, Apr 1, 2015 at 9:38 AM, Hao Ren inv...@gmail.com wrote: Hi, I find HiveContext.setConf does

Re: From DataFrame to LabeledPoint

2015-04-02 Thread Joseph Bradley
Peter's suggestion sounds good, but watch out for the match case since I believe you'll have to match on: case (Row(feature1, feature2, ...), Row(label)) = On Thu, Apr 2, 2015 at 7:57 AM, Peter Rudenko petro.rude...@gmail.com wrote: Hi try next code: val labeledPoints: RDD[LabeledPoint] =

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Michael Armbrust
Do you have a full stack trace? On Thu, Apr 2, 2015 at 11:45 AM, ogoh oke...@gmail.com wrote: Hello, My ETL uses sparksql to generate parquet files which are served through Thriftserver using hive ql. It especially defines a schema programmatically since the schema can be only known at

Re: persist(MEMORY_ONLY) takes lot of time

2015-04-02 Thread Christian Perez
+1. Caching is way too slow. On Wed, Apr 1, 2015 at 12:33 PM, SamyaMaiti samya.maiti2...@gmail.com wrote: Hi Experts, I have a parquet dataset of 550 MB ( 9 Blocks) in HDFS. I want to run SQL queries repetitively. Few questions : 1. When I do the below (persist to memory after reading

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza
This isn't currently a capability that Spark has, though it has definitely been discussed: https://issues.apache.org/jira/browse/SPARK-1061. The primary obstacle at this point is that Hadoop's FileInputFormat doesn't guarantee that each file corresponds to a single split, so the records

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Jeremy Freeman
Hm, that will indeed be trickier because this method assumes records are the same byte size. Is the file an arbitrary sequence of mixed types, or is there structure, e.g. short, long, short, long, etc.? If you could post a gist with an example of the kind of file and how it should look once

RE: Date and decimal datatype not working

2015-04-02 Thread BASAK, ANANDA
Thanks all. Finally I am able to run my code successfully. It is running in Spark 1.2.1. I will try it on Spark 1.3 too. The major cause of all errors I faced was that the delimiter was not correctly declared. val TABLE_A =

Need a spark mllib tutorial

2015-04-02 Thread Phani Yadavilli -X (pyadavil)
Hi, I am new to the spark MLLib and I was browsing through the internet for good tutorials advanced to the spark documentation example. But, I do not find any. Need help. Regards Phani Kumar

Re: Need a spark mllib tutorial

2015-04-02 Thread Reza Zadeh
Here's one: https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html Reza On Thu, Apr 2, 2015 at 12:51 PM, Phani Yadavilli -X (pyadavil) pyada...@cisco.com wrote: Hi, I am new to the spark MLLib and I was browsing through the internet for good tutorials advanced

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread Michael Quinlan
I was able to hack this on my similar setup issue by running (on the driver) $ sudo hostname ip Where ip is the same value set in the spark.driver.host property. This isn't a solution I would use universally and hope the someone can fix this bug in the distribution. Regards, Mike -- View

Re: Mllib kmeans #iteration

2015-04-02 Thread Joseph Bradley
Check out the Spark docs for that parameter: *maxIterations* http://spark.apache.org/docs/latest/mllib-clustering.html#k-means On Thu, Apr 2, 2015 at 4:42 AM, podioss grega...@hotmail.com wrote: Hello, i am running the Kmeans algorithm in cluster mode from Mllib and i was wondering if i could

Re: input size too large | Performance issues with Spark

2015-04-02 Thread Christian Perez
To Akhil's point, see Tuning Data structures. Avoid standard collection hashmap. With fewer machines, try running 4 or 5 cores per executor and only 3-4 executors (1 per node): http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/. Ought to reduce shuffle performance

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread jay vyas
yup a related JIRA is here https://issues.apache.org/jira/browse/SPARK-5113 which you might want to leave a comment in. This can be quite tricky we found ! but there are a host of env variable hacks you can use when launching spark masters/slaves. On Thu, Apr 2, 2015 at 5:18 PM, Michael

RE: Spark SQL. Memory consumption

2015-04-02 Thread java8964
It is hard to say what could be reason without more detail information. If you provide some more information, maybe people here can help you better. 1) What is your worker's memory setting? It looks like that your nodes have 128G physical memory each, but what do you specify for the worker's

RE: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang
Michael, thanks for the response and looking forward to try 1.3.1 From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, April 03, 2015 6:52 AM To: Haopu Wang Cc: user Subject: Re: [SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k,

Re: Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread fightf...@163.com
Hi, there you may need to add : import sqlContext.implicits._ Best, Sun fightf...@163.com From: java8964 Date: 2015-04-03 10:15 To: user@spark.apache.org Subject: Cannot run the example in the Spark 1.3.0 following the document I tried to check out what Spark SQL 1.3.0. I installed it

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Michael Armbrust
This looks to me like you have incompatible versions of scala on your classpath? On Thu, Apr 2, 2015 at 4:28 PM, Okehee Goh oke...@gmail.com wrote: yes, below is the stacktrace. Thanks, Okehee java.lang.NoSuchMethodError:

Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread java8964
I tried to check out what Spark SQL 1.3.0. I installed it and following the online document here: http://spark.apache.org/docs/latest/sql-programming-guide.html In the example, it shows something like this:// Select everybody, but increment the age by 1 df.select(name, df(age) + 1).show() //

RE: ArrayBuffer within a DataFrame

2015-04-02 Thread Mohammed Guller
Hint: DF.rdd.map{} Mohammed From: Denny Lee [mailto:denny.g@gmail.com] Sent: Thursday, April 2, 2015 7:10 PM To: user@spark.apache.org Subject: ArrayBuffer within a DataFrame Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd

Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread a mesar
Yes, with spark.cleaner.ttl set there is no cleanup. We pass --properties-file spark-dev.conf to spark-submit where spark-dev.conf contains: spark.master spark://10.250.241.66:7077 spark.logConf true spark.cleaner.ttl 1800 spark.executor.memory 10709m spark.cores.max 4

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Okehee Goh
Michael, You are right. The build brought org.scala-lang:scala-library:2.10.1 from other package (as below). It works fine after excluding the old scala version. Thanks a lot, Okehee == dependency: |+--- org.apache.kafka:kafka_2.10:0.8.1.1 ||+---

maven compile error

2015-04-02 Thread myelinji
Hi,all:   Just now i checked out spark-1.2 on github , wanna to build it use maven, how ever I encountered an error during compiling: [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile

[SQL] Simple DataFrame questions

2015-04-02 Thread Yana Kadiyska
Hi folks, having some seemingly noob issues with the dataframe API. I have a DF which came from the csv package. 1. What would be an easy way to cast a column to a given type -- my DF columns are all typed as strings coming from a csv. I see a schema getter but not setter on DF 2. I am trying

RE: Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread java8964
The import command already run. Forgot the mention, the rest of examples related to df all works, just this one caused problem. Thanks Yong Date: Fri, 3 Apr 2015 10:36:45 +0800 From: fightf...@163.com To: java8...@hotmail.com; user@spark.apache.org Subject: Re: Cannot run the example in the

Re: Generating a schema in Spark 1.3 failed while using DataTypes.

2015-04-02 Thread Okehee Goh
yes, below is the stacktrace. Thanks, Okehee java.lang.NoSuchMethodError: scala.reflect.NameTransformer$.LOCAL_SUFFIX_STRING()Ljava/lang/String; at scala.reflect.internal.StdNames$CommonNames.init(StdNames.scala:97) at

Re: Spark SQL does not read from cached table if table is renamed

2015-04-02 Thread Michael Armbrust
I'll add we just back ported this so it'll be included in 1.2.2 also. On Wed, Apr 1, 2015 at 4:14 PM, Michael Armbrust mich...@databricks.com wrote: This is fixed in Spark 1.3. https://issues.apache.org/jira/browse/SPARK-5195 On Wed, Apr 1, 2015 at 4:05 PM, Judy Nash

Re: Spark-events does not exist error, while it does with all the req. rights

2015-04-02 Thread Marcelo Vanzin
FYI I wrote a small test to try to reproduce this, and filed SPARK-6688 to track the fix. On Tue, Mar 31, 2015 at 1:15 PM, Marcelo Vanzin van...@cloudera.com wrote: Hmmm... could you try to set the log dir to file:/home/hduser/spark/spark-events? I checked the code and it might be the case

RE: Reading a large file (binary) into RDD

2015-04-02 Thread java8964
I think implementing your own InputFormat and using SparkContext.hadoopFile() is the best option for your case. Yong From: kvi...@vt.edu Date: Thu, 2 Apr 2015 17:31:30 -0400 Subject: Re: Reading a large file (binary) into RDD To: freeman.jer...@gmail.com CC: user@spark.apache.org The file has a

Re: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread Todd Nist
Hi Young, Sorry for the duplicate post, want to reply to all. I just downloaded the bits prebuilt form apache spark download site. Started the spark shell and got the same error. I then started the shell as follows: ./bin/spark-shell --master spark://radtech.io:7077 --total-executor-cores 2

Re: [SQL] Simple DataFrame questions

2015-04-02 Thread Yin Huai
For cast, you can use selectExpr method. For example, df.selectExpr(cast(col1 as int) as col1, cast(col2 as bigint) as col2). Or, df.select(df(colA).cast(int), ...) On Thu, Apr 2, 2015 at 8:33 PM, Michael Armbrust mich...@databricks.com wrote: val df = Seq((test, 1)).toDF(col1, col2) You can

Re: Spark Streaming Worker runs out of inodes

2015-04-02 Thread Tathagata Das
Are you saying that even with the spark.cleaner.ttl set your files are not getting cleaned up? TD On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote: Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and the worker nodes eventually run out of inodes. We see

RE: Spark SQL 1.3.0 - spark-shell error : HiveMetastoreCatalog.class refers to term cache in package com.google.common which is not available

2015-04-02 Thread java8964
Hmm, I just tested my own Spark 1.3.0 build. I have the same problem, but I cannot reproduce it on Spark 1.2.1 If we check the code change below: Spark 1.3 branchhttps://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala vs Spark

ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
Quick question - the output of a dataframe is in the format of: [2015-04, ArrayBuffer(A, B, C, D)] and I'd like to return it as: 2015-04, A 2015-04, B 2015-04, C 2015-04, D What's the best way to do this? Thanks in advance!

回复:How to learn Spark ?

2015-04-02 Thread luohui20001
The best way of learning spark is to use spark you may follow the instruction of apache spark website.http://spark.apache.org/docs/latest/ download-deploy it in standalone mode-run some examples-try cluster deploy mode- then try to develop your own app and deploy it in your spark cluster.

Re: Connection pooling in spark jobs

2015-04-02 Thread Ted Yu
http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm The question doesn't seem to be Spark specific, btw On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Hi, We have a case that we will have to run concurrent jobs (for the same algorithm) on

Re: Error in SparkSQL/Scala IDE

2015-04-02 Thread Dean Wampler
It failed to find the class class org.apache.spark.sql.catalyst.ScalaReflection in the Spark SQL library. Make sure it's in the classpath and the version is correct, too. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly)

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
yes! thank you very much:-) 在 2015年4月2日,下午7:13,Sean Owen so...@cloudera.com 写道: Right, I asked because in your original message, you were looking at the initialization to a random vector. But that is the initial state, not final state. On Thu, Apr 2, 2015 at 11:51 AM, lisendong

Error in SparkSQL/Scala IDE

2015-04-02 Thread Sathish Kumaran Vairavelu
Hi Everyone, I am getting following error while registering table using Scala IDE. Please let me know how to resolve this error. I am using Spark 1.2.1 import sqlContext.createSchemaRDD val empFile = sc.textFile(/tmp/emp.csv, 4) .map ( _.split(,) )

Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Right, I am aware on how to use connection pooling with oracle, but the specific question is how to use it in the context of spark job execution On 2 Apr 2015 17:41, Ted Yu yuzhih...@gmail.com wrote: http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm The question doesn't seem to

[SparkSQL 1.3.0] Cannot resolve column name SUM('p.q) among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang
Hi, I want to rename an aggregation field using DataFrame API. The aggregation is done on a nested field. But I got below exception. Do you see the same issue and any workaround? Thank you very much! == Exception in thread main org.apache.spark.sql.AnalysisException: Cannot resolve

Re: Error reading smallin in hive table with parquet format

2015-04-02 Thread Masf
No, in my company are using cloudera distributions and 1.2.0 is the last version of spark. Thanks On Wed, Apr 1, 2015 at 8:08 PM, Michael Armbrust mich...@databricks.com wrote: Can you try with Spark 1.3? Much of this code path has been rewritten / improved in this version. On Wed, Apr

Matrix Transpose

2015-04-02 Thread Spico Florin
Hello! I have a CSV file that has the following content: C1;C2;C3 11;22;33 12;23;34 13;24;35 What is the best approach to use Spark (API, MLLib) for achieving the transpose of it? C1 11 12 13 C2 22 23 24 C3 33 34 35 I look forward for your solutions and suggestions (some Scala code will be

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
NO, I’m referring to the result. you means there might be so many zero features in the als result ? I think it is not related to the initial state, but I do not know why the percent of zero-vector is so high(50% around) 在 2015年4月2日,下午6:08,Sean Owen so...@cloudera.com 写道: You're referring

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread Sean Owen
Right, I asked because in your original message, you were looking at the initialization to a random vector. But that is the initial state, not final state. On Thu, Apr 2, 2015 at 11:51 AM, lisendong lisend...@163.com wrote: NO, I’m referring to the result. you means there might be so many zero

Mllib kmeans #iteration

2015-04-02 Thread podioss
Hello, i am running the Kmeans algorithm in cluster mode from Mllib and i was wondering if i could run the algorithm with fixed number of iterations in some way. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-kmeans-iteration-tp22353.html

Re: How to learn Spark ?

2015-04-02 Thread prabeesh k
You can also refer this blog http://blog.prabeeshk.com/blog/archives/ On 2 April 2015 at 12:19, Star Guo st...@ceph.me wrote: Hi, all I am new to here. Could you give me some suggestion to learn Spark ? Thanks. Best Regards, Star Guo

Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Hi, We have a case that we will have to run concurrent jobs (for the same algorithm) on different data sets. And these jobs can run in parallel and each one of them would be fetching the data from the database. We would like to optimize the database connections by making use of connection

Re: Issue on Spark SQL insert or create table with Spark running on AWS EMR -- s3n.S3NativeFileSystem: rename never finished

2015-04-02 Thread Wollert, Fabian
Hey Christopher, I'm working with Teng on this issue. Thank you for the explanation. I tried both workarounds: just leaving hive.metastore.warehouse.dir empty is not doing anything. Still the tmp data is written to S3 and the job attempts to rename/copy+delete from S3 to S3. But anyway, since

Re: there are about 50% all-zero vector in the als result

2015-04-02 Thread lisendong
Oh, I found the reason. according to the ALS optimization formula : If a user’s all ratings are zero, that is, the R(i, Ii) is a zero matrix, so the final result feature of this user will be all-zero vector… 在 2015年4月2日,下午6:08,Sean Owen so...@cloudera.com 写道: You're referring to the

Re: How to learn Spark ?

2015-04-02 Thread Vadim Bichutskiy
You can start with http://spark.apache.org/docs/1.3.0/index.html Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great. Enjoy! Vadim ᐧ On Thu, Apr 2, 2015 at 4:19 AM, Star Guo st...@ceph.me wrote: Hi, all I am new to here. Could you give me some suggestion to learn Spark ?

Re: How to learn Spark ?

2015-04-02 Thread Dean Wampler
I have a self-study workshop here: https://github.com/deanwampler/spark-workshop dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler

Re: Connection pooling in spark jobs

2015-04-02 Thread Charles Feduke
How long does each executor keep the connection open for? How many connections does each executor open? Are you certain that connection pooling is a performant and suitable solution? Are you running out of resources on the database server and cannot tolerate each executor having a single

Re: ArrayBuffer within a DataFrame

2015-04-02 Thread Denny Lee
Thanks Michael - that was it! I was drawing a blank on this one for some reason - much appreciated! On Thu, Apr 2, 2015 at 8:27 PM Michael Armbrust mich...@databricks.com wrote: A lateral view explode using HiveQL. I'm hopping to add explode shorthand directly to the df API in 1.4. On

RE: Cannot run the example in the Spark 1.3.0 following the document

2015-04-02 Thread Michael Armbrust
Looks like a typo, try: *df.select**(**df**(name), **df**(age) + 1)* Or df.select(name, age) PRs to fix docs are always appreciated :) On Apr 2, 2015 7:44 PM, java8964 java8...@hotmail.com wrote: The import command already run. Forgot the mention, the rest of examples related to df all

Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
But this basically means that the pool is confined to the job (of a single app) in question, but is not sharable across multiple apps? The setup we have is a job server (the spark-jobserver) that creates jobs. Currently, we have each job opening and closing a connection to the database. What we

Fwd:

2015-04-02 Thread Himanish Kushary
Actually they may not be sequentially generated and also the list (RDD) could come from a different component. For example from this RDD : (105,918) (105,757) (502,516) (105,137) (516,816) (350,502) I would like to separate into two RDD's : 1) (105,918) (502,516) 2) (105,757)

Matei Zaharai: Reddit Ask Me Anything

2015-04-02 Thread ben lorica
*Ask Me Anything about Apache Spark big data* Reddit AMA with Matei Zaharia Friday, April 3 at 9AM PT/ 12PM ET Details can be found here: http://strataconf.com/big-data-conference-uk-2015/public/content/reddit-ama -- View this message in context:

Re: Connection pooling in spark jobs

2015-04-02 Thread Sateesh Kavuri
Each executor runs for about 5 secs until which time the db connection can potentially be open. Each executor will have 1 connection open. Connection pooling surely has its advantages of performance and not hitting the dbserver for every open/close. The database in question is not just used by the

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
Thanks for the reply. Unfortunately, in my case, the binary file is a mix of short and long integers. Is there any other way that could of use here? My current method happens to have a large overhead (much more than actual computation time). Also, I am short of memory at the driver when it has to

Re: workers no route to host

2015-04-02 Thread Dean Wampler
It appears you are using a Cloudera Spark build, 1.3.0-cdh5.4.0-SNAPSHOT, which expects to find the hadoop command: /data/PlatformDep/cdh5/dist/bin/compute-classpath.sh: line 164: hadoop: command not found If you don't want to use Hadoop, download one of the pre-built Spark releases from

conversion from java collection type to scala JavaRDDObject

2015-04-02 Thread Jeetendra Gangele
Hi All Is there an way to make the JavaRDDObject from existing java collection type ListObject? I know this can be done using scala , but i am looking how to do this using java. Regards Jeetendra

Re: Spark, snappy and HDFS

2015-04-02 Thread Nick Travers
Thanks all. I was able to get the decompression working by adding the following to my spark-env.sh script: export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nickt/lib/hadoop/lib/native export

Spark SQL. Memory consumption

2015-04-02 Thread Masf
Hi. I'm using Spark SQL 1.2. I have this query: CREATE TABLE test_MA STORED AS PARQUET AS SELECT field1 ,field2 ,field3 ,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7) ,MIN(field8) ,SUM(field9 / 100) ,COUNT(field10) ,SUM(IF(field11 -500, 1, 0)) ,MAX(field12) ,SUM(IF(field13 = 1, 1, 0))

Re: How to learn Spark ?

2015-04-02 Thread Star Guo
Thank you ! I Begin with it. Best Regards, Star Guo I have a self-study workshop here: https://github.com/deanwampler/spark-workshop dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Spark Streaming Error in block pushing thread

2015-04-02 Thread byoung
I am running a spark streaming stand-alone cluster, connected to rabbitmq endpoint(s). The application will run for 20-30 minutes before failing with the following error: WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 22 - Ask timed

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Bill Young
Thank you for the response, Dean. There are 2 worker nodes, with 8 cores total, attached to the stream. I have the following settings applied: spark.executor.memory 21475m spark.cores.max 16 spark.driver.memory 5235m On Thu, Apr 2, 2015 at 11:50 AM, Dean Wampler deanwamp...@gmail.com wrote:

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Bill Young
Sorry for the obvious typo, I have 4 workers with 16 cores total* On Thu, Apr 2, 2015 at 11:56 AM, Bill Young bill.yo...@threatstack.com wrote: Thank you for the response, Dean. There are 2 worker nodes, with 8 cores total, attached to the stream. I have the following settings applied:

Re: From DataFrame to LabeledPoint

2015-04-02 Thread Peter Rudenko
Hi try next code: |val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{ case Row(feture1, feture2,..., label) = LabeledPoint(label, Vectors.dense(feature1, feature2, ...)) } | Thanks, Peter Rudenko On 2015-04-02 17:17, drarse wrote: Hello!, I have a questions since days ago.

Re How to learn Spark ?

2015-04-02 Thread Star Guo
So cool !! Thanks. Best Regards, Star Guo = You can also refer this blog http://blog.prabeeshk.com/blog/archives/ On 2 April 2015 at 12:19, Star Guo st...@ceph.me wrote: Hi, all I am new to here. Could you give me some suggestion to

Re: conversion from java collection type to scala JavaRDDObject

2015-04-02 Thread Dean Wampler
Use JavaSparkContext.parallelize. http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkContext.html#parallelize(java.util.List) Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Dean Wampler
Are you allocating 1 core per input stream plus additional cores for the rest of the processing? Each input stream Reader requires a dedicated core. So, if you have two input streams, you'll need local[3] at least. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition

Re: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-02 Thread Dean Wampler
To clarify one thing, is count() the first action ( http://spark.apache.org/docs/latest/programming-guide.html#actions) you're attempting? As defined in the programming guide, an action forces evaluation of the pipeline of RDDs. It's only then that reading the data actually occurs. So, count()

Spark Streaming Worker runs out of inodes

2015-04-02 Thread andrem
Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and the worker nodes eventually run out of inodes. We see tons of old shuffle_*.data and *.index files that are never deleted. How do we get Spark to remove these files? We have a simple standalone app with one RabbitMQ

Re: Re:How to learn Spark ?

2015-04-02 Thread Star Guo
Thanks a lot. Follow you suggestion . Best Regards, Star Guo = The best way of learning spark is to use spark you may follow the instruction of apache spark website.http://spark.apache.org/docs/latest/ download-deploy it in standalone mode-run some

A stream of json objects using Java

2015-04-02 Thread James King
I'm reading a stream of string lines that are in json format. I'm using Java with Spark. Is there a way to get this from a transformation? so that I end up with a stream of JSON objects. I would also welcome any feedback about this approach or alternative approaches. thanks jk

Spark streaming error in block pushing thread

2015-04-02 Thread Bill Young
I am running a standalone Spark streaming cluster, connected to multiple RabbitMQ endpoints. The application will run for 20-30 minutes before raising the following error: WARN 2015-04-01 21:00:53,944 org.apache.spark.storage.BlockManagerMaster.logWarning.71: Failed to remove RDD 22 - Ask timed

RE: Spark 1.3.0 DataFrame count() method throwing java.io.EOFException

2015-04-02 Thread Ashley Rose
That’s precisely what I was trying to check. It should have 42577 records in it, because that’s how many there were in the text file I read in. // Load a text file and convert each line to a JavaBean. JavaRDDString lines = sc.textFile(file.txt); JavaRDDBERecord tbBER =

From DataFrame to LabeledPoint

2015-04-02 Thread drarse
Hello!, I have a questions since days ago. I am working with DataFrame and with Spark SQL I imported a jsonFile: /val df = sqlContext.jsonFile(file.json)/ In this json I have the label and de features. I selected it: / val features = df.select (feature1,feature2,feature3,...); val labels =

Re: A stream of json objects using Java

2015-04-02 Thread Sean Owen
This just reduces to finding a library that can translate a String of JSON into a POJO, Map, or other representation of the JSON. There are loads of these, like Gson or Jackson. Sure, you can easily use these in a function that you apply to each JSON string in each line of the file. It's not

Re: Setup Spark jobserver for Spark SQL

2015-04-02 Thread Daniel Siegmann
You shouldn't need to do anything special. Are you using a named context? I'm not sure those work with SparkSqlJob. By the way, there is a forum on Google groups for the Spark Job Server: https://groups.google.com/forum/#!forum/spark-jobserver On Thu, Apr 2, 2015 at 5:10 AM, Harika

Re: Connection pooling in spark jobs

2015-04-02 Thread Cody Koeninger
Connection pools aren't serializable, so you generally need to set them up inside of a closure. Doing that for every item is wasteful, so you typically want to use mapPartitions or foreachPartition rdd.mapPartition { part = setupPool part.map { ... See Design Patterns for using foreachRDD in

A problem with Spark 1.3 artifacts

2015-04-02 Thread Jacek Lewandowski
A very simple example which works well with Spark 1.2, and fail to compile with Spark 1.3: build.sbt: name := untitled version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 Test.scala: package org.apache.spark.metrics import

Re: How to learn Spark ?

2015-04-02 Thread Star Guo
Yes, I just search for it ! Best Regards, Star Guo == You can start with http://spark.apache.org/docs/1.3.0/index.html Also get the Learning Spark book http://amzn.to/1NDFI5x. It's great. Enjoy! Vadim ᐧ On Thu, Apr 2, 2015 at 4:19 AM, Star Guo

Spark + Kinesis

2015-04-02 Thread Vadim Bichutskiy
Hi all, I am trying to write an Amazon Kinesis consumer Scala app that processes data in the Kinesis stream. Is this the correct way to specify *build.sbt*: --- *import AssemblyKeys._* *name := Kinesis Consumer* *version := 1.0organization := com.myconsumerscalaVersion :=

  1   2   >