Re: spark streaming rate limiting from kafka

2014-07-20 Thread Bill Jay
Hi Tobias, It seems that repartition can create more executors for the stages following data receiving. However, the number of executors is still far less than what I require (I specify one core for each executor). Based on the index of the executors in the stage, I find many numbers are missing

Re: spark1.0.1 hadoop2.2.0 issue

2014-07-20 Thread Debasish Das
Yup...the scala version 2.11.0 caused it...with 2.10.4, I could compile 1.0.1 and HEAD both for 2.3.0cdh5.0.2 On Sat, Jul 19, 2014 at 8:14 PM, Debasish Das debasish.da...@gmail.com wrote: I compiled spark 1.0.1 with 2.3.0cdh5.0.2 today... No issues with mvn compilation but my sbt build

Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-20 Thread Chris DuBois
Using the spark-ec2 script with m3.2xlarge instances seems to not have /mnt and /mnt2 pointing to the 80gb SSDs that come with that instance. Does anybody know whether extra steps are required when using this instance type? Thanks, Chris

Re: Out of any idea

2014-07-20 Thread boci
Hi i created a demo input. https://gist.github.com/b0c1/e3721af839feec433b56#file-gistfile1-txt-L10 As you see in line 10 the json received (json/string nevermind) After that everything is ok, except the processing not started... Any idea? Please help guys... I doesn't have any idea what I

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-20 Thread Victor Sheng
Hi, Michael I only modified the default hadoop version to 0.20.2-cdh3u5, and DEFAULT_HIVE=true in SparkBuild.scala. Then sbt/sbt assembly. I just run in the local standalone mode by using sbin/start-all.sh. Hadoop version is 0.20.2-cdh3u5. Then use spark-shell to execute the spark

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-20 Thread chutium
like this: val sc = new SparkContext(new SparkConf().setAppName(SLA Filter)) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext._ val suffix = args(0) sqlContext.parquetFile(/user/hive/warehouse/xxx_parquet.db/xx001_ +

JDBC Connections / newbie question

2014-07-20 Thread Ahmed Ibrahim
Hi All, In a JAVA based scenario where we have a large Oracle DB and want to use spark to do some distributed analysis being done on the data -- in such case how exactly we go about defining a JDBC connection and querying the data thanks, -- Ahmed Osama Ibrahim ITSC International

RDD.pipe(...)

2014-07-20 Thread jay vyas
According to the api docs for the pipe operator, def pipe(command: String): RDD http://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/rdd/RDD.html [String]: Return an RDD created by piping elements to a forked external process. However, its not clear to me: Will the outputted RDD capture

Re: RDD.pipe(...)

2014-07-20 Thread jay vyas
Nevermind :) I found my answer in the docs for the PipedRDD /** * An RDD that pipes the contents of each parent partition through an external command * (printing them one per line) and returns the output as a collection of strings. */ private[spark] class PipedRDD[T: ClassTag]( So, this is

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-20 Thread Matei Zaharia
Is this with the 1.0.0 scripts? I believe it's fixed in 1.0.1. Matei On Jul 20, 2014, at 1:22 AM, Chris DuBois chris.dub...@gmail.com wrote: Using the spark-ec2 script with m3.2xlarge instances seems to not have /mnt and /mnt2 pointing to the 80gb SSDs that come with that instance. Does

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-20 Thread Chris DuBois
I pulled the latest last night. I'm on commit 4da01e3. On Sun, Jul 20, 2014 at 2:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Is this with the 1.0.0 scripts? I believe it's fixed in 1.0.1. Matei On Jul 20, 2014, at 1:22 AM, Chris DuBois chris.dub...@gmail.com wrote: Using the

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-20 Thread Kevin Jung
Hi, Victor I got the same issue and I posted it. In my case, it only happens when I query some spark-sql on spark 1.0.1 but for spark 1.0.0, it works properly. Have you run the same job on spark 1.0.0 ? Sincerely, Kevin -- View this message in context:

RE: Hive From Spark

2014-07-20 Thread Cheng, Hao
JiaJia, I've checkout the latest 1.0 branch, and then do the following steps: SPAKR_HIVE=true sbt/sbt clean assembly cd examples ../bin/run-example sql.hive.HiveFromSpark It works well in my local From your log output, it shows Invalid method name: 'get_table', seems an incompatible jar version

Re: Large Task Size?

2014-07-20 Thread Xiangrui Meng
It was because of the latest change to task serialization: https://github.com/apache/spark/commit/1efb3698b6cf39a80683b37124d2736ebf3c9d9a The task size is no longer limited by akka.frameSize but we show warning messages if the task size is above 100KB. Please check the objects referenced in the

JDBCRDD / Example

2014-07-20 Thread Ahmed Ibrahim
Hi Guys, Any simplistic example for JDBCRDD for a newbie? -- Ahmed Osama Ibrahim ITSC International Technology Services Corporation www.itscorpmd.com Tel: +1 240 685 1444 Fax: +1 240 668 9841

What does @developerApi means?

2014-07-20 Thread 我是will
hello, what does @developerApi? I saw it appear many times in spark source code Thx‍

Re: What does @developerApi means?

2014-07-20 Thread Stephen Boesch
The javaDoc seems reasonably helpful: /** * A lower-level, unstable API intended for developers. * * Developer API's might change or be removed in minor versions of Spark. * */ These would be contrasted with non-Developer (more or less production?) API's that are deemed to be stable within a

Re: Graphx : Perfomance comparison over cluster

2014-07-20 Thread Ankur Dave
On Fri, Jul 18, 2014 at 9:07 PM, ShreyanshB shreyanshpbh...@gmail.com wrote: Does the suggested version with in-memory shuffle affects performance too much? We've observed a 2-3x speedup from it, at least on larger graphs like twitter-2010 http://law.di.unimi.it/webdata/twitter-2010/ and

which kind of BlockId should I use?

2014-07-20 Thread william
When spark is 0.7.3, I use SparkEnv.get.blockManager.getLocal(model) and SparkEnv.get.blockManager.put(model, buf, StorageLevel.MEMORY_ONLY, false) to cached model object When I porting to spark 1.0.1, I found SparkEnv.get.blockManager.getLocal ‍SparkEnv.get.blockManager.put's APIs changed

Re: which kind of BlockId should I use?

2014-07-20 Thread Aaron Davidson
Hm, this is not a public API, but you should theoretically be able to use TestBlockId if you like. Internally, we just use the BlockId's natural hashing and equality to do lookups and puts, so it should work fine. However, since it is in no way public API, it may change even in maintenance

Re: Error with spark-submit (formatting corrected)

2014-07-20 Thread ranjanp
Thanks for your help; problem resolved. As pointed out by Andrew and Meethu, I needed to use spark://vmsparkwin1:7077 rather than the equivalent spark://10.1.3.7:7077 in the spark-submit command. It appears that the argument in the --master option for the spark-submit must match exactly (not

回复: What does @developerApi means?

2014-07-20 Thread william
thank you Stephen -- 原始邮件 -- 发件人: Stephen Boesch;java...@gmail.com; 发送时间: 2014年7月21日(星期一) 中午11:55 收件人: useruser@spark.apache.org; 主题: Re: What does @developerApi means? The javaDoc seems reasonably helpful: /** * A lower-level, unstable API intended for

回复: which kind of BlockId should I use?

2014-07-20 Thread william
thank you Aaron -- 原始邮件 -- 发件人: Aaron Davidson;ilike...@gmail.com; 发送时间: 2014年7月21日(星期一) 中午1:40 收件人: useruser@spark.apache.org; 主题: Re: which kind of BlockId should I use? Hm, this is not a public API, but you should theoretically be able to use