Re: Python script runs fine in local mode, errors in other modes

2015-03-08 Thread Davies Liu
I got another report about this recently, and figured out that it's caused by having different versions of python in driver and YARN: http://stackoverflow.com/questions/28879803/spark-runs-in-local-but-not-in-yarn/28931934#28931934 Created JIRA:

Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Ted Yu
Cycling related bits: http://search-hadoop.com/m/LgpTk2DLMvc On Sun, Mar 8, 2015 at 2:29 PM, Nasir Khan nasirkhan.onl...@gmail.com wrote: HI, I am going to submit a proposal to my University to setup my Standalone Spark Cluster, what hardware should i include in my proposal? I will be

General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Nasir Khan
HI, I am going to submit a proposal to my University to setup my Standalone Spark Cluster, what hardware should i include in my proposal? I will be Working on classification (Spark MLlib) of Data streams (Spark Streams) If some body can fill up this answers, that will be great! Thanks *Cores *=

Re: Bulk insert strategy

2015-03-08 Thread Ashrafuzzaman
Yes so that brings me to another question. How do I do a batch insert from worker? In prod we are planning to put a 3 shared kinesis. So the number of partitions should be 3. Right? On Mar 8, 2015 8:57 PM, Ted Yu yuzhih...@gmail.com wrote: What's the expected number of partitions in your use

Re: Using 1.3.0 client jars with 1.2.1 assembly in yarn-cluster mode

2015-03-08 Thread Akhil Das
Mostly, when you use different versions of jars, it will throw up incompatible version errors. Thanks Best Regards On Fri, Mar 6, 2015 at 7:38 PM, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi, I submit spark jobs in yarn-cluster mode remotely from java code by calling

Re: Help with transformWith in SparkStreaming

2015-03-08 Thread Akhil Das
You could do it like this: val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1: RDD[(String,String)], rdd2 : RDD[Int]) = { var first = ; var second = ; var third = 0

Re: distcp on ec2 standalone spark cluster

2015-03-08 Thread Akhil Das
Did you follow these steps? https://wiki.apache.org/hadoop/AmazonS3 Also make sure your jobtracker/mapreduce processes are running fine. Thanks Best Regards On Sun, Mar 8, 2015 at 7:32 AM, roni roni.epi...@gmail.com wrote: Did you get this to work? I got pass the issues with the cluster not

A way to share RDD directly using Tachyon?

2015-03-08 Thread Yijie Shen
Hi, I would like to share a RDD in several Spark Applications,  i.e, create one in application A, publish the ID somewhere and get the RDD back directly using ID in Application B. I know I can use Tachyon just as a filesystem and  s.saveAsTextFile(tachyon://localhost:19998/Y”) like this. But

Re: How to reuse a ML trained model?

2015-03-08 Thread Xi Shen
errr...do you have any suggestions for me before 1.3 release? I can't believe there's no ML model serialize method in Spark. I think training the models are quite expensive, isn't it? Thanks, David On Sun, Mar 8, 2015 at 5:14 AM Burak Yavuz brk...@gmail.com wrote: Hi, There is model

Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Cui Lin
No woder I had out of memory issue before… I doubt if we really need such configuration on production level… Best regards, Cui Lin From: Krishna Sankar ksanka...@gmail.commailto:ksanka...@gmail.com Date: Sunday, March 8, 2015 at 3:27 PM To: Nasir Khan

Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Krishna Sankar
Without knowing the data size, computation storage requirements ... : - Dual 6 or 8 core machines, 256 GB memory each, 12-15 TB per machine. Probably 5-10 machines. - Don't go for the most exotic machines, otoh don't go for cheapest ones either. - Find a sweet spot with your

Re: [SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-08 Thread James
Hi, It still don't work. Is there any success instruction about how to pass a date to a hql script? Alcaid 2015-03-07 2:43 GMT+08:00 Zhan Zhang zzh...@hortonworks.com: Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash) Thanks. Zhan Zhang On Mar 6, 2015, at 4:20 AM,

Re: How to reuse a ML trained model?

2015-03-08 Thread Sean Owen
You dont need SparkContext to simply serialize and deserialize objects. It is Java mechanism. On Mar 8, 2015 10:29 AM, Xi Shen davidshe...@gmail.com wrote: errr...do you have any suggestions for me before 1.3 release? I can't believe there's no ML model serialize method in Spark. I think

Re: How to reuse a ML trained model?

2015-03-08 Thread Simon Chan
You may also take a look at PredictionIO, which can persist and then deploy MLlib models as web services. Simon On Sunday, March 8, 2015, Sean Owen so...@cloudera.com wrote: You dont need SparkContext to simply serialize and deserialize objects. It is Java mechanism. On Mar 8, 2015 10:29 AM,

Using sparkContext in inside a map function

2015-03-08 Thread danielil
Hi, We are designing a solution which pulls file paths from Kafka and for the current stage just counts the lines in each of these files. When running the code it fails on: Exception in thread main org.apache.spark.SparkException: Task not serializable at

Using sparkContext in inside a map function

2015-03-08 Thread danielil
Hi, We are designing a solution which pulls file paths from Kafka and for the current stage just counts the lines in each of these files. When running the code it fails on: Exception in thread main org.apache.spark.SparkException: Task not serializable at

Re: Bulk insert strategy

2015-03-08 Thread Ted Yu
What's the expected number of partitions in your use case ? Have you thought of doing batching in the workers ? Cheers On Sat, Mar 7, 2015 at 10:54 PM, A.K.M. Ashrafuzzaman ashrafuzzaman...@gmail.com wrote: While processing DStream in the Spark Programming Guide, the suggested usage of

using sparkContext from within a map function (from spark streaming app)

2015-03-08 Thread Daniel Haviv
Hi, We are designing a solution which pulls file paths from Kafka and for the current stage just counts the lines in each of these files. When running the code it fails on: Exception in thread main org.apache.spark.SparkException: Task not serializable at

Re: using sparkContext from within a map function (from spark streaming app)

2015-03-08 Thread Sean Owen
Yes, you can never use the SparkContext inside a remote function. It is on the driver only. On Sun, Mar 8, 2015 at 4:22 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, We are designing a solution which pulls file paths from Kafka and for the current stage just counts the lines in

Can't cache RDD of collaborative filtering on MLlib

2015-03-08 Thread Yuichiro Sakamoto
Hello. I create program, collaborative filtering using Spark, but I have trouble with calculating speed. I want to implement recommendation program using ALS (MLlib), which is another process from Spark. But access speed of MatrixFactorizationModel object on HDFS is slow, so I want to cache it,