Re: distcp on ec2 standalone spark cluster

2015-03-08 Thread Akhil Das
Did you follow these steps? https://wiki.apache.org/hadoop/AmazonS3 Also make sure your jobtracker/mapreduce processes are running fine. Thanks Best Regards On Sun, Mar 8, 2015 at 7:32 AM, roni wrote: > Did you get this to work? > I got pass the issues with the cluster not startetd problem > I

Re: Using 1.3.0 client jars with 1.2.1 assembly in yarn-cluster mode

2015-03-08 Thread Akhil Das
Mostly, when you use different versions of jars, it will throw up incompatible version errors. Thanks Best Regards On Fri, Mar 6, 2015 at 7:38 PM, Zsolt Tóth wrote: > Hi, > > I submit spark jobs in yarn-cluster mode remotely from java code by > calling Client.submitApplication(). For some reaso

Re: Help with transformWith in SparkStreaming

2015-03-08 Thread Akhil Das
You could do it like this: val transformedFileAndTime = fileAndTime.transformWith(anomaly, (rdd1: RDD[(String,String)], rdd2 : RDD[Int]) => { var first = " "; var second = " "; var third = 0

Re: How to reuse a ML trained model?

2015-03-08 Thread Xi Shen
errr...do you have any suggestions for me before 1.3 release? I can't believe there's no ML model serialize method in Spark. I think training the models are quite expensive, isn't it? Thanks, David On Sun, Mar 8, 2015 at 5:14 AM Burak Yavuz wrote: > Hi, > > There is model import/export for s

A way to share RDD directly using Tachyon?

2015-03-08 Thread Yijie Shen
Hi, I would like to share a RDD in several Spark Applications,  i.e, create one in application A, publish the ID somewhere and get the RDD back directly using ID in Application B. I know I can use Tachyon just as a filesystem and  s.saveAsTextFile("tachyon://localhost:19998/Y”) like this. But g

Re: How to reuse a ML trained model?

2015-03-08 Thread Sean Owen
You dont need SparkContext to simply serialize and deserialize objects. It is Java mechanism. On Mar 8, 2015 10:29 AM, "Xi Shen" wrote: > errr...do you have any suggestions for me before 1.3 release? > > I can't believe there's no ML model serialize method in Spark. I think > training the models

Re: How to reuse a ML trained model?

2015-03-08 Thread Simon Chan
You may also take a look at PredictionIO, which can persist and then deploy MLlib models as web services. Simon On Sunday, March 8, 2015, Sean Owen wrote: > You dont need SparkContext to simply serialize and deserialize objects. It > is Java mechanism. > On Mar 8, 2015 10:29 AM, "Xi Shen" > wr

Re: Bulk insert strategy

2015-03-08 Thread Ted Yu
What's the expected number of partitions in your use case ? Have you thought of doing batching in the workers ? Cheers On Sat, Mar 7, 2015 at 10:54 PM, A.K.M. Ashrafuzzaman < ashrafuzzaman...@gmail.com> wrote: > While processing DStream in the Spark Programming Guide, the suggested > usage of c

Using sparkContext in inside a map function

2015-03-08 Thread danielil
Hi, We are designing a solution which pulls file paths from Kafka and for the current stage just counts the lines in each of these files. When running the code it fails on: Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureClea

Using sparkContext in inside a map function

2015-03-08 Thread danielil
Hi, We are designing a solution which pulls file paths from Kafka and for the current stage just counts the lines in each of these files. When running the code it fails on: Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.Closure

using sparkContext from within a map function (from spark streaming app)

2015-03-08 Thread Daniel Haviv
Hi, We are designing a solution which pulls file paths from Kafka and for the current stage just counts the lines in each of these files. When running the code it fails on: Exception in thread "main" org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureClea

Re: using sparkContext from within a map function (from spark streaming app)

2015-03-08 Thread Sean Owen
Yes, you can never use the SparkContext inside a remote function. It is on the driver only. On Sun, Mar 8, 2015 at 4:22 PM, Daniel Haviv wrote: > Hi, > We are designing a solution which pulls file paths from Kafka and for the > current stage just counts the lines in each of these files. > When ru

Can't cache RDD of collaborative filtering on MLlib

2015-03-08 Thread Yuichiro Sakamoto
Hello. I create program, collaborative filtering using Spark, but I have trouble with calculating speed. I want to implement recommendation program using ALS (MLlib), which is another process from Spark. But access speed of MatrixFactorizationModel object on HDFS is slow, so I want to cache it, b

Re: Bulk insert strategy

2015-03-08 Thread Ashrafuzzaman
Yes so that brings me to another question. How do I do a batch insert from worker? In prod we are planning to put a 3 shared kinesis. So the number of partitions should be 3. Right? On Mar 8, 2015 8:57 PM, "Ted Yu" wrote: > What's the expected number of partitions in your use case ? > > Have you

Re: Python script runs fine in local mode, errors in other modes

2015-03-08 Thread Davies Liu
I got another report about this recently, and figured out that it's caused by having different versions of python in driver and YARN: http://stackoverflow.com/questions/28879803/spark-runs-in-local-but-not-in-yarn/28931934#28931934 Created JIRA: https://issues.apache.org/jira/browse/SPARK-6216?fi

General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Nasir Khan
HI, I am going to submit a proposal to my University to setup my Standalone Spark Cluster, what hardware should i include in my proposal? I will be Working on classification (Spark MLlib) of Data streams (Spark Streams) If some body can fill up this answers, that will be great! Thanks *Cores *=

Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Ted Yu
Cycling related bits: http://search-hadoop.com/m/LgpTk2DLMvc On Sun, Mar 8, 2015 at 2:29 PM, Nasir Khan wrote: > HI, I am going to submit a proposal to my University to setup my Standalone > Spark Cluster, what hardware should i include in my proposal? > > I will be Working on classification (Sp

Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Krishna Sankar
Without knowing the data size, computation & storage requirements ... : - Dual 6 or 8 core machines, 256 GB memory each, 12-15 TB per machine. Probably 5-10 machines. - Don't go for the most exotic machines, otoh don't go for cheapest ones either. - Find a sweet spot with your ve

Re: General Purpose Spark Cluster Hardware Requirements?

2015-03-08 Thread Cui Lin
No woder I had out of memory issue before… I doubt if we really need such configuration on production level… Best regards, Cui Lin From: Krishna Sankar mailto:ksanka...@gmail.com>> Date: Sunday, March 8, 2015 at 3:27 PM To: Nasir Khan mailto:nasirkhan.onl...@gmail.com>> Cc: "user@spark.apache.o

Re: [SPARK-SQL] How to pass parameter when running hql script using cli?

2015-03-08 Thread James
Hi, It still don't work. Is there any success instruction about how to pass a date to a hql script? Alcaid 2015-03-07 2:43 GMT+08:00 Zhan Zhang : > Do you mean “--hiveConf” (two dash) , instead of -hiveconf (one dash) > > Thanks. > > Zhan Zhang > > On Mar 6, 2015, at 4:20 AM, James wrote: > >

A strange problem in spark sql join

2015-03-08 Thread Dai, Kevin
Hi, guys I encounter a strange problem as follows: I joined two tables(which are both parquet files) and then did the groupby. The groupby took 19 hours to finish. However, when I kill this job twice in the groupby stage. The third try will su But after I killed this job and run it again. It s

How to use the TF-IDF model?

2015-03-08 Thread Xi Shen
Hi, I read this page, http://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html. But I am wondering, how to use this TF-IDF RDD? What is this TF-IDF vector looks like? Can someone provide me some guide? Thanks, [image: --] Xi Shen [image: http://]about.me/davidshen

what are the types of tasks when running ALS iterations

2015-03-08 Thread lisendong
you see, the core of ALS 1.0.0 is the following code: there should be flatMap and groupByKey when running ALS iterations , right? but when I run als iteration, there are ONLY flatMap tasks... do you know why? private def updateFeatures( products: RDD[(Int, Array[Arr

How to load my ML model?

2015-03-08 Thread Xi Shen
Hi, I used the method on this http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/train.html passage to save my k-means model. But now, I have no idea how to load it back...I tried sc.objectFile("/path/to/data/file/directory/") But I got this error: