Re: MovieALS Implicit Error

2015-07-28 Thread Xiangrui Meng
Hi Benedict, Did you set lambda to zero? -Xiangrui On Mon, Jul 13, 2015 at 4:18 AM, Benedict Liang bli...@thecarousell.com wrote: Hi Sean, This user dataset is organic. What do you think is a good ratings threshold then? I am only encountering this with the implicit type though. The explicit

log file directory

2015-07-28 Thread Jack Yang
Hi all, I have questions with regarding to the log file directory. That say if I run spark-submit --master local[4], where is the log file? Then how about if I run standalone spark-submit --master spark://mymaster:7077? Best regards, Jack

Re: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-28 Thread Xiangrui Meng
Hi Aniruddh, Increasing number of partitions doesn't always help in ALS due to communication/computation trade-off. What rank did you set? If the rank is not large, I'd recommend a small number of partitions. There are some other numbers to watch. Do you have super popular items/users in your

Re: Cluster sizing for recommendations

2015-07-28 Thread Xiangrui Meng
Hi Danny, You might need to reduce the number of partitions (or set userBlocks and productBlocks directly in ALS). Using a large number of partitions increases shuffle size and memory requirement. If you have 16 x 16 = 256 cores. I would recommend 64 or 128 instead of 2048.

Re: Failed stages and dropped executors when running implicit matrix factorization/ALS : Too many values to unpack

2015-07-28 Thread Xiangrui Meng
It seems that the error happens before ALS iterations. Could you try `ratings.first()` right after `ratings = newrdd.map(lambda l: Rating(int(l[1]),int(l[2]),l[4])).partitionBy(50)`? -Xiangrui On Fri, Jun 26, 2015 at 2:28 PM, Ayman Farahat ayman.fara...@yahoo.com wrote: I tried something similar

Re: Writing streaming data to cassandra creates duplicates

2015-07-28 Thread Tathagata Das
You have to partition that data on the Spark Streaming by the primary key, and then make sure insert data into Cassandra atomically per key, or per set of keys in the partition. You can use the combination of the (batch time, and partition Id) of the RDD inside foreachRDD as the unique id for the

Re: [Spark ML] HasInputCol, etc.

2015-07-28 Thread Feynman Liang
Unfortunately, AFAIK custom transformers are not part of the public API so you will have to continue with what you're doing. On Tue, Jul 28, 2015 at 1:32 PM, Matt Narrell matt.narr...@gmail.com wrote: Hey, Our ML ETL pipeline has several complex steps that I’d like to address with custom

Re: restart from last successful stage

2015-07-28 Thread Tathagata Das
If you are using the same RDDs in the both the attempts to run the job, the previous stage outputs generated in the previous job will indeed be reused. This applies to core though. For dataframes, depending on what you do, the physical plan may get generated again leading to new RDDs which may

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Sachin Naik
I agree with Sean - using virtual box on windows and using linux vm is a lot easier than trying to circumvent the cygwin oddities. a lot of functionality might not work in cygwin and you will end up trying to do back patches. Unless there is a compelling reason - cygwin support seems not

Re: Spark - Eclipse IDE - Maven

2015-07-28 Thread Carol McDonald
I agree, I found this book very useful for getting started with spark and eclipse On Tue, Jul 28, 2015 at 11:10 AM, Petar Zecevic petar.zece...@gmail.com wrote: Sorry about self-promotion, but there's a really nice tutorial for setting up Eclipse for Spark in Spark in Action book:

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Corey Nolet
On Tue, Jul 28, 2015 at 2:17 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: Thanks Corey for your answer, Do you mean that final status : SUCCEEDED in terminal logs means that YARN RM could clean the resources after the application has finished (application finishing does not necessarily

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Elkhan Dadashov
I run Spark in yarn-cluster mode. and yes , log aggregation enabled. In Yarn aggregated logs i can the job status correctly. The issue is Yarn Client logs (which is written to stdout in terminal) states that job has succeeded even though the job has failed. As user is not testing if Yarn RM

Re: Getting java.net.BindException when attempting to start Spark master on EC2 node with public IP

2015-07-28 Thread Steve Loughran
try looking at the causes and steps here https://wiki.apache.org/hadoop/BindException On 28 Jul 2015, at 09:22, Wayne Song wayne.e.s...@gmail.commailto:wayne.e.s...@gmail.com wrote: I made this message with the Nabble web interface; I included the stack trace there, but I guess it didn't

Re: NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-28 Thread Sean Owen
That's for the Windows interpreter rather than bash-running Cygwin. I don't know it's worth doing a lot of legwork for Cygwin, but, if it's really just a few lines of classpath translation in one script, seems reasonable. On Tue, Jul 28, 2015 at 9:13 PM, Steve Loughran ste...@hortonworks.com

[Spark ML] HasInputCol, etc.

2015-07-28 Thread Matt Narrell
Hey, Our ML ETL pipeline has several complex steps that I’d like to address with custom Transformers in an ML Pipeline. Looking at the Tokenizer and HashingTF transformers I see these handy traits (HasInputCol, HasLabelCol, HasOutputCol, etc.) but they have strict access modifiers. How can I

Re: Do I really need to build Spark for Hive/Thrift Server support?

2015-07-28 Thread ReeceRobinson
I am building an analytics environment based on Spark and want to use HIVE in multi-user mode i.e. not use the embedded derby database but use Postgres and HDFS instead. I am using the included Spark Thrift Server to process queries using Spark SQL. The documentation gives me the impression that

Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Brandon White
val ssc = new StreamingContext(sc, Minutes(10)) //500 textFile streams watching S3 directories val streams = streamPaths.par.map { path = ssc.textFileStream(path) } streams.par.foreach { stream = stream.foreachRDD { rdd = //do something } } ssc.start() Would something like this

<    1   2