Re: flume spark streaming receiver host random

2014-09-27 Thread Sean Owen
I don't think you control which host he receiver runs on, right? So that Spark can handle the failure of that node and reassign the receiver. On Sep 27, 2014 2:43 AM, centerqi hu cente...@gmail.com wrote: the receiver is not running on the machine I expect 2014-09-26 14:09 GMT+08:00 Sean

Re: Log hdfs blocks sending

2014-09-27 Thread Andrew Ash
Hi Alexey, You're looking in the right place in the first log from the driver. Specifically the locality is on the TaskSetManager INFO log level and looks like this: 14/09/26 16:57:31 INFO TaskSetManager: Starting task 9.0 in stage 1.0 (TID 10, 10.54.255.191, ANY, 1341 bytes) The ANY there

RDD logic and control

2014-09-27 Thread pop1998
hello, im examining the SPARK RDDs and trying to understand how does the RDD flow works. can any one please tell me how does the RDD decide to (and where can i find the relevant code): 1. re-split to new RDD? 2. move to a new PC? 3. perform PC selection? 4. preform union of multiple RDDs? 5.

Re: Is it possible to use Parquet with Dremel encoding

2014-09-27 Thread Michael Armbrust
Based on your first example it looks like what you want is actually run length encoding (which parquet does support https://github.com/Parquet/parquet-format/blob/master/Encodings.md). Repetition and definition levels are used to reconstruct nested or repeated (arrays) data that has been shredded

MLlib 1.2 New Interesting Features

2014-09-27 Thread Krishna Sankar
Guys, - Need help in terms of the interesting features coming up in MLlib 1.2. - I have a 2 Part, ~3 hr hands-on tutorial at the Big Data Tech Con - The Hitchhiker's Guide to Machine Learning with Python Apache Spark[2] - At minimum, it would be good to take the last 30 min

Re: Retrieve dataset of Big Data Benchmark

2014-09-27 Thread Tom
Hi, I was able to download the dataset this way (and just reconfirmed it by doing so again): //Following before starting spark export AWS_ACCESS_KEY_ID=*key_id* export AWS_SECRET_ACCESS_KEY=*access_key* //Start spark ./spark-shell //In the spark shell val dataset =

iPython notebook ec2 cluster matlabplot not found?

2014-09-27 Thread Andy Davidson
Hi I am having a heck of time trying to get python to work correctly on my cluster created using the spark-ec2 script The following link was really helpful https://issues.apache.org/jira/browse/SPARK-922 I am still running into problem with matplotlib. (it works fine on my mac). I can not

Re: iPython notebook ec2 cluster matlabplot not found?

2014-09-27 Thread Nicholas Chammas
Can you first confirm that the regular PySpark shell works on your cluster? Without upgrading to 2.7. That is, you log on to your master using spark-ec2 login and run bin/pyspark successfully without any special flags. And as far as I can tell, you should be able to use IPython at 2.6, so I’d

yarn does not accept job in cluster mode

2014-09-27 Thread jamborta
hi all, I have a job that works ok in yarn-client mode,but when I try in yarn-cluster mode it returns the following: WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory the cluster has

Re: How to run hive scripts pro-grammatically in Spark 1.1.0 ?

2014-09-27 Thread jamborta
Hi, you can create a spark context in your python or scala environment and use that to run your hive queries, pretty much the same way as you'd do it in the shell. thanks, -- View this message in context:

Re: New user question on Spark SQL: can I really use Spark SQL like a normal DB?

2014-09-27 Thread jamborta
hi, Yes, I have been using spark sql extensively that way. I have just tried and saveAsTable() works OK on 1.1.0. Alternatively, you can write the data from schemaRDD to HDFS using saveAsTextFile, and create an external table on top of it. thanks, -- View this message in context:

Re: Build spark with Intellij IDEA 13

2014-09-27 Thread maddenpj
I actually got this same exact issue compiling a unrelated project (not using spark). Maybe it's a protobuf issue? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Build-spark-with-Intellij-IDEA-13-tp9904p15284.html Sent from the Apache Spark User List

PageRank execution imbalance, might hurt performance by 6x

2014-09-27 Thread Larry Xiao
Hi all! I'm running PageRank on GraphX, and I find on some tasks on one machine can spend 5~6 times more time than on others, others are perfectly balance (around 1 second to finish). And since time for a stage (iteration) is determined by the slowest task, the performance is undesirable. I

How to use multi thread in RDD map function ?

2014-09-27 Thread myasuka
Hi, everyone I come across with a problem about increasing the concurency. In a program, after shuffle write, each node should fetch 16 pair matrices to do matrix multiplication. such as: * import breeze.linalg.{DenseMatrix = BDM} pairs.map(t = { val b1 =

Re: How to use multi thread in RDD map function ?

2014-09-27 Thread qinwei
in the options of spark-submit, there are two options which may be helpful to your problem, they are --total-executor-cores NUM(standalone and mesos only), --executor-cores(yarn only) qinwei  From: myasukaDate: 2014-09-28 11:44To: userSubject: How to use multi thread in RDD map function