Re: Python + Spark unable to connect to S3 bucket .... Invalid hostname in URI

2014-08-15 Thread Miroslaw
So after doing some more research I found the root cause of the problem. The bucket name we were using contained an underscore '_'. This goes against the new requirements for naming buckets. Using a bucket that is not named with an underscore solved the issue. If anyone else runs into this

spark on yarn cluster can't launch

2014-08-15 Thread centerqi hu
The code does not run as follows ../bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --verbose \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ ../lib/spark-examples*.jar \ 100 Exception in thread

Re: Debugging Task not serializable

2014-08-15 Thread Juan Rodríguez Hortalá
Hi Sourav, I will take a look to that too, thanks a lot for your help Greetings, Juan 2014-07-30 10:58 GMT+02:00 Sourav Chandra sourav.chan...@livestream.com: While running application set this -Dsun.io.serialization.extendedDebugInfo=true This is applciable post java 1.6 version On

Re: spark streaming - lamda architecture

2014-08-15 Thread Sean Owen
You may be interested in https://github.com/OryxProject/oryx which is at heart exactly lambda architecture on Spark Streaming. With ML pipelines on top. The architecture diagram and a peek at the code may give you a good example of how this could be implemented. I choose to view the batch layer as

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread Cui xp
Did I describe the problem not clearly? Is anyone familiar to softmax regression? Thanks. Cui xp. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-implement-multinomial-logistic-regression-softmax-regression-in-Spark-tp11939p12175.html Sent from the

Re: spark won't build with maven

2014-08-15 Thread visakh
You are running a Continuous Compilation. AFAIK, it runs in an infinite loop and will compile only the modified files. For compiling with maven, have a look at these steps - https://spark.apache.org/docs/latest/building-with-maven.html Thanks, Visakh -- View this message in context:

Re: SparkR: split, apply, combine strategy for dataframes?

2014-08-15 Thread Carlos J. Gil Bellosta
Thanks for your reply. I think that the problem was that SparkR tried to serialize the whole environment. Mind that the large dataframe was part of it. So every worker received their slice / partition (which is very small) plus the whole thing! So I deleted the large dataframe and list before

Issues with S3 client library and Apache Spark

2014-08-15 Thread Darin McBeath
I've seen a couple of issues posted about this, but I never saw a resolution. When I'm using Spark 1.0.2 (and the spark-submit script to submit my jobs) and AWS SDK 1.8.7, I get the stack trace below.  However, if I drop back to AWS SDK 1.3.26 (or anything from the AWS SDK 1.4.* family) then

Re: Seattle Spark Meetup: Spark at eBay - Troubleshooting the everyday issues Slides

2014-08-15 Thread Denny Lee
Apologies but we had placed the settings for downloading the slides to Seattle Spark Meetup members only - but actually meant to share with everyone.  We have since fixed this and now you can download it.  HTH! On August 14, 2014 at 18:14:35, Denny Lee (denny.g@gmail.com) wrote: For

Re: Spark webUI - application details page

2014-08-15 Thread Brad Miller
Hi Andrew, I'm running something close to the present master (I compiled several days ago) but am having some trouble viewing history. I set spark.eventLog.dir to true, but continually receive the error message (via the web UI) Application history not found...No event logs found for application

Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
I've been using the standalone cluster all this time and it worked fine. Recently I'm using another Spark cluster that is based on YARN and I've not experience with YARN. The YARN cluster has 10 nodes and a total memory of 480G. I'm having trouble starting the spark-shell with enough memory. I'm

Re: Spark webUI - application details page

2014-08-15 Thread SK
Hi, Ok, I was specifying --master local. I changed that to --master spark://localhostname:7077 and am now able to see the completed applications. It provides summary stats about runtime and memory usage, which is sufficient for me at this time. However it doesn't seem to archive the info in

Re: Running Spark shell on YARN

2014-08-15 Thread Andrew Or
Hi Soumya, The driver's console output prints out how much memory is actually granted to each executor, so from there you can verify how much memory the executors are actually getting. You should use the '--executor-memory' argument in spark-shell. For instance, assuming each node has 48G of

Re: spark on yarn cluster can't launch

2014-08-15 Thread Andrew Or
Hi 齐忠, Thanks for reporting this. You're correct that the default deploy mode is client. However, this seems to be a bug in the YARN integration code; we should not throw null pointer exception in any case. What version of Spark are you using? Andrew 2014-08-15 0:23 GMT-07:00 centerqi hu

[Spar Streaming] How can we use consecutive data points as the features ?

2014-08-15 Thread Yan Fang
Hi guys, We have a use case where we need to use consecutive data points to predict the status. (yes, like using time series data to predict the machine failure). Is there a straight-forward way to do this in Spark Streaming? If all consecutive data points are in one batch, it's not complicated

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread DB Tsai
Hi Cui You can take a look at multinomial logistic regression PR I created. https://github.com/apache/spark/pull/1379 Ref: http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297 Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
I just checked the YARN config and looks like I need to change this value. Should be upgraded to 48G (the max memory allocated to YARN) per node ? property nameyarn.scheduler.maximum-allocation-mb/name value6144/value sourcejava.io.BufferedInputStream@2e7e1ee/source /property On Fri, Aug 15,

Re: Running Spark shell on YARN

2014-08-15 Thread Sandy Ryza
We generally recommend setting yarn.scheduler.maximum-allocation-mbto the maximum node capacity. -Sandy On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com wrote: I just checked the YARN config and looks like I need to change this value. Should be upgraded to 48G (the

Hardware Context on Spark Worker Hosts

2014-08-15 Thread Chris Brown
Is it practical to maintain a hardware context on each of the worker hosts in Spark? In my particular problem I have an OpenCL (or JavaCL) context which has two things associated with it: - Data stored on a GPU - Code compiled for the GPU If the context goes away, the data is lost and the

Re: Spark webUI - application details page

2014-08-15 Thread Andrew Or
@Brad Your configuration looks alright to me. We parse both file:/ and file:/// the same way so that shouldn't matter. I just tried this on the latest master and verified that it works for me. Can you dig into the directory /tmp/spark-events/ml-pipeline-1408117588599 to make sure that it's not

closure issue - works in scalatest but not in spark-shell

2014-08-15 Thread Mohit Jaggi
Folks, I wrote the following wrapper on top on combineByKey. The RDD is of Array[Any] and I am extracting a field at a given index for combining. There are two ways in which I tried this: Option A: leave colIndex abstract in Aggregator class and define in derived object Aggtor with value -1. It

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread Debasish Das
DB, Did you compare softmax regression with one-vs-all and found that softmax is better ? one-vs-all can be implemented as a wrapper over binary classifier that we have in mllib...I am curious if softmax multinomial is better on most cases or is it worthwhile to add a one vs all version of mlor

Re: How to implement multinomial logistic regression(softmax regression) in Spark?

2014-08-15 Thread DB Tsai
Hi Debasish, I didn't try one-vs-all vs softmax regression. One issue is that for one-vs-all, we have to train k classifiers for k classes problem. The training time will be k times longer. Sincerely, DB Tsai --- My Blog:

ALS checkpoint performance

2014-08-15 Thread Debasish Das
Hi, Are there any experiments detailing the performance hit due to HDFS checkpoint in ALS ? As we scale to large ranks with more ratings, I believe we have to cut the RDD lineage to safe guard against the lineage issue... Thanks. Deb

Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
After changing the allocation I'm getting the following in my logs. No idea what this means. 14/08/15 15:44:33 INFO cluster.YarnClientSchedulerBackend: Application report from ASM: appMasterRpcPort: -1 appStartTime: 1408131861372 yarnAppState: ACCEPTED 14/08/15 15:44:34 INFO

Re: Running Spark shell on YARN

2014-08-15 Thread Kevin Markey
Sandy and others: Is there a single source of Yarn/Hadoop properties that should be set or reset for running Spark on Yarn? We've sort of stumbled through one property after another, and (unless there's an update I've not yet seen) CDH5 Spark-related properties

spark streaming - saving kafka DStream into hadoop throws exception

2014-08-15 Thread salemi
Hi All, I am just trying to save the kafka dstream to hadoop as followed val dStream = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap) dStream.saveAsHadoopFiles(hdfsDataUrl, data) It throws the following exception. What am I doing wrong? 14/08/15 14:30:09 ERROR

Open sourcing Spindle by Adobe Research, a web analytics processing engine in Scala, Spark, and Parquet.

2014-08-15 Thread Brandon Amos
Hi Spark community, At Adobe Research, we're happy to open source a prototype technology called Spindle we've been developing over the past few months for processing analytics queries with Spark. Please take a look at the repository on GitHub at https://github.com/adobe-research/spindle, and we

Re: spark streaming - saving kafka DStream into hadoop throws exception

2014-08-15 Thread Sean Owen
Somewhere, your function has a reference to the Hadoop JobConf object and is trying to send that to the workers. It's not in this code you pasted so must be from something slightly different? It shouldn't need to send that around and in fact it can't be serialized as you see. If you need a Hadoop

Re: ALS checkpoint performance

2014-08-15 Thread Xiangrui Meng
Guoqiang reported some results in his PRs https://github.com/apache/spark/pull/828 and https://github.com/apache/spark/pull/929 . But this is really problem-dependent. -Xiangrui On Fri, Aug 15, 2014 at 12:30 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, Are there any experiments

Re: spark streaming - saving kafka DStream into hadoop throws exception

2014-08-15 Thread salemi
Look this is the whole program. I am not trying to serialize the JobConf. def main(args: Array[String]) { try { val properties = getProperties(settings.properties) StreamingExamples.setStreamingLogLevels() val zkQuorum = properties.get(zookeeper.list).toString() val

mlib model viewing and saving

2014-08-15 Thread Sameer Tilak
Hi All, I have a mlib model: val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth) I see model has following methods:algo asInstanceOf isInstanceOf predicttoString topNode model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0,

Does HiveContext support Parquet?

2014-08-15 Thread lyc
Since SqlContext supports less SQL than Hive (if I understand correctly), I plan to run more queries by hql. However, is that possible to create some tables as Parquet in hql? What kind of commands should I use? Thanks in advance for any information. -- View this message in context:

Re: Scala Spark Distinct on a case class doesn't work

2014-08-15 Thread clarkroberts
I just discovered that the Distinct call is working as expected when I run a driver through spark-submit. This is only an issue in the REPL environment. Very strange... -- View this message in context:

Updating exising JSON files

2014-08-15 Thread ejb11235
I have a bunch of JSON files stored in HDFS that I want to read in, modify, and write back out. I'm new to all this and am not sure if this is even the right thing to do. Basically, my JSON files contain my raw data, and I want to calculate some derived data and add is to the existing data. So

Re: Does HiveContext support Parquet?

2014-08-15 Thread Silvio Fiorito
Yes, you can write to Parquet tables. On Spark 1.0.2 all I had to do was include the parquet-hive-bundle-1.5.0.jar on my classpath. From: lycmailto:yanchen@huawei.com Sent: ?Friday?, ?August? ?15?, ?2014 ?7?:?30? ?PM To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org

Re: spark streaming - saving kafka DStream into hadoop throws exception

2014-08-15 Thread salemi
if I reduce the app to the following code then I don't see the exception. It creates the hadoop files but they are empty! The DStream doesn't get written out to the files! def main(args: Array[String]) { try { val properties = getProperties(settings.properties)

Question regarding spark data partition and coalesce. Need info on my use case.

2014-08-15 Thread abhiguruvayya
My use case as mentioned below. 1. Read input data from local file system using sparkContext.textFile(input path). 2. partition the input data(80 million records) into partitions using RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer function. Without using coalesce() or

Re: GraphX Pagerank application

2014-08-15 Thread Ankur Dave
On Wed, Aug 6, 2014 at 11:37 AM, AlexanderRiggers alexander.rigg...@gmail.com wrote: To perform the page rank I have to create a graph object, adding the edges by setting sourceID=id and distID=brand. In GraphLab there is function: g = SGraph().add_edges(data, src_field='id',

Re: Does HiveContext support Parquet?

2014-08-15 Thread lyc
Thank you for your reply. Do you know where I can find some detailed information about how to use Parquet in HiveContext? Any information is appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-HiveContext-support-Parquet-tp12209p12216.html

Error in sbt/sbt package

2014-08-15 Thread Deep Pradhan
I am getting the following error while doing SPARK_HADOOP_VERSION=2.3.0 sbt/sbt/package java.io.IOException: Cannot run program /home/deep/spark-1.0.0/usr/lib/jvm/java-7-oracle/bin/javac: error=2, No such file or directory ...lots of errors [error] (core/compile:compile)