Re: Spark Streaming in Production

2014-12-12 Thread rahulkumar-aws
Run Spark Cluster managed my Apache Mesos. Mesos can run in high-availability mode, in which multiple Mesos masters run simultaneously. - Software Developer SigmoidAnalytics, Bangalore -- View this message in context:

Re: Access to s3 from spark

2014-12-12 Thread rahulkumar-aws
Try Following any one : *1. Set the access key and secret key in the sparkContext:* sparkContext.set(AWS_ACCESS_KEY_ID,yourAccessKey) sparkContext.set(AWS_SECRET_ACCESS_KEY,yourSecretKey) *2. Set the access key and secret key in the environment before starting your application:* export

Re: Exception using amazonaws library

2014-12-12 Thread Akhil Das
Its a jar conflict (http-client http://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient/4.0-alpha4 jar), You could download the appropriate version of that jar and put that in the classpath before your assembly jar and hopefully it will avoid the conflict. Thanks Best Regards On

Serialization issue when using HBase with Spark

2014-12-12 Thread yangliuyu
The scenario is using HTable instance to scan multiple rowkey range in Spark tasks look likes below: Option 1: val users = input .map { case (deviceId, uid) = uid}.distinct().sortBy(x=x).mapPartitions(iterator={ val conf = HBaseConfiguration.create() val table = new HTable(conf,

Re: Serialization issue when using HBase with Spark

2014-12-12 Thread Akhil Das
Can you paste the complete code? it looks like at some point you are passing a hadoop's configuration which is not Serializable. You can look at this thread for similar discussion http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-into-HBase-td13378.html Thanks Best Regards On

Re: Exception using amazonaws library

2014-12-12 Thread Albert Manyà
Hi, Thanks for your reply. I tried with the jar you pointed but It complains about missing HttpPatch that appears on httpclient 4.2 Exception in thread main java.lang.NoClassDefFoundError: org/apache/http/client/methods/HttpPatch at

Re: Adding a column to a SchemaRDD

2014-12-12 Thread Yanbo Liang
RDD is immutable so you can not modify it. If you want to modify some value or schema in RDD, using map to generate a new RDD. The following code for your reference: def add(a:Int,b:Int):Int = { a + b } val d1 = sc.parallelize(1 to 10).map { i = (i, i+1, i+2) } val d2 = d1.map { i = (i._1,

[Graphx] the communication cost of leftJoin

2014-12-12 Thread Yifan LI
Hi, I am trying to leftJoin an other vertice RDD(e.g vB) with this one(vA). vA.leftJoin(vB)(f) - vA is the vertices RDD in graph G, and G is edge-partitioned using EdgePartition2D. - vB is created using default partitioner(actually I am not sure...) So, I am wondering, that if vB has same

Spark CDH5 packages

2014-12-12 Thread Jing Dong
Hi, I'm new to this list, so please excuse if I'm asking simple questions. We are experimenting spark deployment on existing CDH clusters. However the spark package come with CDH are very out of date (v1.0.0). Has anyone had experience with custom Spark upgrade for CDH5? Any

Re: Spark CDH5 packages

2014-12-12 Thread Sean Owen
No, CDH 5.2 includes Spark 1.1 actually, which is the latest released minor version; 5.3 will include 1.2, which not released yet. You can make a build of just about any version of Spark for CDH5, and manually install it yourself, sure, but easier would be to just update CDH. The instructions are

Read data from SparkStreaming from Java socket.

2014-12-12 Thread Guillermo Ortiz
Hi, I'm a newbie with Spark,, I'm just trying to use SparkStreaming and filter some data sent with a Java Socket but it's not working... it works when I use ncat Why is it not working?? My sparkcode is just this: val sparkConf = new SparkConf().setMaster(local[2]).setAppName(Test) val

...FileNotFoundException: Path is not a file: - error on accessing HDFS with sc.wholeTextFiles

2014-12-12 Thread Karen Murphy
When I try to load a text file from a HDFS path using sc.wholeTextFiles(hdfs://localhost:54310/graphx/anywebsite.com/anywebsite.com/) I'm get the following error: java.io.FileNotFoundException: Path is not a file: /graphx/anywebsite.com/anywebsite.com/css (full stack trace at bottom of

Unit testing and Spark Streaming

2014-12-12 Thread Eric Loots
Hi, I’ve started my first experiments with Spark Streaming and started with setting up an environment using ScalaTest to do unit testing. Poked around on this mailing list and googled the topic. One of the things I wanted to be able to do is to use Scala Sequences as data source in the tests

Re: Read data from SparkStreaming from Java socket.

2014-12-12 Thread Akhil Das
I have created a Serversocket program which you can find over here https://gist.github.com/akhld/4286df9ab0677a555087 It simply listens to the given port and when the client connects, it will send the contents of the given file. I'm attaching the executable jar also, you can run the jar as: java

Re: ...FileNotFoundException: Path is not a file: - error on accessing HDFS with sc.wholeTextFiles

2014-12-12 Thread Akhil Das
I'm not quiet sure whether spark will go inside subdirectories and pick up files from it. You could do something like following to bring all files to one directory. find . -iname '*' -exec mv '{}' . \; Thanks Best Regards On Fri, Dec 12, 2014 at 6:34 PM, Karen Murphy k.l.mur...@qub.ac.uk

Re: Read data from SparkStreaming from Java socket.

2014-12-12 Thread Guillermo Ortiz
I dont' understand what spark streaming socketTextStream is waiting... is it like a server so you just have to send data from a client?? or what's it excepting? 2014-12-12 14:19 GMT+01:00 Akhil Das ak...@sigmoidanalytics.com: I have created a Serversocket program which you can find over here

Re: Read data from SparkStreaming from Java socket.

2014-12-12 Thread Akhil Das
socketTextStream is Socket client which will read from a TCP ServerSocket. Thanks Best Regards On Fri, Dec 12, 2014 at 7:21 PM, Guillermo Ortiz konstt2...@gmail.com wrote: I dont' understand what spark streaming socketTextStream is waiting... is it like a server so you just have to send data

Re: why is spark + scala code so slow, compared to python?

2014-12-12 Thread rzykov
Try this https://github.com/RetailRocket/SparkMultiTool https://github.com/RetailRocket/SparkMultiTool This loader solved slow reading of a big data set of small files in hdfs. -- View this message in context:

Re: Spark 1.1.1, Hadoop 2.6 - Protobuf conflict

2014-12-12 Thread kmurph
I had this problem also with spark 1.1.1. At the time I was using hadoop 0.20. To get around it I installed hadoop 2.5.2, and set the protobuf.version to 2.5.0 in the build command like so: mvn -Phadoop-2.5 -Dhadoop.version=2.5.2 -Dprotobuf.version=2.5.0 -DskipTests clean package So I

Re: Spark 1.1.1, Hadoop 2.6 - Protobuf conflict

2014-12-12 Thread Sean Owen
There is no hadoop-2.5 profile. You can use hadoop-2.4 for 2.4+. This profile already sets protobuf.version to 2.5.0 for this reason. It is already something you can set on the command line as it is read as a Maven build property. It does not pick up an older version because it's somewhere on your

Re: Unit testing and Spark Streaming

2014-12-12 Thread Emre Sevinc
On Fri, Dec 12, 2014 at 2:17 PM, Eric Loots eric.lo...@gmail.com wrote: How can the log level in test mode be reduced (or extended when needed) ? Hello Eric, The following might be helpful for reducing the log messages during unit testing: http://stackoverflow.com/a/2736/236007 -- Emre

Re: Including data nucleus tools

2014-12-12 Thread spark.dubovsky.jakub
Hi,   I had time to try it again. I submited my app by the same command with these additional options:   --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-core-3.2.10.jar, lib/datanucleus-rdbms-3.2.9.jar   Now an app successfully creates hive context. So my question remains: Is

Re: Spark CDH5 packages

2014-12-12 Thread Jing Dong
Hi Sowen, Thanks for the tip. When will CDH 5.3 be released? Some sort of timeline would be helpful. Thanks, Jing On 12 Dec 2014, at 12:34, Sean Owen so...@cloudera.com wrote: No, CDH 5.2 includes Spark 1.1 actually, which is the latest released minor version; 5.3 will include 1.2, which

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-12 Thread Mario Pastorelli
Hi, I asked on SO and got an answer about this http://stackoverflow.com/questions/27444512/missing-classes-from-the-assembly-file-created-by-sbt-assembly . Adding fullClasspath in assembly := (fullClasspath in Compile).value at the end of my builld.sbt solved the problem, apparently. Best,

Cannot pickle DecisionTreeModel in the pyspark

2014-12-12 Thread Gen
Hi everyone, I am trying to save the decision tree model in python and I use pickle.dump() to do this. However, it returns the following error information: /cPickle.UnpickleableError: Cannot pickle type 'thread.lock' objects/ I did some tests on the other model. It seems that decision tree

RE: Session for connections?

2014-12-12 Thread Ashic Mahtab
Looks like the way to go. Quick question regarding the connection pool approach - if I have a connection that gets lazily instantiated, will it automatically die if I kill the driver application? In my scenario, I can keep a connection open for the duration of the app, and aren't that

Re: Spark Streaming in Production

2014-12-12 Thread twizansk
Thanks for the reply. I might be misunderstanding something basic.As far as I can tell, the cluster manager (e.g. Mesos) manages the master and worker nodes but not the drivers or receivers, those are external to the spark cluster: http://spark.apache.org/docs/latest/cluster-overview.html

Re: Spark Streaming in Production

2014-12-12 Thread francois . garillot
IIUC, Receivers run on workers, colocated with other tasks. The Driver, on the other hand, can either run on the querying machine (local mode) or as a worker (cluster mode). — FG On Fri, Dec 12, 2014 at 4:49 PM, twizansk twiza...@gmail.com wrote: Thanks for the reply. I might be

Re: Adding a column to a SchemaRDD

2014-12-12 Thread Nathan Kronenfeld
(1) I understand about immutability, that's why I said I wanted a new SchemaRDD. (2) I specfically asked for a non-SQL solution that takes a SchemaRDD, and results in a new SchemaRDD with one new function. (3) The DSL stuff is a big clue, but I can't find adequate documentation for it What I'm

Re: Unit testing and Spark Streaming

2014-12-12 Thread Jay Vyas
https://github.com/jayunit100/SparkStreamingCassandraDemo On this note, I've built a framework which is mostly pure so that functional unit tests can be run composing mock data for Twitter statuses, with just regular junit... That might be relevant also. I think at some point we should come

Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread Bui, Tri
Hi, Trying to use LBFGS as the optimizer, do I need to implement feature scaling via StandardScaler or does LBFGS do it by default? Following code generated error Failure again! Giving up and returning, Maybe the objective is just poorly behaved ?. val data =

RDD lineage and broadcast variables

2014-12-12 Thread Ron Ayoub
I'm still wrapping my head around that fact that the data backing an RDD is immutable since an RDD may need to be reconstructed from its lineage at any point. In the context of clustering there are many iterations where an RDD may need to change (for instance cluster assignments, etc) based on

Passing Spark Configuration from Driver (Master) to all of the Slave nodes

2014-12-12 Thread Demi Ben-Ari
Hi to all, Our problem was passing configuration from Spark Driver to the Slaves. After a lot of time spent figuring out how things work, this is the solution I came up with. Hope this will be helpful for others as well. You can read about it in my Blog Post

Re: Spark Server - How to implement

2014-12-12 Thread Manoj Samel
Thanks Marcelo. Spark Gurus/Databricks team - do you have something in roadmap for such a spark server ? Thanks, On Thu, Dec 11, 2014 at 5:43 PM, Marcelo Vanzin van...@cloudera.com wrote: Oops, sorry, fat fingers. We've been playing with something like that inside Hive:

How to get driver id?

2014-12-12 Thread Xingwei Yang
Hi Guys: I want to kill an application but I could not find the driver id of the application from web ui. Is there any way to get it from command line? Thanks -- Sincerely Yours Xingwei Yang https://sites.google.com/site/xingweiyang1223/

Re: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread DB Tsai
You need to do the StandardScaler to help the convergency yourself. LBFGS just takes whatever objective function you provide without doing any scaling. I will like to provide LinearRegressionWithLBFGS which does the scaling internally in the nearly feature. Sincerely, DB Tsai

RE: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread Bui, Tri
Thanks for the confirmation. Fyi..The code below works for similar dataset, but with the feature magnitude changed, LBFGS converged to the right weights. Example, time sequential Feature value 1, 2, 3, 4, 5, would generate the error while sequential feature 14111, 14112, 14113,14115 would

resource allocation spark on yarn

2014-12-12 Thread gpatcham
Hi All, I have spark on yarn and there are multiple spark jobs on the cluster. Sometimes some jobs are not getting enough resources even when there are enough free resources available on cluster, even when I use below settings --num-workers 75 \ --worker-cores 16 Jobs stick with the resources

GraphX for large scale PageRank (~4 billion nodes, ~128 billion edges)

2014-12-12 Thread Stephen Merity
Hi! tldr; We're looking at potentially using Spark+GraphX to compute PageRank over a 4 billion node + 128 billion edge graph on a regular (monthly) basis, possibly growing larger in size over time. If anyone has hints / tips / upcoming optimizations I should test out (or wants to contribute --

how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
I have an RDD which is potentially too large to store in memory with collect. I want a single task to write the contents as a file to hdfs. Time is not a large issue but memory is. I say the following converting my RDD (scans) to a local Iterator. This works but hasNext shows up as a separate task

Re: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread DB Tsai
It seems that your response is not scaled which will cause issue in LBFGS. Typically, people train Linear Regression with zero-mean/unit-variable feature and response without training the intercept. Since the response is zero-mean, the intercept will be always zero. When you convert the

Spark 1.2 + Avro file does not work in HDP2.2

2014-12-12 Thread Manas Kar
Hi Experts, I have recently installed HDP2.2(Depends on hadoop 2.6). My spark 1.2 is built with hadoop 2.4 profile. My program has following dependencies val avro= org.apache.avro % avro-mapred %1.7.7 val spark = org.apache.spark % spark-core_2.10 % 1.2.0 % provided My

Spark 1.2 + Avro does not work in HDP2.2

2014-12-12 Thread manasdebashiskar
Hi Experts, I have recently installed HDP2.2(Depends on hadoop 2.6). My spark 1.2 is built with hadoop 2.3 profile. /( mvn -Pyarn -Dhadoop.version=2.6.0 -Dyarn.version=2.6.0 -Phadoop-2.3 -Phive -DskipTests clean package)/ My program has following dependencies /val avro=

Re: how to convert an rdd to a single output file

2014-12-12 Thread Sameer Farooqui
Instead of doing this on the compute side, I would just write out the file with different blocks initially into HDFS and then use hadoop fs -getmerge or HDFSConcat to get one final output file. - SF On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis lordjoe2...@gmail.com wrote: I have an RDD

RE: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread Bui, Tri
Thanks for the info. How do I use StandardScaler() to scale example data (10246.0,[14111.0,1.0]) ? Thx tri -Original Message- From: dbt...@dbtsai.com [mailto:dbt...@dbtsai.com] Sent: Friday, December 12, 2014 1:26 PM To: Bui, Tri Cc: user@spark.apache.org Subject: Re: Do I need to

Re: resource allocation spark on yarn

2014-12-12 Thread Sameer Farooqui
Hi, FYI - There are no Worker JVMs used when Spark is launched under YARN. Instead the NodeManager in YARN does what the Worker JVM does in Spark Standalone mode. For YARN you'll want to look into the following settings: --num-executors: controls how many executors will be allocated

Re: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-12 Thread DB Tsai
You can do something like the following. val rddVector = input.map({ case (response, vec) = { val newVec = MLUtils.appendBias(vec) newVec.toBreeze(newVec.size - 1) = response newVec } } val scalerWithResponse = new StandardScaler(true, true).fit(rddVector) val trainingData =

Re: Spark Server - How to implement

2014-12-12 Thread Patrick Wendell
Hey Manoj, One proposal potentially of interest is the Spark Kernel project from IBM - you should look for their. The interface in that project is more of a remote REPL interface, i.e. you submit commands (as strings) and get back results (as strings), but you don't have direct programmatic

Re: resource allocation spark on yarn

2014-12-12 Thread Giri P
but on spark 0.9 we don't have these options --num-executors: controls how many executors will be allocated --executor-memory: RAM for each executor --executor-cores: CPU cores for each executor On Fri, Dec 12, 2014 at 12:27 PM, Sameer Farooqui same...@databricks.com wrote: Hi, FYI - There

IBM open-sources Spark Kernel

2014-12-12 Thread Robert C Senkbeil
We are happy to announce a developer preview of the Spark Kernel which enables remote applications to dynamically interact with Spark. You can think of the Spark Kernel as a remote Spark Shell that uses the IPython notebook interface to provide a common entrypoint for any application. The Spark

Re: Submiting multiple jobs via different threads

2014-12-12 Thread Michael Quinlan
Haoming If the Spark UI states that one of the jobs is in the Waiting state, this is a resources issue. You will need to set properties such as: spark.executor.memory spark.cores.max Set these so that each instance only takes a portion of the available worker memory and cores. Regards, Mike

GraphX for large scale PageRank (~4 billion nodes, ~128 billion edges)

2014-12-12 Thread Stephen Merity
Hi! tldr; We're looking at potentially using Spark+GraphX to compute PageRank over a 4 billion node + 128 billion edge graph on a regular (monthly) basis, possibly growing larger in size over time. If anyone has hints / tips / upcoming optimizations I should test use (or wants to contribute --

Re: how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
The objective is to let the Spark application generate a file in a format which can be consumed by other programs - as I said I am willing to give up parallelism at this stage (all the expensive steps were earlier but do want an efficient way to pass once through an RDD without the requirement to

Re: how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
what would good spill settings be? On Fri, Dec 12, 2014 at 2:45 PM, Sameer Farooqui same...@databricks.com wrote: You could try re-partitioning or coalescing the RDD to partition and then write it to disk. Make sure you have good spill settings enabled so that the RDD can spill to the local

Spark SQL API Doc IsCached as SQL command

2014-12-12 Thread Judy Nash
Hello, Few questions on Spark SQL: 1) Does Spark SQL support equivalent SQL Query for Scala command: IsCached(table name) ? 2) Is there a documentation spec I can reference for question like this? Closest doc I can find is this one:

Re: Spark SQL API Doc IsCached as SQL command

2014-12-12 Thread Mark Hamstra
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory On Fri, Dec 12, 2014 at 3:14 PM, Judy Nash judyn...@exchange.microsoft.com wrote: Hello, Few questions on Spark SQL: 1) Does Spark SQL support equivalent SQL Query for Scala command:

Re: IBM open-sources Spark Kernel

2014-12-12 Thread Robert C Senkbeil
Hi Sam, We developed the Spark Kernel with a focus on the newest version of the IPython message protocol (5.0) for the upcoming IPython 3.0 release. We are building around Apache Spark's REPL, which is used in the current Spark Shell implementation. The Spark Kernel was designed to be

SVMWithSGD.run source code

2014-12-12 Thread Caron
I'm looking at the source code of SVM.scala and trying to find the location of the source code of the following function: def train(...): SVMModel = { new SVMWithSGD( ... ).run(input, initialWeights) } I'm wondering where I can find the code for SVMWithSGD().run()? I'd like to see the

Re: SVMWithSGD.run source code

2014-12-12 Thread Sean Owen
class SVMWithSGD is defined in the same file you're already looking at. It inherits the run() method from its superclass, GeneralizedLinearAlgorithm. An IDE would help you trace this right away. On Sat, Dec 13, 2014 at 12:52 AM, Caron caron.big...@gmail.com wrote: I'm looking at the source code

sbt assembly with hive

2014-12-12 Thread Stephen Boesch
What is the proper way to build with hive from sbt? The SPARK_HIVE is deprecated. However after running the following: sbt -Pyarn -Phadoop-2.3 -Phive assembly/assembly And then bin/pyspark hivectx = HiveContext(sc) hivectx.hiveql(select * from my_table) Exception: (You must build

clean up of state in State Dstream

2014-12-12 Thread Sunil Yarram
I am using *updateStateByKey *to maintain state in my streaming application, the state gets accumulated over time. Is there a way i can delete the old state data or put a limit on the amount of state the State Dstream can keep in the system. Thanks, Sunil.

Re: sbt assembly with hive

2014-12-12 Thread Abhi Basu
I am getting the same message when trying to get HIveContext in CDH 5.1 after enabling Spark. I am thinking Spark should come with Hive enabled (default option) as Hive metastore is a common way to share data, due to popularity of Hive and other SQL-Over-Hadoop technologies like Impala. Thanks,

Re: clean up of state in State Dstream

2014-12-12 Thread Silvio Fiorito
If you no longer need to maintain state for a key, just return None for that value and it gets removed. From: Sunil Yarram yvsu...@gmail.commailto:yvsu...@gmail.com Date: Friday, December 12, 2014 at 9:44 PM To: user@spark.apache.orgmailto:user@spark.apache.org

Re: resource allocation spark on yarn

2014-12-12 Thread Tsuyoshi OZAWA
Hi, In addition to the options Sameer Mentioned, we need to enable external shuffle manager, right? Thanks, - Tsuyoshi On Sat, Dec 13, 2014 at 5:27 AM, Sameer Farooqui same...@databricks.com wrote: Hi, FYI - There are no Worker JVMs used when Spark is launched under YARN. Instead the

Re: Read data from SparkStreaming from Java socket.

2014-12-12 Thread Tathagata Das
Yes, socketTextStream starts a TCP client that tries to connect to a TCP server (localhost: in your case). If there is a server running on that port that can send data to connected TCP connections, then you will receive data in the stream. Did you check out the quick example in the streaming

Re: Spark SQL API Doc IsCached as SQL command

2014-12-12 Thread Cheng Lian
There isn’t a SQL statement that directly maps |SQLContext.isCached|, but you can use |EXPLAIN EXTENDED| to check whether the underlying physical plan is a |InMemoryColumnarTableScan|. On 12/13/14 7:14 AM, Judy Nash wrote: Hello, Few questions on Spark SQL: 1)Does Spark SQL support