Spark: All masters are unresponsive!

2014-07-07 Thread Sameer Tilak
Hi All, I am having a few issues with stability and scheduling. When I use spark shell to submit my application. I get the following error message and spark shell crashes. I have a small 4-node cluster for PoC. I tried both manual and scripts-based cluster set up. I tried using FQDN as well for

Spark SQL registerAsTable requires a Java Class?

2014-07-07 Thread Ionized
The Java API requires a Java Class to register as table. // Apply a schema to an RDD of JavaBeans and register it as a table.JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);schemaPeople.registerAsTable("people"); If instead of JavaRDD I had JavaRDD (along with the knowledge

Error and doubts in using Mllib Naive bayes for text clasification

2014-07-07 Thread Rahul Bhojwani
Hello, I am a novice.I want to classify the text into two classes. For this purpose I want to use Naive Bayes model. I am using Python for it. Here are the problems I am facing: *Problem 1:* I wanted to use all words as features for the bag of words model. Which means my features will be count

答复: Spark RDD Disk Persistance

2014-07-07 Thread Lizhengbing (bing, BIPA)
You might let your data stored in tachyon 发件人: Jahagirdar, Madhu [mailto:madhu.jahagir...@philips.com] 发送时间: 2014年7月8日 10:16 收件人: user@spark.apache.org 主题: Spark RDD Disk Persistance Should i use Disk based Persistance for RDD's and if the machine goes down during the program execution, next ti

Re: the Pre-built packages for CDH4 can not support yarn ?

2014-07-07 Thread Matei Zaharia
They are for CDH4 without YARN, since YARN is experimental in that. You can download one of the Hadoop 2 packages if you want to run on YARN. Or you might have to build specifically against CDH4's version of YARN if that doesn't work. Matei On Jul 7, 2014, at 9:37 PM, ch huang wrote: > hi,mai

Re: Is the order of messages guaranteed in a DStream?

2014-07-07 Thread Mayur Rustagi
If you receive data through multiple receivers across the cluster. I don't think any order can be guaranteed. Order in distributed systems is tough. On Tuesday, July 8, 2014, Yan Fang wrote: > I know the order of processing DStream is guaranteed. Wondering if the > order of messages in one DStre

Re: Spark Installation

2014-07-07 Thread Krishna Sankar
Couldn't find any reference of CDH in pom.xml - profiles or the hadoop.version.Am also wondering how the cdh compatible artifact was compiled. Cheers On Mon, Jul 7, 2014 at 8:07 PM, Srikrishna S wrote: > Hi All, > > Does anyone know what the command line arguments to mvn are to generate > the

Seattle Spark Meetup slides: xPatterns, Fun Things, and Machine Learning Streams - next is Interactive OLAP

2014-07-07 Thread Denny Lee
Apologies for the delay but we’ve had a bunch of great slides and sessions at Seattle Spark Meetup this past couple of months including Claudiu Barbura’s "xPatterns on Spark, Shark, Mesos, and Tachyon"; Paco Nathan’s "Fun Things You Can Do with Spark 1.0”, and "Machine Learning Streams with Spar

Re: Spark Installation

2014-07-07 Thread Jaideep Dhok
Hi Srikrishna, You can use the make-distribution script in Spark to generate the binary. Example - ./make-distribution.sh --tgz --hadoop HADOOP_VERSION The above script calls maven, so you can look into it to get the exact mvn command too. Thanks, Jaideep On Tue, Jul 8, 2014 at 8:37 AM, Srikris

Spark Installation

2014-07-07 Thread Srikrishna S
Hi All, Does anyone know what the command line arguments to mvn are to generate the pre-built binary for spark on Hadoop 2-CHD5. I would like to pull in a recent bug fix in spark-master and rebuild the binaries in the exact same way that was used for that provided on the website. I have tried th

Help for the large number of the input data files

2014-07-07 Thread innowireless TaeYun Kim
Hi, A help for the implementation best practice is needed. The operating environment is as follows: - Log data file arrives irregularly. - The size of a log data file is from 3.9KB to 8.5MB. The average is about 1MB. - The number of records of a data file is from 13 lines to 22000 lines.

RE: Spark RDD Disk Persistance

2014-07-07 Thread Shao, Saisai
Hi Madhu, I don't think you can reuse the persistent RDD the next time you run the program, because the folder for RDD materialization will be changed, also Spark will lose the information of how to retrieve the previous persisted RDD. AFAIK Spark has fault tolerance mechanism, node failure wil

Re: SparkSQL - Partitioned Parquet

2014-07-07 Thread Michael Armbrust
The only partitioning that is currently supported is through Hive partitioned tables. Supporting this for parquet as well is on our radar, but probably won't happen for 1.1. On Sun, Jul 6, 2014 at 10:00 PM, Raffael Marty wrote: > Does SparkSQL support partitioned parquet tables? How do I save

Spark RDD Disk Persistance

2014-07-07 Thread Jahagirdar, Madhu
Should i use Disk based Persistance for RDD's and if the machine goes down during the program execution, next time when i rerun the program would the data be intact and not lost ? Regards, Madhu Jahagirdar The information contained in this message may be confide

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
Here is a simple example of registering an RDD of Products as a table. It is important that all of the fields are val defined in the constructor and that you implement canEqual, productArity and productElement. class Record(val x1: String) extends Product with Serializable { def canEqual(that:

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread 张包峰
Hi guys, previously I checked out the old "spork" and updated it to Hadoop 2.0, Scala 2.10.3 and Spark 0.9.1, see github project of mine https://github.com/pelick/flare-spork‍ It it also highly experimental, and just directly mapping pig physical operations to spark RDD transformations/actions

RE: The number of cores vs. the number of executors

2014-07-07 Thread innowireless TaeYun Kim
For your information, I've attached the Ganglia monitoring screen capture on the Stack Overflow question. Please see: http://stackoverflow.com/questions/24622108/apache-spark-the-number-of-cores -vs-the-number-of-executors From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]

Is the order of messages guaranteed in a DStream?

2014-07-07 Thread Yan Fang
I know the order of processing DStream is guaranteed. Wondering if the order of messages in one DStream is guaranteed. My gut feeling is yes for the question because RDD is immutable. Some simple tests prove this. Want to hear from "authority" to persuade myself. Thank you. Best, Fang, Yan yanfan

RE: SparkSQL with sequence file RDDs

2014-07-07 Thread Haoming Zhang
Hi Michael, Thanks for the reply. Actually last week I tried to play with Product interface, but I'm not really sure I did correct or not. Here is what I did: 1. Created an abstract class A with Product interface, which has 20 parameters, 2. Created case class B extends A, and B has 20 paramete

the Pre-built packages for CDH4 can not support yarn ?

2014-07-07 Thread ch huang
hi,maillist : i download the pre-built spark packages for CDH4 ,but it say can not support yarn ,why? i need build it by myself with yarn support enable?

Re: Spark SQL user defined functions

2014-07-07 Thread Michael Armbrust
The names of the directories that are created for the metastore are different ("metastore" vs "metastore_db"), but that should be it. Really we should get rid of LocalHiveContext as it is mostly redundant and the current state is kind of confusing. I've created a JIRA to figure this out before th

Re: usage question for saprk run on YARN

2014-07-07 Thread DB Tsai
spark-clinet mode runs driver in your application's JVM while spark-cluster mode runs driver in yarn cluster. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Mon, Jul 7, 2014 at 5:44 PM, C

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-07 Thread Tobias Pfeiffer
Juan, I am doing something similar, just not "insert into SQL database", but "issue some RPC call". I think mapPartitions() may be helpful to you. You could do something like dstream.mapPartitions(iter => { val db = new DbConnection() // maybe only do the above if !iter.isEmpty iter.map(ite

Powered By Spark: Can you please add our org?

2014-07-07 Thread Alex Gaudio
Hi, Sailthru is also using Spark. Could you please add us to the Powered By Spark page when you have a chance? Organization Name: Sailthru URL: www.sailthru.com Short Description: Our data science platform uses Spark to build p

Re: how to set spark.executor.memory and heap size

2014-07-07 Thread Alex Gaudio
Hi All, This is a bit late, but I found it helpful. Piggy-backing on Wang Hao's comment, spark will ignore the "spark.executor.memory" setting if you add it to SparkConf via: conf.set("spark.executor.memory", "1g") What you actually should do depends on how you run spark. I found some "offic

Re: Data loading to Parquet using spark

2014-07-07 Thread Soren Macbeth
I typed "spark parquet" into google and the top results was this blog post about reading and writing parquet files from spark http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ On Mon, Jul 7, 2014 at 5:23 PM, Michael Armbrust wrote: > SchemaRDDs, provided by Spark SQL, have a saveAsPar

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread DB Tsai
Actually, the one needed to install the jar to each individual node is standalone mode which works for both MR1 and MR2. Cloudera and Hortonworks currently support spark in this way as far as I know. For both yarn-cluster or yarn-client, Spark will distribute the jars through distributed cache and

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
> > We know Scala 2.11 has remove the limitation of parameter number, but > Spark 1.0 is not compatible with it. So now we are considering use java > beans instead of Scala case classes. > You can also manually create a class that implements scala's Product interface. Finally, SPARK-2179

RE: SparkSQL with sequence file RDDs

2014-07-07 Thread Haoming Zhang
Hi Gray, Like Michael mentioned, you need to take care of the scala case classes or java beans, because SparkSQL need the schema. Currently we are trying insert our data to HBase with Scala 2.10.4 and Spark 1.0. All the data are tables. We created one case class for each rows, which means th

usage question for saprk run on YARN

2014-07-07 Thread Cheng Ju Chuang
Hi, I am running some simple samples for my project. Right now the spark sample is running on Hadoop 2.2 with YARN. My question is what is the main different when we run as spark-client and spark-cluster except different way to submit our job. And what is the specific way to configure the job e

The number of cores vs. the number of executors

2014-07-07 Thread innowireless TaeYun Kim
Hi, I'm trying to understand the relationship of the number of cores and the number of executors when running a Spark job on YARN. The test environment is as follows: - # of data nodes: 3 - Data node machine spec: - CPU: Core i7-4790 (# of cores: 4, # of threads: 8) - RAM: 32GB (8GB

Re: Data loading to Parquet using spark

2014-07-07 Thread Michael Armbrust
SchemaRDDs, provided by Spark SQL, have a saveAsParquetFile command. You can turn a normal RDD into a SchemaRDD using the techniques described here: http://spark.apache.org/docs/latest/sql-programming-guide.html This should work with Impala, but if you run into any issues please let me know. On

Re: Spark SQL : Join throws exception

2014-07-07 Thread Yin Huai
Hi Subacini, Just want to follow up on this issue. SPARK-2339 has been merged into the master and 1.0 branch. Thanks, Yin On Tue, Jul 1, 2014 at 2:00 PM, Yin Huai wrote: > Seems it is a bug. I have opened > https://issues.apache.org/jira/browse/SPARK-2339 to track it. > > Thank you for repor

Re: Comparative study

2014-07-07 Thread Soumya Simanta
Daniel, Do you mind sharing the size of your cluster and the production data volumes ? Thanks Soumya > On Jul 7, 2014, at 3:39 PM, Daniel Siegmann wrote: > > From a development perspective, I vastly prefer Spark to MapReduce. The > MapReduce API is very constrained; Spark's API feels muc

Re: SparkSQL with sequence file RDDs

2014-07-07 Thread Michael Armbrust
I haven't heard any reports of this yet, but I don't see any reason why it wouldn't work. You'll need to manually convert the objects that come out of the sequence file into something where SparkSQL can detect the schema (i.e. scala case classes or java beans) before you can register the RDD as a t

Re: Comparative study

2014-07-07 Thread Sean Owen
On Tue, Jul 8, 2014 at 1:05 AM, Nabeel Memon wrote: > For Scala API on map/reduce (hadoop engine) there's a library called > "Scalding". It's built on top of Cascading. If you have a huge dataset or > if you consider using map/reduce engine for your job, for any reason, you > can try Scalding. >

Re: Comparative study

2014-07-07 Thread Nabeel Memon
For Scala API on map/reduce (hadoop engine) there's a library called "Scalding". It's built on top of Cascading. If you have a huge dataset or if you consider using map/reduce engine for your job, for any reason, you can try Scalding. However, Spark vs Impala doesn't make sense to me. It should've

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-07-07 Thread Nan Zhu
Hey, Cheney, The problem is still existing? Sorry for the delay, I’m starting to look at this issue, Best, -- Nan Zhu On Tuesday, May 6, 2014 at 10:06 PM, Cheney Sun wrote: > Hi Nan, > > In worker's log, I see the following exception thrown when try to launch on > executor. (The S

memory leak query

2014-07-07 Thread Michael Lewis
Hi, I hope someone can help as I’m not sure if I’m using Spark correctly. Basically, in the simple example below I create an RDD which is just a sequence of random numbers. I then have a loop where I just invoke rdd.count() what I can see is that the memory use always nudges upwards. If I at

Re: reading compress lzo files

2014-07-07 Thread Nicholas Chammas
I found it quite painful to figure out all the steps required and have filed SPARK-2394 to track improving this. Perhaps I have been going about it the wrong way, but it seems way more painful than it should be to set up a Spark cluster built using

Cannot create dir in Tachyon when running Spark with OFF_HEAP caching (FileDoesNotExistException)

2014-07-07 Thread Teng Long
Hi guys, I'm running Spark 1.0.0 with Tachyon 0.4.1, both in single node mode. Tachyon's own tests (./bin/tachyon runTests) works good, and manual file system operation like mkdir works well. But when I tried to run a very simple Spark task with RDD persist as OFF_HEAP, I got the following FileDoe

RE: Spark logging strategy on YARN

2014-07-07 Thread Andrew Lee
Hi Kudryavtsev, Here's what I am doing as a common practice and reference, I don't want to say it is best practice since it requires a lot of customer experience and feedback, but from a development and operating stand point, it will be great to separate the YARN container logs with the Spark lo

acl for spark ui

2014-07-07 Thread Koert Kuipers
i was testing using the acl for spark ui in secure mode on yarn in client mode. it works great. my spark 1.0.0 configuration has: spark.authenticate = true spark.ui.acls.enable = true spark.ui.view.acls = koert spark.ui.filters = org.apache.hadoop.security.authentication.server.AuthenticationFilte

RE: Enable Parsing Failed or Incompleted jobs on HistoryServer (YARN mode)

2014-07-07 Thread Andrew Lee
Hi Suren, It showed up after awhile when I touch the APPLICATION_COMPLETE file in the event log folders. I checked the source code and it looks like it is re-scanning (polling) the folders every 10 seconds (configurable)? Not sure what exactly triggers that 'refresh', may need to do more digging.

Re: Running the BroadcastTest.scala with TorrentBroadcastFactory in a standalone cluster

2014-07-07 Thread jackxucs
Hi Mosharaf, Thanks a lot for the detailed reply. The reason I am using Torrent is mainly because I would like to have something different to be a comparison with the IP multicast/broadcast. I will need to transmit larger data size as well. I tried to increase spark.executor.memory to 2g and spar

Spark shell error messages and app exit issues

2014-07-07 Thread Sameer Tilak
Hi All,When I run my application, it runs for a while and give me part of the o/p correctly. I then get the following error and it then spark shell exits. 14/07/07 13:54:53 INFO SendingConnection: Initiating connection to [localhost.localdomain/127.0.0.1:57423]14/07/07 13:54:53 INFO ConnectionM

Re: NoSuchMethodError in KafkaReciever

2014-07-07 Thread mcampbell
xtrahotsauce wrote > I had this same problem as well. I ended up just adding the necessary > code > in KafkaUtil and compiling my own spark jar. Something like this for the > "raw" stream: > > def createRawStream( > jssc: JavaStreamingContext, > kafkaParams: JMap[String, String], >

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Chester Chen
@Andrew Yes, the link point to the same redirected http://localhost/proxy/application_1404443455764_0010/ I suspect something todo with the cluster setup. I will let you know once I found something. Chester On Mon, Jul 7, 2014 at 1:07 PM, Andrew Or wrote: > @Yan, the UI shoul

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Yan Fang
Thank you, Andrew. That makes sense for me now. I was confused by "In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster" in http://spark.apache.org/docs/latest/running-on-yarn.html . After you explanation, it's clear now. Thank you

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Andrew Or
@Yan, the UI should still work. As long as you look into the container that launches the driver, you will find the SparkUI address and port. Note that in yarn-cluster mode the Spark driver doesn't actually run in the Application Manager; just like the executors, it runs in a container that is launc

RE: Comparative study

2014-07-07 Thread santosh.viswanathan
Thanks Daniel for sharing this info. Regards, Santosh Karthikeyan From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] Sent: Tuesday, July 08, 2014 1:10 AM To: user@spark.apache.org Subject: Re: Comparative study From a development perspective, I vastly prefer Spark to MapReduce. The MapRedu

Re: Comparative study

2014-07-07 Thread Daniel Siegmann
>From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can ju

SparkSQL with sequence file RDDs

2014-07-07 Thread Gary Malouf
Has anyone reported issues using SparkSQL with sequence files (all of our data is in this format within HDFS)? We are considering whether to burn the time upgrading to Spark 1.0 from 0.9 now and this is a main decision point for us.

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Robert James
Thanks - that did solve my error, but instead got a different one: java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/lib/input/FileInputFormat It seems like with that setting, spark can't find Hadoop. On 7/7/14, Koert Kuipers wrote: > spark has a setting to put user jars in front of

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Chester Chen
I don't have experience deploying to EC2. can you use add.jar conf to add the missing jar at runtime ? I haven't tried this myself. Just a guess. On Mon, Jul 7, 2014 at 12:16 PM, Chester Chen wrote: > with "provided" scope, you need to provide the "provided" jars at the > runtime yourself. I

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Chester Chen
with "provided" scope, you need to provide the "provided" jars at the runtime yourself. I guess in this case Hadoop jar files. On Mon, Jul 7, 2014 at 12:13 PM, Robert James wrote: > Thanks - that did solve my error, but instead got a different one: > java.lang.NoClassDefFoundError: > org/apac

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Robert James
Chester - I'm happy rebuilding Spark, but then how can I deploy it to EC2? On 7/7/14, Chester Chen wrote: > Have you tried to change the spark SBT scripts? You can change the > dependency scope to "provided". This similar to compile scope, except JDK > or container need to provide the dependenc

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Chester Chen
As Andrew explained, the port is random rather than 4040, as the the spark driver is started in Application Master and the port is random selected. But I have the similar UI issue. I am running Yarn Cluster mode against my local CDH5 cluster. The log states "14/07/07 11:59:29 INFO ui.SparkUI: St

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Yan Fang
Hi Andrew, Thanks for the quick reply. It works with the yarn-client mode. One question about the yarn-cluster mode: actually I was checking the AM for the log, since the spark driver is running in the AM, the UI should also work, right? But that is not true in my case. Best, Fang, Yan yanfang.

[no subject]

2014-07-07 Thread Juan Rodríguez Hortalá
Hi all, I'm writing a Spark Streaming program that uses reduceByKeyAndWindow(), and when I change the windowsLenght or slidingInterval I get the following exceptions, running in local mode 14/07/06 13:03:46 ERROR actor.OneForOneStrategy: key not found: 1404677026000 ms java.util.NoSuchElementExc

Re: Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Andrew Or
I will assume that you are running in yarn-cluster mode. Because the driver is launched in one of the containers, it doesn't make sense to expose port 4040 for the node that contains the container. (Imagine if multiple driver containers are launched on the same node. This will cause a port collisio

Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-07 Thread Juan Rodríguez Hortalá
Hi list, I'm writing a Spark Streaming program that reads from a kafka topic, performs some transformations on the data, and then inserts each record in a database with foreachRDD. I was wondering which is the best way to handle the connection to the database so each worker, or even each task, use

Re: Kafka - streaming from multiple topics

2014-07-07 Thread Sergey Malov
I opened JIRA issue with Spark, as an improvement though, not as a bug. Hopefully, someone there would notice it. From: Tobias Pfeiffer mailto:t...@preferred.jp>> Reply-To: "user@spark.apache.org" mailto:user@spark.apache.org>> Date: Thursday, July 3, 2014 at 9:41 P

Issues in opening UI when running Spark Streaming in YARN

2014-07-07 Thread Yan Fang
Hi guys, Not sure if you have similar issues. Did not find relevant tickets in JIRA. When I deploy the Spark Streaming to YARN, I have following two issues: 1. The UI port is random. It is not default 4040. I have to look at the container's log to check the UI port. Is this suppose to be this wa

Error while launching spark cluster manaually

2014-07-07 Thread Sameer Tilak
Hi All,I am having the following issue -- may be fqdn/ip resolution issue, but not sure, any help with this will be great! On the master node I get the following error:I start master using ./start-master.shstarting org.apache.spark.deploy.master.Master, logging to /apps/software/spark-1.0.0-bin-

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Chester Chen
Have you tried to change the spark SBT scripts? You can change the dependency scope to "provided". This similar to compile scope, except JDK or container need to provide the dependency at runtime. This assume the Spark will work with the new version of common libraries. Of course, this is not a

Re: tiers of caching

2014-07-07 Thread Andrew Or
Others have also asked for this on the mailing list, and hence there's a related JIRA: https://issues.apache.org/jira/browse/SPARK-1762. Ankur brings up a good point in that any current implementation of in-memory shuffles will compete with application RDD blocks. I think we should definitely add t

Re: spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Koert Kuipers
spark has a setting to put user jars in front of classpath, which should do the trick. however i had no luck with this. see here: https://issues.apache.org/jira/browse/SPARK-1863 On Mon, Jul 7, 2014 at 1:31 PM, Robert James wrote: > spark-submit includes a spark-assembly uber jar, which has o

spark-assembly libraries conflict with application libraries

2014-07-07 Thread Robert James
spark-submit includes a spark-assembly uber jar, which has older versions of many common libraries. These conflict with some of the dependencies we need. I have been racking my brain trying to find a solution (including experimenting with ProGuard), but haven't been able to: when we use spark-sub

spark-assembly libraries conflict with needed libraries

2014-07-07 Thread Robert James
spark-submit includes a spark-assembly uber jar, which has older versions of many common libraries. These conflict with some of the dependencies we need. I have been racking my brain trying to find a solution (including experimenting with ProGuard), but haven't been able to: when we use spark-sub

Re: tiers of caching

2014-07-07 Thread Ankur Dave
I think tiers/priorities for caching are a very good idea and I'd be interested to see what others think. In addition to letting libraries cache RDDs liberally, it could also unify memory management across other parts of Spark. For example, small shuffles benefit from explicitly keeping the shuffle

tiers of caching

2014-07-07 Thread Koert Kuipers
i noticed that some algorithms such as graphx liberally cache RDDs for efficiency, which makes sense. however it can also leave a long trail of unused yet cached RDDs, that might push other RDDs out of memory. in a long-lived spark context i would like to decide which RDDs stick around. would it m

Re: Dense to sparse vector converter

2014-07-07 Thread Xiangrui Meng
No, but it should be easy to add one. -Xiangrui On Mon, Jul 7, 2014 at 12:37 AM, Ulanov, Alexander wrote: > Hi, > > > > Is there a method in Spark/MLlib to convert DenseVector to SparseVector? > > > > Best regards, Alexander

Re: Execution stalls in LogisticRegressionWithSGD

2014-07-07 Thread Xiangrui Meng
It seems to me a setup issue. I just tested news20.binary (1355191 features) on a 2-node EC2 cluster and it worked well. I added one line to conf/spark-env.sh: export SPARK_JAVA_OPTS=" -Dspark.akka.frameSize=20 " and launched spark-shell with "--driver-memory 20g". Could you re-try with an EC2 se

Re: Control number of tasks per stage

2014-07-07 Thread Daniel Siegmann
The default number of tasks when reading files is based on how the files are split among the nodes. Beyond that, the default number of tasks after a shuffle is based on the property spark.default.parallelism. (see http://spark.apache.org/docs/latest/configuration.html). You can use RDD.repartition

Re: Java sample for using cassandra-driver-spark

2014-07-07 Thread Piotr Kołaczkowski
Hi, we're planning to add a basic Java-API very soon, possibly this week. There's a ticket for it here: https://github.com/datastax/cassandra-driver-spark/issues/11 We're open to any ideas. Just let us know what you need the API to have in the comments. Regards, Piotr Kołaczkowski 2014-07-05 0:

Control number of tasks per stage

2014-07-07 Thread Konstantin Kudryavtsev
Hi all, is it any way to control the number tasks per stage? currently I see situation when only 2 tasks are created per stage and each of them is very slow, at the same time cluster has a huge number of unused nodes Thank you, Konstantin Kudryavtsev

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Mayur Rustagi
That version is old :). We are not forking pig but cleanly separating out pig execution engine. Let me know if you are willing to give it a go. Also would love to know what features of pig you are using ? Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_ru

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Bertrand Dechoux
I saw a wiki page from your company but with an old version of Spark. http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 I have no reason to use it yet but I am interested in the state of the initiative. What's your point of view (personal and/or professional) about the P

Comparative study

2014-07-07 Thread santosh.viswanathan
Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh This message is for the designated

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Mayur Rustagi
Hi, We have fixed many major issues around Spork & deploying it with some customers. Would be happy to provide a working version to you to try out. We are looking for more folks to try it out & submit bugs. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_r

Pig 0.13, Spark, Spork

2014-07-07 Thread Bertrand Dechoux
Hi, I was wondering what was the state of the Pig+Spark initiative now that the execution engine of Pig is pluggable? Granted, it was done in order to use Tez but could it be used by Spark? I know about a 'theoretical' project called Spork but I don't know any stable and maintained version of it.

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
Hi Chester, Thank you very much, it is clear now - just two different way to support spark on acluster Thank you, Konstantin Kudryavtsev On Mon, Jul 7, 2014 at 3:22 PM, Chester @work wrote: > In Yarn cluster mode, you can either have spark on all the cluster nodes > or supply the spark jar yo

Possible bug in Spark Streaming :: TextFileStream

2014-07-07 Thread Luis Ángel Vicente Sánchez
I have a basic spark streaming job that is watching a folder, processing any new file and updating a column family in cassandra using the new cassandra-spark-driver. I think there is a problem with SparkStreamingContext.textFileStream... if I start my job in local mode with no files in the folder

Re: graphx Joining two VertexPartitions with different indexes is slow.

2014-07-07 Thread Koert Kuipers
you could only do the deep check if the hashcodes are the same and design hashcodes that do not take all elements into account. the alternative seems to be putting cache statements all over graphx, as is currently the case, which is trouble for any long lived application where caching is carefully

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Chester @work
In Yarn cluster mode, you can either have spark on all the cluster nodes or supply the spark jar yourself. In the 2nd case, you don't need install spark on cluster at all. As you supply the spark assembly as we as your app jar together. I hope this make it clear Chester Sent from my iPhone

spark-submit conflicts with dependencies

2014-07-07 Thread Robert James
When I use spark-submit (along with spark-ec2), I get dependency conflicts. spark-assembly includes older versions of apache commons codec and httpclient, and these conflict with many of the libs our software uses. Is there any way to resolve these? Or, if we use the precompiled spark, can we si

Re: Spark memory optimization

2014-07-07 Thread Surendranauth Hiraman
Using persist() is a sort of a "hack" or a hint (depending on your perspective :-)) to make the RDD use disk, not memory. As I mentioned though, the disk io has consequences, mainly (I think) making sure you have enough disks to not let io be a bottleneck. Increasing partitions I think is the othe

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
thank you Krishna! Could you please explain why do I need install spark on each node if Spark official site said: If you have a Hadoop 2 cluster, you can run Spark without any installation needed I have HDP 2 (YARN) and that's why I hope I don't need to install spark on each node Thank you, Kon

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Krishna Sankar
Konstantin, 1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2, the nodes would have hdfs & YARN. 2. Then you need to install Spark on all nodes. I haven't had experience with HDP, but the tech preview might have installed Spark as well. 3. In the end, one should

which Spark package(wrt. graphX) I should install to do graph computation on cluster?

2014-07-07 Thread Yifan LI
Hi, I am planning to do graph(social network) computation on a cluster(hadoop has been installed), but it seems there are a "Pre-built" package for hadoop which I am NOT sure if the graphX has been included in. or, should I install other released version(obviously the graphX has been included)

Re: How to parallelize model fitting with different cross-validation folds?

2014-07-07 Thread sparkuser2345
Thank you for all the replies! Realizing that I can't distribute the modelling with different cross-validation folds to the cluster nodes this way (but to the threads only), I decided not to create nfolds data sets but to parallelize the calculation (threadwise) over folds and to zip the original

Re: Spark memory optimization

2014-07-07 Thread Igor Pernek
Thanks guys! Actually, I'm not doing any caching (at least I'm not calling cache/persist), do I still need to use the DISK_ONLY storage level? However, I do use reduceByKey and sortByKey. Mayur, you mentioned that sortByKey requires data to fit the memory. Is there any way to work around this (mayb

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-07 Thread Konstantin Kudryavtsev
guys, I'm not talking about running spark on VM, I don have problem with it. I confused in the next: 1) Hortonworks describe installation process as RPMs on each node 2) spark home page said that everything I need is YARN And I'm in stucj with understanding what I need to do to run spark on yarn

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-07 Thread Eustache DIEMERT
Ok, I've tried to add the intercept term myself (code here [1]), but with no luck. It seems that adding a column of ones doesn't help with convergence either. I may have missed something in the coding as I'm quite a noob in Scala, but printing the data seem to indicate I succeeded in adding the o

Re: Broadcast variable in Spark Java application

2014-07-07 Thread Cesar Arevalo
Hi Praveen: It may be easier for other people to help you if you provide more details about what you are doing. It may be worthwhile to also mention which spark version you are using. And if you can share the code which doesn't work for you, that may also give others more clues as to what you a

Broadcast variable in Spark Java application

2014-07-07 Thread Praveen R
I need a variable to be broadcasted from driver to executor processes in my spark java application. I tried using spark broadcast mechanism to achieve, but no luck there. Could someone help me doing this, share some code probably ? Thanks, Praveen R

Dense to sparse vector converter

2014-07-07 Thread Ulanov, Alexander
Hi, Is there a method in Spark/MLlib to convert DenseVector to SparseVector? Best regards, Alexander

Re: Spark SQL user defined functions

2014-07-07 Thread Martin Gammelsæter
Hi again, and thanks for your reply! On Fri, Jul 4, 2014 at 8:45 PM, Michael Armbrust wrote: > >> Sweet. Any idea about when this will be merged into master? > > > It is probably going to be a couple of weeks. There is a fair amount of > cleanup that needs to be done. It works though and we use

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

2014-07-07 Thread Eustache DIEMERT
Well, why not, but IMHO MLLib Logistic Regression is unusable right now. The inability to use intercept is just a no-go. I could hack a column of ones to inject the intercept into the data but frankly it's a pithy to have to do so. 2014-07-05 23:04 GMT+02:00 DB Tsai : > You may try LBFGS to have

  1   2   >