Re: Worker nodes: Error messages

2014-06-26 Thread Akhil Das
Can you paste the stderr from the worker logs? (Found in work/ app-20140625133031-0002/ directory) Most likely you might need to set SPARK_MASTER_IP in your spark-env.sh file (Not sure why i'm seeing akka.tcp://spark@localhost:56569 instead of akka.tcp://spark@*serverip*:56569) Thanks Best

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-26 Thread Akhil Das
Try deleting the .iv2 directory in your home and then do a sbt clean assembly would solve this issue i guess. Thanks Best Regards On Thu, Jun 26, 2014 at 3:10 AM, Robert James srobertja...@gmail.com wrote: In case anyone else is having this problem, deleting all ivy's cache, then doing a sbt

Re: wholeTextFiles like for binary files ?

2014-06-26 Thread Akhil Das
You cannot read image files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gripped files because they are not splittable http://www.bigdataspeak.com/2013_01_01_archive.html (source proving it): override def createRecordReader( split: InputSplit,

Re: ElasticSearch enrich

2014-06-26 Thread boci
That's okay, but hadoop has ES integration. what happened if I run saveAsHadoopFile without hadoop (or I must need to pull up hadoop programatically? (if I can)) b0c1

Re: Spark standalone network configuration problems

2014-06-26 Thread Akhil Das
Hi Shannon, It should be a configuration issue, check in your /etc/hosts and make sure localhost is not associated with the SPARK_MASTER_IP you provided. Thanks Best Regards On Thu, Jun 26, 2014 at 6:37 AM, Shannon Quinn squ...@gatech.edu wrote: Hi all, I have a 2-machine Spark network

Re: ElasticSearch enrich

2014-06-26 Thread Nick Pentreath
You can just add elasticsearch-hadoop as a dependency to your project to user the ESInputFormat and ESOutputFormat ( https://github.com/elasticsearch/elasticsearch-hadoop). Some other basics here: http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html For testing, yes I

Spark Streaming Window without slideDuration parameter

2014-06-26 Thread haopu
If a window is defined without the slideDuration parameter, how will it slide? I guess it will use context's batchInterval as the slideDuration? Thanks for any help. -- View this message in context:

Re: Using CQLSSTableWriter to batch load data from Spark to Cassandra.

2014-06-26 Thread Rohit Rai
Hi Gerard, What is the version of Spark, Hadoop, Cassandra and Calliope are you using. We never built Calliope to Hadoop2 as we/or our clients don't use Hadoop in their deployments or use it only as the Infra component for Spark in which case H1/H2 doesn't make a difference for them. I know

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-26 Thread Ulanov, Alexander
Hi, I cannot argue about other use-cases, however MLLib doesn’t support working with text classification out of the box. There was basic support in MLI (thanks Sean for correcting me that it is MLI not MLLib), but I don’t know why it is not developed anymore. For text classification in

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Sean Owen
Dataflow is a hosted service and tries to abstract an entire pipeline; Spark maps to some components in that pipeline and is software. My first reaction was that Dataflow mapped more to Summingbird, as part of it is a higher-level system for doing a specific thing in batch/streaming --

Re: Hadoop interface vs class

2014-06-26 Thread Sean Owen
You seem to have the binary for Hadoop 2, since it was compiled expecting that TaskAttemptContext is an interface. So the error indicates that Spark is also seeing Hadoop 1 classes somewhere. On Wed, Jun 25, 2014 at 4:41 PM, Robert James srobertja...@gmail.com wrote: After upgrading to Spark

What's the best practice to deploy spark on Big SMP servers?

2014-06-26 Thread guxiaobo1982
Hi, We have a big SMP server(with 128G RAM and 32 CPU cores) to runn small scale analytical works, what's the best practice to deploy a stand alone Spark on the server to achieve good performance. How many instances should be configured, how many RAM and CPU cores should be allocated for

Re: Spark executor error

2014-06-26 Thread Surendranauth Hiraman
I unfortunately haven't seen this directly. But some typical things I try when debugging are as follows. Do you see a corresponding error on the other side of that connection (alpinenode7.alpinenow.local)? Or is that the same machine? Also, do the driver logs show any longer stack trace and have

running multiple applications at the same time

2014-06-26 Thread jamborta
Hi all, not sure if this is a config issue or it's by design, but when I run the spark shell, and try to submit another application from elsewhere, the second application waits for the first to finish and outputs the following: Initial job has not accepted any resources; check your cluster UI to

Re: running multiple applications at the same time

2014-06-26 Thread Akhil Das
​​Hi Jamborta, You can use the following options in your application to limit the usage of resources, like - spark.cores.max - spark.executor.memory Its better to use Mesos if you want to run multiple applications on the same cluster smoothly. Thanks Best Regards On Thu, Jun 26,

About StorageLevel

2014-06-26 Thread tomsheep...@gmail.com
Hi all, I have a newbie question about StorageLevel of spark. I came up with these sentences in spark documents: If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as

[ANNOUNCE] Apache MRQL 0.9.2-incubating released

2014-06-26 Thread Leonidas Fegaras
The Apache MRQL team is pleased to announce the release of Apache MRQL 0.9.2-incubating. This is our second Apache release. Apache MRQL is a query processing and optimization system for large-scale distributed data analysis, built on top of Apache Hadoop, Hama, and Spark. The release artifacts

Re: running multiple applications at the same time

2014-06-26 Thread Akhil Das
Yep, it does. Thanks Best Regards On Thu, Jun 26, 2014 at 6:11 PM, jamborta jambo...@gmail.com wrote: thanks a lot. I have tried restricting the memory usage before, but it seems it was the issue with the number of cores available. I am planning to run this on a yarn cluster, I assume

Fine-grained mesos execution hangs on Debian 7.4

2014-06-26 Thread Fedechicco
Hello, as from object, when I run scala spark-shell on our mesos (0.19) cluster some spark slaves just hang at the end of the staging phase for any given elaboration. The cluster has mixed OSes (Ubuntu 14.04 / Debian 7.4), but if I run the same shell and commands using coarse grained mode

Re: Hadoop interface vs class

2014-06-26 Thread Sean Owen
Yes it does. The idea is to override the dependency if needed. I thought you mentioned that you had built for Hadoop 2. On Jun 26, 2014 11:07 AM, Robert James srobertja...@gmail.com wrote: Yes. As far as I can tell, Spark seems to be including Hadoop 1 via its transitive dependency:

Serialization of objects

2014-06-26 Thread Sameer Tilak
Hi everyone, Aaron, thanks for your help so far. I am trying to serialize objects that I instantiate from a 3rd party library namely instances of com.wcohen.ss.Jaccard, and com.wcohen.ss.BasicStringWrapper. However, I am having problems with serialization. I am (at least trying to) using Kryo

Re: Where Can I find the full documentation for Spark SQL?

2014-06-26 Thread Michael Armbrust
The programming guide is part of the standard documentation: http://spark.apache.org/docs/latest/sql-programming-guide.html Regarding specifics about SQL syntax and functions, I'd recommend using a HiveContext and the HQL method currently, as that is much more complete than the basic SQL parser

Re: About StorageLevel

2014-06-26 Thread Andrew Or
Hi Kang, You raise a good point. Spark does not automatically cache all your RDDs. Why? Simply because the application may create many RDDs, and not all of them are to be reused. After all, there is only so much memory available to each executor, and caching an RDD adds some overhead especially

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Michael Bach Bui
The current problem with Spark is the big overhead and cost of bringing up a cluster. On a good day, it takes AWS spot instances 15 - 20 minutes to bring up a 30 node cluster. This makes it non-efficient for computations which may take only 10 - 15 minutes. Hmm, this is a misleading message.The

Re: Hadoop interface vs class

2014-06-26 Thread Robert James
On 6/26/14, Sean Owen so...@cloudera.com wrote: Yes it does. The idea is to override the dependency if needed. I thought you mentioned that you had built for Hadoop 2. I'm very confused :-( I downloaded the Spark distro for Hadoop 2, and installed it on my machine. But the code doesn't have a

Improving Spark multithreaded performance?

2014-06-26 Thread Kyle Ellrott
I'm working to set up a calculation that involves calling mllib's SVMWithSGD.train several thousand times on different permutations of the data. I'm trying to run the separate jobs using a threadpool to dispatch the different requests to a spark context connected a Mesos's cluster, using course

Running new code on a Spark Cluster

2014-06-26 Thread Pat Ferrel
I’ve created a CLI driver for a Spark version of a Mahout job called item similarity with several tests that all work fine on local[4] Spark standalone. The code even reads and writes to clustered HDFS. But switching to clustered Spark has a problem that seems tied to a broadcast and/or

Re: Fine-grained mesos execution hangs on Debian 7.4

2014-06-26 Thread Sébastien Rainville
Hello Federico, is it working with the 1.0 branch? In either branch, make sure that you have this commit: https://github.com/apache/spark/commit/1132e472eca1a00c2ce10d2f84e8f0e79a5193d3 I never saw the behavior you are describing, but that commit is important if you are running in fine-grained

Re: Spark Streaming RDD transformation

2014-06-26 Thread Sean Owen
If you want to transform an RDD to a Map, I assume you have an RDD of pairs. The method collectAsMap() creates a Map from the RDD in this case. Do you mean that you want to update a Map object using data in each RDD? You would use foreachRDD() in that case. Then you can use RDD.foreach to do

Re: Hadoop interface vs class

2014-06-26 Thread Sean Owen
On Thu, Jun 26, 2014 at 1:44 PM, Robert James srobertja...@gmail.com wrote: I downloaded the Spark distro for Hadoop 2, and installed it on my machine. But the code doesn't have a reference to that path - it uses sbt for dependencies. As far as I can tell, using sbt or maven or ivy will

Re: Spark Streaming RDD transformation

2014-06-26 Thread Bill Jay
Thanks, Sean! I am currently using foreachRDD to update the global map using data in each RDD. The reason I want to return a map as RDD instead of just updating the map is that RDD provides many handy methods for output. For example, I want to save the global map into files in HDFS for each batch

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Nicholas Chammas
On Thu, Jun 26, 2014 at 2:26 PM, Michael Bach Bui free...@adatao.com wrote: The overhead of bringing up a AWS Spark spot instances is NOT the inherent problem of Spark. That’s technically true, but I’d be surprised if there wasn’t a lot of room for improvement in spark-ec2 regarding cluster

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Aureliano Buendia
On Thu, Jun 26, 2014 at 9:42 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: That’s technically true, but I’d be surprised if there wasn’t a lot of room for improvement in spark-ec2 regarding cluster launch+config times. Unfortunately, this is a spark support issue, but an AWS one.

Spark-submit failing on cluster

2014-06-26 Thread ajatix
My setup --- I have a private cluster running on 4 nodes. I want to use the spark-submit script to execute spark applications on the cluster. I am using Mesos to manage the cluster. This is the command I ran on local mode, which ran successfully --- ./bin/spark-submit --master local --class

Re: Spark vs Google cloud dataflow

2014-06-26 Thread Nicholas Chammas
Hmm, I remember a discussion on here about how the way in which spark-ec2 rsyncs stuff to the cluster for setup could be improved, and I’m assuming there are other such improvements to be made. Perhaps those improvements don’t matter much when compared to EC2 instance launch times, but I’m not

Re: ElasticSearch enrich

2014-06-26 Thread boci
Thanks. I without local option I can connect with es remote, now I only have one problem. How can I use elasticsearch-hadoop with spark streaming? I mean DStream doesn't have saveAsHadoopFiles method, my second problem the output index is depend by the input data. Thanks

Re: Spark standalone network configuration problems

2014-06-26 Thread Shannon Quinn
My *best guess* (please correct me if I'm wrong) is that the master (machine1) is sending the command to the worker (machine2) with the localhost argument as-is; that is, machine2 isn't doing any weird address conversion on its end. Consequently, I've been focusing on the settings of the

Re: ElasticSearch enrich

2014-06-26 Thread Holden Karau
Hi b0c1, I have an example of how to do this in the repo for my talk as well, the specific example is at https://github.com/holdenk/elasticsearchspark/blob/master/src/main/scala/com/holdenkarau/esspark/IndexTweetsLive.scala . Since DStream doesn't have a saveAsHadoopDataset we use foreachRDD and

SparkSQL- saveAsParquetFile

2014-06-26 Thread anthonyjschu...@gmail.com
Hi all: I am attempting to execute a simple test of the SparkSQL system capability of persisting to parquet files... My code is: val conf = new SparkConf() .setMaster( local[1]) .setAppName(test) implicit val sc = new SparkContext(conf) val sqlContext = new

SparkSQL- Nested CaseClass Parquet failure

2014-06-26 Thread anthonyjschu...@gmail.com
Hello all: I am attempting to persist a parquet file comprised of a SchemaRDD of nested case classes... Creating a schemaRDD object seems to work fine, but exception is thrown when I attempt to persist this object to a parquet file... my code: case class Trivial(trivial: String = trivial,

Re: ElasticSearch enrich

2014-06-26 Thread boci
Wow, thanks your fast answer, it's help a lot... b0c1 -- Skype: boci13, Hangout: boci.b...@gmail.com On Thu, Jun 26, 2014 at 11:48 PM, Holden Karau

Re: ElasticSearch enrich

2014-06-26 Thread Holden Karau
Just your luck I happened to be working on that very talk today :) Let me know how your experiences with Elasticsearch Spark go :) On Thu, Jun 26, 2014 at 3:17 PM, boci boci.b...@gmail.com wrote: Wow, thanks your fast answer, it's help a lot... b0c1

Re: SparkSQL- Nested CaseClass Parquet failure

2014-06-26 Thread Michael Armbrust
Nested parquet is not supported in 1.0, but is part of the upcoming 1.0.1 release. On Thu, Jun 26, 2014 at 3:03 PM, anthonyjschu...@gmail.com anthonyjschu...@gmail.com wrote: Hello all: I am attempting to persist a parquet file comprised of a SchemaRDD of nested case classes... Creating

Task progress in ipython?

2014-06-26 Thread Xu (Simon) Chen
I am pretty happy with using pyspark with ipython notebook. The only issue is that I need to look at the console output or spark ui to track task progress. I wonder if anyone thought of or better wrote something to display some progress bars on the same page when I evaluate a cell in ipynb? I

Google Cloud Engine adds out of the box Spark/Shark support

2014-06-26 Thread Mayur Rustagi
https://groups.google.com/forum/#!topic/gcp-hadoop-announce/EfQms8tK5cE I suspect they are using thr own builds.. has anybody had a chance to look at it? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Re: Spark job tracker.

2014-06-26 Thread Mayur Rustagi
You can use SparkListener interface to track the tasks.. another is to use JSON patch (https://github.com/apache/spark/pull/882) track tasks with json api Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Fri, Jun 27, 2014

Re: SparkSQL- Nested CaseClass Parquet failure

2014-06-26 Thread anthonyjschu...@gmail.com
Thanks. That might be a good note to add to the official Programming Guide... On Thu, Jun 26, 2014 at 5:05 PM, Michael Armbrust [via Apache Spark User List] ml-node+s1001560n8382...@n3.nabble.com wrote: Nested parquet is not supported in 1.0, but is part of the upcoming 1.0.1 release. On

Re: Spark standalone network configuration problems

2014-06-26 Thread Shannon Quinn
In the interest of completeness, this is how I invoke spark: [on master] sbin/start-all.sh spark-submit --py-files extra.py main.py iPhone'd On Jun 26, 2014, at 17:29, Shannon Quinn squ...@gatech.edu wrote: My *best guess* (please correct me if I'm wrong) is that the master (machine1)

numpy + pyspark

2014-06-26 Thread Avishek Saha
Hi all, Instead of installing numpy in each worker node, is it possible to ship numpy (via --py-files option maybe) while invoking the spark-submit? Thanks, Avishek

Re: Improving Spark multithreaded performance?

2014-06-26 Thread Aaron Davidson
I don't have specific solutions for you, but the general things to try are: - Decrease task size by broadcasting any non-trivial objects. - Increase duration of tasks by making them less fine-grained. How many tasks are you sending? I've seen in the past something like 25 seconds for ~10k total

RE: About StorageLevel

2014-06-26 Thread Liu, Raymond
I think there is a shuffle stage involved. And the future count job will depends on the first job’s shuffle stages’s output data directly as long as it is still available. Thus it will be much faster. Best Regards, Raymond Liu From: tomsheep...@gmail.com [mailto:tomsheep...@gmail.com] Sent:

RE: About StorageLevel

2014-06-26 Thread tomsheep...@gmail.com
Thanks Raymond! I was just reading the source code of ShuffledRDD, and found the the ShuffleFetcher, which wraps BlockManager, does the magic. The shuffled partitions will be stored in disk(?) just as what cacheManager does in a persist operation. Is that to say, whenever there is a shuffle

Re: LiveListenerBus throws exception and weird web UI bug

2014-06-26 Thread Pei-Lun Lee
Hi Baoxu, thanks for sharing. 2014-06-26 22:51 GMT+08:00 Baoxu Shi(Dash) b...@nd.edu: Hi Pei-Lun, I have the same problem there. The Issue is SPARK-2228, there also someone posted a pull request on that, but he only eliminate this exception but not the side effects. I think the problem

Re: Spark standalone network configuration problems

2014-06-26 Thread Akhil Das
Hi Shannon, How about a setting like the following? (just removed the quotes) export SPARK_MASTER_IP=192.168.1.101 export SPARK_MASTER_PORT=5060 #export SPARK_LOCAL_IP=127.0.0.1 Not sure whats happening in your case, it could be that your system is not able to bind to 192.168.1.101 address.

Re: Spark standalone network configuration problems

2014-06-26 Thread sujeetv
Try to explicitly set set the spark.driver.host property to the master's IP. Sujeet -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-network-configuration-problems-tp8304p8396.html Sent from the Apache Spark User List mailing list archive