Re: Processing multiple request in cluster

2014-09-25 Thread Akhil Das
You can try spark on Mesos or Yarn since they have lot more support for scheduling and all Thanks Best Regards On Thu, Sep 25, 2014 at 4:50 AM, Subacini B subac...@gmail.com wrote: hi All, How to run concurrently multiple requests on same cluster. I have a program using *spark streaming

Re: What is a pre built package of Apache Spark

2014-09-25 Thread Akhil Das
Looks like pyspark was not able to find the python binaries from the environment. You need to install python https://docs.python.org/2/faq/windows.html (if not installed already). Thanks Best Regards On Thu, Sep 25, 2014 at 9:00 AM, Denny Lee denny.g@gmail.com wrote: This seems similar to

Log hdfs blocks sending

2014-09-25 Thread Alexey Romanchuk
Hello again spark users and developers! I have standalone spark cluster (1.1.0) and spark sql running on it. My cluster consists of 4 datanodes and replication factor of files is 3. I use thrift server to access spark sql and have 1 table with 30+ partitions. When I run query on whole table

Re: quick start guide: building a standalone scala program

2014-09-25 Thread christy
I have encountered the same issue when I went through the tutorial first standalone application. Then I tried to reinstall the stb but it doest help. Then I follow this thread, create a workspace under spark directly and execute ./sbt/sbt package, it says packing successfully. But how this

Re: java.lang.OutOfMemoryError while running SVD MLLib example

2014-09-25 Thread Xiangrui Meng
7000x7000 is not tall-and-skinny matrix. Storing the dense matrix requires 784MB. The driver needs more storage for collecting result from executors as well as making a copy for LAPACK's dgesvd. So you need more memory. Do you need the full SVD? If not, try to use a small k, e.g, 50. -Xiangrui On

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-25 Thread Xiangrui Meng
For the vectorizer, what's the output feature dimension and are you creating sparse vectors or dense vectors? The model on the driver consists of numClasses * numFeatures doubles. However, the driver needs more memory in order to receive the task result (of the same size) from executors. So you

Re: YARN ResourceManager and Hadoop NameNode Web UI not visible in port 8088, port 50070

2014-09-25 Thread Sandy Ryza
Hi Raghuveer, This might be a better question for the cdh-user list or the Hadoop user list. The Hadoop web interfaces for both the NameNode and ResourceManager are enabled by default. Is it possible you have a firewall blocking those ports? -Sandy On Wed, Sep 24, 2014 at 9:00 PM, Raghuveer

Re: quick start guide: building a standalone scala program

2014-09-25 Thread christy
I encountered exactly the same problem. How did you solve this? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/quick-start-guide-building-a-standalone-scala-program-tp3116p15125.html Sent from the Apache Spark User List mailing list archive at

Issue with Spark-1.1.0 and the start-thriftserver.sh script

2014-09-25 Thread Hélène Delanoeye
Hi We've just experienced an issue with the new Spark-1.1.0 and the start-thriftserver.sh script. We tried to launch start-thriftserver.sh with --master yarn option and got the following error message : Failed to load Hive Thrift server main class

Re: Processing multiple request in cluster

2014-09-25 Thread Mayur Rustagi
There are two problems you may be facing. 1. your application is taking all resources 2. inside your application task submission is not scheduling properly. for 1 you can either configure your app to take less resources or use mesos/yarn types scheduler to dynamically change or juggle resources

Re: Can not see any spark metrics on ganglia-web

2014-09-25 Thread tsingfu
Hi, I found the problem. By default, gmond is monitoring the multicast ip:239.2.11.71, while I set *.sink.ganglia.host=localhost. the correct configuration in metrics.properties: # Enable GangliaSink for all instances *.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink

Memory used in Spark-0.9.0-incubating

2014-09-25 Thread 王晓雨
ENV: Spark:0.9.0-incubating Hadoop:2.3.0 I run spark task on Yarn. I see the log in Nodemanager: 2014-09-25 17:43:34,141 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 549 for container-id

Memory used in Spark-0.9.0-incubating

2014-09-25 Thread 王晓雨
ENV: Spark:0.9.0-incubating Hadoop:2.3.0 I run spark task on Yarn. I see the log in Nodemanager: 2014-09-25 17:43:34,141 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 549 for container-id

Re: Memory used in Spark-0.9.0-incubating

2014-09-25 Thread 王晓雨
My yarn-site.xml config: property nameyarn.nodemanager.resource.memory-mb/name value16384/value /property ENV: Spark:0.9.0-incubating Hadoop:2.3.0 I run spark task on Yarn. I see the log in Nodemanager: 2014-09-25 17:43:34,141 INFO

Re: Re:

2014-09-25 Thread pouryas
I had similar problem writing to cassandra using the connector for cassandra. I am not sure whether this will work or not but I reduced the number of cores to 1 per machine and my job was stable. More explanation of my issue...

java.io.FileNotFoundException in usercache

2014-09-25 Thread Egor Pahomov
I work with spark on unstable cluster with bad administration. I started get 14/09/25 15:29:56 ERROR storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file

SPARK 1.1.0 on yarn-cluster and external JARs

2014-09-25 Thread rzykov
We build some SPARK jobs with external jars. I compile jobs by including them in one assembly. But look for an approach to put all external jars into HDFS. We have already put spark jar in a HDFS folder and set up the variable SPARK_JAR. What is the best way to do that for other external jars

Re: Memory used in Spark-0.9.0-incubating

2014-09-25 Thread Yi Tian
You should check the log of resource manager when you submit this job to yarn. It will be recorded how many resources your spark application actually asked from resource manager for each container. Did you use fair scheduler? there is a config parameter of fair scheduler

Update gcc version ,Still snappy error.

2014-09-25 Thread buring
I update the spark version form 1.02 to 1.10 , experienced an snappy version issue with the new Spark-1.1.0. After update the glibc version, occured a another issue. I abstract the log as follows: 14/09/25 11:29:18 WARN [org.apache.hadoop.util.NativeCodeLoader---main]: Unable to load

Pregel messages serialized in local machine?

2014-09-25 Thread Cheuk Lam
This is a question on using the Pregel function in GraphX. Does a message get serialized and then de-serialized in the scenario where both the source and the destination vertices are in the same compute node/machine? Thank you! -- View this message in context:

Systematic error when re-starting Spark stream unless I delete all checkpoints

2014-09-25 Thread Svend
I experience spark streaming restart issues similar to what is discussed in the 2 threads below (in which I failed to find a solution). Could anybody let me know if anything is wrong in the way I start/stop or if this could be a spark bug?

Re: Spark Hive max key length is 767 bytes

2014-09-25 Thread Denny Lee
Sorry for missing your original email - thanks for the catch, eh?! On Thu, Sep 25, 2014 at 7:14 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, Fixed the issue by downgrade hive from 13.1 to 12.0, it works well now. Regards On 31 Aug, 2014, at 7:28 am,

Re: SPARK 1.1.0 on yarn-cluster and external JARs

2014-09-25 Thread Egor Pahomov
SparkContext.addJar()? Why you didn't like fat jar way? 2014-09-25 16:25 GMT+04:00 rzykov rzy...@gmail.com: We build some SPARK jobs with external jars. I compile jobs by including them in one assembly. But look for an approach to put all external jars into HDFS. We have already put

how to run spark job on yarn with jni lib?

2014-09-25 Thread taqilabon
Hi all, I tried to run my spark job on yarn. In my application, I need to call third-parity jni libraries in a spark job. However, I can't find a way to make spark job load my native libraries. Is there anyone who knows how to solve this problem? Thanks. Ziv Huang -- View this message in

Re: how to run spark job on yarn with jni lib?

2014-09-25 Thread Marcelo Vanzin
Hmmm, you might be suffering from SPARK-1719. Not sure what the proper workaround is, but it sounds like your native libs are not in any of the standard lib directories; one workaround might be to copy them there, or add their location to /etc/ld.so.conf (I'm assuming Linux). On Thu, Sep 25,

Re: Spark SQL use of alias in where clause

2014-09-25 Thread Du Li
Thanks, Yanbo and Nicholas. Now it makes more sense — query optimization is the answer. /Du From: Nicholas Chammas nicholas.cham...@gmail.commailto:nicholas.cham...@gmail.com Date: Thursday, September 25, 2014 at 6:43 AM To: Yanbo Liang yanboha...@gmail.commailto:yanboha...@gmail.com Cc: Du Li

Re: Multiple Kafka Receivers and Union

2014-09-25 Thread Matt Narrell
I suppose I have other problems as I can’t get the Scala example to work either. Puzzling, as I have literally coded like the examples (that are purported to work), but no luck. mn On Sep 24, 2014, at 11:27 AM, Tim Smith secs...@gmail.com wrote: Maybe differences between JavaPairDStream and

VertexRDD partition imbalance

2014-09-25 Thread Larry Xiao
Hi all VertexRDD is partitioned with HashPartitioner, and it exhibits some imbalance of tasks. For example, Connected Components with partition strategy Edge2D: Aggregated Metrics by Executor Executor ID Task Time Total Tasks Failed Tasks Succeeded Tasks Input Shuffle Read

Working on LZOP Files

2014-09-25 Thread Harsha HN
Hi, Anybody using LZOP files to process in Spark? We have a huge volume of LZOP files in HDFS to process through Spark. In MapReduce framework, it automatically detects the file format and sends the decompressed version to Mappers. Any such support in Spark? As of now I am manually downloading,

RE: MLUtils.loadLibSVMFile error

2014-09-25 Thread Sameer Tilak
Hi Liquan, Thanks. I was running this in spark-shell. I was able to resolve this issue by creating an app and then submitting it via spark-submit in yarn-client mode. I have seen this happening before as well -- submitting via spark-shell has memory issues. The same code then works fine when

Optimal Partition Strategy

2014-09-25 Thread Muttineni, Vinay
Hello, A bit of a background. I have a dataset with about 200 million records and around 10 columns. The size of this dataset is around 1.5Tb and is split into around 600 files. When I read this dataset, using sparkContext, by default it creates around 3000 partitions if I do not specify the

Re: Pregel messages serialized in local machine?

2014-09-25 Thread Ankur Dave
At 2014-09-25 06:52:46 -0700, Cheuk Lam chl...@hotmail.com wrote: This is a question on using the Pregel function in GraphX. Does a message get serialized and then de-serialized in the scenario where both the source and the destination vertices are in the same compute node/machine? Yes,

Re: Question About Submit Application

2014-09-25 Thread Marcelo Vanzin
Then I think it's time for you to look at the Spark Master logs... On Thu, Sep 25, 2014 at 7:51 AM, danilopds danilob...@gmail.com wrote: Hi Marcelo, Yes, I can ping spark-01 and I also include the IP and host in my file /etc/hosts. My VM can ping the local machine too. -- View this

Spark Streaming + Actors

2014-09-25 Thread Madabhattula Rajesh Kumar
Hi Team, Can I use Actors in Spark Streaming based on events type? Could you please review below Test program and let me know if any thing I need to change with respect to best practices import akka.actor.Actor import akka.actor.{ActorRef, Props} import org.apache.spark.SparkConf import

Re: Yarn number of containers

2014-09-25 Thread Marcelo Vanzin
On Thu, Sep 25, 2014 at 8:55 AM, jamborta jambo...@gmail.com wrote: I am running spark with the default settings in yarn client mode. For some reason yarn always allocates three containers to the application (wondering where it is set?), and only uses two of them. The default number of

Re: MLUtils.loadLibSVMFile error

2014-09-25 Thread Liquan Pei
Hi Sameer, When starting spark-shell, by default, the JVM for spark-shell only have 512M memory. For a quick hack, you can use SPARK_MEM=4g bin/spark-shell to set JVM memory to be 4g. For more information, you can refer http://spark.apache.org/docs/latest/cluster-overview.html Thanks, Liquan On

Re: SPARK 1.1.0 on yarn-cluster and external JARs

2014-09-25 Thread Marcelo Vanzin
You can pass the HDFS location of those extra jars in the spark-submit --jars argument. Spark will take care of using Yarn's distributed cache to make them available to the executors. Note that you may need to provide the full hdfs URL (not just the path, since that will be interpreted as a local

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-25 Thread Brad Miller
Hi Davies, Thanks for your help. I ultimately re-wrote the code to use broadcast variables, and then received an error when trying to broadcast self.all_models that the size did not fit in an int (recall that broadcasts use 32 bit ints to store size), suggesting that it was in fact over 2G. I

Re:

2014-09-25 Thread Ted Yu
I followed linked JIRAs to HDFS-7005 which is in hadoop 2.6.0 Any chance of deploying 2.6.0-SNAPSHOT to see if the problem goes away ? On Wed, Sep 24, 2014 at 10:54 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Looks like it's a HDFS issue, pretty new.

Re: Multiple Kafka Receivers and Union

2014-09-25 Thread Matt Narrell
Tim, I think I understand this now. I had a five node Spark cluster and a five partition topic, and I created five receivers. I found this: http://stackoverflow.com/questions/25785581/custom-receiver-stalls-worker-in-spark-streaming Indicating that if I use all my workers as receivers,

Add Meetup

2014-09-25 Thread Brian Husted
Please add the Apache Spark Maryland meetup to the Spark website. http://www.meetup.com/Apache-Spark-Maryland Thanks! Brian *Brian Husted* *Tetra Concepts, LLC* tetraconcepts.com *301.518.6994 (c)* *866.618.1343 (f)*

Re: RDD of Iterable[String]

2014-09-25 Thread Liquan Pei
Hi Deep, I believe that you are referring to the map for Iterable[String] suppose you have iter:Iterable[String] you can do newIter = iter.map(item = Item + a ) which will create an new Iterable[String] with each element appending an a to all string in iter. Does this answer your question?

Re: Multiple Kafka Receivers and Union

2014-09-25 Thread Matt Narrell
Additionally, If I dial up/down the number of executor cores, this does what I want. Thanks for the extra eyes! mn On Sep 25, 2014, at 12:34 PM, Matt Narrell matt.narr...@gmail.com wrote: Tim, I think I understand this now. I had a five node Spark cluster and a five partition topic,

Re: K-means faster on Mahout then on Spark

2014-09-25 Thread bhusted
What is the size of your vector mine is set to 20? I am seeing slow results as well with iteration=5, # of elements 200,000,000. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/K-means-faster-on-Mahout-then-on-Spark-tp3195p15168.html Sent from the

SPARK UI - Details post job processiong

2014-09-25 Thread Harsha HN
Hi, Details laid out in Spark UI for the job in progress is really interesting and very useful. But this gets vanished once the job is done. Is there a way to get job details post processing? Looking for Spark UI data, not standard input,output and error info. Thanks, Harsha

Ungroup data

2014-09-25 Thread Luis Guerra
Hi everyone, I need some advice about how to make the following: having a RDD of vectors (each vector being Vector(Int, Int , Int, int)), I need to group the data, then I need to apply a function to every group comparing each consecutive item within a group and retaining a variable (that has to

Spark streaming - submit new job version

2014-09-25 Thread demian
Hi. We are testing Spark streaming. Its looks awesome! We are trying to figure how to submit a new version of a live forever job. We have a job that streams metrics of a bunch of servers applying transformations like .reduceByWindow and then stores the results in hdfs. If we submit this new

Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
I posted yesterday about a related issue but resolved it shortly after. I'm using Spark Streaming to summarize event data from Kafka and save it to a MySQL table. Currently the bottleneck is in writing to MySQL and I'm puzzled as to how to speed it up. I've tried repartitioning with several

Kryo UnsupportedOperationException

2014-09-25 Thread Sandy Ryza
We're running into an error (below) when trying to read spilled shuffle data back in. Has anybody encountered this before / is anybody familiar with what causes these Kryo UnsupportedOperationExceptions? any guidance appreciated, Sandy --- com.esotericsoftware.kryo.KryoException

Re: Yarn number of containers

2014-09-25 Thread Tamas Jambor
Thank you. Where is the number of containers set? On Thu, Sep 25, 2014 at 7:17 PM, Marcelo Vanzin van...@cloudera.com wrote: On Thu, Sep 25, 2014 at 8:55 AM, jamborta jambo...@gmail.com wrote: I am running spark with the default settings in yarn client mode. For some reason yarn always

Re: Yarn number of containers

2014-09-25 Thread Marcelo Vanzin
From spark-submit --help: YARN-only: --executor-cores NUMNumber of cores per executor (Default: 1). --queue QUEUE_NAME The YARN queue to submit to (Default: default). --num-executors NUM Number of executors to launch (Default: 2). --archives ARCHIVES

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-25 Thread Davies Liu
On Thu, Sep 25, 2014 at 11:25 AM, Brad Miller bmill...@eecs.berkeley.edu wrote: Hi Davies, Thanks for your help. I ultimately re-wrote the code to use broadcast variables, and then received an error when trying to broadcast self.all_models that the size did not fit in an int (recall that

Re: java.lang.OutOfMemoryError while running SVD MLLib example

2014-09-25 Thread Shailesh Birari
Hi Xianguri, After setting SVD to smaller value (200) its working. Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-while-running-SVD-MLLib-example-tp14972p15179.html Sent from the Apache Spark User List

Re: Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
Update for posterity, so once again I solved the problem shortly after posting to the mailing list. So updateStateByKey uses the default partitioner, which in my case seemed like it was set to one. Changing my call from .updateStateByKey[Long](updateFn) - .updateStateByKey[Long](updateFn,

Re: Kryo UnsupportedOperationException

2014-09-25 Thread Ian O'Connell
I would guess the field serializer is having issues being able to reconstruct the class again, its pretty much best effort. Is this an intermediate type? On Thu, Sep 25, 2014 at 2:12 PM, Sandy Ryza sandy.r...@cloudera.com wrote: We're running into an error (below) when trying to read spilled

Re: Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread Buntu Dev
Thanks for the update.. I'm interested in writing the results to MySQL as well, can you share some light or code sample on how you setup the driver/connection pool/etc.? On Thu, Sep 25, 2014 at 4:00 PM, maddenpj madde...@gmail.com wrote: Update for posterity, so once again I solved the problem

Shuffle files

2014-09-25 Thread SK
Hi, I am using Spark 1.1.0 on a cluster. My job takes as input 30 files in a directory (I am using sc.textfile(dir/*) ) to read in the files. I am getting the following warning: WARN TaskSetManager: Lost task 99.0 in stage 1.0 (TID 99, mesos12-dev.sccps.net): java.io.FileNotFoundException:

spark-ec2 ERROR: Line magic function `%matplotlib` not found

2014-09-25 Thread Andy Davidson
Hi I am running into trouble using iPython notebook on my cluster. Use the following command to set the cluster up $ ./spark-ec2 --key-pair=$KEY_PAIR --identity-file=$KEY_FILE --region=$REGION --slaves=$NUM_SLAVES launch $CLUSTER_NAME On master I launch python as follows $

Is it possible to use Parquet with Dremel encoding

2014-09-25 Thread matthes
Hi again! At the moment I try to use parquet and I want to keep the data into the memory in an efficient way to make requests against the data as fast as possible. I read about parquet it is able to encode nested columns. Parquet uses the Dremel encoding with definition and repetition levels. Is

Re: K-means faster on Mahout then on Spark

2014-09-25 Thread Xiangrui Meng
Please also check the load balance of the RDD on YARN. How many partitions are you using? Does it match the number of CPU cores? -Xiangrui On Thu, Sep 25, 2014 at 12:28 PM, bhusted brian.hus...@gmail.com wrote: What is the size of your vector mine is set to 20? I am seeing slow results as well

Job cancelled because SparkContext was shut down

2014-09-25 Thread jamborta
hi all, I am getting this strange error about half way through the job (running spark 1.1 on yarn client mode): 14/09/26 00:54:06 INFO ConnectionManager: key already cancelled ? sun.nio.ch.SelectionKeyImpl@4d0155fb java.nio.channels.CancelledKeyException at

Re: Yarn number of containers

2014-09-25 Thread jamborta
thanks. On Thu, Sep 25, 2014 at 10:25 PM, Marcelo Vanzin [via Apache Spark User List] ml-node+s1001560n15177...@n3.nabble.com wrote: From spark-submit --help: YARN-only: --executor-cores NUMNumber of cores per executor (Default: 1). --queue QUEUE_NAME The YARN queue to

Re: Multiple Kafka Receivers and Union

2014-09-25 Thread Tim Smith
Good to know it worked out and thanks for the update. I didn't realize you need to provision for receiver workers + processing workers. One would think a worker would process multiple stages of an app/job and receive is just a stage of the job. On Thu, Sep 25, 2014 at 12:05 PM, Matt Narrell

Re: Memory used in Spark-0.9.0-incubating

2014-09-25 Thread 王晓雨
Thanks Yi Tian! Yes, I use fair scheduler. In resource manager log. I see the container's start shell: /home/export/Data/hadoop/tmp/nm-local-dir/usercache/hpc/appcache/application_1411693809133_0002/container_1411693809133_0002_01_02/launch_container.sh In the end: exec /bin/bash -c

flume spark streaming receiver host random

2014-09-25 Thread centerqi hu
Hi all My code is as follows: /usr/local/webserver/sparkhive/bin/spark-submit --class org.apache.spark.examples.streaming.FlumeEventCount --master yarn --deploy-mode cluster --queue online --num-executors 5 --driver-memory 6g --executor-memory 20g --executor-cores 5

Re: how to run spark job on yarn with jni lib?

2014-09-25 Thread taqilabon
You're right, I'm suffering from SPARK-1719. I've tried to add their location to /etc/ld.so.conf and I've submitted my job as a yarn-client, but the problem is the same: my native libraries are not loaded. Does this method work in your case? -- View this message in context:

Re: Shuffle files

2014-09-25 Thread Andrew Ash
Hi SK, For the problem with lots of shuffle files and the too many open files exception there are a couple options: 1. The linux kernel has a limit on the number of open files at once. This is set with ulimit -n, and can be set permanently in /etc/sysctl.conf or /etc/sysctl.d/. Try increasing

Re: SPARK UI - Details post job processiong

2014-09-25 Thread Andrew Ash
Matt you should be able to set an HDFS path so you'll get logs written to a unified place instead of to local disk on a random box on the cluster. On Thu, Sep 25, 2014 at 1:38 PM, Matt Narrell matt.narr...@gmail.com wrote: How does this work with a cluster manager like YARN? mn On Sep 25,

Re: Optimal Partition Strategy

2014-09-25 Thread Andrew Ash
Hi Vinay, What I'm guessing is happening is that Spark is taking the locality of files into account and you don't have node-local data on all your machines. This might be the case if you're reading out of HDFS and your 600 files are somehow skewed to only be on about 200 of your 400 machines. A

Re: Working on LZOP Files

2014-09-25 Thread Andrew Ash
Hi Harsha, I use LZOP files extensively on my Spark cluster -- see my writeup for how to do this on this mailing list post: http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCAOoZ679ehwvT1g8=qHd2n11Z4EXOBJkP+q=Aj0qE_=shhyl...@mail.gmail.com%3E Maybe we should better document how

Re: quick start guide: building a standalone scala program

2014-09-25 Thread Andrew Ash
Hi Christy, I'm more of a Gradle fan but I know SBT fits better into the Scala ecosystem as a build tool. If you'd like to give Gradle a shot try this skeleton Gradle+Spark repo from my coworker Punya. https://github.com/punya/spark-gradle-test-example Good luck! Andrew On Thu, Sep 25, 2014

Re: Log hdfs blocks sending

2014-09-25 Thread Andrew Ash
Hi Alexey, You should see in the logs a locality measure like NODE_LOCAL, PROCESS_LOCAL, ANY, etc. If your Spark workers each have an HDFS data node on them and you're reading out of HDFS, then you should be seeing almost all NODE_LOCAL accesses. One cause I've seen for mismatches is if Spark

Re: Spark Streaming: No parallelism in writing to database (MySQL)

2014-09-25 Thread maddenpj
Yup it's all in the gist: https://gist.github.com/maddenpj/5032c76aeb330371a6e6 Lines 6-9 deal with setting up the driver specifically. This sets the driver up on each partition which keeps the connection pool around per record. -- View this message in context:

Re: spark-ec2 ERROR: Line magic function `%matplotlib` not found

2014-09-25 Thread Davies Liu
Maybe you have Python 2.7 on master but Python 2.6 in cluster, you should upgrade python to 2.7 in cluster, or use python 2.6 in master by set PYSPARK_PYTHON=python2.6 On Thu, Sep 25, 2014 at 5:11 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I am running into trouble using iPython

Parallel spark jobs on standalone cluster

2014-09-25 Thread Sarath Chandra
Hi All, I have a java program which submits a spark job to a standalone spark cluster (2 nodes; 10 cores (6+4); 12GB (8+4)). This is being called by another java program through ExecutorService and invokes it multiple times with different set of arguments and parameters. I have set spark memory

Spark Streaming: foreachRDD network output

2014-09-25 Thread Jesper Lundgren
Hello all, I have some questions regarding the foreachRDD output function in Spark Streaming. The programming guide ( http://spark.apache.org/docs/1.1.0/streaming-programming-guide.html) describes how to output data using network connection on the worker nodes. Are there some working examples

Re: YARN ResourceManager and Hadoop NameNode Web UI not visible in port 8088, port 50070

2014-09-25 Thread Raghuveer Chanda
The problem is solved, the web interfaces are not opening in local network connecting to server with proxy its opening only in the servers without proxy .. On Thu, Sep 25, 2014 at 1:12 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Raghuveer, This might be a better question for the cdh-user

Re:

2014-09-25 Thread Jianshi Huang
I built a patched DFSClient jar and now testing (takes 3 hours...) I'd like to know if I can patch spark builds? How about just replace DFSClient.class in spark-assembly jar? Jianshi On Fri, Sep 26, 2014 at 2:29 AM, Ted Yu yuzhih...@gmail.com wrote: I followed linked JIRAs to HDFS-7005 which