Re: cannot access port 4040

2015-06-10 Thread Himanshu Mehra
Hi Maria, Have you tried the 8080 as well ? Thanks Himanshu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cannot-access-port-4040-tp23248p23249.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: cannot access port 4040

2015-06-10 Thread mrm
Hi Akhil, (Your reply does not appear in the mailing list but I received an email so I will reply here). I have an application running already in the shell using pyspark. I can see the application running on port 8080, but I cannot log into it through port 4040. It says connection timed out

Re: cannot access port 4040

2015-06-10 Thread mrm
Hi Akhil, Thanks for your reply! I still cannot see port 4040 in my machine when I type master-ip-address:4040 in my browser. I have tried this command: netstat -nat | grep 4040 and it returns this: tcp0 0 :::4040 :::* LISTEN Logging into

cannot access port 4040

2015-06-10 Thread mrm
Hi, I am using Spark 1.3.1 standalone and I have a problem where my cluster is working fine, I can see the port 8080 and check that my ec2 instances are fine, but I cannot access port 4040. I have tried sbin/stop-all.sh, sbin/stop-master.sh, exiting the spark context and restarting it to no

Fwd: Re: How to keep a SQLContext instance alive in a spark streaming application's life cycle?

2015-06-10 Thread Sergio Jiménez Barrio
Note: CCing user@spark.apache.org First, you must check if the RDD is empty: messages.foreachRDD { rdd = if (!rdd.isEmpty) { }} Now, you can obtain the instance of a SQLContext: val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)

Re: cannot access port 4040

2015-06-10 Thread Akhil Das
4040 is your driver port, you need to run some application. Login to your cluster start a spark-shell and try accessing 4040. Thanks Best Regards On Wed, Jun 10, 2015 at 3:51 PM, mrm ma...@skimlinks.com wrote: Hi, I am using Spark 1.3.1 standalone and I have a problem where my cluster is

Re: DataFrame.save with SaveMode.Overwrite produces 3x higher data size

2015-06-10 Thread bkapukaranov
Additionally, if I delete the parquet and recreate it using the same generic save function with 1000 partitions and overwrite the size is again correct. -- View this message in context:

Re: Join between DStream and Periodically-Changing-RDD

2015-06-10 Thread Akhil Das
RDD's are immutable, why not join two DStreams? Not sure, but you can try something like this also: kvDstream.foreachRDD(rdd = { val file = ssc.sparkContext.textFile(/sigmoid/) val kvFile = file.map(x = (x.split(,)(0), x)) rdd.join(kvFile) }) Thanks Best Regards On

Re: append file on hdfs

2015-06-10 Thread Pa Rö
hi, i have an idea to solve my problem, i want write one file for each spark partion, but i not know to get the actuel partion suffix/ID in my call function? points.foreachPartition( new VoidFunctionIteratorTuple2Integer, GeoTimeDataTupel() { private static

DataFrame.save with SaveMode.Overwrite produces 3x higher data size

2015-06-10 Thread bkapukaranov
Hi, Kudos on Spark 1.3.x, it's a great release - loving data frames! One thing I noticed after upgrading is that if I use the generic save DataFrame function with Overwrite mode and a parquet source it produces much larger output parquet file. Source json data: ~500GB Originally saved parquet:

Split RDD based on criteria

2015-06-10 Thread dgoldenberg
Hi, I'm gathering that the typical approach for splitting an RDD is to apply several filters to it. rdd1 = rdd.filter(func1); rdd2 = rdd.filter(func2); ... Is there/should there be a way to create 'buckets' like these in one go? ListRDD rddList = rdd.filter(func1, func2, ..., funcN) Another

Spark standalone mode and kerberized cluster

2015-06-10 Thread kazeborja
Hello all. I've been reading some old mails and notice that the use of kerberos in a standalone cluster was not supported. Is this stillt he case? Thanks. Borja. -- View this message in context:

Re: cannot access port 4040

2015-06-10 Thread Akhil Das
Opening your 4040 manually or ssh tunneling (ssh -L 4040:127.0.0.1:4040 master-ip, and then open localhost:4040 in browser.) will work for you then . Thanks Best Regards On Wed, Jun 10, 2015 at 5:10 PM, mrm ma...@skimlinks.com wrote: Hi Akhil, Thanks for your reply! I still cannot see port

spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
Both the driver (ApplicationMaster running on hadoop) and container (CoarseGrainedExecutorBackend) end up exceeding my 25GB allocation. my code is something like sc.binaryFiles(... 1mil xml files).flatMap( ... extract some domain classes, not many though as each xml usually have zero

[Spark 1.3.1 on YARN on EMR] Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-06-10 Thread Roberto Coluccio
Hi! I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux 2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop 2.4, etc...). I make use of the AWS emr-bootstrap-action

learning rpc about spark core source code

2015-06-10 Thread huangzheng
Hi all Recently I have learned about 1.3 spark core source code , can't understand rpc, How to communicate between client driver, worker and master? There are some scala files such as RpcCallContextRpcEndPointRef RpcEndpoint RpcEnv. On spark core rpc module Have any blogs ?

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-10 Thread Josh Mahonin
Hi Jeroen, Rather than bundle the Phoenix client JAR with your app, are you able to include it in a static location either in the SPARK_CLASSPATH, or set the conf values below (I use SPARK_CLASSPATH myself, though it's deprecated): spark.driver.extraClassPath spark.executor.extraClassPath

Re: BigDecimal problem in parquet file

2015-06-10 Thread Bipin Nag
Hi Cheng, I am using Spark 1.3.1 binary available for Hadoop 2.6. I am loading an existing parquet file, then repartitioning and saving it. Doing this gives the error. The code for this doesn't look like causing problem. I have a feeling the source - the existing parquet is the culprit. I

Re: learning rpc about spark core source code

2015-06-10 Thread Shixiong Zhu
The new RPC interface is an internal module and added in 1.4. It should not exist in 1.3. Where did you find it? For the communication between driver, worker and master, it still uses Akka. There are a pending PR to update them: https://github.com/apache/spark/pull/5392 Do you mean the

Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
Hm, I tried the following with 0.13.1 and 0.13.0 on my laptop (don't have access to a cluster for now) but couldn't reproduce this issue. Your program just executed smoothly... :-/ Command line used to start the Thrift server: ./sbin/start-thriftserver.sh --driver-memory 4g --master local

Re: PostgreSQL JDBC Classpath Issue

2015-06-10 Thread shahab
Hi George, I have same issue, did you manage to find a solution? best, /Shahab On Wed, May 13, 2015 at 9:21 PM, George Adams g.w.adams...@gmail.com wrote: Hey all, I seem to be having an issue with PostgreSQL JDBC jar on my classpath. I’ve outlined the issue on Stack Overflow (

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-10 Thread Jeroen Vlek
Hi Josh, Thank you for your effort. Looking at your code, I feel that mine is semantically the same, except written in Java. The dependencies in the pom.xml all have the scope provided. The job is submitted as follows: $ rm spark.log MASTER=spark://maprdemo:7077

Re: Spark's Scala shell killing itself

2015-06-10 Thread Akhil Das
May be you should update your spark version to the latest one. Thanks Best Regards On Wed, Jun 10, 2015 at 11:04 AM, Chandrashekhar Kotekar shekhar.kote...@gmail.com wrote: Hi, I have configured Spark to run on YARN. Whenever I start spark shell using 'spark-shell' command, it

Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
Also, if the data isn't confidential, would you mind to send me a compressed copy (don't cc user@spark.apache.org)? Cheng On 6/10/15 4:23 PM, 姜超才 wrote: Hi Lian, Thanks for your quick response. I forgot mention that I have tuned driver memory from 2G to 4G, seems got minor improvement, The

Re: How to use Apache spark mllib Model output in C++ component

2015-06-10 Thread Akhil Das
Hope Swig http://www.swig.org/index.php and JNA https://github.com/twall/jna/ might help for accessing c++ libraries from Java. Thanks Best Regards On Wed, Jun 10, 2015 at 11:50 AM, mahesht mahesh.s.tup...@gmail.com wrote: There is C++ component which uses some model which we want to replace

Re: Running SparkSql against Hive tables

2015-06-10 Thread Cheng Lian
On 6/10/15 1:55 AM, James Pirz wrote: I am trying to use Spark 1.3 (Standalone) against Hive 1.2 running on Hadoop 2.6. I looked the ThriftServer2 logs, and I realized that the server was not starting properly, because of failure in creating a server socket. In fact, I had passed the URI to

Re: Determining number of executors within RDD

2015-06-10 Thread Himanshu Mehra
Hi Akshat, I assume what you want is to make sure the number of partitions in your RDD, which is easily achievable by passing numSlices and minSplits argument at the time of RDD creation. example : val someRDD = sc.parallelize(someCollection, numSlices) / val someRDD = sc.textFile(pathToFile,

Re: spark-submit does not use hive-site.xml

2015-06-10 Thread Cheng Lian
Hm, this is a common confusion... Although the variable name is `sqlContext` in Spark shell, it's actually a `HiveContext`, which extends `SQLContext` and has the ability to communicate with Hive metastore. So your program need to instantiate a `org.apache.spark.sql.hive.HiveContext` instead.

回复:Re: Re: Re: How to decrease the time of storing block in memory

2015-06-10 Thread luohui20001
thanks Ak, thanks for your idea. I had tried using spark to do what the shell did. However it is not fast enough as I expected and not very easy. Thanksamp;Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das ak...@sigmoidanalytics.com 收件人:罗辉

Re: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-10 Thread Cheng Lian
Would you mind to provide executor output so that we can check the reason why executors died? And you may run EXPLAIN EXTENDED to find out the physical plan of your query, something like: |0: jdbc:hive2://localhost:1 explain extended select * from foo;

Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
Hi Xiaohan, Would you please try to set spark.sql.thriftServer.incrementalCollect to true and increasing driver memory size? In this way, HiveThriftServer2 uses RDD.toLocalIterator rather than RDD.collect().iterator to return the result set. The key difference is that RDD.toLocalIterator

Re: Monitoring Spark Jobs

2015-06-10 Thread Himanshu Mehra
Hi Sam, You might want to have a look at spark UI which runs by default at localhost://8080. You can also configure Apache Ganglia to monitor over your cluster resources. Thank you Regards Himanshu Mehra -- View this message in context:

Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
Would you please also provide executor stdout and stderr output? Thanks. Cheng On 6/10/15 4:23 PM, 姜超才 wrote: Hi Lian, Thanks for your quick response. I forgot mention that I have tuned driver memory from 2G to 4G, seems got minor improvement, The dead way when fetching 1,400,000 rows

Re: PostgreSQL JDBC Classpath Issue

2015-06-10 Thread Cheng Lian
Michael had answered this question in the SO thread http://stackoverflow.com/a/30226336 Cheng On 6/10/15 9:24 PM, shahab wrote: Hi George, I have same issue, did you manage to find a solution? best, /Shahab On Wed, May 13, 2015 at 9:21 PM, George Adams g.w.adams...@gmail.com

Re: append file on hdfs

2015-06-10 Thread Richard Marscher
Hi, if you now want to write 1 file per partition, that's actually built into Spark as *saveAsTextFile*(*path*)Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call

Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: Determining number of executors within RDD

2015-06-10 Thread maxdml
Note that this property is only available for YARN -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Determining-number-of-executors-within-RDD-tp15554p23256.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Fully in-memory shuffles

2015-06-10 Thread Josh Rosen
There's a discussion of this at https://github.com/apache/spark/pull/5403 On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote: Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: Cassandra Submit

2015-06-10 Thread Yana Kadiyska
Do you build via maven or sbt? How do you submit your application -- do you use local, standalone or mesos/yarn? Your jars as you originally listed them seem right to me. Try this, from your ${SPARK_HOME}:

Re: which database for gene alignment data ?

2015-06-10 Thread Frank Austin Nothaft
Hi Roni, These are exposed as public APIs. If you want, you can run them inside of the adam-shell (which is just a wrapper for the spark shell, but with the ADAM libraries on the class path). Also , I need to save all my intermediate data. Seems like ADAM stores data in Parquet on HDFS. I

Re: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-10 Thread Sourav Mazumder
Here is the physical plan. Also attaching the executor log from one of the executors. You can see that memory consumption is slowly rising and then it is reaching around 10.5 GB. There it is staying for around 5 minutes 06-50-36 to 06-55-00. Then this executor is getting killed. ExecutorMemory

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
I am profiling the driver. It currently has 564MB of strings which might be the 1mil file names. But also it has 2.34 GB of long[] ! That's so far, it is still running. What are those long[] used for? -- View this message in context:

Re: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-10 Thread Cheng Lian
Seems that Spark SQL can't retrieve table size statistics and doesn't enable broadcast join in your case. Would you please try `ANALYZE TABLE table-name` for both tables to generated table statistics information? Cheng On 6/10/15 10:26 PM, Sourav Mazumder wrote: Here is the physical plan.

Re: Linear Regression with SGD

2015-06-10 Thread Debasish Das
It's always better to use a quasi newton solver if the runtime and problem scale permits as there are guarantees on opti mization...owlqn and bfgs are both quasi newton Most single node code bases will run quasi newton solvesif you are using sgd better is to use adadelta/adagrad or similar

Re: Saving compressed textFiles from a DStream in Scala

2015-06-10 Thread Bob Corsaro
Thanks Akhil. For posterity, I ended up with: https://gist.github.com/dokipen/aa07f351a970fe54fcff I couldn't get rddToFilename() to work, but it's impl was pretty simple. I'm a poet but I don't know it. On Tue, Jun 9, 2015 at 3:10 AM Akhil Das ak...@sigmoidanalytics.com wrote: like

How to use Apache spark mllib Model output in C++ component

2015-06-10 Thread mahesht
There is C++ component which uses some model which we want to replace it by spark model output, but there is no C++ API support for reading model, what is the best way to solve this problem..? -- View this message in context:

Re: Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-10 Thread karma243
Thank you for responding @nsalian. 1. I am trying to replicate this https://github.com/dibbhatt/kafka-spark-consumer project on my local system. 2. Yes, kafka and brokers on the same host. 3. I am working with kafka 0.7.3 and spark 1.3.1. Kafka 0.7.3 does not has --describe command. Though

Re: Kafka Spark Streaming: ERROR EndpointWriter: dropping message

2015-06-10 Thread Dibyendu Bhattacharya
Hi, Can you please little detail stack trace from your receiver logs and also the consumer settings you used ? I have never tested the consumer with Kafka 0.7.3 ..not sure if Kafka Version is the issue . Have you tried building the consumer using Kafka 0.7.3 ? Regards, Dibyendu On Wed, Jun 10,

Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-10 Thread Akhil Das
Or you can do sc.addJar(/path/to/the/jar), i haven't tested with HDFS path though it works fine with local path. Thanks Best Regards On Wed, Jun 10, 2015 at 10:17 AM, Jörn Franke jornfra...@gmail.com wrote: I am not sure they work with HDFS pathes. You may want to look at the source code.

RE: Join between DStream and Periodically-Changing-RDD

2015-06-10 Thread Evo Eftimov
It depends on how big the Batch RDD requiring reloading is Reloading it for EVERY single DStream RDD would slow down the stream processing inline with the total time required to reload the Batch RDD ….. But if the Batch RDD is not that big then that might not be an issues especially in

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Marcelo Vanzin
So, I don't have an explicit solution to your problem, but... On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios kostas.koug...@googlemail.com wrote: I am profiling the driver. It currently has 564MB of strings which might be the 1mil file names. But also it has 2.34 GB of long[] ! That's so

Re: Split RDD based on criteria

2015-06-10 Thread Chad Urso McDaniel
While it does feel like a filter is what you want to do, a common way to handle this is to map to different keys. Using your rddList example it becomes like this (scala style): --- val rddSplit: RDD[(Int, Any)] = rdd.map(x = (*createKey*(x), x)) val rddBuckets: RDD[(Int, Iterable[Any])] =

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
After some time the driver accumulated 6.67GB of long[] . The executor mem usage so far is low. -- View this message in context:

spark streaming - checkpointing - looking at old application directory and failure to start streaming context

2015-06-10 Thread Ashish Nigam
Hi, If checkpoint data is already present in HDFS, driver fails to load as it is performing lookup on previous application directory. As that folder already exists, it fails to start context. Failed job's application id was application_1432284018452_0635 and job was performing lookup on

Re: How to build spark with Hive 1.x ?

2015-06-10 Thread Ted Yu
Hive version 1.x is currently not supported. Cheers On Wed, Jun 10, 2015 at 9:16 AM, Neal Yin neal@workday.com wrote: I am trying to build spark 1.3 branch with Hive 1.1.0. mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Phive-0.13.1 -Dhive.version=1.1.0

Re: spark streaming - checkpointing - looking at old application directory and failure to start streaming context

2015-06-10 Thread Ashish Nigam
I did not change driver program. I just shutdown the context and again started. BTW, I see this ticket already open in unassigned state - SPARK-6892 https://issues.apache.org/jira/browse/SPARK-6892 that talks about this issue. Is this a known issue? Also, any workarounds? On Wed, Jun 10,

Re: spark streaming - checkpointing - looking at old application directory and failure to start streaming context

2015-06-10 Thread Akhil Das
Delete the checkpoint directory, you might have modified your driver program. Thanks Best Regards On Wed, Jun 10, 2015 at 9:44 PM, Ashish Nigam ashnigamt...@gmail.com wrote: Hi, If checkpoint data is already present in HDFS, driver fails to load as it is performing lookup on previous

How to build spark with Hive 1.x ?

2015-06-10 Thread Neal Yin
I am trying to build spark 1.3 branch with Hive 1.1.0. mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Phive-0.13.1 -Dhive.version=1.1.0 -Dhive.version.short=1.1.0 -DskipTests clean package I got following error Failed to execute goal on project spark-hive_2.10:

Re: Split RDD based on criteria

2015-06-10 Thread Sean Owen
No, but you can write a couple lines of code that do this. It's not optimized of course. This is actually a long and interesting side discussion, but I'm not sure how much it could be given that the computation is pull rather than push; there is no concept of one pass over the data resulting in

Re: Issue running Spark 1.4 on Yarn

2015-06-10 Thread matvey14
Hi nsalian, For some reason the rest of this thread isn't showing up here. The NodeManager isn't busy. I'll copy/paste, the details are in there. I've tried running a Hadoop app pointing to the same queue. Same

Re: Spark Maven Test error

2015-06-10 Thread Rick Moritz
Dear List, I'm trying to reference a lonely message to this list from March 25th,( http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Maven-Test-error-td22216.html ), but I'm unsure this will thread properly. Sorry, if didn't work out. Anyway, using Spark 1.4.0-RC4 I run into the same

Re: Determining number of executors within RDD

2015-06-10 Thread Sandy Ryza
On YARN, there is no concept of a Spark Worker. Multiple executors will be run per node without any effort required by the user, as long as all the executors fit within each node's resource limits. -Sandy On Wed, Jun 10, 2015 at 3:24 PM, Evo Eftimov evo.efti...@isecc.com wrote: Yes i think

Problem with pyspark on Docker talking to YARN cluster

2015-06-10 Thread Ashwin Shankar
All, I was wondering if any of you have solved this problem : I have pyspark(ipython mode) running on docker talking to a yarn cluster(AM/executors are NOT running on docker). When I start pyspark in the docker container, it binds to port *49460.* Once the app is submitted to YARN, the app(AM)

Re: Determining number of executors within RDD

2015-06-10 Thread Evo Eftimov
Yes  i think it is ONE worker ONE executor as executor is nothing but jvm instance spawned by the worker  To run more executors ie jvm instances on the same physical cluster node you need to run more than one worker on that node and then allocate only part of the sys resourced to that

Re: Efficient way to get top K values per key in (key, value) RDD?

2015-06-10 Thread erisa
Hi, I am a Spark newbie, and trying to solve the same problem, and have implemented the same exact solution that sowen is suggesting. I am using priorityqueues to keep trak of the top 25 sub_categories, per each category, and using the combineByKey function to do that. However I run into the

Re: Issue running Spark 1.4 on Yarn

2015-06-10 Thread nsalian
Hi, Thanks for the added information. Helps add more context. Is that specific queue different from the others? FairScheduler.xml should have the information needed.Or if you have a separate allocations.xml. Something of this format: allocations queue name=sample_queue minResources1

Re: spark-submit does not use hive-site.xml

2015-06-10 Thread James Pirz
Thanks for your help ! Switching to HiveContext fixed the issue. Just one side comment: In the documentation regarding Hive Tables and HiveContext https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables, we see: // sc is an existing JavaSparkContext.HiveContext sqlContext =

RE: [SPARK-6330] 1.4.0/1.5.0 Bug to access S3 -- AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyI

2015-06-10 Thread Shuai Zheng
I have tried both cases(s3 and s3n, set all possible parameters), and trust me, the same code works with 1.3.1, but not for 1.3.0 and 1.4.0, 1.5.0. I even use a plain project to test this, and use maven to include all referenced library, but it give me error. I think everyone can easily

Re: spark-submit does not use hive-site.xml

2015-06-10 Thread Cheng Lian
Thanks for pointing out the documentation error :) Opened https://github.com/apache/spark/pull/6749 to fix this. On 6/11/15 1:18 AM, James Pirz wrote: Thanks for your help ! Switching to HiveContext fixed the issue. Just one side comment: In the documentation regarding Hive Tables and

Re: Determining number of executors within RDD

2015-06-10 Thread maxdml
Actually this is somehow confusing for two reasons: - First, the option 'spark.executor.instances', which seems to be only dealt with in the case of YARN in the source code of SparkSubmit.scala, is also present in the conf/spark-env.sh file under the standalone section, which would indicate that

Hive Custom Transform Scripts (read from stdin and print to stdout) in Spark

2015-06-10 Thread nishanthps
What is the best way to reuse hive custom transform scripts written in python or awk or c++ which process data from stdin and print to stdout in spark. These scripts are typically using the Transform Syntax in Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform --

Re: Determining number of executors within RDD

2015-06-10 Thread Nishkam Ravi
This PR adds support for multiple executors per worker: https://github.com/apache/spark/pull/731 and should be available in 1.4. Thanks, Nishkam On Wed, Jun 10, 2015 at 1:35 PM, Evo Eftimov evo.efti...@isecc.com wrote: We/i were discussing STANDALONE mode, besides maxdml had already

Can't access Ganglia on EC2 Spark cluster

2015-06-10 Thread barmaley
Launching using spark-ec2 script results in: Setting up ganglia RSYNC'ing /etc/ganglia to slaves... ... Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Shutting down GANGLIA gmond:

Re: How to set KryoRegistrator class in spark-shell

2015-06-10 Thread bhomass
you need to register using spark-default.xml as explained here

Re: Fully in-memory shuffles

2015-06-10 Thread Patrick Wendell
In many cases the shuffle will actually hit the OS buffer cache and not ever touch spinning disk if it is a size that is less than memory on the machine. - Patrick On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet cjno...@gmail.com wrote: So with this... to help my understanding of Spark under the

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
So with this... to help my understanding of Spark under the hood- Is this statement correct When data needs to pass between multiple JVMs, a shuffle will *always* hit disk? On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen rosenvi...@gmail.com wrote: There's a discussion of this at

Re: RDD of RDDs

2015-06-10 Thread ping yan
Thanks much for the detailed explanations. I suspected architectural support of the notion of rdd of rdds, but my understanding of Spark or distributed computing in general is not as deep as allowing me to understand better. so this really helps! I ended up going with List[RDD]. The collection of

NullPointerException with functions.rand()

2015-06-10 Thread Justin Yip
Hello, I am using 1.4.0 and found the following weird behavior. This case works fine: scala sc.parallelize(Seq((1,2), (3, 100))).toDF.withColumn(index, rand(30)).show() +--+---+---+ |_1| _2| index| +--+---+---+ | 1| 2| 0.6662967911724369| |

Re: Spark not working on windows 7 64 bit

2015-06-10 Thread Jörn Franke
You may compare the c:\windows\system32\drivers\etc\hosts if they are configured similarly Le mer. 10 juin 2015 à 17:16, Eran Medan eran.me...@gmail.com a écrit : I'm on a road block trying to understand why Spark doesn't work for a colleague of mine on his Windows 7 laptop. I have pretty

Re: PYTHONPATH on worker nodes

2015-06-10 Thread Marcelo Vanzin
I don't think it's propagated automatically. Try this: spark-submit --conf spark.executorEnv.PYTHONPATH=... ... On Wed, Jun 10, 2015 at 8:15 AM, Bob Corsaro rcors...@gmail.com wrote: I'm setting PYTHONPATH before calling pyspark, but the worker nodes aren't inheriting it. I've tried looking

Re: Can't access Ganglia on EC2 Spark cluster

2015-06-10 Thread Akhil Das
Looks like libphp version is 5.6 now, which version of spark are you using? Thanks Best Regards On Thu, Jun 11, 2015 at 3:46 AM, barmaley o...@solver.com wrote: Launching using spark-ec2 script results in: Setting up ganglia RSYNC'ing /etc/ganglia to slaves... ... Shutting down GANGLIA

Re: Spark standalone mode and kerberized cluster

2015-06-10 Thread Akhil Das
This might help http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_installing-kerb-spark-quickstart.html Thanks Best Regards On Wed, Jun 10, 2015 at 6:49 PM, kazeborja kazebo...@gmail.com wrote: Hello all. I've been reading some old mails and

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Ok so it is the case that small shuffles can be done without hitting any disk. Is this the same case for the aux shuffle service in yarn? Can that be done without hitting disk? On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com wrote: In many cases the shuffle will actually hit

Spark not working on windows 7 64 bit

2015-06-10 Thread Eran Medan
I'm on a road block trying to understand why Spark doesn't work for a colleague of mine on his Windows 7 laptop. I have pretty much the same setup and everything works fine. I googled the error message and didn't get anything that resovled it. Here is the exception message (after running spark

PYTHONPATH on worker nodes

2015-06-10 Thread Bob Corsaro
I'm setting PYTHONPATH before calling pyspark, but the worker nodes aren't inheriting it. I've tried looking through the code and it appears that it should, I can't find the bug. Here's an example, what am I doing wrong? https://gist.github.com/dokipen/84c4e4a89fddf702fdf1