Re: Launching Spark Cluster Application through IDE

2015-03-20 Thread Akhil Das
From IntelliJ, you can use the remote debugging feature. http://stackoverflow.com/questions/19128264/how-to-remote-debug-in-intellij-12-1-4 For remote debugging, you need to pass the following: -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4000,suspend=n jvm options and configure your

Visualizing Spark Streaming data

2015-03-20 Thread Harut
I'm trying to build a dashboard to visualize stream of events coming from mobile devices. For example, I have event called add_photo, from which I want to calculate trending tags for added photos for last x minutes. Then I'd like to aggregate that by country, etc. I've built the streaming part,

Re: Powered by Spark addition

2015-03-20 Thread Ricardo Almeida
Hello Matei, Could you please also add our company to the Powered By list? Details are as follows: Name: Act Now URL: www.actnowib.com Description: Sparks powers NOW APPS, a big data, real-time, predictive analytics platform. Using Spark SQL, MLlib and GraphX components for both batch ETL and

Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-20 Thread Akhil Das
Are you submitting your application from local to a remote host? If you want to run the spark application from a remote machine, then you have to at least set the following configurations properly. - *spark.driver.host* - points to the ip/host from where you are submitting the job (make sure

Re: Visualizing Spark Streaming data

2015-03-20 Thread Jeffrey Jedele
I'll stay with my recommendation - that's exactly what Kibana is made for ;) 2015-03-20 9:06 GMT+01:00 Harut Martirosyan harut.martiros...@gmail.com: Hey Jeffrey. Thanks for reply. I already have something similar, I use Grafana and Graphite, and for simple metric streaming we've got all

Re: Visualizing Spark Streaming data

2015-03-20 Thread Harut Martirosyan
Hey Jeffrey. Thanks for reply. I already have something similar, I use Grafana and Graphite, and for simple metric streaming we've got all set-up right. My question is about interactive patterns. For instance, dynamically choose an event to monitor, dynamically choose group-by field or any sort

Measuer Bytes READ and Peak Memory Usage for Query

2015-03-20 Thread anu
Hi All I would like to measure Bytes Read and Peak Memory Usage for a Spark SQL Query. Please clarify if Bytes Read = aggregate size of all RDDs ?? All my RDDs are in memory and 0B spill to disk. And I am clueless how to measure Peak Memory Usage. -- View this message in context:

Re: Measuer Bytes READ and Peak Memory Usage for Query

2015-03-20 Thread Akhil Das
You could do a cache and see the memory usage under Storage tab in the driver UI (runs on port 4040) Thanks Best Regards On Fri, Mar 20, 2015 at 12:02 PM, anu anamika.guo...@gmail.com wrote: Hi All I would like to measure Bytes Read and Peak Memory Usage for a Spark SQL Query. Please

Re: Load balancing

2015-03-20 Thread Akhil Das
1. If you are consuming data from Kafka or any other receiver based sources, then you can start 1-2 receivers per worker (assuming you'll have min 4 core per worker) 2. If you are having single receiver or is a fileStream then what you can do to distribute the data across machines is to do a

Re: Visualizing Spark Streaming data

2015-03-20 Thread Jeffrey Jedele
Hey Harut, I don't think there'll by any general practices as this part heavily depends on your environment, skills and what you want to achieve. If you don't have a general direction yet, I'd suggest you to have a look at Elasticsearch+Kibana. It's very easy to set up, powerful and therefore

Re: Spark-submit and multiple files

2015-03-20 Thread Guillaume Charhon
Hi Davies, I am already using --py-files. The system does use the other file. The error I am getting is not trivial. Please check the error log. On Thu, Mar 19, 2015 at 8:03 PM, Davies Liu dav...@databricks.com wrote: You could submit additional Python source via --py-files , for example:

Spark 1.2. loses often all executors

2015-03-20 Thread mrm
Hi, I recently changed from Spark 1.1. to Spark 1.2., and I noticed that it loses all executors whenever I have any Python code bug (like looking up a key in a dictionary that does not exist). In earlier versions, it would raise an exception but it would not lose all executors. Anybody with a

Re: Load balancing

2015-03-20 Thread Jeffrey Jedele
Hi Mohit, it also depends on what the source for your streaming application is. If you use Kafka, you can easily partition topics and have multiple receivers on different machines. If you have sth like a HTTP, socket, etc stream, you probably can't do that. The Spark RDDs generated by your

Clean the shuffle data during iteration

2015-03-20 Thread James
Hello, Is that possible to delete shuffle data of previous iteration as it is not necessary? Alcaid

Re: MLlib Spam example gets stuck in Stage X

2015-03-20 Thread Su She
Hello Xiangrui, I use spark 1.2.0 on cdh 5.3. Thanks! -Su On Fri, Mar 20, 2015 at 2:27 PM Xiangrui Meng men...@gmail.com wrote: Su, which Spark version did you use? -Xiangrui On Thu, Mar 19, 2015 at 3:49 AM, Akhil Das ak...@sigmoidanalytics.com wrote: To get these metrics out, you need

Spark Streaming Not Reading Messages From Multiple Kafka Topics

2015-03-20 Thread EH
Hi all, I'm building a Spark Streaming application that will continuously read multiple kafka topics at the same time. However, I found a weird issue that it reads only hundreds of messages then it stopped reading any more. If I changed the three topic to only one topic, then it is fine and it

Re: WebUI on yarn through ssh tunnel affected by AmIpfilter

2015-03-20 Thread Marcelo Vanzin
Instead of opening a tunnel to the Spark web ui port, could you open a tunnel to the YARN RM web ui instead? That should allow you to navigate to the Spark application's web ui through the RM proxy, and hopefully that will work better. On Fri, Feb 6, 2015 at 9:08 PM, yangqch

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Matei Zaharia
Feel free to send a pull request to fix the doc (or say which versions it's needed in). Matei On Mar 20, 2015, at 6:49 PM, Krishna Sankar ksanka...@gmail.com wrote: Yep the command-option is gone. No big deal, just add the '%pylab inline' command as part of your notebook. Cheers k/

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread cong yue
Let me do it now. I appreciate the perfect easy understandable documentation of spark! The updated command will be like PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook ./bin/pyspark When IPython notebook server is launched, you can create a new Python 2 notebook from Files

Re: WebUI on yarn through ssh tunnel affected by AmIpfilter

2015-03-20 Thread benbongalon
I ran into a similar issue. What's happening is that when Spark is running in YARN client mode, YARN automatically launches a Web Application Proxy http://archive.cloudera.com/cdh5/cdh/5/hadoop/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html to reduce hacking attempts. In doing so, it

Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-20 Thread Manoj Samel
Hi, Running Spark 1.3 with secured Hadoop. Spark-shell with Yarn client mode runs without issue when not using Dynamic Allocation. When Dynamic allocation is turned on, the shell comes up but same SQL etc. causes it to loop. spark.dynamicAllocation.enabled=true

Re: Spark 1.3 Dynamic Allocation - Requesting 0 new executor(s) because tasks are backlogged

2015-03-20 Thread Manoj Samel
Forgot to add - the cluster is idle otherwise so there should be no resource issues. Also the configuration works when not using Dynamic allocation. On Fri, Mar 20, 2015 at 4:15 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, Running Spark 1.3 with secured Hadoop. Spark-shell with Yarn

Registring UDF from a different package fails

2015-03-20 Thread Ravindra
Hi All, I have all my UDFs defined in the classes residing in a different package than where I am instantiating my HiveContext. I have a register function in my UDF class. I pass HiveContext to this function. and in this function I call hiveContext.registerFunction(myudf, myudf _) All goes well

about Partition Index

2015-03-20 Thread Long Cheng
Dear all, About the index of each partition of an RDD, I am wondering whether we can keep their numbering on each physical machine in a hash partitioning process. For example, a cluster containing three physical machines A,B,C (all are workers), for an RDD with six partitions, assume that the two

Re: Accessing AWS S3 in Frankfurt (v4 only - AWS4-HMAC-SHA256)

2015-03-20 Thread Gourav Sengupta
-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, Host: frankfurt.ingestion.batch.s3.amazonaws.com, x-amz-date: 20150320T112531Z, Authorization: AWS4-HMAC-SHA256 Credential=XXX_MY_KEY_XXX/20150320/us-east-1/s3/aws4_request,SignedHeaders=date;host;x-amz-content

Accessing AWS S3 in Frankfurt (v4 only - AWS4-HMAC-SHA256)

2015-03-20 Thread Ralf Heyde
: Fri, 20 Mar 2015 11:25:31 GMT, x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855, Host: frankfurt.ingestion.batch.s3.amazonaws.com, x-amz-date: 20150320T112531Z, Authorization: AWS4-HMAC-SHA256 Credential=XXX_MY_KEY_XXX/20150320/us-east-1/s3/aws4_request

How to handle under-performing nodes in the cluster

2015-03-20 Thread Yiannis Gkoufas
Hi all, I have 6 nodes in the cluster and one of the nodes is clearly under-performing: ​ I was wandering what is the impact of having such issues? Also what is the recommended way to workaround it? Thanks a lot, Yiannis

What is the jvm size when start spark-submit through local mode

2015-03-20 Thread Shuai Zheng
Hi, I am curious, when I start a spark program in local mode, which parameter will be used to decide the jvm memory size for executor? In theory should be: --executor-memory 20G But I remember local mode will only start spark executor in the same process of driver, then should be:

Buffering for Socket streams

2015-03-20 Thread jamborta
Hi all, We are designing a workflow where we try to stream local files to a Socket streamer, that would clean and process the files and write them to hdfs. We have an issue with bigger files when the streamer cannot keep up with the data, and runs out of memory. What would be the best way to

Re: Accessing AWS S3 in Frankfurt (v4 only - AWS4-HMAC-SHA256)

2015-03-20 Thread Ralf Heyde
: frankfurt.ingestion.batch.s3.amazonaws.com, x-amz-date: 20150320T112531Z, Authorization: AWS4-HMAC-SHA256 Credential=XXX_MY_KEY_XXX/20150320/us-east-1/s3/aws4_request,SignedHeaders=date;host;x-amz-content-sha256;x-amz-date,Signature=2098d3175c4304e44be912b770add7594d1d1b44f545c3025be1748672ec60e4

Re: version conflict common-net

2015-03-20 Thread Jacob Abraham
Anyone? or is this question nonsensical... and I am doing something fundamentally wrong? On Mon, Mar 16, 2015 at 5:33 PM, Jacob Abraham abe.jac...@gmail.com wrote: Hi Folks, I have a situation where I am getting a version conflict between java libraries that is used by my application and

Re: Visualizing Spark Streaming data

2015-03-20 Thread Irfan Ahmad
Grafana allows pretty slick interactive use patterns, especially with graphite as the back-end. In a multi-user environment, why not have each user just build their own independent dashboards and name them under some simple naming convention? *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics*

Re: About the env of Spark1.2

2015-03-20 Thread Ted Yu
bq. Caused by: java.net.UnknownHostException: dhcp-10-35-14-100: Name or service not known Can you check your DNS ? Cheers On Fri, Mar 20, 2015 at 8:54 PM, tangzilu zilu.t...@hotmail.com wrote: Hi All: I recently started to deploy Spark1.2 in my VisualBox Linux. But when I run the command

Filesystem closed Exception

2015-03-20 Thread Sea
Hi, all: When I exit the console of spark-sql, the following exception throwed.. My spark version is 1.3.0, hadoop version is 2.2.0 Exception in thread Thread-3 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629) at

Filesystem closed Exception

2015-03-20 Thread Sea
Hi, all: When I exit the console of spark-sql, the following exception throwed.. Exception in thread Thread-3 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629) at

About the env of Spark1.2

2015-03-20 Thread tangzilu
Hi All: I recently started to deploy Spark1.2 in my VisualBox Linux. But when I run the command ./spark-shell in the path of /opt/spark-1.2.1/bin, I got the result like this: [root@dhcp-10-35-14-100 bin]# ./spark-shell Using Spark's default log4j profile:

Re: Spark 1.2. loses often all executors

2015-03-20 Thread Akhil Das
Isn't that a feature? Other than running a buggy pipeline, just kills all executors? You can always handle exceptions with proper try catch in your code though. Thanks Best Regards On Fri, Mar 20, 2015 at 3:51 PM, mrm ma...@skimlinks.com wrote: Hi, I recently changed from Spark 1.1. to Spark

Re: Spark Job History Server

2015-03-20 Thread Sean Owen
Uh, does that mean HDP shipped Marcelo's uncommitted patch from SPARK-1537 anyway? Given the discussion there, that seems kinda aggressive. On Wed, Mar 18, 2015 at 8:49 AM, Marcelo Vanzin van...@cloudera.com wrote: Those classes are not part of standard Spark. You may want to contact

ShuffleBlockFetcherIterator: Failed to get block(s)

2015-03-20 Thread Eric Friedman
My job crashes with a bunch of these messages in the YARN logs. What are the appropriate steps in troubleshooting? 15/03/19 23:29:45 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 10 outstanding blocks (after 3 retries) 15/03/19 23:29:45 ERROR

Re: Spark 1.2. loses often all executors

2015-03-20 Thread Davies Liu
Maybe this is related to a bug in 1.2 [1], it's fixed in 1.2.2 (not released), could checkout the 1.2 branch and verify that? [1] https://issues.apache.org/jira/browse/SPARK-5788 On Fri, Mar 20, 2015 at 3:21 AM, mrm ma...@skimlinks.com wrote: Hi, I recently changed from Spark 1.1. to Spark

Re: Spark-submit and multiple files

2015-03-20 Thread Davies Liu
You MUST put --py-files BEFORE main.py, as mentioned in another threads. On Fri, Mar 20, 2015 at 1:47 AM, Guillaume Charhon guilla...@databerries.com wrote: Hi Davies, I am already using --py-files. The system does use the other file. The error I am getting is not trivial. Please check the

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-20 Thread Yin Huai
spark.sql.shuffle.partitions only control the number of tasks in the second stage (the number of reducers). For your case, I'd say that the number of tasks in the first state (number of mappers) will be the number of files you have. Actually, have you changed spark.executor.memory (it controls

RE: Why I didn't see the benefits of using KryoSerializer

2015-03-20 Thread java8964
Hi, Imran: Thanks for your information. I found a benchmark online about serialization which compares Java vs Kryo vs gridgain at here: http://gridgain.blogspot.com/2012/12/java-serialization-good-fast-and-faster.html From my test result, in the above benchmark case for the SimpleObject, Kryo is

RE: Spark will process _temporary folder on S3 is very slow and always cause failure

2015-03-20 Thread Shuai Zheng
Thanks! Let me update the status. I have copied the DirectOutputCommitter to my local. And set: Conf.set(spark.hadoop.mapred.output.committer.class, org..DirectOutputCommitter) It works perfectly. Thanks everyone J Regards, Shuai From: Aaron Davidson

Re: RDD Blocks skewing to just few executors

2015-03-20 Thread Sean Owen
Hm is data locality a factor here? I don't know. Just a side note: this doesn't cause OOM errors per se since the cache won't exceed the % of heap it's allowed. However that will hasten OOM problems due to tasks using too much memory, of course. The solution is to get more memory to the tasks or

can distinct transform applied on DStream?

2015-03-20 Thread Darren Hoo
val aDstream = ... val distinctStream = aDstream.transform(_.distinct()) but the elements in distinctStream are not distinct. Did I use it wrong?

Re: ShuffleBlockFetcherIterator: Failed to get block(s)

2015-03-20 Thread Imran Rashid
I think you should see some other errors before that, from NettyBlockTransferService, with a msg like Exception while beginning fetchBlocks. There might be a bit more information there. there are an assortment of possible causes, but first lets just make sure you have all the details from the

Re: RDD Blocks skewing to just few executors

2015-03-20 Thread Alessandro Lulli
Hi All, I'm experiencing the same issue with Spark 120 (not verified with previous). Could you please help us on this? Thanks Alessandro On Tue, Nov 18, 2014 at 1:40 AM, mtimper mich...@timper.com wrote: Hi I'm running a standalone cluster with 8 worker servers. I'm developing a streaming

Re: Spark Job History Server

2015-03-20 Thread Zhan Zhang
Hi Patcharee, It is an alpha feature in HDP distribution, integrating ATS with Spark history server. If you are using upstream, you can configure spark as regular without these configuration. But other related configuration are still mandatory, such as hdp.version related. Thanks. Zhan Zhang

Re: Visualizing Spark Streaming data

2015-03-20 Thread Harut Martirosyan
But it requires all possible combinations of your filters as separate metrics, moreover, it only can show time based information, you cannot group by say country. On 20 March 2015 at 19:09, Irfan Ahmad ir...@cloudphysics.com wrote: Grafana allows pretty slick interactive use patterns,

Re: version conflict common-net

2015-03-20 Thread Sean Owen
It's not a crazy question, no. I'm having a bit of trouble figuring out what's happening. Commons Net 2.2 is what's used by Spark. The error appears to come from Spark. But the error is not finding a method that did not exist in 2.2. I am not sure what ZipStream is, for example. This could be a

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-20 Thread Yiannis Gkoufas
Hi Yin, the way I set the configuration is: val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext.setConf(spark.sql.shuffle.partitions,1000); it is the correct way right? In the mapPartitions task (the first task which is launched), I get again the same number of tasks and again

Re: Error communicating with MapOutputTracker

2015-03-20 Thread Imran Rashid
Hi Thomas, sorry for such a late reply. I don't have any super-useful advice, but this seems like something that is important to follow up on. to answer your immediate question, No, there should not be any hard limit to the number of tasks that MapOutputTracker can handle. Though of course as

Re: FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded

2015-03-20 Thread Imran Rashid
I think you are running into a combo of https://issues.apache.org/jira/browse/SPARK-5928 and https://issues.apache.org/jira/browse/SPARK-5945 The standard solution is to just increase the number of partitions you are creating. textFile(), reduceByKey(), and sortByKey() all take an optional

Re: Visualizing Spark Streaming data

2015-03-20 Thread Roger Hoover
Hi Harut, Jeff's right that Kibana + Elasticsearch can take you quite far out of the box. Depending on your volume of data, you may only be able to keep recent data around though. Another option that is custom-built for handling many dimensions at query time (not as separate metrics) is Druid

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-20 Thread Yiannis Gkoufas
Actually I realized that the correct way is: sqlContext.sql(set spark.sql.shuffle.partitions=1000) but I am still experiencing the same behavior/error. On 20 March 2015 at 16:04, Yiannis Gkoufas johngou...@gmail.com wrote: Hi Yin, the way I set the configuration is: val sqlContext = new

Re: Why I didn't see the benefits of using KryoSerializer

2015-03-20 Thread Imran Rashid
Hi Yong, yes I think your analysis is correct. I'd imagine almost all serializers out there will just convert a string to its utf-8 representation. You might be interested in adding compression on top of a serializer, which would probably bring the string size down in almost all cases, but then

RE: com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large vs FileNotFoundException (Too many open files) on spark 1.2.1

2015-03-20 Thread Shuai Zheng
Below is the output: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1967947 max locked memory (kbytes, -l) 64 max memory

Re: Spark SQL UDT Kryo serialization, Unable to find class

2015-03-20 Thread Michael Armbrust
You probably don't cause a shuffle (which requires serialization) unless there is a join or group by. It's possible that we are need to pass the spark class loader to kryo when creating a new instance (you can get it from Utils I believe). We never run Otto this problem since this API is not

Spark per app logging

2015-03-20 Thread Udit Mehta
Hi, We have spark setup such that there are various users running multiple jobs at the same time. Currently all the logs go to 1 file specified in the log4j.properties. Is it possible to configure log4j in spark for per app/user logging instead of sending all logs to 1 file mentioned in the

Matching Spark application metrics data to App Id

2015-03-20 Thread Judy Nash
Hi, I want to get telemetry metrics on spark apps activities, such as run time and jvm activities. Using Spark Metrics I am able to get the following sample data point on the an app: type=GAUGE, name=application.SparkSQL::headnode0.1426626495312.runtime_ms, value=414873 How can I match this

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-20 Thread Christian Perez
Any other users interested in a feature DataFrame.saveAsExternalTable() for making _useful_ external tables in Hive, or am I the only one? Bueller? If I start a PR for this, will it be taken seriously? On Thu, Mar 19, 2015 at 9:34 AM, Christian Perez christ...@svds.com wrote: Hi Yin, Thanks

RE: com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large vs FileNotFoundException (Too many open files) on spark 1.2.1

2015-03-20 Thread java8964
Do you think the ulimit for the user running Spark on your nodes? Can you run ulimit -a under the user who is running spark on the executor node? Does the result make sense for the data you are trying to process? Yong From: szheng.c...@gmail.com To: user@spark.apache.org Subject:

EC2 cluster created by spark using old HDFS 1.0

2015-03-20 Thread morfious902002
Hi, I created a cluster using spark-ec2 script. But it installs HDFS version 1.0. I would like to use this cluster to connect to HIVE installed on a cloudera CDH 5.3 cluster. But I am getting the following error:- org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate

com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large vs FileNotFoundException (Too many open files) on spark 1.2.1

2015-03-20 Thread Shuai Zheng
Hi All, I try to run a simple sort by on 1.2.1. And it always give me below two errors: 1, 15/03/20 17:48:29 WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID 35, ip-10-169-217-47.ec2.internal): java.io.FileNotFoundException:

Mailing list schizophrenia?

2015-03-20 Thread Jim Kleckner
I notice that some people send messages directly to user@spark.apache.org and some via nabble, either using email or the web client. There are two index sites, one directly at apache.org and one at nabble. But messages sent directly to user@spark.apache.org only show up in the apache list.

Re: Mailing list schizophrenia?

2015-03-20 Thread Jim Kleckner
Yes, it did get delivered to the apache list shown here: http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3CCAGK53LnsD59wwQrP3-9yHc38C4eevAfMbV2so%2B_wi8k0%2Btq5HQ%40mail.gmail.com%3E But the web site for spark community directs people to nabble for viewing messages and it

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: File too large vs FileNotFoundException (Too many open files) on spark 1.2.1

2015-03-20 Thread Charles Feduke
Assuming you are on Linux, what is your /etc/security/limits.conf set for nofile/soft (number of open file handles)? On Fri, Mar 20, 2015 at 3:29 PM Shuai Zheng szheng.c...@gmail.com wrote: Hi All, I try to run a simple sort by on 1.2.1. And it always give me below two errors: 1,

Create a Spark cluster with cloudera CDH 5.2 support

2015-03-20 Thread morfious902002
Hi, I am trying to create a Spark cluster using the spark-ec2 script which will support 2.5.0-cdh5.3.2 for HDFS as well as Hive. I created a cluster by adding --hadoop-major-version=2.5.0 which solved some of the errors I was getting. But now when I run select query on hive I get the following

Re: MLlib Spam example gets stuck in Stage X

2015-03-20 Thread Xiangrui Meng
Su, which Spark version did you use? -Xiangrui On Thu, Mar 19, 2015 at 3:49 AM, Akhil Das ak...@sigmoidanalytics.com wrote: To get these metrics out, you need to open the driver ui running on port 4040. And in there you will see Stages information and for each stage you can see how much time

Re: Create a Spark cluster with cloudera CDH 5.2 support

2015-03-20 Thread Sean Owen
I think you missed -Phadoop-2.4 On Fri, Mar 20, 2015 at 5:27 PM, morfious902002 anubha...@gmail.com wrote: Hi, I am trying to create a Spark cluster using the spark-ec2 script which will support 2.5.0-cdh5.3.2 for HDFS as well as Hive. I created a cluster by adding

Re: Spark per app logging

2015-03-20 Thread Ted Yu
Are these jobs the same jobs, just run by different users or, different jobs ? If the latter, can each application use its own log4j.properties ? Cheers On Fri, Mar 20, 2015 at 1:43 PM, Udit Mehta ume...@groupon.com wrote: Hi, We have spark setup such that there are various users running

Re: Error when using multiple python files spark-submit

2015-03-20 Thread Guillaume Charhon
I see. I will try the other way around. On Thu, Mar 19, 2015 at 8:06 PM, Davies Liu dav...@databricks.com wrote: the options of spark-submit should come before main.py, or they will become the options of main.py, so it should be: ../hadoop/spark-install/bin/spark-submit --py-files

Re: Spark-submit and multiple files

2015-03-20 Thread Petar Zecevic
I tried your program in yarn-client mode and it worked with no exception. This is the command I used: spark-submit --master yarn-client --py-files work.py main.py (Spark 1.2.1) On 20.3.2015. 9:47, Guillaume Charhon wrote: Hi Davies, I am already using --py-files. The system does use the

How to check that a dataset is sorted after it has been written out?

2015-03-20 Thread Michael Albert
Greetings! I sorted a dataset in Spark and then wrote it out in avro/parquet. Then I wanted to check that it was sorted. It looks like each partition has been sorted, but when reading in, the first partition (i.e., as seen in the partition index of mapPartitionsWithIndex) is not the same  as

IPyhon notebook command for spark need to be updated?

2015-03-20 Thread cong yue
Hello : I tried ipython notebook with the following command in my enviroment. PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook --pylab inline ./bin/pyspark But it shows --pylab inline support is removed from ipython newest version. the log is as : --- $

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Krishna Sankar
Yep the command-option is gone. No big deal, just add the '%pylab inline' command as part of your notebook. Cheers k/ On Fri, Mar 20, 2015 at 3:45 PM, cong yue yuecong1...@gmail.com wrote: Hello : I tried ipython notebook with the following command in my enviroment.