Problem with changing the akka.framesize parameter

2015-02-04 Thread sahanbull
I am trying to run a spark application with -Dspark.executor.memory=30g -Dspark.kryoserializer.buffer.max.mb=2000 -Dspark.akka.frameSize=1 and the job fails because one or more of the akka frames are larger than 1mb (12000 ish). When I change the Dspark.akka.frameSize=1 to

Re: Spark Job running on localhost on yarn cluster

2015-02-04 Thread Michael Albert
1) Parameters like --num-executors should come before the jar.  That is, you want something like$SPARK_HOME --num-executors 3 --driver-memory 6g --executor-memory 7g \--master yarn-cluster  --class EDDApp target/scala-2.10/eddjar \outputPath That is, *your* parameters come after the jar,

Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-04 Thread Imran Rashid
Hi Michael, judging from the logs, it seems that those tasks are just working a really long time. If you have long running tasks, then you wouldn't expect the driver to output anything while those tasks are working. What is unusual is that there is no activity during all that time the tasks are

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread bo yang
Hi Corey, I see. Thanks for making it clear. I may be lucky not hitting code path of such Guava classes. But I hit some other jar conflicts when using Spark to connect to AWS S3. Then I had to manually try each version of org.apache.httpcomponents until I found a proper old version. Another

Re: 2GB limit for partitions?

2015-02-04 Thread Mridul Muralidharan
That work is from more than an year back and is not maintained anymore since we do not use it inhouse now. Also note that there have been quite a lot of changes in spark ... Including some which break assumptions made in the patch, so it's value is very low - having said that, do feel free to work

Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-04 Thread Sandy Ryza
Also, do you see any lines in the YARN NodeManager logs where it says that it's killing a container? -Sandy On Wed, Feb 4, 2015 at 8:56 AM, Imran Rashid iras...@cloudera.com wrote: Hi Michael, judging from the logs, it seems that those tasks are just working a really long time. If you have

Re: How many stages in my application?

2015-02-04 Thread Akhil Das
You can easily understand the flow by looking at the number of operations in your program (like map, groupBy, join etc.), first of all you list out the number of operations happening in your application and then from the webui you will be able to see how many operations have happened so far.

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-04 Thread Kelvin Chu
Joe, I also use S3 and gzip. So far the I/O is not a problem. In my case, the operation is SQLContext.JsonFile() and I can see from Ganglia that the whole cluster is CPU bound (99% saturated). I have 160 cores and I can see the network can sustain about 150MBit/s. Kelvin On Wed, Feb 4, 2015 at

Re: How many stages in my application?

2015-02-04 Thread Mark Hamstra
But there isn't a 1-1 mapping from operations to stages since multiple operations will be pipelined into a single stage if no shuffle is required. To determine the number of stages in a job you really need to be looking for shuffle boundaries. On Wed, Feb 4, 2015 at 11:27 AM, Akhil Das

Re: Spark streaming app shutting down

2015-02-04 Thread Akhil Das
AFAIK, From Spark 1.2.0 you can have WAL (Write Ahead Logs) for fault tolerance, which means it can handle the receiver/driver failures. You can also look at the lowlevel kafka consumer https://github.com/dibbhatt/kafka-spark-consumer which has a better fault tolerance mechanism for receiver

Re: advice on diagnosing Spark stall for 1.5hr out of 3.5hr job?

2015-02-04 Thread Michael Albert
Greetings! Thanks to all who have taken the time to look at this. While the process is stalled, I see, in the yarn log on the head node, repeating messages of the form Trying to fulfill reservation for application XXX on node YYY, but that node is is reserved by XXX_01.  Below is a chunk of

Re: Parquet compression codecs not applied

2015-02-04 Thread Ayoub
I was using hive context an not sql context, therefore (SET spark.sql.parquet.compression.codec=gzip) was ignored. Michael Armbrust pointed out that parquet.compression should be used instead, witch solved the issue. I am still wondering if this behavior is normal, it would be better if

Re: Parquet compression codecs not applied

2015-02-04 Thread sahanbull
Hi Ayoub, You could try using the sql format to set the compression type: sc = SparkContext() sqc = SQLContext(sc) sqc.sql(SET spark.sql.parquet.compression.codec=gzip) You get a notification on screen while running the spark job when you set the compression codec like this. I havent compared

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Marcelo Vanzin
Hi Corey, When you run on Yarn, Yarn's libraries are placed in the classpath, and they have precedence over your app's. So, with Spark 1.2, you'll get Guava 11 in your classpath (with Spark 1.1 and earlier you'd get Guava 14 from Spark, so still a problem for you). Right now, the option Markus

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-04 Thread Aaron Davidson
The latter would be faster. With S3, you want to maximize number of concurrent readers until you hit your network throughput limits. On Wed, Feb 4, 2015 at 6:20 AM, Peter Rudenko petro.rude...@gmail.com wrote: Hi if i have a 10GB file on s3 and set 10 partitions, would it be download whole

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Koert Kuipers
the whole spark.files.userClassPathFirs never really worked for me in standalone mode, since jars were added dynamically which means they had different classloaders leading to a real classloader hell if you tried to add a newer version of jar that spark already used. see:

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
My mistake Marcello, I was looking at the wrong message. That reply was meant for bo yang. On Feb 4, 2015 4:02 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Corey, On Wed, Feb 4, 2015 at 12:44 PM, Corey Nolet cjno...@gmail.com wrote: Another suggestion is to build Spark by yourself.

Re: How to get Hive table schema using Spark SQL or otherwise

2015-02-04 Thread Michael Armbrust
sqlContext.table(tableName).schema() On Wed, Feb 4, 2015 at 1:07 PM, Ayoub benali.ayoub.i...@gmail.com wrote: Given a hive context you could execute: hiveContext.sql(describe TABLE_NAME) you would get the name of the fields and their types 2015-02-04 21:47 GMT+01:00 nitinkak001 [hidden

Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Spark 1.2 Data is read from parquet with 2 partitions and is cached as table with 2 partitions. Verified in UI that it shows RDD with 2 partitions it is fully cached in memory Cached data contains column a, b, c. Column a has ~150 distinct values. Next run SQL on this table as select a, sum(b),

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Koert Kuipers
marcelo, i was not aware of those fixes. its a fulltime job to keep up with spark... i will take another look. it would be great if that works on spark standalone also and resolves the issues i experienced before. about putting stuff on classpath before spark or yarn... yeah you can shoot

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Marcelo Vanzin
Hi Koert, On Wed, Feb 4, 2015 at 11:35 AM, Koert Kuipers ko...@tresata.com wrote: do i understand it correctly that on yarn the the customer jars are truly placed before the yarn and spark jars on classpath? meaning at container construction time, on the same classloader? that would be great

Why are task results large in this case?

2015-02-04 Thread ankits
I am running a job, part of which is to add some null values to the rows of a SchemaRDD. The job fails with Total size of serialized results of 2692 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize(1024.0 MB) This is the code: val in = sqc.parquetFile(...) .. val presentColProj:

Re: How to get Hive table schema using Spark SQL or otherwise

2015-02-04 Thread Ayoub
Given a hive context you could execute: hiveContext.sql(describe TABLE_NAME) you would get the name of the fields and their types 2015-02-04 21:47 GMT+01:00 nitinkak001 nitinkak...@gmail.com: I want to get a Hive table schema details into Spark. Specifically, I want to get column name and

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Marcelo Vanzin
On Wed, Feb 4, 2015 at 1:12 PM, Koert Kuipers ko...@tresata.com wrote: about putting stuff on classpath before spark or yarn... yeah you can shoot yourself in the foot with it, but since the container is isolated it should be ok, no? we have been using HADOOP_USER_CLASSPATH_FIRST forever with

Re: Spark streaming - tracking/deleting processed files

2015-02-04 Thread ganterm
Thank you very much for the detailed answer. I feel a little dumb asking but how would that work when using Scala (we use Spark 1.0.2)? I can not figure it out. E.g. I am having trouble using ​UnionPartition and NewHadoopPartition or even ds.values() is not an option for me (in the IDE). Do you

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Koert Kuipers
i have never been a big fan of shading, since it leads to the same library being packaged many times. for example, its not unusual to have ASM 10 times in a jar because of the shading policy they promote. and all that because they broke one signature and without a really good reason. i am a big

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Marcelo Vanzin
Hi Corey, On Wed, Feb 4, 2015 at 12:44 PM, Corey Nolet cjno...@gmail.com wrote: Another suggestion is to build Spark by yourself. I'm having trouble seeing what you mean here, Marcelo. Guava is already shaded to a different package for the 1.2.0 release. It shouldn't be causing conflicts.

Re: spark with cdh 5.2.1

2015-02-04 Thread Mohit Jaggi
yep...it was unnecessary to create a 2.5 profile. i struggled a bit because it wasn't clear that i *need* to select a profile using -P option. i didn't have to do that for earlier hadoop versions. On Fri, Jan 30, 2015 at 12:11 AM, Sean Owen so...@cloudera.com wrote: There is no need for a 2.5

Re: Spark SQL taking long time to print records from a table

2015-02-04 Thread Imran Rashid
Many operations in spark are lazy -- most likely your collect() statement is actually forcing evaluation of severals steps earlier in the pipeline. The logs the UI might give you some info about all the stages that are being run when you get to collect(). I think collect() is just fine if you

Tableau beta connector

2015-02-04 Thread ashu
Hi, I am trying out the tableau beta connector to Spark SQL. I have few basics question: Will this connector be able to fetch the schemaRDDs into tableau. Will all the schemaRDDs be exposed to tableau? Basically I am not getting what tableau will fetch at data-source? Is it existing files in HDFS?

New combination-like RDD based on two RDDs

2015-02-04 Thread dash
Hey Spark gurus! Sorry for the confusing title. I do not know the exactly description of my problem, if you know please tell me so I can change it :-) Say I have two RDDs right now, and they are val rdd1 = sc.parallelize(List((1,(3)), (2,(5)), (3,(6 val rdd2 = sc.parallelize(List((2,(1)),

Re: Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Follow up for closure on thread ... 1. spark.sql.shuffle.partitions is not on config page but is mentioned on http://spark.apache.org/docs/1.2.0/sql-programming-guide.html. Would be better to have it in config page as well for sake of completeness. Should I file a doc bug ? 2. Regarding my #2

synchronously submitting spark jobs

2015-02-04 Thread ey-chih chow
Hi, I would like to submit spark jobs one by one, in that the next job will not be submitted until the previous one succeeds. Spark_submit can only submit jobs asynchronously. Is there any way I can submit jobs sequentially? Thanks. Ey-Chih Chow -- View this message in context:

Re: Large # of tasks in groupby on single table

2015-02-04 Thread Ankur Srivastava
Hi Manoj, You can set the number of partitions you want your sql query to use. By default it is 200 and thus you see that number. You can update it using the spark.sql.shuffle.partitions property spark.sql.shuffle.partitions200Configures the number of partitions to use when shuffling data for

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Koert Kuipers
anyhow i am ranting... sorry On Wed, Feb 4, 2015 at 5:54 PM, Koert Kuipers ko...@tresata.com wrote: yeah i think we have been lucky so far. but i dont really see how i have a choice. it would be fine if say hadoop exposes a very small set of libraries as part of the classpath. but if i look

Re: Large # of tasks in groupby on single table

2015-02-04 Thread Manoj Samel
Hi, Any thoughts on this? Most of the 200 tasks in collect phase take less than 20 ms and lot of time is spent on scheduling these. I suspect the overall time will reduce if # of tasks are dropped to a much smaller # (close to # of partitions?) Any help is appreciated Thanks On Wed, Feb 4,

Re: Problem with changing the akka.framesize parameter

2015-02-04 Thread Shixiong Zhu
The unit of spark.akka.frameSize is MB. The max value is 2047. Best Regards, Shixiong Zhu 2015-02-05 1:16 GMT+08:00 sahanbull sa...@skimlinks.com: I am trying to run a spark application with -Dspark.executor.memory=30g -Dspark.kryoserializer.buffer.max.mb=2000 -Dspark.akka.frameSize=1

Re: Problem with changing the akka.framesize parameter

2015-02-04 Thread Shixiong Zhu
Could you clarify why you need a 10G akka frame size? Best Regards, Shixiong Zhu 2015-02-05 9:20 GMT+08:00 Shixiong Zhu zsxw...@gmail.com: The unit of spark.akka.frameSize is MB. The max value is 2047. Best Regards, Shixiong Zhu 2015-02-05 1:16 GMT+08:00 sahanbull sa...@skimlinks.com: I

Is there a way to access Hive UDFs in a HiveContext?

2015-02-04 Thread vishpatel
I'm trying to access a permanent Hive UDF, scala val data = hc.sql(select func.md5(some_string) from some_table) data: org.apache.spark.sql.SchemaRDD = SchemaRDD[19] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == java.lang.RuntimeException: Couldn't find function func.md5

Re: New combination-like RDD based on two RDDs

2015-02-04 Thread dash
Problem solved. A simple join will do the work val prefix = new PairRDDFunctions[Int, Set[Int]](sc.parallelize(List((9, Set(4)), (1,Set(3)), (2,Set(5)), (2,Set(4) val suffix = sc.parallelize(List((1, Set(1)), (2, Set(6)), (2, Set(5)), (2, Set(7

pyspark - gzip output compression

2015-02-04 Thread Kane Kim
How to save RDD with gzip compression? Thanks.

Re: Tableau beta connector

2015-02-04 Thread Ashutosh Trivedi (MT2013030)
Thanks Denny and Ismail. Denny ,I went through your blog, It was great help. I guess tableau beta connector also following the same procedure,you described in blog. I am building the Spark now. Basically what I don't get is, where to put my data so that tableau can extract. So Ismail,its

Re: Tableau beta connector

2015-02-04 Thread İsmail Keskin
Tableau connects to Spark Thrift Server via an ODBC driver. So, none of the RDD stuff applies, you just issue SQL queries from Tableau. The table metadata can come from Hive Metastore if you place your hive-site.xml to configuration directory of Spark. On Thu, Feb 5, 2015 at 8:11 AM, ashu

Re: Spark Job running on localhost on yarn cluster

2015-02-04 Thread Felix C
Is YARN_CONF_DIR set? --- Original Message --- From: Aniket Bhatnagar aniket.bhatna...@gmail.com Sent: February 4, 2015 6:16 AM To: kundan kumar iitr.kun...@gmail.com, spark users user@spark.apache.org Subject: Re: Spark Job running on localhost on yarn cluster Have you set master in

Re: Tableau beta connector

2015-02-04 Thread Denny Lee
The context is that you would create your RDDs and then persist them in Hive. Once in Hive, the data is accessible from the Tableau extract through Spark thrift server. On Wed, Feb 4, 2015 at 23:36 Ashutosh Trivedi (MT2013030) ashutosh.triv...@iiitb.org wrote: Thanks Denny and Ismail.

Re: Whether standalone spark support kerberos?

2015-02-04 Thread Jander g
Hope someone helps me. Thanks. On Wed, Feb 4, 2015 at 6:14 PM, Jander g jande...@gmail.com wrote: We have a standalone spark cluster for kerberos test. But when reading from hdfs, i get error output: Can't get Master Kerberos principal for use as renewer. So Whether standalone spark

How many stages in my application?

2015-02-04 Thread Joe Wass
I'm sitting here looking at my application crunching gigabytes of data on a cluster and I have no idea if it's an hour away from completion or a minute. The web UI shows progress through each stage, but not how many stages remaining. How can I work out how many stages my program will take

Joining piped RDDs

2015-02-04 Thread Pavel Velikhov
Hi! We have a common usecase with Spark - we go out to some database, e.g. Cassandra, crunch though all of its data, but along the RDD pipeline we use a pipe operator to some script. All the data before the pipe has some unique IDs, but inside the pipe everything is lost. The only current

Re: Exception in thread main java.lang.SecurityException: class javax.servlet.ServletRegistration'

2015-02-04 Thread Sean Owen
It means you did not exclude the Servlet APIs from some dependency in your app, and one of them is bringing it in every time. Look at the dependency tree and exclude whatever brings in javax.servlet. It should be already available in Spark, and the particular javax.servlet JAR from Oracle has

Spark Job running on localhost on yarn cluster

2015-02-04 Thread kundan kumar
Hi, I am trying to execute my code on a yarn cluster The command which I am using is $SPARK_HOME/bin/spark-submit --class EDDApp target/scala-2.10/edd-application_2.10-1.0.jar --master yarn-cluster --num-executors 3 --driver-memory 6g --executor-memory 7g outpuPath But, I can see that this

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
Bo yang- I am using Spark 1.2.0 and undoubtedly there are older Guava classes which are being picked up and serialized with the closures when they are sent from the driver to the executors because the class serial version ids don't match from the driver to the executors. Have you tried doing

Re: Multiple running SparkContexts detected in the same JVM!

2015-02-04 Thread Sean Owen
It means what it says. You should not have multiple SparkContexts running in one JVM. It was always the wrong thing to do, but now is an explicit error. When you run spark-shell, you already have a SparkContext (sc) so there is no need to make another one. Just don't do that. On Wed, Feb 4, 2015

Problems with GC and time to execute with different number of executors.

2015-02-04 Thread Guillermo Ortiz
I execute a job in Spark where I'm processing a file of 80Gb in HDFS. I have 5 slaves: (32cores /256Gb / 7physical disks) x 5 I have been trying many different configurations with YARN. yarn.nodemanager.resource.memory-mb 196Gb yarn.nodemanager.resource.cpu-vcores 24 I have tried to execute the

Whether standalone spark support kerberos?

2015-02-04 Thread Jander g
We have a standalone spark cluster for kerberos test. But when reading from hdfs, i get error output: Can't get Master Kerberos principal for use as renewer. So Whether standalone spark support kerberos? can anyone confirm it? or what i missed? Thanks in advance. -- Thanks, Jander

Re: 2GB limit for partitions?

2015-02-04 Thread Imran Rashid
Hi Mridul, do you think you'll keep working on this, or should this get picked up by others? Looks like there was a lot of work put into LargeByteBuffer, seems promising. thanks, Imran On Tue, Feb 3, 2015 at 7:32 PM, Mridul Muralidharan mri...@gmail.com wrote: That is fairly out of date (we

Re: Sort based shuffle not working properly?

2015-02-04 Thread Imran Rashid
I think you are interested in secondary sort, which is still being worked on: https://issues.apache.org/jira/browse/SPARK-3655 On Tue, Feb 3, 2015 at 4:41 PM, Nitin kak nitinkak...@gmail.com wrote: I thought thats what sort based shuffled did, sort the keys going to the same partition. I

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Koert Kuipers
yeah i think we have been lucky so far. but i dont really see how i have a choice. it would be fine if say hadoop exposes a very small set of libraries as part of the classpath. but if i look at the jars on hadoop classpath its a ton! and why? why does parquet need to be included with hadoop for

Re: Spark streaming app shutting down

2015-02-04 Thread Dibyendu Bhattacharya
Thanks Akhil for mentioning this Low Level Consumer ( https://github.com/dibbhatt/kafka-spark-consumer ) . Yes it has better fault tolerant mechanism than any existing Kafka consumer available . This has no data loss on receiver failure and have ability to reply or restart itself in-case of