Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
FWIW, this is an essential feature to our use of Spark, and I'm surprised it's not advertised clearly as a limitation in the documentation. All I've found about running Spark 1.3 on 2.11 is here:http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211Also, I'm experiencing

Re: Spark 1.3 saveAsTextFile with codec gives error - works with Spark 1.2

2015-04-17 Thread Akhil Das
Not sure if this will help, but try clearing your jar cache (for sbt ~/.ivy2 and for maven ~/.m2) directories. Thanks Best Regards On Wed, Apr 15, 2015 at 9:33 PM, Manoj Samel manojsamelt...@gmail.com wrote: Env - Spark 1.3 Hadoop 2.3, Kerbeos xx.saveAsTextFile(path, codec) gives following

Re: Distinct is very slow

2015-04-17 Thread Akhil Das
How many tasks are you seeing in your mapToPair stage? Is it 7000? then i suggest you giving a number similar/close to 7000 in your .distinct call, what is happening in your case is that, you are repartitioning your data to a smaller number (32) which would put a lot of load on processing i

Re: SparkR: Server IPC version 9 cannot communicate with client version 4

2015-04-17 Thread Akhil Das
There's a version incompatibility between your hadoop jars. You need to make sure you build your spark with Hadoop 2.5.0-cdh5.3.1 version. Thanks Best Regards On Fri, Apr 17, 2015 at 5:17 AM, lalasriza . lala.s.r...@gmail.com wrote: Dear everyone, right now I am working with SparkR on

Re: Task result in Spark Worker Node

2015-04-17 Thread Raghav Shankar
Hey Imran, Thanks for the great explanation! This cleared up a lot of things for me. I am actually trying to utilize some of the features within Spark for a system I am developing. I am currently working on developing a subsystem that can be integrated within Spark and other Big Data

Re: Spark on Windows

2015-04-17 Thread Sree V
spark 'master' branch (i.e. v1.4.0) builds successfully on windows 8.1 intel i7 64-bit with oracle jdk8_45.with maven opts without the flag -XX:ReservedCodeCacheSize=1g. takes about 33 minutes. Thanking you. With Regards Sree  On Thursday, April 16, 2015 9:07 PM, Arun Lists

Re: aliasing aggregate columns?

2015-04-17 Thread elliott cordo
FYI.. the problem is that column names spark generates are not able to be referenced within SQL or dataframe operations (ie. SUM(cool_cnt#725)).. any idea how to alias these final aggregate columns.. the syntax below doesn't make sense, but this is what i'd ideally want to do:

RE: Spark Directed Acyclic Graph / Jobs

2015-04-17 Thread Shao, Saisai
I think this paper will be a good resource (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf), also the paper of Dryad is also a good one. Thanks Jerry From: James King [mailto:jakwebin...@gmail.com] Sent: Friday, April 17, 2015 3:26 PM To: user Subject: Spark Directed Acyclic

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-17 Thread Krist Rastislav
Hello again, steps to reproduce the same problem in JdbcRDD: - create a table containig Date field in your favourite DBMS, I used PostgreSQL: CREATE TABLE spark_test ( pk_spark_test integer NOT NULL, text character varying(25), date1 date, CONSTRAINT pk PRIMARY KEY (pk_spark_test) )

Re: Actor not found

2015-04-17 Thread Shixiong Zhu
I just checked the codes about creating OutputCommitCoordinator. Could you reproduce this issue? If so, could you provide details about how to reproduce it? Best Regards, Shixiong(Ryan) Zhu 2015-04-16 13:27 GMT+08:00 Canoe canoe...@gmail.com: 13119 Exception in thread main

RE: How to do dispatching in Streaming?

2015-04-17 Thread Evo Eftimov
Good use of analogies J Yep friction (or entropy in general) exists in everything – but hey by adding and doing “more work” at the same time (aka more powerful rockets) some people have overcome the friction of the air and even got as far as the moon and beyond It is all about the

Some questions on Multiple Streams

2015-04-17 Thread Laeeq Ahmed
Hi, I am working with multiple Kafka streams (23 streams) and currently I am processing them separately. I receive one stream from each topic. I have the following questions. 1.    Spark streaming guide suggests to union these streams. Is it possible to get statistics of each stream even after

Re: Task result in Spark Worker Node

2015-04-17 Thread Raghav Shankar
My apologies, I had pasted the wrong exception trace in the previous email. Here is the actual exception that I am receiving. Exception in thread main java.lang.NullPointerException at org.apache.spark.rdd.ParallelCollectionRDD$.slice(ParallelCollectionRDD.scala:154) at

Re: Random pairs / RDD order

2015-04-17 Thread Aurélien Bellet
Hi Sean, Thanks a lot for your reply. The problem is that I need to sample random *independent* pairs. If I draw two samples and build all n*(n-1) pairs then there is a lot of dependency. My current solution is also not satisfying because some pairs (the closest ones in a partition) have a

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Sean Owen
Doesn't this reduce to Scala isn't compatible with itself across maintenance releases? Meaning, if this were fixed then Scala 2.11.{x 6} would have similar failures. It's not not-ready; it's just not the Scala 2.11.6 REPL. Still, sure I'd favor breaking the unofficial support to at least make the

RE: General configurations on CDH5 to achieve maximum Spark Performance

2015-04-17 Thread Evo Eftimov
And btw if you suspect this is a YARN issue you can always launch and use Spark in a Standalone Mode which uses its own embedded cluster resource manager - this is possible even when Spark has been deployed on CDH under YARN by the pre-canned install scripts of CDH To achieve that: 1.

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-17 Thread Wang, Daoyuan
Normally I use like the following in scala: case calss datetest (x: Int, y:java.sql.Date) val dt = sc.parallelize(1 to 3).map(p = datetest(p, new java.sql.Date(p*1000*60*60*24))) sqlContext.createDataFrame(dt).registerTempTable(“t1”) sql(“select * from t1”).collect.foreach(println) If you still

Path issue in running spark

2015-04-17 Thread mas
A very basic but strange problem: On running master i am getting following error. My java path is proper, however spark-class file is getting error because here the in the string bin/java is duplicated. Can any body explain why it is getting this . Error: /bin/spark-class: line 190: exec:

Addition of new Metrics for killed executors.

2015-04-17 Thread Archit Thakur
Hi, We are planning to add new Metrics in Spark for the executors that got killed during the execution. Was just curious, why this info is not already present. Is there some reason for not adding it.? Any ideas around are welcome. Thanks and Regards, Archit Thakur.

Re: Joined RDD

2015-04-17 Thread Archit Thakur
map phase of join* On Fri, Apr 17, 2015 at 5:28 PM, Archit Thakur archit279tha...@gmail.com wrote: Ajay, This is true. When we call join again on two RDD's.Rather than computing the whole pipe again, It reads the map output of the map phase of an RDD(which it usually gets from shuffle

Re: Custom partioner

2015-04-17 Thread Jeetendra Gangele
Hi Archit Thanks for reply. How can I don the costom compilation so reduce it to 4 bytes.I want to make it to 4 bytes in any case can you please guide? I am applying flatMapvalue in each step after ZipWithIndex it should be in same Node right? Why its suffling? Also I am running with very less

Re: Executor memory in web UI

2015-04-17 Thread Sean Owen
This is the fraction available for caching, which is 60% * 90% * total by default. On Fri, Apr 17, 2015 at 11:30 AM, podioss grega...@hotmail.com wrote: Hi, i am a bit confused with the executor-memory option. I am running applications with Standalone cluster manager with 8 workers with 4gb

Re: Spark on Windows

2015-04-17 Thread Arun Lists
Thanks, Sree! Are you able to run your applications using spark-submit? Even after we were able to build successfully, we ran into problems with running the spark-submit script. If everything worked correctly for you, we can hope that things will be smoother when 1.4.0 is made generally

RE: Streaming problems running 24x7

2015-04-17 Thread González Salgado , Miquel
Hi Akhil, Thank you for your response, I think it is not because of the processing time, in fact the delay is under 1 second, while the batch interval is 10 seconds… The data volume is low (10 lines / second) By the way, I have seen some results changing to this call of Kafkautils:

Re: RDD collect hangs on large input data

2015-04-17 Thread Zsolt Tóth
Thanks for your answer Imran. I haven't tried your suggestions yet, but setting spark.shuffle.blockTransferService=nio solved my issue. There is a JIRA for this: https://issues.apache.org/jira/browse/SPARK-6962. Zsolt 2015-04-14 21:57 GMT+02:00 Imran Rashid iras...@cloudera.com: is it possible

Re: Custom partioner

2015-04-17 Thread Archit Thakur
By custom installation, I meant change the code and build it. I have not done the complete impact analysis, just had a look on the code. When you say, same key goes to same node, It would need shuffling unless the raw data you are reading is present that way. On Apr 17, 2015 6:30 PM, Jeetendra

Running into several problems with Data Frames

2015-04-17 Thread Darin McBeath
I decided to play around with DataFrames this morning but I'm running into quite a few issues. I'm assuming that I must be doing something wrong so would appreciate some advice. First, I create my Data Frame. import sqlContext.implicits._ case class Entity(InternalId: Long, EntityId: Long,

ClassCastException while caching a query

2015-04-17 Thread Tash Chainar
Hi all, Spark 1.2.1. I have a Cassandra column family and doing the following SchemaRDD s = cassandraSQLContext.sql(select user.id as user_id from user); // user.id is UUID in table definition s.registerTempTable( my_user ); s.cache(); // throws following exception // tried the

SparkStreaming 1.3.0 fileNotFound Exception while using WAL Checkpoints

2015-04-17 Thread Akhil Das
Hi With SparkStreaming on 1.3.0 version when I'm using WAL and checkpoints, sometimes, I'm hitting fileNotFound exceptions. Here's the complete stacktrace: https://gist.github.com/akhld/126b945f7fef408a525e The application simply reads data from Kafka and does a simple wordcount over it. Batch

Executor memory in web UI

2015-04-17 Thread podioss
Hi, i am a bit confused with the executor-memory option. I am running applications with Standalone cluster manager with 8 workers with 4gb memory and 2 cores each and when i submit my application with spark-submit i use --executor-memory 1g. In the web ui in the completed applications table i see

Metrics Servlet on spark 1.2

2015-04-17 Thread Udit Mehta
Hi, I am unable to access the metrics servlet on spark 1.2. I tried to access it from the app master UI on port 4040 but i dont see any metrics there. Is it a known issue with spark 1.2 or am I doing something wrong? Also how do I publish my own metrics and view them on this servlet? Thanks,

When are TaskCompletionListeners called?

2015-04-17 Thread Akshat Aranya
Hi, I'm trying to figure out when TaskCompletionListeners are called -- are they called at the end of the RDD's compute() method, or after the iteration through the iterator of the compute() method is completed. To put it another way, is this OK: class DatabaseRDD[T] extends RDD[T] { def

Re: How to do dispatching in Streaming?

2015-04-17 Thread Jianshi Huang
Thanks everyone for the reply. Looks like foreachRDD + filtering is the way to go. I'll have 4 independent Spark streaming applications so the overhead seems acceptable. Jianshi On Fri, Apr 17, 2015 at 5:17 PM, Evo Eftimov evo.efti...@isecc.com wrote: Good use of analogies J Yep friction

Spark hanging after main method completes

2015-04-17 Thread apropion
I recently started using Spark version 1.3.0 in standalone mode (with Scala 2.10.3), and I'm running into an odd problem. I'm loading data from a file using sc.textFile, doing some conversion of the data, and then clustering it. When I do this with a small file (10 lines, 9 KB), it works fine, and

Re: Distinct is very slow

2015-04-17 Thread Jeetendra Gangele
I am saying to partition something like partitionBy(new HashPartitioner(16) will this not work? On 17 April 2015 at 21:28, Jeetendra Gangele gangele...@gmail.com wrote: I have given 3000 task to mapToPair now its taking so much memory and shuffling and wasting time there. Here is the stats

Which version of Hive QL is Spark 1.3.0 using?

2015-04-17 Thread ARose
So I'm trying to store the results of a query into a DataFrame, but I get the following exception thrown: Exception in thread main java.lang.RuntimeException: [1.71] failure: ``*'' expected but `select' found SELECT DISTINCT OutSwitchID FROM wtbECRTemp WHERE OutSwtichID NOT IN (SELECT SwitchID

Re: Which version of Hive QL is Spark 1.3.0 using?

2015-04-17 Thread Denny Lee
Support for sub queries in predicates hasn't been resolved yet - please refer to SPARK-4226 BTW, Spark 1.3 default bindings to Hive 0.13.1 On Fri, Apr 17, 2015 at 09:18 ARose ashley.r...@telarix.com wrote: So I'm trying to store the results of a query into a DataFrame, but I get the

Re: When are TaskCompletionListeners called?

2015-04-17 Thread Imran Rashid
its the latter -- after spark gets to the end of the iterator (or if it hits an exception) so your example is good, that is exactly what it is intended for. On Fri, Apr 17, 2015 at 12:23 PM, Akshat Aranya aara...@gmail.com wrote: Hi, I'm trying to figure out when TaskCompletionListeners are

Re: Spark Code to read RCFiles

2015-04-17 Thread gle
Hi, I'm new to Spark and am working on a proof of concept. I'm using Spark 1.3.0 and running in local mode. I can read and parse an RCFile using Spark however the performance is not as good as I hoped. I'm testing using ~800k rows and it is taking about 30 mins to process. Is there a better

Re: How to persist RDD return from partitionBy() to disk?

2015-04-17 Thread Imran Rashid
https://issues.apache.org/jira/browse/SPARK-1061 note the proposed fix isn't to have spark automatically know about the partitioner when it reloads the data, but at least to make it *possible* for it to be done at the application level. On Fri, Apr 17, 2015 at 11:35 AM, Wang, Ningjun (LNG-NPV)

Need Costom RDD

2015-04-17 Thread Jeetendra Gangele
Hi All I have an RDDOjbect then I convert it to RDDObject,Long with ZipWithIndex here Index is Long and its taking 8 bytes Is there any way to make it Integer? There is no API available which INT index. How Can I create Custom RDD so that I takes only 4 bytes for index part? Also why API is

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
Thanks. Would that distribution work for hdp 2.2? On Fri, Apr 17, 2015 at 2:19 PM, Zhan Zhang zzh...@hortonworks.com wrote: You don’t need to put any yarn assembly in hdfs. The spark assembly jar will include everything. It looks like your package does not include yarn module, although I

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
H... I don't follow. The 2.11.x series is supposed to be binary compatible against user code. Anyway, I was building Spark against 2.11.2 and still saw the problems with the REPL. I've created a bug report: https://issues.apache.org/jira/browse/SPARK-6989

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
Hi Udit, By the way, do you mind to share the whole log trace? Thanks. Zhan Zhang On Apr 17, 2015, at 2:26 PM, Udit Mehta ume...@groupon.commailto:ume...@groupon.com wrote: I am just trying to launch a spark shell and not do anything fancy. I got the binary distribution from apache and put

Re: Spark hanging after main method completes

2015-04-17 Thread apropion
I was using sbt, and I found that I actually had specified Spark 0.9.1 there. Once I upgraded my sbt config file to use 1.3.0, and Scala to 2.10.4, the problem went away. Michael -- View this message in context:

Re: Why does the HDFS parquet file generated by Spark SQL have different size with those on Tachyon?

2015-04-17 Thread Reynold Xin
It's because you did a repartition -- which rearranges all the data. Parquet uses all kinds of compression techniques such as dictionary encoding and run-length encoding, which would result in the size difference when the data is ordered different. On Fri, Apr 17, 2015 at 4:51 AM, zhangxiongfei

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
You don’t need to put any yarn assembly in hdfs. The spark assembly jar will include everything. It looks like your package does not include yarn module, although I didn’t find anything wrong in your mvn command. Can you check whether the ExecutorLauncher class is in your jar file or not? BTW:

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
You probably want to first try the basic configuration to see whether it works, instead of setting SPARK_JAR pointing to the hdfs location. This error is caused by not finding ExecutorLauncher in class path, and not HDP specific, I think. Thanks. Zhan Zhang On Apr 17, 2015, at 2:26 PM, Udit

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
Hi, This is the log trace: https://gist.github.com/uditmehta27/511eac0b76e6d61f8b47 On the yarn RM UI, I see : Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher The command I run is: bin/spark-shell --master yarn-client The spark defaults I use is:

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
I followed the steps described above and I still get this error: Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher I am trying to build spark 1.3 on hdp 2.2. I built spark from source using: build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive

Announcing Spark 1.3.1 and 1.2.2

2015-04-17 Thread Patrick Wendell
Hi All, I'm happy to announce the Spark 1.3.1 and 1.2.2 maintenance releases. We recommend all users on the 1.3 and 1.2 Spark branches upgrade to these releases, which contain several important bug fixes. Download Spark 1.3.1 or 1.2.2: http://spark.apache.org/downloads.html Release notes:

How to avoid “Invalid checkpoint directory” error in apache Spark?

2015-04-17 Thread Peng Cheng
I'm using Amazon EMR + S3 as my spark cluster infrastructure. When I'm running a job with periodic checkpointing (it has a long dependency tree, so truncating by checkpointing is mandatory, each checkpoint has 320 partitions). The job stops halfway, resulting an exception: (On driver)

Can't get SparkListener to work

2015-04-17 Thread Praveen Balaji
I'm trying to create a simple SparkListener to get notified of error on executors. I do not get any call backs on my SparkListener. Here some simple code I'm executing in spark-shell. But I still don't get any callbacks on my listener. Am I doing something wrong? Thanks for any clue you can send

Re: Can't get SparkListener to work

2015-04-17 Thread Imran Rashid
when you start the spark-shell, its already too late to get the ApplicationStart event. Try listening for StageCompleted or JobEnd instead. On Fri, Apr 17, 2015 at 5:54 PM, Praveen Balaji secondorderpolynom...@gmail.com wrote: I'm trying to create a simple SparkListener to get notified of

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Zhan Zhang
Besides the hdp.version in spark-defaults.conf, I think you probably forget to put the file java-opts under $SPARK_HOME/conf with following contents. [root@c6402 conf]# pwd /usr/hdp/current/spark-client/conf [root@c6402 conf]# ls fairscheduler.xml.template java-opts

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-04-17 Thread Udit Mehta
Thanks Zhang, that solved the error. This is probably not documented anywhere so I missed it. Thanks again, Udit On Fri, Apr 17, 2015 at 3:24 PM, Zhan Zhang zzh...@hortonworks.com wrote: Besides the hdp.version in spark-defaults.conf, I think you probably forget to put the file* java-opts*

RE: ClassCastException processing date fields using spark SQL since 1.3.0

2015-04-17 Thread Wang, Daoyuan
Thank you for the explanation! I’ll check what can be done here. From: Krist Rastislav [mailto:rkr...@vub.sk] Sent: Friday, April 17, 2015 9:03 PM To: Wang, Daoyuan; Michael Armbrust Cc: user Subject: RE: ClassCastException processing date fields using spark SQL since 1.3.0 So finally,

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Sean Owen
You are running on 2.11.6, right? of course, it seems like that should all work, but it doesn't work for you. My point is that the shell you are saying doesn't work is Scala's 2.11.2 shell -- with some light modification. It's possible that the delta is the problem. I can't entirely make out

Re: External JARs not loading Spark Shell Scala 2.11

2015-04-17 Thread Michael Allman
I actually just saw your comment on SPARK-6989 before this message. So I'll copy to the mailing list: I'm not sure I understand what you mean about running on 2.11.6. I'm just running the spark-shell command. It in turn is running java -cp

local directories for spark running on yarn

2015-04-17 Thread shenyanls
According to the documentation: The local directories used by Spark executors will be the local directories configured for YARN (Hadoop YARN config yarn.nodemanager.local-dirs). If the user specifies spark.local.dir, it will be ignored. (https://spark.apache.org/docs/1.2.1/running-on-yarn.html)

Re: Can't get SparkListener to work

2015-04-17 Thread Praveen Balaji
Thanks for the response, Imran. I probably chose the wrong methods for this email. I implemented all methods of SparkListener and the only callback I get is onExecutorMetricsUpdate. Here's the complete code: == import org.apache.spark.scheduler._ sc.addSparkListener(new SparkListener()