Spark on Mesos fine-grained - has one core less per executor

2015-06-11 Thread hbogert
I'm doing a performance analysis for Spark on Mesos and I can see that the Coarse-grained backend simply launches tasks in wave size of the amount of cores available. But it seems Fine-grained mode the Mesos executor takes 1 core for itself (so -1 core per mesos slave). Shouldn't fine- and

[no subject]

2015-06-11 Thread Wangfei (X)
Hi, all We use spark sql to insert data from a text table into a partitioning table and found that if we give more cores to executors the insert performance whold be worse. executors numtotal-executor-cores average time for insert task 3

Re: cannot access port 4040

2015-06-11 Thread mrm
Hi, Akhil Das suggested using ssh tunnelling (ssh -L 4040:127.0.0.1:4040 master-ip, and then open localhost:4040 in browser.) and this solved my problem, so it made me think that the settings of my cluster were wrong. So I checked the inbound rules for the security group of my cluster and I

spark sql insert into table performance issue

2015-06-11 Thread Wangfei (X)
发件人: Wangfei (X) 发送时间: 2015年6月11日 17:33 收件人: user@spark.apache.org 主题: Hi, all We use spark sql to insert data from a text table into a partitioning table and found that if we give more cores to executors the insert performance whold be worse.

Re: Is there a way to limit the sql query result size?

2015-06-11 Thread neeravsalaria
Hi Eric, We are also running into the same issue. Are you able to find some suitable solution to this problem Best Regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-limit-the-sql-query-result-size-tp18316p23272.html Sent from

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
Hi Marchelo, The collected data are collected in say class C. c.id is the id of each of those data. But that id might appear more than once in those 1mil xml files, so I am doing a reduceByKey(). Even if I had multiple binaryFile RDD's, wouldn't I have to ++ those in order to correctly

Re: Issue running Spark 1.4 on Yarn

2015-06-11 Thread matvey14
No, this just a random queue name I picked when submitting the job, there's no specific configuration for it. I am not logged in, so don't have the default fair scheduler configuration in front of me, but I don't think that's the problem. The cluster is completely idle, there aren't any jobs being

How to run scala script in Datastax Spark distribution?

2015-06-11 Thread amit tewari
Hi I am struggling to find how to run a scala script on Datastax Spark. (SPARK_HOME/bin/spark-shell -i test.scala is depricated) I dont want to use the scala prompt. Thanks AT

spark stream and spark sql with data warehouse

2015-06-11 Thread ??????
Hi all: We are trying to using spark to do some real time data processing. I need do some sql-like query and analytical tasks with the real time data against historical normalized data stored in data bases. Is there anyone has done this kind of work or design? Any suggestion or material

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
If I want to restart my consumers into an updated cluster topology after the cluster has been expanded or contracted, would I need to call stop() on them, then call start() on them, or would I need to instantiate and start new context objects (new JavaStreamingContext(...)) ? I'm thinking of

How to deal with an explosive flatmap

2015-06-11 Thread matthewrj
I have a flatmap that takes a line from an input file, splits it by tab into words then returns an array of tuples consisting of every combination of 2 words. I then go on to count the frequency of each combination across the whole file using a reduce by key (where the key is the tuple). I am

Re: Reopen Jira or New Jira

2015-06-11 Thread Yana Kadiyska
John, I took the liberty of reopening because I have sufficient JIRA permissions (not sure if you do). It would be good if you can add relevant comments/investigations there. On Thu, Jun 11, 2015 at 8:34 AM, John Omernik j...@omernik.com wrote: Hey all, from my other post on Spark 1.3.1 issues,

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
Now I am profiling the executor. There seems to be a memory leak. 20 mins after the run there were: 157k byte[] allocated for 75MB. 519k java.lang.ref.Finalizer for 31MB, 291k java.util.zip.Inflater for 17MB 487k java.util.zip.ZStreamRef for 11MB An hour after the run I got : 186k byte[]

How to pass arguments dynamically, that needs to be used in executors

2015-06-11 Thread gaurav sharma
Hi, I am using Kafka Spark cluster for real time aggregation analytics use case in production. Cluster details 6 nodes, each node running 1 Spark and kafka processes each. Node1 - 1 Master , 1 Worker, 1 Driver, 1 Kafka process Node 2,3,4,5,6 - 1 Worker prcocess each

Re: How to set KryoRegistrator class in spark-shell

2015-06-11 Thread Igor Berman
Another option would be to close sc and open new context with your custom configuration On Jun 11, 2015 01:17, bhomass bhom...@gmail.com wrote: you need to register using spark-default.xml as explained here

Re: How to set KryoRegistrator class in spark-shell

2015-06-11 Thread Eugen Cepoi
Or launch the spark-shell with --conf spark.kryo.registrator=foo.bar.MyClass 2015-06-11 14:30 GMT+02:00 Igor Berman igor.ber...@gmail.com: Another option would be to close sc and open new context with your custom configuration On Jun 11, 2015 01:17, bhomass bhom...@gmail.com wrote: you need

Re: How to pass arguments dynamically, that needs to be used in executors

2015-06-11 Thread Todd Nist
Hi Gaurav, Seems like you could use a broadcast variable for this if I understand your use case. Create it in the driver based on the CommandLineArguments and then use it in the workers. https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables So something like:

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-11 Thread Josh Mahonin
Hi Jeroen, No problem. I think there's some magic involved with how the Spark classloader(s) works, especially with regards to the HBase dependencies. I know there's probably a more light-weight solution that doesn't require customizing the Spark setup, but that's the most straight-forward way

Reading Really Big File Stream from HDFS

2015-06-11 Thread SLiZn Liu
Hi Spark Users, I'm trying to load a literally big file (50GB when compressed as gzip file, stored in HDFS) by receiving a DStream using `ssc.textFileStream`, as this file cannot be fitted in my memory. However, it looks like no RDD will be received until I copy this big file to a prior-specified

Could Spark batch processing live within Spark Streaming?

2015-06-11 Thread diplomatic Guru
Hello all, I was wondering if it is possible to have a high latency batch processing job coexists within Spark Streaming job? If it's possible then could we share the state of the batch job with the Spark Streaming job? For example when the long-running batch computation is complete, could we

ClassCastException: BlockManagerId cannot be cast to [B

2015-06-11 Thread davidkl
Hello, I am running a streaming app in Spark 1.2.1. When running local everything works fine. When I try on yarn-cluster it fails and I see ClassCastException in the log (see below). I can run Spark (non-streaming) apps in the cluster with no problem. Any ideas here? Thanks in advance! WARN

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
..and keeps on increasing. maybe there is a bug in some code that zips/unzips data. 109k instances of byte[] followed by 1 mil instances of Finalizer, with ~500k Deflaters and ~500k Inflaters and 1 mil ZStreamRef I assume that's due to either binaryFiles or saveAsObjectFile On 11/06/15

Reading file from S3, facing java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

2015-06-11 Thread shahab
Hi, I tried to read a csv file from amazon s3, but I get the following exception which I have no clue how to solve this. I tried both spark 1.3.1 and 1.2.1, but no success. Any idea how to solve this is appreciated. best, /Shahab the code: val hadoopConf=sc.hadoopConfiguration;

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-11 Thread Konstantinos Kougios
after 2h of running, now I got a 10GB long[], 1.3mil instances of long[] So probably information about the files again. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Reopen Jira or New Jira

2015-06-11 Thread John Omernik
Hey all, from my other post on Spark 1.3.1 issues, I think we found an issue related to a previous closed Jira ( https://issues.apache.org/jira/browse/SPARK-1403) Basically it looks like the threat context class loader is NULL which is causing the NPE in MapR and that's similar to posted Jira.

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
I see. It makes a lot of sense now. It is not unique to spark but it would be great if it is mentioned in spark documentation. I have been using hadoop for a while and I am not aware of it! Zheng zheng On Thu, Jun 11, 2015 at 7:21 PM, Will Briggs wrbri...@gmail.com wrote: To be fair, this is

Re: hiveContext.sql NullPointerException

2015-06-11 Thread patcharee
Hi, Does df.write.partitionBy(partitions).format(format).mode(overwrite).saveAsTable(tbl) support orc file? I tried df.write.partitionBy(zone, z, year, month).format(orc).mode(overwrite).saveAsTable(tbl), but after the insert my table tbl schema has been changed to something I did not

Re: UDF accessing hive struct array fails with buffer underflow from kryo

2015-06-11 Thread Yutong Luo
This is a kryo issue. https://github.com/EsotericSoftware/kryo/issues/124. It has to do with the lengths of the fieldnames. This issue is fixed in Kryo 2.23. What's weird is this doesn't break on Hive itself, only when using SparkSQL. Attached is the full stacktrace. It might be how SparkSQL is

Deleting HDFS files from Pyspark

2015-06-11 Thread Siegfried Bilstein
I've seen plenty of examples for creating HDFS files from pyspark but haven't been able to figure out how to delete files from pyspark. Is there an API I am missing for filesystem management? Or should I be including the HDFS python modules? Thanks, Siegfried

Does MLLib has attribute importance?

2015-06-11 Thread Ruslan Dautkhanov
What would be closest equivalent in MLLib to Oracle Data Miner's Attribute Importance mining function? http://docs.oracle.com/cd/B28359_01/datamine.111/b28129/feature_extr.htm#i1005920 Attribute importance is a supervised function that ranks attributes according to their significance in

Re: Fully in-memory shuffles

2015-06-11 Thread Patrick Wendell
Hey Corey, Yes, when shuffles are smaller than available memory to the OS, most often the outputs never get stored to disk. I believe this holds same for the YARN shuffle service, because the write path is actually the same, i.e. we don't fsync the writes and force them to disk. I would guess in

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Amit Ramesh
Hi Jerry, Take a look at this example: https://spark.apache.org/docs/latest/streaming-kafka-integration.html#tab_scala_2 The offsets are needed because as RDDs get generated within spark the offsets move further along. With direct Kafka mode the current offsets are no more persisted in Zookeeper

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Amit Ramesh
Thanks, Jerry. That's what I suspected based on the code I looked at. Any pointers on what is needed to build in this support would be great. This is critical to the project we are currently working on. Thanks! On Thu, Jun 11, 2015 at 10:54 PM, Saisai Shao sai.sai.s...@gmail.com wrote: OK, I

RE: How to set spark master URL to contain domain name?

2015-06-11 Thread prajod.vettiyattil
Hi Ningjun, This is probably a configuration difference between WIN01 and WIN02. Execute: ipconfig/all on the windows command line on both machines and compare them. Also if you have a localhost entry in the hosts file, it should not have the wrong sequence: See the first answer in this

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
OK, I get it, I think currently Python based Kafka direct API do not provide such equivalence like Scala, maybe we should figure out to add this into Python API also. 2015-06-12 13:48 GMT+08:00 Amit Ramesh a...@yelp.com: Hi Jerry, Take a look at this example:

Re: Reading Really Big File Stream from HDFS

2015-06-11 Thread SLiZn Liu
Hmm, you have a good point. So should I load the file by `sc.textFile()` and specify a high number of partitions, and the file is then split into partitions in memory across the cluster? On Thu, Jun 11, 2015 at 9:27 PM ayan guha guha.a...@gmail.com wrote: Why do you need to use stream in this

Error using json4s with Apache Spark in spark-shell

2015-06-11 Thread Daniel Mahler
This is a cross post of a StackOverflow that has not been answered. I have just started 50pt bounty on it there. http://stackoverflow.com/questions/30744294/error-using-json4s-with-apache-spark-in-spark-shell You can answer it there if you prefer I am trying to use the case class extraction

Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Amit Ramesh
Congratulations on the release of 1.4! I have been trying out the direct Kafka support in python but haven't been able to figure out how to get the offsets from the RDD. Looks like the documentation is yet to be updated to include Python examples (

Re: Spark Streaming reads from stdin or output from command line utility

2015-06-11 Thread Heath Guo
Thanks for your reply! In my use case, it would be stream from only one stdin. Also I'm working with Scala. It would be great if you could talk about multi stdin case as well! Thanks. From: Tathagata Das t...@databricks.commailto:t...@databricks.com Date: Thursday, June 11, 2015 at 8:11 PM To:

Re: Spark 1.4: Python API for getting Kafka offsets in direct mode?

2015-06-11 Thread Saisai Shao
Hi, What is your meaning of getting the offsets from the RDD, from my understanding, the offsetRange is a parameter you offered to KafkaRDD, why do you still want to get the one previous you set into? Thanks Jerry 2015-06-12 12:36 GMT+08:00 Amit Ramesh a...@yelp.com: Congratulations on the

How many APPs does spark support to running simultaneously in one cluster?

2015-06-11 Thread luohui20001
Hi guys:I'm trying to running 24 APPs simultaneously in one spark cluster. However, everytime my cluster can only running 17 APPs in the same time. Other APPs disappeared, no logs, no failures. Any ideas will be appreciated. Here is my code:object GeneCompare5{ def

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-11 Thread Jeroen Vlek
Hi Josh, That worked! Thank you so much! (I can't believe it was something so obvious ;) ) If you care about such a thing you could answer my question here for bounty: http://stackoverflow.com/questions/30639659/apache-phoenix-4-3-1-and-4-4-0-hbase-0-98-on-spark-1-3-1-classnotfoundexceptio

how to deal with continued records

2015-06-11 Thread Zhang Jiaqiang
Hello, I have a large CSV file in which the continued records(with same RecordID) may have the context meaning. I should see these continued records as ONE complete record. Also the recordID will be reset to 1 at some time when the csv dumper system think it's necessary. I'd like to get some

how to deal with continued records

2015-06-11 Thread wushuzh
Hello, I have a large CSV file in which the continued records(with same RecordID) may have the context meaning. I should see these continued records as ONE complete record. Also the recordID will be reset to 1 at some time when the csv dumper system think it's necessary. I'd like to get some

Re: Fully in-memory shuffles

2015-06-11 Thread Mark Hamstra
I would guess in such shuffles the bottleneck is serializing the data rather than raw IO, so I'm not sure explicitly buffering the data in the JVM process would yield a large improvement. Good guess! It is very hard to beat the performance of retrieving shuffle outputs from the OS buffer

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Tathagata Das
Let me try to add some clarity in the different thought directions that's going on in this thread. 1. HOW TO DETECT THE NEED FOR MORE CLUSTER RESOURCES? If there are not rate limits set up, the most reliable way to detect whether the current Spark cluster is being insufficient to handle the data

Re: Spark Streaming reads from stdin or output from command line utility

2015-06-11 Thread Tathagata Das
Are you going to receive data from one stdin from one machine, or many stdins on many machines? On Thu, Jun 11, 2015 at 7:25 PM, foobar heath...@fb.com wrote: Hi, I'm new to Spark Streaming, and I want to create a application where Spark Streaming could create DStream from stdin. Basically I

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Dmitry Goldenberg
Yes, Tathagata, thank you. For #1, the 'need detection', one idea we're entertaining is timestamping the messages coming into the Kafka topics. The consumers would check the interval between the time they get the message and that message origination timestamp. As Kafka topics start to fill up

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-11 Thread Tathagata Das
Do you have the event logging enabled? TD On Thu, Jun 11, 2015 at 11:24 AM, scar0909 scar0...@gmail.com wrote: I have the same problem. i realized that the master spark becomes unresponsive when we kill the leader zookeeper (of course i assigned the leader election task to the zookeeper).

Re: Deleting HDFS files from Pyspark

2015-06-11 Thread ayan guha
Simplest way would be issuing a os.system with HDFS rm command from driver, assuming it has hdfs connectivity, like a gateway node. Executors will have nothing to do with it. On 12 Jun 2015 08:57, Siegfried Bilstein sbilst...@gmail.com wrote: I've seen plenty of examples for creating HDFS files

Re: Join between DStream and Periodically-Changing-RDD

2015-06-11 Thread Tathagata Das
Another approach not mentioned is to use a function to get the RDD that is to be joined. Something like this. Not sure, but you can try something like this also: kvDstream.foreachRDD(rdd = { val rdd = getOrUpdateRDD(params...) rdd.join(kvFile) }) The

Is it possible to see Spark jobs on MapReduce job history ? (running Spark on YARN cluster)

2015-06-11 Thread Elkhan Dadashov
Hi all, I wonder if anyone has used use MapReduce Job History to show Spark jobs. I can see my Spark jobs (Spark running on Yarn cluster) on Resource manager (RM). I start Spark History server, and then through Spark's web-based user interface I can monitor the cluster (and track cluster and

Spark Streaming reads from stdin or output from command line utility

2015-06-11 Thread foobar
Hi, I'm new to Spark Streaming, and I want to create a application where Spark Streaming could create DStream from stdin. Basically I have a command line utility that generates stream data, and I'd like to pipe data into DStream. What's the best way to do that? I thought rdd.pipe() could help, but

Re: SparkSQL JDBC Datasources API when running on YARN - Spark 1.3.0

2015-06-11 Thread Sathish Kumaran Vairavelu
Hi Nathan, I am also facing the issue with Spark 1.3. Did you find any workaround for this issue? Please help Thanks Sathish On Thu, Apr 16, 2015 at 6:03 AM Nathan McCarthy nathan.mccar...@quantium.com.au wrote: Its JTDS 1.3.1; http://sourceforge.net/projects/jtds/files/jtds/1.3.1/ I

How many APPs does spark support to run simultaneously in one cluster?

2015-06-11 Thread luohui20001
Hi there, If I set spark.scheduler.mode to FAIR, How many APPs does spark support to run simultaneously in one cluster? Is there a upper limit? Thanks. Thanksamp;Best regards! San.Luo

Re: Spark standalone mode and kerberized cluster

2015-06-11 Thread Steve Loughran
That's spark on YARN in Kerberos In Spark 1.3 you can submit work to a Kerberized Hadoop cluster; once the tokens you passed up with your app submission expire (~72 hours) your job can't access HDFS any more. That's been addressed in Spark 1.4, where you can now specify a kerberos keytab for

spark-sql from CLI ---EXCEPTION: java.lang.OutOfMemoryError: Java heap space

2015-06-11 Thread Sanjay Subramanian
hey guys Using Hive and Impala daily intensively.Want to transition to spark-sql in CLI mode Currently in my sandbox I am using the Spark (standalone mode) in the CDH distribution (starving developer version 5.3.3) 3 datanode hadoop cluster32GB RAM per node8 cores per node | spark |

ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse
I would like to work with RDD pairs of Tuple2byte[], obj, but byte[]s with the same contents are considered as different values because their reference values are different. I didn't see any to pass in a custom comparer. I could convert the byte[] into a String with an explicit charset, but

Re: spark stream and spark sql with data warehouse

2015-06-11 Thread Yi Tian
Here is an example: |val sc = new SparkContext(new SparkConf) // access hive tables val hqlc = new HiveContext(sc) import hqlc.implicits._ // access files on hdfs val sqlc = new SQLContext(sc) import sqlc.implicits._ sqlc.jsonFile(xxx).registerTempTable(xxx) // access other DB sqlc.jdbc(url,

Re: Running Spark in Local Mode

2015-06-11 Thread mrm
Hi, Did you resolve this? I have the same questions. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-tp22279p23278.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

[ANNOUNCE] Announcing Spark 1.4

2015-06-11 Thread Patrick Wendell
Hi All, I'm happy to announce the availability of Spark 1.4.0! Spark 1.4.0 is the fifth release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 210 developers and more than 1,000 commits! A huge thanks go to all of the individuals and organizations

Limit Spark Shuffle Disk Usage

2015-06-11 Thread Al M
I am using Spark on a machine with limited disk space. I am using it to analyze very large (100GB to 1TB per file) data sets stored in HDFS. When I analyze these datasets, I will run groups, joins and cogroups. All of these operations mean lots of shuffle files written to disk. Unfortunately

Re: How to deal with an explosive flatmap

2015-06-11 Thread Ted Yu
From http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ : Look at the number of partitions in the parent RDD and then keep multiplying that by 1.5 until performance stops improving. FYI On Thu, Jun 11, 2015 at 7:57 AM, matthewrj mle...@gmail.com wrote: I have a

takeSample() results in two stages

2015-06-11 Thread barmaley
I've observed interesting behavior in Spark 1.3.1, the reason for which is not clear. Doing something as simple as sc.textFile(...).takeSample(...) always results in two stages:Spark's takeSample() results in two stages http://apache-spark-user-list.1001560.n3.nabble.com/file/n23280/Capture.jpg

Re: ReduceByKey with a byte array as the key

2015-06-11 Thread Sonal Goyal
I think if you wrap the byte[] into an object and implement equals and hashcode methods, you may be able to do this. There will be the overhead of extra object, but conceptually it should work unless I am missing something. Best Regards, Sonal Founder, Nube Technologies http://www.nubetech.co

RE: ReduceByKey with a byte array as the key

2015-06-11 Thread Mark Tse
Makes sense – I suspect what you suggested should work. However, I think the overhead between this and using `String` would be similar enough to warrant just using `String`. Mark From: Sonal Goyal [mailto:sonalgoy...@gmail.com] Sent: June-11-15 12:58 PM To: Mark Tse Cc: user@spark.apache.org

DataFrames for non-SQL computation?

2015-06-11 Thread Tom Hubregtsen
I've looked a bit into what DataFrames are, and it seems that most posts on the subject are related to SQL, but it does seem to be very efficient. My main questions is: Are DataFrames also beneficial for non-SQL computations? For instance I want to: - sort k/v pairs (in particular, is the naive

RE: ReduceByKey with a byte array as the key

2015-06-11 Thread Aaron Davidson
Be careful shoving arbitrary binary data into a string, invalid utf characters can cause significant computational overhead in my experience. On Jun 11, 2015 10:09 AM, Mark Tse mark@d2l.com wrote: Makes sense – I suspect what you suggested should work. However, I think the overhead

Re: spark-1.2.0--standalone-ha-zookeeper

2015-06-11 Thread scar0909
Hello, have you found a solution? I have the same issue -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-0-standalone-ha-zookeeper-tp21308p23282.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-11 Thread scar0909
I have the same problem. i realized that the master spark becomes unresponsive when we kill the leader zookeeper (of course i assigned the leader election task to the zookeeper). please let me know if you have any devlepments. -- View this message in context:

Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
I load a list of ids from a text file as NLineInputFormat, and when I do distinct(), it returns incorrect number. JavaRDDText idListData = jvc .hadoopFile(idList, NLineInputFormat.class, LongWritable.class, Text.class).values().distinct() I should have

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Sean Owen
Guess: it has something to do with the Text object being reused by Hadoop? You can't in general keep around refs to them since they change. So you may have a bunch of copies of one object at the end that become just one in each partition. On Thu, Jun 11, 2015, 8:36 PM Crystal Xing

Re: Issue running Spark 1.4 on Yarn

2015-06-11 Thread nsalian
Hello, Since the other queues are fine, I reckon, there may be a limit in the max apps or memory on this queue in particular. I don't suspect fairscheduler limits either but on this queue we may be seeing / hitting a maximum. Could you try to get the configs for the queue? That should provide

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Crystal Xing
That is a little scary. So you mean in general, we shouldn't use hadoop's writable as Key in RDD? Zheng zheng On Thu, Jun 11, 2015 at 6:44 PM, Sean Owen so...@cloudera.com wrote: Guess: it has something to do with the Text object being reused by Hadoop? You can't in general keep around refs

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Sean Owen
Yep you need to use a transformation of the raw value; use toString for example. On Thu, Jun 11, 2015, 8:54 PM Crystal Xing crystalxin...@gmail.com wrote: That is a little scary. So you mean in general, we shouldn't use hadoop's writable as Key in RDD? Zheng zheng On Thu, Jun 11, 2015 at

Re: Spark distinct() returns incorrect results for some types?

2015-06-11 Thread Will Briggs
To be fair, this is a long-standing issue due to optimizations for object reuse in the Hadoop API, and isn't necessarily a failing in Spark - see this blog post (https://cornercases.wordpress.com/2011/08/18/hadoop-object-reuse-pitfall-all-my-reducer-values-are-the-same/) from 2011 documenting

how to use a properties file from a url in spark-submit

2015-06-11 Thread Gary Ogden
I have a properties file that is hosted at a url. I would like to be able to use the url in the --properties-file parameter when submitting a job to mesos using spark-submit via chronos I would rather do this than use a file on the local server. This doesn't seem to work though when submitting

Re: how to use a properties file from a url in spark-submit

2015-06-11 Thread Marcelo Vanzin
That's not supported. You could use wget / curl to download the file to a temp location before running spark-submit, though. On Thu, Jun 11, 2015 at 12:48 PM, Gary Ogden gog...@gmail.com wrote: I have a properties file that is hosted at a url. I would like to be able to use the url in the

Re: DataFrames for non-SQL computation?

2015-06-11 Thread Michael Armbrust
Yes, DataFrames are for much more than SQL and I would recommend using them where ever possible. It is much easier for us to do optimizations when we have more information about the schema of your data, and as such, most of our on going optimization effort will focus on making DataFrames faster.

How to set spark master URL to contain domain name?

2015-06-11 Thread Wang, Ningjun (LNG-NPV)
I start spark master on windows using bin\spark-class.cmd org.apache.spark.deploy.master.Master Then I goto http://localhost:8080/ to find the master URL, it is spark://WIN02:7077 Here WIN02 is my machine name. Why does it missing the domain name? If I start the spark master on other

Re: spark streaming - checkpointing - looking at old application directory and failure to start streaming context

2015-06-11 Thread Ashish Nigam
Any idea why this happens? On Wed, Jun 10, 2015 at 9:28 AM, Ashish Nigam ashnigamt...@gmail.com wrote: BTW, I am using spark streaming 1.2.0 version. On Wed, Jun 10, 2015 at 9:26 AM, Ashish Nigam ashnigamt...@gmail.com wrote: I did not change driver program. I just shutdown the context