Re: Parquet problems

2015-07-22 Thread Anders Arpteg
No, never really resolved the problem, except by increasing the permgem space which only partially solved it. Still have to restart the job multiple times so make the whole job complete (it stores intermediate results). The parquet data sources have about 70 columns, and yes Cheng, it works fine

spark.deploy.spreadOut core allocation

2015-07-22 Thread Srikanth
Hello, I've set spark.deploy.spreadOut=false in spark-env.sh. export SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores=4 -Dspark.deploy.spreadOut=false There are 3 workers each with 4 cores. Spark-shell was started with noof cores = 6. Spark UI show that one executor was used with 6 cores. Is

Re: user threads in executors

2015-07-22 Thread Shushant Arora
Thanks ! I am using spark streaming 1.3 , And if some post fails because of any reason, I will store the offset of that message in another kafka topic. I want to read these offsets in another spark job and from them the original kafka topic's messages based on these offsets- So is it possible

problems running Spark on a firewalled remote YARN cluster via SOCKS proxy

2015-07-22 Thread rok
I am trying to run Spark applications with the driver running locally and interacting with a firewalled remote cluster via a SOCKS proxy. I have to modify the hadoop configuration on the *local machine* to try to make this work, adding property

Re: user threads in executors

2015-07-22 Thread Cody Koeninger
Yes, look at KafkaUtils.createRDD On Wed, Jul 22, 2015 at 11:17 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks ! I am using spark streaming 1.3 , And if some post fails because of any reason, I will store the offset of that message in another kafka topic. I want to read these

Re: Parquet problems

2015-07-22 Thread Cheng Lian
How many columns are there in these Parquet files? Could you load a small portion of the original large dataset successfully? Cheng On 6/25/15 5:52 PM, Anders Arpteg wrote: Yes, both the driver and the executors. Works a little bit better with more space, but still a leak that will cause

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
Yeah, the benefit of `saveAsTable` is that you don't need to deal with schema explicitly, while the benefit of ALTER TABLE is you still have a standard vanilla Hive table. Cheng On 7/22/15 11:00 PM, Dean Wampler wrote: While it's not recommended to overwrite files Hive thinks it understands,

Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
Hello all, We are having a major performance issue with the Spark, which is holding us from going live. We have a job that carries out computation on log files and write the results into Oracle DB. The reducer 'reduceByKey' have been set to parallelize by 4 as we don't want to establish too

Help accessing protected S3

2015-07-22 Thread Greg Anderson
I have a protected s3 bucket that requires a certain IAM role to access. When I start my cluster using the spark-ec2 script, everything works just fine until I try to read from that part of s3. Here is the command I am using: ./spark-ec2 -k KEY -i KEY_FILE.pem

Re: Parquet problems

2015-07-22 Thread Jerry Lam
Hi guys, I noticed that too. Anders, can you confirm that it works on Spark 1.5 snapshot? This is what I tried at the end. It seems it is 1.4 issue. Best Regards, Jerry On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg arp...@spotify.com wrote: No, never really resolved the problem, except by

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Stahlman, Jonathan
Hi Burak, Looking at the source code, the intermediate RDDs used in ALS.train() are persisted during the computation using intermediateRDDStorageLevel (default value is StorageLevel.MEMORY_AND_DISK) - see

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Burak Yavuz
Hi Jonathan, I believe calling persist with StorageLevel.NONE doesn't do anything. That's why the unpersist has an if statement before it. Could you give more information about your setup please? Number of cores, memory, number of partitions of ratings_train? Thanks, Burak On Wed, Jul 22, 2015

How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread swetha
Hi, We have a requirement wherein we need to keep RDDs in memory between Spark batch processing that happens every one hour. The idea here is to have RDDs that have active user sessions in memory between two jobs so that once a job processing is done and another job is run after an hour the RDDs

RE: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Ganelin, Ilya
To be Unpersisted the RDD must be persisted first. If it's set to None, then it's not persisted, and as such does not need to be freed. Does that make sense ? Thank you, Ilya Ganelin -Original Message- From: Stahlman, Jonathan

Re: Spark spark.shuffle.memoryFraction has no affect

2015-07-22 Thread Walid Baroni
Hi Andrew I tried many different combinations, but still no change in the amount of shuffle bytes spilled to disk by checking the UI. I made sure the configuration have been applied by checking Spark UI/Environment. I only see changes in shuffle bytes spilled if I disable spark.shuffle.spill

Re: Parquet problems

2015-07-22 Thread Michael Misiewicz
For what it's worth, my data set has around 85 columns in Parquet format as well. I have tried bumping the permgen up to 512m but I'm still getting errors in the driver thread. On Wed, Jul 22, 2015 at 1:20 PM, Jerry Lam chiling...@gmail.com wrote: Hi guys, I noticed that too. Anders, can you

Re: How to share a Map among RDDS?

2015-07-22 Thread Dan Dong
Hi, Andrew, If I broadcast the Map: val map2=sc.broadcast(map1) I will get compilation error: org.apache.spark.broadcast.Broadcast[scala.collection.immutable.Map[Int,String]] does not take parameters [error] val matchs= Vecs.map(term=term.map{case (a,b)=(map2(a),b)}) Seems it's still an

Re: How to share a Map among RDDS?

2015-07-22 Thread Dan Dong
Thanks Andrew, exactly. 2015-07-22 14:26 GMT-05:00 Andrew Or and...@databricks.com: Hi Dan, `map2` is a broadcast variable, not your map. To access the map on the executors you need to do `map2.value(a)`. -Andrew 2015-07-22 12:20 GMT-07:00 Dan Dong dongda...@gmail.com: Hi, Andrew,

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
Hi Dan, `map2` is a broadcast variable, not your map. To access the map on the executors you need to do `map2.value(a)`. -Andrew 2015-07-22 12:20 GMT-07:00 Dan Dong dongda...@gmail.com: Hi, Andrew, If I broadcast the Map: val map2=sc.broadcast(map1) I will get compilation error:

Re: spark.executor.memory and spark.driver.memory have no effect in yarn-cluster mode (1.4.x)?

2015-07-22 Thread Michael Misiewicz
That makes a lot of sense, thanks for the concise answer! On Wed, Jul 22, 2015 at 4:10 PM, Andrew Or and...@databricks.com wrote: Hi Michael, In general, driver related properties should not be set through the SparkConf. This is because by the time the SparkConf is created, we have already

Re: spark.deploy.spreadOut core allocation

2015-07-22 Thread Andrew Or
Hi Srikanth, It does look like a bug. Did you set `spark.executor.cores` in your application by any chance? -Andrew 2015-07-22 8:05 GMT-07:00 Srikanth srikanth...@gmail.com: Hello, I've set spark.deploy.spreadOut=false in spark-env.sh. export

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread ericacm
Tachyon is one way. Also check out the Spark Job Server https://github.com/spark-jobserver/spark-jobserver . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory-between-two-different-batch-jobs-tp23957p23958.html Sent from the

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread harirajaram
I was about say whatever the previous post said,so +1 to the previous post,from my understanding (gut feeling) of your requirement it very easy to do this with spark-job-server. -- View this message in context:

Spark DataFrame created from JavaRDDRow copies all columns data into first column

2015-07-22 Thread unk1102
Hi I have a DataFrame which I need to convert into JavaRDD and back to DataFrame I have the following code DataFrame sourceFrame = hiveContext.read().format(orc).load(/path/to/orc/file); //I do order by in above sourceFrame and then I convert it into JavaRDD JavaRDDRow modifiedRDD =

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread Robin East
The first question I would ask is have you determined whether you have a performance issue writing to Oracle? In particular how many commits are you making? If you are issuing a lot of commits that would be a performance problem. Robin On 22 Jul 2015, at 19:11, diplomatic Guru

spark.executor.memory and spark.driver.memory have no effect in yarn-cluster mode (1.4.x)?

2015-07-22 Thread Michael Misiewicz
Hi group, I seem to have encountered a weird problem with 'spark-submit' and manually setting sparkconf values in my applications. It seems like setting the configuration values spark.executor.memory and spark.driver.memory don't have any effect, when they are set from within my application

databricks spark sql csv FAILFAST not failing, Spark 1.3.1 Java 7

2015-07-22 Thread Adam Pritchard
Hi all, I am using the databricks csv library to load some data into a data frame. https://github.com/databricks/spark-csv I am trying to confirm that failfast mode works correctly and aborts execution upon receiving an invalid csv file. But have not been able to see it fail yet after testing

Re: spark.deploy.spreadOut core allocation

2015-07-22 Thread Srikanth
Cool. Thanks! Srikanth On Wed, Jul 22, 2015 at 3:12 PM, Andrew Or and...@databricks.com wrote: Hi Srikanth, I was able to reproduce the issue by setting `spark.cores.max` to a number greater than the number of cores on a worker. I've filed SPARK-9260 which I believe is already being fixed

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Stahlman, Jonathan
Hello again, In trying to understand the caching of intermediate RDDs by ALS, I looked into the source code and found what may be a bug. Looking here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L230 you see that ALS.train()

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread ericacm
Actually, I should clarify - Tachyon is a way to keep your data in RAM, but it's not exactly the same as keeping it cached in Spark. Spark Job Server is a way to keep it cached in Spark. -- View this message in context:

Re: spark streaming 1.3 issues

2015-07-22 Thread Shushant Arora
In spark streaming 1.3 - Say I have 10 executors each with 4 cores so in total 40 tasks in parllel at once. If I repartition kafka directstream to 40 partitions vs say I have in kafka topic 300 partitions - which one will be more efficient , Should I repartition the kafka stream equal to num of

Re: spark.deploy.spreadOut core allocation

2015-07-22 Thread Andrew Or
Hi Srikanth, I was able to reproduce the issue by setting `spark.cores.max` to a number greater than the number of cores on a worker. I've filed SPARK-9260 which I believe is already being fixed in https://github.com/apache/spark/pull/7274. Thanks for reporting the issue! -Andrew 2015-07-22

Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-07-22 Thread Eugene Morozov
Hi, I’m stuck with the same issue, but I see org.apache.hadoop.fs.s3native.NativeS3FileSystem in the hadoop-core:1.0.4 (that’s the current hadoop-client I use) and this far is transitive dependency that comes from spark itself. I’m using custom build of spark 1.3.1 with hadoop-client 1.0.4.

Re: Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-22 Thread Andrew Or
Hi, It would be whatever's left in the JVM. This is not explicitly controlled by a fraction like storage or shuffle. However, the computation usually doesn't need to use that much space. In my experience it's almost always the caching or the aggregation during shuffles that's the most memory

Re: Does Spark streaming support is there with RabbitMQ

2015-07-22 Thread Abel Rincón
Hi, We tested this receiver internally in stratio sparkta, and it works fine, If you will try the receiver, we're open to your collaboration, your issues will be wellcome. Regards A.Rincón Stratio software architect 2015-07-22 8:15 GMT+02:00 Tathagata Das t...@databricks.com: You could

Re: many-to-many join

2015-07-22 Thread Sonal Goyal
If I understand this correctly, you could join area_code_user and area_code_state and then flat map to get user, areacode, state. Then groupby/reduce by user. You can also try some join optimizations like partitioning on area code or broadcasting smaller table depending on size of

Re: Spark spark.shuffle.memoryFraction has no affect

2015-07-22 Thread Andrew Or
Hi, The setting of 0.2 / 0.6 looks reasonable to me. Since you are not using caching at all, have you tried trying something more extreme, like 0.1 / 0.9? Since disabling spark.shuffle.spill didn't cause an OOM this setting should be fine. Also, one thing you could do is to verify the shuffle

Re: How to share a Map among RDDS?

2015-07-22 Thread Andrew Or
Hi Dan, If the map is small enough, you can just broadcast it, can't you? It doesn't have to be an RDD. Here's an example of broadcasting an array and using it on the executors:

Re: How to restart Twitter spark stream

2015-07-22 Thread Akhil Das
That was a pseudo code, working version would look like this: val stream = TwitterUtils.createStream(ssc, None) val hashTags = stream.flatMap(status = status.getText.split( ).filter(_.startsWith(#))).map(x = (x.toLowerCase,1)) val topCounts10 = hashTags.map((_,

Mesos + Spark

2015-07-22 Thread boci
Hi guys! I'm a new in mesos. I have two spark application (one streaming and one batch). I want to run both app in mesos cluster. Now for testing I want to run in docker container so I started a simple redjack/mesos-master, but I think a lot of think unclear for me (both mesos and spark-mesos).

Re: user threads in executors

2015-07-22 Thread Tathagata Das
Yes, you could unroll from the iterator in batch of 100-200 and then post them in multiple rounds. If you are using the Kafka receiver based approach (not Direct), then the raw Kafka data is stored in the executor memory. If you are using Direct Kafka, then it is read from Kafka directly at the

Re: spark streaming 1.3 issues

2015-07-22 Thread Tathagata Das
For Java, do OffsetRange[] offsetRanges = ((HasOffsetRanges)rdd*.rdd()*).offsetRanges(); If you fix that error, you should be seeing data. You can call arbitrary RDD operations on a DStream, using DStream.transform. Take a look at the docs. For the direct kafka approach you are doing, - tasks

Re: spark streaming 1.3 coalesce on kafkadirectstream

2015-07-22 Thread Tathagata Das
With DirectKafkaStream there are two approaches. 1. you increase the number of KAfka partitions Spark will automatically read in parallel 2. if that's not possible, then explicitly repartition only if there are more cores in the cluster than the number of Kafka partitions, AND the first map-like

Re: Does Spark streaming support is there with RabbitMQ

2015-07-22 Thread Tathagata Das
You could contact the authors of the spark-packages.. maybe that will help? On Mon, Jul 20, 2015 at 6:41 AM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks Todd, I m not sure whether somebody has used it or not. can somebody confirm if this integrate nicely with Spark streaming? On

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Cheng Lian
Since Hive doesn’t support schema evolution, you’ll have to update the schema stored in metastore somehow. For example, you can create a new external table with the merged schema. Say you have a Hive table |t1|: |CREATE TABLE t1 (c0 INT, c1 DOUBLE); | By default, this table is stored in HDFS

Re: many-to-many join

2015-07-22 Thread ayan guha
Hi RDD solution: u = [(615,1),(720,1),(615,2)] urdd=sc.parallelize(u,1) a1 = [(615,'T'),(720,'C')] ardd=sc.parallelize(a1,1) def addString(s1,s2): ... return s1+','+s2 j = urdd.join(ardd).map(lambda t:t[1]).reduceByKey(addString) print j.collect() [(2, 'T'), (1, 'C,T')] However, if

Scaling spark cluster for a running application

2015-07-22 Thread phagunbaya
I have a spark cluster running in client mode with driver outside the spark cluster. I want to scale the cluster after an application is submitted. In order to do this, I'm creating new workers and they are getting registered with master but issue I'm seeing is; running application does not use

Re: Is spark suitable for real time query

2015-07-22 Thread Robin East
Real-time is, of course, relative but you’ve mentioned microsecond level. Spark is designed to process large amounts of data in a distributed fashion. No distributed system I know of could give any kind of guarantees at the microsecond level. Robin On 22 Jul 2015, at 11:14, Louis Hust

Proper saving/loading of MatrixFactorizationModel

2015-07-22 Thread PShestov
Hi all! I have MatrixFactorizationModel object. If I'm trying to recommend products to single user right after constructing model through ALS.train(...) then it takes 300ms (for my data and hardware). But if I save model to disk and load it back then recommendation takes almost 2000ms. Also Spark

Is spark suitable for real time query

2015-07-22 Thread Louis Hust
Hi, all I am using spark jar in standalone mode, fetch data from different mysql instance and do some action, but i found the time is at second level. So i want to know if spark job is suitable for real time query which at microseconds?

Re: Broadcast variables in R

2015-07-22 Thread FRANCHOIS Serge
Thank you very much Shivaram. I’ve got it working on Mac now by specifying the namespace. Using SparkR:::parallelize() iso just parallelize() Wkr, Serge On 21 Jul 2015, at 17:20, Shivaram Venkataraman shiva...@eecs.berkeley.edumailto:shiva...@eecs.berkeley.edu wrote: There shouldn't be

Re: Is spark suitable for real time query

2015-07-22 Thread Louis Hust
I do a simple test using spark in standalone mode(not cluster), and found a simple action take a few seconds, the data size is small, just few rows. So each spark job will cost some time for init or prepare work no matter what the job is? I mean if the basic framework of spark job will cost

Need help in SparkSQL

2015-07-22 Thread Jeetendra Gangele
HI All, I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries involved multiple fields So my question in which which format I should store the data in HDFS so that processing will be fast for such kind of queries?

Re: How to build Spark with my own version of Hadoop?

2015-07-22 Thread jay vyas
As you know, the hadoop versions and so on are available in the spark build files, iirc the top level pox.xml has all the maven variables for versions. So I think if you just build hadoop locally (i.e. build it as it to 2.2.1234-SNAPSHOT and mvn install it), you should be able to change the

Re: Is spark suitable for real time query

2015-07-22 Thread Igor Berman
you can use spark rest job server(or any other solution that provides long running spark context) so that you won't pay this bootstrap time on each query in addition : if you have some rdd that u want your queries to be executed on, you can cache this rdd in memory(depends on ur cluster memory

java.lang.IllegalArgumentException: problem reading type: type = group, name = param, original type = null

2015-07-22 Thread SkyFox
Hello! I don't understand why, but I can't read data from my parquet file. I made parquet file from json file and read it to data frame: /df.printSchema() |-- param: struct (nullable = true) ||-- FORM: string (nullable = true) ||-- URL: string (nullable = true)/ /When I try to read

No suitable driver found for jdbc:mysql://

2015-07-22 Thread roni
Hi All, I have a cluster with spark 1.4. I am trying to save data to mysql but getting error Exception in thread main java.sql.SQLException: No suitable driver found for jdbc:mysql://.rds.amazonaws.com:3306/DAE_kmer?user=password= *I looked at - https://issues.apache.org/jira/browse/SPARK-8463

Re: Need help in SparkSQL

2015-07-22 Thread Jörn Franke
Can you provide an example of an and query ? If you do just look-up you should try Hbase/ phoenix, otherwise you can try orc with storage index and/or compression, but this depends on how your queries look like Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a écrit : HI

Re: Performance issue with Spak's foreachpartition method

2015-07-22 Thread diplomatic Guru
Thanks Robin for your reply. I'm pretty sure that writing to Oracle is taking longer as when writing to HDFS is only taking ~5minutes. The job is writing about ~5 Million of records. I've set the job to call executeBatch() when the batchSize reaches 200,000 of records, so I assume that commit

ShuffledHashJoin instead of CartesianProduct

2015-07-22 Thread Srikanth
Hello, I'm trying to link records from two large data sources. Both datasets have almost same number of rows. Goal is to match records based on multiple columns. val matchId = SFAccountDF.as(SF).join(ELAccountDF.as(EL)).where($SF.Email === $EL.EmailAddress || $SF.Phone === EL.Phone)

What if request cores are not satisfied

2015-07-22 Thread bit1...@163.com
Hi, Assume a following scenario: The spark standalone cluster has 10 cores in total, I have an application that will request 12 cores. Will the application run with fewer cores than requested or will it simply wait for ever since there are only 10 cores available. I would guess it will be run

Hive Session gets overwritten in ClientWrapper

2015-07-22 Thread Vishak
I'm currently using Spark 1.4 in standalone mode. I've forked the Apache Hive branch from https://github.com/pwendell/hive https://github.com/pwendell/hive and customised in the following way. Added a thread local variable in SessionManager class. And I'm setting the session variable in my

Re: Need help in SparkSQL

2015-07-22 Thread Jörn Franke
I do not think you can put all your queries into the row key without duplicating the data for each query. However, this would be more last resort. Have you checked out phoenix for Hbase? This might suit your needs. It makes it much simpler, because it provided sql on top of Hbase. Nevertheless,

Re: Need help in setting up spark cluster

2015-07-22 Thread Jeetendra Gangele
Can anybody help here? On 22 July 2015 at 10:38, Jeetendra Gangele gangele...@gmail.com wrote: Hi All, I am trying to capture the user activities for real estate portal. I am using RabbitMS and Spark streaming combination where all the Events I am pushing to RabbitMQ and then 1 secs micro

Re: Need help in SparkSQL

2015-07-22 Thread Jeetendra Gangele
Query will be something like that 1. how many users visited 1 BHK flat in last 1 hour in given particular area 2. how many visitor for flats in give area 3. list all user who bought given property in last 30 days Further it may go too complex involving multiple parameters in my query. The

Re: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-22 Thread Debasish Das
Does it also support insert operations ? On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote: We are happy to announce the availability of the Spark SQL on HBase 1.0.0 release. http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase The main features in this

Issue with column named count in a DataFrame

2015-07-22 Thread Young, Matthew T
I'm trying to do some simple counting and aggregation in an IPython notebook with Spark 1.4.0 and I have encountered behavior that looks like a bug. When I try to filter rows out of an RDD with a column name of count I get a large error message. I would just avoid naming things count, except

RE: Need help in SparkSQL

2015-07-22 Thread Mohammed Guller
Parquet Mohammed From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Wednesday, July 22, 2015 5:48 AM To: user Subject: Need help in SparkSQL HI All, I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries

spark-submit and spark-shell behaviors mismatch.

2015-07-22 Thread Dan Dong
Hi, I have a simple test spark program as below, the strange thing is that it runs well under a spark-shell, but will get a runtime error of java.lang.NoSuchMethodError: in spark-submit, which indicate the line of: val maps2=maps.collect.toMap has problem. But why the compilation has no

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread Haoyuan Li
Yes. Tachyon can handle this well: http://tachyon-project.org/ Best, Haoyuan On Wed, Jul 22, 2015 at 10:56 AM, swetha swethakasire...@gmail.com wrote: Hi, We have a requirement wherein we need to keep RDDs in memory between Spark batch processing that happens every one hour. The idea here

Using Wurfl in Spark

2015-07-22 Thread Zhongxiao Ma
Hi all, I am trying to do wurfl lookup in a spark cluster and getting exceptions, I am pretty sure that the same thing works in small scale. But it fails when I tried to do it in spark. I used spark-ec2/copy-dir to copy the wurfl library to workers already and launched the spark-shell with

Package Release Annoucement: Spark SQL on HBase Astro

2015-07-22 Thread Bing Xiao (Bing)
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 release. http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase The main features in this package, dubbed Astro, include: * Systematic and powerful handling of data pruning and intelligent scan, based

Re: No suitable driver found for jdbc:mysql://

2015-07-22 Thread Rishi Yadav
try setting --driver-class-path On Wed, Jul 22, 2015 at 3:45 PM, roni roni.epi...@gmail.com wrote: Hi All, I have a cluster with spark 1.4. I am trying to save data to mysql but getting error Exception in thread main java.sql.SQLException: No suitable driver found for

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-22 Thread Yana Kadiyska
Is it complaining about collect or toMap? In either case this error is indicative of an old version usually -- any chance you have an old installation of Spark somehow? Or scala? You can try running spark-submit with --verbose. Also, when you say it runs with spark-shell do you run spark shell in

Re: Spark SQL Table Caching

2015-07-22 Thread Pedro Rodriguez
I would be interested in the answer to this question, plus the relationship between those and registerTempTable() Pedro On Tue, Jul 21, 2015 at 1:59 PM, Brandon White bwwintheho...@gmail.com wrote: A few questions about caching a table in Spark SQL. 1) Is there any difference between caching

Re: Issue with column named count in a DataFrame

2015-07-22 Thread Michael Armbrust
Additionally have you tried enclosing count in `backticks`? On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust mich...@databricks.com wrote: I believe this will be fixed in Spark 1.5 https://github.com/apache/spark/pull/7237 On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T

Re: Issue with column named count in a DataFrame

2015-07-22 Thread Michael Armbrust
I believe this will be fixed in Spark 1.5 https://github.com/apache/spark/pull/7237 On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T matthew.t.yo...@intel.com wrote: I'm trying to do some simple counting and aggregation in an IPython notebook with Spark 1.4.0 and I have encountered behavior

Re: assertion failed error with GraphX

2015-07-22 Thread Roman Sokolov
I am also having problems with triangle count - seems like this algorithm is very memory consuming (I could not process even small graphs ~ 5 million Vertices and 70 million Edges with less the 32 GB RAM on EACH machine). What if I have graphs with billion edges, what amount of RAM do I need then?

Re: Parquet problems

2015-07-22 Thread Michael Misiewicz
Hi Anders, Did you ever get to the bottom of this issue? I'm encountering it too, but only in yarn-cluster mode running on spark 1.4.0. I was thinking of trying 1.4.1 today. Michael On Thu, Jun 25, 2015 at 5:52 AM, Anders Arpteg arp...@spotify.com wrote: Yes, both the driver and the

Re: Mesos + Spark

2015-07-22 Thread Dean Wampler
This page, http://spark.apache.org/docs/latest/running-on-mesos.html, covers many of these questions. If you submit a job with the option --supervise, it will be restarted if it fails. You can use Chronos for scheduling. You can create a single streaming job with a 10 minute batch interval, if

Re: Spark-hive parquet schema evolution

2015-07-22 Thread Dean Wampler
While it's not recommended to overwrite files Hive thinks it understands, you can add the column to Hive's metastore using an ALTER TABLE command using HiveQL in the Hive shell or using HiveContext.sql(): ALTER TABLE mytable ADD COLUMNS col_name data_type See

Re: Is spark suitable for real time query

2015-07-22 Thread Louis Hust
My code like below: MapString, String t11opt = new HashMapString, String(); t11opt.put(url, DB_URL); t11opt.put(dbtable, t11); DataFrame t11 = sqlContext.load(jdbc, t11opt); t11.registerTempTable(t11); ...the same for

Re: use S3-Compatible Storage with spark

2015-07-22 Thread Schmirr Wurst
I could get a little further : - installed spark-1.4.1-without-hadoop - unpacked hadoop 2.7.1 - added the folowing in spark-env.sh HADOOP_HOME=/opt/hadoop-2.7.1/

Re: Scaling spark cluster for a running application

2015-07-22 Thread Romi Kuntsman
Are you running the Spark cluster in standalone or YARN? In standalone, the application gets the available resources when it starts. With YARN, you can try to turn on the setting *spark.dynamicAllocation.enabled* See https://spark.apache.org/docs/latest/configuration.html On Wed, Jul 22, 2015 at

R: Is spark suitable for real time query

2015-07-22 Thread Paolo Platter
Are you using jdbc server? Paolo Inviata dal mio Windows Phone Da: Louis Hustmailto:louis.h...@gmail.com Inviato: ‎22/‎07/‎2015 13:47 A: Robin Eastmailto:robin.e...@xense.co.uk Cc: user@spark.apache.orgmailto:user@spark.apache.org Oggetto: Re: Is spark suitable

Applications metrics unseparatable from Master metrics?

2015-07-22 Thread Romi Kuntsman
Hi, I tried to enable Master metrics source (to get number of running/waiting applications etc), and connected it to Graphite. However, when these are enabled, application metrics are also sent. Is it possible to separate them, and send only master metrics without applications? I see that