Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread ayan guha
You can use custom partitioner to redistribution using partitionby On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote: I'm currently trying to join two large tables (order 1B rows each) using Spark SQL (1.3.0) and am running into long GC pauses which bring the job to a halt. I'm

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Nick Travers
Could you be more specific in how this is done? A DataFrame class doesn't have that method. On Sun, May 3, 2015 at 11:07 PM, ayan guha guha.a...@gmail.com wrote: You can use custom partitioner to redistribution using partitionby On 4 May 2015 15:37, Nick Travers n.e.trav...@gmail.com wrote:

spark log analyzer sample

2015-05-04 Thread anshu shukla
Exception in thread main java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 I am not using any hadoop facility (not even hdfs) then why it is giving this error . -- Thanks Regards, Anshu Shukla

Spark job concurrency problem

2015-05-04 Thread Xi Shen
Hi, I have two small RDD, each has about 600 records. In my code, I did val rdd1 = sc...cache() val rdd2 = sc...cache() val result = rdd1.cartesian(rdd2).*repartition*(num_cpu).map {case (a,b) = some_expensive_job(a,b) } I ran my job in YARN cluster with --master yarn-cluster, I have 6

Spark Mongodb connection

2015-05-04 Thread Yasemin Kaya
Hi! I am new at Spark and I want to begin Spark with simple wordCount example in Java. But I want to give my input from Mongodb database. I want to learn how can I connect Mongodb database to my project. Any one can help for this issue. Have a nice day yasemin -- hiç ender hiç

Re: Problem in Standalone Mode

2015-05-04 Thread Akhil Das
Can you paste the complete stacktrace? It looks like you are having version incompatibility with hadoop. Thanks Best Regards On Sat, May 2, 2015 at 4:36 PM, drarse drarse.a...@gmail.com wrote: When I run my program with Spark-Submit everythink are ok. But when I try run in satandalone mode I

how to make sure data is partitioned across all workers?

2015-05-04 Thread shahab
Hi, Is there any way to enforce Spark to partition cached data across all worker nodes, so all data is not cached only in one of the worker nodes? best, /Shahab

Re: spark filestrea problem

2015-05-04 Thread Akhil Das
With filestream you can actually pass a filter parameter to avoid loading up .tmp file/directories. Also, when you move/rename a file, the file creation date doesn't change and hence spark won't detect them i believe. Thanks Best Regards On Sat, May 2, 2015 at 9:37 PM, Evo Eftimov

Re: Remoting warning when submitting to cluster

2015-05-04 Thread Akhil Das
Looks like a version incompatibility, just make sure you have the proper version of spark. Also look further in the stacktrace what is causing Futures timed out (it could be a network issue also if the ports aren't opened properly) Thanks Best Regards On Sat, May 2, 2015 at 12:04 AM,

Re: Hardware requirements

2015-05-04 Thread Akhil Das
500GB of data will have nearly 3900 partitions and if you can have nearly that many number of cores and around 500GB of memory then things will be lightening fast. :) Thanks Best Regards On Sun, May 3, 2015 at 12:49 PM, sherine ahmed sherine.sha...@hotmail.com wrote: I need to use spark to

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Hello Dean Others, Thanks for the response. I tried with 100,200, 400, 600 and 1200 repartitions with 100,200,400 and 800 executors. Each time all the tasks of join complete in less than a minute except one and that one tasks runs forever. I have a huge cluster at my disposal. The data for each

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
IMHO If your data or your algorithm is prone to data skew, I think you have to fix this from application level, Spark itself cannot overcome this problem (if one key has large amount of values), you may change your algorithm to choose another shuffle key, somethings like this to avoid shuffle on

Re: Hardware requirements

2015-05-04 Thread ayan guha
Hi How do you figure out 500gig~3900 partitions? I am trying to do the math. If I assume 64mb block size then 1G~16 blocks and 500g~8000 blocks. If we assume split and block sizes are same, shouldn't we end up with 8k partitions? On 4 May 2015 17:49, Akhil Das ak...@sigmoidanalytics.com wrote:

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Hello Shao, Can you talk more about shuffle key or point me to APIs that allow me to change shuffle key. I will try with different keys and see the performance. What is the shuffle key by default ? On Mon, May 4, 2015 at 2:37 PM, Saisai Shao sai.sai.s...@gmail.com wrote: IMHO If your data or

Re: Hardware requirements

2015-05-04 Thread Akhil Das
Assume your block size is 128MB. Thanks Best Regards On Mon, May 4, 2015 at 2:38 PM, ayan guha guha.a...@gmail.com wrote: Hi How do you figure out 500gig~3900 partitions? I am trying to do the math. If I assume 64mb block size then 1G~16 blocks and 500g~8000 blocks. If we assume split and

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
Shuffle key is depending on your implementation, I'm not sure if you are familiar with MapReduce, the mapper output is a key-value pair, where the key is the shuffle key for shuffling, Spark is also the same. 2015-05-04 17:31 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com: Hello Shao, Can you

Re: Map-Side Join in Spark

2015-05-04 Thread ๏̯͡๏
This is how i implemented map-side join using broadcast. val listings = DataUtil.getDwLstgItem(sc, DateUtil.addDaysToDate(startDate, -89)) val viEvents = details.map { vi = (vi.get(14).asInstanceOf[Long], vi) } val lstgItemMap = listings.map { lstg = (lstg.getItemId().toLong, lstg)

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
One dataset (RDD Pair) val lstgItem = listings.map { lstg = (lstg.getItemId().toLong, lstg) } Second Dataset (RDDPair) val viEvents = viEventsRaw.map { vi = (vi.get(14).asInstanceOf[Long], vi) } As i want to join based on item Id that is used as first element in the tuple in both cases and i

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Four tasks are now failing with IndexIDAttemptStatus ▾Locality LevelExecutor ID / HostLaunch TimeDurationGC TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle Spill (Disk) Errors 0 3771 0 FAILED PROCESS_LOCAL 114 / host1 2015/05/04 01:27:44 / ExecutorLostFailure (executor 114 lost)

Re: Unusual filter behaviour on RDD

2015-05-04 Thread fawadalam
I got it working. When I was persisting, it only persisted 85% of the RDD to memory and the rest of the RDD gets recomputed every time. Because my flagged RDD uses a random method to create the field, I was getting unpredictable results. When I persist using:

mapping JavaRDD to jdbc DataFrame

2015-05-04 Thread Lior Chaga
Hi, I'd like to use a JavaRDD containing parameters for an SQL query, and use SparkSQL jdbc to load data from mySQL. Consider the following pseudo code: JavaRDDString namesRdd = ... ; ... options.put(url, jdbc:mysql://mysql?user=usr); options.put(password, pass); options.put(dbtable, (SELECT *

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
From the symptoms you mentioned that one task's shuffle write is much larger than all the other task, it is quite similar to normal data skew behavior, I just give some advice based on your descriptions, I think you need to detect whether data is actually skewed or not. The shuffle will put data

Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Olivier Girardot
Hi, What is you Spark version ? Regards, Olivier. Le lun. 4 mai 2015 à 11:03, luohui20...@sina.com a écrit : hi guys when i am running a sql like select a.name,a.startpoint,a.endpoint, a.piece from db a join sample b on (a.name = b.name) where (b.startpoint a.startpoint + 25); I

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
I ran it against one file instead of 10 files and i see one task is still running after 33 mins its shuffle read size is 780MB/50 mil records. I did a count of records for each itemId from dataset-2 [One FILE] (Second Dataset (RDDPair) val viEvents = viEventsRaw.map { vi = (vi.get(14

Re: Custom Partitioning Spark

2015-05-04 Thread ๏̯͡๏
Why do you use custom partitioner ? Are you doing join ? And, can you share some code that shows how you implemented custom partitioner. On Tue, Apr 21, 2015 at 8:38 PM, ayan guha guha.a...@gmail.com wrote: Are you looking for? *mapPartitions*(*func*)Similar to map, but runs separately on

Re: Support for skewed joins in Spark

2015-05-04 Thread ๏̯͡๏
Hello Soila, Can you share the code that shows usuag of RangePartitioner ? I am facing issue with .join() where one task runs forever. I tried repartition(100/200/300/1200) and it did not help, I cannot use map-side join because both datasets are huge and beyond driver memory size. Regards, Deepak

spark 1.3.1

2015-05-04 Thread Saurabh Gupta
HI, I am trying to build a example code given at https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds code is: // Import factory methods provided by DataType.import org.apache.spark.sql.types.DataType;// Import StructType and StructFieldimport

Difference ?

2015-05-04 Thread ๏̯͡๏
*Datasets* val viEvents = viEventsRaw.map { vi = (vi.get(14).asInstanceOf[Long], vi) } val lstgItem = listings.map { lstg = (lstg.getItemId().toLong, lstg) } What is the difference between 1) lstgItem.join(viEvents, new org.apache.spark.RangePartitioner(partitions = 1200, rdd = viEvents)).map

MLLib SVM probability

2015-05-04 Thread Robert Musters
Hi all, I am trying to understand the output of the SVM classifier. Right now, my output looks like this: -18.841544889249917 0.0 168.32916035523283 1.0 420.67763915879794 1.0 -974.1942589201286 0.0 71.73602841256813 1.0 233.13636224524993 1.0 -1000.5902168199027 0.0 The

Re: spark 1.3.1

2015-05-04 Thread Driesprong, Fokko
Hi Saurabh, Did you check the log of maven? 2015-05-04 15:17 GMT+02:00 Saurabh Gupta saurabh.gu...@semusi.com: HI, I am trying to build a example code given at https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds code is: // Import factory methods

Re: Exiting driver main() method...

2015-05-04 Thread James Carman
I think I figured it out. I am playing around with the Cassandra connector and I had a method that inserted some data into a locally-running Cassandra instance, but I forgot to close the Cluster object. I guess that left some non-daemon thread running and kept the process for exiting. Nothing

Re: MLLib SVM probability

2015-05-04 Thread Driesprong, Fokko
Hi Robert, I would say, taking the sign of the numbers represent the class of the input-vector. What kind of data are you using, and what kind of traning-set do you use. Fundamentally a SVM is able to separate only two classes, you can do one vs the rest as you mentioned. I don't see how LVQ can

How to deal with code that runs before foreach block in Apache Spark?

2015-05-04 Thread Emre Sevinc
I'm trying to deal with some code that runs differently on Spark stand-alone mode and Spark running on a cluster. Basically, for each item in an RDD, I'm trying to add it to a list, and once this is done, I want to send this list to Solr. This works perfectly fine when I run the following code in

Re: spark 1.3.1

2015-05-04 Thread Saurabh Gupta
I am really new to this but what should I look into maven logs? I have tried mvn package -X -e SHould I show the full trace? On Mon, May 4, 2015 at 6:54 PM, Driesprong, Fokko fo...@driesprong.frl wrote: Hi Saurabh, Did you check the log of maven? 2015-05-04 15:17 GMT+02:00 Saurabh Gupta

Re: How to deal with code that runs before foreach block in Apache Spark?

2015-05-04 Thread Gerard Maas
I'm not familiar with the Solr API but provided that ' SolrIndexerDriver' is a singleton, I guess that what's going on when running on a cluster is that the call to: SolrIndexerDriver.solrInputDocumentList.add(elem) is happening on different singleton instances of the SolrIndexerDriver on

Re: mapping JavaRDD to jdbc DataFrame

2015-05-04 Thread ayan guha
You can use applySchema On Mon, May 4, 2015 at 10:16 PM, Lior Chaga lio...@taboola.com wrote: Hi, I'd like to use a JavaRDD containing parameters for an SQL query, and use SparkSQL jdbc to load data from mySQL. Consider the following pseudo code: JavaRDDString namesRdd = ... ; ...

Troubling Logging w/Simple Example (spark-1.2.2-bin-hadoop2.4)...

2015-05-04 Thread James Carman
I have the following simple example program: public class SimpleCount { public static void main(String[] args) { final String master = System.getProperty(spark.master, local[*]); System.out.printf(Running job against spark master %s ...%n, master); final SparkConf

java.io.IOException: No space left on device while doing repartitioning in Spark

2015-05-04 Thread shahab
Hi, I am getting No space left on device exception when doing repartitioning of approx. 285 MB of data while these is still 2 GB space left ?? does it mean that repartitioning needs more space (more than 2 GB) for repartitioning of 285 MB of data ?? best, /Shahab java.io.IOException: No

Re: Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread ayan guha
Are you using only 1 executor? On Mon, May 4, 2015 at 11:07 PM, luohui20...@sina.com wrote: hi Olivier spark1.3.1, with java1.8.0.45 and add 2 pics . it seems like a GC issue. I also tried with different parameters like memory size of driverexecutor, memory fraction, java opts... but

Re: spark 1.3.1

2015-05-04 Thread Deng Ching-Mallete
Hi, I think you need to import org.apache.spark.sql.types.DataTypes instead of org.apache.spark.sql.types.DataType and use that instead to access the StringType.. HTH, Deng On Mon, May 4, 2015 at 9:37 PM, Saurabh Gupta saurabh.gu...@semusi.com wrote: I am really new to this but what should I

Re: Spark Mongodb connection

2015-05-04 Thread Gaspar Muñoz
Hi Yasemin, You can find here a MongoDB connector for Spark SQL: http://github.com/Stratio/spark-mongodb Best regards 2015-05-04 9:27 GMT+02:00 Yasemin Kaya godo...@gmail.com: Hi! I am new at Spark and I want to begin Spark with simple wordCount example in Java. But I want to give my input

Re: java.io.IOException: No space left on device while doing repartitioning in Spark

2015-05-04 Thread Ted Yu
See https://wiki.gentoo.org/wiki/Knowledge_Base:No_space_left_on_device_while_there_is_plenty_of_space_available What's the value for spark.local.dir property ? Cheers On Mon, May 4, 2015 at 6:57 AM, shahab shahab.mok...@gmail.com wrote: Hi, I am getting No space left on device exception

Re: empty jdbc RDD in spark

2015-05-04 Thread Cody Koeninger
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD The arguments are sql string, lower bound, upper bound, number of partitions. Your call SELECT * FROM MEMBERS LIMIT ? OFFSET ?, 0, 100, 1 would thus be run as SELECT * FROM MEMBERS LIMIT 0 OFFSET 100

Re: SparkStream saveAsTextFiles()

2015-05-04 Thread anavidad
Hi, What kind of can't find symbol are you receiving? On the other hand, I would try to change guava dependency version to 14.0.1. In Spark 1.3.0, guava version is 14.0.1 but is not included inside spark artifact because it's marked like provided.

SparkSQL Nested structure

2015-05-04 Thread Giovanni Paolo Gibilisco
Hi, I'm trying to parse log files generated by Spark using SparkSQL. In the JSON elements related to the StageCompleted event we have a nested structre containing an array of elements with RDD Info. (see the log below as an example (omitting some parts). { Event: SparkListenerStageCompleted,

Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Yi Zhang
I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and improve query performance, but I found it took long time as same as querying not using LIMIT. It let me confused. Anybody know why? Thanks. Regards,Yi

Re: SparkStream saveAsTextFiles()

2015-05-04 Thread anavidad
Structure seems fine. Only need to type at the end of your program: ssc.start(); ssc.awaitTermination(); Check method arguments. I advise you to check the spark java api streaming. https://spark.apache.org/docs/1.3.0/api/java/ Regards. -- View this message in context:

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Richard Marscher
In regards to the large GC pauses, assuming you allocated all 100GB of memory per worker you may consider running with less memory on your Worker nodes, or splitting up the available memory on the Worker nodes amongst several worker instances. The JVM's garbage collection starts to become very

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
What query are you running. It may be the case that your query requires PosgreSQL to do a large amount of work before identifying the first n rows On 4 May 2015, at 15:52, Yi Zhang zhangy...@yahoo.com.INVALID wrote: I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
and a further question - have you tried running this query in pqsl? what’s the performance like there? On 4 May 2015, at 16:04, Robin East robin.e...@xense.co.uk wrote: What query are you running. It may be the case that your query requires PosgreSQL to do a large amount of work before

Python Custom Partitioner

2015-05-04 Thread ayan guha
Hi Can someone share some working code for custom partitioner in python? I am trying to understand it better. Here is documentation partitionBy(*numPartitions*, *partitionFunc=function portable_hash at 0x2c45140*)

Re: Python Custom Partitioner

2015-05-04 Thread ๏̯͡๏
I have implemented map-side join with broadcast variables and the code is on mailing list (scala). On Mon, May 4, 2015 at 8:38 PM, ayan guha guha.a...@gmail.com wrote: Hi Can someone share some working code for custom partitioner in python? I am trying to understand it better. Here is

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
I tried this val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))] = lstgItem.join(viEvents, new org.apache.spark.RangePartitioner(partitions = 1200, rdd = viEvents)).map { It fired two jobs and still i have 1 task that never completes. IndexIDAttemptStatusLocality

Parallelize foreach in PySpark with Spark Standalone

2015-05-04 Thread kdunn
Full disclosure, I am *brand* new to Spark. I am trying to use [Py]SparkSQL standalone to pre-process a bunch of *local* (non HDFS) Parquet files. I have several thousand files and want to dispatch as many workers as my machine can handle to process the data in parallel; either at the per-file

AJAX with Apache Spark

2015-05-04 Thread Sergio Jiménez Barrio
Hi, I am trying create a DashBoard of a job of Apache Spark. I need run Spark Streaming 24/7 and when recive a ajax request this answer with the actual state of the job. I have created the client, and the program in Spark. I tried create the service of response with play, but this run the

com.datastax.spark % spark-streaming_2.10 % 1.1.0 in my build.sbt ??

2015-05-04 Thread Eric Ho
Can I specify this in my build file ? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/com-datastax-spark-spark-streaming-2-10-1-1-0-in-my-build-sbt-tp22758.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

No logs from my cluster / worker ... (running DSE 4.6.1)

2015-05-04 Thread Eric Ho
I'm submitting this via 'dse spark-submit' but somehow, I don't see any loggings in my cluster or worker machines... How can I find out ? My cluster is running DSE 4.6.1 with Spark enabled. My source is running Kafka 0.8.2.0 I'm launching my program on one of my DSE machines. Any insights much

OOM error with GMMs on 4GB dataset

2015-05-04 Thread Vinay Muttineni
Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760). The spark (1.3.1) job is allocated 120 executors with 6GB each and the driver also has 6GB. Spark Config Params: .set(spark.hadoop.validateOutputSpecs, false).set(spark.dynamicAllocation.enabled,

spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-04 Thread ??
hi all, when i use submit a spark-sql programe to select data from my hive database I get an error like this: User class threw exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my spark configure ,thank

RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread Wang, Daoyuan
You can use Explain extended select …. From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Tuesday, May 05, 2015 9:52 AM To: Cheng, Hao; Olivier Girardot; user Subject: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. As I know broadcastjoin is automatically enabled by

Re: Nightly builds/releases?

2015-05-04 Thread Guru Medasani
I see a Jira for this one, but unresolved. https://issues.apache.org/jira/browse/SPARK-1517 https://issues.apache.org/jira/browse/SPARK-1517 On May 4, 2015, at 10:25 PM, Ankur Chauhan achau...@brightcove.com wrote: Hi, Does anyone know if spark has any nightly builds or equivalent

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Can you print out the physical plan? EXPLAIN SELECT xxx… From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Monday, May 4, 2015 9:08 PM To: Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining 2 tables. hi Olivier spark1.3.1, with java1.8.0.45 and add 2 pics

Re: Kryo serialization of classes in additional jars

2015-05-04 Thread Akshat Aranya
Actually, after some digging, I did find a JIRA for it: SPARK-5470. The fix for this has gone into master, but it isn't in 1.2. On Mon, May 4, 2015 at 2:47 PM, Imran Rashid iras...@cloudera.com wrote: Oh, this seems like a real pain. You should file a jira, I didn't see an open issue -- if

sparksql support hive view

2015-05-04 Thread luohui20001
guys, just to confirm, sparksql support hive feature view, is that the one LateralView in hive language manual? thanks Thanksamp;Best regards! 罗辉 San.Luo

Re: Nightly builds/releases?

2015-05-04 Thread Ted Yu
See this related thread: http://search-hadoop.com/m/JW1q5bnnyT1 Cheers On Mon, May 4, 2015 at 7:58 PM, Guru Medasani gdm...@gmail.com wrote: I see a Jira for this one, but unresolved. https://issues.apache.org/jira/browse/SPARK-1517 On May 4, 2015, at 10:25 PM, Ankur Chauhan

Re: Nightly builds/releases?

2015-05-04 Thread Ankur Chauhan
Hi, There is also a make-distribution.sh file in the repository root. If someone with jenkins access can create a simple builder that would be awesome. But I am guessing besides the spark binary one would also probably want the maven artifacts (lower priority though) to work with it. -- Ankur

Help with Spark SQL Hash Distribution

2015-05-04 Thread Mani
I am trying to distribute a table using a particular column which is the key that I’ll be using to perform join operations on the table. Is it possible to do this with Spark SQL? I checked the method partitionBy() for rdds. But not sure how to specify which column is the key? Can anyone give an

Re: sparksql support hive view

2015-05-04 Thread Michael Armbrust
We support both LATERAL VIEWs (a query language feature that lets you turn a single row into many rows, for example with an explode) and virtual views (a table that is really just a query that is run on demand). On Mon, May 4, 2015 at 7:12 PM, luohui20...@sina.com wrote: guys, just to

Re: Help with Spark SQL Hash Distribution

2015-05-04 Thread Michael Armbrust
If you do a join with at least one equality relationship between the two tables, Spark SQL will automatically hash partition the data and perform the join. If you are looking to prepartition the data, that information is not yet propagated from the in memory cached representation so won't help

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Yi Zhang
Robin,My query statement is as below:select id, name, trans_date, gender, hobby, job, country from Employees LIMIT 100 In PostgreSQL, it works very well. For 10M records in DB, it just took less than 20ms, but in SparkSQL, it took long time.  Michael, Got it. For me, it is not good news. Anyway,

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Data Set 1 : viEvents : Is the event activity data of 1 day. I took 10 files out of it and 10 records *Item ID Count* 201335783004 3419 191568402102 1793 111654479898 1362 181503913062 1310 261798565828 1028 111654493548 994 231516683056 862

Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread Cheng, Hao
I assume you’re using the DataFrame API within your application. sql(“SELECT…”).explain(true) From: Wang, Daoyuan Sent: Tuesday, May 5, 2015 10:16 AM To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user Subject: RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. You can use

Spark JVM default memory

2015-05-04 Thread Vijayasarathy Kannan
Starting the master with /sbin/start-master.sh creates a JVM with only 512MB of memory. How to change this default amount of memory? Thanks, Vijay

Re: No logs from my cluster / worker ... (running DSE 4.6.1)

2015-05-04 Thread Ted Yu
bq. its Spark libs are all at 2.10 Clarification: 2.10 is version of Scala Your Spark version is 1.1.0 You can use earlier release of Kafka. Cheers On Mon, May 4, 2015 at 2:39 PM, Eric Ho eric...@intel.com wrote: I still prefer to use Spark core / streaming /... at 2.10 becuase my DSE is at

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Koert Kuipers
shoot me an email if you need any help with spark-sorted. it does not (yet?) have a java api, so you will have to work in scala On Mon, May 4, 2015 at 4:05 PM, Burak Yavuz brk...@gmail.com wrote: I think this Spark Package may be what you're looking for!

Re: Python Custom Partitioner

2015-05-04 Thread ayan guha
Thanks, but is there non broadcast solution? On 5 May 2015 01:34, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I have implemented map-side join with broadcast variables and the code is on mailing list (scala). On Mon, May 4, 2015 at 8:38 PM, ayan guha guha.a...@gmail.com wrote: Hi Can

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Michael Armbrust
The JDBC interface for Spark SQL does not support pushing down limits today. On Mon, May 4, 2015 at 8:06 AM, Robin East robin.e...@xense.co.uk wrote: and a further question - have you tried running this query in pqsl? what’s the performance like there? On 4 May 2015, at 16:04, Robin East

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Michael Armbrust
If you data is evenly distributed (i.e. no skewed datapoints in your join keys), it can also help to increase spark.sql.shuffle.partitions (default is 200). On Mon, May 4, 2015 at 8:03 AM, Richard Marscher rmarsc...@localytics.com wrote: In regards to the large GC pauses, assuming you allocated

Re: MLLib SVM probability

2015-05-04 Thread Joseph Bradley
Currently, SVMs don't have built-in multiclass support. Logistic Regression supports multiclass, as do trees and random forests. It would be great to add multiclass support for SVMs as well. There is some ongoing work on generic multiclass-to-binary reductions:

RE: Spark JVM default memory

2015-05-04 Thread Mohammed Guller
Did you confirm through the Spark UI how much memory is getting allocated to your application on each worker? Mohammed From: Vijayasarathy Kannan [mailto:kvi...@vt.edu] Sent: Monday, May 4, 2015 3:36 PM To: Andrew Ash Cc: user@spark.apache.org Subject: Re: Spark JVM default memory I am trying

Re: SparkSQL Nested structure

2015-05-04 Thread Michael Armbrust
You are looking for LATERAL VIEW explode https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-explode in HiveQL. On Mon, May 4, 2015 at 7:49 AM, Giovanni Paolo Gibilisco gibb...@gmail.com wrote: Hi, I'm trying to parse log files generated by Spark using

Re: Spark JVM default memory

2015-05-04 Thread Vijayasarathy Kannan
I am not able to access the web UI for some reason. But the logs (being written while running my application) show that only 385Mb are being allocated for each executor (or slave nodes) while the executor memory I set is 16Gb. This 385Mb is not the same for each run either. It looks random

Re: Spark JVM default memory

2015-05-04 Thread Vijayasarathy Kannan
I am trying to read in a file (4GB file). I tried setting both spark.driver.memory and spark.executor.memory to large values (say 16GB) but I still get a GC limit exceeded error. Any idea what I am missing? On Mon, May 4, 2015 at 5:30 PM, Andrew Ash and...@andrewash.com wrote: It's unlikely you

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Imran Rashid
oh wow, that is a really interesting observation, Marco Jerry. I wonder if this is worth exposing in combineByKey()? I think Jerry's proposed workaround is all you can do for now -- use reflection to side-step the fact that the methods you need are private. On Mon, Apr 27, 2015 at 8:07 AM,

Re: spark kryo serialization question

2015-05-04 Thread Imran Rashid
yes, you should register all three. The truth is, you only *need* to register classes that will get serialized -- either via RDD caching or in a shuffle. So if a type is only used as an intermediate inside a stage, you don't need to register it. But the overhead of registering extra classes is

Re: No logs from my cluster / worker ... (running DSE 4.6.1)

2015-05-04 Thread Ted Yu
Looks like you're using Spark 1.1.0 Support for Kafka 0.8.2 was added by: https://issues.apache.org/jira/browse/SPARK-2808 which would come in Spark 1.4.0 FYI On Mon, May 4, 2015 at 12:22 PM, Eric Ho eric...@intel.com wrote: I'm submitting this via 'dse spark-submit' but somehow, I don't see

Re: Spark JVM default memory

2015-05-04 Thread Andrew Ash
It's unlikely you need to increase the amount of memory on your master node since it does simple bookkeeping. The majority of the memory pressure across a cluster is on executor nodes. See the conf/spark-env.sh file for configuring heap sizes, and this section in the docs for more information on

Re: Spark partitioning question

2015-05-04 Thread Imran Rashid
Hi Marius, I am also a little confused -- are you saying that myPartitions is basically something like: class MyPartitioner extends Partitioner { def numPartitions = 1 def getPartition(key: Any) = 0 } ?? If so, I don't understand how you'd ever end up data in two partitions. Indeed, than

Re: Extra stage that executes before triggering computation with an action

2015-05-04 Thread Imran Rashid
sortByKey() runs one job to sample the data, to determine what range of keys to put in each partition. There is a jira to change it to defer launching the job until the subsequent action, but it will still execute another stage: https://issues.apache.org/jira/browse/SPARK-1021 On Wed, Apr 29,

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Burak Yavuz
I think this Spark Package may be what you're looking for! http://spark-packages.org/package/tresata/spark-sorted Best, Burak On Mon, May 4, 2015 at 12:56 PM, Imran Rashid iras...@cloudera.com wrote: oh wow, that is a really interesting observation, Marco Jerry. I wonder if this is worth

Building DAG from log

2015-05-04 Thread Giovanni Paolo Gibilisco
Hi, I'm trying to build the DAG of an application from the logs. I've had a look at SparkReplayDebugger but it doesn't operato offline on logs. I looked also at the one in this pull: https://github.com/apache/spark/pull/2077 that seems to operate only on logs but it doesn't clealry show the

Re: AJAX with Apache Spark

2015-05-04 Thread Olivier Girardot
Hi Sergio, you shouldn't architecture it this way, rather update a storage with Spark Streaming that your Play App will query. For example a Cassandra table, or Redis, or anything that will be able to answer you in milliseconds, rather than querying the Spark Streaming program. Regards, Olivier.

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Or, have you ever try broadcast join? From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Tuesday, May 5, 2015 8:33 AM To: luohui20...@sina.com; Olivier Girardot; user Subject: RE: 回复:Re: sparksql running slow while joining 2 tables. Can you print out the physical plan? EXPLAIN SELECT xxx…

回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread luohui20001
As I know broadcastjoin is automatically enabled by spark.sql.autoBroadcastJoinThreshold.refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options and how to check my app's physical plan,and others things like optimized plan,executable plan.etc thanks

回复:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-04 Thread luohui20001
you may need to copy hive-site.xml to your spark conf directory and check your hive metastore warehouse setting, and also check if you are authenticated to access hive metastore warehouse. Thanksamp;Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:鹰