Re: Spark partitioning question

2015-05-04 Thread Marius Danciu
Hi Imran, Yes that's what MyPartitioner does. I do see (using traces from MyPartitioner) that the key is partitioned on partition 0 but then I see this record arriving in both Yarn containers (I see it in the logs). Basically I need to emulate a Hadoop map-reduce job in Spark and groupByKey seemed

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
Michael Are there plans to add LIMIT push down? It's quite a natural thing to do in interactive querying. Sent from my iPhone > On 4 May 2015, at 22:57, Michael Armbrust wrote: > > The JDBC interface for Spark SQL does not support pushing down limits today. > >> On Mon, May 4, 2015 at 8:06 A

Re: Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Olivier Girardot
Can you activate your eventLogs and send them us ? Thank you ! Le mar. 5 mai 2015 à 04:56, luohui20001 a écrit : > Yes,just by default 1 executor.thanks > > > > 发自我的小米手机 > 在 2015年5月4日 下午10:01,ayan guha 写道: > > Are you using only 1 executor? > > On Mon, May 4, 2015 at 11:07 PM, wrote: > >> hi Ol

Re: "java.io.IOException: No space left on device" while doing repartitioning in Spark

2015-05-04 Thread Akhil Das
It could be filling up your /tmp directory. You need to set your spark.local.dir or you can also specify SPARK_WORKER_DIR to another location which has sufficient space. Thanks Best Regards On Mon, May 4, 2015 at 7:27 PM, shahab wrote: > Hi, > > I am getting "No space left on device" exception

Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread Cheng, Hao
I assume you’re using the DataFrame API within your application. sql(“SELECT…”).explain(true) From: Wang, Daoyuan Sent: Tuesday, May 5, 2015 10:16 AM To: luohui20...@sina.com; Cheng, Hao; Olivier Girardot; user Subject: RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. You can use

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Data Set 1 : viEvents : Is the event activity data of 1 day. I took 10 files out of it and 10 records *Item ID Count* 201335783004 3419 191568402102 1793 111654479898 1362 181503913062 1310 261798565828 1028 111654493548 994 231516683056 862 131497785968

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Yi Zhang
Robin,My query statement is as below:select id, name, trans_date, gender, hobby, job, country from Employees LIMIT 100 In PostgreSQL, it works very well. For 10M records in DB, it just took less than 20ms, but in SparkSQL, it took long time.  Michael, Got it. For me, it is not good news. Anyway,

Re: Help with Spark SQL Hash Distribution

2015-05-04 Thread Michael Armbrust
If you do a join with at least one equality relationship between the two tables, Spark SQL will automatically hash partition the data and perform the join. If you are looking to prepartition the data, that information is not yet propagated from the in memory cached representation so won't help avo

Help with Spark SQL Hash Distribution

2015-05-04 Thread Mani
I am trying to distribute a table using a particular column which is the key that I’ll be using to perform join operations on the table. Is it possible to do this with Spark SQL? I checked the method partitionBy() for rdds. But not sure how to specify which column is the key? Can anyone give an

Re: Nightly builds/releases?

2015-05-04 Thread Ankur Chauhan
Hi, There is also a make-distribution.sh file in the repository root. If someone with jenkins access can create a simple builder that would be awesome. But I am guessing besides the spark binary one would also probably want the maven artifacts (lower priority though) to work with it. -- Ankur >

Re: Nightly builds/releases?

2015-05-04 Thread Ted Yu
See this related thread: http://search-hadoop.com/m/JW1q5bnnyT1 Cheers On Mon, May 4, 2015 at 7:58 PM, Guru Medasani wrote: > I see a Jira for this one, but unresolved. > > https://issues.apache.org/jira/browse/SPARK-1517 > > > > > On May 4, 2015, at 10:25 PM, Ankur Chauhan > wrote: > > Hi, >

Re: Nightly builds/releases?

2015-05-04 Thread Guru Medasani
I see a Jira for this one, but unresolved. https://issues.apache.org/jira/browse/SPARK-1517 > On May 4, 2015, at 10:25 PM, Ankur Chauhan wrote: > > Hi, > > Does anyone know if spark has any nightly builds or equivalent that provides > bi

Nightly builds/releases?

2015-05-04 Thread Ankur Chauhan
Hi, Does anyone know if spark has any nightly builds or equivalent that provides binaries that have passed a CI build so that one could try out the bleeding edge without having to compile. -- Ankur signature.asc Description: Message signed with OpenPGP using GPGMail

RE: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread Wang, Daoyuan
You can use Explain extended select …. From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Tuesday, May 05, 2015 9:52 AM To: Cheng, Hao; Olivier Girardot; user Subject: 回复:RE: 回复:Re: sparksql running slow while joining_2_tables. As I know broadcastjoin is automatically enabled by spa

Re: sparksql support hive view

2015-05-04 Thread Michael Armbrust
We support both LATERAL VIEWs (a query language feature that lets you turn a single row into many rows, for example with an explode) and virtual views (a table that is really just a query that is run on demand). On Mon, May 4, 2015 at 7:12 PM, wrote: > guys, > > just to confirm, sparksql su

sparksql support hive view

2015-05-04 Thread luohui20001
guys, just to confirm, sparksql support hive feature view, is that the one LateralView in hive language manual? thanks Thanks&Best regards! 罗辉 San.Luo

回复:spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-04 Thread luohui20001
you may need to copy hive-site.xml to your spark conf directory and check your hive metastore warehouse setting, and also check if you are authenticated to access hive metastore warehouse. Thanks&Best regards! 罗辉 San.Luo - 原始邮件 - 发件人:"鹰" <980548...@qq

回复:RE: 回复:Re: sparksql running slow while joining_2_tables.

2015-05-04 Thread luohui20001
As I know broadcastjoin is automatically enabled by spark.sql.autoBroadcastJoinThreshold.refer to http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options and how to check my app's physical plan,and others things like optimized plan,executable plan.etc thanks -

spark Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-05-04 Thread ??
hi all, when i use submit a spark-sql programe to select data from my hive database I get an error like this: User class threw exception: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient ,what's wrong with my spark configure ,thank an

Re: Kryo serialization of classes in additional jars

2015-05-04 Thread Akshat Aranya
Actually, after some digging, I did find a JIRA for it: SPARK-5470. The fix for this has gone into master, but it isn't in 1.2. On Mon, May 4, 2015 at 2:47 PM, Imran Rashid wrote: > Oh, this seems like a real pain. You should file a jira, I didn't see an > open issue -- if nothing else just to d

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Or, have you ever try broadcast join? From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Tuesday, May 5, 2015 8:33 AM To: luohui20...@sina.com; Olivier Girardot; user Subject: RE: 回复:Re: sparksql running slow while joining 2 tables. Can you print out the physical plan? EXPLAIN SELECT xxx… From

RE: 回复:Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Cheng, Hao
Can you print out the physical plan? EXPLAIN SELECT xxx… From: luohui20...@sina.com [mailto:luohui20...@sina.com] Sent: Monday, May 4, 2015 9:08 PM To: Olivier Girardot; user Subject: 回复:Re: sparksql running slow while joining 2 tables. hi Olivier spark1.3.1, with java1.8.0.45 and add 2 pics

OOM error with GMMs on 4GB dataset

2015-05-04 Thread Vinay Muttineni
Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760). The spark (1.3.1) job is allocated 120 executors with 6GB each and the driver also has 6GB. Spark Config Params: .set("spark.hadoop.validateOutputSpecs", "false").set("spark.dynamicAllocation.enabled", "false").set("spark.

Re: Spark JVM default memory

2015-05-04 Thread Vijayasarathy Kannan
I am not able to access the web UI for some reason. But the logs (being written while running my application) show that only 385Mb are being allocated for each executor (or slave nodes) while the executor memory I set is 16Gb. This "385Mb" is not the same for each run either. It looks random (somet

Re: SparkSQL Nested structure

2015-05-04 Thread Michael Armbrust
You are looking for LATERAL VIEW explode in HiveQL. On Mon, May 4, 2015 at 7:49 AM, Giovanni Paolo Gibilisco wrote: > Hi, I'm trying to parse log files generated by Spark using SparkSQL. > > In the JS

RE: Spark JVM default memory

2015-05-04 Thread Mohammed Guller
Did you confirm through the Spark UI how much memory is getting allocated to your application on each worker? Mohammed From: Vijayasarathy Kannan [mailto:kvi...@vt.edu] Sent: Monday, May 4, 2015 3:36 PM To: Andrew Ash Cc: user@spark.apache.org Subject: Re: Spark JVM default memory I am trying t

Re: Spark JVM default memory

2015-05-04 Thread Vijayasarathy Kannan
I am trying to read in a file (4GB file). I tried setting both "spark.driver.memory" and "spark.executor.memory" to large values (say 16GB) but I still get a GC limit exceeded error. Any idea what I am missing? On Mon, May 4, 2015 at 5:30 PM, Andrew Ash wrote: > It's unlikely you need to increas

Re: MLLib SVM probability

2015-05-04 Thread Joseph Bradley
Currently, SVMs don't have built-in multiclass support. Logistic Regression supports multiclass, as do trees and random forests. It would be great to add multiclass support for SVMs as well. There is some ongoing work on generic multiclass-to-binary reductions: https://issues.apache.org/jira/bro

Re: Python Custom Partitioner

2015-05-04 Thread ayan guha
Thanks, but is there non broadcast solution? On 5 May 2015 01:34, "ÐΞ€ρ@Ҝ (๏̯͡๏)" wrote: > I have implemented map-side join with broadcast variables and the code is > on mailing list (scala). > > > On Mon, May 4, 2015 at 8:38 PM, ayan guha wrote: > >> Hi >> >> Can someone share some working code

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Koert Kuipers
shoot me an email if you need any help with spark-sorted. it does not (yet?) have a java api, so you will have to work in scala On Mon, May 4, 2015 at 4:05 PM, Burak Yavuz wrote: > I think this Spark Package may be what you're looking for! > http://spark-packages.org/package/tresata/spark-sorted

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Michael Armbrust
If you data is evenly distributed (i.e. no skewed datapoints in your join keys), it can also help to increase spark.sql.shuffle.partitions (default is 200). On Mon, May 4, 2015 at 8:03 AM, Richard Marscher wrote: > In regards to the large GC pauses, assuming you allocated all 100GB of > memory p

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Michael Armbrust
The JDBC interface for Spark SQL does not support pushing down limits today. On Mon, May 4, 2015 at 8:06 AM, Robin East wrote: > and a further question - have you tried running this query in pqsl? what’s > the performance like there? > > On 4 May 2015, at 16:04, Robin East wrote: > > What query

Re: spark kryo serialization question

2015-05-04 Thread Imran Rashid
yes, you should register all three. The truth is, you only *need* to register classes that will get serialized -- either via RDD caching or in a shuffle. So if a type is only used as an intermediate inside a stage, you don't need to register it. But the overhead of registering extra classes is p

Re: Kryo serialization of classes in additional jars

2015-05-04 Thread Imran Rashid
Oh, this seems like a real pain. You should file a jira, I didn't see an open issue -- if nothing else just to document the issue. As you've noted, the problem is that the serializer is created immediately in the executors, right when the SparkEnv is created, but the other jars aren't downloaded

Re: No logs from my cluster / worker ... (running DSE 4.6.1)

2015-05-04 Thread Ted Yu
bq. its Spark libs are all at 2.10 Clarification: 2.10 is version of Scala Your Spark version is 1.1.0 You can use earlier release of Kafka. Cheers On Mon, May 4, 2015 at 2:39 PM, Eric Ho wrote: > I still prefer to use Spark core / streaming /... at 2.10 becuase my DSE > is at 4.6.1 and its S

Re: Spark JVM default memory

2015-05-04 Thread Andrew Ash
It's unlikely you need to increase the amount of memory on your master node since it does simple bookkeeping. The majority of the memory pressure across a cluster is on executor nodes. See the conf/spark-env.sh file for configuring heap sizes, and this section in the docs for more information on

Spark JVM default memory

2015-05-04 Thread Vijayasarathy Kannan
Starting the master with "/sbin/start-master.sh" creates a JVM with only 512MB of memory. How to change this default amount of memory? Thanks, Vijay

Building DAG from log

2015-05-04 Thread Giovanni Paolo Gibilisco
Hi, I'm trying to build the DAG of an application from the logs. I've had a look at SparkReplayDebugger but it doesn't operato offline on logs. I looked also at the one in this pull: https://github.com/apache/spark/pull/2077 that seems to operate only on logs but it doesn't clealry show the depende

Re: AJAX with Apache Spark

2015-05-04 Thread Olivier Girardot
Hi Sergio, you shouldn't architecture it this way, rather update a storage with Spark Streaming that your Play App will query. For example a Cassandra table, or Redis, or anything that will be able to answer you in milliseconds, rather than "querying" the Spark Streaming program. Regards, Olivier

Re: No logs from my cluster / worker ... (running DSE 4.6.1)

2015-05-04 Thread Ted Yu
Looks like you're using Spark 1.1.0 Support for Kafka 0.8.2 was added by: https://issues.apache.org/jira/browse/SPARK-2808 which would come in Spark 1.4.0 FYI On Mon, May 4, 2015 at 12:22 PM, Eric Ho wrote: > I'm submitting this via 'dse spark-submit' but somehow, I don't see any > loggings i

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Burak Yavuz
I think this Spark Package may be what you're looking for! http://spark-packages.org/package/tresata/spark-sorted Best, Burak On Mon, May 4, 2015 at 12:56 PM, Imran Rashid wrote: > oh wow, that is a really interesting observation, Marco & Jerry. > I wonder if this is worth exposing in combineBy

Re: Extra stage that executes before triggering computation with an action

2015-05-04 Thread Imran Rashid
sortByKey() runs one job to sample the data, to determine what range of keys to put in each partition. There is a jira to change it to defer launching the job until the subsequent action, but it will still execute another stage: https://issues.apache.org/jira/browse/SPARK-1021 On Wed, Apr 29, 20

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Imran Rashid
oh wow, that is a really interesting observation, Marco & Jerry. I wonder if this is worth exposing in combineByKey()? I think Jerry's proposed workaround is all you can do for now -- use reflection to side-step the fact that the methods you need are private. On Mon, Apr 27, 2015 at 8:07 AM, Sais

Re: Spark partitioning question

2015-05-04 Thread Imran Rashid
Hi Marius, I am also a little confused -- are you saying that myPartitions is basically something like: class MyPartitioner extends Partitioner { def numPartitions = 1 def getPartition(key: Any) = 0 } ?? If so, I don't understand how you'd ever end up data in two partitions. Indeed, than ev

No logs from my cluster / worker ... (running DSE 4.6.1)

2015-05-04 Thread Eric Ho
I'm submitting this via 'dse spark-submit' but somehow, I don't see any loggings in my cluster or worker machines... How can I find out ? My cluster is running DSE 4.6.1 with Spark enabled. My source is running Kafka 0.8.2.0 I'm launching my program on one of my DSE machines. Any insights much

"com.datastax.spark" % "spark-streaming_2.10" % "1.1.0" in my build.sbt ??

2015-05-04 Thread Eric Ho
Can I specify this in my build file ? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/com-datastax-spark-spark-streaming-2-10-1-1-0-in-my-build-sbt-tp22758.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

AJAX with Apache Spark

2015-05-04 Thread Sergio Jiménez Barrio
Hi, I am trying create a DashBoard of a job of Apache Spark. I need run Spark Streaming 24/7 and when recive a ajax request this answer with the actual state of the job. I have created the client, and the program in Spark. I tried create the service of response with play, but this run the program

Parallelize foreach in PySpark with Spark Standalone

2015-05-04 Thread kdunn
Full disclosure, I am *brand* new to Spark. I am trying to use [Py]SparkSQL standalone to pre-process a bunch of *local* (non HDFS) Parquet files. I have several thousand files and want to dispatch as many workers as my machine can handle to process the data in parallel; either at the per-file or

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
I tried this val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))] = lstgItem.join(viEvents, new org.apache.spark.RangePartitioner(partitions = 1200, rdd = viEvents)).map { It fired two jobs and still i have 1 task that never completes. IndexIDAttemptStatusLocality LevelExe

Re: Python Custom Partitioner

2015-05-04 Thread ๏̯͡๏
I have implemented map-side join with broadcast variables and the code is on mailing list (scala). On Mon, May 4, 2015 at 8:38 PM, ayan guha wrote: > Hi > > Can someone share some working code for custom partitioner in python? > > I am trying to understand it better. > > Here is documentation >

Python Custom Partitioner

2015-05-04 Thread ayan guha
Hi Can someone share some working code for custom partitioner in python? I am trying to understand it better. Here is documentation partitionBy(*numPartitions*, *partitionFunc=*) Return a copy of the RDD part

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
and a further question - have you tried running this query in pqsl? what’s the performance like there? > On 4 May 2015, at 16:04, Robin East wrote: > > What query are you running. It may be the case that your query requires > PosgreSQL to do a large amount of work before identifying the first n

Re: Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Robin East
What query are you running. It may be the case that your query requires PosgreSQL to do a large amount of work before identifying the first n rows > On 4 May 2015, at 15:52, Yi Zhang wrote: > > I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and > improve query performance,

Re: Long GC pauses with Spark SQL 1.3.0 and billion row tables

2015-05-04 Thread Richard Marscher
In regards to the large GC pauses, assuming you allocated all 100GB of memory per worker you may consider running with less memory on your Worker nodes, or splitting up the available memory on the Worker nodes amongst several worker instances. The JVM's garbage collection starts to become very slow

Re: SparkStream saveAsTextFiles()

2015-05-04 Thread anavidad
Structure seems fine. Only need to type at the end of your program: ssc.start(); ssc.awaitTermination(); Check method arguments. I advise you to check the spark java api streaming. https://spark.apache.org/docs/1.3.0/api/java/ Regards. -- View this message in context: http://apache-spark-u

Is LIMIT n in Spark SQL useful?

2015-05-04 Thread Yi Zhang
I am trying to query PostgreSQL using LIMIT(n) to reduce memory size and improve query performance, but I found it took long time as same as querying not using LIMIT. It let me confused. Anybody know why? Thanks. Regards,Yi

SparkSQL Nested structure

2015-05-04 Thread Giovanni Paolo Gibilisco
Hi, I'm trying to parse log files generated by Spark using SparkSQL. In the JSON elements related to the StageCompleted event we have a nested structre containing an array of elements with RDD Info. (see the log below as an example (omitting some parts). { "Event": "SparkListenerStageComplete

Re: SparkStream saveAsTextFiles()

2015-05-04 Thread anavidad
Hi, What kind of "can't find symbol" are you receiving? On the other hand, I would try to change guava dependency version to 14.0.1. In Spark 1.3.0, guava version is 14.0.1 but is not included inside spark artifact because it's marked like provided. http://repo1.maven.org/maven2/org/apache/spar

Re: empty jdbc RDD in spark

2015-05-04 Thread Cody Koeninger
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.JdbcRDD The arguments are sql string, lower bound, upper bound, number of partitions. Your call SELECT * FROM MEMBERS LIMIT ? OFFSET ?, 0, 100, 1 would thus be run as SELECT * FROM MEMBERS LIMIT 0 OFFSET 100 Natu

Re: "java.io.IOException: No space left on device" while doing repartitioning in Spark

2015-05-04 Thread Ted Yu
See https://wiki.gentoo.org/wiki/Knowledge_Base:No_space_left_on_device_while_there_is_plenty_of_space_available What's the value for spark.local.dir property ? Cheers On Mon, May 4, 2015 at 6:57 AM, shahab wrote: > Hi, > > I am getting "No space left on device" exception when doing repartitio

Re: Spark Mongodb connection

2015-05-04 Thread Gaspar Muñoz
Hi Yasemin, You can find here a MongoDB connector for Spark SQL: http://github.com/Stratio/spark-mongodb Best regards 2015-05-04 9:27 GMT+02:00 Yasemin Kaya : > Hi! > > I am new at Spark and I want to begin Spark with simple wordCount example > in Java. But I want to give my input from Mongodb

Re: spark 1.3.1

2015-05-04 Thread Deng Ching-Mallete
Hi, I think you need to import "org.apache.spark.sql.types.DataTypes" instead of "org.apache.spark.sql.types.DataType" and use that instead to access the StringType.. HTH, Deng On Mon, May 4, 2015 at 9:37 PM, Saurabh Gupta wrote: > I am really new to this but what should I look into maven logs

Re: Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread ayan guha
Are you using only 1 executor? On Mon, May 4, 2015 at 11:07 PM, wrote: > hi Olivier > > spark1.3.1, with java1.8.0.45 > > and add 2 pics . > > it seems like a GC issue. I also tried with different parameters like > memory size of driver&executor, memory fraction, java opts... > > but this issue

"java.io.IOException: No space left on device" while doing repartitioning in Spark

2015-05-04 Thread shahab
Hi, I am getting "No space left on device" exception when doing repartitioning of approx. 285 MB of data while these is still 2 GB space left ?? does it mean that repartitioning needs more space (more than 2 GB) for repartitioning of 285 MB of data ?? best, /Shahab java.io.IOException: No spa

Troubling Logging w/Simple Example (spark-1.2.2-bin-hadoop2.4)...

2015-05-04 Thread James Carman
I have the following simple example program: public class SimpleCount { public static void main(String[] args) { final String master = System.getProperty("spark.master", "local[*]"); System.out.printf("Running job against spark master %s ...%n", master); final SparkCo

Re: mapping JavaRDD to jdbc DataFrame

2015-05-04 Thread ayan guha
You can use applySchema On Mon, May 4, 2015 at 10:16 PM, Lior Chaga wrote: > Hi, > > I'd like to use a JavaRDD containing parameters for an SQL query, and use > SparkSQL jdbc to load data from mySQL. > > Consider the following pseudo code: > > JavaRDD namesRdd = ... ; > ... > options.put("ur

Re: How to deal with code that runs before foreach block in Apache Spark?

2015-05-04 Thread Gerard Maas
I'm not familiar with the Solr API but provided that ' SolrIndexerDriver' is a singleton, I guess that what's going on when running on a cluster is that the call to: SolrIndexerDriver.solrInputDocumentList.add(elem) is happening on different singleton instances of the SolrIndexerDriver on diffe

Re: spark 1.3.1

2015-05-04 Thread Saurabh Gupta
I am really new to this but what should I look into maven logs? I have tried mvn package -X -e SHould I show the full trace? On Mon, May 4, 2015 at 6:54 PM, Driesprong, Fokko wrote: > Hi Saurabh, > > Did you check the log of maven? > > 2015-05-04 15:17 GMT+02:00 Saurabh Gupta : > >> HI, >> >

How to deal with code that runs before foreach block in Apache Spark?

2015-05-04 Thread Emre Sevinc
I'm trying to deal with some code that runs differently on Spark stand-alone mode and Spark running on a cluster. Basically, for each item in an RDD, I'm trying to add it to a list, and once this is done, I want to send this list to Solr. This works perfectly fine when I run the following code in

Re: MLLib SVM probability

2015-05-04 Thread Driesprong, Fokko
Hi Robert, I would say, taking the sign of the numbers represent the class of the input-vector. What kind of data are you using, and what kind of traning-set do you use. Fundamentally a SVM is able to separate only two classes, you can do one vs the rest as you mentioned. I don't see how LVQ can

Re: Exiting "driver" main() method...

2015-05-04 Thread James Carman
I think I figured it out. I am playing around with the Cassandra connector and I had a method that inserted some data into a locally-running Cassandra instance, but I forgot to close the Cluster object. I guess that left some non-daemon thread running and kept the process for exiting. Nothing to

Re: spark 1.3.1

2015-05-04 Thread Driesprong, Fokko
Hi Saurabh, Did you check the log of maven? 2015-05-04 15:17 GMT+02:00 Saurabh Gupta : > HI, > > I am trying to build a example code given at > > https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds > > code is: > > // Import factory methods provided by DataTy

MLLib SVM probability

2015-05-04 Thread Robert Musters
Hi all, I am trying to understand the output of the SVM classifier. Right now, my output looks like this: -18.841544889249917 0.0 168.32916035523283 1.0 420.67763915879794 1.0 -974.1942589201286 0.0 71.73602841256813 1.0 233.13636224524993 1.0 -1000.5902168199027 0.0 The documenta

Difference ?

2015-05-04 Thread ๏̯͡๏
*Datasets* val viEvents = viEventsRaw.map { vi => (vi.get(14).asInstanceOf[Long], vi) } val lstgItem = listings.map { lstg => (lstg.getItemId().toLong, lstg) } What is the difference between 1) lstgItem.join(viEvents, new org.apache.spark.RangePartitioner(partitions = 1200, rdd = viEvents)).ma

spark 1.3.1

2015-05-04 Thread Saurabh Gupta
HI, I am trying to build a example code given at https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds code is: // Import factory methods provided by DataType.import org.apache.spark.sql.types.DataType;// Import StructType and StructFieldimport org.apache.spark

Re: Support for skewed joins in Spark

2015-05-04 Thread ๏̯͡๏
Hello Soila, Can you share the code that shows usuag of RangePartitioner ? I am facing issue with .join() where one task runs forever. I tried repartition(100/200/300/1200) and it did not help, I cannot use map-side join because both datasets are huge and beyond driver memory size. Regards, Deepak

Re: Custom Partitioning Spark

2015-05-04 Thread ๏̯͡๏
Why do you use custom partitioner ? Are you doing join ? And, can you share some code that shows how you implemented custom partitioner. On Tue, Apr 21, 2015 at 8:38 PM, ayan guha wrote: > Are you looking for? > > *mapPartitions*(*func*)Similar to map, but runs separately on each > partition (b

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
I ran it against one file instead of 10 files and i see one task is still running after 33 mins its shuffle read size is 780MB/50 mil records. I did a count of records for each itemId from dataset-2 [One FILE] (Second Dataset (RDDPair) val viEvents = viEventsRaw.map { vi => (vi.get(14 ).asInstance

Re: sparksql running slow while joining 2 tables.

2015-05-04 Thread Olivier Girardot
Hi, What is you Spark version ? Regards, Olivier. Le lun. 4 mai 2015 à 11:03, a écrit : > hi guys > > when i am running a sql like "select a.name,a.startpoint,a.endpoint, > a.piece from db a join sample b on (a.name = b.name) where (b.startpoint > > a.startpoint + 25);" I found sparks

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
>From the symptoms you mentioned that one task's shuffle write is much larger than all the other task, it is quite similar to normal data skew behavior, I just give some advice based on your descriptions, I think you need to detect whether data is actually skewed or not. The shuffle will put data

mapping JavaRDD to jdbc DataFrame

2015-05-04 Thread Lior Chaga
Hi, I'd like to use a JavaRDD containing parameters for an SQL query, and use SparkSQL jdbc to load data from mySQL. Consider the following pseudo code: JavaRDD namesRdd = ... ; ... options.put("url", "jdbc:mysql://mysql?user=usr"); options.put("password", "pass"); options.put("dbtable", "(SELEC

Re: Unusual filter behaviour on RDD

2015-05-04 Thread fawadalam
I got it working. When I was persisting, it only persisted 85% of the RDD to memory and the rest of the RDD gets recomputed every time. Because my flagged RDD uses a random method to create the field, I was getting unpredictable results. When I persist using: flagged.persist(StorageLevel.MEMORY_AN

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Four tasks are now failing with IndexIDAttemptStatus ▾Locality LevelExecutor ID / HostLaunch TimeDurationGC TimeShuffle Read Size / RecordsShuffle Spill (Memory)Shuffle Spill (Disk) Errors 0 3771 0 FAILED PROCESS_LOCAL 114 / host1 2015/05/04 01:27:44 / ExecutorLostFailure (executor 114 lost)

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
One dataset (RDD Pair) val lstgItem = listings.map { lstg => (lstg.getItemId().toLong, lstg) } Second Dataset (RDDPair) val viEvents = viEventsRaw.map { vi => (vi.get(14).asInstanceOf[Long], vi) } As i want to join based on item Id that is used as first element in the tuple in both cases and i

Re: Map-Side Join in Spark

2015-05-04 Thread ๏̯͡๏
This is how i implemented map-side join using broadcast. val listings = DataUtil.getDwLstgItem(sc, DateUtil.addDaysToDate(startDate, -89)) val viEvents = details.map { vi => (vi.get(14).asInstanceOf[Long], vi) } val lstgItemMap = listings.map { lstg => (lstg.getItemId().toLong, lstg) }

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
Shuffle key is depending on your implementation, I'm not sure if you are familiar with MapReduce, the mapper output is a key-value pair, where the key is the shuffle key for shuffling, Spark is also the same. 2015-05-04 17:31 GMT+08:00 ÐΞ€ρ@Ҝ (๏̯͡๏) : > Hello Shao, > Can you talk more about shuff

Re: Hardware requirements

2015-05-04 Thread Akhil Das
Assume your block size is 128MB. Thanks Best Regards On Mon, May 4, 2015 at 2:38 PM, ayan guha wrote: > Hi > > How do you figure out 500gig~3900 partitions? I am trying to do the math. > If I assume 64mb block size then 1G~16 blocks and 500g~8000 blocks. If we > assume split and block sizes are

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Hello Shao, Can you talk more about shuffle key or point me to APIs that allow me to change shuffle key. I will try with different keys and see the performance. What is the shuffle key by default ? On Mon, May 4, 2015 at 2:37 PM, Saisai Shao wrote: > IMHO If your data or your algorithm is prone

Re: Hardware requirements

2015-05-04 Thread ayan guha
Hi How do you figure out 500gig~3900 partitions? I am trying to do the math. If I assume 64mb block size then 1G~16 blocks and 500g~8000 blocks. If we assume split and block sizes are same, shouldn't we end up with 8k partitions? On 4 May 2015 17:49, "Akhil Das" wrote: > 500GB of data will have

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread Saisai Shao
IMHO If your data or your algorithm is prone to data skew, I think you have to fix this from application level, Spark itself cannot overcome this problem (if one key has large amount of values), you may change your algorithm to choose another shuffle key, somethings like this to avoid shuffle on sk

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-04 Thread ๏̯͡๏
Hello Dean & Others, Thanks for the response. I tried with 100,200, 400, 600 and 1200 repartitions with 100,200,400 and 800 executors. Each time all the tasks of join complete in less than a minute except one and that one tasks runs forever. I have a huge cluster at my disposal. The data for each

Re: Hardware requirements

2015-05-04 Thread Akhil Das
500GB of data will have nearly 3900 partitions and if you can have nearly that many number of cores and around 500GB of memory then things will be lightening fast. :) Thanks Best Regards On Sun, May 3, 2015 at 12:49 PM, sherine ahmed wrote: > I need to use spark to upload a 500 GB data from had

Re: Remoting warning when submitting to cluster

2015-05-04 Thread Akhil Das
Looks like a version incompatibility, just make sure you have the proper version of spark. Also look further in the stacktrace what is causing Futures timed out (it could be a network issue also if the ports aren't opened properly) Thanks Best Regards On Sat, May 2, 2015 at 12:04 AM, javidelgadil

Re: spark filestrea problem

2015-05-04 Thread Akhil Das
With filestream you can actually pass a filter parameter to avoid loading up .tmp file/directories. Also, when you move/rename a file, the file creation date doesn't change and hence spark won't detect them i believe. Thanks Best Regards On Sat, May 2, 2015 at 9:37 PM, Evo Eftimov wrote: > it

how to make sure data is partitioned across all workers?

2015-05-04 Thread shahab
Hi, Is there any way to enforce Spark to partition cached data across all worker nodes, so all data is not cached only in one of the worker nodes? best, /Shahab

Re: Problem in Standalone Mode

2015-05-04 Thread Akhil Das
Can you paste the complete stacktrace? It looks like you are having version incompatibility with hadoop. Thanks Best Regards On Sat, May 2, 2015 at 4:36 PM, drarse wrote: > When I run my program with Spark-Submit everythink are ok. But when I try > run in satandalone mode I obtain the nex Excep

Spark Mongodb connection

2015-05-04 Thread Yasemin Kaya
Hi! I am new at Spark and I want to begin Spark with simple wordCount example in Java. But I want to give my input from Mongodb database. I want to learn how can I connect Mongodb database to my project. Any one can help for this issue. Have a nice day yasemin -- hiç ender hiç

Spark job concurrency problem

2015-05-04 Thread Xi Shen
Hi, I have two small RDD, each has about 600 records. In my code, I did val rdd1 = sc...cache() val rdd2 = sc...cache() val result = rdd1.cartesian(rdd2).*repartition*(num_cpu).map {case (a,b) => some_expensive_job(a,b) } I ran my job in YARN cluster with "--master yarn-cluster", I have 6 exe