saveAsTextFile hangs with hdfs

2014-08-19 Thread David
)); Thanks in advance, David

sortByKey trouble

2014-09-24 Thread david
Hi, Does anybody know how to use sortbykey in scala on a RDD like : val rddToSave = file.map(l = l.split(\\|)).map(r = (r(34)+-+r(3), r(4), r(10), r(12))) besauce, i received ann error sortByKey is not a member of ord.apache.spark.rdd.RDD[(String,String,String,String)]. What i try do

Re: sortByKey trouble

2014-09-24 Thread david
thank's i've already try this solution but it does not compile (in Eclipse) I'm surprise to see that in Spark-shell, sortByKey works fine on 2 solutions : (String,String,String,String) (String,(String,String,String)) -- View this message in context:

Re: foreachPartition: write to multiple files

2014-10-08 Thread david
Hi, I finally found a solution after reading the post : http://apache-spark-user-list.1001560.n3.nabble.com/how-to-split-RDD-by-key-and-save-to-different-path-td11887.html#a11983 -- View this message in context:

Key-Value decomposition

2014-11-03 Thread david
Hi, I'm a newbie in Spark and faces the following use case : val data = Array ( A, 1;2;3) val rdd = sc.parallelize(data) // Something here to produce RDD of (Key,value) // ( A, 1) , (A, 2), (A, 3) Does anybody know how to do ? Thank's -- View this message in

Re: Key-Value decomposition

2014-11-03 Thread david
Hi, But i've only one RDD. Hre is a more complete exemple : my rdd is something like (A, 1;2;3), (B, 2;5;6), (C, 3;2;1) And i expect to have the following result : (A,1) , (A,2) , (A,3) , (B,2) , (B,5) , (B,6) , (C,3) , (C,2) , (C,1) Any idea about how can i achieve this ? Thank's

RE: Key-Value decomposition

2014-11-04 Thread david
Thank's -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Key-Value-decomposition-tp17966p18050.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To

Spark SQL (1.0)

2014-11-24 Thread david
Hi, I build 2 tables from files. Table F1 join with table F2 on c5=d4. F1 has 46730613 rows F2 has 3386740 rows All keys d4 exists in F1.c5, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // --- begin code --- val sqlContext = new

Spark SQL Join returns less rows that expected

2014-11-25 Thread david
Hi, I have 2 files which come from csv import of 2 Oracle tables. F1 has 46730613 rows F2 has 3386740 rows I build 2 tables with spark. Table F1 join with table F2 on c1=d1. All keys F2.d1 exists in F1.c1, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // ---

spark streaming kafa best practices ?

2014-12-05 Thread david
hi, What is the bet way to process a batch window in SparkStreaming : kafkaStream.foreachRDD(rdd = { rdd.collect().foreach(event = { // process the event process(event) }) }) Or kafkaStream.foreachRDD(rdd = { rdd.map(event = { //

Spark steaming : work with collect() but not without collect()

2014-12-11 Thread david
Hi, We use the following Spark Streaming code to collect and process Kafka event : kafkaStream.foreachRDD(rdd = { rdd.collect().foreach(event = { process(event._1, event._2) }) }) This work fine. But without /collect()/ function, the following exception is

ROSE: Spark + R on the JVM.

2016-01-12 Thread David
to [take a look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback, questions etc very welcome. David "All that is gold does not glitter, Not all those who wander are lost."

Help with groupByKey

2014-03-02 Thread David Thomas
I have an RDD of (K, Array[V]) pairs. For example: ((key1, (1,2,3)), (key2, (3,2,4)), (key1, (4,3,2))) How can I do a groupByKey such that I get back an RDD of the form (K, Array[V]) pairs. Ex: ((key1, (1,2,3,4,3,2)), (key2, (3,2,4)))

Block

2014-03-11 Thread David Thomas
What is the concept of Block and BlockManager in Spark? How is a Block related to a Partition of a RDD?

Re: Are all transformations lazy?

2014-03-11 Thread David Thomas
: https://spark-project.atlassian.net/browse/SPARK-1021). David Thomas dt5434...@gmail.com March 11, 2014 at 9:49 PM For example, is distinct() transformation lazy? when I see the Spark source code, distinct applies a map- reduceByKey - map function to the RDD elements. Why is this lazy

Re: Replicating RDD elements

2014-03-28 Thread David Thomas
That helps! Thank you. On Fri, Mar 28, 2014 at 12:36 AM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi David, I am sorry but your question is not clear to me. Are you talking about taking some value and sharing it across your cluster so that it is present on all the nodes? You can look

Spark webUI - application details page

2014-03-30 Thread David Thomas
Is there a way to see 'Application Detail UI' page (at master:4040) for completed applications? Currently, I can see that page only for running applications, I would like to see various numbers for the application after it has completed.

Resilient nature of RDD

2014-04-02 Thread David Thomas
Can someone explain how RDD is resilient? If one of the partition is lost, who is responsible to recreate that partition - is it the driver program?

Checkpoint Vs Cache

2014-04-13 Thread David Thomas
What is the difference between checkpointing and caching an RDD?

Task splitting among workers

2014-04-19 Thread David Thomas
During a Spark stage, how are tasks split among the workers? Specifically for a HadoopRDD, who determines which worker has to get which task?

RE:

2014-04-23 Thread Buttler, David
This sounds like a configuration issue. Either you have not set the MASTER correctly, or possibly another process is using up all of the cores Dave From: ge ko [mailto:koenig@gmail.com] Sent: Sunday, April 13, 2014 12:51 PM To: user@spark.apache.org Subject: Hi, I'm still going to start

RE: K-means with large K

2014-04-28 Thread Buttler, David
@spark.apache.org Cc: user@spark.apache.org Subject: Re: K-means with large K David, Just curious to know what kind of use cases demand such large k clusters Chester Sent from my iPhone On Apr 28, 2014, at 9:19 AM, Buttler, David buttl...@llnl.govmailto:buttl...@llnl.gov wrote: Hi, I am trying

Re: Spark Streaming using Flume body size limitation

2014-05-23 Thread David Lemieux
For some reason the patch did not make it. Trying via email: /D On May 23, 2014, at 9:52 AM, lemieud david.lemi...@radialpoint.com wrote: Hi, I think I found the problem. In SparkFlumeEvent the readExternal method use in.read(bodyBuff) which read the first 1020 bytes, but no more. The

Re: Spark Streaming using Flume body size limitation

2014-05-23 Thread David Lemieux
Created https://issues.apache.org/jira/browse/SPARK-1916 I'll submit a pull request soon. /D On May 23, 2014, at 9:56 AM, David Lemieux david.lemi...@radialpoint.com wrote: For some reason the patch did not make it. Trying via email: /D On May 23, 2014, at 9:52 AM, lemieud david.lemi

Spark misconfigured? Small input split sizes in shark query

2014-07-15 Thread David Rosenstrauch
Got a spark/shark cluster up and running recently, and have been kicking the tires on it. However, been wrestling with an issue on it that I'm not quite sure how to solve. (Or, at least, not quite sure about the correct way to solve it.) I ran a simple Hive query (select count ...) against

Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
at the content inside of the map function or should I be doing something else entirely? Thanks David

Re: Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
. It may be the case that you don't really need a bunch of RDDs at all, but can operate on an RDD of pairs of Strings (roots) and something-elses, all at once. On Mon, Aug 18, 2014 at 2:31 PM, David Tinker david.tin...@gmail.com wrote: Hi All. I need to create a lot of RDDs starting from

Small input split sizes

2014-08-20 Thread David Rosenstrauch
I'm still bumping up against this issue: spark (and shark) are breaking my inputs into 64MB-sized splits. Anyone know where/how to configure spark so that it either doesn't split the inputs, or at least uses a much large split size? (E.g., 512MB.) Thanks, DR On 07/15/2014 05:58 PM, David

spark-ec2 [Errno 110] Connection time out

2014-08-30 Thread David Matheson
conn = ec2.connect_to_region(opts.region) Any suggestions on how to debug the cause of the timeout? Note: I replaced the name of my keypair with Blah. Thanks, David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-Errno-110-Connection-time-out

Re: Computing mean and standard deviation by key

2014-09-12 Thread David Rowe
I generally call values.stats, e.g.: val stats = myPairRdd.values.stats On Fri, Sep 12, 2014 at 4:46 PM, rzykov rzy...@gmail.com wrote: Is it possible to use DoubleRDDFunctions https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/DoubleRDDFunctions.html for calculating mean

Re: Computing mean and standard deviation by key

2014-09-12 Thread David Rowe
Oh I see, I think you're trying to do something like (in SQL): SELECT order, mean(price) FROM orders GROUP BY order In this case, I'm not aware of a way to use the DoubleRDDFunctions, since you have a single RDD of pairs where each pair is of type (KeyType, Iterable[Double]). It seems to me

SQL shell for Spark SQL?

2014-09-17 Thread David Rosenstrauch
Is there a shell available for Spark SQL, similar to the way the Shark or Hive shells work? From my reading up on Spark SQL, it seems like one can execute SQL queries in the Spark shell, but only from within code in a programming language such as Scala. There does not seem to be any way to

Re: Issues with partitionBy: FetchFailed

2014-09-21 Thread David Rowe
Hi, I've seen this problem before, and I'm not convinced it's GC. When spark shuffles it writes a lot of small files to store the data to be sent to other executors (AFAICT). According to what I've read around the place the intention is that these files be stored in disk buffers, and since

Re: Where can I find the module diagram of SPARK?

2014-09-23 Thread David Rowe
Hi Andrew, I can't speak for Theodore, but I would find that incredibly useful. Dave On Wed, Sep 24, 2014 at 11:24 AM, Andrew Ash and...@andrewash.com wrote: Hi Theodore, What do you mean by module diagram? A high level architecture diagram of how the classes are organized into packages?

aggregateByKey vs combineByKey

2014-09-29 Thread David Rowe
Hi All, After some hair pulling, I've reached the realisation that an operation I am currently doing via: myRDD.groupByKey.mapValues(func) should be done more efficiently using aggregateByKey or combineByKey. Both of these methods would do, and they seem very similar to me in terms of their

Re: aggregateByKey vs combineByKey

2014-09-29 Thread David Rowe
, mergeCombiners. Hope this helps! Liquan On Sun, Sep 28, 2014 at 11:59 PM, David Rowe davidr...@gmail.com wrote: Hi All, After some hair pulling, I've reached the realisation that an operation I am currently doing via: myRDD.groupByKey.mapValues(func) should be done more efficiently using

pyspark cassandra examples

2014-09-30 Thread David Vincelli
the documentation and found nothing specifically relevant to cassandra, is there such a piece of documentation? Thank you, - David

Re: pyspark cassandra examples

2014-09-30 Thread David Vincelli
Thanks, that worked! I downloaded the version pre-built against hadoop1 and the examples worked. - David On Tue, Sep 30, 2014 at 5:08 PM, Kan Zhang kzh...@apache.org wrote: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

inconsistent edge counts in GraphX

2014-11-10 Thread Buttler, David
Hi, I am building a graph from a large CSV file. Each record contains a couple of nodes and about 10 edges. When I try to load a large portion of the graph, using multiple partitions, I get inconsistent results in the number of edges between different runs. However, if I use a single

subscribe

2014-11-11 Thread DAVID SWEARINGEN
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread David Blewett
You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1]. 1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400 On Nov 26, 2014 12:24 PM, Aaron Davidson ilike...@gmail.com wrote: Spark has a known problem where it will do a pass of metadata on a large number

DAGScheduler StackOverflowError

2014-12-19 Thread David McWhorter
StackOverflowError's in DAGScheduler such as the one below. I've attached a sample application that illustrates what I'm trying to do. Can anyone point out how I can keep the DAG from growing so large that spark is not able to process it? Thank you, David java.lang.StackOverflowError

Re: SparkSQL Array type support - Unregonized Thrift TTypeId value: ARRAY_TYPE

2014-12-23 Thread David Allan
Doh...figured it out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817p20832.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Beginner in Spark

2015-02-06 Thread David Fallside
-Environment will have you quickly up and running on a single machine without having to manage the details of the system installations. There is a Docker version, https://github.com/ibm-et/spark-kernel/wiki/Using-the-Docker-Container-for-the-Spark-Kernel , if you prefer Docker. Regards, David King

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch
You could also just push the data to Amazon S3, which would un-link the size of the cluster needed to process the data from the size of the data. DR On 02/03/2015 11:43 AM, Joe Wass wrote: I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need to store the input in HDFS

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch
Why you cannot use S3 as a replacement for HDFS[0]. I'd love to be proved wrong, though, that would make things a lot easier. [0] http://wiki.apache.org/hadoop/AmazonS3 On 3 February 2015 at 16:45, David Rosenstrauch dar...@darose.net wrote: You could also just push the data to Amazon S3

Re: ephemeral-hdfs vs persistent-hdfs - performance

2015-02-03 Thread David Rosenstrauch
, I'll certainly give it a look. Can you give me a hint about you unzip your input files on the fly? I thought that it wasn't possible to parallelize zipped inputs unless they were unzipped before passing to Spark? Joe On 3 February 2015 at 17:48, David Rosenstrauch dar...@darose.net wrote: We use

Re: Using Spark SQL with multiple (avro) files

2015-01-14 Thread David Jones
for reference: http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E On Wed, Jan 14, 2015 at 4:34 AM, David Jones letsnumsperi...@gmail.com wrote: Hi, I have a program that loads a single avro file using spark SQL

Re: Using Spark SQL with multiple (avro) files

2015-01-15 Thread David Jones
at 3:53 PM, David Jones letsnumsperi...@gmail.com wrote: Should I be able to pass multiple paths separated by commas? I haven't tried but didn't think it'd work. I'd expected a function that accepted a list of strings. On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote

Re: configuring spark.yarn.driver.memoryOverhead on Spark 1.2.0

2015-01-12 Thread David McWhorter
Hi Ganelin, sorry if it wasn't clear from my previous email, but that is how I am creating a spark context. I just didn't write out the lines where I create the new SparkConf and SparkContext. I am also upping the driver memory when running. Thanks, David On 01/12/2015 11:12 AM, Ganelin

Re: no snappyjava in java.library.path

2015-01-12 Thread David Rosenstrauch
I ran into this recently. Turned out we had an old org-xerial-snappy.properties file in one of our conf directories that had the setting: # Disables loading Snappy-Java native library bundled in the # snappy-java-*.jar file forcing to load the Snappy-Java native # library from the

Using Spark SQL with multiple (avro) files

2015-01-14 Thread David Jones
not possible, is there some way to load multiple avro files into the same table/RDD so the whole dataset can be processed (and in that case I'd supply paths to each file concretely, but I *really* don't want to have to do that). Thanks David

RE: GraphX vs GraphLab

2015-01-13 Thread Buttler, David
would be if the AMP Lab or Databricks maintained a set of benchmarks on the web that showed how much each successive version of Spark improved. Dave From: Madabhattula Rajesh Kumar [mailto:mrajaf...@gmail.com] Sent: Monday, January 12, 2015 9:24 PM To: Buttler, David Subject: Re: GraphX vs

configuring spark.yarn.driver.memoryOverhead on Spark 1.2.0

2015-01-12 Thread David McWhorter
spark configuration object but I still get Will allocate AM container, with MB memory including 384 MB overhead when launching. I'm running in yarn-cluster mode. Any help or tips would be appreciated. Thanks, David -- David McWhorter Software Engineer Commonwealth Computer Research, Inc

Re: Spark Release 1.3.0 DataFrame API

2015-03-15 Thread David Mitchell
Thank you for your help. toDF() solved my first problem. And, the second issue was a non-issue, since the second example worked without any modification. David On Sun, Mar 15, 2015 at 1:37 AM, Rishi Yadav ri...@infoobjects.com wrote: programmatically specifying Schema needs import

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-19 Thread David Holiday
kk - I'll put something together and get back to you with more :-) DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.commailto:broo...@annaisystems.com [cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com] www.AnnaiSystems.comhttp

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-19 Thread David Holiday
hi all - thx for the alacritous replies! so regarding how to get things from notebook to spark and back, am I correct that spark-submit is the way to go? DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.commailto:broo...@annaisystems.com

Spark Release 1.3.0 DataFrame API

2015-03-14 Thread David Mitchell
[String], org.apache.spark.sql.ty pes.StructType) val df = sqlContext.createDataFrame(people, schema) Any help would be appreciated. David

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-24 Thread David Holiday
is: what do I need to do from here to get those first ten rows of table data into my RDD? DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.commailto:broo...@annaisystems.com [cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
the first element of data thusly: rddX.first I get the following error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.apache.accumulo.core.data.Key any thoughts on where to go from here? DAVID HOLIDAY

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
w0t! that did it! t/y so much! I'm going to put together a pastebin or something that has all the code put together so if anyone else runs into this issue they will have some working code to help them figure out what's going on. DAVID HOLIDAY Software Engineer 760 607 3300

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
will do! I've got to clear with my boss what I can post and in what manner, but I'll definitely do what I can to put some working code out into the world so the next person who runs into this brick wall can benefit from all this :-D DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-25 Thread David Holiday
hi Irfan, thanks for getting back to me - i'll try the accumulo list to be sure. what is the normal use case for spark though? I'm surprised that hooking it into something as common and popular as accumulo isn't more of an every-day task. DAVID HOLIDAY Software Engineer 760 607 3300 | Office

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread David Holiday
, responses from notebook, etc. I'm going to try invoking the same techniques both from within a stand-alone scala problem and from the shell itself to see if I can get some traction. I'll report back when I have more data. cheers (and thx!) DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312

ORCFiles

2015-04-24 Thread David Mitchell
Does anyone know in which version of Spark will there be support for ORCFiles via spark.sql.hive? Will it be in 1.4? David

sparkR equivalent to SparkContext.newAPIHadoopRDD?

2015-05-02 Thread David Holiday
the magic happen with sparkR. Anyone got any ideas? thanks! DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.commailto:broo...@annaisystems.com [cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com] www.AnnaiSystems.comhttp://www.AnnaiSystems.com

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
to be inherent to the “commercial” vendors, but I can confirm as fact it is also in effect to the “open source movement” (because human nature remains the same) *From:* David Morales [mailto:dmora...@stratio.com] *Sent:* Thursday, May 14, 2015 4:30 PM *To:* Paolo Platter *Cc:* Evo Eftimov; Matei

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
something very similar… I will contact you to understand if we can contribute to you with some piece ! Best Paolo *Da:* Evo Eftimov evo.efti...@isecc.com *Data invio:* ‎giovedì‎ ‎14‎ ‎maggio‎ ‎2015 ‎17‎:‎21 *A:* 'David Morales' dmora...@stratio.com, Matei Zaharia matei.zaha...@gmail.com *Cc

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- David Morales de Frías :: +34 607 010 411 :: @dmoralesdf https://twitter.com/dmoralesdf http://www.stratio.com/ Vía de las dos Castillas, 33, Ática 4, 3ª Planta

Re: spark sql - reading data from sql tables having space in column names

2015-06-02 Thread David Mitchell
I am having the same problem reading JSON. There does not seem to be a way of selecting a field that has a space, Executor Info from the Spark logs. I suggest that we open a JIRA ticket to address this issue. On Jun 2, 2015 10:08 AM, ayan guha guha.a...@gmail.com wrote: I would think the

RDD boundaries and triggering processing using tags in the data

2015-05-27 Thread David Webber
if you have seen something like this before. Thanks, David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-boundaries-and-triggering-processing-using-tags-in-the-data-tp23060.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: grpah x issue spark 1.3

2015-08-17 Thread David Zeelen
the code below is taken from the spark website and generates the error detailed Hi using spark 1.3 and trying some sample code: val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, (rxin, student)), (7L, (jgonzal, postdoc)), (5L, (franklin, prof)), (2L, (istoica, prof //

Passing SPARK_CONF_DIR to slaves in standalone mode under Grid Engine job

2015-07-29 Thread David Chin
Hi, all, I am just setting up to run Spark in standalone mode, as a (Univa) Grid Engine job. I have been able to set up the appropriate environment variables such that the master launches correctly, etc. In my setup, I generate GE job-specific conf and log dirs. However, I am finding that the

Re: No. of Task vs No. of Executors

2015-07-18 Thread David Mitchell
This is likely due to data skew. If you are using key-value pairs, one key has a lot more records, than the other keys. Do you have any groupBy operations? David On Tue, Jul 14, 2015 at 9:43 AM, shahid sha...@trialx.com wrote: hi I have a 10 node cluster i loaded the data onto hdfs, so

Re: OLAP query using spark dataframe with cassandra

2015-11-10 Thread David Morales
ve columnar storage and query performance, but we had >> nothing more >> >> knowledge. >> >> Question is : Any guy had such use case for now, especially using in your >> production environment ? Would be interested in your architeture for >> designing this

RE: hdfs-ha on mesos - odd bug

2015-11-11 Thread Buttler, David
I have verified that this error exists on my system as well, and the suggested workaround also works. Spark version: 1.5.1; 1.5.2 Mesos version: 0.21.1 CDH version: 4.7 I have set up the spark-env.sh to contain HADOOP_CONF_DIR pointing to the correct place, and I have also linked in the

Re: How to avoid Spark shuffle spill memory?

2015-10-06 Thread David Mitchell
your code to make is use less memory. David On Tue, Oct 6, 2015 at 3:19 PM, unk1102 <umesh.ka...@gmail.com> wrote: > Hi I have a Spark job which runs for around 4 hours and it shared > SparkContext and runs many child jobs. When I see each job in UI I see > shuffle spill of aro

Re: Install via directions in "Learning Spark". Exception when running bin/pyspark

2015-10-13 Thread David Bess
Got it working! Thank you for confirming my suspicion that this issue was related to Java. When I dug deeper I found multiple versions and some other issues. I worked on it a while before deciding it would be easier to just uninstall all Java and reinstall clean JDK, and now it works perfectly.

Install via directions in "Learning Spark". Exception when running bin/pyspark

2015-10-12 Thread David Bess
as java8u60 I double checked my python version and it appears to be 2.7.10 I am familiar with command line, and have background in hadoop, but this has me stumped. Thanks in advance, David Bess -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/In

Re: Spark performance

2015-07-11 Thread David Mitchell
You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return responses in milliseconds. Analytic databases tend to operate on large data sets, and

Re: submit_spark_job_to_YARN

2015-08-30 Thread David Mitchell
Hi Ajay, Are you trying to save to your local file system or to HDFS? // This would save to HDFS under /user/hadoop/counter counter.saveAsTextFile(/user/hadoop/counter); David On Sun, Aug 30, 2015 at 11:21 AM, Ajay Chander itsche...@gmail.com wrote: Hi Everyone, Recently we have installed

Event logging not working when worker machine terminated

2015-09-08 Thread David Rosenstrauch
Our Spark cluster is configured to write application history event logging to a directory on HDFS. This all works fine. (I've tested it with Spark shell.) However, on a large, long-running job that we ran tonight, one of our machines at the cloud provider had issues and had to be terminated

Re: Event logging not working when worker machine terminated

2015-09-09 Thread David Rosenstrauch
Standalone. On 09/08/2015 11:18 PM, Jeff Zhang wrote: What cluster mode do you use ? Standalone/Yarn/Mesos ? On Wed, Sep 9, 2015 at 11:15 AM, David Rosenstrauch <dar...@darose.net> wrote: Our Spark cluster is configured to write application history event logging to a directory o

Re: Event logging not working when worker machine terminated

2015-09-09 Thread David Rosenstrauch
introduced in 1.3. Hopefully it¹s fixed in 1.4. Thanks, Charles On 9/9/15, 7:30 AM, "David Rosenstrauch" <dar...@darose.net> wrote: Standalone. On 09/08/2015 11:18 PM, Jeff Zhang wrote: What cluster mode do you use ? Standalone/Yarn/Mesos ? On Wed, Sep 9, 2015 at

Spark streaming to database exception handling

2015-09-17 Thread david w
I am using spark stream to receive data from kafka, and then write result rdd to external database inside foreachPartition(). All thing works fine, my question is how can we handle no data loss if there is database connection failure, or other exception happened during write data to external

Re: Spark Streaming Suggestion

2015-09-15 Thread David Morales
inute M from cassandra and starts processing the data. >>> >>> 2. Storm writes the data to both cassandra and kafka, spark reads the >>> actual data from kafka , processes the data and writes to cassandra. >>> The second approach avoids additional hit of reading f

Spark on YARN multitenancy

2015-12-15 Thread David Fox
Hello Spark experts, We are currently evaluating Spark on our cluster that already supports MRv2 over YARN. We have noticed a problem with running jobs concurrently, in particular that a running Spark job will not release its resources until the job is finished. Ideally, if two people run any

Using Experminal Spark Features

2015-12-30 Thread David Newberger
this approach yet and if so what has you experience been with using it? If it helps we'd be looking to implement it using Scala. Secondly, in general what has people experience been with using experimental features in Spark? Cheers, David Newberger

Problem About Worker System.out

2015-12-28 Thread David John
I have used Spark 1.4 for 6 months. Thanks all the members of this community for your great work.I have a question about the logging issue. I hope this question can be solved. The program is running under this configurations: YARN Cluster, YARN-client mode. In Scala,writing a code

FW: Problem About Worker System.out

2015-12-28 Thread David John
015 at 5:33 PM, David John <david_john_2...@outlook.com> wrote: I have used Spark 1.4 for 6 months. Thanks all the members of this community for your great work.I have a question about the logging issue. I hope this question can be solved. The program is running under this configurati

Re: Fat jar can't find jdbc

2015-12-22 Thread David Yerrington
n manifest goes, I'm really not sure. I will research it though. Now I'm wondering if my mergeStrategy is to blame? I'm going to try there next. Thank you for the help! On Tue, Dec 22, 2015 at 1:18 AM, Igor Berman <igor.ber...@gmail.com> wrote: > David, can you verify that mysql conne

Re: Fat jar can't find jdbc

2015-12-22 Thread David Yerrington
.discard case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first case PathList("org", "apache", xs @ _*) => MergeStrategy.first case PathList("org", "jboss", xs @ _*) => MergeStrategy.first case "a

Fat jar can't find jdbc

2015-12-21 Thread David Yerrington
load("jdbc", myOptions)". I know this is a total newbie question but in my defense, I'm fairly new to Scala, and this is my first go at deploying a fat jar with sbt-assembly. Thanks for any advice! -- David Yerrington yerrington.net

RE: fishing for help!

2015-12-21 Thread David Newberger
Hi Eran, Based on the limited information the first things that come to my mind are Processor, RAM, and Disk speed. David Newberger QA Analyst WAND - The Future of Restaurant Technology (W) www.wandcorp.com<http://www.wandcorp.com/> (E) david.newber...@wandcorp.com<mailto:dav

Re: WARN LoadSnappy: Snappy native library not loaded

2015-11-19 Thread David Rosenstrauch
I ran into this recently. Turned out we had an old org-xerial-snappy.properties file in one of our conf directories that had the setting: # Disables loading Snappy-Java native library bundled in the # snappy-java-*.jar file forcing to load the Snappy-Java native # library from the

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell
t as ROSE and it not designed to work in a clustered environment. ROSE on the other hand is designed for scale. David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: Re: ROSE: Spark + R on the JVM. Local Time: Janu

Re: ROSE: Spark + R on the JVM.

2016-01-13 Thread David Russell
n Java, JavaScript and .NET that can easily support your use case. The outputs of your DeployR integration could then become inputs to your data processing system. David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: Re:

ROSE: Spark + R on the JVM, now available.

2016-01-12 Thread David Russell
to [take a look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback, questions etc very welcome. David "All that is gold does not glitter, Not all those who wander are lost."

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell
Hi Corey, > Would you mind providing a link to the github? Sure, here is the github link you're looking for: https://github.com/onetapbeyond/opencpu-spark-executor David "All that is gold does not glitter, Not all those who wander are lost." Original Message ---

RE: About a problem running a spark job in a cdh-5.7.0 vmware image.

2016-06-03 Thread David Newberger
Spark, it is cloned and can no longer be modified by the user. Spark does not support modifying the configuration at runtime. “ David Newberger From: Alonso Isidoro Roman [mailto:alons...@gmail.com] Sent: Friday, June 3, 2016 10:37 AM To: David Newberger Cc: user@spark.apache.org Subject: Re: Abo

  1   2   3   >