Re: SparkSQL Performance Tuning Options

2015-01-27 Thread Cheng Lian
On 1/27/15 5:55 PM, Cheng Lian wrote: On 1/27/15 11:38 AM, Manoj Samel wrote: Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db. Use case is Spark Yarn app will start and serve as query server for multiple users i.e. always up and running. At startup, there is option t

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
I wanted to update this thread for others who may be looking for a solution to his as well. I found [1] and I'm going to investigate if this is a viable solution. [1] http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job On Wed, Jan 28, 2015 at 12:51 AM,

Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
I need to be able to take an input RDD[Map[String,Any]] and split it into several different RDDs based on some partitionable piece of the key (groups) and then send each partition to a separate set of files in different folders in HDFS. 1) Would running the RDD through a custom partitioner be the

Re: spark-submit conflicts with dependencies

2015-01-27 Thread Ted Yu
You can add exclusion to Spark pom.xml Here is an example (for hadoop-client which brings in hadoop-common) diff --git a/pom.xml b/pom.xml index 05cb379..53947d9 100644 --- a/pom.xml +++ b/pom.xml @@ -632,6 +632,10 @@ ${hadoop.deps.scope} +commons-configu

Re: Error reporting/collecting for users

2015-01-27 Thread Tobias Pfeiffer
Hi, On Wed, Jan 28, 2015 at 1:45 PM, Soumitra Kumar wrote: > It is a Streaming application, so how/when do you plan to access the > accumulator on driver? > Well... maybe there would be some user command or web interface showing the errors that have happened during processing...? Thanks Tobias

Re: spark-submit conflicts with dependencies

2015-01-27 Thread soid
I have the same problem too. org.apache.hadoop:hadoop-common:jar:2.4.0 brings commons-configuration:commons-configuration:jar:1.6 but we're using commons-configuration:commons 1.8 Is there any workaround for this? Greg -- View this message in context: http://apache-spark-user-list.1001560.n3.

Re: Error reporting/collecting for users

2015-01-27 Thread Soumitra Kumar
It is a Streaming application, so how/when do you plan to access the accumulator on driver? On Tue, Jan 27, 2015 at 6:48 PM, Tobias Pfeiffer wrote: > Hi, > > thanks for your mail! > > On Wed, Jan 28, 2015 at 11:44 AM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> That seems reasonab

LeftOuter Join issue

2015-01-27 Thread sprookie
I have about 15 -20 joins to perform. Each of these tables are in the order of 6 million to 66 million rows. The number of columns range from 20 are 400. I read the parquet files and obtain schemaRDDs. Then use join functionality on 2 SchemaRDDs. I join the previous join results with the next sche

joins way slow never go to completion

2015-01-27 Thread charlie Brown
I have about 15 -20 joins to perform. Each of these tables are in the order of 6 million to 66 million rows. The number of columns range from 20 are 400. I read the parquet files and obtain schemaRDDs. Then use join functionality on 2 SchemaRDDs. I join the previous join results with the next sche

Re: spark sqlContext udaf

2015-01-27 Thread Kuldeep Bora
UDAF is a WIP, at least from API user's perspective as there is no public API to my knowledge. https://issues.apache.org/jira/browse/SPARK-3947 Thanks On Tue, Jan 27, 2015 at 12:26 PM, sunwei wrote: > Hi, any one can show me some examples using UDAF for spark sqlcontext? >

Spark on Windows 2008 R2 serv er does not work

2015-01-27 Thread Wang, Ningjun (LNG-NPV)
I download and install spark-1.2.0-bin-hadoop2.4.tgz pre-built version on Windows 2008 R2 server. When I submit a job using spark-submit, I got the following error WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform ... using builtin-java classe

RE: How to start spark master on windows

2015-01-27 Thread Wang, Ningjun (LNG-NPV)
Never mind, the problem is that JAVA is not installed on windows. I install JAVA and the problem go away. Regards, Ningjun Wang Consulting Software Engineer LexisNexis 121 Chanlon Road New Providence, NJ 07974-1541 From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com] Sent: Tuesday

Re: performance of saveAsTextFile moving files from _temporary

2015-01-27 Thread Josh Walton
I'm not sure how to confirm how the moving is happening, however, one of the jobs just completed that I was talking about with 9k files of 4mb each. Spark UI showed the job being complete after ~2 hours. The last four hours of the job was just moving the files from _temporary to their final destina

RE: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Shao, Saisai
Aha, you’re right, I did a wrong comparison, the reason might be only for checkpointing :). Thanks Jerry From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Wednesday, January 28, 2015 10:39 AM To: Shao, Saisai Cc: user Subject: Re: Why must the dstream.foreachRDD(...) parameter be serializa

Re: Error reporting/collecting for users

2015-01-27 Thread Tobias Pfeiffer
Hi, thanks for your mail! On Wed, Jan 28, 2015 at 11:44 AM, Tathagata Das wrote: > That seems reasonable to me. Are you having any problems doing it this way? > Well, actually I haven't done that yet. The idea of using accumulators to collect errors just came while writing the email, but I tho

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Tobias Pfeiffer
Hi, thanks for the answers! On Wed, Jan 28, 2015 at 11:31 AM, Shao, Saisai wrote: > > Also this `foreachFunc` is more like an action function of RDD, thinking > of rdd.foreach(func), in which `func` need to be serializable. So maybe I > think your way of use it is not a normal way :). > Yeah I

RE: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Shao, Saisai
Hey Tobias, I think one consideration is for checkpoint of DStream which guarantee driver fault tolerance. Also this `foreachFunc` is more like an action function of RDD, thinking of rdd.foreach(func), in which `func` need to be serializable. So maybe I think your way of use it is not a normal

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Matei Zaharia
I believe this is needed for driver recovery in Spark Streaming. If your Spark driver program crashes, Spark Streaming can recover the application by reading the set of DStreams and output operations from a checkpoint file (see https://spark.apache.org/docs/latest/streaming-programming-guide.htm

Re: Re: Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-27 Thread Jim Green
Thanks Sun. My understanding is , savaAsNewHadoopFile is to save as Hfile on hdfs. Is it doable to use saveAsNewAPIHadoopDataset to directly loading to hbase? If so, is there any sample code for that? Thanks! On Tue, Jan 27, 2015 at 6:07 PM, fightf...@163.com wrote: > Hi, Jim > Your generated

Re: performance of saveAsTextFile moving files from _temporary

2015-01-27 Thread Aaron Davidson
This renaming from _temporary to the final location is actually done by executors, in parallel, for saveAsTextFile. It should be performed by each task individually before it returns. I have seen an issue similar to what you mention dealing with Hive code which did the renaming serially on the dri

Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Tobias Pfeiffer
Hi, I want to do something like dstream.foreachRDD(rdd => if (someCondition) ssc.stop()) so in particular the function does not touch any element in the RDD and runs completely within the driver. However, this fails with a NotSerializableException because $outer is not serializable etc. The DStr

Re: Re: Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-27 Thread fightf...@163.com
Hi, Jim Your generated rdd should be the type of RDD[ImmutableBytesWritable, KeyValue], while your current type goes to RDD[ImmutableBytesWritable, Put]. You can go like this and the result should be type of RDD[ImmutableBytesWritable, KeyValue] that can be savaAsNewHadoopFile val result = n

Re: Large number of pyspark.daemon processes

2015-01-27 Thread Sven Krasser
After slimming down the job quite a bit, it looks like a call to coalesce() on a larger RDD can cause these Python worker spikes (additional details in Jira: https://issues.apache.org/jira/browse/SPARK-5395?focusedCommentId=14294570&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpa

Re: SparkSQL Performance Tuning Options

2015-01-27 Thread Cheng Lian
On 1/27/15 11:38 AM, Manoj Samel wrote: Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db. Use case is Spark Yarn app will start and serve as query server for multiple users i.e. always up and running. At startup, there is option to cache data and also pre-compute some r

Error reporting/collecting for users

2015-01-27 Thread Tobias Pfeiffer
Hi, in my Spark Streaming application, computations depend on users' input in terms of * user-defined functions * computation rules * etc. that can throw exceptions in various cases (think: exception in UDF, division by zero, invalid access by key etc.). Now I am wondering about what is a good

RE: Running a script on scala-shell on Spark Standalone Cluster

2015-01-27 Thread Mohammed Guller
Looks like the culprit is this error: FileNotFoundException: File file:/home/sparkuser/spark-1.2.0/spark-1.2.0-bin-hadoop2.4/data/cut/ratings.txt does not exist Mohammed -Original Message- From: riginos [mailto:samarasrigi...@gmail.com] Sent: Tuesday, January 27, 2015 4:24 PM To: user

performance of saveAsTextFile moving files from _temporary

2015-01-27 Thread jwalton
We are running spark in Google Compute Engine using their One-Click Deploy. By doing so, we get their Google Cloud Storage connector for hadoop for free meaning we can specify gs:// paths for input and output. We have jobs that take a couple of hours, end up with ~9k partitions which means 9k outp

Running a script on scala-shell on Spark Standalone Cluster

2015-01-27 Thread riginos
I configured 4 pcs with spark-1.2.0-bin-hadoop2.4.tgz with Spark Standalone on a Cluster. On master pc i executed: ./sbin/start-master.sh ./sbin/start-slaves.sh (4 instances) On datanode1 ,datanode2, secondarymaster pcs i executed: ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://m

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Antony Mayi
I have yarn configured with yarn.nodemanager.vmem-check-enabled=false and  yarn.nodemanager.pmem-check-enabled=false to avoid yarn killing the containers. the stack trace is bellow. thanks,Antony. 15/01/27 17:02:53 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM15/01/27 17

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Guru Medasani
Can you attach the logs where this is failing? From: Sven Krasser Date: Tuesday, January 27, 2015 at 4:50 PM To: Guru Medasani Cc: Sandy Ryza , Antony Mayi , "user@spark.apache.org" Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Since it's an executor running OOM it

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Sven Krasser
Since it's an executor running OOM it doesn't look like a container being killed by YARN to me. As a starting point, can you repartition your job into smaller tasks? -Sven On Tue, Jan 27, 2015 at 2:34 PM, Guru Medasani wrote: > Hi Anthony, > > What is the setting of the total amount of memory in

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Guru Medasani
Hi Anthony, What is the setting of the total amount of memory in MB that can be allocated to containers on your NodeManagers? yarn.nodemanager.resource.memory-mb Can you check this above configuration in yarn-site.xml used by the node manager process? -Guru Medasani From: Sandy Ryza Date:

Re: RDD.combineBy

2015-01-27 Thread francois . garillot
Have you looked at the `aggregate` function in the RDD API ? If your way of extracting the “key” (identifier) and “value” (payload) parts of the RDD elements is uniform (a function), it’s unclear to me how this would be more efficient that extracting key and value and then using combine, howe

How to start spark master on windows

2015-01-27 Thread Wang, Ningjun (LNG-NPV)
I download spark 1.2.0 on my windows server 2008. How do I start spark master? I tried to run the following on command prompt C:\spark-1.2.0-bin-hadoop2.4> bin\spark-class.cmd org.apache.spark.deploy.master.Master I got the error else was unexpected at this time. Ningjun

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Sandy Ryza
Hi Antony, If you look in the YARN NodeManager logs, do you see that it's killing the executors? Or are they crashing for a different reason? -Sandy On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi wrote: > Hi, > > I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors > crashe

RDD.combineBy

2015-01-27 Thread Mohit Jaggi
Hi All, I have a use case where I have an RDD (not a k,v pair) where I want to do a combineByKey() operation. I can do that by creating an intermediate RDD of k,v pairs and using PairRDDFunctions.combineByKey(). However, I believe it will be more efficient if I can avoid this intermediate RDD. I

Re: Storing DecisionTreeModel

2015-01-27 Thread andresb...@gmail.com
Ok, thanks for your reply! On Tue, Jan 27, 2015 at 2:32 PM, Joseph Bradley wrote: > Hi Andres, > > Currently, serializing the object is probably the best way to do it. > However, there are efforts to support actual model import/export: > https://issues.apache.org/jira/browse/SPARK-4587 > https:/

Re: Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-27 Thread Jim Green
I used below code, and it still failed with the same error. Anyone has experience on bulk loading using scala? Thanks. import org.apache.spark._ import org.apache.spark.rdd.NewHadoopRDD import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor} import org.apache.hadoop.hbase.client.HBas

java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-01-27 Thread Antony Mayi
Hi, I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors crashed with this error. does that mean I have genuinely not enough RAM or is this matter of config tuning? other config options used:spark.storage.memoryFraction=0.3 SPARK_EXECUTOR_MEMORY=14G running spark 1.2.0 as yarn

Re: Storing DecisionTreeModel

2015-01-27 Thread Joseph Bradley
Hi Andres, Currently, serializing the object is probably the best way to do it. However, there are efforts to support actual model import/export: https://issues.apache.org/jira/browse/SPARK-4587 https://issues.apache.org/jira/browse/SPARK-1406 I'm hoping to have the PR for the first JIRA ready so

Re: Index wise most frequently occuring element

2015-01-27 Thread Sven Krasser
Use combineByKey. For top 10 as an example (bottom 10 work similarly): add the element to a list. If the list is larger than 10, delete the smallest elements until size is back to 10. -Sven On Tue, Jan 27, 2015 at 3:35 AM, kundan kumar wrote: > I have a an array of the form > > val array: Array[

Re: Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-27 Thread Jim Green
Thanks Ted. Could you give me a simple example to load one row data in hbase? How should I generate the KeyValue? I tried multiple times, and still can not figure it out. On Tue, Jan 27, 2015 at 12:10 PM, Ted Yu wrote: > Here is the method signature used by HFileOutputFormat : > public voi

Re: Spark (Streaming?) holding on to Mesos resources

2015-01-27 Thread Tim Chen
Hi Gerard, As others has mentioned I believe you're hitting Mesos-1688, can you upgrade to the latest Mesos release (0.21.1) and let us know if it resolves your problem? Thanks, Tim On Tue, Jan 27, 2015 at 10:39 AM, Sam Bessalah wrote: > Hi Geraard, > isn't this the same issueas this? > https

Re: Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-27 Thread Ted Yu
Here is the method signature used by HFileOutputFormat : public void write(ImmutableBytesWritable row, KeyValue kv) Meaning, KeyValue is expected, not Put. On Tue, Jan 27, 2015 at 10:54 AM, Jim Green wrote: > Hi Team, > > I need some help on writing a scala to bulk load some data into hba

Spark 1.2.x Yarn Auxiliary Shuffle Service

2015-01-27 Thread Corey Nolet
I've read that this is supposed to be a rather significant optimization to the shuffle system in 1.1.0 but I'm not seeing much documentation on enabling this in Yarn. I see github classes for it in 1.2.0 and a property "spark.shuffle.service.enabled" in the spark-defaults.conf. The code mentions t

SparkSQL Performance Tuning Options

2015-01-27 Thread Manoj Samel
Spark 1.2, no Hive, prefer not to use HiveContext to avoid metastore_db. Use case is Spark Yarn app will start and serve as query server for multiple users i.e. always up and running. At startup, there is option to cache data and also pre-compute some results sets, hash maps etc. that would be lik

Error on block pushing thread (spark streaming)

2015-01-27 Thread Matt Silvey
Folks, We are running a streaming job on a cluster that is failing with: 'ERROR BlockGenerator: Error in block pushing thread' via a Futures timeout. The cluster is running on EC2. Data is fed in via a Kafka client. I have been looking for network communications issues but haven't found any smoki

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-27 Thread Davies Liu
Maybe it's caused by integer overflow, is it possible that one object or batch bigger than 2G (after pickling)? On Tue, Jan 27, 2015 at 7:59 AM, rok wrote: > I've got an dataset saved with saveAsPickleFile using pyspark -- it saves > without problems. When I try to read it back in, it fails with:

Bulk loading into hbase using saveAsNewAPIHadoopFile

2015-01-27 Thread Jim Green
Hi Team, I need some help on writing a scala to bulk load some data into hbase. *Env:* hbase 0.94 spark-1.0.2 I am trying below code to just bulk load some data into hbase table “t1”. import org.apache.spark._ import org.apache.spark.rdd.NewHadoopRDD import org.apache.hadoop.hbase.{HBaseConfigur

Re: [documentation] Update the python example ALS of the site?

2015-01-27 Thread Davies Liu
will be fixed by https://github.com/apache/spark/pull/4226 On Tue, Jan 27, 2015 at 8:17 AM, gen tang wrote: > Hi, > > In the spark 1.2.0, it requires the ratings should be a RDD of Rating or > tuple or list. However, the current example in the site use still RDD[array] > as the ratings. Therefore

java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda

2015-01-27 Thread Milad khajavi
Hi all, I can run spark job pragmatically in j2SE with following code without any error: import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class TestSpark { public static void main() { String sourceP

Re: Spark (Streaming?) holding on to Mesos resources

2015-01-27 Thread Sam Bessalah
Hi Geraard, isn't this the same issueas this? https://issues.apache.org/jira/browse/MESOS-1688 On Mon, Jan 26, 2015 at 9:17 PM, Gerard Maas wrote: > Hi, > > We are observing with certain regularity that our Spark jobs, as Mesos > framework, are hoarding resources and not releasing them, resulti

Re: Mathematical functions in spark sql

2015-01-27 Thread Cheng Lian
Hey Alexey, You need to use |HiveContext| in order to access Hive UDFs. You may try it with |bin/spark-sql| (|src| is a Hive table): |spark-sql> select key / 3 from src limit 10; 79.33 28.668 103.67 9.0 55.0 136.34 85.0 92.67 32.6

Re: spark 1.2 ec2 launch script hang

2015-01-27 Thread Nicholas Chammas
For those who found that absolute vs. relative path for the pem file mattered, what OS and shell are you using? What version of Spark are you using? ~/ vs. absolute path shouldn’t matter. Your shell will expand the ~/ to the absolute path before sending it to spark-ec2. (i.e. tilde expansion.) Ab

Re: saving rdd to multiple files named by the key

2015-01-27 Thread Nicholas Chammas
There is also SPARK-3533 , which proposes to add a convenience method for this. ​ On Mon Jan 26 2015 at 10:38:56 PM Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > This might be helpful: > http://stackoverflow.com/questions/23995040/write-

[documentation] Update the python example ALS of the site?

2015-01-27 Thread gen tang
Hi, In the spark 1.2.0, it requires the ratings should be a RDD of Rating or tuple or list. However, the current example in the site use still RDD[array] as the ratings. Therefore, the example doesn't work under the version 1.2.0. May be we should update the documentation of the site? Thanks a lo

NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-27 Thread rok
I've got an dataset saved with saveAsPickleFile using pyspark -- it saves without problems. When I try to read it back in, it fails with: Job aborted due to stage failure: Task 401 in stage 0.0 failed 4 times, most recent failure: Lost task 401.3 in stage 0.0 (TID 449, e1326.hpc-lca.ethz.ch): jav

Storing DecisionTreeModel

2015-01-27 Thread andresbm84
Hi everyone, Is there a way to save on disk the model to reuse it later? I could serialize the object and save the bytes, but I'm guessing there might be a better way to do so. Has anyone tried that? Andres. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.co

Re: Spark on Yarn: java.lang.IllegalArgumentException: Invalid rule

2015-01-27 Thread Ted Yu
Caused by: java.lang.IllegalArgumentException: Invalid rule: L RULE:[2:$1@$0](.*@XXXCOMPANY.COM )s/(.*)@ XXXCOMPANY.COM/$1/L DEFAULT Can you put the rule on a single line (not sure whether there is newline or space between L and DEFAULT) ? Look

Re: Spark on Yarn: java.lang.IllegalArgumentException: Invalid rule

2015-01-27 Thread Niranjan Reddy
Thanks, Ted. Kerberos is enabled on the cluster. I'm new to the world of kerberos, so pease excuse my ignorance here. Do you know if there are any additional steps I need to take in addition to setting HADOOP_CONF_DIR? For instance, does hadoop.security.auth_to_local require any specific setting (

Re: spark 1.2 ec2 launch script hang

2015-01-27 Thread Charles Feduke
Absolute path means no ~ and also verify that you have the path to the file correct. For some reason the Python code does not validate that the file exists and will hang (this is the same reason why ~ hangs). On Mon, Jan 26, 2015 at 10:08 PM Pete Zybrick wrote: > Try using an absolute path to the

Re: Spark on Yarn: java.lang.IllegalArgumentException: Invalid rule

2015-01-27 Thread maven
Thanks, Siddardha. I did but got the same error. Kerberos is enabled on my cluster and I may be missing a configuration step somewhere. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-java-lang-IllegalArgumentException-Invalid-rule-tp21382p213

Re: Mathematical functions in spark sql

2015-01-27 Thread Ted Yu
Created SPARK-5427 for the addition of floor function. On Mon, Jan 26, 2015 at 11:20 PM, Alexey Romanchuk < alexey.romanc...@gmail.com> wrote: > I have tried "select ceil(2/3)", but got "key not found: floor" > > On Tue, Jan 27, 2015 at 11:05 AM, Ted Yu wrote: > >> Have you tried floor() or ceil

Re: Spark and S3 server side encryption

2015-01-27 Thread Ted Yu
Adding on what Thomas said. There have been a few bug fixes for s3a since Hadoop 2.6.0 was released. One example is HADOOP-11446. The fixes would be in Hadoop 2.7.0 Cheers > On Jan 27, 2015, at 1:41 AM, Thomas Demoor > wrote: > > Spark uses the Hadoop filesystems. > > I assume you are try

Re: how to split key from RDD for compute UV

2015-01-27 Thread Gerard Maas
Hi, Did you try asking this on StackOverflow? http://stackoverflow.com/questions/tagged/apache-spark I'd also suggest adding some sample data to help others understanding your logic. -kr, Gerard. On Tue, Jan 27, 2015 at 1:14 PM, 老赵 wrote: > Hello All, > > I am writing a simple Spark applica

how to split key from RDD for compute UV

2015-01-27 Thread 老赵
Hello All, I am writing a simple Spark application to count UV(unique view) from a log file。Below is my code,it is not right on the red line .My idea here is same cookie on a host only count one .So i want to split the host from the previous RDD. But now I don't know how to finish it .A

Best way to shut down a stream initiated from the recevier

2015-01-27 Thread jamborta
Hi all, we are building a custom JDBC receiver that would create a stream from sql tables. Not sure what is the best way to shut down the stream once all the data goes through, as the receiver knows it is completed but it cannot initiate the stream to shut down. Any suggestion how to structure th

Index wise most frequently occuring element

2015-01-27 Thread kundan kumar
I have a an array of the form val array: Array[(Int, (String, Int))] = Array( (idx1,(word1,count1)), (idx2,(word2,count2)), (idx1,(word1,count1)), (idx3,(word3,count1)), (idx4,(word4,count4))) I want to get the top 10 and bottom 10 elements from this array for each index (idx1,idx2,

Re: Issues when combining Spark and a third party java library

2015-01-27 Thread Staffan
Okay, I finally tried to change the Hadoop-client version from 2.4.0 to 2.5.2 and that mysteriously fixed everything.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-when-combining-Spark-and-a-third-party-java-library-tp21367p21387.html Sent from th

Re: SaveAsTextFile to S3 bucket

2015-01-27 Thread Thomas Demoor
S3 does not have the concept "directory". An S3 bucket only holds files (objects). The hadoop filesystem is mapped onto a bucket and use Hadoop-specific (or rather "s3tool"-specific: s3n uses the jets3t tool) conventions(hacks) to fake directories such as a ending with a slash ("filename/") and wit

Re: Spark and S3 server side encryption

2015-01-27 Thread Thomas Demoor
Spark uses the Hadoop filesystems. I assume you are trying to use s3n:// which, under the hood, uses the 3rd party jets3t library. It is configured through the jets3t.properties file (google "hadoop s3n jets3t") which you should put on Spark's classpath. The setting you are looking for is s3servic

RE: [SQL] Self join with ArrayType columns problems

2015-01-27 Thread Cheng, Hao
The root cause for this probably because the identical “exprId” of the “AttributeReference” existed while do self-join with “temp table” (temp table = resolved logical plan). I will do the bug fixing and JIRA creation. Cheng Hao From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Tuesd

Re: HDFS Namenode in safemode when I turn off my EC2 instance

2015-01-27 Thread Su She
Thanks Akhil! 1) I had to do sudo -u hdfs hdfs dfsadmin -safemode leave a) I had created a user called hdfs with superuser privileges in Hue, hence the double hdfs. 2) Lastly, I know this is getting a bit off topic, but this is my etc/hosts file: 127.0.0.1 localhost.localdomain loca

Re: Issues when combining Spark and a third party java library

2015-01-27 Thread Staffan
To clarify: I'm currently working on this locally, running on a laptop and I do not use Spark-submit (using Eclipse to run my applications currently). I've tried running both on Mac OS X and in a VM running Ubuntu. Furthermore, I've got the VM from a fellow worker which has no issues running his Sp

Re: Spark (Streaming?) holding on to Mesos Resources

2015-01-27 Thread Adam Bordelon
> Hopefully some very bad ugly bug that has been fixed already and that will urge us to upgrade our infra? > Mesos 0.20 + Marathon 0.7.4 + Spark 1.1.0 Could be https://issues.apache.org/jira/browse/MESOS-1688 (fixed in Mesos 0.21) On Mon, Jan 26, 2015 at 2:45 PM, Gerard Maas wrote: > Hi Jörn, >