Re: Segmented fold count

2014-08-18 Thread Davies Liu
import itertools l = [1,1,1,2,2,3,4,4,5,1] gs = itertools.groupby(l) map(lambda (n, it): (n, sum(1 for _ in it)), gs) [(1, 3), (2, 2), (3, 1), (4, 2), (5, 1), (1, 1)] def groupCount(l): gs = itertools.groupby(l) return map(lambda (n, it): (n, sum(1 for _ in it)), gs) If you have an

Re: Segmented fold count

2014-08-18 Thread Andrew Ash
What happens when a run of numbers is spread across a partition boundary? I think you might end up with two adjacent groups of the same value in that situation. On Mon, Aug 18, 2014 at 2:05 AM, Davies Liu dav...@databricks.com wrote: import itertools l = [1,1,1,2,2,3,4,4,5,1] gs =

Re: Segmented fold count

2014-08-18 Thread Davies Liu
On Sun, Aug 17, 2014 at 11:07 PM, Andrew Ash and...@andrewash.com wrote: What happens when a run of numbers is spread across a partition boundary? I think you might end up with two adjacent groups of the same value in that situation. Yes, need another scan to combine this continuous groups

Re: application as a service

2014-08-18 Thread Davies Liu
Another option is using Tachyon to cache the RDD, then the cache can be shared by different applications. See how to use Spark with Tachyon: http://tachyon-project.org/Running-Spark-on-Tachyon.html Davies On Sun, Aug 17, 2014 at 4:48 PM, ryaminal tacmot...@gmail.com wrote: You can also look

Re: OutOfMemory Error

2014-08-18 Thread Akhil Das
Hi Ghousia, You can try the following: 1. Increase the heap size https://spark.apache.org/docs/0.9.0/configuration.html 2. Increase the number of partitions http://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine 3. You could try

NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo
Hello: I am trying to setup Spark to connect to a Hive table which is backed by HBase, but I am running into the following NullPointerException: scala val hiveCount = hiveContext.sql(select count(*) from dataset_records).collect().head.getLong(0) 14/08/18 06:34:29 INFO ParseDriver: Parsing

a noob question for how to implement setup and cleanup in Spark map

2014-08-18 Thread Henry Hung
Hi All, I'm new to Spark and Scala, just recently using this language and love it, but there is a small coding problem when I want to convert my existing map reduce code from Java to Spark... In Java, I create a class by extending org.apache.hadoop.mapreduce.Mapper and override the setup(),

Re: OutOfMemory Error

2014-08-18 Thread Ghousia
Thanks for the answer Akhil. We are right now getting rid of this issue by increasing the number of partitions. And we are persisting RDDs to DISK_ONLY. But the issue is with heavy computations within an RDD. It would be better if we have the option of spilling the intermediate transformation

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Akhil Das
Looks like your hiveContext is null. Have a look at this documentation. https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables Thanks Best Regards On Mon, Aug 18, 2014 at 12:09 PM, Cesar Arevalo ce...@zephyrhealthinc.com wrote: Hello: I am trying to setup Spark to

RE: a noob question for how to implement setup and cleanup in Spark map

2014-08-18 Thread Henry Hung
Hi All, Please ignore my question, I found a way to implement it via old archive mails: http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAF_KkPzpU4qZWzDWUpS5r9bbh=-hwnze2qqg56e25p--1wv...@mail.gmail.com%3E Best regards, Henry From: MA33 YTHung1 Sent: Monday, August 18, 2014

Re: OutOfMemory Error

2014-08-18 Thread Akhil Das
I believe spark.shuffle.memoryFraction is the one you are looking for. spark.shuffle.memoryFraction : Fraction of Java heap to use for aggregation and cogroups during shuffles, if spark.shuffle.spill is true. At any given time, the collective size of all in-memory maps used for shuffles is

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-18 Thread Akhil Das
You can create an RDD and then you can do a map or mapPartitions on that where in the top you will create the database connection and all, then do the operations and at the end close the connections. Thanks Best Regards On Mon, Aug 18, 2014 at 12:34 PM, Henry Hung ythu...@winbond.com wrote:

Re: a noob question for how to implement setup and cleanup in Spark map

2014-08-18 Thread Sean Owen
I think this was a more comprehensive answer recently. Tobias is right that it is not quite that simple: http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E On Mon, Aug 18, 2014 at 8:04 AM, Henry Hung

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo
Nope, it is NOT null. Check this out: scala hiveContext == null res2: Boolean = false And thanks for sending that link, but I had already looked at it. Any other ideas? I looked through some of the relevant Spark Hive code and I'm starting to think this may be a bug. -Cesar On Mon, Aug 18,

RE: a noob question for how to implement setup and cleanup in Spark map

2014-08-18 Thread Henry Hung
I slightly modify the code to use while(partitions.hasNext) { } instead of partitions.map(func) I suppose this can eliminate the uncertainty from lazy execution. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Monday, August 18, 2014 3:10 PM To: MA33 YTHung1 Cc:

Re: Segmented fold count

2014-08-18 Thread fil
Thanks for the reply! def groupCount(l): gs = itertools.groupby(l) return map(lambda (n, it): (n, sum(1 for _ in it)), gs) If you have an RDD, you can use RDD.mapPartitions(groupCount).collect() Yes, I am interested in RDD - not pure Python :) I am new to Spark, can you explain: -

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Akhil Das
Then definitely its a jar conflict. Can you try removing this jar from the class path /opt/spark-poc/lib_managed/jars/org.spark-project.hive/hive-exec/ hive-exec-0.12.0.jar Thanks Best Regards On Mon, Aug 18, 2014 at 12:45 PM, Cesar Arevalo ce...@zephyrhealthinc.com wrote: Nope, it is NOT

Re: Spark: why need a masterLock when sending heartbeat to master

2014-08-18 Thread Victor Sheng
Thanks, I got it ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-why-need-a-masterLock-when-sending-heartbeat-to-master-tp12256p12297.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: s3:// sequence file startup time

2014-08-18 Thread Cheng Lian
Maybe irrelevant, but this resembles a lot the S3 Parquet file issue we've met before. It takes a dozen minutes to read the metadata because the ParquetInputFormat tries to call geFileStatus for all part-files sequentially. Just checked SequenceFileInputFormat, and found that a MapFile may share

Re: Re: application as a service

2014-08-18 Thread Zhanfeng Huo
That helps a lot. Thanks. Zhanfeng Huo From: Davies Liu Date: 2014-08-18 14:31 To: ryaminal CC: u...@spark.incubator.apache.org Subject: Re: application as a service Another option is using Tachyon to cache the RDD, then the cache can be shared by different applications. See how to use

Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
Hi All. I need to create a lot of RDDs starting from a set of roots and count the rows in each. Something like this: final JavaSparkContext sc = new JavaSparkContext(conf); ListString roots = ... MapString, Object res = sc.parallelize(roots).mapToPair(new PairFunctionString, String, Long(){

Re: Working with many RDDs in parallel?

2014-08-18 Thread Sean Owen
You won't be able to use RDDs inside of RDD operation. I imagine your immediate problem is that the code you've elided references 'sc' and that gets referenced by the PairFunction and serialized, but it can't be. If you want to play it this way, parallelize across roots in Java. That is just use

spark kryo serilizable exception

2014-08-18 Thread adu
hi all, In RDD map , i invoke an object that is *Serialized* by java standard , and exception :: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 13 at com.esotericsoftware.kryo.io.Output.require(Output.java:138) at

Spark Streaming Data Sharing

2014-08-18 Thread Levi Bowman
Based on my understanding something like this doesn't seem to be possible out of the box, but I thought I would write it up anyway in case someone has any ideas. We have conceptually one high volume input stream, each streaming job is either interested in a subset of the stream or the entire

Re: Question regarding spark data partition and coalesce. Need info on my use case.

2014-08-18 Thread abhiguruvayya
Hello Mayur, #3 in the new RangePartitioner(*3*, partitionedFile); is also a hard coded value for the number of partitions. Do you find a way where i can avoid that. And including the cluster size, partitions depends on the input data size also. Correct me if i am wrong. -- View this message

RE: spark kryo serilizable exception

2014-08-18 Thread Sameer Tilak
Hi,I was able to set this parameter in my application to resolve this issue: set(spark.kryoserializer.buffer.mb, 256) Please let me know if this helps. Date: Mon, 18 Aug 2014 21:50:02 +0800 From: dujinh...@hzduozhun.com To: user@spark.apache.org Subject: spark kryo serilizable exception

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo
I removed the JAR that you suggested but now I get another error when I try to create the HiveContext. Here is the error: scala val hiveContext = new HiveContext(sc) error: bad symbolic reference. A signature in HiveContext.class refers to term ql in package org.apache.hadoop.hive which is not

Bug or feature? Overwrite broadcasted variables.

2014-08-18 Thread Peng Cheng
I'm curious to see that if you declare broadcasted wrapper as a var, and overwrite it in the driver program, the modification can have stable impact on all transformations/actions defined BEFORE the overwrite but was executed lazily AFTER the overwrite: val a = sc.parallelize(1 to 10) var

Re: Merging complicated small matrices to one big matrix

2014-08-18 Thread Davies Liu
rdd.flatMap(lambda x:x) maybe could solve your problem, it will convert an RDD from [[[1,2,3],[4,5,6]],[[7,8,9,],[10,11,12]]] into: [[1,2,3], [4,5,6], [7,8,9,], [10,11,12]] On Mon, Aug 18, 2014 at 2:42 AM, Chengi Liu chengi.liu...@gmail.com wrote: I have an rdd in pyspark which looks like

RE: Does HiveContext support Parquet?

2014-08-18 Thread lyc
I followed your instructions to try to load data as parquet format through hiveContext but failed. Do you happen to know my uncorrectness in the following steps? The steps I am following is like: 1. download parquet-hive-bundle-1.5.0.jar 2. revise hive-site.xml including this: property

java.nio.channels.CancelledKeyException in Graphx Connected Components

2014-08-18 Thread Jeffrey Picard
Hey all, I’m trying to run connected components in graphx on about 400GB of data on 50 m3.xlarge nodes on emr. I keep getting java.nio.channels.CancelledKeyException when it gets to mapPartitions at VertexRDD.scala:347”. I haven’t been able to find much about this online, and nothing that

Extracting unique elements of an ArrayBuffer

2014-08-18 Thread SK
Hi, I have a piece of code in which the result of a groupByKey operation is as follows: (2013-04, ArrayBuffer(s1, s2, s3, s1, s2, s4)) The first element is a String value representing a date and the ArrayBuffer consists of (non-unique) strings. I want to extract the unique elements of the

Re: Spark Streaming Data Sharing

2014-08-18 Thread Ruchir Jha
The Spark Job that has the main DStream, could have another DStream that is listening for stream subscription requests. So when a subscription is received, you could do a filter/forEach on the main DStream and respond to that one request. So you're basically creating a stream server that is

Re: Writing to RabbitMQ

2014-08-18 Thread jschindler
Well, it looks like I can use the .repartition(1) method to stuff everything in one partition so that gets rid of the duplicate messages I send to RabbitMQ but that seems like a bad idea perhaps. Wouldn't that hurt scalability? -- View this message in context:

spark - reading hfds files every 5 minutes

2014-08-18 Thread salemi
Hi, Mine data source stores the incoming data every 10 second to hdfs. The naming convention save-timestamp.csv (see below) drwxr-xr-x ali supergroup 0 B 0 0 B save-1408396065000.csv drwxr-xr-x ali supergroup 0 B 0 0 B save-140839607.csv drwxr-xr-x

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Zhan Zhang
Looks like hbaseTableName is null, probably caused by incorrect configuration. String hbaseTableName = jobConf.get(HBaseSerDe.HBASE_TABLE_NAME); setHTable(new HTable(HBaseConfiguration.create(jobConf), Bytes.toBytes(hbaseTableName))); Here is the definition. public static final

setCallSite for API backtraces not showing up in logs?

2014-08-18 Thread John Salvatier
What's the correct way to use setCallSite to get the change to show up in the spark logs? I have something like class RichRDD (rdd : RDD[MyThing]) { def mySpecialOperation() { rdd.context.setCallSite(bubbles and candy!) rdd.map() val result = rdd.groupBy()

Re: [GraphX] how to set memory configurations to avoid OutOfMemoryError GC overhead limit exceeded

2014-08-18 Thread Ankur Dave
On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI iamyifa...@gmail.com wrote: I am testing our application(similar to personalised page rank using Pregel, and note that each vertex property will need pretty much more space to store after new iteration) [...] But when we ran it on larger graph(e.g.

Re: Bug or feature? Overwrite broadcasted variables.

2014-08-18 Thread Zhan Zhang
I think the behavior is by designed. Because if b is not persisted, and in each call b.collect, broadcasted has point to a new broadcasted variable, serialized by driver, and fetched by executors. If you do persist, you don’t expect the RDD get changed due to new broadcasted variable. Thanks.

How to use Spark Streaming from an HTTP api?

2014-08-18 Thread bumble123
I want to send an HTTP request (specifically to OpenTSDB) to get data. I've been looking at the StreamingContext api and don't seem to see any methods that can connect to this. Has anyone tried connecting Spark Streaming to a server via HTTP requests before? How have you done it? -- View this

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread cesararevalo
Thanks, Zhan for the follow up. But, do you know how I am supposed to set that table name on the jobConf? I don't have access to that object from my client driver? -- View this message in context:

spark-submit with HA YARN

2014-08-18 Thread Matt Narrell
Hello, I have an HA enabled YARN cluster with two resource mangers. When submitting jobs via “spark-submit —master yarn-cluster”. It appears that the driver is looking explicitly for the yarn.resourcemanager.address” property rather than round robin-ing through the resource managers via the

Re: Writing to RabbitMQ

2014-08-18 Thread Vida Ha
Hi John, It seems like original problem you had was that you were initializing the RabbitMQ connection on the driver, but then calling the code to write to RabbitMQ on the workers (I'm guessing, but I don't know since I didn't see your code). That's definitely a problem because the connection

Re: Writing to RabbitMQ

2014-08-18 Thread Vida Ha
Oh sorry, just to be more clear - writing from the driver program is only safe if the amount of data you are trying to write is small enough to fit on memory in the driver program. I looked at your code, and since you are just writing a few things each time interval, this seems safe. -Vida On

Processing multiple files in parallel

2014-08-18 Thread SK
Hi, I have a piece of code that reads all the (csv) files in a folder. For each file, it parses each line, extracts the first 2 elements from each row of the file, groups the tuple by the key and finally outputs the number of unique values for each key. val conf = new

Data loss - Spark streaming and network receiver

2014-08-18 Thread Wei Liu
We are prototyping an application with Spark streaming and Kinesis. We use kinesis to accept incoming txn data, and then process them using spark streaming. So far we really liked both technologies, and we saw both technologies are getting mature rapidly. We are almost settled to use these two

Re: How to use Spark Streaming from an HTTP api?

2014-08-18 Thread Silvio Fiorito
You need to create a custom receiver that submits the HTTP requests then deserializes the data and pushes it into the Streaming context. See here for an example: http://spark.apache.org/docs/latest/streaming-custom-receivers.html On 8/18/14, 6:20 PM, bumble123 tc1...@att.com wrote: I want to

Re: Does HiveContext support Parquet?

2014-08-18 Thread Silvio Fiorito
First the JAR needs to be deployed using the ‹jars argument. Then in your HQL code you need to use the DeprecatedParquetInputFormat and DeprecatedParquetOutputFormat as described here https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0 .12 This is because SparkSQL is based

Re: Segmented fold count

2014-08-18 Thread fil
fil wrote - Python functions like groupCount; these get reflected from their Python AST and converted into a Spark DAG? Presumably if I try and do something non-convertible this transformation process will throw an error? In other words this runs in the JVM. Further to this - it seems that

Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Tobias Pfeiffer
Hi Wei, On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wei@stellarloyalty.com wrote: Since our application cannot tolerate losing customer data, I am wondering what is the best way for us to address this issue. 1) We are thinking writing application specific logic to address the data loss. To

sqlContext.parquetFile(path) fails if path is a file but succeeds if a directory

2014-08-18 Thread Fengyun RAO
I'm using CDH 5.1 with spark 1.0. When I try to run Spark SQL following the Programming Guide val parquetFile = sqlContext.parquetFile(path) If the path is a file, it throws an exception: Exception in thread main java.lang.IllegalArgumentException: Expected hdfs://*/file.parquet for be a

Re: Segmented fold count

2014-08-18 Thread Davies Liu
On Mon, Aug 18, 2014 at 7:41 PM, fil f...@pobox.com wrote: fil wrote - Python functions like groupCount; these get reflected from their Python AST and converted into a Spark DAG? Presumably if I try and do something non-convertible this transformation process will throw an error? In other

spark - Identifying and skipping processed data in hdfs

2014-08-18 Thread salemi
Hi, Mine data source stores the incoming data every 10 second to hdfs. The naming convention save-timestamp.csv (see below) drwxr-xr-x ali supergroup 0 B 0 0 B save-1408396065000.csv drwxr-xr-x ali supergroup 0 B 0 0 B save-140839607.csv drwxr-xr-x

Re: Bug or feature? Overwrite broadcasted variables.

2014-08-18 Thread Peng Cheng
Yeah, Thanks a lot. I know for people understanding lazy execution this seems straightforward. But for those who don't it may become a liability. I've only tested its stability on a small example (which seems stable), hopefully it's not a serendipity. Can a committer confirm this? Yours Peng

RE: Data loss - Spark streaming and network receiver

2014-08-18 Thread Shao, Saisai
I think Currently Spark Streaming lack a data acknowledging mechanism when data is stored and replicated in BlockManager, so potentially data will be lost even pulled into Kafka, say if data is stored just in BlockGenerator not BM, while in the meantime Kafka itself commit the consumer offset,

Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Dibyendu Bhattacharya
Dear All, Recently I have written a Spark Kafka Consumer to solve this problem. Even we have seen issues with KafkaUtils which is using Highlevel Kafka Consumer and consumer code has no handle to offset management. The below code solves this problem, and this has is being tested in our Spark

Re: Does anyone have a stand alone spark instance running on Windows

2014-08-18 Thread Steve Lewis
OK I tried your build - First you need to put spt in C:\sbt Then you get Microsoft Windows [Version 6.2.9200] (c) 2012 Microsoft Corporation. All rights reserved. e:\which java /cygdrive/c/Program Files/Java/jdk1.6.0_25/bin/java e:\java -version java version 1.6.0_25 Java(TM) SE Runtime

Re: Data loss - Spark streaming and network receiver

2014-08-18 Thread Wei Liu
Thank you all for responding to my question. I am pleasantly surprised by this many prompt responses I got. It shows the strength of the spark community. Kafka is still an option for us, I will check out the link provided by Dibyendu. Meanwhile if someone out there already figured this out with

Re: Working with many RDDs in parallel?

2014-08-18 Thread David Tinker
Hmm I thought as much. I am using Cassandra with the Spark connector. What I really need is a RDD created from a query against Cassandra of the form where partition_key = :id where :id is taken from a list. Some grouping of the ids would be a way to partition this. On Mon, Aug 18, 2014 at 3:42