Re: Rolling upgrade Spark cluster

2014-12-17 Thread Bhaskar Dutta
HDFS rolling upgrade in Hadoop 2.6 (available since 2.4) http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html Some parts of NM and RM work preserving restart was released in Hadoop 2.6.0. YARN-1367 After restart NM should resync with the RM without killing

wordcount job slow while input from NFS mount

2014-12-17 Thread Larry Liu
A wordcounting job for about 1G text file takes 1 hour while input from a NFS mount. The same job took 30 seconds while input from local file system. Is there any tuning required for a NFS mount input? Thanks Larry

Re: when will the spark 1.3.0 be released?

2014-12-17 Thread Madabhattula Rajesh Kumar
Hi All, When will the Spark 1.2.0 be released? and What are the features in Spark 1.2.0 Regards, Rajesh On Wed, Dec 17, 2014 at 11:14 AM, Andrew Ash and...@andrewash.com wrote: Releases are roughly every 3mo so you should expect around March if the pace stays steady. 2014-12-16 22:56

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

2014-12-17 Thread Nicholas Chammas
Rui is correct. Check how many partitions your RDD has after loading the gzipped files. e.g. rdd.getNumPartitions(). If that number is way less than the number of cores in your cluster (in your case I suspect the number is 4), then explicitly repartition the RDD to match the number of cores in

Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-17 Thread Nicholas Chammas
Spark 1.3 does not exist. Spark 1.2 hasn't been released just yet. Which version of Spark did you mean? Also, from what I can see in the docs http://spark.apache.org/docs/1.1.1/building-with-maven.html#specifying-the-hadoop-version, I believe the latest version of Hadoop that Spark supports is

When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao
Hi,       In Spark SQL help document, it says Some of these (such as indexes) are less important due to Spark SQL’s in-memory  computational model. Others are slotted for future releases of Spark SQL. - Block level bitmap indexes and virtual columns (used to build indexes)      For our

SchemaRDD.sample problem

2014-12-17 Thread Hao Ren
Hi, I am using SparkSQL on 1.2.1 branch. The problem comes froms the following 4-line code: *val t1: SchemaRDD = hiveContext hql select * from product where is_new = 0 val tb1: SchemaRDD = t1.sample(withReplacement = false, fraction = 0.05) tb1.registerTempTable(t1_tmp) (hiveContext sql select

Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-17 Thread Sean Owen
Spark works fine with 2.4 *and later*. The docs don't mean to imply 2.4 is the last supported version. On Wed, Dec 17, 2014 at 10:19 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Spark 1.3 does not exist. Spark 1.2 hasn't been released just yet. Which version of Spark did you mean?

Re: java.lang.IllegalStateException: unread block data

2014-12-17 Thread Morbious
I found solution. I use HADOOP_MAPRED_HOME in my environment what clashes with spark. After I set empty HADOOP_MAPRED_HOME spark's started working. -- View this message in context:

Re: Intermittent test failures

2014-12-17 Thread Marius Soutier
Using TestSQLContext from multiple tests leads to: SparkException: : Task not serializable ERROR ContextCleaner: Error cleaning broadcast 10 java.lang.NullPointerException at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:246) at

Hadoop and spark together

2014-12-17 Thread Morbious
Hi, I'm tring to use hadoop and spark togehter but they don't work. If I set HADOOP_MAPRED_HOME to use MRv1 or MRv2 spark stops working. If I set empty HADOOP_MAPRED_HOME to use spark , hadoop stops working. I use cloudera 5.1.3. Best regards, Morbious -- View this message in context:

weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Sun, Rui
Hi, I encountered a weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary. Steps to reproduce: 1. Download the official pre-built Spark binary 1.1.1 at http://d3kbcqa49mib13.cloudfront.net/spark-1.1.1-bin-hadoop1.tgz 2. Launch

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Sean Owen
You should use the same binaries everywhere. The problem here is that anonymous functions get compiled to different names when you build different (potentially) so you actually have one function being called when another function is meant. On Wed, Dec 17, 2014 at 12:07 PM, Sun, Rui

Re: java.lang.IllegalStateException: unread block data

2014-12-17 Thread sivarani
same issue anyone help please -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-tp20668p20745.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: when will the spark 1.3.0 be released?

2014-12-17 Thread Daniel Darabos
Spark 1.2.0 is coming in the next 48 hours according to http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-tc9815.html On Wed, Dec 17, 2014 at 10:11 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi All, When will the Spark 1.2.0 be

Are lazy values created once per node or once per partition?

2014-12-17 Thread Ashic Mahtab
Hello, Say, I have the following code: let something = Something() someRdd.foreachRdd(something.someMethod) And in something, I have a lazy member variable that gets created in something.someMethod. Would that lazy be created once per node, or once per partition? Thanks, Ashic.

Apache Spark 1.1.1 with Hbase 0.98.8-hadoop2 and hadoop 2.3.0

2014-12-17 Thread Amit Singh Hora
Hi All, I have downloaded pre built Spark 1.1.1 for Hadoop 2.3.0 then i did mvn install for the jar spark-assembly-1.1.1-hadoop2.3.0.jar available in lib folder of the spark downloaded and added its dependency as following in my java program dependency groupIdorg.apache.spark/groupId

Re: spark streaming kafa best practices ?

2014-12-17 Thread Gerard Maas
Patrick, I was wondering why one would choose for rdd.map vs rdd.foreach to execute a side-effecting function on an RDD. -kr, Gerard. On Sat, Dec 6, 2014 at 12:57 AM, Patrick Wendell pwend...@gmail.com wrote: The second choice is better. Once you call collect() you are pulling all of the

spark-ec2 starts hdfs1, tachyon but not spark

2014-12-17 Thread Al Thompson
Hi All: I am new to Spark. I recently checked out and built spark 1.2 RC2 as an assembly. I then ran spark-ec2 according to: http://spark.apache.org/docs/latest/ec2-scripts.html I got master and slave instances in EC2 after running ./src/spark/ec2/spark-ec2 -k mykey -i mykey.pem -s 1 launch

Re: Are lazy values created once per node or once per partition?

2014-12-17 Thread Aniket Bhatnagar
I would think that it has to be per worker. On Wed, Dec 17, 2014, 6:32 PM Ashic Mahtab as...@live.com wrote: Hello, Say, I have the following code: let something = Something() someRdd.foreachRdd(something.someMethod) And in something, I have a lazy member variable that gets created in

RE: Control default partition when load a RDD from HDFS

2014-12-17 Thread Shuai Zheng
Nice, that is the answer I want. Thanks! From: Sun, Rui [mailto:rui@intel.com] Sent: Wednesday, December 17, 2014 1:30 AM To: Shuai Zheng; user@spark.apache.org Subject: RE: Control default partition when load a RDD from HDFS Hi, Shuai, How did you turn off the file split in

Re: Apache Spark 1.1.1 with Hbase 0.98.8-hadoop2 and hadoop 2.3.0

2014-12-17 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/JW1q5FS8Mr1 If the problem you encountered is different, please give full stack trace. Cheers On Wed, Dec 17, 2014 at 5:43 AM, Amit Singh Hora hora.a...@gmail.com wrote: Hi All, I have downloaded pre built Spark 1.1.1 for Hadoop 2.3.0

RE: Control default partition when load a RDD from HDFS

2014-12-17 Thread Diego García Valverde
Why not is a good option to create a RDD per each 200Mb file and then apply the pre-calculations before merging them? I think the partitions per RDD must be transparent to the pre-calculations, and not to set them fixed to optimize the spark maps/reduces processes. De: Shuai Zheng

Get the value of DStream[(String, Iterable[String])]

2014-12-17 Thread Guillermo Ortiz
I'm a newbie with Spark,,, a simple question val errorLines = lines.filter(_.contains(h)) val mapErrorLines = errorLines.map(line = (key, line)) val grouping = errorLinesValue.groupByKeyAndWindow(Seconds(8), Seconds(4)) I get something like: 604: --- 605:

Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi spark users, Do you know how to read json files using Spark SQL that are LZO compressed? I'm looking into sqlContext.jsonFile but I don't know how to configure it to read lzo files. Best Regards, Jerry

Re: Get the value of DStream[(String, Iterable[String])]

2014-12-17 Thread Gerard Maas
You can create a DStream that contains the count, transforming the grouped windowed RDD, like this: val errorCount = grouping.map{case (k,v) = v.size } If you need to preserve the key: val errorCount = grouping.map{case (k,v) = (k,v.size) } or you if you don't care about the content of the

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi Ted, Thanks for your help. I'm able to read lzo files using sparkContext.newAPIHadoopFile but I couldn't do the same for sqlContext because sqlContext.josnFile does not provide ways to configure the input file format. Do you know if there are some APIs to do that? Best Regards, Jerry On

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Ted Yu
See this thread: http://search-hadoop.com/m/JW1q5HAuFv which references https://issues.apache.org/jira/browse/SPARK-2394 Cheers On Wed, Dec 17, 2014 at 8:21 AM, Jerry Lam chiling...@gmail.com wrote: Hi spark users, Do you know how to read json files using Spark SQL that are LZO compressed?

Who is using Spark and related technologies for bioinformatics applications?

2014-12-17 Thread Steve Lewis
I am aware of the ADAM project in Berkeley and I am working on Proteomic searches - anyone else working in this space

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Ted Yu
In SQLContext: def jsonFile(path: String, samplingRatio: Double): SchemaRDD = { val json = sparkContext.textFile(path) jsonRDD(json, samplingRatio) } Looks like jsonFile() can be enhanced with call to sparkContext.newAPIHadoopFile() with proper input file format. Cheers On Wed, Dec

Implementing a spark version of Haskell's partition

2014-12-17 Thread Juan Rodríguez Hortalá
Hi all, I would like to be able to split a RDD in two pieces according to a predicate. That would be equivalent to applying filter twice, with the predicate and its complement, which is also similar to Haskell's partition list function (

Re: Implementing a spark version of Haskell's partition

2014-12-17 Thread andy petrella
yo, First, here is the scala version: http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@partition(p:A= Boolean):(Repr,Repr) Second: RDD is distributed so what you'll have to do is to partition each partition each partition (:-D) or create two RDDs with by filtering twice →

wordcount job slow while input from NFS mount

2014-12-17 Thread Larry Liu
Hi, A wordcounting job for about 1G text file takes 1 hour while input from a NFS mount. The same job took 30 seconds while input from local file system. Is there any tuning required for a NFS mount input? Thanks Larry

wordcount job slow while input from NFS mount

2014-12-17 Thread Larryliu
A wordcounting job for about 1G text file takes 1 hour while input from a NFS mount. The same job took 30 seconds while input from local file system. Is there any tuning required for a NFS mount input? Thanks Larry -- View this message in context:

Re: Get the value of DStream[(String, Iterable[String])]

2014-12-17 Thread Guillermo Ortiz
What I would like to do it's to count the number of elements and if it's greater than a number, I have to iterate all them and store them in mysql or another system. So, I need to count them and preserve the values because saving in other system. I know about this map(line = (key, line)), it was

Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-17 Thread Nicholas Chammas
Thanks for the correction, Sean. Do the docs need to be updated on this point, or is it safer for now just to note 2.4 specifically? On Wed Dec 17 2014 at 5:54:53 AM Sean Owen so...@cloudera.com wrote: Spark works fine with 2.4 *and later*. The docs don't mean to imply 2.4 is the last

Re: Get the value of DStream[(String, Iterable[String])]

2014-12-17 Thread Guillermo Ortiz
Basically what I want to do it'd be something like.. val errorLines = lines.filter(_.contains(h)) val mapErrorLines = errorLines.map(line = (key, line)) val grouping = errorLinesValue.groupByKeyAndWindow(Seconds(8), Seconds(4)) if (errorLinesValue.getValue().size() X){ //iterate values and

Re: Implementing a spark version of Haskell's partition

2014-12-17 Thread Juan Rodríguez Hortalá
Hi Andy, thanks for your response. I already thought about filtering twice, that was what I meant with that would be equivalent to applying filter twice, but I was thinking if I could do it in a single pass, so that could be later generalized to an arbitrary numbers of classes. I would also like

Re: spark streaming kafa best practices ?

2014-12-17 Thread Patrick Wendell
Foreach is slightly more efficient because Spark doesn't bother to try and collect results from each task since it's understood there will be no return type. I think the difference is very marginal though - it's mostly stylistic... typically you use foreach for something that is intended to

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Shivaram Venkataraman
Just to clarify, are you running the application using spark-submit after packaging with sbt package ? One thing that might help is to mark the Spark dependency as 'provided' as then you shouldn't have the Spark classes in your jar. Thanks Shivaram On Wed, Dec 17, 2014 at 4:39 AM, Sean Owen

Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-17 Thread Daniel Haviv
Thanks for your replies. I was building spark from trunk. Daniel On 17 בדצמ׳ 2014, at 19:49, Nicholas Chammas nicholas.cham...@gmail.com wrote: Thanks for the correction, Sean. Do the docs need to be updated on this point, or is it safer for now just to note 2.4 specifically? On Wed

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
You can create an RDD[String] using whatever method and pass that to jsonRDD. On Wed, Dec 17, 2014 at 8:33 AM, Jerry Lam chiling...@gmail.com wrote: Hi Ted, Thanks for your help. I'm able to read lzo files using sparkContext.newAPIHadoopFile but I couldn't do the same for sqlContext because

Re: When will Spark SQL support building DB index natively?

2014-12-17 Thread Michael Armbrust
- Dev list Have you looked at partitioned table support? That would only scan data where the predicate matches the partition. Depending on the cardinality of the customerId column that could be a good option for you. On Wed, Dec 17, 2014 at 2:25 AM, Xuelin Cao xuelin...@yahoo.com.invalid

SparkSQL 1.2.1-snapshot Left Join problem

2014-12-17 Thread Hao Ren
Hi, When running SparkSQL branch 1.2.1 on EC2 standalone cluster, the following query does not work: create table debug as select v1.* from t1 as v1 left join t2 as v2 on v1.sku = v2.sku where v2.sku is null Both t1 and t2 have 200 partitions. t1 has 10k rows, and t2 has 4k rows. this query

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi Michael, This is what I did. I was thinking if there is a more efficient way to accomplish this. I was doing a very simple benchmark: Convert lzo compressed json files to parquet files using SparkSQL vs. Hadoop MR. Spark SQL seems to require 2 stages to accomplish this task: Stage 1: read

Re: S3 globbing

2014-12-17 Thread durga katakam
Hi Akhil, Thanks for your time. I appreciate .I tried this approach , but either I am getting less files or more files not exact hour files. Is there any way I can tell the range (between this time to this time) Thanks, D On Tue, Dec 16, 2014 at 11:04 PM, Akhil Das ak...@sigmoidanalytics.com

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
The first pass is inferring the schema of the JSON data. If you already know the schema you can skip this pass by specifying the schema as the second parameter to jsonRDD. On Wed, Dec 17, 2014 at 10:59 AM, Jerry Lam chiling...@gmail.com wrote: Hi Michael, This is what I did. I was thinking

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Michael Armbrust
To be a little more clear jsonRDD and jsonFile use the same implementation underneath. jsonFile is just a connivence method that does jsonRDD(sc.textFile(...)) On Wed, Dec 17, 2014 at 11:37 AM, Michael Armbrust mich...@databricks.com wrote: The first pass is inferring the schema of the JSON

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
The problem is very likely NFS, not Spark. What kind of network is it mounted over? You can also test the performance of your NFS by copying a file from it to a local disk or to /dev/null and seeing how many bytes per second it can copy. Matei On Dec 17, 2014, at 9:38 AM, Larryliu

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Larry Liu
Hi, Matei Thanks for your response. I tried to copy the file (1G) from NFS and took 10 seconds. The NFS mount is a LAN environment and the NFS server is running on the same server that Spark is running on. So basically I mount the NFS on the same bare metal machine. Larry On Wed, Dec 17, 2014

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
I see, you may have something else configured weirdly then. You should look at CPU and disk utilization while your Spark job is reading from NFS and, if you see high CPU use, run jstack to see where the process is spending time. Also make sure Spark's local work directories (spark.local.dir)

Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-17 Thread Josh Rosen
Yeah, it looks like messages that are successfully posted via Nabble end up on the Apache mailing list, but messages posted directly to Apache aren't mirrored to Nabble anymore because it's based off the incubator mailing list. We should fix this so that Nabble posts to / archives the

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Larry Liu
Thanks, Matei. I will give it a try. Larry On Wed, Dec 17, 2014 at 1:01 PM, Matei Zaharia matei.zaha...@gmail.com wrote: I see, you may have something else configured weirdly then. You should look at CPU and disk utilization while your Spark job is reading from NFS and, if you see high CPU

Help with updateStateByKey

2014-12-17 Thread Pierce Lamb
I am trying to run stateful Spark Streaming computations over (fake) apache web server logs read from Kafka. The goal is to sessionize the web traffic similar to this blog post: http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

spark-sql with join terribly slow.

2014-12-17 Thread harirajaram
Guys, I'm trying to join 2-3 schemaRDD's for approx 30,000 rows and it is terribly slow.No doubt I get the results but it takes 8s to do the join and get the results. I'm running on a standalone spark in my m/c having 8 cores and 12gb RAM with 4 workers. Not sure why it is consuming time,any

building with Hadoop 1.0.4 / where is hadoop-yarn-common:1.0.4 ?

2014-12-17 Thread Tim Harsch
I tried sending this message to the users list several hours ago, but it did not get distributed. I was just trying to build Spark, v1.1.1 with defaults. It sets hadoop.version to 1.0.4, and yarn.version to hadoop.version, the dependency entry for org.apache.hadoop:hadoop-yarn-common sets

Re: building with Hadoop 1.0.4 / where is hadoop-yarn-common:1.0.4 ?

2014-12-17 Thread Sean Owen
There's no such version of YARN. But you only build the YARN support when you set -Pyarn. Then, yes you have to set yarn.version separately for earlier versions that didn't match up with Hadoop versions. http://spark.apache.org/docs/latest/building-with-maven.html On Thu, Dec 18, 2014 at 12:35

Re: java.io.NotSerializableException: org.apache.avro.mapred.AvroKey using spark with avro

2014-12-17 Thread touchdown
Yeah, I have the same problem with 1.1.0, but not 1.0.0. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-io-NotSerializableException-org-apache-avro-mapred-AvroKey-using-spark-with-avro-tp15165p20752.html Sent from the Apache Spark User List mailing

SPARK-2243 Support multiple SparkContexts in the same JVM

2014-12-17 Thread Anton Brazhnyk
Greetings, First comment on the issue says that reason for non-supporting of multiple contexts is There are numerous assumptions in the code base that uses a shared cache or thread local variables or some global identifiers which prevent us from using multiple SparkContext's. May it be worked

Re: No disk single pass RDD aggregation

2014-12-17 Thread thanhtien522
Jim Carroll wrote Okay, I have an rdd that I want to run an aggregate over but it insists on spilling to disk even though I structured the processing to only require a single pass. In other words, I can do all of my processing one entry in the rdd at a time without persisting anything.

Re: SPARK-2243 Support multiple SparkContexts in the same JVM

2014-12-17 Thread Marcelo Vanzin
Hi Anton, That could solve some of the issues (I've played with that a little bit). But there are still some areas where this would be sub-optimal, because Spark still uses system properties in some places and those are global, not per-class loader. (SparkSubmit is the biggest offender here, but

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-17 Thread buring
I am not sure this can help you. I have 57 million rating,about 4million user and 4k items. I used 7-14 total-executor-cores,executal-memory 13g,cluster have 4 nodes,each have 4cores,max memory 16g. I found set as follows may help avoid this problem:

RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Sun, Rui
Sean, Yes, the problem is exactly anonymous function mis-matching as you described So if an Spark app (driver) depends on a Spark module jar (for example spark-core) to programmatically communicate with a Spark cluster, user should not use pre-built Spark binary but build Spark from the source

RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-17 Thread Sun, Rui
Not using spark-submit. The App directly communicates with the Spark cluster in standalone mode. If mark the Spark dependency as 'provided’, then the spark-core .jar elsewhere must be pointe to in CLASSPATH. However, the pre-built Spark binary only has an assembly jar, not having individual

Re: spark-sql with join terribly slow.

2014-12-17 Thread nitin
This might be because Spark SQL first does a shuffle on both the tables involved in join on the Join condition as key. I had a specific use case of join where I always Join on specific column id and have an optimisation lined up for that in which i can cache the data partitioned on JOIN key id

Re: When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao
Thanks, I didn't try the partitioned table support (sounds like a hive feature) Is there any guideline? Should I use hiveContext to create the table with partition firstly?  On Thursday, December 18, 2014 2:28 AM, Michael Armbrust mich...@databricks.com wrote: - Dev list Have you

Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-17 Thread Kyle Lin
Hi there The following is my steps. And got the same exception with Daniel's. Another question: how can I build a tgz file like the pre-build file I download from official website? 1. download trunk from git. 2. add following lines in pom.xml + profile + idhadoop-2.6/id +

Re: spark streaming kafa best practices ?

2014-12-17 Thread Tobias Pfeiffer
Hi, On Thu, Dec 18, 2014 at 3:08 AM, Patrick Wendell pwend...@gmail.com wrote: On Wed, Dec 17, 2014 at 5:43 AM, Gerard Maas gerard.m...@gmail.com wrote: I was wondering why one would choose for rdd.map vs rdd.foreach to execute a side-effecting function on an RDD. Personally, I like to

Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
All, I'm using the Spark shell to interact with a small test deployment of Spark, built from the current master branch. I'm processing a dataset comprising a few thousand objects on Google Cloud Storage, split into a half dozen directories. My code constructs an object--let me call it the Dataset

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
I'm curious if you're seeing the same thing when using bdutil against GCS? I'm wondering if this may be an issue concerning the transfer rate of Spark - Hadoop - GCS Connector - GCS. On Wed Dec 17 2014 at 10:09:17 PM Alessandro Baretta alexbare...@gmail.com wrote: All, I'm using the Spark

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Denny, No, gsutil scans through the listing of the bucket quickly. See the following. alex@hadoop-m:~/split$ time bash -c gsutil ls gs://my-bucket/20141205/csv/*/*/* | wc -l 6860 real0m6.971s user0m1.052s sys 0m0.096s Alex On Wed, Dec 17, 2014 at 10:29 PM, Denny Lee

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would result in just as fast scans? On Wed Dec 17 2014 at 10:44:45 PM Alessandro Baretta alexbare...@gmail.com wrote: Denny, No, gsutil scans through the listing of the bucket

Getting OutOfMemoryError and Worker.run caught exception

2014-12-17 Thread A.K.M. Ashrafuzzaman
Hi guys, Getting the following errors, 2014-12-17 09:05:02,391 [SocialInteractionDAL.scala:Executor task launch worker-110:20] - --- Inserting into mongo - 2014-12-17 09:05:06,768 [ Logging.scala:Executor task launch worker-110:96] - Exception in task 1.0 in stage

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Well, what do you suggest I run to test this? But more importantly, what information would this give me? On Wed, Dec 17, 2014 at 10:46 PM, Denny Lee denny.g@gmail.com wrote: Oh, it makes sense of gsutil scans through this quickly, but I was wondering if running a Hadoop job / bdutil would

Re: Getting OutOfMemoryError and Worker.run caught exception

2014-12-17 Thread Akhil Das
You can go through this doc for tuning http://spark.apache.org/docs/latest/tuning.html Looks like you are creating a lot of objects and the JVM is spending more time clearing these. If you can paste the code snippet, then it will be easy to understand whats happening. Thanks Best Regards On

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Denny Lee
For Spark to connect to GCS, it utilizes the Hadoop and GCS connector jars for connectivity. I'm wondering if it's those connection points that are ultimately slowing down the connection between Spark and GCS. The reason I was asking if you could run bdutil is because it would be basically Hadoop

Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Here's another data point: the slow part of my code is the construction of an RDD as the union of the textFile RDDs representing data from several distinct google storage directories. So the question becomes the following: what computation happens when calling the union method on two RDDs? On