can communication and computation be overlapped in spark?
Hi, all fetch wait time: * Time the task spent waiting for remote shuffle blocks. This only includes the time * blocking on shuffle input data. For instance if block B is being fetched while the task is * still not finished processing block A, it is not considered to be blocking on block B. according to the definition of fetch wait time, spark can process block A and fetch block B simultaneously, means communication and computation can be overlapped? is it applied to all the situations? especially in some cases the worker needs both A and B to start the work. thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-communication-and-computation-be-overlapped-in-spark-tp6348.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Issue with the parallelize method in SparkContext
Hi, dear user group: I recently try to use the parallelize method of SparkContext to slice original data into small pieces for further handling. Something like the below: val partitionedSource = sparkContext.parallelize(seq, sparkPartitionSize) The size of my original testing data is 88 objects. I know the default value (if I don't specify the sparkPartitionSize value) of numSlices is is 10. What happens is that when I specified the numSlices value to be 2 (As I use 2 slave nodes), when I do something like this: println(partitionedSource.count: + partitionedSource.count) The output is partitionedSource.count: 44. The subtask though, is correctly created as 2. My intention is the get two slices where each slice has 44 objects and thus partitionedSource.count should be 2, isn't it? So, does this actually result 44 mean that I have 44 slices or 44 objects in each slice? How can the second case be? What if I have 89 objects? Maybe I didn't use it correctly? Can somebody help me on this? Thanks, Xiao Bing
Re: Working with Avro Generic Records in the interactive scala shell
Jeremy, Just to be clear, are you assembling a jar with that class compiled (with its dependencies) and including the path to that jar on the command line in an environment variable (e.g. SPARK_CLASSPATH=path ./spark-shell)? --j On Saturday, May 24, 2014, Jeremy Lewi jer...@lewi.us wrote: Hi Spark Users, I'm trying to read and process an Avro dataset using the interactive spark scala shell. When my pipeline executes I get the ClassNotFoundException pasted at the end of this email. I'm trying to use the Generic Avro API (not the Specific API). Here's a gist of the commands I'm running in the spark console: https://gist.github.com/jlewi/2c853e0ceee5f00c Here's my registrator for kryo. https://github.com/jlewi/cloud/blob/master/spark/src/main/scala/contrail/AvroGenericRegistrator.scala Any help or suggestions would be greatly appreciated. Thanks Jeremy Here's the log message that is spewed out. 14/05/24 02:00:48 WARN TaskSetManager: Loss was due to java.lang.ClassNotFoundException java.lang.ClassNotFoundException: $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at
Re: Working with Avro Generic Records in the interactive scala shell
Hi Josh, Thanks for the help. The class should be on the path on all nodes. Here's what I did: 1) I built a jar from my scala code. 2) I copied that jar to a location on all nodes in my cluster (/usr/local/spark) 3) I edited bin/compute-classpath.sh to add my jar to the class path. 4) I repeated the process with the avro mapreduce jar to provide AvroKey. I doubt this is the best way to set the classpath but it seems to work. J On Sat, May 24, 2014 at 9:26 AM, Josh Marcus jmar...@meetup.com wrote: Jeremy, Just to be clear, are you assembling a jar with that class compiled (with its dependencies) and including the path to that jar on the command line in an environment variable (e.g. SPARK_CLASSPATH=path ./spark-shell)? --j On Saturday, May 24, 2014, Jeremy Lewi jer...@lewi.us wrote: Hi Spark Users, I'm trying to read and process an Avro dataset using the interactive spark scala shell. When my pipeline executes I get the ClassNotFoundException pasted at the end of this email. I'm trying to use the Generic Avro API (not the Specific API). Here's a gist of the commands I'm running in the spark console: https://gist.github.com/jlewi/2c853e0ceee5f00c Here's my registrator for kryo. https://github.com/jlewi/cloud/blob/master/spark/src/main/scala/contrail/AvroGenericRegistrator.scala Any help or suggestions would be greatly appreciated. Thanks Jeremy Here's the log message that is spewed out. 14/05/24 02:00:48 WARN TaskSetManager: Loss was due to java.lang.ClassNotFoundException java.lang.ClassNotFoundException: $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at
Re: Using Spark to analyze complex JSON
But going back to your presented pattern, I have a question. Say your data does have a fixed structure, but some of the JSON values are lists. How would you map that to a SchemaRDD? (I didn’t notice any list values in the CandyCrush example.) Take the likes field from my original example: Spark SQL supports complex types, including sequences. You should be able to define a field of type Seq[String] in your case class. One note: the hql parser that is including in a hive context currently has better support for working with complex types, but we are working to improve that.
Re: Using Spark to analyze complex JSON
Hi Michael, Is the in-memory columnar store planned as part of SparkSQL ? Also will both HiveQL SQLParser be kept updated? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Sun, May 25, 2014 at 2:44 AM, Michael Armbrust mich...@databricks.comwrote: But going back to your presented pattern, I have a question. Say your data does have a fixed structure, but some of the JSON values are lists. How would you map that to a SchemaRDD? (I didn’t notice any list values in the CandyCrush example.) Take the likes field from my original example: Spark SQL supports complex types, including sequences. You should be able to define a field of type Seq[String] in your case class. One note: the hql parser that is including in a hive context currently has better support for working with complex types, but we are working to improve that.
Custom Accumulator: Type Mismatch Error
Hello, I have been trying to implement a custom accumulator as below import org.apache.spark._ class VectorNew1(val data: Array[Double]) {} implicit object VectorAP extends AccumulatorParam[VectorNew1] { def zero(v: VectorNew1) = new VectorNew1(new Array(v.data.size)) def addInPlace(v1: VectorNew1, v2: VectorNew1) = { for (i - 0 to v1.data.size - 1) v1.data(i) += v2.data(i) v1 } } //Create an accumulator counter of length = Number of columns which is in turn derived from the header val actualCounters1 = sc.accumulator(new VectorNew1(Array.fill[Double](2)(0))) onlySplitFile.foreach(oneRow = { //println(Here) //println(oneRow(0)) //for(eachColumnValue - oneRow) //{ actualCounters1 += new VectorNew1(Array(1,1)) //} }) I am receiving the following error Error: type mismatch; Found : VectorNew1 Required : vectorNew1 actualCounters1 += new VectorNew1(Array(1,1)) Could someone help me with this? Thank You Vinay