can communication and computation be overlapped in spark?

2014-05-24 Thread wxhsdp
Hi, all
  fetch wait time: 
   * Time the task spent waiting for remote shuffle blocks. This only
includes the time 
   * blocking on shuffle input data. For instance if block B is being
fetched while the task is 
   * still not finished processing block A, it is not considered to be
blocking on block B. 

  according to the definition of fetch wait time, spark can process block A
and fetch block B
  simultaneously, means communication and computation can be overlapped? is
it applied to
  all the situations? especially in some cases the worker needs both A and B
to start the work.

  thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/can-communication-and-computation-be-overlapped-in-spark-tp6348.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Issue with the parallelize method in SparkContext

2014-05-24 Thread Wisc Forum
Hi, dear user group:

I recently try to use the parallelize method of SparkContext to slice original 
data into small pieces for further handling. Something like the below:

val partitionedSource = sparkContext.parallelize(seq, sparkPartitionSize)

The size of my original testing data is 88 objects.

I know the default value (if I don't specify the sparkPartitionSize  value) of 
numSlices is is 10. 

What happens is that when I specified the numSlices value to be 2 (As I use 2 
slave nodes), when I do something like this:
println(partitionedSource.count:  + partitionedSource.count)


The output is partitionedSource.count: 44. The subtask though, is correctly 
created as 2.

My intention is the get two slices where each slice has 44 objects and thus 
partitionedSource.count should be 2, isn't it? So, does this actually result 44 
mean that I have 44 slices or 44 objects in each slice? How can the second case 
be? What if I have 89 objects? Maybe I didn't use it correctly?

Can somebody help me on this?

Thanks,
Xiao Bing

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-24 Thread Josh Marcus
Jeremy,

Just to be clear, are you assembling a jar with that class compiled (with
its dependencies) and including the path to that jar on the command line in
an environment variable (e.g. SPARK_CLASSPATH=path ./spark-shell)?

--j

On Saturday, May 24, 2014, Jeremy Lewi jer...@lewi.us wrote:

 Hi Spark Users,

 I'm trying to read and process an Avro dataset using the interactive spark
 scala shell. When my pipeline executes I get the ClassNotFoundException
 pasted at the end of this email.
 I'm trying to use the Generic Avro API (not the Specific API).

 Here's a gist of the commands I'm running in the spark console:
 https://gist.github.com/jlewi/2c853e0ceee5f00c

 Here's my registrator for kryo.

 https://github.com/jlewi/cloud/blob/master/spark/src/main/scala/contrail/AvroGenericRegistrator.scala

 Any help or suggestions would be greatly appreciated.

 Thanks
 Jeremy

 Here's the log message that is spewed out.

 14/05/24 02:00:48 WARN TaskSetManager: Loss was due to
 java.lang.ClassNotFoundException
 java.lang.ClassNotFoundException:
 $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
  at
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
 at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
  at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
 at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
  at
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-24 Thread Jeremy Lewi
Hi Josh,

Thanks for the help.

The class should be on the path on all nodes. Here's what I did:
1) I built a jar from my scala code.
2) I copied that jar to a location on all nodes in my cluster
(/usr/local/spark)
3) I edited bin/compute-classpath.sh  to add my jar to the class path.
4) I repeated the process with the avro mapreduce jar to provide AvroKey.

I doubt this is the best way to set the classpath but it seems to work.

J


On Sat, May 24, 2014 at 9:26 AM, Josh Marcus jmar...@meetup.com wrote:

 Jeremy,

 Just to be clear, are you assembling a jar with that class compiled (with
 its dependencies) and including the path to that jar on the command line in
 an environment variable (e.g. SPARK_CLASSPATH=path ./spark-shell)?

 --j


 On Saturday, May 24, 2014, Jeremy Lewi jer...@lewi.us wrote:

 Hi Spark Users,

 I'm trying to read and process an Avro dataset using the interactive
 spark scala shell. When my pipeline executes I get the
 ClassNotFoundException pasted at the end of this email.
 I'm trying to use the Generic Avro API (not the Specific API).

 Here's a gist of the commands I'm running in the spark console:
 https://gist.github.com/jlewi/2c853e0ceee5f00c

 Here's my registrator for kryo.

 https://github.com/jlewi/cloud/blob/master/spark/src/main/scala/contrail/AvroGenericRegistrator.scala

 Any help or suggestions would be greatly appreciated.

 Thanks
 Jeremy

 Here's the log message that is spewed out.

 14/05/24 02:00:48 WARN TaskSetManager: Loss was due to
 java.lang.ClassNotFoundException
 java.lang.ClassNotFoundException:
 $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
  at
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
 at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
  at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
 at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
  at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
  at
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
  at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at 

Re: Using Spark to analyze complex JSON

2014-05-24 Thread Michael Armbrust
 But going back to your presented pattern, I have a question. Say your data
 does have a fixed structure, but some of the JSON values are lists. How
 would you map that to a SchemaRDD? (I didn’t notice any list values in the
 CandyCrush example.) Take the likes field from my original example:

Spark SQL supports complex types, including sequences.  You should be able
to define a field of type Seq[String] in your case class.

One note: the hql parser that is including in a hive context currently has
better support for working with complex types, but we are working to
improve that.


Re: Using Spark to analyze complex JSON

2014-05-24 Thread Mayur Rustagi
Hi Michael,

Is the in-memory columnar store planned as part of SparkSQL ?

Also will both HiveQL  SQLParser be kept updated?


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi



On Sun, May 25, 2014 at 2:44 AM, Michael Armbrust mich...@databricks.comwrote:


 But going back to your presented pattern, I have a question. Say your data
 does have a fixed structure, but some of the JSON values are lists. How
 would you map that to a SchemaRDD? (I didn’t notice any list values in the
 CandyCrush example.) Take the likes field from my original example:

 Spark SQL supports complex types, including sequences.  You should be able
 to define a field of type Seq[String] in your case class.

 One note: the hql parser that is including in a hive context currently has
 better support for working with complex types, but we are working to
 improve that.



Custom Accumulator: Type Mismatch Error

2014-05-24 Thread Muttineni, Vinay
Hello,

I have been trying to implement a custom accumulator as below



import org.apache.spark._

  class VectorNew1(val data: Array[Double]) {}

  implicit object VectorAP extends AccumulatorParam[VectorNew1] {

def zero(v: VectorNew1) = new VectorNew1(new Array(v.data.size))

def addInPlace(v1: VectorNew1, v2: VectorNew1) = {

  for (i - 0 to v1.data.size - 1) v1.data(i) += v2.data(i)

  v1

}



  }

//Create an accumulator counter of length = Number of columns which is in 
turn derived from the header

  val actualCounters1 = sc.accumulator(new 
VectorNew1(Array.fill[Double](2)(0)))

  onlySplitFile.foreach(oneRow = {

//println(Here)

//println(oneRow(0))

//for(eachColumnValue - oneRow)

//{

  actualCounters1 += new VectorNew1(Array(1,1))

//}



  })







I am receiving the following error

Error: type mismatch;

Found : VectorNew1

Required : vectorNew1



actualCounters1 += new VectorNew1(Array(1,1))



Could someone help me with this?

Thank You

Vinay