Custom Accumulator: Type Mismatch Error

2014-05-24 Thread Muttineni, Vinay
Hello,

I have been trying to implement a custom accumulator as below



import org.apache.spark._

  class VectorNew1(val data: Array[Double]) {}

  implicit object VectorAP extends AccumulatorParam[VectorNew1] {

def zero(v: VectorNew1) = new VectorNew1(new Array(v.data.size))

def addInPlace(v1: VectorNew1, v2: VectorNew1) = {

  for (i <- 0 to v1.data.size - 1) v1.data(i) += v2.data(i)

  v1

}



  }

//Create an accumulator counter of length = Number of columns which is in 
turn derived from the header

  val actualCounters1 = sc.accumulator(new 
VectorNew1(Array.fill[Double](2)(0)))

  onlySplitFile.foreach(oneRow => {

//println("Here")

//println(oneRow(0))

//for(eachColumnValue <- oneRow)

//{

  actualCounters1 += new VectorNew1(Array(1,1))

//}



  })







I am receiving the following error

Error: type mismatch;

Found : VectorNew1

Required : vectorNew1



actualCounters1 += new VectorNew1(Array(1,1))



Could someone help me with this?

Thank You

Vinay


Re: Using Spark to analyze complex JSON

2014-05-24 Thread Mayur Rustagi
Hi Michael,

Is the in-memory columnar store planned as part of SparkSQL ?

Also will both HiveQL & SQLParser be kept updated?


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Sun, May 25, 2014 at 2:44 AM, Michael Armbrust wrote:

>
> But going back to your presented pattern, I have a question. Say your data
>> does have a fixed structure, but some of the JSON values are lists. How
>> would you map that to a SchemaRDD? (I didn’t notice any list values in the
>> CandyCrush example.) Take the likes field from my original example:
>>
> Spark SQL supports complex types, including sequences.  You should be able
> to define a field of type Seq[String] in your case class.
>
> One note: the hql parser that is including in a hive context currently has
> better support for working with complex types, but we are working to
> improve that.
>


Re: Using Spark to analyze complex JSON

2014-05-24 Thread Michael Armbrust
> But going back to your presented pattern, I have a question. Say your data
> does have a fixed structure, but some of the JSON values are lists. How
> would you map that to a SchemaRDD? (I didn’t notice any list values in the
> CandyCrush example.) Take the likes field from my original example:
>
Spark SQL supports complex types, including sequences.  You should be able
to define a field of type Seq[String] in your case class.

One note: the hql parser that is including in a hive context currently has
better support for working with complex types, but we are working to
improve that.


Re: Working with Avro Generic Records in the interactive scala shell

2014-05-24 Thread Jeremy Lewi
Hi Josh,

Thanks for the help.

The class should be on the path on all nodes. Here's what I did:
1) I built a jar from my scala code.
2) I copied that jar to a location on all nodes in my cluster
(/usr/local/spark)
3) I edited bin/compute-classpath.sh  to add my jar to the class path.
4) I repeated the process with the avro mapreduce jar to provide AvroKey.

I doubt this is the best way to set the classpath but it seems to work.

J


On Sat, May 24, 2014 at 9:26 AM, Josh Marcus  wrote:

> Jeremy,
>
> Just to be clear, are you assembling a jar with that class compiled (with
> its dependencies) and including the path to that jar on the command line in
> an environment variable (e.g. SPARK_CLASSPATH=path ./spark-shell)?
>
> --j
>
>
> On Saturday, May 24, 2014, Jeremy Lewi  wrote:
>
>> Hi Spark Users,
>>
>> I'm trying to read and process an Avro dataset using the interactive
>> spark scala shell. When my pipeline executes I get the
>> ClassNotFoundException pasted at the end of this email.
>> I'm trying to use the Generic Avro API (not the Specific API).
>>
>> Here's a gist of the commands I'm running in the spark console:
>> https://gist.github.com/jlewi/2c853e0ceee5f00c
>>
>> Here's my registrator for kryo.
>>
>> https://github.com/jlewi/cloud/blob/master/spark/src/main/scala/contrail/AvroGenericRegistrator.scala
>>
>> Any help or suggestions would be greatly appreciated.
>>
>> Thanks
>> Jeremy
>>
>> Here's the log message that is spewed out.
>>
>> 14/05/24 02:00:48 WARN TaskSetManager: Loss was due to
>> java.lang.ClassNotFoundException
>> java.lang.ClassNotFoundException:
>> $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>  at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>  at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:270)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
>>  at
>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
>> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>>  at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>  at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>  at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>  at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>>  at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:606)
>> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>>  at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>>  at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>  at
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-24 Thread Josh Marcus
Jeremy,

Just to be clear, are you assembling a jar with that class compiled (with
its dependencies) and including the path to that jar on the command line in
an environment variable (e.g. SPARK_CLASSPATH=path ./spark-shell)?

--j

On Saturday, May 24, 2014, Jeremy Lewi  wrote:

> Hi Spark Users,
>
> I'm trying to read and process an Avro dataset using the interactive spark
> scala shell. When my pipeline executes I get the ClassNotFoundException
> pasted at the end of this email.
> I'm trying to use the Generic Avro API (not the Specific API).
>
> Here's a gist of the commands I'm running in the spark console:
> https://gist.github.com/jlewi/2c853e0ceee5f00c
>
> Here's my registrator for kryo.
>
> https://github.com/jlewi/cloud/blob/master/spark/src/main/scala/contrail/AvroGenericRegistrator.scala
>
> Any help or suggestions would be greatly appreciated.
>
> Thanks
> Jeremy
>
> Here's the log message that is spewed out.
>
> 14/05/24 02:00:48 WARN TaskSetManager: Loss was due to
> java.lang.ClassNotFoundException
> java.lang.ClassNotFoundException:
> $line16.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>  at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>  at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
>  at
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
>  at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:606)
> at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>  at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
>  at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
>  at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>  at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>  at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> at java.io.ObjectInputStream.readSerialData(ObjectI

Re: Issue with the parallelize method in SparkContext

2014-05-24 Thread Nicholas Chammas
partitionedSource is an RDD, right? If so, then
partitionedSource.countshould return the number of elements in the
RDD, regardless of how many
partitions it’s split into.

If you want to count the number of elements per partition, you’ll need to
use RDD.mapPartitions, I believe.
​


On Sat, May 24, 2014 at 10:18 AM, Wisc Forum  wrote:

> Hi, dear user group:
>
> I recently try to use the parallelize method of SparkContext to slice
> original data into small pieces for further handling. Something like the
> below:
>
> val partitionedSource = sparkContext.parallelize(seq, sparkPartitionSize)
>
> The size of my original testing data is 88 objects.
>
> I know the default value (if I don't specify the sparkPartitionSize
> value) of numSlices is is 10.
>
> What happens is that when I specified the numSlices value to be 2 (As I
> use 2 slave nodes), when I do something like this:
> println("partitionedSource.count: " + partitionedSource.count)
>
>
> The output is partitionedSource.count: 44. The subtask though, is
> correctly created as 2.
>
> My intention is the get two slices where each slice has 44 objects and
> thus partitionedSource.count should be 2, isn't it? So, does this
> actually result 44 mean that I have 44 slices or 44 objects in each slice?
> How can the second case be? What if I have 89 objects? Maybe I didn't use
> it correctly?
>
> Can somebody help me on this?
>
> Thanks,
> Xiao Bing
>


Issue with the parallelize method in SparkContext

2014-05-24 Thread Wisc Forum
Hi, dear user group:

I recently try to use the parallelize method of SparkContext to slice original 
data into small pieces for further handling. Something like the below:

val partitionedSource = sparkContext.parallelize(seq, sparkPartitionSize)

The size of my original testing data is 88 objects.

I know the default value (if I don't specify the sparkPartitionSize  value) of 
numSlices is is 10. 

What happens is that when I specified the numSlices value to be 2 (As I use 2 
slave nodes), when I do something like this:
println("partitionedSource.count: " + partitionedSource.count)


The output is partitionedSource.count: 44. The subtask though, is correctly 
created as 2.

My intention is the get two slices where each slice has 44 objects and thus 
partitionedSource.count should be 2, isn't it? So, does this actually result 44 
mean that I have 44 slices or 44 objects in each slice? How can the second case 
be? What if I have 89 objects? Maybe I didn't use it correctly?

Can somebody help me on this?

Thanks,
Xiao Bing

can communication and computation be overlapped in spark?

2014-05-24 Thread wxhsdp
Hi, all
  fetch wait time: 
   * Time the task spent waiting for remote shuffle blocks. This only
includes the time 
   * blocking on shuffle input data. For instance if block B is being
fetched while the task is 
   * still not finished processing block A, it is not considered to be
blocking on block B. 

  according to the definition of fetch wait time, spark can process block A
and fetch block B
  simultaneously, means communication and computation can be overlapped? is
it applied to
  all the situations? especially in some cases the worker needs both A and B
to start the work.

  thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/can-communication-and-computation-be-overlapped-in-spark-tp6348.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.