Custom Accumulator: Type Mismatch Error

2014-05-24 Thread Muttineni, Vinay
Hello, I have been trying to implement a custom accumulator as below import org.apache.spark._ class VectorNew1(val data: Array[Double]) {} implicit object VectorAP extends AccumulatorParam[VectorNew1] { def zero(v: VectorNew1) = new VectorNew1(new Array(v.dat

Re: Using Spark to analyze complex JSON

2014-05-24 Thread Mayur Rustagi
Hi Michael, Is the in-memory columnar store planned as part of SparkSQL ? Also will both HiveQL & SQLParser be kept updated? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Sun, May 25, 2014 at 2:44 AM, Michael Armbrus

Re: Using Spark to analyze complex JSON

2014-05-24 Thread Michael Armbrust
> But going back to your presented pattern, I have a question. Say your data > does have a fixed structure, but some of the JSON values are lists. How > would you map that to a SchemaRDD? (I didn’t notice any list values in the > CandyCrush example.) Take the likes field from my original example: >

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-24 Thread Jeremy Lewi
Hi Josh, Thanks for the help. The class should be on the path on all nodes. Here's what I did: 1) I built a jar from my scala code. 2) I copied that jar to a location on all nodes in my cluster (/usr/local/spark) 3) I edited bin/compute-classpath.sh to add my jar to the class path. 4) I repeated

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-24 Thread Josh Marcus
Jeremy, Just to be clear, are you assembling a jar with that class compiled (with its dependencies) and including the path to that jar on the command line in an environment variable (e.g. SPARK_CLASSPATH=path ./spark-shell)? --j On Saturday, May 24, 2014, Jeremy Lewi wrote: > Hi Spark Users, >

Re: Issue with the parallelize method in SparkContext

2014-05-24 Thread Nicholas Chammas
partitionedSource is an RDD, right? If so, then partitionedSource.countshould return the number of elements in the RDD, regardless of how many partitions it’s split into. If you want to count the number of elements per partition, you’ll need to use RDD.mapPartitions, I believe. ​ On Sat, May 24,

Issue with the parallelize method in SparkContext

2014-05-24 Thread Wisc Forum
Hi, dear user group: I recently try to use the parallelize method of SparkContext to slice original data into small pieces for further handling. Something like the below: val partitionedSource = sparkContext.parallelize(seq, sparkPartitionSize) The size of my original testing data is 88 objects

can communication and computation be overlapped in spark?

2014-05-24 Thread wxhsdp
Hi, all fetch wait time: * Time the task spent waiting for remote shuffle blocks. This only includes the time * blocking on shuffle input data. For instance if block B is being fetched while the task is * still not finished processing block A, it is not considered to be blocking on bl