Re: Unresolved Attributes

2014-11-09 Thread Srinivas Chamarthi
ok I am answering my question here. looks like name has a reserved key word
or some special treatment. unless you use alias, it doesn't work. so use an
alias always with name attribute.

select a.name from xxx a where a. = 'y' // RIGHT
select name from  where t ='yy' // doesn't work.

not sure if theres an issue already and already fixed in master.

I will raise an issue if someone else also confirms it.

thx
srinivas

On Sat, Nov 8, 2014 at 3:26 PM, Srinivas Chamarthi 
srinivas.chamar...@gmail.com wrote:

 I have an exception when I am trying to run a simple where clause query. I
 can see the name attribute is present in the schema but somehow it still
 throws the exception.

 query = select name from business where business_id= + business_id

 what am I doing wrong ?

 thx
 srinivas


 Exception in thread main
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: 'name, tree:
 Project ['name]
  Filter (business_id#1 = 'Ba1hXOqb3Yhix8bhE0k_WQ)
   Subquery business
SparkLogicalPlan (ExistingRdd
 [attributes#0,business_id#1,categories#2,city#3,full_address#4,hours#5,latitude#6,longitude#7,name#8,neighborhoods#9,open#10,review_count#11,stars#12,state#13,type#14],
 MappedRDD[5] at map at JsonRDD.scala:38)



Why does this siimple spark program uses only one core?

2014-11-09 Thread ReticulatedPython
So, I'm running this simple program on a 16 core multicore system. I run it
by issuing the following.

spark-submit --master local[*] pi.py

And the code of that program is the following. When I use top to see CPU
consumption, only 1 core is being utilized. Why is it so? Seconldy, spark
documentation says that the default parallelism is contained in property
spark.default.parallelism. How can I read this property from within my
python program?

#pi.py
from pyspark import SparkContext
import random

NUM_SAMPLES = 1250

def sample(p):
x, y = random.random(), random.random()
return 1 if x*x + y*y  1 else 0

sc = SparkContext(local, Test App)
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a,
b: a + b)
print Pi is roughly %f % (4.0 * count / NUM_SAMPLES)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does spark works on multicore systems?

2014-11-09 Thread Sonal Goyal
Also, the level of parallelism would be affected by how big your input is.
Could this be a problem in your  case?

On Sunday, November 9, 2014, Aaron Davidson ilike...@gmail.com wrote:

 oops, meant to cc userlist too

 On Sat, Nov 8, 2014 at 3:13 PM, Aaron Davidson ilike...@gmail.com
 javascript:_e(%7B%7D,'cvml','ilike...@gmail.com'); wrote:

 The default local master is local[*], which should use all cores on
 your system. So you should be able to just do ./bin/pyspark and
 sc.parallelize(range(1000)).count() and see that all your cores were used.

 On Sat, Nov 8, 2014 at 2:20 PM, Blind Faith person.of.b...@gmail.com
 javascript:_e(%7B%7D,'cvml','person.of.b...@gmail.com'); wrote:

 I am a Spark newbie and I use python (pyspark). I am trying to run a
 program on a 64 core system, but no matter what I do, it always uses 1
 core. It doesn't matter if I run it using spark-submit --master local[64]
 run.sh or I call x.repartition(64) in my code with an RDD, the spark
 program always uses one core. Has anyone experience of running spark
 programs on multicore processors with success? Can someone provide me a
 very simple example that does properly run on all cores of a multicore
 system?





-- 
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal


Re: Does spark works on multicore systems?

2014-11-09 Thread Akhil Das
Try adding the following entry inside your conf/spark-defaults.conf file

spark.cores.max 64

Thanks
Best Regards

On Sun, Nov 9, 2014 at 3:50 AM, Blind Faith person.of.b...@gmail.com
wrote:

 I am a Spark newbie and I use python (pyspark). I am trying to run a
 program on a 64 core system, but no matter what I do, it always uses 1
 core. It doesn't matter if I run it using spark-submit --master local[64]
 run.sh or I call x.repartition(64) in my code with an RDD, the spark
 program always uses one core. Has anyone experience of running spark
 programs on multicore processors with success? Can someone provide me a
 very simple example that does properly run on all cores of a multicore
 system?



Re: supported sql functions

2014-11-09 Thread Nicholas Chammas
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features

2014년 11월 9일 일요일, Srinivas Chamarthisrinivas.chamar...@gmail.com님이 작성한
메시지:

 can anyone point me to a documentation on supported sql functions ? I am
 trying to do a contians operation on sql array type. But I don't know how
 to type the  sql.

 // like hive function array_contains
 select * from business where array_contains(type, insurance)



 appreciate any help.




Re: Why does this siimple spark program uses only one core?

2014-11-09 Thread Akhil Das
You can set the following entry inside the conf/spark-defaults.conf file

spark.cores.max 16


If you want to read the default value, then you can use the following api
call

*sc*.defaultParallelism

where ​*sc* is your sparkContext object.​


Thanks
Best Regards

On Sun, Nov 9, 2014 at 6:48 PM, ReticulatedPython person.of.b...@gmail.com
wrote:

 So, I'm running this simple program on a 16 core multicore system. I run it
 by issuing the following.

 spark-submit --master local[*] pi.py

 And the code of that program is the following. When I use top to see CPU
 consumption, only 1 core is being utilized. Why is it so? Seconldy, spark
 documentation says that the default parallelism is contained in property
 spark.default.parallelism. How can I read this property from within my
 python program?

 #pi.py
 from pyspark import SparkContext
 import random

 NUM_SAMPLES = 1250

 def sample(p):
 x, y = random.random(), random.random()
 return 1 if x*x + y*y  1 else 0

 sc = SparkContext(local, Test App)
 count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a,
 b: a + b)
 print Pi is roughly %f % (4.0 * count / NUM_SAMPLES)



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: spark context not defined

2014-11-09 Thread Akhil Das
If you are talking about a stand alone program, have a look at this doc.
https://spark.apache.org/docs/0.9.1/python-programming-guide.html#standalone-programs

from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext

conf = (SparkConf()
 .setMaster(local)
 .setAppName(My app)
 .set(spark.executor.memory, 1g))
sc = SparkContext(conf = conf)

sqlContext = HiveContext(sc)


Thanks
Best Regards

On Sat, Nov 8, 2014 at 4:35 AM, Pagliari, Roberto rpagli...@appcomsci.com
wrote:

 I’m running the latest version of spark with Hadoop 1.x and scala 2.9.3
 and hive 0.9.0.



 When using python 2.7

 from pyspark.sql import HiveContext

 sqlContext = HiveContext(sc)



 I’m getting ‘sc not defined’



 On the other hand, I can see ‘sc’ from pyspark CLI.



 Is there a way to fix it?



Re: spark-submit inside script... need some bash help

2014-11-09 Thread Akhil Das
Not sure why that is failing, but i found a workaround like:

#!/bin/bash -e

SPARK_SUBMIT=/home/akhld/mobi/localcluster/spark-1/bin/spark-submit

*export _JAVA_OPTIONS=-Xmx1g*

OPTS+= --class org.apache.spark.examples.SparkPi

echo $SPARK_SUBMIT $OPTS lib/spark-examples-1.1.0-hadoop1.0.4.jar

exec $SPARK_SUBMIT $OPTS lib/spark-examples-1.1.0-hadoop1.0.4.jar


Thanks
Best Regards

On Sat, Nov 8, 2014 at 12:31 AM, Koert Kuipers ko...@tresata.com wrote:

 i need to run spark-submit inside a script with options that are build up
 programmatically. oh and i need to use exec to keep the same pid (so it can
 run as a service and be killed).

 this is what i tried:
 ==
 #!/bin/bash -e

 SPARK_SUBMIT=/usr/local/lib/spark/bin/spark-submit

 OPTS=--class org.apache.spark.examples.SparkPi
 OPTS+= --driver-java-options \-Da=b -Dc=d\

 echo $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar

 exec $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar
 ==

 no luck. it gets confused on the multiple java options it seems. i get:
 Exception in thread main java.lang.NoClassDefFoundError: -Da=b
 Caused by: java.lang.ClassNotFoundException: -Da=b
 at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
 Could not find the main class: -Da=b.  Program will exit.

 i also tried many other ways of escaping the quoted java options. none of
 them work.
 strangely it does work if i replace the last line by (there is no science
 to this for me, i dont know much about bash, just trying random and
 probably bad things):
 eval exec $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar

 i am lost as to why... and there must be a better solution? it looks kinda
 nasty with the eval + exec

 best, koert



Re: Why does this siimple spark program uses only one core?

2014-11-09 Thread Matei Zaharia
Call getNumPartitions() on your RDD to make sure it has the right number of 
partitions. You can also specify it when doing parallelize, e.g.

rdd = sc.parallelize(xrange(1000), 10))

This should run in parallel if you have multiple partitions and cores, but it 
might be that during part of the process only one node (e.g. the master 
process) is doing anything.

Matei


 On Nov 9, 2014, at 9:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 You can set the following entry inside the conf/spark-defaults.conf file 
 
 spark.cores.max 16
 
 If you want to read the default value, then you can use the following api call
 
 sc.defaultParallelism
 
 where ​sc is your sparkContext object.​
 
 Thanks
 Best Regards
 
 On Sun, Nov 9, 2014 at 6:48 PM, ReticulatedPython person.of.b...@gmail.com 
 mailto:person.of.b...@gmail.com wrote:
 So, I'm running this simple program on a 16 core multicore system. I run it
 by issuing the following.
 
 spark-submit --master local[*] pi.py
 
 And the code of that program is the following. When I use top to see CPU
 consumption, only 1 core is being utilized. Why is it so? Seconldy, spark
 documentation says that the default parallelism is contained in property
 spark.default.parallelism. How can I read this property from within my
 python program?
 
 #pi.py
 from pyspark import SparkContext
 import random
 
 NUM_SAMPLES = 1250
 
 def sample(p):
 x, y = random.random(), random.random()
 return 1 if x*x + y*y  1 else 0
 
 sc = SparkContext(local, Test App)
 count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a,
 b: a + b)
 print Pi is roughly %f % (4.0 * count / NUM_SAMPLES)
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 



Re: netty on classpath when using spark-submit

2014-11-09 Thread Tobias Pfeiffer
Hi,

On Wed, Nov 5, 2014 at 10:23 AM, Tobias Pfeiffer wrote:

 On Tue, Nov 4, 2014 at 8:33 PM, M. Dale wrote:

From http://spark.apache.org/docs/latest/configuration.html it seems
 that there is an experimental property:

 spark.files.userClassPathFirst


 Thank you very much, I didn't know about this.  Unfortunately, it doesn't
 change anything.  With this setting both true and false (as indicated by
 the Spark web interface) and no matter whether local[N] or yarn-client
 or yarn-cluster mode are used with spark-submit, the classpath looks the
 same and the netty class is loaded from the Spark jar. Can I use this
 setting with spark-submit at all?


Has anyone used this setting successfully or can advice me on how to use it
correctly?

Thanks
Tobias


Re: Re: about write mongodb in mapPartitions

2014-11-09 Thread qinwei






Thanks for your reply!    According to your hint, the code should be like this: 
      // i want to save data in rdd to mongodb and hdfs        
rdd.saveAsNewAPIHadoopFile()        rdd.saveAsTextFile()
    but will the application read hdfs twice?



qinwei
 From: Akhil DasDate: 2014-11-07 18:32To: qinweiCC: userSubject: Re: about 
write mongodb in mapPartitionsWhy not saveAsNewAPIHadoopFile?
//Define your mongoDB confsval config = new Configuration()     
config.set(mongo.output.uri, mongodb://127.0.0.1:27017/sigmoid.output)
//Write everything to mongo rdd.saveAsNewAPIHadoopFile(file:///some/random, 
classOf[Any], classOf[Any], classOf[com.mongodb.hadoop.MongoOutputFormat[Any, 
Any]], config)

ThanksBest Regards

On Fri, Nov 7, 2014 at 2:53 PM, qinwei wei@dewmobile.net wrote:

Hi, everyone
    I come across with a prolem about writing data to mongodb in mapPartitions, 
my code is as below:                 val sourceRDD = 
sc.textFile(hdfs://host:port/sourcePath)          // some transformations     
   val rdd= sourceRDD .map(mapFunc).filter(filterFunc)        val newRDD = 
rdd.mapPartitions(args = {             val mongoClient = new 
MongoClient(host, port) 
            val db = mongoClient.getDB(db) 
            val coll = db.getCollection(collectionA) 

            args.map(arg = { 
                coll.insert(new BasicDBObject(pkg, arg)) 
                arg 
    }) 

            mongoClient.close() 
            args 
        })            newRDD.saveAsTextFile(hdfs://host:port/path)        The 
application saved data to HDFS correctly, but not mongodb, is there someting 
wrong?    I know that collecting the newRDD to driver and then saving it to 
mongodb will success, but will the following saveAsTextFile read the filesystem 
once again?
    Thanks    

qinwei





Re: Re: about write mongodb in mapPartitions

2014-11-09 Thread qinwei






Thanks for your reply! As you mentioned , the insert clause is not executed as 
the results of args.map are never used anywhere, and after i modified the code 
, it works.


qinwei
 From: Tobias PfeifferDate: 2014-11-07 18:04To: qinweiCC: userSubject: Re: 
about write mongodb in mapPartitionsHi,

On Fri, Nov 7, 2014 at 6:23 PM, qinwei wei@dewmobile.net wrote:           
 args.map(arg = { 
                coll.insert(new BasicDBObject(pkg, arg)) 
                arg 
    }) 

            mongoClient.close() 
            args  As the results of args.map are never used anywhere, I think 
the loop body is not executed at all. Maybe try:
            val argsProcessed = args.map(arg = {                 
coll.insert(new BasicDBObject(pkg, arg))                 arg             }) 
            mongoClient.close()             argsProcessed
Tobias





Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-09 Thread santon
Sorry for the delay. I'll try to add some more details on Monday.

Unfortunately, I don't have a script to reproduce the error. Actually, it
seemed to be more about the data set than the script. The same code on
different data sets lead to different results; only larger data sets on the
order of 40 GB seemed to crash with the described error. Also, I believe
our cluster was recently updated to CDH 5.2, which uses Spark 1.1. I'll
check to see if the issue was resolved.

On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] 
ml-node+s1001560n18393...@n3.nabble.com wrote:

 Could you tell how large is the data set? It will help us to debug this
 issue.

 On Thu, Nov 6, 2014 at 10:39 AM, skane [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=0 wrote:

  I don't have any insight into this bug, but on Spark version 1.0.0 I ran
 into
  the same bug running the 'sort.py' example. On a smaller data set, it
 worked
  fine. On a larger data set I got this error:
 
  Traceback (most recent call last):
File /home/skane/spark/examples/src/main/python/sort.py, line 30, in
  module
  .sortByKey(lambda x: x)
File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey
  bounds.append(samples[index])
  IndexError: list index out of range
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=1
  For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=2
 

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=3
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=4



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html
  To unsubscribe from PySpark issue with sortByKey: IndexError: list index
 out of range, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16445code=c3RldmVuLm0uYW50b25AZ21haWwuY29tfDE2NDQ1fDEzNTcxOTI5
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18442.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Repartition to data-size per partition

2014-11-09 Thread Harry Brundage
I want to avoid the small files problem when using Spark, without having to
manually calibrate a `repartition` at the end of each Spark application I
am writing, since the amount of data passing through sadly isn't all that
predictable. We're picking up from and writing data to HDFS.

I know other tools like Pig can set the number of reducers and thus the
number of output partitions for you based on the size of the input data,
but I want to know if anyone else has a better way to do this with Spark's
primitives.

Right now we have an ok solution but it is starting to break down. We cache
our output RDD at the end of the application's flow, and then map over once
more it to guess what size it will be when pickled and gzipped (we're in
pyspark), and then compute a number to repartition to using a target
partition size. The problem is that we want to work with datasets bigger
than what will comfortably fit in the cache. Just spit balling here, but
what would be amazing is the ability to ask Spark how big it thinks each
partition might be, or the ability to give an accumulator as an argument to
`repartition` who's value wouldn't be used until the stage prior had
finished, or the ability to just have Spark repartition to a target
partition size for us.

Thanks for any help you can give me!


Re: Submitting Spark job on Unix cluster from dev environment (Windows)

2014-11-09 Thread thanhtien522
yeah, It work.

I turn off firewall on my windows machine and it work.

Thanks so much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-tp16989p18452.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: org/apache/commons/math3/random/RandomGenerator issue

2014-11-09 Thread lev
I set the path of commons-math3-3.1.1.jar to spark.executor.extraClassPath
and it worked. 
Thanks a lot!

It only worked for me when the jar was locally on the machine.
Is there a way to make it work when the jar is on hdfs?

I tried putting there a link to the file on the hdfs (with or without
hdfs:// ) and it didn't work..

Thanks,
Lev.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p18453.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



embedded spark for unit testing..

2014-11-09 Thread Kevin Burton
What’s the best way to embed spark to run local mode in unit tests?

Some or our jobs are mildly complex and I want to keep verifying that they
work including during schema changes / migration.

I think for some of this I would just run local mode, read from a few text
files via resources, and then write to /tmp …

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com


Queues

2014-11-09 Thread Deep Pradhan
Has anyone implemented Queues using RDDs?


Thank You


Rdd replication

2014-11-09 Thread rapelly kartheek
Hi,

I am trying to understand  rdd replication code. In the process, I
frequently execute one spark application whenever I make a change to the
code to see effect.

My problem is, after a set of repeated executions of the same application,
I find that my cluster behaves unusually.

Ideally, when I replicate an rdd twice, the webUI displays each partition
twice in the RDD storage info tab. But, sometimes I find that it displays
each partition only once. Also, when it is replicated only once, each
partition gets displayed twice. This happens frequently.

Can someone throw some light in this regard.


Re: Do spark works on multicore systems?

2014-11-09 Thread lalit1303
While creating sparkConf, set the variable *spark.cores.max* to
thspark.cores.max maximum number of cores to be used by spark job.
By default it is set to 1.



-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-spark-works-on-multicore-systems-tp18419p18459.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: embedded spark for unit testing..

2014-11-09 Thread DB Tsai
You can write unittest with a local spark context by mixing
LocalSparkContext trait.

See
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala

https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/util/LocalSparkContext.scala

as an example.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, Nov 9, 2014 at 9:12 PM, Kevin Burton bur...@spinn3r.com wrote:
 What’s the best way to embed spark to run local mode in unit tests?

 Some or our jobs are mildly complex and I want to keep verifying that they
 work including during schema changes / migration.

 I think for some of this I would just run local mode, read from a few text
 files via resources, and then write to /tmp …

 --

 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Efficient Key Structure in pairRDD

2014-11-09 Thread nsareen
Hi,

We are trying to adopt Spark for our application.

We have an analytical application which stores data in Star Schemas ( SQL
Server ). All the cubes are loaded into a Key / Value structure and saved in
Trove ( in memory collection ). here key is a short array where each short
number represents a dimension member. 
e.g
Tuple = CampaignX,Product1,Region_south,10.23232 gets converted to 
Trove Key[[12322],[45232],[53421]]  Value[10.23232].

This is done to avoid saving collection of string objects as key in Trove.

Now can we save this data structure in Spark using pairRDD?  if yes, will
key value be an ideal way of storing data in spark and retrieving it for
data analysis, or is there any other better data structure we can  create,
which would help us create and process RDD ?

Nitin.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-Key-Structure-in-pairRDD-tp18461.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



canopy clustering

2014-11-09 Thread aminn_524
I want to run k-means of MLib  on a big dataset, it seems for big datsets, we
need to perform pre-clustering methods such as canopy clustering. By
starting with an initial clustering the number of more expensive distance
measurements can be significantly reduced by ignoring points outside of the
initial canopies. 

I I am not mistaken, in the k-means of MLib, there are three initialization
steps : Kmeans ++, Kmeans|| and random .

So, can anyone explain to me that can we use kmeans|| instead of canopy
clustering? or these two methods act completely different? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/canopy-clustering-tp18462.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org