Re: Unresolved Attributes

2014-11-09 Thread Srinivas Chamarthi
ok I am answering my question here. looks like name has a reserved key word
or some special treatment. unless you use alias, it doesn't work. so use an
alias always with name attribute.

select a.name from xxx a where a. = 'y' // RIGHT
select name from  where t ='yy' // doesn't work.

not sure if theres an issue already and already fixed in master.

I will raise an issue if someone else also confirms it.

thx
srinivas

On Sat, Nov 8, 2014 at 3:26 PM, Srinivas Chamarthi <
srinivas.chamar...@gmail.com> wrote:

> I have an exception when I am trying to run a simple where clause query. I
> can see the name attribute is present in the schema but somehow it still
> throws the exception.
>
> query = "select name from business where business_id=" + business_id
>
> what am I doing wrong ?
>
> thx
> srinivas
>
>
> Exception in thread "main"
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
> attributes: 'name, tree:
> Project ['name]
>  Filter (business_id#1 = 'Ba1hXOqb3Yhix8bhE0k_WQ)
>   Subquery business
>SparkLogicalPlan (ExistingRdd
> [attributes#0,business_id#1,categories#2,city#3,full_address#4,hours#5,latitude#6,longitude#7,name#8,neighborhoods#9,open#10,review_count#11,stars#12,state#13,type#14],
> MappedRDD[5] at map at JsonRDD.scala:38)
>


Why does this siimple spark program uses only one core?

2014-11-09 Thread ReticulatedPython
So, I'm running this simple program on a 16 core multicore system. I run it
by issuing the following.

spark-submit --master local[*] pi.py

And the code of that program is the following. When I use top to see CPU
consumption, only 1 core is being utilized. Why is it so? Seconldy, spark
documentation says that the default parallelism is contained in property
spark.default.parallelism. How can I read this property from within my
python program?

#"""pi.py"""
from pyspark import SparkContext
import random

NUM_SAMPLES = 1250

def sample(p):
x, y = random.random(), random.random()
return 1 if x*x + y*y < 1 else 0

sc = SparkContext("local", "Test App")
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a,
b: a + b)
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



supported sql functions

2014-11-09 Thread Srinivas Chamarthi
can anyone point me to a documentation on supported sql functions ? I am
trying to do a contians operation on sql array type. But I don't know how
to type the  sql.

// like hive function array_contains
select * from business where array_contains(type, "insurance")



appreciate any help.


Re: Does spark works on multicore systems?

2014-11-09 Thread Sonal Goyal
Also, the level of parallelism would be affected by how big your input is.
Could this be a problem in your  case?

On Sunday, November 9, 2014, Aaron Davidson  wrote:

> oops, meant to cc userlist too
>
> On Sat, Nov 8, 2014 at 3:13 PM, Aaron Davidson  > wrote:
>
>> The default local master is "local[*]", which should use all cores on
>> your system. So you should be able to just do "./bin/pyspark" and
>> "sc.parallelize(range(1000)).count()" and see that all your cores were used.
>>
>> On Sat, Nov 8, 2014 at 2:20 PM, Blind Faith > > wrote:
>>
>>> I am a Spark newbie and I use python (pyspark). I am trying to run a
>>> program on a 64 core system, but no matter what I do, it always uses 1
>>> core. It doesn't matter if I run it using "spark-submit --master local[64]
>>> run.sh" or I call x.repartition(64) in my code with an RDD, the spark
>>> program always uses one core. Has anyone experience of running spark
>>> programs on multicore processors with success? Can someone provide me a
>>> very simple example that does properly run on all cores of a multicore
>>> system?
>>>
>>
>>
>

-- 
Best Regards,
Sonal
Nube Technologies 




Re: Does spark works on multicore systems?

2014-11-09 Thread Akhil Das
Try adding the following entry inside your conf/spark-defaults.conf file

spark.cores.max 64

Thanks
Best Regards

On Sun, Nov 9, 2014 at 3:50 AM, Blind Faith 
wrote:

> I am a Spark newbie and I use python (pyspark). I am trying to run a
> program on a 64 core system, but no matter what I do, it always uses 1
> core. It doesn't matter if I run it using "spark-submit --master local[64]
> run.sh" or I call x.repartition(64) in my code with an RDD, the spark
> program always uses one core. Has anyone experience of running spark
> programs on multicore processors with success? Can someone provide me a
> very simple example that does properly run on all cores of a multicore
> system?
>


Re: supported sql functions

2014-11-09 Thread Akhil Das
These are the feature list

available i believe. You can create a custom udf for the contains. A
similar example is explained on this StackOverflow

post

Thanks
Best Regards

On Sun, Nov 9, 2014 at 7:27 PM, Srinivas Chamarthi <
srinivas.chamar...@gmail.com> wrote:

> can anyone point me to a documentation on supported sql functions ? I am
> trying to do a contians operation on sql array type. But I don't know how
> to type the  sql.
>
> // like hive function array_contains
> select * from business where array_contains(type, "insurance")
>
>
>
> appreciate any help.
>
>


Re: supported sql functions

2014-11-09 Thread Nicholas Chammas
http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features

2014년 11월 9일 일요일, Srinivas Chamarthi님이 작성한
메시지:

> can anyone point me to a documentation on supported sql functions ? I am
> trying to do a contians operation on sql array type. But I don't know how
> to type the  sql.
>
> // like hive function array_contains
> select * from business where array_contains(type, "insurance")
>
>
>
> appreciate any help.
>
>


Re: Why does this siimple spark program uses only one core?

2014-11-09 Thread Akhil Das
You can set the following entry inside the conf/spark-defaults.conf file

spark.cores.max 16


If you want to read the default value, then you can use the following api
call

*sc*.defaultParallelism

where ​*sc* is your sparkContext object.​


Thanks
Best Regards

On Sun, Nov 9, 2014 at 6:48 PM, ReticulatedPython 
wrote:

> So, I'm running this simple program on a 16 core multicore system. I run it
> by issuing the following.
>
> spark-submit --master local[*] pi.py
>
> And the code of that program is the following. When I use top to see CPU
> consumption, only 1 core is being utilized. Why is it so? Seconldy, spark
> documentation says that the default parallelism is contained in property
> spark.default.parallelism. How can I read this property from within my
> python program?
>
> #"""pi.py"""
> from pyspark import SparkContext
> import random
>
> NUM_SAMPLES = 1250
>
> def sample(p):
> x, y = random.random(), random.random()
> return 1 if x*x + y*y < 1 else 0
>
> sc = SparkContext("local", "Test App")
> count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a,
> b: a + b)
> print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark on YARN, ExecutorLostFailure for long running computations in map

2014-11-09 Thread Akhil Das
Used to hit this issue, but setting the following confs while creating the
sparkcontext seems working.

 .set("spark.rdd.compress","true")

  .set("spark.storage.memoryFraction","1")
  .set("spark.core.connection.ack.wait.timeout","6000")
  .set("spark.akka.frameSize","50")

Most likely, one of the executor is stuck on a GC Pause and meanwhile
master thinks its dead and throws timeout/cancel key exception.

Thanks
Best Regards

On Sat, Nov 8, 2014 at 2:58 PM,  wrote:

> Hi,
>
> I am getting ExecutorLostFailure when I run spark on YARN and in map I
> perform very long tasks (couple of hours). Error Log is below.
>
> Do you know if it is possible to set something to make it possible for
> Spark to perform these very long running jobs in map?
>
> Thank you very much for any advice.
>
> Best regards,
> Jan
>
> Spark log:
> 4533,931: [GC 394578K->20882K(1472000K), 0,0226470 secs]
> Traceback (most recent call last):
>   File "/home/hadoop/spark_stuff/spark_lda.py", line 112, in 
> models.saveAsTextFile(sys.argv[1])
>   File "/home/hadoop/spark/python/pyspark/rdd.py", line 1324, in
> saveAsTextFile
> keyed._jrdd.map(self.ctx._jvm.BytesToString()).saveAsTextFile(path)
>   File
> "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
> 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o36.saveAsTextFile.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 28 in stage 0.0 failed 4 times, most recent failure: Lost task 28.3 in
> stage 0.0 (TID 41, ip-172-16-1-90.us-west-2.compute.internal):
> ExecutorLostFailure (executor lost)
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> Yarn log:
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-152.us-west-2.compute.internal:41091 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-152.us-west-2.compute.internal:39160 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-152.us-west-2.compute.internal:45058 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-241.us-west-2.compute.internal:54111 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-238.us-west-2.compute.internal:45772 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-241.us-west-2.compute.internal:59509 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:20:34 INFO storage.BlockManagerInfo: Added broadcast_0_piece0
> in memory on ip-172-16-1-238.us-west-2.compute.internal:35720 (size: 596.9
> KB, free: 775.7 MB)
> 14/11/08 08:21:11 INFO network.ConnectionManager: Removing
> SendingConnection to
> ConnectionMa

Re: spark context not defined

2014-11-09 Thread Akhil Das
If you are talking about a stand alone program, have a look at this doc.


from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext

conf = (SparkConf()
 .setMaster("local")
 .setAppName("My app")
 .set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf)

sqlContext = HiveContext(sc)


Thanks
Best Regards

On Sat, Nov 8, 2014 at 4:35 AM, Pagliari, Roberto 
wrote:

> I’m running the latest version of spark with Hadoop 1.x and scala 2.9.3
> and hive 0.9.0.
>
>
>
> When using python 2.7
>
> from pyspark.sql import HiveContext
>
> sqlContext = HiveContext(sc)
>
>
>
> I’m getting ‘sc not defined’
>
>
>
> On the other hand, I can see ‘sc’ from pyspark CLI.
>
>
>
> Is there a way to fix it?
>


Re: spark-submit inside script... need some bash help

2014-11-09 Thread Akhil Das
Not sure why that is failing, but i found a workaround like:

#!/bin/bash -e

SPARK_SUBMIT=/home/akhld/mobi/localcluster/spark-1/bin/spark-submit

*export _JAVA_OPTIONS=-Xmx1g*

OPTS+=" --class org.apache.spark.examples.SparkPi"

echo $SPARK_SUBMIT $OPTS lib/spark-examples-1.1.0-hadoop1.0.4.jar

exec $SPARK_SUBMIT $OPTS lib/spark-examples-1.1.0-hadoop1.0.4.jar


Thanks
Best Regards

On Sat, Nov 8, 2014 at 12:31 AM, Koert Kuipers  wrote:

> i need to run spark-submit inside a script with options that are build up
> programmatically. oh and i need to use exec to keep the same pid (so it can
> run as a service and be killed).
>
> this is what i tried:
> ==
> #!/bin/bash -e
>
> SPARK_SUBMIT=/usr/local/lib/spark/bin/spark-submit
>
> OPTS="--class org.apache.spark.examples.SparkPi"
> OPTS+=" --driver-java-options \"-Da=b -Dc=d\""
>
> echo $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar
>
> exec $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar
> ==
>
> no luck. it gets confused on the multiple java options it seems. i get:
> Exception in thread "main" java.lang.NoClassDefFoundError: "-Da=b
> Caused by: java.lang.ClassNotFoundException: "-Da=b
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> Could not find the main class: "-Da=b.  Program will exit.
>
> i also tried many other ways of escaping the quoted java options. none of
> them work.
> strangely it does work if i replace the last line by (there is no science
> to this for me, i dont know much about bash, just trying random and
> probably bad things):
> eval exec $SPARK_SUBMIT $OPTS spark-examples_2.10-1.1.0.jar
>
> i am lost as to why... and there must be a better solution? it looks kinda
> nasty with the eval + exec
>
> best, koert
>


Re: Submitting Spark job on Unix cluster from dev environment (Windows)

2014-11-09 Thread Shailesh Birari
Have you tried to set the host name/port to your Windows machine ?
Also specify the following ports for Spark. Make sure the ports you
mentioned should not be blocked (on windows machine).

spark.fileserver.port
spark.broadcast.port
spark.replClassServer.port
spark.blockManager.port
spark.executor.port

Hope this helps.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-tp16989p18436.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why does this siimple spark program uses only one core?

2014-11-09 Thread Matei Zaharia
Call getNumPartitions() on your RDD to make sure it has the right number of 
partitions. You can also specify it when doing parallelize, e.g.

rdd = sc.parallelize(xrange(1000), 10))

This should run in parallel if you have multiple partitions and cores, but it 
might be that during part of the process only one node (e.g. the master 
process) is doing anything.

Matei


> On Nov 9, 2014, at 9:27 AM, Akhil Das  wrote:
> 
> You can set the following entry inside the conf/spark-defaults.conf file 
> 
> spark.cores.max 16
> 
> If you want to read the default value, then you can use the following api call
> 
> sc.defaultParallelism
> 
> where ​sc is your sparkContext object.​
> 
> Thanks
> Best Regards
> 
> On Sun, Nov 9, 2014 at 6:48 PM, ReticulatedPython  > wrote:
> So, I'm running this simple program on a 16 core multicore system. I run it
> by issuing the following.
> 
> spark-submit --master local[*] pi.py
> 
> And the code of that program is the following. When I use top to see CPU
> consumption, only 1 core is being utilized. Why is it so? Seconldy, spark
> documentation says that the default parallelism is contained in property
> spark.default.parallelism. How can I read this property from within my
> python program?
> 
> #"""pi.py"""
> from pyspark import SparkContext
> import random
> 
> NUM_SAMPLES = 1250
> 
> def sample(p):
> x, y = random.random(), random.random()
> return 1 if x*x + y*y < 1 else 0
> 
> sc = SparkContext("local", "Test App")
> count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a,
> b: a + b)
> print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: netty on classpath when using spark-submit

2014-11-09 Thread Tobias Pfeiffer
Hi,

On Wed, Nov 5, 2014 at 10:23 AM, Tobias Pfeiffer wrote:
>
> On Tue, Nov 4, 2014 at 8:33 PM, M. Dale wrote:
>
>>From http://spark.apache.org/docs/latest/configuration.html it seems
>> that there is an experimental property:
>>
>> spark.files.userClassPathFirst
>>
>
> Thank you very much, I didn't know about this.  Unfortunately, it doesn't
> change anything.  With this setting both true and false (as indicated by
> the Spark web interface) and no matter whether "local[N]" or "yarn-client"
> or "yarn-cluster" mode are used with spark-submit, the classpath looks the
> same and the netty class is loaded from the Spark jar. Can I use this
> setting with spark-submit at all?
>

Has anyone used this setting successfully or can advice me on how to use it
correctly?

Thanks
Tobias


Re: How to add elements into map?

2014-11-09 Thread Tobias Pfeiffer
Hi,

On Sat, Nov 8, 2014 at 1:39 PM, Tim Chou  wrote:
>
> val table = sc.textFile(args(1))
> val histMap = collection.mutable.Map[Int,Int]()
> for (x <- table) {
>
>
> val tuple = x.split('|')
>
>
> histMap.put(tuple(0).toInt, 1)
>
>
> }
>

What will happen here is that histMap (an empty Map) will be serialized and
sent to all Spark workers. Each worker will fill it locally with the data
that was processed locally, but it won't ever be sent back to the Spark
driver, that's why you don't see anything there.

Tobias


Re: Re: about write mongodb in mapPartitions

2014-11-09 Thread qinwei






Thanks for your reply!    According to your hint, the code should be like this: 
      // i want to save data in rdd to mongodb and hdfs        
rdd.saveAsNewAPIHadoopFile()        rdd.saveAsTextFile()
    but will the application read hdfs twice?



qinwei
 From: Akhil DasDate: 2014-11-07 18:32To: qinweiCC: userSubject: Re: about 
write mongodb in mapPartitionsWhy not saveAsNewAPIHadoopFile?
//Define your mongoDB confsval config = new Configuration()     
config.set("mongo.output.uri", "mongodb://127.0.0.1:27017/sigmoid.output")
//Write everything to mongo rdd.saveAsNewAPIHadoopFile("file:///some/random", 
classOf[Any], classOf[Any], classOf[com.mongodb.hadoop.MongoOutputFormat[Any, 
Any]], config)

ThanksBest Regards

On Fri, Nov 7, 2014 at 2:53 PM, qinwei  wrote:

Hi, everyone
    I come across with a prolem about writing data to mongodb in mapPartitions, 
my code is as below:                 val sourceRDD = 
sc.textFile("hdfs://host:port/sourcePath")          // some transformations     
   val rdd= sourceRDD .map(mapFunc).filter(filterFunc)        val newRDD = 
rdd.mapPartitions(args => {             val mongoClient = new 
MongoClient("host", port) 
            val db = mongoClient.getDB("db") 
            val coll = db.getCollection("collectionA") 

            args.map(arg => { 
                coll.insert(new BasicDBObject("pkg", arg)) 
                arg 
    }) 

            mongoClient.close() 
            args 
        })            newRDD.saveAsTextFile("hdfs://host:port/path")        The 
application saved data to HDFS correctly, but not mongodb, is there someting 
wrong?    I know that collecting the newRDD to driver and then saving it to 
mongodb will success, but will the following saveAsTextFile read the filesystem 
once again?
    Thanks    

qinwei





Re: Re: about write mongodb in mapPartitions

2014-11-09 Thread qinwei






Thanks for your reply! As you mentioned , the insert clause is not executed as 
the results of args.map are never used anywhere, and after i modified the code 
, it works.


qinwei
 From: Tobias PfeifferDate: 2014-11-07 18:04To: qinweiCC: userSubject: Re: 
about write mongodb in mapPartitionsHi,

On Fri, Nov 7, 2014 at 6:23 PM, qinwei  wrote:           
 args.map(arg => { 
                coll.insert(new BasicDBObject("pkg", arg)) 
                arg 
    }) 

            mongoClient.close() 
            args  As the results of args.map are never used anywhere, I think 
the loop body is not executed at all. Maybe try:
            val argsProcessed = args.map(arg => {                 
coll.insert(new BasicDBObject("pkg", arg))                 arg             }) 
            mongoClient.close()             argsProcessed
Tobias





Re: PySpark issue with sortByKey: "IndexError: list index out of range"

2014-11-09 Thread santon
Sorry for the delay. I'll try to add some more details on Monday.

Unfortunately, I don't have a script to reproduce the error. Actually, it
seemed to be more about the data set than the script. The same code on
different data sets lead to different results; only larger data sets on the
order of 40 GB seemed to crash with the described error. Also, I believe
our cluster was recently updated to CDH 5.2, which uses Spark 1.1. I'll
check to see if the issue was resolved.

On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] <
ml-node+s1001560n18393...@n3.nabble.com> wrote:

> Could you tell how large is the data set? It will help us to debug this
> issue.
>
> On Thu, Nov 6, 2014 at 10:39 AM, skane <[hidden email]
> > wrote:
>
> > I don't have any insight into this bug, but on Spark version 1.0.0 I ran
> into
> > the same bug running the 'sort.py' example. On a smaller data set, it
> worked
> > fine. On a larger data set I got this error:
> >
> > Traceback (most recent call last):
> >   File "/home/skane/spark/examples/src/main/python/sort.py", line 30, in
> > 
> > .sortByKey(lambda x: x)
> >   File "/usr/lib/spark/python/pyspark/rdd.py", line 480, in sortByKey
> > bounds.append(samples[index])
> > IndexError: list index out of range
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: [hidden email]
> 
> > For additional commands, e-mail: [hidden email]
> 
> >
>
> -
> To unsubscribe, e-mail: [hidden email]
> 
> For additional commands, e-mail: [hidden email]
> 
>
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html
>  To unsubscribe from PySpark issue with sortByKey: "IndexError: list index
> out of range", click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18442.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Repartition to data-size per partition

2014-11-09 Thread Harry Brundage
I want to avoid the small files problem when using Spark, without having to
manually calibrate a `repartition` at the end of each Spark application I
am writing, since the amount of data passing through sadly isn't all that
predictable. We're picking up from and writing data to HDFS.

I know other tools like Pig can set the number of reducers and thus the
number of output partitions for you based on the size of the input data,
but I want to know if anyone else has a better way to do this with Spark's
primitives.

Right now we have an ok solution but it is starting to break down. We cache
our output RDD at the end of the application's flow, and then map over once
more it to guess what size it will be when pickled and gzipped (we're in
pyspark), and then compute a number to repartition to using a target
partition size. The problem is that we want to work with datasets bigger
than what will comfortably fit in the cache. Just spit balling here, but
what would be amazing is the ability to ask Spark how big it thinks each
partition might be, or the ability to give an accumulator as an argument to
`repartition` who's value wouldn't be used until the stage prior had
finished, or the ability to just have Spark repartition to a target
partition size for us.

Thanks for any help you can give me!


Re: Submitting Spark job on Unix cluster from dev environment (Windows)

2014-11-09 Thread thanhtien522
yeah, It work.

I turn off firewall on my windows machine and it work.

Thanks so much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-tp16989p18452.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: org/apache/commons/math3/random/RandomGenerator issue

2014-11-09 Thread lev
I set the path of commons-math3-3.1.1.jar to spark.executor.extraClassPath
and it worked. 
Thanks a lot!

It only worked for me when the jar was locally on the machine.
Is there a way to make it work when the jar is on hdfs?

I tried putting there a link to the file on the hdfs (with or without
"hdfs://" ) and it didn't work..

Thanks,
Lev.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p18453.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



embedded spark for unit testing..

2014-11-09 Thread Kevin Burton
What’s the best way to embed spark to run local mode in unit tests?

Some or our jobs are mildly complex and I want to keep verifying that they
work including during schema changes / migration.

I think for some of this I would just run local mode, read from a few text
files via resources, and then write to /tmp …

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile




Re: org/apache/commons/math3/random/RandomGenerator issue

2014-11-09 Thread Shivaram Venkataraman
AFAIK, Spark only supports adding local files to the class path and not
from HDFS. If you are using spark-ec2, you could rsync the jar across
machines using something like '/root/spark-ec2/copy-dir '

Shivaram

On Sun, Nov 9, 2014 at 9:08 PM, lev  wrote:

> I set the path of commons-math3-3.1.1.jar to spark.executor.extraClassPath
> and it worked.
> Thanks a lot!
>
> It only worked for me when the jar was locally on the machine.
> Is there a way to make it work when the jar is on hdfs?
>
> I tried putting there a link to the file on the hdfs (with or without
> "hdfs://" ) and it didn't work..
>
> Thanks,
> Lev.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-tp15748p18453.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


MLlib Naive Bayes classifier confidence

2014-11-09 Thread jatinpreet
Hi,

Is there a way to get the confidence value of a prediction with  MLlib's
implementation of Naive Baye's classification. I wish to eliminate the
samples that were classified with low confidence.

Thanks,
Jatin



-
Novice Big Data Programmer
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Naive-Bayes-classifier-confidence-tp18456.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Queues

2014-11-09 Thread Deep Pradhan
Has anyone implemented Queues using RDDs?


Thank You


Rdd replication

2014-11-09 Thread rapelly kartheek
Hi,

I am trying to understand  rdd replication code. In the process, I
frequently execute one spark application whenever I make a change to the
code to see effect.

My problem is, after a set of repeated executions of the same application,
I find that my cluster behaves unusually.

Ideally, when I replicate an rdd twice, the webUI displays each partition
twice in the RDD storage info tab. But, sometimes I find that it displays
each partition only once. Also, when it is replicated only once, each
partition gets displayed twice. This happens frequently.

Can someone throw some light in this regard.


Re: Do spark works on multicore systems?

2014-11-09 Thread lalit1303
While creating sparkConf, set the variable *"spark.cores.max"* to
th"spark.cores.max maximum number of cores to be used by spark job.
By default it is set to 1.



-
Lalit Yadav
la...@sigmoidanalytics.com
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Do-spark-works-on-multicore-systems-tp18419p18459.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: embedded spark for unit testing..

2014-11-09 Thread DB Tsai
You can write unittest with a local spark context by mixing
LocalSparkContext trait.

See
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala

https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/util/LocalSparkContext.scala

as an example.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Sun, Nov 9, 2014 at 9:12 PM, Kevin Burton  wrote:
> What’s the best way to embed spark to run local mode in unit tests?
>
> Some or our jobs are mildly complex and I want to keep verifying that they
> work including during schema changes / migration.
>
> I think for some of this I would just run local mode, read from a few text
> files via resources, and then write to /tmp …
>
> --
>
> Founder/CEO Spinn3r.com
> Location: San Francisco, CA
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Efficient Key Structure in pairRDD

2014-11-09 Thread nsareen
Hi,

We are trying to adopt Spark for our application.

We have an analytical application which stores data in Star Schemas ( SQL
Server ). All the cubes are loaded into a Key / Value structure and saved in
Trove ( in memory collection ). here key is a short array where each short
number represents a dimension member. 
e.g
Tuple = CampaignX,Product1,Region_south,10.23232 gets converted to 
Trove Key[[12322],[45232],[53421]] & Value[10.23232].

This is done to avoid saving collection of string objects as key in Trove.

Now can we save this data structure in Spark using pairRDD? & if yes, will
key value be an ideal way of storing data in spark and retrieving it for
data analysis, or is there any other better data structure we can  create,
which would help us create and process RDD ?

Nitin.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-Key-Structure-in-pairRDD-tp18461.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



canopy clustering

2014-11-09 Thread aminn_524
I want to run k-means of MLib  on a big dataset, it seems for big datsets, we
need to perform pre-clustering methods such as canopy clustering. By
starting with an initial clustering the number of more expensive distance
measurements can be significantly reduced by ignoring points outside of the
initial canopies. 

I I am not mistaken, in the k-means of MLib, there are three initialization
steps : Kmeans ++, Kmeans|| and random .

So, can anyone explain to me that can we use kmeans|| instead of canopy
clustering? or these two methods act completely different? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/canopy-clustering-tp18462.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org