Re: Book: Data Analysis with SparkR

2014-11-21 Thread Zongheng Yang
Hi Daniel,

Thanks for your email! We don't have a book (yet?) specifically on SparkR,
but here's a list of helpful tutorials / links you can check out (I am
listing them in roughly basic -> advanced order):

- AMPCamp5 SparkR exercises
. This covers the
basics of SparkR's API, performs basic analytics, and visualizes the
results.
- SparkR examples
. We have
K-means, logistic regression, MNIST solver, \pi estimation, word count and
other examples available.
- Running SparkR on EC2
.
This entry details the steps to run a SparkR program on an EC2 cluster.

Finally, we have a talk at the AMPCamp 
today on SparkR, whose video & slides will be available soon on the website
-- it covers the basics of the interface & what you can do with it.
Additionally, you could direct any SparkR questions to our sparkr-dev
 mailing list.

Let us know if you have further questions.

Zongheng

On Fri Nov 21 2014 at 3:48:53 PM Emaasit  wrote:

> Is the a book on SparkR for the absolute & terrified beginner?
> I use R for my daily analysis and I am interested in a detailed guide to
> using SparkR for data analytics: like a book or online tutorials. If
> there's
> any please direct me to the address.
>
> Thanks,
> Daniel
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Book-Data-Analysis-with-SparkR-tp19529.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkR : lapplyPartition transforms the data in vertical format

2014-08-07 Thread Zongheng Yang
Hi Pranay,

If this is data format is to be assumed, then I believe the issue starts at

lines <- textFile(sc,"/sparkdev/datafiles/covariance.txt")
totals <- lapply(lines, function(lines)

After the first line, `lines` becomes an RDD of strings, each of which
is a line of the form "1,1". Therefore, the lapply() should be used to
map over each line, like this:

totals <- lapply(lines, function(line) ... // modified logic and
treat each line to have the form `x,x`

Doing a quick glance so let me know if this method still doesn't work!

On Wed, Aug 6, 2014 at 11:29 PM, Pranay Dave  wrote:
> Hello Shivram
> Thanks for your reply.
>
> Here is a simple data set input. This data is in file called
> "/sparkdev/datafiles/covariance.txt"
> 1,1
> 2,2
> 3,3
> 4,4
> 5,5
> 6,6
> 7,7
> 8,8
> 9,9
> 10,10
>
> Output I would like to see is a total of columns. It can be done with
> reduce, but I wanted to test lapply.
>
> Output I want to see is sum of columns in same row
> 55,55
>
> But output what I get is in two rows
> 55, NA
> 55, NA
>
> Thanks
> Pranay
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-lapplyPartition-transforms-the-data-in-vertical-format-tp11540p11617.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Visualizing stage & task dependency graph

2014-08-04 Thread Zongheng Yang
I agree that this is definitely useful.

One related project I know of is Sparkling [1] (also see talk at Spark
Summit 2014 [2]), but it'd be great (and I imagine somewhat
challenging) to visualize the *physical execution* graph of a Spark
job.

[1] http://pr01.uml.edu/
[2] 
http://spark-summit.org/2014/talk/sparkling-identification-of-task-skew-and-speculative-partition-of-data-for-spark-applications

On Mon, Aug 4, 2014 at 8:55 PM, rpandya  wrote:
> Is there a way to visualize the task dependency graph of an application,
> during or after its execution? The list of stages on port 4040 is useful,
> but still quite limited. For example, I've found that if I don't cache() the
> result of one expensive computation, it will get repeated 4 times, but it is
> not easy to trace through exactly why. Ideally, what I would like for each
> stage is:
> - the individual tasks and their dependencies
> - the various RDD operators that have been applied
> - the full stack trace, both for the stage barrier, the task, and for the
> lambdas used (often the RDDs are manipulated inside layers of code, so the
> immediate file/line# is not enough)
>
> Any suggestions?
>
> Thanks,
>
> Ravi
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Visualizing-stage-task-dependency-graph-tp11404.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang
Looking at what this patch [1] has to do to achieve it, I am not sure
if you can do the same thing in 1.0.0 using DSL only. Just curious,
why don't you use the hql() / sql() methods and pass a query string
in?

[1] https://github.com/apache/spark/pull/1211/files

On Thu, Jul 31, 2014 at 2:20 PM, Buntu Dev  wrote:
> Thanks Zongheng for the pointer. Is there a way to achieve the same in 1.0.0
> ?
>
>
> On Thu, Jul 31, 2014 at 1:43 PM, Zongheng Yang  wrote:
>>
>> countDistinct is recently added and is in 1.0.2. If you are using that
>> or the master branch, you could try something like:
>>
>> r.select('keyword, countDistinct('userId)).groupBy('keyword)
>>
>> On Thu, Jul 31, 2014 at 12:27 PM, buntu  wrote:
>> > I'm looking to write a select statement to get a distinct count on
>> > userId
>> > grouped by keyword column on a parquet file SchemaRDD equivalent of:
>> >   SELECT keyword, count(distinct(userId)) from table group by keyword
>> >
>> > How to write it using the chained select().groupBy() operations?
>> >
>> > Thanks!
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-select-expression-tp11069.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>


Re: SchemaRDD select expression

2014-07-31 Thread Zongheng Yang
countDistinct is recently added and is in 1.0.2. If you are using that
or the master branch, you could try something like:

r.select('keyword, countDistinct('userId)).groupBy('keyword)

On Thu, Jul 31, 2014 at 12:27 PM, buntu  wrote:
> I'm looking to write a select statement to get a distinct count on userId
> grouped by keyword column on a parquet file SchemaRDD equivalent of:
>   SELECT keyword, count(distinct(userId)) from table group by keyword
>
> How to write it using the chained select().groupBy() operations?
>
> Thanks!
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-select-expression-tp11069.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: the EC2 setup script often will not allow me to SSH into my machines. Ideas?

2014-07-30 Thread Zongheng Yang
To add to this: for this many (>= 20) machines I usually use at least
--wait 600.

On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas
 wrote:
> William,
>
> The error you are seeing is misleading. There is no need to terminate the
> cluster and start over.
>
> Just re-run your launch command, but with the additional --resume option
> tacked on the end.
>
> As Akhil explained, this happens because AWS is not starting up the
> instances as quickly as the script is expecting. You can increase the wait
> time to mitigate this problem.
>
> Nick
>
>
>
> On Wed, Jul 30, 2014 at 11:51 AM, Akhil Das 
> wrote:
>>
>> You need to increase the wait time, (-w) the default is 120 seconds, you
>> may set it to a higher number like 300-400. The problem is that EC2 takes
>> some time to initiate the machine (which is > 120 seconds sometimes.)
>>
>> Thanks
>> Best Regards
>>
>>
>> On Wed, Jul 30, 2014 at 8:52 PM, William Cox
>>  wrote:
>>>
>>> TL;DR >50% of the time I can't SSH into either my master or slave nodes
>>> and have to terminate all the machines and restart the EC2 cluster setup
>>> process.
>>>
>>> Hello,
>>>
>>> I'm trying to setup a Spark cluster on Amazon EC2. I am finding the setup
>>> script to be delicate and unpredictable in terms of reliably allowing SSH
>>> logins to all of the slaves and the master. For instance (I'm running Spark
>>> 0.9.1-hadoop1. since I intend to use Shark. I call this command to provision
>>> a 32 slave cluster using spot instances:
>>>
 $./spark-ec2 --spot-price=0.1 --zone=us-east-1e -k key -i ~/key.pem -s
 32 --instance-type=m1.medium launch cluster_name
>>>
>>>
>>>  After waiting for the instances to provision I get the following output:
>>>
 All 32 slaves granted
 Launched master in us-east-1e, regid = r-f8444a89
 Waiting for instances to start up...
 Waiting 120 more seconds...
 Generating cluster's SSH key on master...
 ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22:
 Connection refused
 Error executing remote command, retrying after 30 seconds: Command
 '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/user/key.pem',
 '-t', '-t', u'r...@ecmaster.compute-1.amazonaws.com', "\n  [ -f
 ~/.ssh/id_rsa ] ||\n(ssh-keygen -q -t rsa -N '' -f ~/.ssh/id_rsa
 &&\n cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys)\n"]'
 returned non-zero exit status 255
>>>
>>>
>>> I have removed the key and machine names with 'MASTER' and 'key'. I wait
>>> a few more cycles of the error message and finally, after 3 attempts, the
>>> script quits with this message:
>>>
>>>
 ssh: connect to host ecMASTER.compute-1.amazonaws.com port 22:
 Connection refused
 Error:
 Failed to SSH to remote host ecMASTER.compute-1.amazonaws.com.
 Please check that you have provided the correct --identity-file and
 --key-pair parameters and try again.
>>>
>>> So, YES, the .pem file is correct - I am currently running a smaller
>>> cluster and can provision other machines on EC2 and use that file. Secondly,
>>> the node it can't seem to connect to is the MASTER node. I have also gone
>>> into the EC2 console and verified that all the machines are using the "key"
>>> that corresponds to "key.pem".
>>>
>>> I have tried this command 2x and on a friends machine with no success.
>>> However, I was able to provision a 15 machine cluster using m1.larges.
>>>
>>> Now I PAUSE for some period of time - 2-3 minutes (to write this email) -
>>> and I call the same command with the "--resume" flag. This time it logs into
>>> the master node just fine and begins to give the slaves SSH keys, and it
>>> fails on a certain slave.

 ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
 Connection refused
 Error 255 while executing remote command, retrying after 30 seconds
 ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
 Connection refused
 Error 255 while executing remote command, retrying after 30 seconds
 ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
 Connection refused
 Error 255 while executing remote command, retrying after 30 seconds
 ssh: connect to host ec2-54-237-6-95.compute-1.amazonaws.com port 22:
 Connection refused
 Traceback (most recent call last):
   File "./spark_ec2.py", line 806, in 
 main()
   File "./spark_ec2.py", line 799, in main
 real_main()
   File "./spark_ec2.py", line 684, in real_main
 setup_cluster(conn, master_nodes, slave_nodes, opts, True)
   File "./spark_ec2.py", line 423, in setup_cluster
 ssh_write(slave.public_dns_name, opts, ['tar', 'x'], dot_ssh_tar)
   File "./spark_ec2.py", line 640, in ssh_write
 raise RuntimeError("ssh_write failed with error %s" %
 proc.returncode)
 RuntimeError: ssh_write failed with error 255
>>>
>>> So I log into the EC2 console and TERMINATE that specific machine, and
>>> 

Re: SparkSQL can not use SchemaRDD from Hive

2014-07-28 Thread Zongheng Yang
As Hao already mentioned, using 'hive' (the HiveContext) throughout would
work.

On Monday, July 28, 2014, Cheng, Hao  wrote:

> In your code snippet, "sample" is actually a SchemaRDD, and SchemaRDD
> actually binds a certain SQLContext in runtime, I don't think we can
> manipulate/share the SchemaRDD across SQLContext Instances.
>
> -Original Message-
> From: Kevin Jung [mailto:itsjb.j...@samsung.com ]
> Sent: Tuesday, July 29, 2014 1:47 PM
> To: u...@spark.incubator.apache.org 
> Subject: SparkSQL can not use SchemaRDD from Hive
>
> Hi
> I got a error message while using Hive and SparkSQL.
> This is code snippet I used.
>
> (in spark-shell , 1.0.0)
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext._
> val hive = new org.apache.spark.sql.hive.HiveContext(sc)
> var sample = hive.hql("select * from sample10") // This creates SchemaRDD.
> I have table 'sample10' in hive.
> var countHive = sample.count() // It works
> sqlContext.registerRDDAsTable(sample,"temp")
> sqlContext.sql("select * from temp").count() // It gives me a error message
> "java.lang.RuntimeException: Table Not Found: sample10"
>
> I don't know why this happen. Does SparkSQL conflict with Hive?
>
> Thanks,
> Kevin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-can-not-use-SchemaRDD-from-Hive-tp10841.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Getting the number of slaves

2014-07-28 Thread Zongheng Yang
Nicholas,

The (somewhat common) situation you ran into probably meant the
executors were still connecting. A typical solution is to sleep a
couple seconds before querying that field.

On Mon, Jul 28, 2014 at 3:57 PM, Andrew Or  wrote:
> Yes, both of these are derived from the same source, and this source
> includes the driver. In other words, if you submit a job with 10 executors
> you will get back 11 for both statuses.
>
>
> 2014-07-28 15:40 GMT-07:00 Sung Hwan Chung :
>
>> Do getExecutorStorageStatus and getExecutorMemoryStatus both return the
>> number of executors + the driver?
>> E.g., if I submit a job with 10 executors, I get 11 for
>> getExeuctorStorageStatus.length and getExecutorMemoryStatus.size
>>
>>
>> On Thu, Jul 24, 2014 at 4:53 PM, Nicolas Mai 
>> wrote:
>>>
>>> Thanks, this is what I needed :) I should have searched more...
>>>
>>> Something I noticed though: after the SparkContext is initialized, I had
>>> to
>>> wait for a few seconds until sc.getExecutorStorageStatus.length returns
>>> the
>>> correct number of workers in my cluster (otherwise it returns 1, for the
>>> driver)...
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Getting-the-number-of-slaves-tp10604p10619.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>
>


Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-28 Thread Zongheng Yang
The optimal config depends on lots of things, but did you try a
smaller numPartition size? Just guessing -- 160 / 320 may be
reasonable.

On Mon, Jul 28, 2014 at 1:52 AM, Earthson  wrote:
> I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition
> with 2048 buckets.
> 
>   sqlsc.set("spark.sql.shuffle.partitions", "2048")
>   hql("""|insert %s table mz_log
>|PARTITION (date='%s')
>|select * from tmp_mzlog
>|CLUSTER BY mzid
> """.stripMargin.format(overwrite, log_date))
> 
>
> env:
>
> yarn-client mode with 80 executor, 2 cores/per executor.
>
> Data:
>
> original text log is about 1.1T.
>
> - - -
>
> the reduce stage is too slow.
>
> 
>
> here is the network usage, it's not the bottle neck.
>
> 
>
> and the CPU load is very high, why?
>
> 
> here is the configuration(conf/spark-defaults.conf)
>
> 
> spark.ui.port   
> spark.akka.frameSize128
> spark.akka.timeout  600
> spark.akka.threads  8
> spark.files.overwrite   true
> spark.executor.memory   2G
> spark.default.parallelism   32
> spark.shuffle.consolidateFiles  true
> spark.kryoserializer.buffer.mb  128
> spark.storage.blockManagerSlaveTimeoutMs20
> spark.serializerorg.apache.spark.serializer.KryoSerializer
> 
>
> 2 failed with MapTracker Error.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SparkSQL-reduce-stage-of-shuffle-is-slow-tp10765.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to do an interactive Spark SQL

2014-07-22 Thread Zongheng Yang
Can you paste a small code example to illustrate your questions?

On Tue, Jul 22, 2014 at 5:05 PM, hsy...@gmail.com  wrote:
> Sorry, typo. What I mean is sharing. If the sql is changing at runtime, how
> do I broadcast the sql to all workers that is doing sql analysis.
>
> Best,
> Siyuan
>
>
> On Tue, Jul 22, 2014 at 4:15 PM, Zongheng Yang  wrote:
>>
>> Do you mean that the texts of the SQL queries being hardcoded in the
>> code? What do you mean by "cannot shar the sql to all workers"?
>>
>> On Tue, Jul 22, 2014 at 4:03 PM, hsy...@gmail.com 
>> wrote:
>> > Hi guys,
>> >
>> > I'm able to run some Spark SQL example but the sql is static in the
>> > code. I
>> > would like to know is there a way to read sql from somewhere else (shell
>> > for
>> > example)
>> >
>> > I could read sql statement from kafka/zookeeper, but I cannot share the
>> > sql
>> > to all workers. broadcast seems not working for updating values.
>> >
>> > Moreover if I use some non-serializable class(DataInputStream etc) to
>> > read
>> > sql from other source, I always get "Task not serializable:
>> > java.io.NotSerializableException"
>> >
>> >
>> > Best,
>> > Siyuan
>
>


Re: How to do an interactive Spark SQL

2014-07-22 Thread Zongheng Yang
Do you mean that the texts of the SQL queries being hardcoded in the
code? What do you mean by "cannot shar the sql to all workers"?

On Tue, Jul 22, 2014 at 4:03 PM, hsy...@gmail.com  wrote:
> Hi guys,
>
> I'm able to run some Spark SQL example but the sql is static in the code. I
> would like to know is there a way to read sql from somewhere else (shell for
> example)
>
> I could read sql statement from kafka/zookeeper, but I cannot share the sql
> to all workers. broadcast seems not working for updating values.
>
> Moreover if I use some non-serializable class(DataInputStream etc) to read
> sql from other source, I always get "Task not serializable:
> java.io.NotSerializableException"
>
>
> Best,
> Siyuan


Re: replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Zongheng Yang
One way is to set this in your conf/spark-defaults.conf:

spark.executor.extraLibraryPath /path/to/native/lib

The key is documented here:
http://spark.apache.org/docs/latest/configuration.html

On Thu, Jul 17, 2014 at 1:25 PM, Eric Friedman
 wrote:
> I used to use SPARK_LIBRARY_PATH to specify the location of native libs
> for lzo compression when using spark 0.9.0.
>
> The references to that environment variable have disappeared from the docs
> for
> spark 1.0.1 and it's not clear how to specify the location for lzo.
>
> Any guidance?


Re: Equivalent functions for NVL() and CASE expressions in Spark SQL

2014-07-17 Thread Zongheng Yang
Hi Pandees,

Spark SQL introduced support for CASE expressions just recently and it
is available in 1.0.1. As for NVL(), I don't think we support it yet,
and if you are interested a pull request will be much appreciated!

Thanks,
Zongheng

On Thu, Jul 17, 2014 at 7:26 AM, pandees waran  wrote:
> Do we have any equivalent scala functions available for NVL() and CASE
> expressions to use in spark sql?
>


Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
Hi Keith & gorenuru,

This patch (https://github.com/apache/spark/pull/1423) solves the
errors for me in my local tests. If possible, can you guys test this
out to see if it solves your test programs?

Thanks,
Zongheng

On Tue, Jul 15, 2014 at 3:08 PM, Zongheng Yang  wrote:
> - user@incubator
>
> Hi Keith,
>
> I did reproduce this using local-cluster[2,2,1024], and the errors
> look almost the same.  Just wondering, despite the errors did your
> program output any result for the join? On my machine, I could see the
> correct output.
>
> Zongheng
>
> On Tue, Jul 15, 2014 at 1:46 PM, Michael Armbrust
>  wrote:
>> Thanks for the extra info.  At a quick glance the query plan looks fine to
>> me.  The class IntegerType does build a type tag I wonder if you are
>> seeing the Scala issue manifest in some new way.  We will attempt to
>> reproduce locally.
>>
>>
>> On Tue, Jul 15, 2014 at 1:41 PM, gorenuru  wrote:
>>>
>>> Just my "few cents" on this.
>>>
>>> I having the same problems with v 1.0.1 but this bug is sporadic and looks
>>> like is relayed to object initialization.
>>>
>>> Even more, i'm not using any SQL or something. I just have utility class
>>> like this:
>>>
>>> object DataTypeDescriptor {
>>>   type DataType = String
>>>
>>>   val BOOLEAN = "BOOLEAN"
>>>   val STRING = "STRING"
>>>   val TIMESTAMP = "TIMESTAMP"
>>>   val LONG = "LONG"
>>>   val INT = "INT"
>>>   val SHORT = "SHORT"
>>>   val BYTE = "BYTE"
>>>   val DECIMAL = "DECIMAL"
>>>   val DOUBLE = "DOUBLE"
>>>   val FLOAT = "FLOAT"
>>>
>>>   def $$(name: String, format: Option[String] = None) =
>>> DataTypeDescriptor(name, format)
>>>
>>>   private lazy val nativeTypes: Map[String, NativeType] = Map(
>>> BOOLEAN -> BooleanType, STRING -> StringType, TIMESTAMP ->
>>> TimestampType, LONG -> LongType, INT -> IntegerType,
>>> SHORT -> ShortType, BYTE -> ByteType, DECIMAL -> DecimalType, DOUBLE
>>> ->
>>> DoubleType, FLOAT -> FloatType
>>>   )
>>>
>>>   lazy val defaultValues: Map[String, Any] = Map(
>>> BOOLEAN -> false, STRING -> "", TIMESTAMP -> null, LONG -> 0L, INT ->
>>> 0,
>>> SHORT -> 0.toShort, BYTE -> 0.toByte,
>>> DECIMAL -> BigDecimal(0d), DOUBLE -> 0d, FLOAT -> 0f
>>>   )
>>>
>>>   def apply(dataType: String): DataTypeDescriptor = {
>>> DataTypeDescriptor(dataType.toUpperCase, None)
>>>   }
>>>
>>>   def apply(dataType: SparkDataType): DataTypeDescriptor = {
>>> nativeTypes
>>>   .find { case (_, descriptor) => descriptor == dataType }
>>>   .map { case (name, descriptor) => DataTypeDescriptor(name, None) }
>>>   .get
>>>   }
>>>
>>> .
>>>
>>> and some test that check each of this methods.
>>>
>>> The problem is that this test fails randomly with this error.
>>>
>>> P.S.: I did not have this problem in Spark 1.0.0
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-Spark-SQL-join-when-using-Spark-1-0-1-tp9776p9817.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>


Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
- user@incubator

Hi Keith,

I did reproduce this using local-cluster[2,2,1024], and the errors
look almost the same.  Just wondering, despite the errors did your
program output any result for the join? On my machine, I could see the
correct output.

Zongheng

On Tue, Jul 15, 2014 at 1:46 PM, Michael Armbrust
 wrote:
> Thanks for the extra info.  At a quick glance the query plan looks fine to
> me.  The class IntegerType does build a type tag I wonder if you are
> seeing the Scala issue manifest in some new way.  We will attempt to
> reproduce locally.
>
>
> On Tue, Jul 15, 2014 at 1:41 PM, gorenuru  wrote:
>>
>> Just my "few cents" on this.
>>
>> I having the same problems with v 1.0.1 but this bug is sporadic and looks
>> like is relayed to object initialization.
>>
>> Even more, i'm not using any SQL or something. I just have utility class
>> like this:
>>
>> object DataTypeDescriptor {
>>   type DataType = String
>>
>>   val BOOLEAN = "BOOLEAN"
>>   val STRING = "STRING"
>>   val TIMESTAMP = "TIMESTAMP"
>>   val LONG = "LONG"
>>   val INT = "INT"
>>   val SHORT = "SHORT"
>>   val BYTE = "BYTE"
>>   val DECIMAL = "DECIMAL"
>>   val DOUBLE = "DOUBLE"
>>   val FLOAT = "FLOAT"
>>
>>   def $$(name: String, format: Option[String] = None) =
>> DataTypeDescriptor(name, format)
>>
>>   private lazy val nativeTypes: Map[String, NativeType] = Map(
>> BOOLEAN -> BooleanType, STRING -> StringType, TIMESTAMP ->
>> TimestampType, LONG -> LongType, INT -> IntegerType,
>> SHORT -> ShortType, BYTE -> ByteType, DECIMAL -> DecimalType, DOUBLE
>> ->
>> DoubleType, FLOAT -> FloatType
>>   )
>>
>>   lazy val defaultValues: Map[String, Any] = Map(
>> BOOLEAN -> false, STRING -> "", TIMESTAMP -> null, LONG -> 0L, INT ->
>> 0,
>> SHORT -> 0.toShort, BYTE -> 0.toByte,
>> DECIMAL -> BigDecimal(0d), DOUBLE -> 0d, FLOAT -> 0f
>>   )
>>
>>   def apply(dataType: String): DataTypeDescriptor = {
>> DataTypeDescriptor(dataType.toUpperCase, None)
>>   }
>>
>>   def apply(dataType: SparkDataType): DataTypeDescriptor = {
>> nativeTypes
>>   .find { case (_, descriptor) => descriptor == dataType }
>>   .map { case (name, descriptor) => DataTypeDescriptor(name, None) }
>>   .get
>>   }
>>
>> .
>>
>> and some test that check each of this methods.
>>
>> The problem is that this test fails randomly with this error.
>>
>> P.S.: I did not have this problem in Spark 1.0.0
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-Spark-SQL-join-when-using-Spark-1-0-1-tp9776p9817.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>


Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Zongheng Yang
FWIW, I am unable to reproduce this using the example program locally.

On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons  wrote:
> Nope.  All of them are registered from the driver program.
>
> However, I think we've found the culprit.  If the join column between two
> tables is not in the same column position in both tables, it triggers what
> appears to be a bug.  For example, this program fails:
>
> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.SchemaRDD
> import org.apache.spark.sql.catalyst.types._
>
> case class Record(value: String, key: Int)
> case class Record2(key: Int, value: String)
>
> object TestJob {
>
>   def main(args: Array[String]) {
> run()
>   }
>
>   private def run() {
> val sparkConf = new SparkConf()
> sparkConf.setAppName("TestJob")
> sparkConf.set("spark.cores.max", "8")
> sparkConf.set("spark.storage.memoryFraction", "0.1")
> sparkConf.set("spark.shuffle.memoryFracton", "0.2")
> sparkConf.set("spark.executor.memory", "2g")
> sparkConf.setJars(List("target/scala-2.10/spark-test-assembly-1.0.jar"))
> sparkConf.setMaster(s"spark://dev1.dev.pulse.io:7077")
> sparkConf.setSparkHome("/home/pulseio/spark/current")
> val sc = new SparkContext(sparkConf)
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext._
>
> val rdd1 = sc.parallelize((1 to 100).map(i => Record(s"val_$i", i)))
> val rdd2 = sc.parallelize((1 to 100).map(i => Record2(i, s"rdd_$i")))
> rdd1.registerAsTable("rdd1")
> rdd2.registerAsTable("rdd2")
>
> sql("SELECT * FROM rdd1").collect.foreach { row => println(row) }
>
> sql("SELECT rdd1.key, rdd1.value, rdd2.value FROM rdd1 join rdd2 on
> rdd1.key = rdd2.key order by rdd1.key").collect.foreach { row =>
> println(row) }
>
> sc.stop()
>   }
>
> }
>
> If you change the definition of Record and Record2 to the following, it
> succeeds:
>
> case class Record(key: Int, value: String)
> case class Record2(key: Int, value: String)
>
> as does:
>
> case class Record(value: String, key: Int)
> case class Record2(value: String, key: Int)
>
> Let me know if you need anymore details.
>
>
> On Tue, Jul 15, 2014 at 11:14 AM, Michael Armbrust 
> wrote:
>>
>> Are you registering multiple RDDs of case classes as tables concurrently?
>> You are possibly hitting SPARK-2178 which is caused by SI-6240.
>>
>>
>> On Tue, Jul 15, 2014 at 10:49 AM, Keith Simmons 
>> wrote:
>>>
>>> HI folks,
>>>
>>> I'm running into the following error when trying to perform a join in my
>>> code:
>>>
>>> java.lang.NoClassDefFoundError: Could not initialize class
>>> org.apache.spark.sql.catalyst.types.LongType$
>>>
>>> I see similar errors for StringType$ and also:
>>>
>>>  scala.reflect.runtime.ReflectError: value apache is not a package.
>>>
>>> Strangely, if I just work with a single table, everything is fine. I can
>>> iterate through the records in both tables and print them out without a
>>> problem.
>>>
>>> Furthermore, this code worked without an exception in Spark 1.0.0
>>> (thought the join caused some field corruption, possibly related to
>>> https://issues.apache.org/jira/browse/SPARK-1994).  The data is coming from
>>> a custom protocol buffer based format on hdfs that is being mapped into the
>>> individual record types without a problem.
>>>
>>> The immediate cause seems to be a task trying to deserialize one or more
>>> SQL case classes before loading the spark uber jar, but I have no idea why
>>> this is happening, or why it only happens when I do a join.  Ideas?
>>>
>>> Keith
>>>
>>> P.S. If it's relevant, we're using the Kryo serializer.
>>>
>>>
>>
>


Re: Count distinct with groupBy usage

2014-07-15 Thread Zongheng Yang
Sounds like a job for Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html !

On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath
 wrote:
> You can use .distinct.count on your user RDD.
>
> What are you trying to achieve with the time group by?
> —
> Sent from Mailbox
>
>
> On Tue, Jul 15, 2014 at 8:14 PM, buntu  wrote:
>>
>> Hi --
>>
>> New to Spark and trying to figure out how to do a generate unique counts
>> per
>> page by date given this raw data:
>>
>> timestamp,page,userId
>> 1405377264,google,user1
>> 1405378589,google,user2
>> 1405380012,yahoo,user1
>> ..
>>
>> I can do a groupBy a field and get the count:
>>
>> val lines=sc.textFile("data.csv")
>> val csv=lines.map(_.split(","))
>> // group by page
>> csv.groupBy(_(1)).count
>>
>> But not able to see how to do count distinct on userId and also apply
>> another groupBy on timestamp field. Please let me know how to handle such
>> cases.
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>


Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-11 Thread Zongheng Yang
Hey Jerry,

When you ran these queries using different methods, did you see any
discrepancy in the returned results (i.e. the counts)?

On Thu, Jul 10, 2014 at 5:55 PM, Michael Armbrust
 wrote:
> Yeah, sorry.  I think you are seeing some weirdness with partitioned tables
> that I have also seen elsewhere. I've created a JIRA and assigned someone at
> databricks to investigate.
>
> https://issues.apache.org/jira/browse/SPARK-2443
>
>
> On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam  wrote:
>>
>> Hi Michael,
>>
>> Yes the table is partitioned on 1 column. There are 11 columns in the
>> table and they are all String type.
>>
>> I understand that SerDes contributes to some overheads but using pure
>> Hive, we could run the query about 5 times faster than SparkSQL. Given that
>> Hive also has the same SerDes overhead, then there must be something
>> additional that SparkSQL adds to the overall overheads that Hive doesn't
>> have.
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>>
>> On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust 
>> wrote:
>>>
>>> On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam  wrote:

 For the curious mind, the dataset is about 200-300GB and we are using 10
 machines for this benchmark. Given the env is equal between the two
 experiments, why pure spark is faster than SparkSQL?
>>>
>>>
>>> There is going to be some overhead to parsing data using the Hive SerDes
>>> instead of the native Spark code, however, the slow down you are seeing here
>>> is much larger than I would expect. Can you tell me more about the table?
>>> What does the schema look like?  Is it partitioned?
>>>
 By the way, I also try hql("select * from m").count. It is terribly slow
 too.
>>>
>>>
>>> FYI, this query is actually identical to the one where you write out
>>> COUNT(*).
>>
>>
>


Re: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Zongheng Yang
Hi Haoming,

For your spark-submit question: can you try using an assembly jar
("sbt/sbt assembly" will build it for you)? Another thing to check is
if there is any package structure that contains your SimpleApp; if so
you should include the hierarchal name.

Zongheng

On Thu, Jul 10, 2014 at 11:33 AM, Haoming Zhang
 wrote:
> Hi Yadid,
>
> I have the same problem with you so I implemented the product interface as
> well, even the codes are similar with your codes. But now I face another
> problem that is I don't know how to run the codes...My whole program is like
> this:
>
> object SimpleApp {
>
>   class Record(val x1: String, val x2: String, val x3: String, ... val x24:
> String) extends Product with Serializable {
> def canEqual(that: Any) = that.isInstanceOf[Record]
>
> def productArity = 24
>
>
> def productElement(n: Int) = n match {
>   case 0 => x1
>   case 1 => x2
>   case 2 => x3
>   ...
>   case 23 => x24
> }
>   }
>
>   def main(args: Array[String]) {
>
> val conf = new SparkConf().setAppName("Product Test")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc);
>
> val record = new Record("a", "b", "c", "d", "e", "f", "g", "h", "i",
> "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x")
>
> import sqlContext._
> sc.parallelize(record :: Nil).registerAsTable("records")
>
> sql("SELECT x1 FROM records").collect()
>   }
> }
>
> I tried to run the above program with spark-submit:
> ./spark-submit --class "SimpleApp" --master local
> /playground/ProductInterface/target/scala-2.10/classes/product-interface-test_2.10-1.0.jar
>
> But I always get the exception that is "Exception in thread "main"
> java.lang.ClassNotFoundException: SimpleApp".
>
> So can you please share me the way to run the test program? Actually I can
> see there is a SimpleApp.class in classes folder, but I don't understand why
> spark-submit cannot find it.
>
> Best,
> Haoming
>
>> Date: Thu, 10 Jul 2014 09:02:18 -0700
>> From: ya...@media.mit.edu
>> To: u...@spark.incubator.apache.org
>> Subject: SPARKSQL problem with implementing Scala's Product interface
>
>>
>> Hi All,
>>
>> I have a class with too many variables to be implemented as a case class,
>> therefor I am using regular class that implements Scala's product
>> interface.
>> Like so:
>>
>> class Info () extends Product with Serializable {
>> var param1 : String = ""
>> var param2 : String = ""
>> ...
>> var param38: String = ""
>>
>> def canEqual(that: Any) = that.isInstanceOf[Info]
>> def productArity = 38
>> def productElement(n: Int) = n match {
>> case 0 => param1
>> case 1 => param2
>> ...
>> case 37 => param38
>> }
>> }
>>
>> after registering the table as info when I execute "SELECT * from info" I
>> get the expected result.
>> However, when I execute "SELECT param1, param2 from info" I get the
>> following exception:
>> Loss was due to
>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: No
>> function
>> to evaluate expression. type: UnresolvedAttribute, tree: 'param1
>>
>> I guess I must be missing a method in the implementation. Any pointers
>> appreciated.
>>
>> Yadid
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/SPARKSQL-problem-with-implementing-Scala-s-Product-interface-tp9311.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: jsonFile function in SQLContext does not work

2014-06-25 Thread Zongheng Yang
Hi durin,

I just tried this example (nice data, by the way!), *with each JSON
object on one line*, and it worked fine:

scala> rdd.printSchema()
root
 |-- entities: org.apache.spark.sql.catalyst.types.StructType$@13b6cdef
 ||-- friends:
ArrayType[org.apache.spark.sql.catalyst.types.StructType$@13b6cdef]
 |||-- id: IntegerType
 |||-- indices: ArrayType[IntegerType]
 |||-- name: StringType
 ||-- weapons: ArrayType[StringType]
 |-- field1: StringType
 |-- id: IntegerType
 |-- lang: StringType
 |-- place: StringType
 |-- read: BooleanType
 |-- user: org.apache.spark.sql.catalyst.types.StructType$@13b6cdef
 ||-- id: IntegerType
 ||-- name: StringType
 ||-- num_heads: IntegerType

On Wed, Jun 25, 2014 at 10:57 AM, durin  wrote:
> I'm using Spark 1.0.0-SNAPSHOT (downloaded and compiled on 2014/06/23).
> I'm trying to execute the following code:
>
> import org.apache.spark.SparkContext._
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val table =
> sqlContext.jsonFile("hdfs://host:9100/user/myuser/data.json")
> table.printSchema()
>
> data.json looks like this (3 shortened lines shown here):
>
> {"field1":"content","id":12312213,"read":false,"user":{"id":121212,"name":"E.
> Stark","num_heads":0},"place":"Winterfell","entities":{"weapons":[],"friends":[{"name":"R.
> Baratheon","id":23234,"indices":[0,16]}]},"lang":"en"}
> {"field1":"content","id":56756765,"read":false,"user":{"id":121212,"name":"E.
> Stark","num_heads":0},"place":"Winterfell","entities":{"weapons":[],"friends":[{"name":"R.
> Baratheon","id":23234,"indices":[0,16]}]},"lang":"en"}
> {"field1":"content","id":56765765,"read":false,"user":{"id":121212,"name":"E.
> Stark","num_heads":0},"place":"Winterfell","entities":{"weapons":[],"friends":[{"name":"R.
> Baratheon","id":23234,"indices":[0,16]}]},"lang":"en"}
>
> The JSON-Object in each line is valid according to the JSON-Validator I use,
> and as jsonFile is defined as
>
> def jsonFile(path: String): SchemaRDD
> Loads a JSON file (one object per line), returning the result as a
> SchemaRDD.
>
> I would assume this should work. However, executing this code return this
> error:
>
> 14/06/25 10:05:09 WARN scheduler.TaskSetManager: Lost TID 11 (task 0.0:11)
> 14/06/25 10:05:09 WARN scheduler.TaskSetManager: Loss was due to
> com.fasterxml.jackson.databind.JsonMappingException
> com.fasterxml.jackson.databind.JsonMappingException: No content to map due
> to end-of-input
>  at [Source: java.io.StringReader@238df2e4; line: 1, column: 1]
> at
> com.fasterxml.jackson.databind.JsonMappingException.from(JsonMappingException.java:164)
> ...
>
>
> Does anyone know where the problem lies?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/jsonFile-function-in-SQLContext-does-not-work-tp8273.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: SparkR Installation

2014-06-19 Thread Zongheng Yang
Hi Stuti,

Yes, you do need to install R on all nodes. Furthermore the rJava
library is also required, which can be installed simply using
'install.packages("rJava")' in the R shell. Some more installation
instructions after that step can be found in the README here:
https://github.com/amplab-extras/SparkR-pkg

Zongheng

On Tue, Jun 17, 2014 at 10:59 PM, Stuti Awasthi  wrote:
> Hi All,
>
>
>
> I wanted to try SparkR. Do we need preinstalled R on all the nodes of the
> cluster before installing SparkR package ? Please guide me how to proceed
> with this. As of now, I work with R only on single node.
>
> Please suggest
>
>
>
> Thanks
>
> Stuti Awasthi
>
>
>
> ::DISCLAIMER::
> 
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior written
> consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error please
> delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
> 


Re: Patterns for making multiple aggregations in one pass

2014-06-18 Thread Zongheng Yang
If your input data is JSON, you can also try out the recently merged
in initial JSON support:
https://github.com/apache/spark/commit/d2f4f30b12f99358953e2781957468e2cfe3c916

On Wed, Jun 18, 2014 at 5:27 PM, Nicholas Chammas
 wrote:
> That’s pretty neat! So I guess if you start with an RDD of objects, you’d
> first do something like RDD.map(lambda x: Record(x['field_1'], x['field_2'],
> ...)) in order to register it as a table, and from there run your
> aggregates. Very nice.
>
>
>
> On Wed, Jun 18, 2014 at 7:56 PM, Evan R. Sparks 
> wrote:
>>
>> This looks like a job for SparkSQL!
>>
>>
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>> import sqlContext._
>> case class MyRecord(country: String, name: String, age: Int, hits: Long)
>> val data = sc.parallelize(Array(MyRecord("USA", "Franklin", 24, 234),
>> MyRecord("USA", "Bob", 55, 108), MyRecord("France", "Remi", 33, 72)))
>> data.registerAsTable("MyRecords")
>> val results = sql("""SELECT t.country, AVG(t.age), SUM(t.hits) FROM
>> MyRecords t GROUP BY t.country""").collect
>>
>> Now "results" contains:
>>
>> Array[org.apache.spark.sql.Row] = Array([France,33.0,72], [USA,39.5,342])
>>
>>
>>
>> On Wed, Jun 18, 2014 at 4:42 PM, Doris Xin  wrote:
>>>
>>> Hi Nick,
>>>
>>> Instead of using reduceByKey(), you might want to look into using
>>> aggregateByKey(), which allows you to return a different value type U
>>> instead of the input value type V for each input tuple (K, V). You can
>>> define U to be a datatype that holds both the average and total and have
>>> seqOp update both fields of U in a single pass.
>>>
>>> Hope this makes sense,
>>> Doris
>>>
>>>
>>> On Wed, Jun 18, 2014 at 4:28 PM, Nick Chammas
>>>  wrote:

 The following is a simplified example of what I am trying to accomplish.

 Say I have an RDD of objects like this:

 {
 "country": "USA",
 "name": "Franklin",
 "age": 24,
 "hits": 224
 }
 {

 "country": "USA",
 "name": "Bob",
 "age": 55,
 "hits": 108
 }
 {

 "country": "France",
 "name": "Remi",
 "age": 33,
 "hits": 72
 }

 I want to find the average age and total number of hits per country.
 Ideally, I would like to scan the data once and perform both aggregations
 simultaneously.

 What is a good approach to doing this?

 I’m thinking that we’d want to keyBy(country), and then somehow
 reduceByKey(). The problem is, I don’t know how to approach writing a
 function that can be passed to reduceByKey() and that will track a running
 average and total simultaneously.

 Nick


 
 View this message in context: Patterns for making multiple aggregations
 in one pass
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>>
>>
>


Re: convert List to RDD

2014-06-13 Thread Zongheng Yang
Sorry I wasn't being clear. The idea off the top of my head was that
you could append an original position index to each element (using the
line above), and modified what ever processing functions you have in
mind to make them aware of these indices. And I think you are right
that RDD collections are unordered by default.

On Fri, Jun 13, 2014 at 6:33 PM, SK  wrote:
> Thanks. But that did not work.
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/convert-List-to-RDD-tp7606p7609.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: convert List to RDD

2014-06-13 Thread Zongheng Yang
I may be wrong, but I think RDDs must be created inside a
SparkContext. To somehow preserve the order of the list, perhaps you
could try something like:

sc.parallelize((1 to xs.size).zip(xs))

On Fri, Jun 13, 2014 at 6:08 PM, SK  wrote:
> Hi,
>
> I have a List[ (String, Int, Int) ] that I would liek to convert to an RDD.
> I tried to use sc.parallelize and sc.makeRDD, but in each case the original
> order of items in the List gets modified. Is there a simple way to convert a
> List to RDD without using SparkContext?
>
> thanks
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/convert-List-to-RDD-tp7606.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: spark EC2 bring-up problems

2014-06-12 Thread Zongheng Yang
Hi Toby,

It is usually the case that even if the EC2 console says the nodes are
up, they are not really fully initialized. For 16 nodes I have found
`--wait 800` to be the norm that makes things work.

In my previous experience I have found this to be the culprit, so if
you immediately do 'launch --resume' when you see the first SSH error
it's still very likely to fail. But if you wait a little bit longer
and do 'launch --resume', it could work.

Zongheng

On Thu, Jun 12, 2014 at 1:03 PM, Toby Douglass  wrote:
> On Thu, Jun 12, 2014 at 8:50 PM, Nicholas Chammas
>  wrote:
>>
>> Yes, you need Python 2.7 to run spark-ec2 and most AMIs come with 2.6
>
> Ah, yes - I mean to say, Amazon Linux.
>>
>> .Have you tried either:
>>
>> Retrying launch with the --resume option?
>> Increasing the value of the --wait option?
>
> No.  I will try the first, now.  I think the latter is not the issue - the
> instances are up; something else is amiss.
>
> Thankyou.
>


Re: SQLContext and HiveContext Query Performance

2014-06-04 Thread Zongheng Yang
Hi,

Just wondering if you can try this:

val obj = sql("select manufacturer, count(*) as examcount from pft
group by manufacturer order by examcount desc")
obj.collect()
obj.queryExecution.executedPlan.executeCollect()

and time the third line alone. It could be that Spark SQL taking some
time to run the optimizer & generate physical plans that slows down
the query.

Thanks,
Zongheng

On Wed, Jun 4, 2014 at 2:16 PM, ssb61  wrote:
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-and-HiveContext-Query-Performance-tp6948p6976.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.