Spark response times for queries seem slow

2015-01-05 Thread Sam Flint
I am running pyspark job over 4GB of data that is split into 17 parquet
files on HDFS cluster.   This is all in cloudera manager.

Here is the query the job is running :

parquetFile.registerTempTable(parquetFileone)

results = sqlContext.sql(SELECT sum(total_impressions), sum(total_clicks)
FROM parquetFileone group by hour)


I also ran this way :
mapped = parquetFile.map(lambda row: (str(row.hour),
(row.total_impressions, row.total_clicks))) counts =
mapped.reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1]))


my results where anywhere from 8 - 10 minutes.

I am wondering if there is a configuration that needs to be tweaked or if
this is expected response time.

Machines are 30g RAM and 4 cores. Seems the CPU's are just getting pegged
and that is what is taking so long.

 Any help on this would be amazing.

Thanks,


-- 

*MAGNE**+**I**C*

*Sam Flint* | *Lead Developer, Data Analytics*


org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: pyspark on yarn

2015-01-05 Thread Sam Flint
.parq,/user/hive/warehouse/impala_new_4/key=20141001/f1448ca083a5e224-159572f61b50d7a3_854675293_data.2.parq,/user/hive/warehouse/impala_new_4/key=20141001/f1448ca083a5e224-159572f61b50d7a3_854675293_data.3.parq,/user/hive/warehouse/impala_new_4/key=20141001/f1448ca083a5e224-159572f61b50d7a3_854675293_data.4.parq,
Some(Configuration: core-default.xml, core-site.xml, yarn-default.xml,
yarn-site.xml, mapred-default.xml, mapred-site.xml, hdfs-default.xml,
hdfs-site.xml), org.apache.spark.sql.SQLContext@2c76fd82, []

at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:72)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$apply$1.applyOrElse(Analyzer.scala:70)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:70)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:68)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:397)
at
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:397)
at org.apache.spark.sql.SchemaRDD.javaToPython(SchemaRDD.scala:412)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)

-- 

*MAGNE**+**I**C*

*Sam Flint* | *Lead Developer, Data Analytics*


Re: NEW to spark and sparksql

2014-11-20 Thread Sam Flint
So you are saying to query an entire day of data I would need to create one
RDD for every hour and then union them into one RDD.  After I have the one
RDD I would be able to query for a=2 throughout the entire day.   Please
correct me if I am wrong.

Thanks

On Wed, Nov 19, 2014 at 5:53 PM, Michael Armbrust mich...@databricks.com
wrote:

 I would use just textFile unless you actually need a guarantee that you
 will be seeing a whole file at time (textFile splits on new lines).

 RDDs are immutable, so you cannot add data to them.  You can however union
 two RDDs, returning a new RDD that contains all the data.

 On Wed, Nov 19, 2014 at 2:46 PM, Sam Flint sam.fl...@magnetic.com wrote:

 Michael,
 Thanks for your help.   I found a wholeTextFiles() that I can use to
 import all files in a directory.  I believe this would be the case if all
 the files existed in the same directory.  Currently the files come in by
 the hour and are in a format somewhat like this ../2014/10/01/00/filename
 and there is one file per hour.

 Do I create an RDD and add to it? Is that possible?  My example query
 would be select count(*) from (entire day RDD) where a=2.  a would exist
 in all files multiple times with multiple values.

 I don't see in any documentation how to import a file create an RDD then
 import another file into that RDD.   kinda like in mysql when you create a
 table import data then import more data.  This may be my ignorance because
 I am not that familiar with spark, but I would expect to import data into a
 single RDD before performing analytics on it.

 Thank you for your time and help on this.


 P.S. I am using python if that makes a difference.

 On Wed, Nov 19, 2014 at 4:45 PM, Michael Armbrust mich...@databricks.com
  wrote:

 In general you should be able to read full directories of files as a
 single RDD/SchemaRDD.  For documentation I'd suggest the programming
 guides:

 http://spark.apache.org/docs/latest/quick-start.html
 http://spark.apache.org/docs/latest/sql-programming-guide.html

 For Avro in particular, I have been working on a library for Spark SQL.
 Its very early code, but you can find it here:
 https://github.com/databricks/spark-avro

 Bug reports welcome!

 Michael

 On Wed, Nov 19, 2014 at 1:02 PM, Sam Flint sam.fl...@magnetic.com
 wrote:

 Hi,

 I am new to spark.  I have began to read to understand sparks RDD
 files as well as SparkSQL.  My question is more on how to build out the RDD
 files and best practices.   I have data that is broken down by hour into
 files on HDFS in avro format.   Do I need to create a separate RDD for each
 file? or using SparkSQL a separate SchemaRDD?

 I want to be able to pull lets say an entire day of data into spark and
 run some analytics on it.  Then possibly a week, a month, etc.


 If there is documentation on this procedure or best practives for
 building RDD's please point me at them.

 Thanks for your time,
Sam







 --

 *MAGNE**+**I**C*

 *Sam Flint* | *Lead Developer, Data Analytics*






-- 

*MAGNE**+**I**C*

*Sam Flint* | *Lead Developer, Data Analytics*