Re: Unable to stop Worker in standalone mode by sbin/stop-all.sh

2015-03-12 Thread Ted Yu
Does the machine have cron job that periodically cleans up /tmp dir ?

Cheers

On Thu, Mar 12, 2015 at 6:18 PM, sequoiadb mailing-list-r...@sequoiadb.com
wrote:

 Checking the script, it seems spark-daemon.sh unable to stop the worker
 $ ./spark-daemon.sh stop org.apache.spark.deploy.worker.Worker 1
 no org.apache.spark.deploy.worker.Worker to stop
 $ ps -elf | grep spark
 0 S taoewang 24922 1  0  80   0 - 733878 futex_ Mar12 ?   00:08:54
 java -cp
 /data/sequoiadb-driver-1.10.jar,/data/spark-sequoiadb-0.0.1-SNAPSHOT.jar::/data/spark/conf:/data/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar
 -XX:MaxPermSize=128m -Dspark.deploy.recoveryMode=ZOOKEEPER
 -Dspark.deploy.zookeeper.url=centos-151:2181,centos-152:2181,centos-153:2181
 -Dspark.deploy.zookeeper.dir=/data/zookeeper
 -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
 org.apache.spark.deploy.worker.Worker
 spark://centos-151:7077,centos-152:7077,centos-153:7077

 In spark-daemon script it tries to find $pid in /tmp/:
 pid=$SPARK_PID_DIR/spark-$SPARK_IDENT_STRING-$command-$instance.pid”

 In my case pid supposed to be:
 /tmp/spark-taoewang-org.apache.spark.deploy.worker.Worker-1.pid

 However when I go through the files in /tmp directory I don’t find such
 file exist.
 I got 777 on /tmp and also tried to touch a file with my current account
 and success, so it shouldn’t be permission issue.
 $ ls -la / | grep tmp
 drwxrwxrwx.   6 root root  4096 Mar 13 08:19 tmp

 Anyone has any idea why the pid file didn’t show up?

 Thanks
 TW

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: OutOfMemoryError when using DataFrame created by Spark SQL

2015-03-25 Thread Ted Yu
Can you try giving Spark driver more heap ?

Cheers



 On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote:
 
 Hi,
 
 I am using Spark SQL to query on my Hive cluster, following Spark SQL and 
 DataFrame Guide step by step. However, my HiveQL via sqlContext.sql() fails 
 and java.lang.OutOfMemoryError was raised. The expected result of such query 
 is considered to be small (by adding limit 1000 clause). My code is shown 
 below:
 
 scala import sqlContext.implicits._  

 scala val df = sqlContext.sql(select * from some_table where 
 logdate=2015-03-24 limit 1000)
 and the error msg:
 
 [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] 
 [ActorSystem(sparkDriver)] Uncaught fatal error from thread 
 [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver]
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 the master heap memory is set by -Xms512m -Xmx512m, while workers set by 
 -Xms4096M -Xmx4096M, which I presume sufficient for this trivial query.
 
 Additionally, after restarted the spark-shell and re-run the limit 5 query , 
 the df object is returned and can be printed by df.show(), but other APIs 
 fails on OutOfMemoryError, namely, df.count(), df.select(some_field).show() 
 and so forth.
 
 I understand that the RDD can be collected to master hence further 
 transmutations can be applied, as DataFrame has “richer optimizations under 
 the hood” and the convention from an R/julia user, I really hope this error 
 is able to be tackled, and DataFrame is robust enough to depend.
 
 Thanks in advance!
 
 REGARDS,
 Todd
 
 
 View this message in context: OutOfMemoryError when using DataFrame created 
 by Spark SQL
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Untangling dependency issues in spark streaming

2015-03-29 Thread Ted Yu
For Gradle, there are:
https://github.com/musketyr/gradle-fatjar-plugin
https://github.com/johnrengelman/shadow

FYI

On Sun, Mar 29, 2015 at 4:29 PM, jay vyas jayunit100.apa...@gmail.com
wrote:

 thanks for posting this! Ive ran into similar issues before, and generally
 its a bad idea to swap the libraries out and pray fot the best, so the
 shade functionality is probably the best feature.

 Unfortunately, im not sure how well SBT and Gradle support shading... how
 do folks using next gen build tools solve this problem?



 On Sun, Mar 29, 2015 at 3:10 AM, Neelesh neele...@gmail.com wrote:

 Hi,
   My streaming app uses org.apache.httpcomponent:httpclient:4.3.6, but
 spark uses 4.2.6 , and I believe thats what's causing the following error.
 I've tried setting
 spark.executor.userClassPathFirst  spark.driver.userClassPathFirst to
 true in the config, but that does not solve it either. Finally I had to
 resort to relocating classes using maven shade plugin while building my
 apps uber jar, using

 relocations
 relocation
 patternorg.apache.http/pattern
 shadedPatternorg.shaded.apache.http/shadedPattern
 /relocation
 /relocations


 Hope this is useful to others in the same situation. It would be really 
 great to deal with this the right way (like tomcat or any other servlet 
 container - classloader hierarchy etc).


 Caused by: java.lang.NoSuchFieldError: INSTANCE
 at
 org.apache.http.impl.io.DefaultHttpRequestWriterFactory.init(DefaultHttpRequestWriterFactory.java:52)
 at
 org.apache.http.impl.io.DefaultHttpRequestWriterFactory.init(DefaultHttpRequestWriterFactory.java:56)
 at
 org.apache.http.impl.io.DefaultHttpRequestWriterFactory.clinit(DefaultHttpRequestWriterFactory.java:46)
 at
 org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.init(ManagedHttpClientConnectionFactory.java:72)
 at
 org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.init(ManagedHttpClientConnectionFactory.java:84)
 at
 org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.clinit(ManagedHttpClientConnectionFactory.java:59)
 at
 org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.init(PoolingHttpClientConnectionManager.java:494)
 at
 org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:149)
 at
 org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:138)
 at
 org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:114)

 and ...
 Caused by: java.lang.NoClassDefFoundError: Could not initialize class
 org.apache.http.impl.conn.ManagedHttpClientConnectionFactory
 at
 org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.init(PoolingHttpClientConnectionManager.java:494)
 at
 org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:149)
 at
 org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:138)
 at
 org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:114)




 --
 jay vyas



Re: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed

2015-03-29 Thread Ted Yu
Nathan:
Please look in log files for any of the following:
doCleanupRDD():
  case e: Exception = logError(Error cleaning RDD  + rddId, e)
doCleanupShuffle():
  case e: Exception = logError(Error cleaning shuffle  + shuffleId,
e)
doCleanupBroadcast():
  case e: Exception = logError(Error cleaning broadcast  +
broadcastId, e)

Cheers

On Sun, Mar 29, 2015 at 7:55 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Try these:

 - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM)
 - Enable log rotation:

 sparkConf.set(spark.executor.logs.rolling.strategy, size)
 .set(spark.executor.logs.rolling.size.maxBytes, 1024)
 .set(spark.executor.logs.rolling.maxRetainedFiles, 3)


 Also see, whats really getting filled on disk.

 Thanks
 Best Regards

 On Sat, Mar 28, 2015 at 8:18 PM, Nathan Marin nathan.ma...@teads.tv
 wrote:

 Hi,

 I’ve been trying to use Spark Streaming for my real-time analysis
 application using the Kafka Stream API on a cluster (using the yarn
 version) of 6 executors with 4 dedicated cores and 8192mb of dedicated
 RAM.

 The thing is, my application should run 24/7 but the disk usage is
 leaking. This leads to some exceptions occurring when Spark tries to
 write on a file system where no space is left.

 Here are some graphs showing the disk space remaining on a node where
 my application is deployed:
 http://i.imgur.com/vdPXCP0.png
 The drops occurred on a 3 minute interval.

 The Disk Usage goes back to normal once I kill my application:
 http://i.imgur.com/ERZs2Cj.png

 The persistance level of my RDD is MEMORY_AND_DISK_SER_2, but even
 when I tried MEMORY_ONLY_SER_2 the same thing happened (this mode
 shouldn't even allow spark to write on disk, right?).

 My question is: How can I force Spark (Streaming?) to remove whatever
 he stores immediately after he processed-it? Obviously it doesn’t look
 like the disk is being cleaned up (even though the memory does) even
 with me calling the rdd.unpersist() method foreach RDD processed.

 Here’s a sample of my application code:
 http://pastebin.com/K86LE1J6

 Maybe something is wrong in my app too?

 Thanks for your help,
 NM

 --
 View this message in context: [Spark Streaming] Disk not being cleaned
 up during runtime after RDD being processed
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Disk-not-being-cleaned-up-during-runtime-after-RDD-being-processed-tp22271.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.





Re: Build fails on 1.3 Branch

2015-03-29 Thread Ted Yu
Jenkins build failed too:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/326/consoleFull

For the moment, you can apply the following change:

diff --git
a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpression
index a53ae97..7ae4b38 100644
---
a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
+++
b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala
@@ -18,7 +18,7 @@
 package org.apache.spark.sql

 import org.apache.spark.sql.catalyst.expressions.NamedExpression
-import org.apache.spark.sql.catalyst.plans.logical.{Project, NoRelation}
+import org.apache.spark.sql.catalyst.plans.logical.{Project}
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.test.TestSQLContext
 import org.apache.spark.sql.test.TestSQLContext.implicits._

Cheers

On Sun, Mar 29, 2015 at 8:48 AM, mjhb sp...@mjhb.com wrote:

 I tried pulling the source and building for the first time, but cannot get
 past the object NoRelation is not a member of package
 org.apache.spark.sql.catalyst.plans.logical error below on the 1.3 branch.
 I can build the 1.2 branch.

 I have tried with both -Dscala-2.11 and 2.10 (after running the appropriate
 change-version-to-2.1#.sh), and with different combinations of hadoop
 flags.

 Relevant excerpts below - full build output available at
 http://mjhb.com/tmp/build-spark-1.3.out or
 http://mjhb.com/tmp/build-spark-1.3.out.gz

 $ git branch
 * branch-1.3
 $ build/mvn -e -X -DskipTests clean package
 Apache Maven 3.0.5
 Maven home: /usr/share/maven
 Java version: 1.7.0_76, vendor: Oracle Corporation
 Java home: /usr/lib/jvm/java-7-oracle/jre
 Default locale: en_US, platform encoding: UTF-8
 OS name: linux, version: 3.16.0-33-generic, arch: amd64, family:
 unix

 snipped...

 [error]

 /home/marty/work/spark-1.3-maint/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala:21:
 object NoRelation is not a member of package
 org.apache.spark.sql.catalyst.plans.logical
 [error] import org.apache.spark.sql.catalyst.plans.logical.{Project,
 NoRelation}
 [error]^
 [error] one error found
 [debug] Compilation failed (CompilerInterface)
 [error] Compile failed at Mar 29, 2015 7:52:54 AM [5.793s]




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Build-fails-on-1-3-Branch-tp22275.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Too many open files

2015-03-30 Thread Ted Yu
bq. In /etc/secucity/limits.conf set the next values:

Have you done the above modification on all the machines in your Spark
cluster ?

If you use Ubuntu, be sure that the /etc/pam.d/common-session file contains
the following line:

session required  pam_limits.so


On Mon, Mar 30, 2015 at 5:08 AM, Masf masfwo...@gmail.com wrote:

 Hi.

 I've done relogin, in fact, I put 'uname -n' and returns 100, but it
 crashs.
 I'm doing reduceByKey and SparkSQL mixed over 17 files (250MB-500MB/file)


 Regards.
 Miguel Angel.

 On Mon, Mar 30, 2015 at 1:52 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Mostly, you will have to restart the machines to get the ulimit effect
 (or relogin). What operation are you doing? Are you doing too many
 repartitions?

 Thanks
 Best Regards

 On Mon, Mar 30, 2015 at 4:52 PM, Masf masfwo...@gmail.com wrote:

 Hi

 I have a problem with temp data in Spark. I have fixed
 spark.shuffle.manager to SORT. In /etc/secucity/limits.conf set the next
 values:
 *   softnofile  100
 *   hardnofile  100
 In spark-env.sh set ulimit -n 100
 I've restarted the spark service and it continues crashing (Too many
 open files)

 How can I resolve? I'm executing Spark 1.2.0 in Cloudera 5.3.2

 java.io.FileNotFoundException:
 /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c
 (Too many open files)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)
 at scala.Array$.fill(Array.scala:267)
 at
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 15/03/30 11:54:18 WARN TaskSetManager: Lost task 22.0 in stage 3.0 (TID
 27, localhost): java.io.FileNotFoundException:
 /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c
 (Too many open files)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)
 at scala.Array$.fill(Array.scala:267)
 at
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)



 Thanks!
 --


 Regards.
 Miguel Ángel





 --


 Saludos.
 Miguel Ángel



Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-30 Thread Ted Yu
Nicolas:
See if there was occurrence of the following exception in the log:
  errs = throw new SparkException(
sCouldn't connect to leader for topic ${part.topic}
${part.partition}:  +
  errs.mkString(\n)),

Cheers

On Mon, Mar 30, 2015 at 9:40 AM, Cody Koeninger c...@koeninger.org wrote:

 This line

 at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.close(
 KafkaRDD.scala:158)

 is the attempt to close the underlying kafka simple consumer.

 We can add a null pointer check, but the underlying issue of the consumer
 being null probably indicates a problem earlier.  Do you see any previous
 error messages?

 Also, can you clarify for the successful and failed cases which topics you
 are attempting this on, how many partitions there are, and whether there
 are messages in the partitions?  There's an existing jira regarding empty
 partitions.




 On Mon, Mar 30, 2015 at 11:05 AM, Nicolas Phung nicolas.ph...@gmail.com
 wrote:

 Hello,

 I'm using spark-streaming-kafka 1.3.0 with the new consumer Approach 2:
 Direct Approach (No Receivers) (
 http://spark.apache.org/docs/latest/streaming-kafka-integration.html).
 I'm using the following code snippets :

 // Create direct kafka stream with brokers and topics
 val messages = KafkaUtils.createDirectStream[String, Array[Byte],
 StringDecoder, DefaultDecoder](
 ssc, kafkaParams, topicsSet)

 // Get the stuff from Kafka and print them
 val raw = messages.map(_._2)
 val dStream: DStream[RawScala] = raw.map(
 byte = {
 // Avro Decoder
 println(Byte length:  + byte.length)
 val rawDecoder = new AvroDecoder[Raw](schema = Raw.getClassSchema)
 RawScala.toScala(rawDecoder.fromBytes(byte))
 }
 )
 // Useful for debug
 dStream.print()

 I launch my Spark Streaming and everything is fine if there's no incoming
 logs from Kafka. When I'm sending a log, I got the following error :

 15/03/30 17:34:40 ERROR TaskContextImpl: Error in TaskCompletionListener
 java.lang.NullPointerException
 at
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.close(KafkaRDD.scala:158)
 at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63)
 at
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$1.apply(KafkaRDD.scala:101)
 at
 org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$1.apply(KafkaRDD.scala:101)
 at
 org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49)
 at
 org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68)
 at
 org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66)
 at org.apache.spark.scheduler.Task.run(Task.scala:58)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 15/03/30 17:34:40 INFO TaskSetManager: Finished task 3.0 in stage 28.0
 (TID 94) in 12 ms on localhost (2/4)
 15/03/30 17:34:40 INFO TaskSetManager: Finished task 2.0 in stage 28.0
 (TID 93) in 13 ms on localhost (3/4)
 15/03/30 17:34:40 ERROR Executor: Exception in task 0.0 in stage 28.0
 (TID 91)
 org.apache.spark.util.TaskCompletionListenerException
 at
 org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:76)
 at org.apache.spark.scheduler.Task.run(Task.scala:58)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 15/03/30 17:34:40 WARN TaskSetManager: Lost task 0.0 in stage 28.0 (TID
 91, localhost): org.apache.spark.util.TaskCompletionListenerException
 at
 org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:76)
 at org.apache.spark.scheduler.Task.run(Task.scala:58)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

 15/03/30 17:34:40 ERROR TaskSetManager: Task 0 in stage 28.0 failed 1
 times; aborting job
 15/03/30 17:34:40 INFO TaskSchedulerImpl: Removed TaskSet 28.0, whose
 tasks have all completed, from pool
 15/03/30 17:34:40 INFO TaskSchedulerImpl: Cancelling stage 28
 15/03/30 17:34:40 INFO DAGScheduler: Job 28 failed: print at
 HotFCANextGen.scala:63, took 0,041068 s
 15/03/30 17:34:40 INFO JobScheduler: Starting job streaming 

Re: Error in Delete Table

2015-03-31 Thread Ted Yu
Which Spark and Hive release are you using ?

Thanks



 On Mar 27, 2015, at 2:45 AM, Masf masfwo...@gmail.com wrote:
 
 Hi.
 
 In HiveContext, when I put this statement DROP TABLE IF EXISTS TestTable
 If TestTable doesn't exist, spark returns an error:
 
 
 
 ERROR Hive: NoSuchObjectException(message:default.TestTable table not found)
   at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338)
   at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306)
   at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237)
   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
   at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036)
   at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1008)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:90)
   at com.sun.proxy.$Proxy22.getTable(Unknown Source)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1000)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:942)
   at 
 org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3887)
   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:310)
   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
   at 
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1554)
   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1321)
   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1139)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:962)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:952)
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
   at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
   at 
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
   at 
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
   at 
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
   at 
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
   at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
   at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98)
   at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at GeoMain$.HiveExecution(GeoMain.scala:96)
   at GeoMain$.main(GeoMain.scala:17)
   at GeoMain.main(GeoMain.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 
 
 Thanks!!
 -- 
 
 
 Regards.
 Miguel Ángel

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: refer to dictionary

2015-03-31 Thread Ted Yu
You can use broadcast variable. 

See also this thread:
http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variablesubj=How+Broadcast+variable+scale+



 On Mar 31, 2015, at 4:43 AM, Peng Xia sparkpeng...@gmail.com wrote:
 
 Hi,
 
 I have a RDD (rdd1)where each line is split into an array [a, b, c], etc.
 And I also have a local dictionary p (dict1) stores key value pair {a:1, 
 b: 2, c:3}
 I want to replace the keys in the rdd with the its corresponding value in the 
 dict:
 rdd1.map(lambda line: [dict1[item] for item in line])
 
 But this task is not distributed, I believe the reason is the dict1 is a 
 local instance.
 Can any one provide suggestions on this to parallelize this?
 
 
 Thanks,
 Best,
 Peng
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark streaming with Kafka, multiple partitions fail, single partition ok

2015-03-31 Thread Ted Yu
Can you show us the output of DStream#print() if you have it ?

Thanks

On Tue, Mar 31, 2015 at 2:55 AM, Nicolas Phung nicolas.ph...@gmail.com
wrote:

 Hello,

 @Akhil Das I'm trying to use the experimental API
 https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala
 https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fmaster%2Fexamples%2Fscala-2.10%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fexamples%2Fstreaming%2FDirectKafkaWordCount.scalasa=Dsntz=1usg=AFQjCNFOmScaSfP-2J4Zn56k86-jHUkYaw.
 I'm reusing the same code snippet to initialize my topicSet.

 @Cody Koeninger I don't see any previous error messages (see the full log
 at the end). To create the topic, I'm doing the following :

 kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 
 --partitions 10 --topic toto

 kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 
 --partitions 1 --topic toto-single

 I'm launching my Spark Streaming in local mode.

 @Ted Yu There's no log Couldn't connect to leader for topic, here's the
 full version :

 spark-submit --conf config.resource=application-integration.conf --class
 nextgen.Main assembly-0.1-SNAPSHOT.jar

 15/03/31 10:47:12 INFO SecurityManager: Changing view acls to: nphung
 15/03/31 10:47:12 INFO SecurityManager: Changing modify acls to: nphung
 15/03/31 10:47:12 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(nphung); users 
 with modify permissions: Set(nphung)
 15/03/31 10:47:13 INFO Slf4jLogger: Slf4jLogger started
 15/03/31 10:47:13 INFO Remoting: Starting remoting
 15/03/31 10:47:13 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://sparkDriver@int.local:44180]
 15/03/31 10:47:13 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://sparkDriver@int.local:44180]
 15/03/31 10:47:13 INFO Utils: Successfully started service 'sparkDriver' on 
 port 44180.
 15/03/31 10:47:13 INFO SparkEnv: Registering MapOutputTracker
 15/03/31 10:47:13 INFO SparkEnv: Registering BlockManagerMaster
 15/03/31 10:47:13 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20150331104713-2238
 15/03/31 10:47:13 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
 15/03/31 10:47:15 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-2c8e34a0-bec3-4f1e-9fe7-83e08efc4f53
 15/03/31 10:47:15 INFO HttpServer: Starting HTTP Server
 15/03/31 10:47:15 INFO Utils: Successfully started service 'HTTP file server' 
 on port 50204.
 15/03/31 10:47:15 INFO Utils: Successfully started service 'SparkUI' on port 
 4040.
 15/03/31 10:47:15 INFO SparkUI: Started SparkUI at http://int.local:4040
 15/03/31 10:47:16 INFO SparkContext: Added JAR 
 file:/home/nphung/assembly-0.1-SNAPSHOT.jar at 
 http://10.153.165.98:50204/jars/assembly-0.1-SNAPSHOT.jar with timestamp 
 1427791636151
 15/03/31 10:47:16 INFO AkkaUtils: Connecting to HeartbeatReceiver: 
 akka.tcp://sparkDriver@int.local:44180/user/HeartbeatReceiver
 15/03/31 10:47:16 INFO NettyBlockTransferService: Server created on 40630
 15/03/31 10:47:16 INFO BlockManagerMaster: Trying to register BlockManager
 15/03/31 10:47:16 INFO BlockManagerMasterActor: Registering block manager 
 localhost:40630 with 265.1 MB RAM, BlockManagerId(driver, localhost, 40630)
 15/03/31 10:47:16 INFO BlockManagerMaster: Registered BlockManager
 15/03/31 10:47:17 INFO EventLoggingListener: Logging events to 
 hdfs://int.local:8020/user/spark/applicationHistory/local-1427791636195
 15/03/31 10:47:17 INFO VerifiableProperties: Verifying properties
 15/03/31 10:47:17 INFO VerifiableProperties: Property group.id is overridden 
 to
 15/03/31 10:47:17 INFO VerifiableProperties: Property zookeeper.connect is 
 overridden to
 15/03/31 10:47:17 INFO ForEachDStream: metadataCleanupDelay = -1
 15/03/31 10:47:17 INFO MappedDStream: metadataCleanupDelay = -1
 15/03/31 10:47:17 INFO MappedDStream: metadataCleanupDelay = -1
 15/03/31 10:47:17 INFO DirectKafkaInputDStream: metadataCleanupDelay = -1
 15/03/31 10:47:17 INFO DirectKafkaInputDStream: Slide time = 2000 ms
 15/03/31 10:47:17 INFO DirectKafkaInputDStream: Storage level = 
 StorageLevel(false, false, false, false, 1)
 15/03/31 10:47:17 INFO DirectKafkaInputDStream: Checkpoint interval = null
 15/03/31 10:47:17 INFO DirectKafkaInputDStream: Remember duration = 2000 ms
 15/03/31 10:47:17 INFO DirectKafkaInputDStream: Initialized and validated 
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream@1daf3b44
 15/03/31 10:47:17 INFO MappedDStream: Slide time = 2000 ms
 15/03/31 10:47:17 INFO MappedDStream: Storage level = StorageLevel(false, 
 false, false, false, 1)
 15/03/31 10:47:17 INFO MappedDStream: Checkpoint interval = null
 15/03/31 10:47:17 INFO MappedDStream: Remember duration = 2000 ms
 15/03/31 10:47:17 INFO MappedDStream: Initialized and validated

Re: Spark 1.3.0 DataFrame and Postgres

2015-04-01 Thread Ted Yu
+1 on escaping column names. 



 On Apr 1, 2015, at 5:50 AM, fergjo00 johngfergu...@gmail.com wrote:
 
 Question:  
 ---
 Is there a way to have JDBC DataFrames use quoted/escaped column names? 
 Right now, it looks like it sees the names correctly in the schema created
 but does not escape them in the SQL it creates when they are not compliant:
 
 org.apache.spark.sql.jdbc.JDBCRDD
 
 private val columnList: String = {
 val sb = new StringBuilder()
 columns.foreach(x = sb.append(,).append(x))
 if (sb.length == 0) 1 else sb.substring(1)
 }
 
 
 If you see value in this, I would take a shot at adding the quoting
 (escaping) of column names here.  If you don't do it, some drivers... like
 postgresql's will simply drop case all names when parsing the query.  As you
 can see in the TL;DR below that means they won't match the schema I am
 given.
 
 Thanks.
 
 TL;DR:
 
 I am able to connect to a Postgres database in the shell (with driver
 referenced):
 
   val jdbcDf =
 sqlContext.jdbc(jdbc:postgresql://localhost/sparkdemo?user=dbuser,
 sp500)
 
 In fact when I run:
 
   jdbcDf.registerTempTable(sp500)
   val avgEPSNamed = sqlContext.sql(SELECT AVG(`Earnings/Share`) as AvgCPI
 FROM sp500)
 
 and 
 
   val avgEPSProg = jsonDf.agg(avg(jsonDf.col(Earnings/Share)))
 
 The values come back as expected.  However, if I try:
 
   jdbcDf.show
 
 Or if I try
 
   val all = sqlContext.sql(SELECT * FROM sp500)
   all.show
 
 I get errors about column names not being found.  In fact the error includes
 a mention of column names all lower cased.  For now I will change my schema
 to be more restrictive.  Right now it is, per a Stack Overflow poster, not
 ANSI compliant by doing things that are allowed by 's in pgsql, MySQL and
 SQLServer.  BTW, our users are giving us tables like this... because various
 tools they already use support non-compliant names.  In fact, this is mild
 compared to what we've had to support.
 
 Currently the schema in question uses mixed case, quoted names with special
 characters and spaces:
 
 CREATE TABLE sp500
 (
 Symbol text,
 Name text,
 Sector text,
 Price double precision,
 Dividend Yield double precision,
 Price/Earnings double precision,
 Earnings/Share double precision,
 Book Value double precision,
 52 week low double precision,
 52 week high double precision,
 Market Cap double precision,
 EBITDA double precision,
 Price/Sales double precision,
 Price/Book double precision,
 SEC Filings text
 )
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-0-DataFrame-and-Postgres-tp22338.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3 build with hive support fails on JLine

2015-04-01 Thread Ted Yu
Please invoke dev/change-version-to-2.11.sh before running mvn.

Cheers

On Mon, Mar 30, 2015 at 1:02 AM, Night Wolf nightwolf...@gmail.com wrote:

 Hey,

 Trying to build Spark 1.3 with Scala 2.11 supporting yarn  hive (with
 thrift server).

 Running;

 *mvn -e -DskipTests -Pscala-2.11 -Dscala-2.11 -Pyarn -Pmapr4 -Phive
 -Phive-thriftserver clean install*

 The build fails with;

 INFO] Compiling 9 Scala sources to 
 /var/lib/jenkins/workspace/cse-Apache-Spark-YARN-2.11-1.3-build/sql/hive-thriftserver/target/scala-2.11/classes...[ERROR]
  
 /var/lib/jenkins/workspace/cse-Apache-Spark-YARN-2.11-1.3-build/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:25:
  object ConsoleReader is not a member of package jline[ERROR] import 
 jline.{ConsoleReader, History}[ERROR]^[WARNING] Class jline.Completor 
 not found - continuing with a stub.[WARNING] Class jline.ConsoleReader not 
 found - continuing with a stub.[ERROR] 
 /var/lib/jenkins/workspace/cse-Apache-Spark-YARN-2.11-1.3-build/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:165:
  not found: type ConsoleReader[ERROR] val reader = new 
 ConsoleReader()[ERROR]  ^[ERROR] Class jline.Completor 
 not found - continuing with a stub.[WARNING] Class com.google.protobuf.Parser 
 not found - continuing with a stub.[WARNING] Class com.google.protobuf.Parser 
 not found - continuing with a stub.[WARNING] Class com.google.protobuf.Parser 
 not found - continuing with a stub.[WARNING] Class com.google.protobuf.Parser 
 not found - continuing with a stub.[WARNING] 6 warnings found[ERROR] three 
 errors found[INFO] 
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Parent POM ... SUCCESS [ 15.731 
 s]
 [INFO] Spark Project Networking ... SUCCESS [ 46.667 
 s]
 [INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 28.508 
 s]
 [INFO] Spark Project Core . SUCCESS [07:45 
 min]
 [INFO] Spark Project Bagel  SUCCESS [01:10 
 min]
 [INFO] Spark Project GraphX ... SUCCESS [02:42 
 min]
 [INFO] Spark Project Streaming  SUCCESS [03:22 
 min]
 [INFO] Spark Project Catalyst . SUCCESS [04:42 
 min]
 [INFO] Spark Project SQL .. SUCCESS [05:17 
 min]
 [INFO] Spark Project ML Library ... SUCCESS [05:36 
 min]
 [INFO] Spark Project Tools  SUCCESS [ 46.976 
 s]
 [INFO] Spark Project Hive . SUCCESS [04:08 
 min]
 [INFO] Spark Project REPL . SUCCESS [01:58 
 min]
 [INFO] Spark Project YARN . SUCCESS [01:47 
 min]
 [INFO] Spark Project Hive Thrift Server ... FAILURE [ 20.731 
 s]
 [INFO] Spark Project Assembly . SKIPPED
 [INFO] Spark Project External Twitter . SKIPPED
 [INFO] Spark Project External Flume Sink .. SKIPPED
 [INFO] Spark Project External Flume ... SKIPPED
 [INFO] Spark Project External MQTT  SKIPPED
 [INFO] Spark Project External ZeroMQ .. SKIPPED
 [INFO] Spark Project Examples . SKIPPED
 [INFO] Spark Project YARN Shuffle Service . SKIPPED
 [INFO] 
 
 [INFO] BUILD FAILURE
 [INFO] 
 
 [INFO] Total time: 41:59 min
 [INFO] Finished at: 2015-03-30T07:06:49+00:00
 [INFO] Final Memory: 94M/868M
 [INFO] 
 [ERROR]
  Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile 
 (scala-compile-first) on project spark-hive-thriftserver_2.11: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed. CompileFailed - 
 [Help 1]org.apache.maven.lifecycle.LifecycleExecutionException: Failed to 
 execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile 
 (scala-compile-first) on project spark-hive-thriftserver_2.11: Execution 
 scala-compile-first of goal 
 net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed.


 Any ideas?

 Cheers,
 N



Re: Spark streaming

2015-03-27 Thread Ted Yu
jamborta :
Please also describe the format of your csv files.

Cheers

On Fri, Mar 27, 2015 at 6:42 AM, DW @ Gmail deanwamp...@gmail.com wrote:

 Show us the code. This shouldn't happen for the simple process you
 described

 Sent from my rotary phone.


  On Mar 27, 2015, at 5:47 AM, jamborta jambo...@gmail.com wrote:
 
  Hi all,
 
  We have a workflow that pulls in data from csv files, then originally
 setup
  up of the workflow was to parse the data as it comes in (turn into
 array),
  then store it. This resulted in out of memory errors with larger files
 (as a
  result of increased GC?).
 
  It turns out if the data gets stored as a string first, then parsed, it
  issues does not occur.
 
  Why is that?
 
  Thanks,
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tp22255.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-03-31 Thread Ted Yu
Jeetendra:
Please extract the information you need from Result and return the
extracted portion - instead of returning Result itself.

Cheers

On Tue, Mar 31, 2015 at 1:14 PM, Nan Zhu zhunanmcg...@gmail.com wrote:

 The example in
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala
  might
 help

 Best,

 --
 Nan Zhu
 http://codingcat.me

 On Tuesday, March 31, 2015 at 3:56 PM, Sean Owen wrote:

 Yep, it's not serializable:
 https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html

 You can't return this from a distributed operation since that would
 mean it has to travel over the network and you haven't supplied any
 way to convert the thing into bytes.

 On Tue, Mar 31, 2015 at 8:51 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 When I am trying to get the result from Hbase and running mapToPair
 function
 of RRD its giving the error
 java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

 Here is the code

 // private static JavaPairRDDInteger, Result
 getCompanyDataRDD(JavaSparkContext sc) throws IOException {
 // return sc.newAPIHadoopRDD(companyDAO.getCompnayDataConfiguration(),
 TableInputFormat.class, ImmutableBytesWritable.class,
 // Result.class).mapToPair(new
 PairFunctionTuple2ImmutableBytesWritable, Result, Integer, Result() {
 //
 // public Tuple2Integer, Result call(Tuple2ImmutableBytesWritable,
 Result t) throws Exception {
 // System.out.println(In getCompanyDataRDD+t._2);
 //
 // String cknid = Bytes.toString(t._1.get());
 // System.out.println(processing cknids is:+cknid);
 // Integer cknidInt = Integer.parseInt(cknid);
 // Tuple2Integer, Result returnTuple = new Tuple2Integer,
 Result(cknidInt, t._2);
 // return returnTuple;
 // }
 // });
 // }


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Ted Yu
Please take a look at 
https://spark.apache.org/docs/latest/sql-programming-guide.html

Cheers



 On Mar 28, 2015, at 5:08 AM, Vincent He vincent.he.andr...@gmail.com wrote:
 
 
 I am learning spark sql and try spark-sql example,  I running following code, 
 but I got exception ERROR CliDriver: org.apache.spark.sql.AnalysisException: 
 cannot recognize input near 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; 
 line 1 pos 17, I have two questions,
 1. Do we have a list of the statement supported in spark-sql ?
 2. Does spark-sql shell support hiveql ? If yes, how to set?
 
 The example I tried:
 CREATE TEMPORARY TABLE jsonTable
 USING org.apache.spark.sql.json
 OPTIONS (
   path examples/src/main/resources/people.json
 )
 SELECT * FROM jsonTable
 The exception I got,
  CREATE TEMPORARY TABLE jsonTable
   USING org.apache.spark.sql.json
   OPTIONS (
 path examples/src/main/resources/people.json
   )
   SELECT * FROM jsonTable
   ;
 15/03/28 17:38:34 INFO ParseDriver: Parsing command: CREATE TEMPORARY TABLE 
 jsonTable
 USING org.apache.spark.sql.json
 OPTIONS (
   path examples/src/main/resources/people.json
 )
 SELECT * FROM jsonTable
 NoViableAltException(241@[654:1: ddlStatement : ( createDatabaseStatement | 
 switchDatabaseStatement | dropDatabaseStatement | createTableStatement | 
 dropTableStatement | truncateTableStatement | alterStatement | descStatement 
 | showStatement | metastoreCheck | createViewStatement | dropViewStatement | 
 createFunctionStatement | createMacroStatement | createIndexStatement | 
 dropIndexStatement | dropFunctionStatement | dropMacroStatement | 
 analyzeStatement | lockStatement | unlockStatement | lockDatabase | 
 unlockDatabase | createRoleStatement | dropRoleStatement | grantPrivileges | 
 revokePrivileges | showGrants | showRoleGrants | showRolePrincipals | 
 showRoles | grantRole | revokeRole | setRole | showCurrentRole );])
 at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
 at org.antlr.runtime.DFA.predict(DFA.java:144)
 at 
 org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2090)
 at 
 org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
 at 
 org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
 at 
 org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
 at 
 org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
 at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227)
 at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241)
 at 
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
 at 
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
 at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
 at 
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
 at 
 scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
 at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
 at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
 at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 at 
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at 
 scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
 at 
 scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
 at 
 org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38)
 at 
 org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
 at 
 org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138)
 at 
 org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96)
 at 
 

Re: Anyone has some simple example with spark-sql with spark 1.3

2015-03-28 Thread Ted Yu
See
https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html

I haven't tried the SQL statements in above blog myself.

Cheers

On Sat, Mar 28, 2015 at 5:39 AM, Vincent He vincent.he.andr...@gmail.com
wrote:

 thanks for your information . I have read it, I can run sample with scala
 or python, but for spark-sql shell, I can not get an exmaple running
 successfully, can you give me an example I can run with ./bin/spark-sql
 without writing any code? thanks

 On Sat, Mar 28, 2015 at 7:35 AM, Ted Yu yuzhih...@gmail.com wrote:

 Please take a look at
 https://spark.apache.org/docs/latest/sql-programming-guide.html

 Cheers



  On Mar 28, 2015, at 5:08 AM, Vincent He vincent.he.andr...@gmail.com
 wrote:
 
 
  I am learning spark sql and try spark-sql example,  I running following
 code, but I got exception ERROR CliDriver:
 org.apache.spark.sql.AnalysisException: cannot recognize input near
 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17, I have two
 questions,
  1. Do we have a list of the statement supported in spark-sql ?
  2. Does spark-sql shell support hiveql ? If yes, how to set?
 
  The example I tried:
  CREATE TEMPORARY TABLE jsonTable
  USING org.apache.spark.sql.json
  OPTIONS (
path examples/src/main/resources/people.json
  )
  SELECT * FROM jsonTable
  The exception I got,
   CREATE TEMPORARY TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path examples/src/main/resources/people.json
)
SELECT * FROM jsonTable
;
  15/03/28 17:38:34 INFO ParseDriver: Parsing command: CREATE TEMPORARY
 TABLE jsonTable
  USING org.apache.spark.sql.json
  OPTIONS (
path examples/src/main/resources/people.json
  )
  SELECT * FROM jsonTable
  NoViableAltException(241@[654:1: ddlStatement : (
 createDatabaseStatement | switchDatabaseStatement | dropDatabaseStatement |
 createTableStatement | dropTableStatement | truncateTableStatement |
 alterStatement | descStatement | showStatement | metastoreCheck |
 createViewStatement | dropViewStatement | createFunctionStatement |
 createMacroStatement | createIndexStatement | dropIndexStatement |
 dropFunctionStatement | dropMacroStatement | analyzeStatement |
 lockStatement | unlockStatement | lockDatabase | unlockDatabase |
 createRoleStatement | dropRoleStatement | grantPrivileges |
 revokePrivileges | showGrants | showRoleGrants | showRolePrincipals |
 showRoles | grantRole | revokeRole | setRole | showCurrentRole );])
  at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
  at org.antlr.runtime.DFA.predict(DFA.java:144)
  at
 org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2090)
  at
 org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
  at
 org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
  at
 org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
  at
 org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
  at
 org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227)
  at
 org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241)
  at
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
  at
 org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
  at
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
  at
 scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
 at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
  at
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
  at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
  at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
  at
 scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
  at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
  at
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
  at
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
  at
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
  at
 scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
  at
 scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
  at
 scala.util.parsing.combinator.Parsers$$anon$2.apply

Re: Can't access file in spark, but can in hadoop

2015-03-28 Thread Ted Yu
Thanks for the follow-up, Dale.

bq. hdp 2.3.1

Minor correction: should be hdp 2.1.3

Cheers

On Sat, Mar 28, 2015 at 2:28 AM, Johnson, Dale daljohn...@ebay.com wrote:

  Actually I did figure this out eventually.

  I’m running on a Hortonworks cluster hdp 2.3.1 (hadoop 2.4.1).  Spark
 bundles the org/apache/hadoop/hdfs/… classes along with the spark-assembly
 jar.  This turns out to introduce a small incompatibility with hdp 2.3.1.
 I carved these classes out of the jar, and put a distro-provided jar into
 the class path for the hdfs classes, and this fixed the problem.

  Ideally there would be an exclusion in the pom to deal with this.

  Dale.

   From: Zhan Zhang zzh...@hortonworks.com
 Date: Friday, March 27, 2015 at 4:28 PM
 To: Johnson, Dale daljohn...@ebay.com
 Cc: Ted Yu yuzhih...@gmail.com, user user@spark.apache.org

 Subject: Re: Can't access file in spark, but can in hadoop

   Probably guava version conflicts issue. What spark version did you use,
 and which hadoop version it compile against?

  Thanks.

  Zhan Zhang

  On Mar 27, 2015, at 12:13 PM, Johnson, Dale daljohn...@ebay.com wrote:

  Yes, I could recompile the hdfs client with more logging, but I don’t
 have the day or two to spare right this week.

  One more thing about this, the cluster is Horton Works 2.1.3 [.0]

  They seem to have a claim of supporting spark on Horton Works 2.2

  Dale.

   From: Ted Yu yuzhih...@gmail.com
 Date: Thursday, March 26, 2015 at 4:54 PM
 To: Johnson, Dale daljohn...@ebay.com
 Cc: user user@spark.apache.org
 Subject: Re: Can't access file in spark, but can in hadoop

   Looks like the following assertion failed:
   Preconditions.checkState(storageIDsCount == locs.size());

  locs is ListDatanodeInfoProto
 Can you enhance the assertion to log more information ?

  Cheers

 On Thu, Mar 26, 2015 at 3:06 PM, Dale Johnson daljohn...@ebay.com wrote:

 There seems to be a special kind of corrupted according to Spark state
 of
 file in HDFS.  I have isolated a set of files (maybe 1% of all files I
 need
 to work with) which are producing the following stack dump when I try to
 sc.textFile() open them.  When I try to open directories, most large
 directories contain at least one file of this type.  Curiously, the
 following two lines fail inside of a Spark job, but not inside of a Scoobi
 job:

 val conf = new org.apache.hadoop.conf.Configuration
 val fs = org.apache.hadoop.fs.FileSystem.get(conf)

 The stack trace follows:

 15/03/26 14:22:43 INFO yarn.ApplicationMaster: Final app status: FAILED,
 exitCode: 15, (reason: User class threw exception: null)
 Exception in thread Driver java.lang.IllegalStateException
 at

 org.spark-project.guava.common.base.Preconditions.checkState(Preconditions.java:133)
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:673)
 at

 org.apache.hadoop.hdfs.protocolPB.PBHelper.convertLocatedBlock(PBHelper.java:1100)
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1118)
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1251)
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1354)
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1363)
 at

 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:518)
 at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1743)
 at

 org.apache.hadoop.hdfs.DistributedFileSystem$15.init(DistributedFileSystem.java:738)
 at

 org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:727)
 at
 org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1662)
 at org.apache.hadoop.fs.FileSystem$5.init(FileSystem.java:1724)
 at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:1721)
 at

 com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1125)
 at

 com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1123)
 at

 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at

 com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$.main(SpellQuery.scala:1123)
 at
 com.ebay.ss.niffler.miner.speller.SpellQueryLaunch.main(SpellQuery.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at

 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:427)
 15/03/26 14:22:43 INFO yarn.ApplicationMaster

Re: Spark-submit not working when application jar is in hdfs

2015-03-28 Thread Ted Yu
Looking at SparkSubmit#addJarToClasspath():

uri.getScheme match {
  case file | local =
...
  case _ =
printWarning(sSkip remote jar $uri.)

It seems hdfs scheme is not recognized.

FYI

On Thu, Feb 26, 2015 at 6:09 PM, dilm dmend...@exist.com wrote:

 I'm trying to run a spark application using bin/spark-submit. When I
 reference my application jar inside my local filesystem, it works. However,
 when I copied my application jar to a directory in hdfs, i get the
 following
 exception:

 Warning: Skip remote jar
 hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar.
 java.lang.ClassNotFoundException: com.example.SimpleApp

 Here's the comand:

 $ ./bin/spark-submit --class com.example.SimpleApp --master local
 hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar

 I'm using hadoop version 2.6.0, spark version 1.2.1

 In the official documentation‌​, it stated there that: application-jar:
 Path to a bundled jar including your application and all dependencies. The
 URL must be globally visible inside of your cluster, for instance, an
 *hdfs:// path* or a file:// path that is present on all nodes. I'm
 thinking
 maybe this is a valid bug?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: registerTempTable is not a member of RDD on spark 1.2?

2015-03-23 Thread Ted Yu
Have you tried adding the following ?

import org.apache.spark.sql.SQLContext

Cheers

On Mon, Mar 23, 2015 at 6:45 AM, IT CTO goi@gmail.com wrote:

 Thanks.
 I am new to the environment and running cloudera CDH5.3 with spark in it.

 apparently when running in spark-shell this command  val sqlContext = new
 SQLContext(sc)
 I am failing with the not found type SQLContext

 Any idea why?

 On Mon, Mar 23, 2015 at 3:05 PM, Dean Wampler deanwamp...@gmail.com
 wrote:

 In 1.2 it's a member of SchemaRDD and it becomes available on RDD
 (through the type class mechanism) when you add a SQLContext, like so.

 val sqlContext = new SQLContext(sc)import sqlContext._


 In 1.3, the method has moved to the new DataFrame type.

 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition
 http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
 Typesafe http://typesafe.com
 @deanwampler http://twitter.com/deanwampler
 http://polyglotprogramming.com

 On Mon, Mar 23, 2015 at 5:25 AM, IT CTO goi@gmail.com wrote:

 Hi,

 I am running spark when I use sc.version I get 1.2 but when I call
 registerTempTable(MyTable) I get error saying registedTempTable is not a
 member of RDD

 Why?

 --
 Eran | CTO





 --
 Eran | CTO



Re: Convert Spark SQL table to RDD in Scala / error: value toFloat is a not a member of Any

2015-03-22 Thread Ted Yu
I thought of formation #1.
But looks like when there're many fields, formation #2 is cleaner.

Cheers


On Sun, Mar 22, 2015 at 8:14 PM, Cheng Lian lian.cs@gmail.com wrote:

  You need either

 .map { row =
   (row(0).asInstanceOf[Float], row(1).asInstanceOf[Float], ...)
 }

 or

 .map { case Row(f0: Float, f1: Float, ...) =
   (f0, f1)
 }

 On 3/23/15 9:08 AM, Minnow Noir wrote:

 I'm following some online tutorial written in Python and trying to
 convert a Spark SQL table object to an RDD in Scala.

  The Spark SQL just loads a simple table from a CSV file.  The tutorial
 says to convert the table to an RDD.

  The Python is

 products_rdd = sqlContext.table(products).map(lambda row:
 (float(row[0]),float(row[1]),float(row[2]),float(row[3]),
 float(row[4]),float(row[5]),float(row[6]),float(row[7]),float(row[8]),float(row[9]),float(row[10]),float(row[11])))

  The Scala is *not*

 val productsRdd = sqlContext.table(products).map( row = (
   row(0).toFloat,row(1).toFloat,row(2).toFloat,row(3).toFloat,
 row(4).toFloat,row(5).toFloat,row(6).toFloat,row(7).toFloat,row(8).toFloat,
 row(9).toFloat,row(10).toFloat,row(11).toFloat
 ))

  I know this, because Spark says that for each of the row(x).toFloat
 calls,
 error: value toFloat is not a member of Any

  Does anyone know the proper syntax for this?

  Thank you


   ​



Re: Spark 1.2. loses often all executors

2015-03-23 Thread Ted Yu
In this thread:
http://search-hadoop.com/m/JW1q5DM69G

I only saw two replies. Maybe some people forgot to use 'Reply to All' ?

Cheers

On Mon, Mar 23, 2015 at 8:19 AM, mrm ma...@skimlinks.com wrote:

 Hi,

 I have received three replies to my question on my personal e-mail, why
 don't they also show up on the mailing list? I would like to reply to the 3
 users through a thread.

 Thanks,
 Maria



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-loses-often-all-executors-tp22162p22187.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077

2015-03-02 Thread Ted Yu
bq. Cause was: akka.remote.InvalidAssociation: Invalid address:
akka.tcp://sparkMaster@localhost:7077

There should be some more output following the above line.

Can you post them ?

Cheers

On Mon, Mar 2, 2015 at 2:06 PM, Krishnanand Khambadkone 
kkhambadk...@yahoo.com.invalid wrote:

 Hi,  I am running spark on my mac.   It is reading from a kafka topic and
 then writes the data to a hbase table.  When I do a spark submit,  I get
 this error,

 Error connecting to master spark://localhost:7077
 (akka.tcp://sparkMaster@localhost:7077), exiting.
 Cause was: akka.remote.InvalidAssociation: Invalid address:
 akka.tcp://sparkMaster@localhost:7077

 My submit statement looks like this,

 ./spark-submit --jars
 /Users/hadoop/kafka-0.8.1.1-src/core/build/libs/kafka_2.8.0-0.8.1.1.jar
 --class KafkaMain --master spark://localhost:7077 --deploy-mode cluster
 --num-executors 100 --driver-memory 1g --executor-memory 1g
 /Users/hadoop/dev/kafkahbase/kafka/src/sparkhbase.jar




Re: Executing hive query from Spark code

2015-03-02 Thread Ted Yu
Here is snippet of dependency tree for spark-hive module:

[INFO] org.apache.spark:spark-hive_2.10:jar:1.3.0-SNAPSHOT
...
[INFO] +- org.spark-project.hive:hive-metastore:jar:0.13.1a:compile
[INFO] |  +- org.spark-project.hive:hive-shims:jar:0.13.1a:compile
[INFO] |  |  +-
org.spark-project.hive.shims:hive-shims-common:jar:0.13.1a:compile
[INFO] |  |  +-
org.spark-project.hive.shims:hive-shims-0.20:jar:0.13.1a:runtime
[INFO] |  |  +-
org.spark-project.hive.shims:hive-shims-common-secure:jar:0.13.1a:compile
[INFO] |  |  +-
org.spark-project.hive.shims:hive-shims-0.20S:jar:0.13.1a:runtime
[INFO] |  |  \-
org.spark-project.hive.shims:hive-shims-0.23:jar:0.13.1a:runtime
...
[INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile
[INFO] |  +- org.spark-project.hive:hive-ant:jar:0.13.1a:compile
[INFO] |  |  \- org.apache.velocity:velocity:jar:1.5:compile
[INFO] |  | \- oro:oro:jar:2.0.8:compile
[INFO] |  +- org.spark-project.hive:hive-common:jar:0.13.1a:compile
...
[INFO] +- org.spark-project.hive:hive-serde:jar:0.13.1a:compile

bq. is there a way to have the hive support without updating the assembly

I don't think so.

On Mon, Mar 2, 2015 at 12:37 PM, nitinkak001 nitinkak...@gmail.com wrote:

 I want to run Hive query inside Spark and use the RDDs generated from that
 inside Spark. I read in the documentation

 /Hive support is enabled by adding the -Phive and -Phive-thriftserver
 flags
 to Spark’s build. This command builds a new assembly jar that includes
 Hive.
 Note that this Hive assembly jar must also be present on all of the worker
 nodes, as they will need access to the Hive serialization and
 deserialization libraries (SerDes) in order to access data stored in
 Hive./

 I just wanted to know what -Phive and -Phive-thriftserver flags really do
 and is there a way to have the hive support without updating the assembly.
 Does that flag add a hive support jar or something?

 The reason I am asking is that I will be using Cloudera version of Spark in
 future and I am not sure how to add the Hive support to that Spark
 distribution.






 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Executing-hive-query-from-Spark-code-tp21880.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark UI and running spark-submit with --master yarn

2015-03-02 Thread Ted Yu
Default RM Web UI port is 8088 (configurable
through yarn.resourcemanager.webapp.address)

Cheers

On Mon, Mar 2, 2015 at 4:14 PM, Anupama Joshi anupama.jo...@gmail.com
wrote:

 Hi Marcelo,
 Thanks for the quick reply.
 I have a EMR cluster and I am running the spark-submit on the master node
 in the cluster.
 When I start the spark-submit , I see
 15/03/02 23:48:33 INFO client.RMProxy: Connecting to ResourceManager at /
 172.31.43.254:9022
 But If I try that URL or the use the external DNS
 ec2-52-10-234-111.us-west-2.compute.amazonaws.com:9022
 it does not work
 What am I missing here ?
 Thanks a lot for the help
 -AJ


 On Mon, Mar 2, 2015 at 3:50 PM, Marcelo Vanzin van...@cloudera.com
 wrote:

 What are you calling masternode? In yarn-cluster mode, the driver
 is running somewhere in your cluster, not on the machine where you run
 spark-submit.

 The easiest way to get to the Spark UI when using Yarn is to use the
 Yarn RM's web UI. That will give you a link to the application's UI
 regardless of whether it's running on client or cluster mode.

 On Mon, Mar 2, 2015 at 3:39 PM, Anupama Joshi anupama.jo...@gmail.com
 wrote:
  Hi ,
 
   When I run my application with --master yarn-cluster or --master yarn
  --deploy-mode cluster , I can not  the spark UI at the  location --
  masternode:4040Even if I am running the job , I can not see teh SPARK
 UI.
  When I run with --master yarn --deploy-mode client  -- I see the Spark
 UI
  but I cannot see my job  running.
 
  When I run spark-submit with --master local[*] , I see the spark UI ,
 my job
  everything (Thats great)
 
  Do I need to do some settings to see the UI?
 
  Thanks
 
  -AJ
 
 
 
 
 
 



 --
 Marcelo





Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077

2015-03-02 Thread Ted Yu
In AkkaUtils.scala:

val akkaLogLifecycleEvents =
conf.getBoolean(spark.akka.logLifecycleEvents, false)

Can you turn on life cycle event logging to see if you would get some more
clue ?

Cheers

On Mon, Mar 2, 2015 at 3:56 PM, Krishnanand Khambadkone 
kkhambadk...@yahoo.com wrote:

 I see these messages now,

 spark.master - spark://krishs-mbp:7077
 Classpath elements:



 Sending launch command to spark://krishs-mbp:7077
 Driver successfully submitted as driver-20150302155433-
 ... waiting before polling master for driver state
 ... polling master for driver state
 State of driver-20150302155433- is FAILED


   On Monday, March 2, 2015 2:42 PM, Ted Yu yuzhih...@gmail.com wrote:


 bq. Cause was: akka.remote.InvalidAssociation: Invalid address:
 akka.tcp://sparkMaster@localhost:7077

 There should be some more output following the above line.

 Can you post them ?

 Cheers

 On Mon, Mar 2, 2015 at 2:06 PM, Krishnanand Khambadkone 
 kkhambadk...@yahoo.com.invalid wrote:

 Hi,  I am running spark on my mac.   It is reading from a kafka topic and
 then writes the data to a hbase table.  When I do a spark submit,  I get
 this error,

 Error connecting to master spark://localhost:7077
 (akka.tcp://sparkMaster@localhost:7077), exiting.
 Cause was: akka.remote.InvalidAssociation: Invalid address:
 akka.tcp://sparkMaster@localhost:7077

 My submit statement looks like this,

 ./spark-submit --jars
 /Users/hadoop/kafka-0.8.1.1-src/core/build/libs/kafka_2.8.0-0.8.1.1.jar
 --class KafkaMain --master spark://localhost:7077 --deploy-mode cluster
 --num-executors 100 --driver-memory 1g --executor-memory 1g
 /Users/hadoop/dev/kafkahbase/kafka/src/sparkhbase.jar







Re: Resource manager UI for Spark applications

2015-03-03 Thread Ted Yu
bq. spark UI does not work for Yarn-cluster.

Can you be a bit more specific on the error(s) you saw ?

What Spark release are you using ?

Cheers

On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com wrote:

 Sorry , for half email - here it is again in full
 Hi ,
 I have 2 questions -

  1. I was trying to use Resource Manager UI for my SPARK application using
 yarn cluster mode as I observed that spark UI does not work for
 Yarn-cluster.
 IS that correct or am I missing some setup?

 2. when I click on Application Monitoring or history , i get re-directed
 to some linked with internal Ip address. Even if I replace that address
 with the public IP , it still does not work.  What kind of setup changes
 are needed for that?

 Thanks
 -roni

 On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi roni.epi...@gmail.com
 wrote:

 Hi ,
 I have 2 questions -

  1. I was trying to use Resource Manager UI for my SPARK application
 using yarn cluster mode as I observed that spark UI does not work for
 Yarn-cluster.
 IS that correct or am I missing some setup?








Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Ted Yu
I have created SPARK-6085 with pull request:
https://github.com/apache/spark/pull/4836

Cheers

On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote:

 +1 to a better default as well.

 We were working find until we ran against a real dataset which was much
 larger than the test dataset we were using locally. It took me a couple
 days and digging through many logs to figure out this value was what was
 causing the problem.

 On Sat, Feb 28, 2015 at 11:38 AM, Ted Yu yuzhih...@gmail.com wrote:

 Having good out-of-box experience is desirable.

 +1 on increasing the default.


 On Sat, Feb 28, 2015 at 8:27 AM, Sean Owen so...@cloudera.com wrote:

 There was a recent discussion about whether to increase or indeed make
 configurable this kind of default fraction. I believe the suggestion
 there too was that 9-10% is a safer default.

 Advanced users can lower the resulting overhead value; it may still
 have to be increased in some cases, but a fatter default may make this
 kind of surprise less frequent.

 I'd support increasing the default; any other thoughts?

 On Sat, Feb 28, 2015 at 3:34 PM, Koert Kuipers ko...@tresata.com
 wrote:
  hey,
  running my first map-red like (meaning disk-to-disk, avoiding in memory
  RDDs) computation in spark on yarn i immediately got bitten by a too
 low
  spark.yarn.executor.memoryOverhead. however it took me about an hour
 to find
  out this was the cause. at first i observed failing shuffles leading to
  restarting of tasks, then i realized this was because executors could
 not be
  reached, then i noticed in containers got shut down and reallocated in
  resourcemanager logs (no mention of errors, it seemed the containers
  finished their business and shut down successfully), and finally i
 found the
  reason in nodemanager logs.
 
  i dont think this is a pleasent first experience. i realize
  spark.yarn.executor.memoryOverhead needs to be set differently from
  situation to situation. but shouldnt the default be a somewhat higher
 value
  so that these errors are unlikely, and then the experts that are
 willing to
  deal with these errors can tune it lower? so why not make the default
 10%
  instead of 7%? that gives something that works in most situations out
 of the
  box (at the cost of being a little wasteful). it worked for me.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: bitten by spark.yarn.executor.memoryOverhead

2015-03-02 Thread Ted Yu
bq. that 0.1 is always enough?

The answer is: it depends (on use cases).
The value of 0.1 has been validated by several users. I think it is a
reasonable default.

Cheers

On Mon, Mar 2, 2015 at 8:36 AM, Ryan Williams ryan.blake.willi...@gmail.com
 wrote:

 For reference, the initial version of #3525
 https://github.com/apache/spark/pull/3525 (still open) made this
 fraction a configurable value, but consensus went against that being
 desirable so I removed it and marked SPARK-4665
 https://issues.apache.org/jira/browse/SPARK-4665 as won't fix.

 My team wasted a lot of time on this failure mode as well and has settled
 in to passing --conf spark.yarn.executor.memoryOverhead=1024 to most
 jobs (that works out to 10-20% of --executor-memory, depending on the job).

 I agree that learning about this the hard way is a weak part of the
 Spark-on-YARN onboarding experience.

 The fact that our instinct here is to increase the 0.07 minimum instead of
 the alternate 384MB
 https://github.com/apache/spark/blob/3efd8bb6cf139ce094ff631c7a9c1eb93fdcd566/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L93
 minimum seems like evidence that the fraction is the thing we should allow
 people to configure, instead of absolute amount that is currently
 configurable.

 Finally, do we feel confident that 0.1 is always enough?


 On Sat, Feb 28, 2015 at 4:51 PM Corey Nolet cjno...@gmail.com wrote:

 Thanks for taking this on Ted!

 On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu yuzhih...@gmail.com wrote:

 I have created SPARK-6085 with pull request:
 https://github.com/apache/spark/pull/4836

 Cheers

 On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote:

 +1 to a better default as well.

 We were working find until we ran against a real dataset which was much
 larger than the test dataset we were using locally. It took me a couple
 days and digging through many logs to figure out this value was what was
 causing the problem.

 On Sat, Feb 28, 2015 at 11:38 AM, Ted Yu yuzhih...@gmail.com wrote:

 Having good out-of-box experience is desirable.

 +1 on increasing the default.


 On Sat, Feb 28, 2015 at 8:27 AM, Sean Owen so...@cloudera.com wrote:

 There was a recent discussion about whether to increase or indeed make
 configurable this kind of default fraction. I believe the suggestion
 there too was that 9-10% is a safer default.

 Advanced users can lower the resulting overhead value; it may still
 have to be increased in some cases, but a fatter default may make this
 kind of surprise less frequent.

 I'd support increasing the default; any other thoughts?

 On Sat, Feb 28, 2015 at 3:34 PM, Koert Kuipers ko...@tresata.com
 wrote:
  hey,
  running my first map-red like (meaning disk-to-disk, avoiding in
 memory
  RDDs) computation in spark on yarn i immediately got bitten by a
 too low
  spark.yarn.executor.memoryOverhead. however it took me about an
 hour to find
  out this was the cause. at first i observed failing shuffles
 leading to
  restarting of tasks, then i realized this was because executors
 could not be
  reached, then i noticed in containers got shut down and reallocated
 in
  resourcemanager logs (no mention of errors, it seemed the containers
  finished their business and shut down successfully), and finally i
 found the
  reason in nodemanager logs.
 
  i dont think this is a pleasent first experience. i realize
  spark.yarn.executor.memoryOverhead needs to be set differently from
  situation to situation. but shouldnt the default be a somewhat
 higher value
  so that these errors are unlikely, and then the experts that are
 willing to
  deal with these errors can tune it lower? so why not make the
 default 10%
  instead of 7%? that gives something that works in most situations
 out of the
  box (at the cost of being a little wasteful). it worked for me.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: Resource manager UI for Spark applications

2015-03-03 Thread Ted Yu
bq. changing the address with internal to the external one , but still does
not work.
Not sure what happened.
For the time being, you can use yarn command line to pull container log
(put in your appId and container Id):
yarn logs -applicationId application_1386639398517_0007 -containerId
container_1386639398517_0007_01_19

Cheers

On Tue, Mar 3, 2015 at 9:50 AM, roni roni.epi...@gmail.com wrote:

 Hi Ted,
  I  used s3://support.elasticmapreduce/spark/install-spark to install
 spark on my EMR cluster. It is 1.2.0.
  When I click on the link for history or logs it takes me to

 http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop
  and I get -

 The server at *ip-172-31-43-116.us-west-2.compute.internal* can't be
 found, because the DNS lookup failed. DNS is the network service that
 translates a website's name to its Internet address. This error is most
 often caused by having no connection to the Internet or a misconfigured
 network. It can also be caused by an unresponsive DNS server or a firewall
 preventing Google Chrome from accessing the network.
 I tried  changing the address with internal to the external one , but
 still does not work.
 Thanks
 _roni


 On Tue, Mar 3, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote:

 bq. spark UI does not work for Yarn-cluster.

 Can you be a bit more specific on the error(s) you saw ?

 What Spark release are you using ?

 Cheers

 On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com
 wrote:

 Sorry , for half email - here it is again in full
 Hi ,
 I have 2 questions -

  1. I was trying to use Resource Manager UI for my SPARK application
 using yarn cluster mode as I observed that spark UI does not work for
 Yarn-cluster.
 IS that correct or am I missing some setup?

 2. when I click on Application Monitoring or history , i get re-directed
 to some linked with internal Ip address. Even if I replace that address
 with the public IP , it still does not work.  What kind of setup changes
 are needed for that?

 Thanks
 -roni

 On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi roni.epi...@gmail.com
 wrote:

 Hi ,
 I have 2 questions -

  1. I was trying to use Resource Manager UI for my SPARK application
 using yarn cluster mode as I observed that spark UI does not work for
 Yarn-cluster.
 IS that correct or am I missing some setup?










Re: Issue using S3 bucket from Spark 1.2.1 with hadoop 2.4

2015-03-03 Thread Ted Yu
If you can use hadoop 2.6.0 binary, you can use s3a

s3a is being polished in the upcoming 2.7.0 release:
https://issues.apache.org/jira/browse/HADOOP-11571

Cheers

On Tue, Mar 3, 2015 at 9:44 AM, Ankur Srivastava ankur.srivast...@gmail.com
 wrote:

 Hi,

 We recently upgraded to Spark 1.2.1 - Hadoop 2.4 binary. We are not having
 any other dependency on hadoop jars, except for reading our source files
 from S3.

 Since we have upgraded to the latest version our reads from S3 have
 considerably slowed down. For some jobs we see the read from S3 is stalled
 for a long time and then it starts.

 Is there a known issue with S3 or do we need to upgrade any settings? The
 only settings that we are using are:
 sc.hadoopConfiguration().set(fs.s3n.impl,
 org.apache.hadoop.fs.s3native.NativeS3FileSystem);

 sc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, someKey);

  sc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, someSecret);


 Thanks for help!!

 - Ankur



Re: spark sql median and standard deviation

2015-03-04 Thread Ted Yu
Please take a look at DoubleRDDFunctions.scala :

  /** Compute the mean of this RDD's elements. */
  def mean(): Double = stats().mean

  /** Compute the variance of this RDD's elements. */
  def variance(): Double = stats().variance

  /** Compute the standard deviation of this RDD's elements. */
  def stdev(): Double = stats().stdev

Cheers

On Wed, Mar 4, 2015 at 10:51 AM, tridib tridib.sama...@live.com wrote:

 Hello,
 Is there in built function for getting median and standard deviation in
 spark sql? Currently I am converting the schemaRdd to DoubleRdd and calling
 doubleRDD.stats(). But still it does not have median.

 What is the most efficient way to get the median?

 Thanks  Regards
 Tridib



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-median-and-standard-deviation-tp21914.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Driver disassociated

2015-03-04 Thread Ted Yu
What release are you using ?

SPARK-3923 went into 1.2.0 release.

Cheers

On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber thomas.ger...@radius.com
wrote:

 Hello,

 sometimes, in the *middle* of a job, the job stops (status is then seen
 as FINISHED in the master).

 There isn't anything wrong in the shell/submit output.

 When looking at the executor logs, I see logs like this:

 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker
 actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal
 :40019/user/MapOutputTracker#893807065]
 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs for
 shuffle 38, fetching them
 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
 [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766] -
 [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] disassociated!
 Shutting down.
 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with remote
 system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] has
 failed, address is now gated for [5000] ms. Reason is: [Disassociated].

 How can I investigate further?
 Thanks



Re: Where can I find more information about the R interface forSpark?

2015-03-04 Thread Ted Yu
Please follow SPARK-5654

On Wed, Mar 4, 2015 at 7:22 PM, Haopu Wang hw...@qilinsoft.com wrote:

  Thanks, it's an active project.



 Will it be released with Spark 1.3.0?


  --

 *From:* 鹰 [mailto:980548...@qq.com]
 *Sent:* Thursday, March 05, 2015 11:19 AM
 *To:* Haopu Wang; user
 *Subject:* Re: Where can I find more information about the R interface
 forSpark?



 you can search SparkR on google or search it on github



Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2015-03-05 Thread Ted Yu
Please add the following to build command:
-Djackson.version=1.9.3

Cheers

On Thu, Mar 5, 2015 at 10:04 AM, Todd Nist tsind...@gmail.com wrote:

 I am running Spark on a HortonWorks HDP Cluster. I have deployed there
 prebuilt version but it is only for Spark 1.2.0 not 1.2.1 and there are a
 few fixes and features in there that I would like to leverage.

 I just downloaded the spark-1.2.1 source and built it to support Hadoop
 2.6 by doing the following:

 radtech:spark-1.2.1 tnist$ ./make-distribution.sh --name hadoop2.6 --tgz 
 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver 
 -DskipTests clean package

 When I deploy this to my hadoop cluster and kick of a spark-shell,

 $ spark-1.2.1-bin-hadoop2.6]# ./bin/spark-shell --master yarn-client 
 --driver-memory 512m --executor-memory 512m

 Results in  java.lang.NoClassDefFoundError:
 org/codehaus/jackson/map/deser/std/StdDeserializer

 The full stack trace is below. I have validate that the
 $SPARK_HOME/lib/spark-assembly-1.2.1-hadoop2.6.0.jar does infact contain
 the class in question:

 jar -tvf spark-assembly-1.2.1-hadoop2.6.0.jar | grep 
 'org/codehaus/jackson/map/deser/std'
 ...
  18002 Thu Mar 05 11:23:04 EST 2015  
 parquet/org/codehaus/jackson/map/deser/std/StdDeserializer.class
   1584 Thu Mar 05 11:23:04 EST 2015 
 parquet/org/codehaus/jackson/map/deser/std/StdKeyDeserializer$BoolKD.class...

 Any guidance on what I missed ? If i start the spark-shell in standalone
 it comes up fine, $SPARK_HOME/bin/spark-shell so it looks to be related
 to starting it under yarn from what I can tell.

 TIA for the assistance.

 -Todd
 Stack Trace

 15/03/05 12:12:38 INFO spark.SecurityManager: Changing view acls to: 
 root15/03/05 12:12:38 INFO spark.SecurityManager: Changing modify acls to: 
 root15/03/05 12:12:38 INFO spark.SecurityManager: SecurityManager: 
 authentication disabled; ui acls disabled; users with view permissions: 
 Set(root); users with modify permissions: Set(root)15/03/05 12:12:38 INFO 
 spark.HttpServer: Starting HTTP Server15/03/05 12:12:39 INFO server.Server: 
 jetty-8.y.z-SNAPSHOT15/03/05 12:12:39 INFO server.AbstractConnector: Started 
 SocketConnector@0.0.0.0:3617615/03/05 12:12:39 INFO util.Utils: Successfully 
 started service 'HTTP class server' on port 36176.
 Welcome to
     __
  / __/__  ___ _/ /__
 _\ \/ _ \/ _ `/ __/  '_/
/___/ .__/\_,_/_/ /_/\_\   version 1.2.1
   /_/

 Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_75)
 Type in expressions to have them evaluated.
 Type :help for more information.15/03/05 12:12:43 INFO spark.SecurityManager: 
 Changing view acls to: root15/03/05 12:12:43 INFO spark.SecurityManager: 
 Changing modify acls to: root15/03/05 12:12:43 INFO spark.SecurityManager: 
 SecurityManager: authentication disabled; ui acls disabled; users with view 
 permissions: Set(root); users with modify permissions: Set(root)15/03/05 
 12:12:44 INFO slf4j.Slf4jLogger: Slf4jLogger started15/03/05 12:12:44 INFO 
 Remoting: Starting remoting15/03/05 12:12:44 INFO Remoting: Remoting started; 
 listening on addresses 
 :[akka.tcp://sparkdri...@hadoopdev01.opsdatastore.com:50544]15/03/05 12:12:44 
 INFO util.Utils: Successfully started service 'sparkDriver' on port 
 50544.15/03/05 12:12:44 INFO spark.SparkEnv: Registering 
 MapOutputTracker15/03/05 12:12:44 INFO spark.SparkEnv: Registering 
 BlockManagerMaster15/03/05 12:12:44 INFO storage.DiskBlockManager: Created 
 local directory at 
 /tmp/spark-16402794-cc1e-42d0-9f9c-99f15eaa1861/spark-118bc6af-4008-45d7-a22f-491bcd1856c015/03/05
  12:12:44 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 
 MB15/03/05 12:12:45 WARN util.NativeCodeLoader: Unable to load native-hadoop 
 library for your platform... using builtin-java classes where 
 applicable15/03/05 12:12:45 INFO spark.HttpFileServer: HTTP File server 
 directory is 
 /tmp/spark-5d7da34c-58d4-4d60-9b6a-3dce43cab39e/spark-4d65aacb-78bd-40fd-b6c0-53b47e28819915/03/05
  12:12:45 INFO spark.HttpServer: Starting HTTP Server15/03/05 12:12:45 INFO 
 server.Server: jetty-8.y.z-SNAPSHOT15/03/05 12:12:45 INFO 
 server.AbstractConnector: Started SocketConnector@0.0.0.0:5645215/03/05 
 12:12:45 INFO util.Utils: Successfully started service 'HTTP file server' on 
 port 56452.15/03/05 12:12:45 INFO server.Server: jetty-8.y.z-SNAPSHOT15/03/05 
 12:12:45 INFO server.AbstractConnector: Started 
 SelectChannelConnector@0.0.0.0:404015/03/05 12:12:45 INFO util.Utils: 
 Successfully started service 'SparkUI' on port 4040.15/03/05 12:12:45 INFO 
 ui.SparkUI: Started SparkUI at 
 http://hadoopdev01.opsdatastore.com:404015/03/05 12:12:46 INFO 
 impl.TimelineClientImpl: Timeline service address: 
 http://hadoopdev02.opsdatastore.com:8188/ws/v1/timeline/
 java.lang.NoClassDefFoundError: 
 org/codehaus/jackson/map/deser/std/StdDeserializer
 at java.lang.ClassLoader.defineClass1(Native Method)
 at 

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Ted Yu
/FileInputFormat.java#225


 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#

   }

 226 
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#226


 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#

 }



 However when invoking sc.textFile there are errors on directory entries: not 
 a file. This behavior is confusing - given the proper support appears to be 
 in place for handling directories.


 2015-03-03 15:04 GMT-08:00 Sean Owen so...@cloudera.com:

 This API reads a directory of files, not one file. A file here
 really means a directory full of part-* files. You do not need to read
 those separately.

 Any syntax that works with Hadoop's FileInputFormat should work. I
 thought you could specify a comma-separated list of paths? maybe I am
 imagining that.

 On Tue, Mar 3, 2015 at 10:57 PM, S. Zhou myx...@yahoo.com.invalid
 wrote:
  Thanks Ted. Actually a follow up question. I need to read multiple HDFS
  files into RDD. What I am doing now is: for each file I read them into a
  RDD. Then later on I union all these RDDs into one RDD. I am not sure
 if it
  is the best way to do it.
 
  Thanks
  Senqiang
 
 
  On Tuesday, March 3, 2015 2:40 PM, Ted Yu yuzhih...@gmail.com wrote:
 
 
  Looking at scaladoc:
 
   /** Get an RDD for a Hadoop file with an arbitrary new API
 InputFormat. */
def newAPIHadoopFile[K, V, F : NewInputFormat[K, V]]
 
  Your conclusion is confirmed.

 
  On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou myx...@yahoo.com.invalid
 wrote:
 
  I did some experiments and it seems not. But I like to get confirmation
 (or
  perhaps I missed something). If it does support, could u let me know
 how to
  specify multiple folders? Thanks.
 
  Senqiang
 
 
 
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Ted Yu
Thanks for the confirmation, Stephen.

On Tue, Mar 3, 2015 at 3:53 PM, Stephen Boesch java...@gmail.com wrote:

 Thanks, I was looking at an old version of FileInputFormat..

 BEFORE setting the recursive config (
 mapreduce.input.fileinputformat.input.dir.recursive)
 scala sc.textFile(dev/*).count
  java.io.IOException: *Not a file*:
 file:/shared/sparkup/dev/audit-release/blank_maven_build

 The default is null/not set which is evaluated as false:

 scala
 sc.hadoopConfiguration.get(mapreduce.input.fileinputformat.input.dir.recursive)

 res1: String = null


 AFTER:


 Now set the value :


 sc.hadoopConfiguration.set(mapreduce.input.fileinputformat.input.dir.recursive,true)

 scala
 sc.hadoopConfiguration.get(mapreduce.input.fileinputformat.input.dir.recursive)
 res4: String = true


 scalasc.textFile(dev/*).count

 ..
 res5: Long = 3481


 So it works.

 2015-03-03 15:26 GMT-08:00 Ted Yu yuzhih...@gmail.com:

 Looking at FileInputFormat#listStatus():

 // Whether we need to recursive look into the directory structure

 boolean recursive = job.getBoolean(INPUT_DIR_RECURSIVE, false);

 where:

   public static final String INPUT_DIR_RECURSIVE =

 mapreduce.input.fileinputformat.input.dir.recursive;

 FYI

 On Tue, Mar 3, 2015 at 3:14 PM, Stephen Boesch java...@gmail.com wrote:


 The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass)
 TextInputFormat.  Inside the logic does exist to do the recursive directory
 reading - i.e. first detecting if an entry were a directory and if so then
 descending:

  for (FileStatus 
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus
  globStat: matches) {

 218 
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#218


 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#

  * if (globStat.isDir 
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus.isDir%28%29())
  {*

 *219
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#219*

 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#

 for(FileStatus 
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus
  stat: f*s**.listStatus 
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileSystem.java#FileSystem.listStatus%28org.apache.hadoop.fs.Path%2Corg.apache.hadoop.fs.PathFilter%29*(globStat.getPath
  
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus.getPath%28%29(),

 220 
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#220


 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#

 inputFilter)) {

 221 
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#221


 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#

   result.add 
 http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/List.java#List.add%28org.apache.hadoop.fs.FileStatus%29(stat);

 222 
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#222


 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#

 }

 223 
 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#223

Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?

2015-03-03 Thread Ted Yu
Looking at scaladoc:

 /** Get an RDD for a Hadoop file with an arbitrary new API InputFormat. */
  def newAPIHadoopFile[K, V, F : NewInputFormat[K, V]]

Your conclusion is confirmed.

On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou myx...@yahoo.com.invalid wrote:

 I did some experiments and it seems not. But I like to get confirmation
 (or perhaps I missed something). If it does support, could u let me know
 how to specify multiple folders? Thanks.

 Senqiang




Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Ted Yu
Having good out-of-box experience is desirable.

+1 on increasing the default.


On Sat, Feb 28, 2015 at 8:27 AM, Sean Owen so...@cloudera.com wrote:

 There was a recent discussion about whether to increase or indeed make
 configurable this kind of default fraction. I believe the suggestion
 there too was that 9-10% is a safer default.

 Advanced users can lower the resulting overhead value; it may still
 have to be increased in some cases, but a fatter default may make this
 kind of surprise less frequent.

 I'd support increasing the default; any other thoughts?

 On Sat, Feb 28, 2015 at 3:34 PM, Koert Kuipers ko...@tresata.com wrote:
  hey,
  running my first map-red like (meaning disk-to-disk, avoiding in memory
  RDDs) computation in spark on yarn i immediately got bitten by a too low
  spark.yarn.executor.memoryOverhead. however it took me about an hour to
 find
  out this was the cause. at first i observed failing shuffles leading to
  restarting of tasks, then i realized this was because executors could
 not be
  reached, then i noticed in containers got shut down and reallocated in
  resourcemanager logs (no mention of errors, it seemed the containers
  finished their business and shut down successfully), and finally i found
 the
  reason in nodemanager logs.
 
  i dont think this is a pleasent first experience. i realize
  spark.yarn.executor.memoryOverhead needs to be set differently from
  situation to situation. but shouldnt the default be a somewhat higher
 value
  so that these errors are unlikely, and then the experts that are willing
 to
  deal with these errors can tune it lower? so why not make the default 10%
  instead of 7%? that gives something that works in most situations out of
 the
  box (at the cost of being a little wasteful). it worked for me.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-02-28 Thread Ted Yu
Have you verified that spark-catalyst_2.10 jar was in the classpath ?

Cheers

On Sat, Feb 28, 2015 at 9:18 AM, Ashish Nigam ashnigamt...@gmail.com
wrote:

 Hi,
 I wrote a very simple program in scala to convert an existing RDD to
 SchemaRDD.
 But createSchemaRDD function is throwing exception

 Exception in thread main scala.ScalaReflectionException: class
 org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial
 classloader with boot classpath [.] not found


 Here's more info on the versions I am using -

 scala.binary.version2.11/scala.binary.version
 spark.version1.2.1/spark.version
 scala.version2.11.5/scala.version

 Please let me know how can I resolve this problem.

 Thanks
 Ashish



Re: Tools to manage workflows on Spark

2015-02-28 Thread Ted Yu
Here was latest modification in spork repo:
Mon Dec 1 10:08:19 2014

Not sure if it is being actively maintained.

On Sat, Feb 28, 2015 at 6:26 PM, Qiang Cao caoqiang...@gmail.com wrote:

 Thanks for the pointer, Ashish! I was also looking at Spork
 https://github.com/sigmoidanalytics/spork Pig-on-Spark), but wasn't sure
 if that's the right direction.

 On Sat, Feb 28, 2015 at 6:36 PM, Ashish Nigam ashnigamt...@gmail.com
 wrote:

 You have to call spark-submit from oozie.
 I used this link to get the idea for my implementation -


 http://mail-archives.apache.org/mod_mbox/oozie-user/201404.mbox/%3CCAHCsPn-0Grq1rSXrAZu35yy_i4T=fvovdox2ugpcuhkwmjp...@mail.gmail.com%3E



 On Feb 28, 2015, at 3:25 PM, Qiang Cao caoqiang...@gmail.com wrote:

 Thanks, Ashish! Is Oozie integrated with Spark? I knew it can accommodate
 some Hadoop jobs.


 On Sat, Feb 28, 2015 at 6:07 PM, Ashish Nigam ashnigamt...@gmail.com
 wrote:

 Qiang,
 Did you look at Oozie?
 We use oozie to run spark jobs in production.


 On Feb 28, 2015, at 2:45 PM, Qiang Cao caoqiang...@gmail.com wrote:

 Hi Everyone,

 We need to deal with workflows on Spark. In our scenario, each workflow
 consists of multiple processing steps. Among different steps, there could
 be dependencies.  I'm wondering if there are tools available that can
 help us schedule and manage workflows on Spark. I'm looking for something
 like pig on Hadoop, but it should fully function on Spark.

 Any suggestion?

 Thanks in advance!

 Qiang








Re: unsafe memory access in spark 1.2.1

2015-03-01 Thread Ted Yu
Google led me to:
https://bugs.openjdk.java.net/browse/JDK-8040802

Not sure if the last comment there applies to your deployment.

On Sun, Mar 1, 2015 at 8:32 AM, Zalzberg, Idan (Agoda) 
idan.zalzb...@agoda.com wrote:

  My run time version is:



 java version 1.7.0_75

 OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13)

 OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)



 Thanks



 *From:* Ted Yu [mailto:yuzhih...@gmail.com]
 *Sent:* Sunday, March 01, 2015 10:18 PM
 *To:* Zalzberg, Idan (Agoda)
 *Cc:* user@spark.apache.org
 *Subject:* Re: unsafe memory access in spark 1.2.1



 What Java version are you using ?



 Thanks



 On Sun, Mar 1, 2015 at 7:03 AM, Zalzberg, Idan (Agoda) 
 idan.zalzb...@agoda.com wrote:

  Hi,

 I am using spark 1.2.1, sometimes I get these errors sporadically:

 Any thought on what could be the cause?

 Thanks



 2015-02-27 15:08:47 ERROR SparkUncaughtExceptionHandler:96 - Uncaught
 exception in thread Thread[Executor task launch worker-25,5,main]

 java.lang.InternalError: a fault occurred in a recent unsafe memory access
 operation in compiled Java code

 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)

 at
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

 at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

 at
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)

 at
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

 at
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)

 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:365)

 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)

 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

 at org.apache.spark.scheduler.Task.run(Task.scala:56)

 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:745)




  --
  This message is confidential and is for the sole use of the intended
 recipient(s). It may also be privileged or otherwise protected by copyright
 or other legal rules. If you have received it by mistake please let us know
 by reply email and delete it from your system. It is prohibited to copy
 this message or disclose its content to anyone. Any confidentiality or
 privilege is not waived or lost by any mistaken delivery or unauthorized
 disclosure of the message. All messages sent to and from Agoda may be
 monitored to ensure compliance with company policies, to protect the
 company's interests and to remove potential malware. Electronic messages
 may be intercepted, amended, lost or deleted, or contain viruses.





Re: unsafe memory access in spark 1.2.1

2015-03-01 Thread Ted Yu
What Java version are you using ?

Thanks

On Sun, Mar 1, 2015 at 7:03 AM, Zalzberg, Idan (Agoda) 
idan.zalzb...@agoda.com wrote:

  Hi,

 I am using spark 1.2.1, sometimes I get these errors sporadically:

 Any thought on what could be the cause?

 Thanks



 2015-02-27 15:08:47 ERROR SparkUncaughtExceptionHandler:96 - Uncaught
 exception in thread Thread[Executor task launch worker-25,5,main]

 java.lang.InternalError: a fault occurred in a recent unsafe memory access
 operation in compiled Java code

 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)

 at
 java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)

 at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)

 at
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)

 at
 org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)

 at
 scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350)

 at
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at
 scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

 at
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:365)

 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)

 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)

 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)

 at org.apache.spark.scheduler.Task.run(Task.scala:56)

 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)

 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:745)



 --
 This message is confidential and is for the sole use of the intended
 recipient(s). It may also be privileged or otherwise protected by copyright
 or other legal rules. If you have received it by mistake please let us know
 by reply email and delete it from your system. It is prohibited to copy
 this message or disclose its content to anyone. Any confidentiality or
 privilege is not waived or lost by any mistaken delivery or unauthorized
 disclosure of the message. All messages sent to and from Agoda may be
 monitored to ensure compliance with company policies, to protect the
 company's interests and to remove potential malware. Electronic messages
 may be intercepted, amended, lost or deleted, or contain viruses.



Re: How to parse Json formatted Kafka message in spark streaming

2015-03-05 Thread Ted Yu
Cui:
You can check messages.partitions.size to determine whether messages is an
empty RDD.

Cheers

On Thu, Mar 5, 2015 at 12:52 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 When you use KafkaUtils.createStream with StringDecoders, it will return
 String objects inside your messages stream. To access the elements from the
 json, you could do something like the following:


val mapStream = messages.map(x= {
   val mapper = new ObjectMapper() with ScalaObjectMapper
   mapper.registerModule(DefaultScalaModule)

   mapper.readValue[Map[String,Any]](x)*.get(time)*
 })



 Thanks
 Best Regards

 On Thu, Mar 5, 2015 at 11:13 AM, Cui Lin cui@hds.com wrote:

   Friends,

   I'm trying to parse json formatted Kafka messages and then send back
 to cassandra.I have two problems:

1. I got the exception below. How to check an empty RDD?

  Exception in thread main java.lang.UnsupportedOperationException:
 empty collection
  at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:869)
  at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:869)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.rdd.RDD.reduce(RDD.scala:869)
  at org.apache.spark.sql.json.JsonRDD$.inferSchema(JsonRDD.scala:57)
  at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:232)
  at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:204)

  val messages = KafkaUtils.createStream[String, String, StringDecoder, 
 StringDecoder](…)

 messages.foreachRDD { rdd =
   val message:RDD[String] = rdd.map { y = y._2 }
   sqlContext.jsonRDD(message).registerTempTable(tempTable)
   sqlContext.sql(SELECT time,To FROM tempTable)
 .saveToCassandra(cassandra_keyspace, cassandra_table, SomeColumns(key, 
 msg))
 }


  2. how to get all column names from json messages? I have hundreds of
 columns in the json formatted message.

  Thanks for your help!




  Best regards,

  Cui Lin





Re: How to integrate HBASE on Spark

2015-02-23 Thread Ted Yu
Installing hbase on hadoop cluster would allow hbase to utilize features
provided by hdfs, such as short circuit read (See '90.2. Leveraging local
data' under http://hbase.apache.org/book.html#perf.hdfs).

Cheers

On Sun, Feb 22, 2015 at 11:38 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 If you are having both the clusters on the same network, then i'd suggest
 you installing it on the hadoop cluster. If you install it on the spark
 cluster itself, then hbase might take up a few cpu cycles and there's a
 chance for the job to lag.

 Thanks
 Best Regards

 On Mon, Feb 23, 2015 at 12:48 PM, sandeep vura sandeepv...@gmail.com
 wrote:

 Hi

 I had installed spark on 3 node cluster. Spark services are up and
 running.But i want to integrate hbase on spark

 Do i need to install HBASE on hadoop cluster or spark cluster.

 Please let me know asap.

 Regards,
 Sandeep.v





Re: Spark excludes fastutil dependencies we need

2015-02-25 Thread Ted Yu
Interesting. Looking at SparkConf.scala :

val configs = Seq(
  DeprecatedConfig(spark.files.userClassPathFirst,
spark.executor.userClassPathFirst,
1.3),
  DeprecatedConfig(spark.yarn.user.classpath.first, null, 1.3,
Use spark.{driver,executor}.userClassPathFirst instead.))

It seems spark.files.userClassPathFirst and spark.yarn.user.classpath.first
are deprecated.

On Wed, Feb 25, 2015 at 12:39 AM, Sean Owen so...@cloudera.com wrote:

 No, we should not add fastutil back. It's up to the app to bring
 dependencies it needs, and that's how I understand this issue. The
 question is really, how to get the classloader visibility right. It
 depends on where you need these classes. Have you looked into
 spark.files.userClassPathFirst and spark.yarn.user.classpath.first ?

 On Wed, Feb 25, 2015 at 5:34 AM, Ted Yu yuzhih...@gmail.com wrote:
  bq. depend on missing fastutil classes like Long2LongOpenHashMap
 
  Looks like Long2LongOpenHashMap should be added to the shaded jar.
 
  Cheers
 
  On Tue, Feb 24, 2015 at 7:36 PM, Jim Kleckner j...@cloudphysics.com
 wrote:
 
  Spark includes the clearspring analytics package but intentionally
  excludes
  the dependencies of the fastutil package (see below).
 
  Spark includes parquet-column which includes fastutil and relocates it
  under
  parquet/
  but creates a shaded jar file which is incomplete because it shades out
  some
  of
  the fastutil classes, notably Long2LongOpenHashMap, which is present in
  the
  fastutil jar file that parquet-column is referencing.
 
  We are using more of the clearspring classes (e.g. QDigest) and those do
  depend on
  missing fastutil classes like Long2LongOpenHashMap.
 
  Even though I add them to our assembly jar file, the class loader finds
  the
  spark assembly
  and we get runtime class loader errors when we try to use it.
 
  It is possible to put our jar file first, as described here:
https://issues.apache.org/jira/browse/SPARK-939
 
 
 http://spark.apache.org/docs/1.2.0/configuration.html#runtime-environment
 
  which I tried with args to spark-submit:
--conf spark.driver.userClassPathFirst=true  --conf
  spark.executor.userClassPathFirst=true
  but we still get the class not found error.
 
  We have tried copying the source code for clearspring into our package
 and
  renaming the
  package and that makes it appear to work...  Is this risky?  It
 certainly
  is
  ugly.
 
  Can anyone recommend a way to deal with this dependency **ll ?
 
 
  === The spark/pom.xml file contains the following lines:
 
dependency
  groupIdcom.clearspring.analytics/groupId
  artifactIdstream/artifactId
  version2.7.0/version
  exclusions
 
exclusion
  groupIdit.unimi.dsi/groupId
  artifactIdfastutil/artifactId
/exclusion
  /exclusions
/dependency
 
  === The parquet-column/pom.xml file contains:
  artifactIdmaven-shade-plugin/artifactId
  executions
execution
  phasepackage/phase
  goals
goalshade/goal
  /goals
  configuration
minimizeJartrue/minimizeJar
artifactSet
  includes
includeit.unimi.dsi:fastutil/include
  /includes
/artifactSet
relocations
  relocation
patternit.unimi.dsi/pattern
shadedPatternparquet.it.unimi.dsi/shadedPattern
  /relocation
/relocations
  /configuration
/execution
  /executions
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-excludes-fastutil-dependencies-we-need-tp21794.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



Re: [Spark SQL]: Convert SchemaRDD back to RDD

2015-02-22 Thread Ted Yu
Haven't found the method in
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD

The new DataFrame has this method:

  /**
   * Returns the content of the [[DataFrame]] as an [[RDD]] of [[Row]]s.
   * @group rdd
   */
  def rdd: RDD[Row] = {

FYI

On Sun, Feb 22, 2015 at 11:51 AM, stephane.collot stephane.col...@gmail.com
 wrote:

 Hi Michael,

 I think that the feature (convert a SchemaRDD to a structured class RDD) is
 now available. But I didn't understand in the PR how exactly to do this.
 Can
 you give an example or doc links?

 Best regards



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-tp9071p21753.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Posting to the list

2015-02-22 Thread Ted Yu
bq. i didnt get any new subscription mail in my inbox.

Have you checked your Spam folder ?

Cheers

On Sun, Feb 22, 2015 at 2:36 PM, hnahak harihar1...@gmail.com wrote:

 I'm also facing the same issue, this is third time whenever I post anything
 it never accept by the community and at the same time got a failure mail in
 my register mail id.

 and when click to subscribe to this mailing list link, i didnt get any
 new
 subscription mail in my inbox.

 Please anyone suggest a best way to subscribed the email ID



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Launching Spark cluster on EC2 with Ubuntu AMI

2015-02-22 Thread Ted Yu
bq. bash: git: command not found

Looks like the AMI doesn't have git pre-installed.

Cheers

On Sun, Feb 22, 2015 at 4:29 PM, olegshirokikh o...@solver.com wrote:

 I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu)
 using
 the following:

 ./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem'
 --region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2
 --instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch
 spark-ubuntu-cluster

 Everything starts OK and instances are launched:

 Found 1 master(s), 2 slaves
 Waiting for all instances in cluster to enter 'ssh-ready' state.
 Generating cluster's SSH key on master.

 But then I'm getting the following SSH errors until it stops trying and
 quits:

 bash: git: command not found
 Connection to ***.us-west-2.compute.amazonaws.com closed.
 Error executing remote command, retrying after 30 seconds: Command '['ssh',
 '-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o',
 'UserKnownHostsFile=/dev/null', '-t', '-t',
 u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2  git
 clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero
 exit
 status 127

 I know that Spark EC2 scripts are not guaranteed to work with custom AMIs
 but still, it should work... Any advice would be greatly appreciated!




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Getting to proto buff classes in Spark Context

2015-02-23 Thread Ted Yu
bq. Caused by: java.lang.ClassNotFoundException: com.rick.reports.Reports$
SensorReports

Is Reports$SensorReports class in rick-processors-assembly-1.0.jar ?

Thanks

On Mon, Feb 23, 2015 at 8:43 PM, necro351 . necro...@gmail.com wrote:

 Hello,

 I am trying to deserialize some data encoded using proto buff from within
 Spark and am getting class-not-found exceptions. I have narrowed the
 program down to something very simple that shows the problem exactly (see
 'The Program' below) and hopefully someone can tell me the easy fix :)

 So the situation is I have some proto buff reports in /tmp/reports. I also
 have a Spark project with the below Scala code (under The Program) as well
 as a Java file defining SensorReports all in the same src sub-tree in my
 Spark project. Its built using sbt in the standard way. The Spark job reads
 in the reports from /tmp/reports and then prints them to the console. When
 I build and run my spark job with spark-submit everything works as expected
 and the reports are printed out. When I uncomment the 'XXX' variant in the
 Scala spark program and try to print the reports from within a Spark
 Context I get the class-not-found exceptions. I don't understand why. If I
 get this working then I will want to do more than just print the reports
 from within the Spark Context.

 My read of the documentation tells me that my spark job should have access
 to everything in the submitted jar and that jar includes the Java code
 generated by the proto buff library which defines SensorReports. This is
 the spark-submit invocation I use after building my job as an assembly with
 the sbt-assembly plugin:

 spark-submit --class com.rick.processors.NewReportProcessor --master
 local[*]
 ../../../analyzer/spark/target/scala-2.10/rick-processors-assembly-1.0.jar

 I have also tried adding the jar programmatically using sc.addJar but that
 does not help. I found a bug from July (
 https://github.com/apache/spark/pull/181) that seems related but it went
 into Spark 1.2.0 (which is what I am currently using) so I don't think
 that's it.

 Any ideas? Thanks!

 The Program:
 ==
 package com.rick.processors



 import java.io.File

 import java.nio.file.{Path, Files, FileSystems}

 import org.apache.spark.{SparkContext, SparkConf}

 import com.rick.reports.Reports.SensorReports



 object NewReportProcessor {

   private val sparkConf = new SparkConf().setAppName(ReportProcessor)

   private val sc = new SparkContext(sparkConf)



   def main(args: Array[String]) = {

 val protoBuffsBinary = localFileReports()

 val sensorReportsBundles = protoBuffsBinary.map(bundle =
 SensorReports.parseFrom(bundle))
 // XXX: Printing from within the SparkContext throws class-not-found

 // exceptions, why?

 // sc.makeRDD(sensorReportsBundles).foreach((x: SensorReports) =
 println(x.toString))
 sensorReportsBundles.foreach((x: SensorReports) =
 println(x.toString))
   }



   private def localFileReports() = {

 val reportDir = new File(/tmp/reports)

 val reportFiles =
 reportDir.listFiles.filter(_.getName.endsWith(.report))

 reportFiles.map(file = {

   val path = FileSystems.getDefault().getPath(/tmp/reports,
 file.getName())
   Files.readAllBytes(path)

 })

   }

 }

 The Class-not-found exceptions:
 =
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/02/23 17:35:03 WARN Utils: Your hostname, ubuntu resolves to a loopback
 address: 127.0.1.1; using 192.168.241.128 instead (on interface eth0)
 15/02/23 17:35:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
 another address
 15/02/23 17:35:04 INFO SecurityManager: Changing view acls to: rick
 15/02/23 17:35:04 INFO SecurityManager: Changing modify acls to: rick
 15/02/23 17:35:04 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(rick); users
 with modify permissions: Set(rick)
 15/02/23 17:35:04 INFO Slf4jLogger: Slf4jLogger started
 15/02/23 17:35:04 INFO Remoting: Starting remoting
 15/02/23 17:35:04 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@192.168.241.128:38110]
 15/02/23 17:35:04 INFO Utils: Successfully started service 'sparkDriver'
 on port 38110.
 15/02/23 17:35:04 INFO SparkEnv: Registering MapOutputTracker
 15/02/23 17:35:04 INFO SparkEnv: Registering BlockManagerMaster
 15/02/23 17:35:04 INFO DiskBlockManager: Created local directory at
 /tmp/spark-local-20150223173504-b26c
 15/02/23 17:35:04 INFO MemoryStore: MemoryStore started with capacity
 267.3 MB
 15/02/23 17:35:05 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable
 15/02/23 17:35:05 INFO HttpFileServer: HTTP File server directory is
 /tmp/spark-c77dbc9a-d626-4991-a9b7-f593acafbe64
 15/02/23 17:35:05 INFO 

Re: InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag

2015-02-23 Thread Ted Yu
bq. have installed hadoop on a local virtual machine

Can you tell us the release of hadoop you installed ?

What Spark release are you using ? Or be more specific, what hadoop release
was the Spark built against ?

Cheers

On Mon, Feb 23, 2015 at 9:37 PM, fanooos dev.fano...@gmail.com wrote:

 Hi

 I have installed hadoop on a local virtual machine using the steps from
 this
 URL


 https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10

 In the local machine I write a little Spark application in java to read a
 file from the hadoop instance installed in the virtual machine.

 The code is below

 public static void main(String[] args) {

 JavaSparkContext sc = new JavaSparkContext(new
 SparkConf().setAppName(Spark Count).setMaster(local));

 JavaRDDString lines =
 sc.textFile(hdfs://10.62.57.141:50070/tmp/lines.txt);
 JavaRDDInteger lengths = lines.flatMap(new FlatMapFunctionString,
 Integer() {
 @Override
 public IterableInteger call(String t) throws Exception {
 return Arrays.asList(t.length());
 }
 });
 ListInteger collect = lengths.collect();
 int totalLength = lengths.reduce(new Function2Integer, Integer,
 Integer() {
 @Override
 public Integer call(Integer v1, Integer v2) throws
 Exception {
 return v1+v2;
 }
 });
 System.out.println(totalLength);

   }


 The application throws this exception

 Exception in thread main java.io.IOException: Failed on local
 exception: com.google.protobuf.InvalidProtocolBufferException: Protocol
 message end-group tag did not match expected tag.; Host Details : local
 host
 is: TOSHIBA-PC/192.168.56.1; destination host is: 10.62.57.141:50070;
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764)
 at org.apache.hadoop.ipc.Client.call(Client.java:1351)
 at org.apache.hadoop.ipc.Client.call(Client.java:1300)
 at

 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
 at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
 at java.lang.reflect.Method.invoke(Unknown Source)
 at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
 at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
 at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source)
 at

 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651)
 at
 org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679)
 at

 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
 at

 org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
 at

 org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
 at

 org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
 at
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1701)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1647)
 at

 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:222)
 at

 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
 at
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at
 org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222)
 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:220)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1367)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:797)
 at
 

Re: Does Spark Streaming depend on Hadoop?

2015-02-23 Thread Ted Yu
Can you pastebin the whole stack trace ?

Thanks



 On Feb 23, 2015, at 6:14 PM, bit1...@163.com bit1...@163.com wrote:
 
 Hi,
 
 When I submit a spark streaming application with following script,
 
 ./spark-submit --name MyKafkaWordCount --master local[20] --executor-memory 
 512M --total-executor-cores 2 --class 
 spark.examples.streaming.MyKafkaWordCount  my.kakfa.wordcountjar
 
 An exception occurs:
 Exception in thread main java.net.ConnectException: Call From 
 hadoop.master/192.168.26.137 to hadoop.master:9000 failed on connection 
 exception.
 
 From the exception, it tries to connect to 9000 which is for Hadoop/HDFS. and 
 I don't use Hadoop at all in my code(such as save to HDFS).
 
 
 
 bit1...@163.com


Re: Getting to proto buff classes in Spark Context

2015-02-23 Thread Ted Yu
The classname given in stack trace was com.rick.reports.Reports

In the output from jar command the class is com.defend7.reports.Reports.

FYI

On Mon, Feb 23, 2015 at 9:33 PM, necro351 . necro...@gmail.com wrote:

 Hi Ted,

 Yes it appears to be:
 rick@ubuntu:~/go/src/rick/sparksprint/containers/tests/StreamingReports$
 jar tvf
 ../../../analyzer/spark/target/scala-2.10/rick-processors-assembly-1.0.jar|grep
 SensorReports
   1128 Mon Feb 23 17:34:46 PST 2015
 com/defend7/reports/Reports$SensorReports$1.class
  13507 Mon Feb 23 17:34:46 PST 2015
 com/defend7/reports/Reports$SensorReports$Builder.class
  10640 Mon Feb 23 17:34:46 PST 2015
 com/defend7/reports/Reports$SensorReports.class
815 Mon Feb 23 17:34:46 PST 2015
 com/defend7/reports/Reports$SensorReportsOrBuilder.class


 On Mon Feb 23 2015 at 8:57:18 PM Ted Yu yuzhih...@gmail.com wrote:

 bq. Caused by: java.lang.ClassNotFoundException: com.rick.reports.
 Reports$SensorReports

 Is Reports$SensorReports class in rick-processors-assembly-1.0.jar ?

 Thanks

 On Mon, Feb 23, 2015 at 8:43 PM, necro351 . necro...@gmail.com wrote:

 Hello,

 I am trying to deserialize some data encoded using proto buff from
 within Spark and am getting class-not-found exceptions. I have narrowed the
 program down to something very simple that shows the problem exactly (see
 'The Program' below) and hopefully someone can tell me the easy fix :)

 So the situation is I have some proto buff reports in /tmp/reports. I
 also have a Spark project with the below Scala code (under The Program) as
 well as a Java file defining SensorReports all in the same src sub-tree in
 my Spark project. Its built using sbt in the standard way. The Spark job
 reads in the reports from /tmp/reports and then prints them to the console.
 When I build and run my spark job with spark-submit everything works as
 expected and the reports are printed out. When I uncomment the 'XXX'
 variant in the Scala spark program and try to print the reports from within
 a Spark Context I get the class-not-found exceptions. I don't understand
 why. If I get this working then I will want to do more than just print the
 reports from within the Spark Context.

 My read of the documentation tells me that my spark job should have
 access to everything in the submitted jar and that jar includes the Java
 code generated by the proto buff library which defines SensorReports. This
 is the spark-submit invocation I use after building my job as an assembly
 with the sbt-assembly plugin:

 spark-submit --class com.rick.processors.NewReportProcessor --master
 local[*]
 ../../../analyzer/spark/target/scala-2.10/rick-processors-assembly-1.0.jar

 I have also tried adding the jar programmatically using sc.addJar but
 that does not help. I found a bug from July (
 https://github.com/apache/spark/pull/181) that seems related but it
 went into Spark 1.2.0 (which is what I am currently using) so I don't think
 that's it.

 Any ideas? Thanks!

 The Program:
 ==
 package com.rick.processors



 import java.io.File

 import java.nio.file.{Path, Files, FileSystems}

 import org.apache.spark.{SparkContext, SparkConf}

 import com.rick.reports.Reports.SensorReports



 object NewReportProcessor {

   private val sparkConf = new SparkConf().setAppName(ReportProcessor)

   private val sc = new SparkContext(sparkConf)



   def main(args: Array[String]) = {

 val protoBuffsBinary = localFileReports()

 val sensorReportsBundles = protoBuffsBinary.map(bundle =
 SensorReports.parseFrom(bundle))
 // XXX: Printing from within the SparkContext throws class-not-found

 // exceptions, why?

 // sc.makeRDD(sensorReportsBundles).foreach((x: SensorReports) =
 println(x.toString))
 sensorReportsBundles.foreach((x: SensorReports) =
 println(x.toString))
   }



   private def localFileReports() = {

 val reportDir = new File(/tmp/reports)

 val reportFiles =
 reportDir.listFiles.filter(_.getName.endsWith(.report))

 reportFiles.map(file = {

   val path = FileSystems.getDefault().getPath(/tmp/reports,
 file.getName())
   Files.readAllBytes(path)

 })

   }

 }

 The Class-not-found exceptions:
 =
 Spark assembly has been built with Hive, including Datanucleus jars on
 classpath
 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties
 15/02/23 17:35:03 WARN Utils: Your hostname, ubuntu resolves to a
 loopback address: 127.0.1.1; using 192.168.241.128 instead (on interface
 eth0)
 15/02/23 17:35:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
 another address
 15/02/23 17:35:04 INFO SecurityManager: Changing view acls to: rick
 15/02/23 17:35:04 INFO SecurityManager: Changing modify acls to: rick
 15/02/23 17:35:04 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(rick); users
 with modify permissions: Set(rick)
 15/02/23 17:35:04 INFO Slf4jLogger: Slf4jLogger started
 15/02

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Ted Yu
Here is a tool which may give you some clue:
http://file-leak-detector.kohsuke.org/

Cheers

On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov 
vrodio...@splicemachine.com wrote:

 Usually it happens in Linux when application deletes file w/o double
 checking that there are no open FDs (resource leak). In this case, Linux
 holds all space allocated and does not release it until application exits
 (crashes in your case). You check file system and everything is normal, you
 have enough space and you have no idea why does application report no
 space left on device.

 Just a guess.

 -Vladimir Rodionov

 On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote:

 I'm running a cluster of 3 Amazon EC2 machines (small number because it's
 expensive when experiments keep crashing after a day!).

 Today's crash looks like this (stacktrace at end of message).
 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 0

 On my three nodes, I have plenty of space and inodes:

 A $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97937  426351   19% /
 tmpfs1909200   1 19091991% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds831869296   23844 8318454521% /vol0

 A $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  802G  199G  81% /vol0

 B $ df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97947  426341   19% /
 tmpfs1906639   1 19066381% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds816200704   24223 8161764811% /vol0

 B $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.6G  4.3G  46% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  805G  195G  81% /vol0

 C $df -i
 FilesystemInodes   IUsed   IFree IUse% Mounted on
 /dev/xvda1524288   97938  426350   19% /
 tmpfs1906897   1 19068961% /dev/shm
 /dev/xvdb2457600  54 24575461% /mnt
 /dev/xvdc2457600  24 24575761% /mnt2
 /dev/xvds755218352   24024 7551943281% /vol0
 root@ip-10-204-136-223 ~]$

 C $ df -h
 FilesystemSize  Used Avail Use% Mounted on
 /dev/xvda17.9G  3.4G  4.5G  44% /
 tmpfs 7.3G 0  7.3G   0% /dev/shm
 /dev/xvdb  37G  1.2G   34G   4% /mnt
 /dev/xvdc  37G  177M   35G   1% /mnt2
 /dev/xvds1000G  820G  181G  82% /vol0

 The devices may be ~80% full but that still leaves ~200G free on each. My
 spark-env.sh has

 export SPARK_LOCAL_DIRS=/vol0/spark

 I have manually verified that on each slave the only temporary files are
 stored on /vol0, all looking something like this


 /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884

 So it looks like all the files are being stored on the large drives
 (incidentally they're AWS EBS volumes, but that's the only way to get
 enough storage). My process crashed before with a slightly different
 exception under the same circumstances: kryo.KryoException:
 java.io.IOException: No space left on device

 These both happen after several hours and several GB of temporary files.

 Why does Spark think it's run out of space?

 TIA

 Joe

 Stack trace 1:

 org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 0
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
 at
 org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 at
 org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380)
 at
 org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176)
 at
 

Re: Perf Prediction

2015-02-21 Thread Ted Yu
Can you be a bit more specific ?

Are you asking about performance across Spark releases ?

Cheers

On Sat, Feb 21, 2015 at 6:38 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 Has some performance prediction work been done on Spark?

 Thank You




Re: Query data in Spark RRD

2015-02-21 Thread Ted Yu
Have you looked at
http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD
?

Cheers

On Sat, Feb 21, 2015 at 4:24 AM, Nikhil Bafna nikhil.ba...@flipkart.com
wrote:


 Hi.

 My use case is building a realtime monitoring system over
 multi-dimensional data.

 The way I'm planning to go about it is to use Spark Streaming to store
 aggregated count over all dimensions in 10 sec interval.

 Then, from a dashboard, I would be able to specify a query over some
 dimensions, which will need re-aggregation from the already computed job.

 My query is, how can I run dynamic queries over data in schema RDDs?

 --
 Nikhil Bafna



Re: upgrade to Spark 1.2.1

2015-02-25 Thread Ted Yu
Could this be caused by Spark using shaded Guava jar ?

Cheers

On Wed, Feb 25, 2015 at 3:26 PM, Pat Ferrel p...@occamsmachete.com wrote:

 Getting an error that confuses me. Running a largish app on a standalone
 cluster on my laptop. The app uses a guava HashBiMap as a broadcast value.
 With Spark 1.1.0 I simply registered the class and its serializer with kryo
 like this:

kryo.register(classOf[com.google.common.collect.HashBiMap[String,
 Int]], new JavaSerializer())

 And all was well. I’ve also tried addSerializer instead of register. Now I
 get a class not found during deserialization. I checked the jar list used
 to create the context and found the jar that contains HashBiMap but get
 this error. Any ideas:

 15/02/25 14:46:33 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
 4.0 (TID 8, 192.168.0.2): java.io.IOException:
 com.esotericsoftware.kryo.KryoException: Error during Java deserialization.
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1093)
 at
 org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
 at
 org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
 at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
 at
 org.apache.mahout.drivers.TDIndexedDatasetReader$$anonfun$5.apply(TextDelimitedReaderWriter.scala:95)
 at
 org.apache.mahout.drivers.TDIndexedDatasetReader$$anonfun$5.apply(TextDelimitedReaderWriter.scala:94)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:366)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: com.esotericsoftware.kryo.KryoException: Error during Java
 deserialization.
 at
 com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42)
 at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 at
 org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:144)
 at
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
 at
 org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1090)
 ... 19 more

 == root error 
 Caused by: java.lang.ClassNotFoundException:
 com.google.common.collect.HashBiMap


 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at
 java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:625)
 at
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at
 java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at
 java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at
 com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:40)
 ... 24 more



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark excludes fastutil dependencies we need

2015-02-25 Thread Ted Yu
Maybe drop the exclusion for parquet-provided profile ?

Cheers

On Wed, Feb 25, 2015 at 8:42 PM, Jim Kleckner j...@cloudphysics.com wrote:

 Inline

 On Wed, Feb 25, 2015 at 1:53 PM, Ted Yu yuzhih...@gmail.com wrote:

 Interesting. Looking at SparkConf.scala :

 val configs = Seq(
   DeprecatedConfig(spark.files.userClassPathFirst,
 spark.executor.userClassPathFirst,
 1.3),
   DeprecatedConfig(spark.yarn.user.classpath.first, null, 1.3,
 Use spark.{driver,executor}.userClassPathFirst instead.))

 It seems spark.files.userClassPathFirst and spark.yarn.user.classpath.first
 are deprecated.


 Note that I did use the non-deprecated version, spark.executor.
 userClassPathFirst=true.



 On Wed, Feb 25, 2015 at 12:39 AM, Sean Owen so...@cloudera.com wrote:

 No, we should not add fastutil back. It's up to the app to bring
 dependencies it needs, and that's how I understand this issue. The
 question is really, how to get the classloader visibility right. It
 depends on where you need these classes. Have you looked into
 spark.files.userClassPathFirst and spark.yarn.user.classpath.first ?



 I noted that I tried this in my original email.

 The issue appears related to the fact that parquet is also creating a
 shaded
 jar and that one leaves out the Long2LongOpenHashMap class.

 FYI, I have subsequently tried removing the exclusion from the spark build
 and
 that does cause the fastutil classes to be included and the example
 works...

 So, should the userClassPathFirst flag work and there is a bug?

 Or is it reasonable to put in a pull request for the elimination of the
 exclusion?




 On Wed, Feb 25, 2015 at 5:34 AM, Ted Yu yuzhih...@gmail.com wrote:
  bq. depend on missing fastutil classes like Long2LongOpenHashMap
 
  Looks like Long2LongOpenHashMap should be added to the shaded jar.
 
  Cheers

 
  On Tue, Feb 24, 2015 at 7:36 PM, Jim Kleckner j...@cloudphysics.com
 wrote:
 
  Spark includes the clearspring analytics package but intentionally
  excludes
  the dependencies of the fastutil package (see below).
 
  Spark includes parquet-column which includes fastutil and relocates it
  under
  parquet/
  but creates a shaded jar file which is incomplete because it shades
 out
  some
  of
  the fastutil classes, notably Long2LongOpenHashMap, which is present
 in
  the
  fastutil jar file that parquet-column is referencing.
 
  We are using more of the clearspring classes (e.g. QDigest) and those
 do
  depend on
  missing fastutil classes like Long2LongOpenHashMap.
 
  Even though I add them to our assembly jar file, the class loader
 finds
  the
  spark assembly
  and we get runtime class loader errors when we try to use it.
 
  It is possible to put our jar file first, as described here:
https://issues.apache.org/jira/browse/SPARK-939
 
 
 http://spark.apache.org/docs/1.2.0/configuration.html#runtime-environment
 
  which I tried with args to spark-submit:
--conf spark.driver.userClassPathFirst=true  --conf
  spark.executor.userClassPathFirst=true
  but we still get the class not found error.
 
  We have tried copying the source code for clearspring into our
 package and
  renaming the
  package and that makes it appear to work...  Is this risky?  It
 certainly
  is
  ugly.
 
  Can anyone recommend a way to deal with this dependency **ll ?
 
 
  === The spark/pom.xml file contains the following lines:
 
dependency
  groupIdcom.clearspring.analytics/groupId
  artifactIdstream/artifactId
  version2.7.0/version
  exclusions
 
exclusion
  groupIdit.unimi.dsi/groupId
  artifactIdfastutil/artifactId
/exclusion
  /exclusions
/dependency
 
  === The parquet-column/pom.xml file contains:
  artifactIdmaven-shade-plugin/artifactId
  executions
execution
  phasepackage/phase
  goals
goalshade/goal
  /goals
  configuration
minimizeJartrue/minimizeJar
artifactSet
  includes
includeit.unimi.dsi:fastutil/include
  /includes
/artifactSet
relocations
  relocation
patternit.unimi.dsi/pattern
shadedPatternparquet.it.unimi.dsi/shadedPattern
  /relocation
/relocations
  /configuration
/execution
  /executions
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-excludes-fastutil-dependencies-we-need-tp21794.html
  Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 






Re: Spark excludes fastutil dependencies we need

2015-02-24 Thread Ted Yu
bq. depend on missing fastutil classes like Long2LongOpenHashMap

Looks like Long2LongOpenHashMap should be added to the shaded jar.

Cheers

On Tue, Feb 24, 2015 at 7:36 PM, Jim Kleckner j...@cloudphysics.com wrote:

 Spark includes the clearspring analytics package but intentionally excludes
 the dependencies of the fastutil package (see below).

 Spark includes parquet-column which includes fastutil and relocates it
 under
 parquet/
 but creates a shaded jar file which is incomplete because it shades out
 some
 of
 the fastutil classes, notably Long2LongOpenHashMap, which is present in the
 fastutil jar file that parquet-column is referencing.

 We are using more of the clearspring classes (e.g. QDigest) and those do
 depend on
 missing fastutil classes like Long2LongOpenHashMap.

 Even though I add them to our assembly jar file, the class loader finds the
 spark assembly
 and we get runtime class loader errors when we try to use it.

 It is possible to put our jar file first, as described here:
   https://issues.apache.org/jira/browse/SPARK-939

 http://spark.apache.org/docs/1.2.0/configuration.html#runtime-environment

 which I tried with args to spark-submit:
   --conf spark.driver.userClassPathFirst=true  --conf
 spark.executor.userClassPathFirst=true
 but we still get the class not found error.

 We have tried copying the source code for clearspring into our package and
 renaming the
 package and that makes it appear to work...  Is this risky?  It certainly
 is
 ugly.

 Can anyone recommend a way to deal with this dependency **ll ?


 === The spark/pom.xml file contains the following lines:

   dependency
 groupIdcom.clearspring.analytics/groupId
 artifactIdstream/artifactId
 version2.7.0/version
 exclusions

   exclusion
 groupIdit.unimi.dsi/groupId
 artifactIdfastutil/artifactId
   /exclusion
 /exclusions
   /dependency

 === The parquet-column/pom.xml file contains:
 artifactIdmaven-shade-plugin/artifactId
 executions
   execution
 phasepackage/phase
 goals
   goalshade/goal
 /goals
 configuration
   minimizeJartrue/minimizeJar
   artifactSet
 includes
   includeit.unimi.dsi:fastutil/include
 /includes
   /artifactSet
   relocations
 relocation
   patternit.unimi.dsi/pattern
   shadedPatternparquet.it.unimi.dsi/shadedPattern
 /relocation
   /relocations
 /configuration
   /execution
 /executions




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-excludes-fastutil-dependencies-we-need-tp21794.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to tell if one RDD depends on another

2015-02-26 Thread Ted Yu
bq. whether or not rdd1 is a cached rdd

RDD has getStorageLevel method which would return the RDD's current storage
level.

SparkContext has this method:
   * Return information about what RDDs are cached, if they are in mem or
on disk, how much space
   * they take, etc.
   */
  @DeveloperApi
  def getRDDStorageInfo: Array[RDDInfo] = {

Cheers

On Thu, Feb 26, 2015 at 4:00 PM, Corey Nolet cjno...@gmail.com wrote:

 Zhan,

 This is exactly what I'm trying to do except, as I metnioned in my first
 message, I am being given rdd1 and rdd2 only and I don't necessarily know
 at that point whether or not rdd1 is a cached rdd. Further, I don't know at
 that point whether or not rdd2 depends on rdd1.

 On Thu, Feb 26, 2015 at 6:54 PM, Zhan Zhang zzh...@hortonworks.com
 wrote:

  In this case, it is slow to wait for rdd1.saveAsHasoopFile(...)  to
 finish probably due to writing to hdfs.  a walk around for this particular
 case may be as follows.

  val rdd1 = ..cache()

  val rdd2 = rdd1.map().()
 rdd1.count
  future { rdd1.saveAsHasoopFile(...) }
  future { rdd2.saveAsHadoopFile(…)]

  In this way, rdd1 will be calculated once, and two saveAsHadoopFile
 will happen concurrently.

  Thanks.

  Zhan Zhang



  On Feb 26, 2015, at 3:28 PM, Corey Nolet cjno...@gmail.com wrote:

   What confused me is  the statement of *The final result is that rdd1
 is calculated twice.” *Is it the expected behavior?

  To be perfectly honest, performing an action on a cached RDD in two
 different threads and having them (at the partition level) block until the
 parent are cached would be the behavior and myself and all my coworkers
 expected.

 On Thu, Feb 26, 2015 at 6:26 PM, Corey Nolet cjno...@gmail.com wrote:

  I should probably mention that my example case is much over
 simplified- Let's say I've got a tree, a fairly complex one where I begin a
 series of jobs at the root which calculates a bunch of really really
 complex joins and as I move down the tree, I'm creating reports from the
 data that's already been joined (i've implemented logic to determine when
 cached items can be cleaned up, e.g. the last report has been done in a
 subtree).

  My issue is that the 'actions' on the rdds are currently being
 implemented in a single thread- even if I'm waiting on a cache to complete
 fully before I run the children jobs, I'm still in a better placed than I
 was because I'm able to run those jobs concurrently- right now this is not
 the case.

  What you want is for a request for partition X to wait if partition X
 is already being calculated in a persisted RDD.

 I totally agree and if I could get it so that it's waiting at the
 granularity of the partition, I'd be in a much much better place. I feel
 like I'm going down a rabbit hole and working against the Spark API.


 On Thu, Feb 26, 2015 at 6:03 PM, Sean Owen so...@cloudera.com wrote:

 To distill this a bit further, I don't think you actually want rdd2 to
 wait on rdd1 in this case. What you want is for a request for
 partition X to wait if partition X is already being calculated in a
 persisted RDD. Otherwise the first partition of rdd2 waits on the
 final partition of rdd1 even when the rest is ready.

 That is probably usually a good idea in almost all cases. That much, I
 don't know how hard it is to implement. But I speculate that it's
 easier to deal with it at that level than as a function of the
 dependency graph.

 On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet cjno...@gmail.com
 wrote:
  I'm trying to do the scheduling myself now- to determine that rdd2
 depends
  on rdd1 and rdd1 is a persistent RDD (storage level != None) so that
 I can
  do the no-op on rdd1 before I run rdd2. I would much rather the DAG
 figure
  this out so I don't need to think about all this.








Re: JettyUtils.createServletHandler Method not Found?

2015-03-27 Thread Ted Yu
JettyUtils is marked with:
private[spark] object JettyUtils extends Logging {

FYI

On Fri, Mar 27, 2015 at 9:50 AM, kmader kevin.ma...@gmail.com wrote:

 I have a very strange error in Spark 1.3 where at runtime in the
 org.apache.spark.ui.JettyUtils object the method createServletHandler is
 not
 found

 Exception in thread main java.lang.NoSuchMethodError:

 org.apache.spark.ui.JettyUtils$.createServletHandler(Ljava/lang/String;Ljavax/servlet/http/HttpServlet;Ljava/lang/String;)Lorg/eclipse/jetty/servlet/ServletContextHandler;


 The code compiles without issue, but at runtime it fails. I know the Jetty
 dependencies have been changed, but this should not affect the JettyUtils
 inside Spark Core, or is there another change I am not aware of?

 Thanks,
 Kevin




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/JettyUtils-createServletHandler-Method-not-Found-tp22262.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: RDD equivalent of HBase Scan

2015-03-26 Thread Ted Yu
In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala,
TableInputFormat is used.
TableInputFormat accepts parameter

  public static final String SCAN = hbase.mapreduce.scan;

where if specified, Scan object would be created from String form:

if (conf.get(SCAN) != null) {

  try {

scan = TableMapReduceUtil.convertStringToScan(conf.get(SCAN));

You can use TableMapReduceUtil#convertScanToString() to convert a Scan
which has filter(s) and pass to TableInputFormat

Cheers

On Thu, Mar 26, 2015 at 6:46 AM, Stuart Layton stuart.lay...@gmail.com
wrote:

 HBase scans come with the ability to specify filters that make scans very
 fast and efficient (as they let you seek for the keys that pass the filter).

 Do RDD's or Spark DataFrames offer anything similar or would I be required
 to use a NoSQL db like HBase to do something like this?

 --
 Stuart Layton



Re: Which RDD operations preserve ordering?

2015-03-26 Thread Ted Yu
This is related:
https://issues.apache.org/jira/browse/SPARK-6340

On Thu, Mar 26, 2015 at 5:58 AM, sergunok ser...@gmail.com wrote:

 Hi guys,

 I don't have exact picture about preserving of ordering of elements of RDD
 after executing of operations.

 Which operations preserve it?
 1) Map (Yes?)
 2) ZipWithIndex (Yes or sometimes yes?)

 Serg.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Which-RDD-operations-preserve-ordering-tp22239.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Can't access file in spark, but can in hadoop

2015-03-26 Thread Ted Yu
Looks like the following assertion failed:
  Preconditions.checkState(storageIDsCount == locs.size());

locs is ListDatanodeInfoProto
Can you enhance the assertion to log more information ?

Cheers

On Thu, Mar 26, 2015 at 3:06 PM, Dale Johnson daljohn...@ebay.com wrote:

 There seems to be a special kind of corrupted according to Spark state of
 file in HDFS.  I have isolated a set of files (maybe 1% of all files I need
 to work with) which are producing the following stack dump when I try to
 sc.textFile() open them.  When I try to open directories, most large
 directories contain at least one file of this type.  Curiously, the
 following two lines fail inside of a Spark job, but not inside of a Scoobi
 job:

 val conf = new org.apache.hadoop.conf.Configuration
 val fs = org.apache.hadoop.fs.FileSystem.get(conf)

 The stack trace follows:

 15/03/26 14:22:43 INFO yarn.ApplicationMaster: Final app status: FAILED,
 exitCode: 15, (reason: User class threw exception: null)
 Exception in thread Driver java.lang.IllegalStateException
 at

 org.spark-project.guava.common.base.Preconditions.checkState(Preconditions.java:133)
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:673)
 at

 org.apache.hadoop.hdfs.protocolPB.PBHelper.convertLocatedBlock(PBHelper.java:1100)
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1118)
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1251)
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1354)
 at
 org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1363)
 at

 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:518)
 at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1743)
 at

 org.apache.hadoop.hdfs.DistributedFileSystem$15.init(DistributedFileSystem.java:738)
 at

 org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:727)
 at
 org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1662)
 at org.apache.hadoop.fs.FileSystem$5.init(FileSystem.java:1724)
 at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:1721)
 at

 com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1125)
 at

 com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1123)
 at

 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at

 com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$.main(SpellQuery.scala:1123)
 at
 com.ebay.ss.niffler.miner.speller.SpellQueryLaunch.main(SpellQuery.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at

 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:427)
 15/03/26 14:22:43 INFO yarn.ApplicationMaster: Invoking sc stop from
 shutdown hook

 It appears to have found the three copies of the given HDFS block, but is
 performing some sort of validation with them before giving them back to
 spark to schedule the job.  But there is an assert failing.

 I've tried this with 1.2.0, 1.2.1 and 1.3.0, and I get the exact same
 error,
 but I've seen the line numbers change on the HDFS libraries, but not the
 function names.  I've tried recompiling myself with different hadoop
 versions, and it's the same.  We're running hadoop 2.4.1 on our cluster.

 A google search turns up absolutely nothing on this.

 Any insight at all would be appreciated.

 Dale Johnson
 Applied Researcher
 eBay.com




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-access-file-in-spark-but-can-in-hadoop-tp22251.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Building spark 1.2 from source requires more dependencies

2015-03-26 Thread Ted Yu
Looking at output from dependency:tree, servlet-api is brought in by the
following:

[INFO] +- org.apache.cassandra:cassandra-all:jar:1.2.6:compile
[INFO] |  +- org.antlr:antlr:jar:3.2:compile
[INFO] |  +- com.googlecode.json-simple:json-simple:jar:1.1:compile
[INFO] |  +- org.yaml:snakeyaml:jar:1.6:compile
[INFO] |  +- edu.stanford.ppl:snaptree:jar:0.1:compile
[INFO] |  +- org.mindrot:jbcrypt:jar:0.3m:compile
[INFO] |  +- org.apache.thrift:libthrift:jar:0.7.0:compile
[INFO] |  |  \- javax.servlet:servlet-api:jar:2.5:compile

FYI

On Thu, Mar 26, 2015 at 3:36 PM, Pala M Muthaia mchett...@rocketfuelinc.com
 wrote:

 Hi,

 We are trying to build spark 1.2 from source (tip of the branch-1.2 at the
 moment). I tried to build spark using the following command:

 mvn -U -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
 -Phive-thriftserver -DskipTests clean package

 I encountered various missing class definition exceptions (e.g: class
 javax.servlet.ServletException not found).

 I eventually got the build to succeed after adding the following set of
 dependencies to the spark-core's pom.xml:

 dependency
   groupIdjavax.servlet/groupId
   artifactId*servlet-api*/artifactId
   version3.0/version
 /dependency

 dependency
   groupIdorg.eclipse.jetty/groupId
   artifactId*jetty-io*/artifactId
 /dependency

 dependency
   groupIdorg.eclipse.jetty/groupId
   artifactId*jetty-http*/artifactId
 /dependency

 dependency
   groupIdorg.eclipse.jetty/groupId
   artifactId*jetty-servlet*/artifactId
 /dependency

 Pretty much all of the missing class definition errors came up while
 building HttpServer.scala, and went away after the above dependencies were
 included.

 My guess is official build for spark 1.2 is working already. My question
 is what is wrong with my environment or setup, that requires me to add
 dependencies to pom.xml in this manner, to get this build to succeed.

 Also, i am not sure if this build would work at runtime for us, i am still
 testing this out.


 Thanks,
 pala



Re: JAVA_HOME problem with upgrade to 1.3.0

2015-03-19 Thread Ted Yu
JAVA_HOME, an environment variable, should be defined on the node where
appattempt_1420225286501_4699_02 ran.

Cheers

On Thu, Mar 19, 2015 at 8:59 AM, Williams, Ken ken.willi...@windlogics.com
wrote:

  I’m trying to upgrade a Spark project, written in Scala, from Spark
 1.2.1 to 1.3.0, so I changed my `build.sbt` like so:

 -libraryDependencies += org.apache.spark %% spark-core % 1.2.1
 % “provided
+libraryDependencies += org.apache.spark %% spark-core % 1.3.0 %
 provided

  then make an `assembly` jar, and submit it:

 HADOOP_CONF_DIR=/etc/hadoop/conf \
 spark-submit \
 --driver-class-path=/etc/hbase/conf \
 --conf spark.hadoop.validateOutputSpecs=false \
 --conf
 spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \
 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer
 \
 --deploy-mode=cluster \
 --master=yarn \
 --class=TestObject \
 --num-executors=54 \
 target/scala-2.11/myapp-assembly-1.2.jar

  The job fails to submit, with the following exception in the terminal:

  15/03/19 10:30:07 INFO yarn.Client:
 client token: N/A
 diagnostics: Application application_1420225286501_4699 failed 2 times due
 to AM Container for appattempt_1420225286501_4699_02 exited with
  exitCode: 127 due to: Exception from container-launch:
 org.apache.hadoop.util.Shell$ExitCodeException:
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
 at org.apache.hadoop.util.Shell.run(Shell.java:379)
 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
 at
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
 at
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)

  Finally, I go and check the YARN app master’s web interface (since the
 job is shown, I know it at least made it that far), and the only logs it
 shows are these:

  Log Type: stderr
 Log Length: 61
 /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory

 Log Type: stdout
 Log Length: 0

  I’m not sure how to interpret that – is '{{JAVA_HOME}}' a literal
 (including the brackets) that’s somehow making it into a script?  Is this
 coming from the worker nodes or the driver?  Anything I can do to
 experiment  troubleshoot?

-Ken



 --

 CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution of
 any kind is strictly prohibited. If you are not the intended recipient,
 please contact the sender via reply e-mail and destroy all copies of the
 original message. Thank you.



Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ted Yu
Ranga:
Please apply the patch from:
https://github.com/apache/spark/pull/4867

And rebuild Spark - the build would use Tachyon-0.6.1

Cheers

On Wed, Mar 18, 2015 at 2:23 PM, Ranga sra...@gmail.com wrote:

 Hi Haoyuan

 No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If
 not, I can rebuild and try. Could you let me know how to rebuild with 0.6.0?
 Thanks for your help.


 - Ranga

 On Wed, Mar 18, 2015 at 12:59 PM, Haoyuan Li haoyuan...@gmail.com wrote:

 Did you recompile it with Tachyon 0.6.0?

 Also, Tachyon 0.6.1 has been released this morning:
 http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases

 Best regards,

 Haoyuan

 On Wed, Mar 18, 2015 at 11:48 AM, Ranga sra...@gmail.com wrote:

 I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same
 issue. Here are the logs:
 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder'
 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId'
 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts
 to create tachyon dir in
 /tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/driver

 Thanks for any other pointers.


 - Ranga



 On Wed, Mar 18, 2015 at 9:53 AM, Ranga sra...@gmail.com wrote:

 Thanks for the information. Will rebuild with 0.6.0 till the patch is
 merged.

 On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote:

 Ranga:
 Take a look at https://github.com/apache/spark/pull/4867

 Cheers

 On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com
 wrote:

 Hi, Ranga

 That's true. Typically a version mis-match issue. Note that spark
 1.2.1 has tachyon built in with version 0.5.0 , I think you may need to
 rebuild spark
 with your current tachyon release.
 We had used tachyon for several of our spark projects in a production
 environment.

 Thanks,
 Sun.

 --
 fightf...@163.com


 *From:* Ranga sra...@gmail.com
 *Date:* 2015-03-18 06:45
 *To:* user@spark.apache.org
 *Subject:* StorageLevel: OFF_HEAP
 Hi

 I am trying to use the OFF_HEAP storage level in my Spark (1.2.1)
 cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running.
 However, when I try to persist the RDD, I get the following error:

 ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
 TachyonFS.java[connect]:364)  - Invalid method name:
 'getUserUnderfsTempFolder'
 ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
 TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'

 Is this because of a version mis-match?

 On a different note, I was wondering if Tachyon has been used in a
 production environment by anybody in this group?

 Appreciate your help with this.


 - Ranga







 --
 Haoyuan Li
 AMPLab, EECS, UC Berkeley
 http://www.cs.berkeley.edu/~haoyuan/





Re: Bulk insert strategy

2015-03-08 Thread Ted Yu
What's the expected number of partitions in your use case ?

Have you thought of doing batching in the workers ?

Cheers

On Sat, Mar 7, 2015 at 10:54 PM, A.K.M. Ashrafuzzaman 
ashrafuzzaman...@gmail.com wrote:

 While processing DStream in the Spark Programming Guide, the suggested
 usage of connection is the following,

 dstream.foreachRDD(rdd = {
   rdd.foreachPartition(partitionOfRecords = {
   // ConnectionPool is a static, lazily initialized pool of 
 connections
   val connection = ConnectionPool.getConnection()
   partitionOfRecords.foreach(record = connection.send(record))
   ConnectionPool.returnConnection(connection)  // return to the pool 
 for future reuse
   })
   })


 In this case processing and the insertion is done in the workers. There,
 we don’t use batch insert in db. How about this use case, where we can
 process(parse string JSON to obj) and send back those objects to master and
 then send a bulk insert request. Is there any benefit for sending
 individually using connection pool vs use of bulk operation in the master?

 A.K.M. Ashrafuzzaman
 Lead Software Engineer
 NewsCred http://www.newscred.com/

 (M) 880-175-5592433
 Twitter https://twitter.com/ashrafuzzaman | Blog
 http://jitu-blog.blogspot.com/ | Facebook
 https://www.facebook.com/ashrafuzzaman.jitu

 Check out The Academy http://newscred.com/theacademy, your #1 source
 for free content marketing resources




Re: Spark error NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit

2015-03-23 Thread Ted Yu
InputSplit is in hadoop-mapreduce-client-core jar

Please check that the jar is in your classpath.

Cheers

On Mon, Mar 23, 2015 at 8:10 AM, , Roy rp...@njit.edu wrote:

 Hi,


   I am using CDH 5.3.2 packages installation through Cloudera Manager 5.3.2

 I am trying to run one spark job with following command

 PYTHONPATH=~/code/utils/ spark-submit --master yarn --executor-memory 3G
 --num-executors 30 --driver-memory 2G --executor-cores 2 --name=analytics
 /home/abc/code/updb/spark/UPDB3analytics.py -date 2015-03-01

 but I am getting following error

 15/03/23 11:06:49 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 7,
 hdp003.dev.xyz.com): java.lang.NoClassDefFoundError:
 org/apache/hadoop/mapred/InputSplit
 at java.lang.Class.getDeclaredConstructors0(Native Method)
 at java.lang.Class.privateGetDeclaredConstructors(Class.java:2532)
 at java.lang.Class.getDeclaredConstructors(Class.java:1901)
 at
 java.io.ObjectStreamClass.computeDefaultSUID(ObjectStreamClass.java:1749)
 at java.io.ObjectStreamClass.access$100(ObjectStreamClass.java:72)
 at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:250)
 at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:248)
 at java.security.AccessController.doPrivileged(Native Method)
 at
 java.io.ObjectStreamClass.getSerialVersionUID(ObjectStreamClass.java:247)
 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:613)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
 at
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.mapred.InputSplit
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 25 more

 here is the full trace

 https://gist.github.com/anonymous/3492f0ec63d7a23c47cf




Re: JDBC DF using DB2

2015-03-23 Thread Ted Yu
bq. is to modify compute_classpath.sh on all worker nodes to include your
driver JARs.

Please follow the above advice.

Cheers

On Mon, Mar 23, 2015 at 12:34 PM, Jack Arenas j...@ckarenas.com wrote:

 Hi Team,



 I’m trying to create a DF using jdbc as detailed here
 https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#jdbc-to-other-databases
  –
 I’m currently using DB2 v9.7.0.6 and I’ve tried to use the db2jcc.jar and
 db2jcc_license_cu.jar combo, and while it works in --master local using the
 command below, I get some strange behavior in --master yarn-client. Here is
 the command:



 *val df = sql.load(jdbc, Map(url -
 jdbc:db2://host:port/db:currentSchema=schema;user=user;password=password;,
 driver - com.ibm.db2.jcc.DB2Driver, dbtable - table))*



 It seems to also be working on yarn-client because once executed I get the
 following log:

 *df: org.apache.spark.sql.DataFrame = [DATE_FIELD: date, INT_FIELD: int,
 DOUBLE_FIELD: double]*



 Which indicates me that Spark was able to connect to the DB. But once I
 run *df.count() *or *df.take(5).foreach(println)* in order to operate on
 the data and get a result, I get back a ‘*No suitable driver found*’
 exception, which makes me think the driver wasn’t shipped with the spark
 job.



 I’ve tried using *--driver-class-path, --jars, SPARK_CLASSPATH* to add
 the jars to the spark job. I also have the jars in my*$CLASSPATH* and
 *$HADOOP_CLASSPATH*.



 I also saw this in the trouble shooting section, but quite frankly I’m not
 sure what primordial class loader it’s talking about:



 The JDBC driver class must be visible to the primordial class loader on
 the client session and on all executors. This is because Java’s
 DriverManager class does a security check that results in it ignoring all
 drivers not visible to the primordial class loader when one goes to open a
 connection. One convenient way to do this is to modify compute_classpath.sh
 on all worker nodes to include your driver JARs.


 Any advice is welcome!



 Thanks,

 Jack









Re: Spark log shows only this line repeated: RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time X

2015-03-26 Thread Ted Yu
It is logged from RecurringTimer#loop():

  private def loop() {
try {
  while (!stopped) {
clock.waitTillTime(nextTime)
callback(nextTime)
prevTime = nextTime
nextTime += period
logDebug(Callback for  + name +  called at time  + prevTime)
  }
For the callback, JobGenerator has this:

  private val timer = new RecurringTimer(clock,
ssc.graph.batchDuration.milliseconds,
longTime = eventActor ! GenerateJobs(new Time(longTime)),
JobGenerator)
...
  /** Generate jobs and perform checkpoint for the given `time`.  */
  private def generateJobs(time: Time) {

Cheers

On Thu, Mar 26, 2015 at 6:55 AM, Adrian Mocanu amoc...@verticalscope.com
wrote:

  Here’s my log output from a streaming job.

 What is this?





 09:54:27.504 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067504

 09:54:27.505 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067505

 09:54:27.506 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067506

 09:54:27.508 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067507

 09:54:27.508 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067508

 09:54:27.509 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067509

 09:54:27.510 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067510

 09:54:27.511 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067511

 09:54:27.512 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067512

 09:54:27.513 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067513

 09:54:27.514 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067514

 09:54:27.515 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067515

 09:54:27.516 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067516

 09:54:27.517 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067517

 09:54:27.518 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067518

 09:54:27.519 [RecurringTimer - JobGenerator] DEBUG
 o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at
 time 1427378067519

 09:54:27.520 [Recurri …



Re: Hive UDAF percentile_approx says This UDAF does not support the deprecated getEvaluator() method.

2015-01-13 Thread Ted Yu
Looking at the source code for AbstractGenericUDAFResolver, the following
(non-deprecated) method should be called:

  public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo info)

It is called by hiveUdfs.scala (master branch):

val parameterInfo = new
SimpleGenericUDAFParameterInfo(inspectors.toArray, false, false)
resolver.getEvaluator(parameterInfo)

FYI

On Tue, Jan 13, 2015 at 1:51 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi,

 The following SQL query

 select percentile_approx(variables.var1, 0.95) p95
 from model

 will throw

 ERROR SparkSqlInterpreter: Error
 org.apache.hadoop.hive.ql.parse.SemanticException: This UDAF does not
 support the deprecated getEvaluator() method.
 at
 org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:53)
 at
 org.apache.spark.sql.hive.HiveGenericUdaf.objectInspector$lzycompute(hiveUdfs.scala:196)
 at
 org.apache.spark.sql.hive.HiveGenericUdaf.objectInspector(hiveUdfs.scala:195)
 at
 org.apache.spark.sql.hive.HiveGenericUdaf.dataType(hiveUdfs.scala:203)
 at
 org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:105)
 at
 org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:143)
 at
 org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:143)

 I'm using latest branch-1.2

 I found in PR that percentile and percentile_approx are supported. A bug?

 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?

2015-03-23 Thread Ted Yu
Looking at core/pom.xml :
dependency
  groupIdorg.json4s/groupId
  artifactIdjson4s-jackson_${scala.binary.version}/artifactId
  version3.2.10/version
/dependency

The version is hard coded.

You can rebuild Spark 1.3.0 with json4s 3.2.11

Cheers

On Mon, Mar 23, 2015 at 2:12 PM, Alexey Zinoviev alexey.zinov...@gmail.com
wrote:

 Spark has a dependency on json4s 3.2.10, but this version has several bugs
 and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to
 build.sbt and everything compiled fine. But when I spark-submit my JAR it
 provides me with 3.2.10.


 build.sbt

 import sbt.Keys._

 name := sparkapp

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies += org.apache.spark %% spark-core  % 1.3.0 %
 provided

 libraryDependencies += org.json4s %% json4s-native % 3.2.11`


 plugins.sbt

 logLevel := Level.Warn

 resolvers += Resolver.url(artifactory, url(
 http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases
 ))(Resolver.ivyStylePatterns)

 addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0)


 App1.scala

 import org.apache.spark.SparkConf
 import org.apache.spark.rdd.RDD
 import org.apache.spark.{Logging, SparkConf, SparkContext}
 import org.apache.spark.SparkContext._

 object App1 extends Logging {
   def main(args: Array[String]) = {
 val conf = new SparkConf().setAppName(App1)
 val sc = new SparkContext(conf)
 println(sjson4s version: ${org.json4s.BuildInfo.version.toString})
   }
 }



 sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4

 Is it possible to force 3.2.11 version usage?

 Thanks,
 Alexey



Re: pyspark hbase range scan

2015-04-01 Thread Ted Yu
Have you looked at http://happybase.readthedocs.org/en/latest/ ?

Cheers



 On Apr 1, 2015, at 4:50 PM, Eric Kimbrel eric.kimb...@soteradefense.com 
 wrote:
 
 I am attempting to read an hbase table in pyspark with a range scan.  
 
 conf = {
hbase.zookeeper.quorum: host, 
hbase.mapreduce.inputtable: table,
hbase.mapreduce.scan : scan
 }
 hbase_rdd = sc.newAPIHadoopRDD(
org.apache.hadoop.hbase.mapreduce.TableInputFormat,
org.apache.hadoop.hbase.io.ImmutableBytesWritable,
org.apache.hadoop.hbase.client.Result,
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
 
 If i jump over to scala or java and generate a base64 encoded protobuf scan
 object and convert it to a string, i can use that value for
 hbase.mapreduce.scan and everything works,  the rdd will correctly perform
 the range scan and I am happy.  The problem is that I can not find any
 reasonable way to generate that range scan string in python.   The scala
 code required is:
 
 import org.apache.hadoop.hbase.util.Base64;
 import org.apache.hadoop.hbase.protobuf.ProtobufUtil;
 import org.apache.hadoop.hbase.client.{Delete, HBaseAdmin, HTable, Put,
 Result = HBaseResult, Scan}
 
 val scan = new Scan()
 scan.setStartRow(test_domain\0email.getBytes)
 scan.setStopRow(test_domain\0email~.getBytes)
 def scanToString(scan:Scan): String = { Base64.encodeBytes( 
 ProtobufUtil.toScan(scan).toByteArray()) }
 scanToString(scan)
 
 
 Is there another way to perform an hbase range scan from pyspark or is that
 functionality something that might be supported in the future?
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-hbase-range-scan-tp22348.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Connection pooling in spark jobs

2015-04-02 Thread Ted Yu
http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm

The question doesn't seem to be Spark specific, btw




 On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote:
 
 Hi,
 
 We have a case that we will have to run concurrent jobs (for the same 
 algorithm) on different data sets. And these jobs can run in parallel and 
 each one of them would be fetching the data from the database.
 We would like to optimize the database connections by making use of 
 connection pooling. Any suggestions / best known ways on how to achieve this. 
 The database in question is Oracle
 
 Thanks,
 Sateesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: 答复:maven compile error

2015-04-03 Thread Ted Yu
Can you include -X in your maven command and pastebin the output ?

Cheers



 On Apr 3, 2015, at 3:58 AM, myelinji myeli...@aliyun.com wrote:
 
 Thank you for your reply. When I'm using maven to compile the whole project, 
 the erros as follows
 
 [INFO] Spark Project Parent POM .. SUCCESS [4.136s]
 [INFO] Spark Project Networking .. SUCCESS [7.405s]
 [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [5.071s]
 [INFO] Spark Project Core  SUCCESS [3:08.445s]
 [INFO] Spark Project Bagel ... SUCCESS [21.613s]
 [INFO] Spark Project GraphX .. SUCCESS [58.915s]
 [INFO] Spark Project Streaming ... SUCCESS [1:26.202s]
 [INFO] Spark Project Catalyst  FAILURE [1.537s]
 [INFO] Spark Project SQL . SKIPPED
 [INFO] Spark Project ML Library .. SKIPPED
 [INFO] Spark Project Tools ... SKIPPED
 [INFO] Spark Project Hive  SKIPPED
 [INFO] Spark Project REPL  SKIPPED
 [INFO] Spark Project Assembly  SKIPPED
 [INFO] Spark Project External Twitter  SKIPPED
 [INFO] Spark Project External Flume Sink . SKIPPED
 [INFO] Spark Project External Flume .. SKIPPED
 [INFO] Spark Project External MQTT ... SKIPPED
 [INFO] Spark Project External ZeroMQ . SKIPPED
 [INFO] Spark Project External Kafka .. SKIPPED
 [INFO] Spark Project Examples  SKIPPED
 
 it seems like there is something wrong with calatlyst project. Why i cannot 
 compile this project?
 
 
 --
 发件人:Sean Owen so...@cloudera.com
 发送时间:2015年4月3日(星期五) 17:48
 收件人:myelinji myeli...@aliyun.com
 抄 送:spark用户组 user@spark.apache.org
 主 题:Re: maven compile error
 
 If you're asking about a compile error, you should include the command
 you used to compile.
 
 I am able to compile branch 1.2 successfully with mvn -DskipTests
 clean package.
 
 This error is actually an error from scalac, not a compile error from
 the code. It sort of sounds like it has not been able to download
 scala dependencies. Check or maybe recreate your environment.
 
 On Fri, Apr 3, 2015 at 3:19 AM, myelinji myeli...@aliyun.com wrote:
  Hi,all:
  Just now i checked out spark-1.2 on github , wanna to build it use maven,
  how ever I encountered an error during compiling:
 
  [INFO]
  
  [ERROR] Failed to execute goal
  net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on
  project spark-catalyst_2.10: wrap:
  scala.reflect.internal.MissingRequirementError: object scala.runtime in
  compiler mirror not found. - [Help 1]
  org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
  goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile
  (scala-compile-first) on project spark-catalyst_2.10: wrap:
  scala.reflect.internal.MissingRequirementError: object scala.runtime in
  compiler mirror not found.
  at
  org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
  at
  org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
  at
  org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
  at
  org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
  at
  org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
  at
  org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
  at
  org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
  at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
  at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
  at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537)
  at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
  at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
  sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at
  sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230)
  at
  org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409)
  at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352)
  Caused by: 

Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Ted Yu
Maybe add another stat for batches waiting in the job queue ?

Cheers

On Fri, Apr 3, 2015 at 10:01 AM, Tathagata Das t...@databricks.com wrote:

 Very good question! This is because the current code is written such that
 the ui considers a batch as waiting only when it has actually started being
 processed. Thats batched waiting in the job queue is not considered in the
 calculation. It is arguable that it may be more intuitive to count that in
 the waiting as well.
 On Apr 3, 2015 12:59 AM, bit1...@163.com bit1...@163.com wrote:


 I copied the following from the spark streaming UI, I don't know why the
 Waiting batches is 1, my understanding is that it should be 72.
 Following  is my understanding:
 1. Total time is 1minute 35 seconds=95 seconds
 2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds.
 3. Processed batches are 23(Correct, because in my processing code, it
 does nothing but sleep 4 seconds)
 4. Then the waiting batches should be 95-23=72



- *Started at: * Fri Apr 03 15:17:47 CST 2015
- *Time since start: *1 minute 35 seconds
- *Network receivers: *1
- *Batch interval: *1 second
- *Processed batches: *23
- *Waiting batches: *1
- *Received records: *0
- *Processed records: *0


 --
 bit1...@163.com




Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Ted Yu
Please take a look at https://issues.apache.org/jira/browse/PHOENIX-1815

On Mon, Apr 20, 2015 at 10:11 AM, Jeetendra Gangele gangele...@gmail.com
wrote:

 Thanks for reply.

 Does phoenix using inside Spark will be useful?

 what is the best way to bring data from Hbase into Spark in terms
 performance of application?

 Regards
 Jeetendra

 On 20 April 2015 at 20:49, Ted Yu yuzhih...@gmail.com wrote:

 To my knowledge, Spark SQL currently doesn't provide range scan
 capability against hbase.

 Cheers



  On Apr 20, 2015, at 7:54 AM, Jeetendra Gangele gangele...@gmail.com
 wrote:
 
  HI All,
 
  I am Querying Hbase and combining result and using in my spake job.
  I am querying hbase using Hbase client api inside my spark job.
  can anybody suggest me will Spark SQl will be fast enough and provide
 range of queries?
 
  Regards
  Jeetendra
 








Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-23 Thread Ted Yu
NativeS3FileSystem class is in hadoop-aws jar.
Looks like it was not on classpath.

Cheers

On Thu, Apr 23, 2015 at 7:30 AM, Sujee Maniyam su...@sujee.net wrote:

 Thanks all...

 btw, s3n load works without any issues with  spark-1.3.1-bulit-for-hadoop
 2.4

 I tried this on 1.3.1-hadoop26
   sc.hadoopConfiguration.set(fs.s3n.impl,
 org.apache.hadoop.fs.s3native.NativeS3FileSystem)
  val f = sc.textFile(s3n://bucket/file)
  f.count

 No it can't find the implementation path.  Looks like some jar is missing ?

 java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
 org.apache.hadoop.fs.s3native.NativeS3FileSystem not found
 at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

 On Wednesday, April 22, 2015, Shuai Zheng szheng.c...@gmail.com wrote:

 Below is my code to access s3n without problem (only for 1.3.1. there is
 a bug in 1.3.0).



   Configuration hadoopConf = ctx.hadoopConfiguration();

   hadoopConf.set(fs.s3n.impl,
 org.apache.hadoop.fs.s3native.NativeS3FileSystem);

   hadoopConf.set(fs.s3n.awsAccessKeyId, awsAccessKeyId);

   hadoopConf.set(fs.s3n.awsSecretAccessKey,
 awsSecretAccessKey);



 Regards,



 Shuai



 *From:* Sujee Maniyam [mailto:su...@sujee.net]
 *Sent:* Wednesday, April 22, 2015 12:45 PM
 *To:* Spark User List
 *Subject:* spark 1.3.1 : unable to access s3n:// urls (no file system
 for scheme s3n:)



 Hi all

 I am unable to access s3n://  urls using   sc.textFile().. getting 'no
 file system for scheme s3n://'  error.



 a bug or some conf settings missing?



 See below for details:



 env variables :

 AWS_SECRET_ACCESS_KEY=set

 AWS_ACCESS_KEY_ID=set



 spark/RELAESE :

 Spark 1.3.1 (git revision 908a0bf) built for Hadoop 2.6.0

 Build flags: -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
 -Phive-thriftserver -Pyarn -DzincPort=3034





 ./bin/spark-shell

  val f = sc.textFile(s3n://bucket/file)

  f.count



 error==

 java.io.IOException: No FileSystem for scheme: s3n

 at
 org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)

 at
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)

 at
 org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

 at
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

 at
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)

 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)

 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)

 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)

 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)

 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)

 at
 org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)

 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)

 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)

 at
 org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)

 at scala.Option.getOrElse(Option.scala:120)

 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)

 at
 org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)

 at org.apache.spark.rdd.RDD.count(RDD.scala:1006)

 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)

 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)

 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)

 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)

 at $iwC$$iwC$$iwC$$iwC.init(console:35)

 at $iwC$$iwC$$iwC.init(console:37)

 at $iwC$$iwC.init(console:39)

 at $iwC.init(console:41)

 at init(console:43)

 at .init(console:47)

 at .clinit(console)

 at .init(console:7)

 at .clinit(console)

 at $print(console)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at 

Re: Slower performance when bigger memory?

2015-04-23 Thread Ted Yu
Shuai:
Please take a look at:

http://blog.takipi.com/garbage-collectors-serial-vs-parallel-vs-cms-vs-the-g1-and-whats-new-in-java-8/



 On Apr 23, 2015, at 10:18 AM, Dean Wampler deanwamp...@gmail.com wrote:
 
 JVM's often have significant GC overhead with heaps bigger than 64GB. You 
 might try your experiments with configurations below this threshold.
 
 dean
 
 Dean Wampler, Ph.D.
 Author: Programming Scala, 2nd Edition (O'Reilly)
 Typesafe
 @deanwampler
 http://polyglotprogramming.com
 
 On Thu, Apr 23, 2015 at 12:14 PM, Shuai Zheng szheng.c...@gmail.com wrote:
 Hi All,
 
  
 
 I am running some benchmark on r3*8xlarge instance. I have a cluster with 
 one master (no executor on it) and one slave (r3*8xlarge).
 
  
 
 My job has 1000 tasks in stage 0.
 
  
 
 R3*8xlarge has 244G memory and 32 cores.
 
  
 
 If I create 4 executors, each has 8 core+50G memory, each task will take 
 around 320s-380s. And if I only use one big executor with 32 cores and 200G 
 memory, each task will take 760s-900s.
 
  
 
 And I check the log, looks like the minor GC takes much longer when using 
 200G memory:
 
  
 
 285.242: [GC [PSYoungGen: 29027310K-8646087K(31119872K)] 
 38810417K-19703013K(135977472K), 11.2509770 secs] [Times: user=38.95 
 sys=120.65, real=11.25 secs]
 
  
 
 And when it uses 50G memory, the minor GC takes only less than 1s.
 
  
 
 I try to see what is the best way to configure the Spark. For some special 
 reason, I tempt to use a bigger memory on single executor if no significant 
 penalty on performance. But now looks like it is?
 
  
 
 Anyone has any idea?
 
  
 
 Regards,
 
  
 
 Shuai
 
 


Re: implicits is not a member of org.apache.spark.sql.SQLContext

2015-04-21 Thread Ted Yu
Have you tried the following ?
  import sqlContext._
  import sqlContext.implicits._

Cheers

On Tue, Apr 21, 2015 at 7:54 AM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.com wrote:

  I tried to convert an RDD to a data frame using the example codes on
 spark website





 case class Person(name: String, age: Int)



 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.implicits._

 val people  =
 sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p
 = *Person*(p(0), p(1).trim.toInt)).toDF()



 It compile fine using sbt. But in IntelliJ IDEA 14.0.3, it fail to compile
 and generate the following error



 Error:(289, 23) value *implicits is not a member of
 org.apache.spark.sql.SQLContext*

 import sqlContext.implicits._

   ^



 What is the problem here? Is there any work around (e.g. do explicit
 conversion)?



 Here is my build.sbt



 name := my-project



 version := 0.2



 scalaVersion := 2.10.4



 val sparkVersion = 1.3.1



 val luceneVersion = 4.10.2



 libraryDependencies = scalaVersion {

   scala_version = Seq(

 org.apache.spark %% spark-core % sparkVersion  % provided,

 org.apache.spark %% spark-mllib % sparkVersion % provided,

 spark.jobserver % job-server-api % 0.4.1 % provided,

 org.scalatest %% scalatest % 2.2.1 % test

   )

 }

 resolvers += Spark Packages Repo at 
 http://dl.bintray.com/spark-packages/maven;

 resolvers += Resolver.mavenLocal





 Ningjun





Re: Meet Exception when learning Broadcast Variables

2015-04-21 Thread Ted Yu
Does line 27 correspond to brdcst.value ?

Cheers



 On Apr 21, 2015, at 3:19 AM, donhoff_h 165612...@qq.com wrote:
 
 Hi, experts.
 
 I wrote a very little program to learn how to use Broadcast Variables, but 
 met an exception. The program and the exception are listed as following.  
 Could anyone help me to solve this problem? Thanks!
 
 **My Program is as following**
 object TestBroadcast02 {
  var brdcst : Broadcast[Array[Int]] = null
 
  def main(args: Array[String]) {
val conf = new SparkConf()
val sc = new SparkContext(conf)
brdcst = sc.broadcast(Array(1,2,3,4,5,6))
val rdd = 
 sc.textFile(hdfs://bgdt-dev-hrb/user/spark/tst/charset/A_utf8.txt)
rdd.foreachPartition(fun1)
sc.stop()
  }
 
  def fun1(it : Iterator[String]) : Unit = {
val v = brdcst.value
for(i - v) println(BroadCast Variable:+i)
for(j - it) println(Text File Content:+j)
  }
 } 
 **The Exception is as following**
 15/04/21 17:39:53 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
 (TID 0, bgdt01.dev.hrb): java.lang.NullPointerException
 at 
 dhao.test.BroadCast.TestBroadcast02$.fun1(TestBroadcast02.scala:27)
 at 
 dhao.test.BroadCast.TestBroadcast02$$anonfun$main$1.apply(TestBroadcast02.scala:22)
 at 
 dhao.test.BroadCast.TestBroadcast02$$anonfun$main$1.apply(TestBroadcast02.scala:22)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
 at 
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497)
 at 
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 
 By the way, if I use anonymous function instead of 'fun1' in my program, it 
 works. But since I think the readability is not good for anonymous functions, 
 I still prefer to use the 'fun1' .

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)

2015-04-22 Thread Ted Yu
This thread from hadoop mailing list should give you some clue:
http://search-hadoop.com/m/LgpTk2df7822

On Wed, Apr 22, 2015 at 9:45 AM, Sujee Maniyam su...@sujee.net wrote:

 Hi all
 I am unable to access s3n://  urls using   sc.textFile().. getting 'no
 file system for scheme s3n://'  error.

 a bug or some conf settings missing?

 See below for details:

 env variables :
 AWS_SECRET_ACCESS_KEY=set
 AWS_ACCESS_KEY_ID=set

 spark/RELAESE :
 Spark 1.3.1 (git revision 908a0bf) built for Hadoop 2.6.0
 Build flags: -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive
 -Phive-thriftserver -Pyarn -DzincPort=3034


 ./bin/spark-shell
  val f = sc.textFile(s3n://bucket/file)
  f.count

 error==
 java.io.IOException: No FileSystem for scheme: s3n
 at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
 at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
 at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
 at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
 at
 org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
 at
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
 at
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512)
 at org.apache.spark.rdd.RDD.count(RDD.scala:1006)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31)
 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:33)
 at $iwC$$iwC$$iwC$$iwC.init(console:35)
 at $iwC$$iwC$$iwC.init(console:37)
 at $iwC$$iwC.init(console:39)
 at $iwC.init(console:41)
 at init(console:43)
 at .init(console:47)
 at .clinit(console)
 at .init(console:7)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
 at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
 at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813)
 at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664)
 at org.apache.spark.repl.SparkILoop.org
 $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944)
 at
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.org
 $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
 at 

Re: Auto Starting a Spark Job on Cluster Starup

2015-04-22 Thread Ted Yu
This thread seems related:
http://search-hadoop.com/m/JW1q51W02V

Cheers

On Wed, Apr 22, 2015 at 6:09 AM, James King jakwebin...@gmail.com wrote:

 What's the best way to start-up a spark job as part of starting-up the
 Spark cluster.

 I have an single uber jar for my job and want to make the start-up as easy
 as possible.

 Thanks

 jk





Re: Spark Performance on Yarn

2015-04-22 Thread Ted Yu
In master branch, overhead is now 10%. 
That would be 500 MB 

FYI



 On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote:
 
 +1 to executor-memory to 5g.
 Do check the overhead space for both the driver and the executor as per
 Wilfred's suggestion.
 
 Typically, 384 MB should suffice.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Ted Yu
Yin:
Fix Version of SPARK-4520 is not set.
I assume it was fixed in 1.3.0

Cheers
Fix Version

On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai yh...@databricks.com wrote:

 The exception looks like the one mentioned in
 https://issues.apache.org/jira/browse/SPARK-4520. What is the version of
 Spark?

 On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,

 My data looks like this:

 +---++--+
 | col_name  | data_type  | comment  |
 +---++--+
 | cust_id   | string |  |
 | part_num  | int|  |
 | ip_list   | arraystructip:string   |  |
 | vid_list  | arraystructvid:string  |  |
 | fso_list  | arraystructfso:string  |  |
 | src   | string |  |
 | date  | int|  |
 +---++--+

 And I did select *, it reports ParquetDecodingException.

 Is this type not supported in SparkSQL?

 Detailed error message here:


 Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
 Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in 
 stage 27.0 (TID 510, lvshdc5dn0542.lvs.paypal.com): 
 parquet.io.ParquetDecodingException:
 Can not read value at 0 in block -1 in file 
 hdfs://xxx/part-m-0.gz.parquet
 at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
 at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
 at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
 at 
 org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:64)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.elementData(ArrayList.java:400)
 at java.util.ArrayList.get(ArrayList.java:413)
 at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
 at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95)
 at parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80)
 at parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74)
 at 
 parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:290)
 at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131)
 at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96)
 at 
 parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136)
 at 
 parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96)
 at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126)
 at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





Re: ORCFiles

2015-04-24 Thread Ted Yu
Please see SPARK-2883
There is no Fix Version yet.

On Fri, Apr 24, 2015 at 5:45 PM, David Mitchell jdavidmitch...@gmail.com
wrote:

 Does anyone know in which version of Spark will there be support for
 ORCFiles via spark.sql.hive?  Will it be in 1.4?

 David






Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown

2015-04-25 Thread Ted Yu
Looks like this is related:
https://issues.apache.org/jira/browse/SPARK-5456

On Sat, Apr 25, 2015 at 6:59 AM, doovs...@sina.com wrote:

 Hi all,
 When I query Postgresql based on Spark SQL like this:
   dataFrame.registerTempTable(Employees)
   val emps = sqlContext.sql(select name, sum(salary) from Employees
 group by name, salary)
   monitor {
 emps.take(10)
   .map(row = (row.getString(0), row.getDecimal(1)))
   .foreach(println)
   }

 The type of salary column in data table is numeric(10, 2).

 It throws the following exception:
 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent
 failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
 java.lang.ClassCastException: java.math.BigDecimal cannot be cast to
 org.apache.spark.sql.types.Decimal

 Who know this issue and how to solve it? Thanks.

 Regards,
 Yi
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to debug Spark on Yarn?

2015-04-23 Thread Ted Yu
For step 2, you can pipe application log to a file instead of copy-pasting. 

Cheers



 On Apr 22, 2015, at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
 
 I submit a spark app to YARN and i get these messages
 
 
 
 15/04/22 22:45:04 INFO yarn.Client: Application report for 
 application_1429087638744_101363 (state: RUNNING)
 
 
 15/04/22 22:45:04 INFO yarn.Client: Application report for 
 application_1429087638744_101363 (state: RUNNING).
 
 ...
 
 
 
 1) I can go to Spark UI and see the status of the APP but cannot see the logs 
 as the job progresses. How can i see logs of executors as they progress ?
 
 2) In case the App fails/completes then Spark UI vanishes and i get a YARN 
 Job page that says job failed, there are no link to Spark UI again. Now i 
 take the job ID and run yarn application logs appid and my console fills up 
 (with huge scrolling) with logs of all executors. Then i copy and paste into 
 a text editor and search for keywords Exception , Job aborted due to . Is 
 this the right way to view logs ?
 
 
 -- 
 Deepak
 


Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Ted Yu
To my knowledge, Spark SQL currently doesn't provide range scan capability 
against hbase. 

Cheers



 On Apr 20, 2015, at 7:54 AM, Jeetendra Gangele gangele...@gmail.com wrote:
 
 HI All,
 
 I am Querying Hbase and combining result and using in my spake job.
 I am querying hbase using Hbase client api inside my spark job.
 can anybody suggest me will Spark SQl will be fast enough and provide range 
 of queries?
 
 Regards
 Jeetendra
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: compliation error

2015-04-19 Thread Ted Yu
What JDK release are you using ?

Can you give the complete command you used ?

Which Spark branch are you working with ?

Cheers

On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula 
brahmareddy.batt...@huawei.com wrote:

  Hi All

 Getting following error, when I am compiling spark..What did I miss..?
 Even googled and did not find the exact solution for this...


 [ERROR] Failed to execute goal
 org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project
 spark-assembly_2.10: Error creating shaded jar: Error in ASM processing
 class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help
 1]
 org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
 goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on
 project spark-assembly_2.10: Error creating shaded jar: Error in ASM
 processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
 at
 org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84)
 at
 org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183)
 at
 org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161)
 at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)



  Thanks  Regards

 Brahma Reddy Battula






Re: HBase HTable constructor hangs

2015-04-28 Thread Ted Yu
Can you give us more information ?
Such as hbase release, Spark release.

If you can pastebin jstack of the hanging HTable process, that would help.

BTW I used http://search-hadoop.com/?q=spark+HBase+HTable+constructor+hangs
and saw a very old thread with this subject.

Cheers

On Tue, Apr 28, 2015 at 7:12 PM, tridib tridib.sama...@live.com wrote:

 I am exactly having same issue. I am running hbase and spark in docker
 container.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: hive-thriftserver maven artifact

2015-04-28 Thread Ted Yu
Credit goes to Misha Chernetsov (see SPARK-4925)

FYI

On Tue, Apr 28, 2015 at 8:25 AM, Marco marco@gmail.com wrote:

 Thx Ted for the info !

 2015-04-27 23:51 GMT+02:00 Ted Yu yuzhih...@gmail.com:

 This is available for 1.3.1:

 http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10

 FYI

 On Mon, Feb 16, 2015 at 7:24 AM, Marco marco@gmail.com wrote:

 Ok, so will it be only available for the next version (1.30)?

 2015-02-16 15:24 GMT+01:00 Ted Yu yuzhih...@gmail.com:

 I searched for 'spark-hive-thriftserver_2.10' on this page:
 http://mvnrepository.com/artifact/org.apache.spark

 Looks like it is not published.

 On Mon, Feb 16, 2015 at 5:44 AM, Marco marco@gmail.com wrote:

 Hi,

 I am referring to https://issues.apache.org/jira/browse/SPARK-4925
 (Hive Thriftserver Maven Artifact). Can somebody point me (URL) to the
 artifact in a public repository ? I have not found it @Maven Central.

 Thanks,
 Marco





 --
 Viele Grüße,
 Marco





 --
 Viele Grüße,
 Marco



Re: HBase HTable constructor hangs

2015-04-28 Thread Ted Yu
How did you distribute hbase-site.xml to the nodes ?

Looks like HConnectionManager couldn't find the hbase:meta server.

Cheers

On Tue, Apr 28, 2015 at 9:19 PM, Tridib Samanta tridib.sama...@live.com
wrote:

 I am using Spark 1.2.0 and HBase 0.98.1-cdh5.1.0.

 Here is the jstack trace. Complete stack trace attached.

 Executor task launch worker-1 #58 daemon prio=5 os_prio=0
 tid=0x7fd3d0445000 nid=0x488 waiting on condition [0x7fd4507d9000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
  at java.lang.Thread.sleep(Native Method)
  at
 org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:152)
  - locked 0xf8cb7258 (a
 org.apache.hadoop.hbase.client.RpcRetryingCaller)
  at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705)
  at
 org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162)
  - locked 0xf84ac0b0 (a java.lang.Object)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
  at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326)
  at org.apache.hadoop.hbase.client.HTable.init(HTable.java:192)
  at org.apache.hadoop.hbase.client.HTable.init(HTable.java:150)
  at com.mypackage.storeTuples(CubeStoreService.java:59)
  at
 com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:23)
  at
 com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:13)
  at
 org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195)
  at
 org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195)
  at
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773)
  at
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773)
  at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
  at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
  at org.apache.spark.scheduler.Task.run(Task.scala:56)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
 Executor task launch worker-0 #57 daemon prio=5 os_prio=0
 tid=0x7fd3d0443800 nid=0x487 waiting for monitor entry
 [0x7fd4506d8000]
java.lang.Thread.State: BLOCKED (on object monitor)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1156)
  - waiting to lock 0xf84ac0b0 (a java.lang.Object)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054)
  at
 org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011)
  at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326)
  at org.apache.hadoop.hbase.client.HTable.init(HTable.java:192)
  at org.apache.hadoop.hbase.client.HTable.init(HTable.java:150)
  at com.mypackage.storeTuples(CubeStoreService.java:59)
  at
 com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:23)
  at
 com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:13)
  at
 org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195)
  at
 org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195)
  at
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773)
  at
 org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773)
  at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
  at
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
  at org.apache.spark.scheduler.Task.run(Task.scala:56)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)

 --
 Date: Tue, 28 Apr 2015 19:35:26 -0700
 

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Ted Yu
Can you run the command 'ulimit -n' to see the current limit ?

To configure ulimit settings on Ubuntu, edit */etc/security/limits.conf*
Cheers

On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 Hi all,

 I am using the direct approach to receive real-time data from Kafka in the
 following link:

 https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html


 My code follows the word count direct example:


 https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala



 After around 12 hours, I got the following error messages in Spark log:

 15/04/29 20:18:10 ERROR JobScheduler: Error generating jobs for time
 143033869 ms
 org.apache.spark.SparkException: ArrayBuffer(java.io.IOException: Too many
 open files, java.io.IOException: Too many open files, java.io.IOException:
 Too many open files, java.io.IOException: Too many open files,
 java.io.IOException: Too many open files)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:94)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:116)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
 at
 org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
 at
 org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
 at
 org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
 at
 org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
 at
 

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Ted Yu
Maybe add statement.close() in finally block ?

Streaming / Kafka experts may have better insight.

On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay bill.jaypeter...@gmail.com
wrote:

 Thanks for the suggestion. I ran the command and the limit is 1024.

 Based on my understanding, the connector to Kafka should not open so many
 files. Do you think there is possible socket leakage? BTW, in every batch
 which is 5 seconds, I output some results to mysql:

   def ingestToMysql(data: Array[String]) {
 val url = jdbc:mysql://localhost:3306/realtime?user=rootpassword=123
 var sql = insert into loggingserver1 values 
 data.foreach(line = sql += line)
 sql = sql.dropRight(1)
 sql += ;
 logger.info(sql)
 var conn: java.sql.Connection = null
 try {
   conn = DriverManager.getConnection(url)
   val statement = conn.createStatement()
   statement.executeUpdate(sql)
 } catch {
   case e: Exception = logger.error(e.getMessage())
 } finally {
   if (conn != null) {
 conn.close
   }
 }
   }

 I am not sure whether the leakage originates from Kafka connector or the
 sql connections.

 Bill

 On Wed, Apr 29, 2015 at 2:12 PM, Ted Yu yuzhih...@gmail.com wrote:

 Can you run the command 'ulimit -n' to see the current limit ?

 To configure ulimit settings on Ubuntu, edit */etc/security/limits.conf*
 Cheers

 On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay bill.jaypeter...@gmail.com
 wrote:

 Hi all,

 I am using the direct approach to receive real-time data from Kafka in
 the following link:

 https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html


 My code follows the word count direct example:


 https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala



 After around 12 hours, I got the following error messages in Spark log:

 15/04/29 20:18:10 ERROR JobScheduler: Error generating jobs for time
 143033869 ms
 org.apache.spark.SparkException: ArrayBuffer(java.io.IOException: Too
 many open files, java.io.IOException: Too many open files,
 java.io.IOException: Too many open files, java.io.IOException: Too many
 open files, java.io.IOException: Too many open files)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:94)
 at
 org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:116)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
 at
 org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
 at
 org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287)
 at scala.Option.orElse(Option.scala:257)
 at
 org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284)
 at
 org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread Ted Yu
bq. a single query on one filter criteria

Can you tell us more about your filter ? How selective is it ?

Which hbase release are you using ?

Cheers

On Thu, Apr 30, 2015 at 7:23 AM, Siddharth Ubale 
siddharth.ub...@syncoms.com wrote:

  Hi,



 I want to use Spark as Query engine on HBase with sub second latency.



 I am  using Spark 1.3  version. And followed the steps below on Hbase
 table with around 3.5 lac rows :



 *1.   *Mapped the Dataframe to Hbase table .RDDCustomers maps to the
 hbase table which is used to create the Dataframe.

 *“ DataFrame schemaCustomers = sqlInstance*

 *
 .createDataFrame(SparkContextImpl.getRddCustomers(),*

 *
 Customers.class);” *

 2.   Used registertemp table i.e”
 *schemaCustomers.registerTempTable(customers);”*

 3.   Running the query on Dataframe using Sqlcontext Instance.



 What I am observing is that for a single query on one filter criteria the
 query is taking 7-8 seconds? And the time increases as I am increasing the
 number of rows in Hbase table. Also, there was one time when I was getting
 query response under 1-2 seconds. Seems like strange behavior.

 Is this expected behavior from Spark or am I missing something here?

 Can somebody help me understand this scenario . Please assist.



 Thanks,

 Siddharth Ubale,





Re: spark with standalone HBase

2015-04-30 Thread Ted Yu
$partitions$2.apply(RDD.scala:219)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1632)
 at org.apache.spark.rdd.RDD.count(RDD.scala:1012)
 at org.apache.spark.examples.HBaseTest$.main(HBaseTest.scala:58)
 at org.apache.spark.examples.HBaseTest.main(HBaseTest.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:607)
 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:190)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 On Thu, Apr 30, 2015 at 1:25 PM, Saurabh Gupta saurabh.gu...@semusi.com
 wrote:

 I am using hbase -0.94.8.

 On Wed, Apr 29, 2015 at 11:56 PM, Ted Yu yuzhih...@gmail.com wrote:

 Can you enable HBase DEBUG logging in log4j.properties so that we can
 have more clue ?

 What hbase release are you using ?

 Cheers

 On Wed, Apr 29, 2015 at 4:27 AM, Saurabh Gupta saurabh.gu...@semusi.com
  wrote:

 Hi,

 I am working with standalone HBase. And I want to execute
 HBaseTest.scala (in scala examples) .

 I have created a test table with three rows and I just want to get the
 count using HBaseTest.scala

 I am getting this issue:

 15/04/29 11:17:10 INFO BlockManagerMaster: Registered BlockManager
 15/04/29 11:17:11 INFO ZooKeeper: Client
 environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
 15/04/29 11:17:11 INFO ZooKeeper: Client environment:host.name
 =ip-10-144-185-113
 15/04/29 11:17:11 INFO ZooKeeper: Client
 environment:java.version=1.7.0_79
 15/04/29 11:17:11 INFO ZooKeeper: Client environment:java.vendor=Oracle
 Corporation
 15/04/29 11:17:11 INFO ZooKeeper: Client
 environment:java.home=/usr/lib/jvm/java-7-openjdk-amd64/jre
 15/04/29 11:17:11 INFO ZooKeeper: Client
 environment:java.class.path=/home/ubuntu/sparkfolder/conf/:/home/ubuntu/sparkfolder/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.2.0.jar:/home/ubuntu/sparkfolder/lib_managed/jars/datanucleus-core-3.2.10.jar:/home/ubuntu/sparkfolder/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/home/ubuntu/sparkfolder/lib_managed/jars/datanucleus-rdbms-3.2.9.jar
 15/04/29 11:17:11 INFO ZooKeeper: Client
 environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib
 15/04/29 11:17:11 INFO ZooKeeper: Client environment:java.io.tmpdir=/tmp
 15/04/29 11:17:11 INFO ZooKeeper: Client environment:java.compiler=NA
 15/04/29 11:17:11 INFO ZooKeeper: Client environment:os.name=Linux
 15/04/29 11:17:11 INFO ZooKeeper: Client environment:os.arch=amd64
 15/04/29 11:17:11 INFO ZooKeeper: Client
 environment:os.version=3.13.0-49-generic
 15/04/29 11:17:11 INFO ZooKeeper: Client environment:user.name=root
 15/04/29 11:17:11 INFO ZooKeeper: Client environment:user.home=/root
 15/04/29 11:17:11 INFO ZooKeeper: Client
 environment:user.dir=/home/ubuntu/sparkfolder
 15/04/29 11:17:11 INFO ZooKeeper: Initiating client connection,
 connectString=localhost:2181 sessionTimeout=9
 watcher=hconnection-0x2711025f, quorum=localhost:2181, baseZNode=/hbase
 15/04/29 11:17:11 INFO RecoverableZooKeeper: Process
 identifier=hconnection-0x2711025f connecting to ZooKeeper
 ensemble=localhost:2181
 15/04/29 11:17:11 INFO ClientCnxn: Opening socket connection to server
 ip-10-144-185-113/10.144.185.113:2181. Will not attempt to
 authenticate using SASL (unknown error)
 15/04/29 11:17:11 INFO ClientCnxn: Socket connection established to
 ip-10-144-185-113/10.144.185.113:2181, initiating session
 15/04/29 11:17:11 INFO ClientCnxn: Session establishment complete on
 server ip-10-144-185-113/10.144.185.113:2181, sessionid =
 0x14d04d506da0005, negotiated timeout = 4
 15/04/29 11:17:11 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper
 is null

 Its just stuck Not showing any error. There is no Hadoop on my machine.
 What could be the issue?

 here is hbase-site.xml:

 configuration
 property
namehbase.zookeeper.quorum/name
   valuelocalhost/value
 /property

property
   namehbase.zookeeper.property.clientPort/name
   value2181/value
/property
 property
 namezookeeper.znode.parent/name
value/hbase/value
 /property
 /configuration







Re: hive-thriftserver maven artifact

2015-04-27 Thread Ted Yu
This is available for 1.3.1:
http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10

FYI

On Mon, Feb 16, 2015 at 7:24 AM, Marco marco@gmail.com wrote:

 Ok, so will it be only available for the next version (1.30)?

 2015-02-16 15:24 GMT+01:00 Ted Yu yuzhih...@gmail.com:

 I searched for 'spark-hive-thriftserver_2.10' on this page:
 http://mvnrepository.com/artifact/org.apache.spark

 Looks like it is not published.

 On Mon, Feb 16, 2015 at 5:44 AM, Marco marco@gmail.com wrote:

 Hi,

 I am referring to https://issues.apache.org/jira/browse/SPARK-4925
 (Hive Thriftserver Maven Artifact). Can somebody point me (URL) to the
 artifact in a public repository ? I have not found it @Maven Central.

 Thanks,
 Marco





 --
 Viele Grüße,
 Marco



Re: Exception in using updateStateByKey

2015-04-27 Thread Ted Yu
Which hadoop release are you using ?

Can you check hdfs audit log to see who / when deleted spark/ck/hdfsaudit/
receivedData/0/log-1430139541443-1430139601443 ?

Cheers

On Mon, Apr 27, 2015 at 6:21 AM, Sea 261810...@qq.com wrote:

 Hi, all:
 I use function updateStateByKey in Spark Streaming, I need to store the
 states for one minite,  I set spark.cleaner.ttl to 120, the duration is 2
 seconds, but it throws Exception


 Caused by:
 org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
 does not exist:
 spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443
 at
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
 at
 org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
 at
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)

 at org.apache.hadoop.ipc.Client.call(Client.java:1347)
 at org.apache.hadoop.ipc.Client.call(Client.java:1300)
 at
 org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
 at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
 at
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188)
 at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)

 Why?

 my code is

 ssc = StreamingContext(sc,2)
 kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1})
 kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\
 .filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \
 .filter(lambda x: x[1]['isExisted'] != 1) \
 .foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))





Re: Spark - Hive Metastore MySQL driver

2015-05-02 Thread Ted Yu
Can you try the patch from:
[SPARK-6913][SQL] Fixed java.sql.SQLException: No suitable driver found

Cheers

On Sat, Mar 28, 2015 at 12:41 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 This is from my Hive installation

 -sh-4.1$ ls /apache/hive/lib  | grep derby

 derby-10.10.1.1.jar

 derbyclient-10.10.1.1.jar

 derbynet-10.10.1.1.jar


 -sh-4.1$ ls /apache/hive/lib  | grep datanucleus

 datanucleus-api-jdo-3.2.6.jar

 datanucleus-core-3.2.10.jar

 datanucleus-rdbms-3.2.9.jar


 -sh-4.1$ ls /apache/hive/lib  | grep mysql

 mysql-connector-java-5.0.8-bin.jar

 -sh-4.1$


 $ hive --version

 Hive 0.13.0.2.1.3.6-2

 Subversion
 git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6
 -r 87da9430050fb9cc429d79d95626d26ea382b96c


 $



 On Sat, Mar 28, 2015 at 1:05 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 I tried with a different version of driver but same error

 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,
 */home/dvasthimal/spark1.3/mysql-connector-java-5.0.8-bin.jar* --files
 $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory 4g
 --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g
 --executor-cores 1 --queue hdmi-express --class
 com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2

 On Sat, Mar 28, 2015 at 12:47 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 This is what am seeing



 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar
 --files $SPARK_HOME/conf/hive-site.xml  --num-executors 1 --driver-memory
 4g --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g
 --executor-cores 1 --queue hdmi-express --class
 com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2


 Caused by:
 org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException:
 The specified datastore driver (com.mysql.jdbc.Driver) was not found in
 the CLASSPATH. Please check your CLASSPATH specification, and the name of
 the driver.




 ./bin/spark-submit -v --master yarn-cluster --driver-class-path
 /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars
 /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar
 *--files $SPARK_HOME/conf/hive-site.xml  --num-executors 1
 --driver-memory 4g --driver-java-options -XX:MaxPermSize=2G
 --executor-memory 2g --executor-cores 1 --queue hdmi-express --class
 com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar
 startDate=2015-02-16 endDate=2015-02-16
 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro
 subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2


 Caused by: java.sql.SQLException: No suitable driver found for
 jdbc:mysql://db_host_name.vip.ebay.com:3306/HDB
 at java.sql.DriverManager.getConnection(DriverManager.java:596)


 Looks like the driver jar that i got in is not correct,

 On Sat, Mar 28, 2015 at 12:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 Could someone please share the spark-submit command that shows their
 mysql jar containing driver class used to connect to Hive MySQL meta store.

 Even after including it through

 

<    1   2   3   4   5   6   7   8   9   10   >