Re: Unable to stop Worker in standalone mode by sbin/stop-all.sh
Does the machine have cron job that periodically cleans up /tmp dir ? Cheers On Thu, Mar 12, 2015 at 6:18 PM, sequoiadb mailing-list-r...@sequoiadb.com wrote: Checking the script, it seems spark-daemon.sh unable to stop the worker $ ./spark-daemon.sh stop org.apache.spark.deploy.worker.Worker 1 no org.apache.spark.deploy.worker.Worker to stop $ ps -elf | grep spark 0 S taoewang 24922 1 0 80 0 - 733878 futex_ Mar12 ? 00:08:54 java -cp /data/sequoiadb-driver-1.10.jar,/data/spark-sequoiadb-0.0.1-SNAPSHOT.jar::/data/spark/conf:/data/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-SNAPSHOT-hadoop2.4.0.jar -XX:MaxPermSize=128m -Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=centos-151:2181,centos-152:2181,centos-153:2181 -Dspark.deploy.zookeeper.dir=/data/zookeeper -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.worker.Worker spark://centos-151:7077,centos-152:7077,centos-153:7077 In spark-daemon script it tries to find $pid in /tmp/: pid=$SPARK_PID_DIR/spark-$SPARK_IDENT_STRING-$command-$instance.pid” In my case pid supposed to be: /tmp/spark-taoewang-org.apache.spark.deploy.worker.Worker-1.pid However when I go through the files in /tmp directory I don’t find such file exist. I got 777 on /tmp and also tried to touch a file with my current account and success, so it shouldn’t be permission issue. $ ls -la / | grep tmp drwxrwxrwx. 6 root root 4096 Mar 13 08:19 tmp Anyone has any idea why the pid file didn’t show up? Thanks TW - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: OutOfMemoryError when using DataFrame created by Spark SQL
Can you try giving Spark driver more heap ? Cheers On Mar 25, 2015, at 2:14 AM, Todd Leo sliznmail...@gmail.com wrote: Hi, I am using Spark SQL to query on my Hive cluster, following Spark SQL and DataFrame Guide step by step. However, my HiveQL via sqlContext.sql() fails and java.lang.OutOfMemoryError was raised. The expected result of such query is considered to be small (by adding limit 1000 clause). My code is shown below: scala import sqlContext.implicits._ scala val df = sqlContext.sql(select * from some_table where logdate=2015-03-24 limit 1000) and the error msg: [ERROR] [03/25/2015 16:08:22.379] [sparkDriver-scheduler-27] [ActorSystem(sparkDriver)] Uncaught fatal error from thread [sparkDriver-scheduler-27] shutting down ActorSystem [sparkDriver] java.lang.OutOfMemoryError: GC overhead limit exceeded the master heap memory is set by -Xms512m -Xmx512m, while workers set by -Xms4096M -Xmx4096M, which I presume sufficient for this trivial query. Additionally, after restarted the spark-shell and re-run the limit 5 query , the df object is returned and can be printed by df.show(), but other APIs fails on OutOfMemoryError, namely, df.count(), df.select(some_field).show() and so forth. I understand that the RDD can be collected to master hence further transmutations can be applied, as DataFrame has “richer optimizations under the hood” and the convention from an R/julia user, I really hope this error is able to be tackled, and DataFrame is robust enough to depend. Thanks in advance! REGARDS, Todd View this message in context: OutOfMemoryError when using DataFrame created by Spark SQL Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Untangling dependency issues in spark streaming
For Gradle, there are: https://github.com/musketyr/gradle-fatjar-plugin https://github.com/johnrengelman/shadow FYI On Sun, Mar 29, 2015 at 4:29 PM, jay vyas jayunit100.apa...@gmail.com wrote: thanks for posting this! Ive ran into similar issues before, and generally its a bad idea to swap the libraries out and pray fot the best, so the shade functionality is probably the best feature. Unfortunately, im not sure how well SBT and Gradle support shading... how do folks using next gen build tools solve this problem? On Sun, Mar 29, 2015 at 3:10 AM, Neelesh neele...@gmail.com wrote: Hi, My streaming app uses org.apache.httpcomponent:httpclient:4.3.6, but spark uses 4.2.6 , and I believe thats what's causing the following error. I've tried setting spark.executor.userClassPathFirst spark.driver.userClassPathFirst to true in the config, but that does not solve it either. Finally I had to resort to relocating classes using maven shade plugin while building my apps uber jar, using relocations relocation patternorg.apache.http/pattern shadedPatternorg.shaded.apache.http/shadedPattern /relocation /relocations Hope this is useful to others in the same situation. It would be really great to deal with this the right way (like tomcat or any other servlet container - classloader hierarchy etc). Caused by: java.lang.NoSuchFieldError: INSTANCE at org.apache.http.impl.io.DefaultHttpRequestWriterFactory.init(DefaultHttpRequestWriterFactory.java:52) at org.apache.http.impl.io.DefaultHttpRequestWriterFactory.init(DefaultHttpRequestWriterFactory.java:56) at org.apache.http.impl.io.DefaultHttpRequestWriterFactory.clinit(DefaultHttpRequestWriterFactory.java:46) at org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.init(ManagedHttpClientConnectionFactory.java:72) at org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.init(ManagedHttpClientConnectionFactory.java:84) at org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.clinit(ManagedHttpClientConnectionFactory.java:59) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.init(PoolingHttpClientConnectionManager.java:494) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:149) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:138) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:114) and ... Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.http.impl.conn.ManagedHttpClientConnectionFactory at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.init(PoolingHttpClientConnectionManager.java:494) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:149) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:138) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.init(PoolingHttpClientConnectionManager.java:114) -- jay vyas
Re: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed
Nathan: Please look in log files for any of the following: doCleanupRDD(): case e: Exception = logError(Error cleaning RDD + rddId, e) doCleanupShuffle(): case e: Exception = logError(Error cleaning shuffle + shuffleId, e) doCleanupBroadcast(): case e: Exception = logError(Error cleaning broadcast + broadcastId, e) Cheers On Sun, Mar 29, 2015 at 7:55 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Try these: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation: sparkConf.set(spark.executor.logs.rolling.strategy, size) .set(spark.executor.logs.rolling.size.maxBytes, 1024) .set(spark.executor.logs.rolling.maxRetainedFiles, 3) Also see, whats really getting filled on disk. Thanks Best Regards On Sat, Mar 28, 2015 at 8:18 PM, Nathan Marin nathan.ma...@teads.tv wrote: Hi, I’ve been trying to use Spark Streaming for my real-time analysis application using the Kafka Stream API on a cluster (using the yarn version) of 6 executors with 4 dedicated cores and 8192mb of dedicated RAM. The thing is, my application should run 24/7 but the disk usage is leaking. This leads to some exceptions occurring when Spark tries to write on a file system where no space is left. Here are some graphs showing the disk space remaining on a node where my application is deployed: http://i.imgur.com/vdPXCP0.png The drops occurred on a 3 minute interval. The Disk Usage goes back to normal once I kill my application: http://i.imgur.com/ERZs2Cj.png The persistance level of my RDD is MEMORY_AND_DISK_SER_2, but even when I tried MEMORY_ONLY_SER_2 the same thing happened (this mode shouldn't even allow spark to write on disk, right?). My question is: How can I force Spark (Streaming?) to remove whatever he stores immediately after he processed-it? Obviously it doesn’t look like the disk is being cleaned up (even though the memory does) even with me calling the rdd.unpersist() method foreach RDD processed. Here’s a sample of my application code: http://pastebin.com/K86LE1J6 Maybe something is wrong in my app too? Thanks for your help, NM -- View this message in context: [Spark Streaming] Disk not being cleaned up during runtime after RDD being processed http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Disk-not-being-cleaned-up-during-runtime-after-RDD-being-processed-tp22271.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: Build fails on 1.3 Branch
Jenkins build failed too: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-1.3-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/326/consoleFull For the moment, you can apply the following change: diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpression index a53ae97..7ae4b38 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala @@ -18,7 +18,7 @@ package org.apache.spark.sql import org.apache.spark.sql.catalyst.expressions.NamedExpression -import org.apache.spark.sql.catalyst.plans.logical.{Project, NoRelation} +import org.apache.spark.sql.catalyst.plans.logical.{Project} import org.apache.spark.sql.functions._ import org.apache.spark.sql.test.TestSQLContext import org.apache.spark.sql.test.TestSQLContext.implicits._ Cheers On Sun, Mar 29, 2015 at 8:48 AM, mjhb sp...@mjhb.com wrote: I tried pulling the source and building for the first time, but cannot get past the object NoRelation is not a member of package org.apache.spark.sql.catalyst.plans.logical error below on the 1.3 branch. I can build the 1.2 branch. I have tried with both -Dscala-2.11 and 2.10 (after running the appropriate change-version-to-2.1#.sh), and with different combinations of hadoop flags. Relevant excerpts below - full build output available at http://mjhb.com/tmp/build-spark-1.3.out or http://mjhb.com/tmp/build-spark-1.3.out.gz $ git branch * branch-1.3 $ build/mvn -e -X -DskipTests clean package Apache Maven 3.0.5 Maven home: /usr/share/maven Java version: 1.7.0_76, vendor: Oracle Corporation Java home: /usr/lib/jvm/java-7-oracle/jre Default locale: en_US, platform encoding: UTF-8 OS name: linux, version: 3.16.0-33-generic, arch: amd64, family: unix snipped... [error] /home/marty/work/spark-1.3-maint/sql/core/src/test/scala/org/apache/spark/sql/ColumnExpressionSuite.scala:21: object NoRelation is not a member of package org.apache.spark.sql.catalyst.plans.logical [error] import org.apache.spark.sql.catalyst.plans.logical.{Project, NoRelation} [error]^ [error] one error found [debug] Compilation failed (CompilerInterface) [error] Compile failed at Mar 29, 2015 7:52:54 AM [5.793s] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Build-fails-on-1-3-Branch-tp22275.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Too many open files
bq. In /etc/secucity/limits.conf set the next values: Have you done the above modification on all the machines in your Spark cluster ? If you use Ubuntu, be sure that the /etc/pam.d/common-session file contains the following line: session required pam_limits.so On Mon, Mar 30, 2015 at 5:08 AM, Masf masfwo...@gmail.com wrote: Hi. I've done relogin, in fact, I put 'uname -n' and returns 100, but it crashs. I'm doing reduceByKey and SparkSQL mixed over 17 files (250MB-500MB/file) Regards. Miguel Angel. On Mon, Mar 30, 2015 at 1:52 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Mostly, you will have to restart the machines to get the ulimit effect (or relogin). What operation are you doing? Are you doing too many repartitions? Thanks Best Regards On Mon, Mar 30, 2015 at 4:52 PM, Masf masfwo...@gmail.com wrote: Hi I have a problem with temp data in Spark. I have fixed spark.shuffle.manager to SORT. In /etc/secucity/limits.conf set the next values: * softnofile 100 * hardnofile 100 In spark-env.sh set ulimit -n 100 I've restarted the spark service and it continues crashing (Too many open files) How can I resolve? I'm executing Spark 1.2.0 in Cloudera 5.3.2 java.io.FileNotFoundException: /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360) at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355) at scala.Array$.fill(Array.scala:267) at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/03/30 11:54:18 WARN TaskSetManager: Lost task 22.0 in stage 3.0 (TID 27, localhost): java.io.FileNotFoundException: /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123) at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360) at org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355) at scala.Array$.fill(Array.scala:267) at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Thanks! -- Regards. Miguel Ángel -- Saludos. Miguel Ángel
Re: Spark streaming with Kafka, multiple partitions fail, single partition ok
Nicolas: See if there was occurrence of the following exception in the log: errs = throw new SparkException( sCouldn't connect to leader for topic ${part.topic} ${part.partition}: + errs.mkString(\n)), Cheers On Mon, Mar 30, 2015 at 9:40 AM, Cody Koeninger c...@koeninger.org wrote: This line at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.close( KafkaRDD.scala:158) is the attempt to close the underlying kafka simple consumer. We can add a null pointer check, but the underlying issue of the consumer being null probably indicates a problem earlier. Do you see any previous error messages? Also, can you clarify for the successful and failed cases which topics you are attempting this on, how many partitions there are, and whether there are messages in the partitions? There's an existing jira regarding empty partitions. On Mon, Mar 30, 2015 at 11:05 AM, Nicolas Phung nicolas.ph...@gmail.com wrote: Hello, I'm using spark-streaming-kafka 1.3.0 with the new consumer Approach 2: Direct Approach (No Receivers) ( http://spark.apache.org/docs/latest/streaming-kafka-integration.html). I'm using the following code snippets : // Create direct kafka stream with brokers and topics val messages = KafkaUtils.createDirectStream[String, Array[Byte], StringDecoder, DefaultDecoder]( ssc, kafkaParams, topicsSet) // Get the stuff from Kafka and print them val raw = messages.map(_._2) val dStream: DStream[RawScala] = raw.map( byte = { // Avro Decoder println(Byte length: + byte.length) val rawDecoder = new AvroDecoder[Raw](schema = Raw.getClassSchema) RawScala.toScala(rawDecoder.fromBytes(byte)) } ) // Useful for debug dStream.print() I launch my Spark Streaming and everything is fine if there's no incoming logs from Kafka. When I'm sending a log, I got the following error : 15/03/30 17:34:40 ERROR TaskContextImpl: Error in TaskCompletionListener java.lang.NullPointerException at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.close(KafkaRDD.scala:158) at org.apache.spark.util.NextIterator.closeIfNeeded(NextIterator.scala:63) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$1.apply(KafkaRDD.scala:101) at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator$$anonfun$1.apply(KafkaRDD.scala:101) at org.apache.spark.TaskContextImpl$$anon$1.onTaskCompletion(TaskContextImpl.scala:49) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:68) at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:66) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:58) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 15/03/30 17:34:40 INFO TaskSetManager: Finished task 3.0 in stage 28.0 (TID 94) in 12 ms on localhost (2/4) 15/03/30 17:34:40 INFO TaskSetManager: Finished task 2.0 in stage 28.0 (TID 93) in 13 ms on localhost (3/4) 15/03/30 17:34:40 ERROR Executor: Exception in task 0.0 in stage 28.0 (TID 91) org.apache.spark.util.TaskCompletionListenerException at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:76) at org.apache.spark.scheduler.Task.run(Task.scala:58) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 15/03/30 17:34:40 WARN TaskSetManager: Lost task 0.0 in stage 28.0 (TID 91, localhost): org.apache.spark.util.TaskCompletionListenerException at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:76) at org.apache.spark.scheduler.Task.run(Task.scala:58) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 15/03/30 17:34:40 ERROR TaskSetManager: Task 0 in stage 28.0 failed 1 times; aborting job 15/03/30 17:34:40 INFO TaskSchedulerImpl: Removed TaskSet 28.0, whose tasks have all completed, from pool 15/03/30 17:34:40 INFO TaskSchedulerImpl: Cancelling stage 28 15/03/30 17:34:40 INFO DAGScheduler: Job 28 failed: print at HotFCANextGen.scala:63, took 0,041068 s 15/03/30 17:34:40 INFO JobScheduler: Starting job streaming
Re: Error in Delete Table
Which Spark and Hive release are you using ? Thanks On Mar 27, 2015, at 2:45 AM, Masf masfwo...@gmail.com wrote: Hi. In HiveContext, when I put this statement DROP TABLE IF EXISTS TestTable If TestTable doesn't exist, spark returns an error: ERROR Hive: NoSuchObjectException(message:default.TestTable table not found) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1008) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:90) at com.sun.proxy.$Proxy22.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1000) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:942) at org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3887) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:310) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1554) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1321) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1139) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:962) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:952) at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46) at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425) at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108) at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94) at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98) at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at GeoMain$.HiveExecution(GeoMain.scala:96) at GeoMain$.main(GeoMain.scala:17) at GeoMain.main(GeoMain.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thanks!! -- Regards. Miguel Ángel - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: refer to dictionary
You can use broadcast variable. See also this thread: http://search-hadoop.com/m/JW1q5GX7U22/Spark+broadcast+variablesubj=How+Broadcast+variable+scale+ On Mar 31, 2015, at 4:43 AM, Peng Xia sparkpeng...@gmail.com wrote: Hi, I have a RDD (rdd1)where each line is split into an array [a, b, c], etc. And I also have a local dictionary p (dict1) stores key value pair {a:1, b: 2, c:3} I want to replace the keys in the rdd with the its corresponding value in the dict: rdd1.map(lambda line: [dict1[item] for item in line]) But this task is not distributed, I believe the reason is the dict1 is a local instance. Can any one provide suggestions on this to parallelize this? Thanks, Best, Peng - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark streaming with Kafka, multiple partitions fail, single partition ok
Can you show us the output of DStream#print() if you have it ? Thanks On Tue, Mar 31, 2015 at 2:55 AM, Nicolas Phung nicolas.ph...@gmail.com wrote: Hello, @Akhil Das I'm trying to use the experimental API https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fmaster%2Fexamples%2Fscala-2.10%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fexamples%2Fstreaming%2FDirectKafkaWordCount.scalasa=Dsntz=1usg=AFQjCNFOmScaSfP-2J4Zn56k86-jHUkYaw. I'm reusing the same code snippet to initialize my topicSet. @Cody Koeninger I don't see any previous error messages (see the full log at the end). To create the topic, I'm doing the following : kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 10 --topic toto kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic toto-single I'm launching my Spark Streaming in local mode. @Ted Yu There's no log Couldn't connect to leader for topic, here's the full version : spark-submit --conf config.resource=application-integration.conf --class nextgen.Main assembly-0.1-SNAPSHOT.jar 15/03/31 10:47:12 INFO SecurityManager: Changing view acls to: nphung 15/03/31 10:47:12 INFO SecurityManager: Changing modify acls to: nphung 15/03/31 10:47:12 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nphung); users with modify permissions: Set(nphung) 15/03/31 10:47:13 INFO Slf4jLogger: Slf4jLogger started 15/03/31 10:47:13 INFO Remoting: Starting remoting 15/03/31 10:47:13 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@int.local:44180] 15/03/31 10:47:13 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@int.local:44180] 15/03/31 10:47:13 INFO Utils: Successfully started service 'sparkDriver' on port 44180. 15/03/31 10:47:13 INFO SparkEnv: Registering MapOutputTracker 15/03/31 10:47:13 INFO SparkEnv: Registering BlockManagerMaster 15/03/31 10:47:13 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150331104713-2238 15/03/31 10:47:13 INFO MemoryStore: MemoryStore started with capacity 265.1 MB 15/03/31 10:47:15 INFO HttpFileServer: HTTP File server directory is /tmp/spark-2c8e34a0-bec3-4f1e-9fe7-83e08efc4f53 15/03/31 10:47:15 INFO HttpServer: Starting HTTP Server 15/03/31 10:47:15 INFO Utils: Successfully started service 'HTTP file server' on port 50204. 15/03/31 10:47:15 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/03/31 10:47:15 INFO SparkUI: Started SparkUI at http://int.local:4040 15/03/31 10:47:16 INFO SparkContext: Added JAR file:/home/nphung/assembly-0.1-SNAPSHOT.jar at http://10.153.165.98:50204/jars/assembly-0.1-SNAPSHOT.jar with timestamp 1427791636151 15/03/31 10:47:16 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@int.local:44180/user/HeartbeatReceiver 15/03/31 10:47:16 INFO NettyBlockTransferService: Server created on 40630 15/03/31 10:47:16 INFO BlockManagerMaster: Trying to register BlockManager 15/03/31 10:47:16 INFO BlockManagerMasterActor: Registering block manager localhost:40630 with 265.1 MB RAM, BlockManagerId(driver, localhost, 40630) 15/03/31 10:47:16 INFO BlockManagerMaster: Registered BlockManager 15/03/31 10:47:17 INFO EventLoggingListener: Logging events to hdfs://int.local:8020/user/spark/applicationHistory/local-1427791636195 15/03/31 10:47:17 INFO VerifiableProperties: Verifying properties 15/03/31 10:47:17 INFO VerifiableProperties: Property group.id is overridden to 15/03/31 10:47:17 INFO VerifiableProperties: Property zookeeper.connect is overridden to 15/03/31 10:47:17 INFO ForEachDStream: metadataCleanupDelay = -1 15/03/31 10:47:17 INFO MappedDStream: metadataCleanupDelay = -1 15/03/31 10:47:17 INFO MappedDStream: metadataCleanupDelay = -1 15/03/31 10:47:17 INFO DirectKafkaInputDStream: metadataCleanupDelay = -1 15/03/31 10:47:17 INFO DirectKafkaInputDStream: Slide time = 2000 ms 15/03/31 10:47:17 INFO DirectKafkaInputDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/31 10:47:17 INFO DirectKafkaInputDStream: Checkpoint interval = null 15/03/31 10:47:17 INFO DirectKafkaInputDStream: Remember duration = 2000 ms 15/03/31 10:47:17 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka.DirectKafkaInputDStream@1daf3b44 15/03/31 10:47:17 INFO MappedDStream: Slide time = 2000 ms 15/03/31 10:47:17 INFO MappedDStream: Storage level = StorageLevel(false, false, false, false, 1) 15/03/31 10:47:17 INFO MappedDStream: Checkpoint interval = null 15/03/31 10:47:17 INFO MappedDStream: Remember duration = 2000 ms 15/03/31 10:47:17 INFO MappedDStream: Initialized and validated
Re: Spark 1.3.0 DataFrame and Postgres
+1 on escaping column names. On Apr 1, 2015, at 5:50 AM, fergjo00 johngfergu...@gmail.com wrote: Question: --- Is there a way to have JDBC DataFrames use quoted/escaped column names? Right now, it looks like it sees the names correctly in the schema created but does not escape them in the SQL it creates when they are not compliant: org.apache.spark.sql.jdbc.JDBCRDD private val columnList: String = { val sb = new StringBuilder() columns.foreach(x = sb.append(,).append(x)) if (sb.length == 0) 1 else sb.substring(1) } If you see value in this, I would take a shot at adding the quoting (escaping) of column names here. If you don't do it, some drivers... like postgresql's will simply drop case all names when parsing the query. As you can see in the TL;DR below that means they won't match the schema I am given. Thanks. TL;DR: I am able to connect to a Postgres database in the shell (with driver referenced): val jdbcDf = sqlContext.jdbc(jdbc:postgresql://localhost/sparkdemo?user=dbuser, sp500) In fact when I run: jdbcDf.registerTempTable(sp500) val avgEPSNamed = sqlContext.sql(SELECT AVG(`Earnings/Share`) as AvgCPI FROM sp500) and val avgEPSProg = jsonDf.agg(avg(jsonDf.col(Earnings/Share))) The values come back as expected. However, if I try: jdbcDf.show Or if I try val all = sqlContext.sql(SELECT * FROM sp500) all.show I get errors about column names not being found. In fact the error includes a mention of column names all lower cased. For now I will change my schema to be more restrictive. Right now it is, per a Stack Overflow poster, not ANSI compliant by doing things that are allowed by 's in pgsql, MySQL and SQLServer. BTW, our users are giving us tables like this... because various tools they already use support non-compliant names. In fact, this is mild compared to what we've had to support. Currently the schema in question uses mixed case, quoted names with special characters and spaces: CREATE TABLE sp500 ( Symbol text, Name text, Sector text, Price double precision, Dividend Yield double precision, Price/Earnings double precision, Earnings/Share double precision, Book Value double precision, 52 week low double precision, 52 week high double precision, Market Cap double precision, EBITDA double precision, Price/Sales double precision, Price/Book double precision, SEC Filings text ) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-0-DataFrame-and-Postgres-tp22338.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.3 build with hive support fails on JLine
Please invoke dev/change-version-to-2.11.sh before running mvn. Cheers On Mon, Mar 30, 2015 at 1:02 AM, Night Wolf nightwolf...@gmail.com wrote: Hey, Trying to build Spark 1.3 with Scala 2.11 supporting yarn hive (with thrift server). Running; *mvn -e -DskipTests -Pscala-2.11 -Dscala-2.11 -Pyarn -Pmapr4 -Phive -Phive-thriftserver clean install* The build fails with; INFO] Compiling 9 Scala sources to /var/lib/jenkins/workspace/cse-Apache-Spark-YARN-2.11-1.3-build/sql/hive-thriftserver/target/scala-2.11/classes...[ERROR] /var/lib/jenkins/workspace/cse-Apache-Spark-YARN-2.11-1.3-build/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:25: object ConsoleReader is not a member of package jline[ERROR] import jline.{ConsoleReader, History}[ERROR]^[WARNING] Class jline.Completor not found - continuing with a stub.[WARNING] Class jline.ConsoleReader not found - continuing with a stub.[ERROR] /var/lib/jenkins/workspace/cse-Apache-Spark-YARN-2.11-1.3-build/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:165: not found: type ConsoleReader[ERROR] val reader = new ConsoleReader()[ERROR] ^[ERROR] Class jline.Completor not found - continuing with a stub.[WARNING] Class com.google.protobuf.Parser not found - continuing with a stub.[WARNING] Class com.google.protobuf.Parser not found - continuing with a stub.[WARNING] Class com.google.protobuf.Parser not found - continuing with a stub.[WARNING] Class com.google.protobuf.Parser not found - continuing with a stub.[WARNING] 6 warnings found[ERROR] three errors found[INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 15.731 s] [INFO] Spark Project Networking ... SUCCESS [ 46.667 s] [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 28.508 s] [INFO] Spark Project Core . SUCCESS [07:45 min] [INFO] Spark Project Bagel SUCCESS [01:10 min] [INFO] Spark Project GraphX ... SUCCESS [02:42 min] [INFO] Spark Project Streaming SUCCESS [03:22 min] [INFO] Spark Project Catalyst . SUCCESS [04:42 min] [INFO] Spark Project SQL .. SUCCESS [05:17 min] [INFO] Spark Project ML Library ... SUCCESS [05:36 min] [INFO] Spark Project Tools SUCCESS [ 46.976 s] [INFO] Spark Project Hive . SUCCESS [04:08 min] [INFO] Spark Project REPL . SUCCESS [01:58 min] [INFO] Spark Project YARN . SUCCESS [01:47 min] [INFO] Spark Project Hive Thrift Server ... FAILURE [ 20.731 s] [INFO] Spark Project Assembly . SKIPPED [INFO] Spark Project External Twitter . SKIPPED [INFO] Spark Project External Flume Sink .. SKIPPED [INFO] Spark Project External Flume ... SKIPPED [INFO] Spark Project External MQTT SKIPPED [INFO] Spark Project External ZeroMQ .. SKIPPED [INFO] Spark Project Examples . SKIPPED [INFO] Spark Project YARN Shuffle Service . SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 41:59 min [INFO] Finished at: 2015-03-30T07:06:49+00:00 [INFO] Final Memory: 94M/868M [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-hive-thriftserver_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed. CompileFailed - [Help 1]org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-hive-thriftserver_2.11: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed. Any ideas? Cheers, N
Re: Spark streaming
jamborta : Please also describe the format of your csv files. Cheers On Fri, Mar 27, 2015 at 6:42 AM, DW @ Gmail deanwamp...@gmail.com wrote: Show us the code. This shouldn't happen for the simple process you described Sent from my rotary phone. On Mar 27, 2015, at 5:47 AM, jamborta jambo...@gmail.com wrote: Hi all, We have a workflow that pulls in data from csv files, then originally setup up of the workflow was to parse the data as it comes in (turn into array), then store it. This resulted in out of memory errors with larger files (as a result of increased GC?). It turns out if the data gets stored as a string first, then parsed, it issues does not occur. Why is that? Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tp22255.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result
Jeetendra: Please extract the information you need from Result and return the extracted portion - instead of returning Result itself. Cheers On Tue, Mar 31, 2015 at 1:14 PM, Nan Zhu zhunanmcg...@gmail.com wrote: The example in https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/HBaseTest.scala might help Best, -- Nan Zhu http://codingcat.me On Tuesday, March 31, 2015 at 3:56 PM, Sean Owen wrote: Yep, it's not serializable: https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html You can't return this from a distributed operation since that would mean it has to travel over the network and you haven't supplied any way to convert the thing into bytes. On Tue, Mar 31, 2015 at 8:51 PM, Jeetendra Gangele gangele...@gmail.com wrote: When I am trying to get the result from Hbase and running mapToPair function of RRD its giving the error java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result Here is the code // private static JavaPairRDDInteger, Result getCompanyDataRDD(JavaSparkContext sc) throws IOException { // return sc.newAPIHadoopRDD(companyDAO.getCompnayDataConfiguration(), TableInputFormat.class, ImmutableBytesWritable.class, // Result.class).mapToPair(new PairFunctionTuple2ImmutableBytesWritable, Result, Integer, Result() { // // public Tuple2Integer, Result call(Tuple2ImmutableBytesWritable, Result t) throws Exception { // System.out.println(In getCompanyDataRDD+t._2); // // String cknid = Bytes.toString(t._1.get()); // System.out.println(processing cknids is:+cknid); // Integer cknidInt = Integer.parseInt(cknid); // Tuple2Integer, Result returnTuple = new Tuple2Integer, Result(cknidInt, t._2); // return returnTuple; // } // }); // } - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Anyone has some simple example with spark-sql with spark 1.3
Please take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html Cheers On Mar 28, 2015, at 5:08 AM, Vincent He vincent.he.andr...@gmail.com wrote: I am learning spark sql and try spark-sql example, I running following code, but I got exception ERROR CliDriver: org.apache.spark.sql.AnalysisException: cannot recognize input near 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17, I have two questions, 1. Do we have a list of the statement supported in spark-sql ? 2. Does spark-sql shell support hiveql ? If yes, how to set? The example I tried: CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json OPTIONS ( path examples/src/main/resources/people.json ) SELECT * FROM jsonTable The exception I got, CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json OPTIONS ( path examples/src/main/resources/people.json ) SELECT * FROM jsonTable ; 15/03/28 17:38:34 INFO ParseDriver: Parsing command: CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json OPTIONS ( path examples/src/main/resources/people.json ) SELECT * FROM jsonTable NoViableAltException(241@[654:1: ddlStatement : ( createDatabaseStatement | switchDatabaseStatement | dropDatabaseStatement | createTableStatement | dropTableStatement | truncateTableStatement | alterStatement | descStatement | showStatement | metastoreCheck | createViewStatement | dropViewStatement | createFunctionStatement | createMacroStatement | createIndexStatement | dropIndexStatement | dropFunctionStatement | dropMacroStatement | analyzeStatement | lockStatement | unlockStatement | lockDatabase | unlockDatabase | createRoleStatement | dropRoleStatement | grantPrivileges | revokePrivileges | showGrants | showRoleGrants | showRolePrincipals | showRoles | grantRole | revokeRole | setRole | showCurrentRole );]) at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) at org.antlr.runtime.DFA.predict(DFA.java:144) at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2090) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166) at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890) at scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.apply(AbstractSparkSQLParser.scala:38) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.hive.HiveQl$$anonfun$3.apply(HiveQl.scala:138) at org.apache.spark.sql.SparkSQLParser$$anonfun$org$apache$spark$sql$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:96) at
Re: Anyone has some simple example with spark-sql with spark 1.3
See https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html I haven't tried the SQL statements in above blog myself. Cheers On Sat, Mar 28, 2015 at 5:39 AM, Vincent He vincent.he.andr...@gmail.com wrote: thanks for your information . I have read it, I can run sample with scala or python, but for spark-sql shell, I can not get an exmaple running successfully, can you give me an example I can run with ./bin/spark-sql without writing any code? thanks On Sat, Mar 28, 2015 at 7:35 AM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at https://spark.apache.org/docs/latest/sql-programming-guide.html Cheers On Mar 28, 2015, at 5:08 AM, Vincent He vincent.he.andr...@gmail.com wrote: I am learning spark sql and try spark-sql example, I running following code, but I got exception ERROR CliDriver: org.apache.spark.sql.AnalysisException: cannot recognize input near 'CREATE' 'TEMPORARY' 'TABLE' in ddl statement; line 1 pos 17, I have two questions, 1. Do we have a list of the statement supported in spark-sql ? 2. Does spark-sql shell support hiveql ? If yes, how to set? The example I tried: CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json OPTIONS ( path examples/src/main/resources/people.json ) SELECT * FROM jsonTable The exception I got, CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json OPTIONS ( path examples/src/main/resources/people.json ) SELECT * FROM jsonTable ; 15/03/28 17:38:34 INFO ParseDriver: Parsing command: CREATE TEMPORARY TABLE jsonTable USING org.apache.spark.sql.json OPTIONS ( path examples/src/main/resources/people.json ) SELECT * FROM jsonTable NoViableAltException(241@[654:1: ddlStatement : ( createDatabaseStatement | switchDatabaseStatement | dropDatabaseStatement | createTableStatement | dropTableStatement | truncateTableStatement | alterStatement | descStatement | showStatement | metastoreCheck | createViewStatement | dropViewStatement | createFunctionStatement | createMacroStatement | createIndexStatement | dropIndexStatement | dropFunctionStatement | dropMacroStatement | analyzeStatement | lockStatement | unlockStatement | lockDatabase | unlockDatabase | createRoleStatement | dropRoleStatement | grantPrivileges | revokePrivileges | showGrants | showRoleGrants | showRolePrincipals | showRoles | grantRole | revokeRole | setRole | showCurrentRole );]) at org.antlr.runtime.DFA.noViableAlt(DFA.java:158) at org.antlr.runtime.DFA.predict(DFA.java:144) at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2090) at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398) at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199) at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166) at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:227) at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:241) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41) at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136) at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at scala.util.parsing.combinator.Parsers$$anon$2.apply
Re: Can't access file in spark, but can in hadoop
Thanks for the follow-up, Dale. bq. hdp 2.3.1 Minor correction: should be hdp 2.1.3 Cheers On Sat, Mar 28, 2015 at 2:28 AM, Johnson, Dale daljohn...@ebay.com wrote: Actually I did figure this out eventually. I’m running on a Hortonworks cluster hdp 2.3.1 (hadoop 2.4.1). Spark bundles the org/apache/hadoop/hdfs/… classes along with the spark-assembly jar. This turns out to introduce a small incompatibility with hdp 2.3.1. I carved these classes out of the jar, and put a distro-provided jar into the class path for the hdfs classes, and this fixed the problem. Ideally there would be an exclusion in the pom to deal with this. Dale. From: Zhan Zhang zzh...@hortonworks.com Date: Friday, March 27, 2015 at 4:28 PM To: Johnson, Dale daljohn...@ebay.com Cc: Ted Yu yuzhih...@gmail.com, user user@spark.apache.org Subject: Re: Can't access file in spark, but can in hadoop Probably guava version conflicts issue. What spark version did you use, and which hadoop version it compile against? Thanks. Zhan Zhang On Mar 27, 2015, at 12:13 PM, Johnson, Dale daljohn...@ebay.com wrote: Yes, I could recompile the hdfs client with more logging, but I don’t have the day or two to spare right this week. One more thing about this, the cluster is Horton Works 2.1.3 [.0] They seem to have a claim of supporting spark on Horton Works 2.2 Dale. From: Ted Yu yuzhih...@gmail.com Date: Thursday, March 26, 2015 at 4:54 PM To: Johnson, Dale daljohn...@ebay.com Cc: user user@spark.apache.org Subject: Re: Can't access file in spark, but can in hadoop Looks like the following assertion failed: Preconditions.checkState(storageIDsCount == locs.size()); locs is ListDatanodeInfoProto Can you enhance the assertion to log more information ? Cheers On Thu, Mar 26, 2015 at 3:06 PM, Dale Johnson daljohn...@ebay.com wrote: There seems to be a special kind of corrupted according to Spark state of file in HDFS. I have isolated a set of files (maybe 1% of all files I need to work with) which are producing the following stack dump when I try to sc.textFile() open them. When I try to open directories, most large directories contain at least one file of this type. Curiously, the following two lines fail inside of a Spark job, but not inside of a Scoobi job: val conf = new org.apache.hadoop.conf.Configuration val fs = org.apache.hadoop.fs.FileSystem.get(conf) The stack trace follows: 15/03/26 14:22:43 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: null) Exception in thread Driver java.lang.IllegalStateException at org.spark-project.guava.common.base.Preconditions.checkState(Preconditions.java:133) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:673) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convertLocatedBlock(PBHelper.java:1100) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1118) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1251) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1354) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1363) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:518) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1743) at org.apache.hadoop.hdfs.DistributedFileSystem$15.init(DistributedFileSystem.java:738) at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:727) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1662) at org.apache.hadoop.fs.FileSystem$5.init(FileSystem.java:1724) at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:1721) at com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1125) at com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1123) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$.main(SpellQuery.scala:1123) at com.ebay.ss.niffler.miner.speller.SpellQueryLaunch.main(SpellQuery.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:427) 15/03/26 14:22:43 INFO yarn.ApplicationMaster
Re: Spark-submit not working when application jar is in hdfs
Looking at SparkSubmit#addJarToClasspath(): uri.getScheme match { case file | local = ... case _ = printWarning(sSkip remote jar $uri.) It seems hdfs scheme is not recognized. FYI On Thu, Feb 26, 2015 at 6:09 PM, dilm dmend...@exist.com wrote: I'm trying to run a spark application using bin/spark-submit. When I reference my application jar inside my local filesystem, it works. However, when I copied my application jar to a directory in hdfs, i get the following exception: Warning: Skip remote jar hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar. java.lang.ClassNotFoundException: com.example.SimpleApp Here's the comand: $ ./bin/spark-submit --class com.example.SimpleApp --master local hdfs://localhost:9000/user/hdfs/jars/simple-project-1.0-SNAPSHOT.jar I'm using hadoop version 2.6.0, spark version 1.2.1 In the official documentation, it stated there that: application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an *hdfs:// path* or a file:// path that is present on all nodes. I'm thinking maybe this is a valid bug? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-submit-not-working-when-application-jar-is-in-hdfs-tp21840.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: registerTempTable is not a member of RDD on spark 1.2?
Have you tried adding the following ? import org.apache.spark.sql.SQLContext Cheers On Mon, Mar 23, 2015 at 6:45 AM, IT CTO goi@gmail.com wrote: Thanks. I am new to the environment and running cloudera CDH5.3 with spark in it. apparently when running in spark-shell this command val sqlContext = new SQLContext(sc) I am failing with the not found type SQLContext Any idea why? On Mon, Mar 23, 2015 at 3:05 PM, Dean Wampler deanwamp...@gmail.com wrote: In 1.2 it's a member of SchemaRDD and it becomes available on RDD (through the type class mechanism) when you add a SQLContext, like so. val sqlContext = new SQLContext(sc)import sqlContext._ In 1.3, the method has moved to the new DataFrame type. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition http://shop.oreilly.com/product/0636920033073.do (O'Reilly) Typesafe http://typesafe.com @deanwampler http://twitter.com/deanwampler http://polyglotprogramming.com On Mon, Mar 23, 2015 at 5:25 AM, IT CTO goi@gmail.com wrote: Hi, I am running spark when I use sc.version I get 1.2 but when I call registerTempTable(MyTable) I get error saying registedTempTable is not a member of RDD Why? -- Eran | CTO -- Eran | CTO
Re: Convert Spark SQL table to RDD in Scala / error: value toFloat is a not a member of Any
I thought of formation #1. But looks like when there're many fields, formation #2 is cleaner. Cheers On Sun, Mar 22, 2015 at 8:14 PM, Cheng Lian lian.cs@gmail.com wrote: You need either .map { row = (row(0).asInstanceOf[Float], row(1).asInstanceOf[Float], ...) } or .map { case Row(f0: Float, f1: Float, ...) = (f0, f1) } On 3/23/15 9:08 AM, Minnow Noir wrote: I'm following some online tutorial written in Python and trying to convert a Spark SQL table object to an RDD in Scala. The Spark SQL just loads a simple table from a CSV file. The tutorial says to convert the table to an RDD. The Python is products_rdd = sqlContext.table(products).map(lambda row: (float(row[0]),float(row[1]),float(row[2]),float(row[3]), float(row[4]),float(row[5]),float(row[6]),float(row[7]),float(row[8]),float(row[9]),float(row[10]),float(row[11]))) The Scala is *not* val productsRdd = sqlContext.table(products).map( row = ( row(0).toFloat,row(1).toFloat,row(2).toFloat,row(3).toFloat, row(4).toFloat,row(5).toFloat,row(6).toFloat,row(7).toFloat,row(8).toFloat, row(9).toFloat,row(10).toFloat,row(11).toFloat )) I know this, because Spark says that for each of the row(x).toFloat calls, error: value toFloat is not a member of Any Does anyone know the proper syntax for this? Thank you
Re: Spark 1.2. loses often all executors
In this thread: http://search-hadoop.com/m/JW1q5DM69G I only saw two replies. Maybe some people forgot to use 'Reply to All' ? Cheers On Mon, Mar 23, 2015 at 8:19 AM, mrm ma...@skimlinks.com wrote: Hi, I have received three replies to my question on my personal e-mail, why don't they also show up on the mailing list? I would like to reply to the 3 users through a thread. Thanks, Maria -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-loses-often-all-executors-tp22162p22187.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077
bq. Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077 There should be some more output following the above line. Can you post them ? Cheers On Mon, Mar 2, 2015 at 2:06 PM, Krishnanand Khambadkone kkhambadk...@yahoo.com.invalid wrote: Hi, I am running spark on my mac. It is reading from a kafka topic and then writes the data to a hbase table. When I do a spark submit, I get this error, Error connecting to master spark://localhost:7077 (akka.tcp://sparkMaster@localhost:7077), exiting. Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077 My submit statement looks like this, ./spark-submit --jars /Users/hadoop/kafka-0.8.1.1-src/core/build/libs/kafka_2.8.0-0.8.1.1.jar --class KafkaMain --master spark://localhost:7077 --deploy-mode cluster --num-executors 100 --driver-memory 1g --executor-memory 1g /Users/hadoop/dev/kafkahbase/kafka/src/sparkhbase.jar
Re: Executing hive query from Spark code
Here is snippet of dependency tree for spark-hive module: [INFO] org.apache.spark:spark-hive_2.10:jar:1.3.0-SNAPSHOT ... [INFO] +- org.spark-project.hive:hive-metastore:jar:0.13.1a:compile [INFO] | +- org.spark-project.hive:hive-shims:jar:0.13.1a:compile [INFO] | | +- org.spark-project.hive.shims:hive-shims-common:jar:0.13.1a:compile [INFO] | | +- org.spark-project.hive.shims:hive-shims-0.20:jar:0.13.1a:runtime [INFO] | | +- org.spark-project.hive.shims:hive-shims-common-secure:jar:0.13.1a:compile [INFO] | | +- org.spark-project.hive.shims:hive-shims-0.20S:jar:0.13.1a:runtime [INFO] | | \- org.spark-project.hive.shims:hive-shims-0.23:jar:0.13.1a:runtime ... [INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile [INFO] | +- org.spark-project.hive:hive-ant:jar:0.13.1a:compile [INFO] | | \- org.apache.velocity:velocity:jar:1.5:compile [INFO] | | \- oro:oro:jar:2.0.8:compile [INFO] | +- org.spark-project.hive:hive-common:jar:0.13.1a:compile ... [INFO] +- org.spark-project.hive:hive-serde:jar:0.13.1a:compile bq. is there a way to have the hive support without updating the assembly I don't think so. On Mon, Mar 2, 2015 at 12:37 PM, nitinkak001 nitinkak...@gmail.com wrote: I want to run Hive query inside Spark and use the RDDs generated from that inside Spark. I read in the documentation /Hive support is enabled by adding the -Phive and -Phive-thriftserver flags to Spark’s build. This command builds a new assembly jar that includes Hive. Note that this Hive assembly jar must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive./ I just wanted to know what -Phive and -Phive-thriftserver flags really do and is there a way to have the hive support without updating the assembly. Does that flag add a hive support jar or something? The reason I am asking is that I will be using Cloudera version of Spark in future and I am not sure how to add the Hive support to that Spark distribution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Executing-hive-query-from-Spark-code-tp21880.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark UI and running spark-submit with --master yarn
Default RM Web UI port is 8088 (configurable through yarn.resourcemanager.webapp.address) Cheers On Mon, Mar 2, 2015 at 4:14 PM, Anupama Joshi anupama.jo...@gmail.com wrote: Hi Marcelo, Thanks for the quick reply. I have a EMR cluster and I am running the spark-submit on the master node in the cluster. When I start the spark-submit , I see 15/03/02 23:48:33 INFO client.RMProxy: Connecting to ResourceManager at / 172.31.43.254:9022 But If I try that URL or the use the external DNS ec2-52-10-234-111.us-west-2.compute.amazonaws.com:9022 it does not work What am I missing here ? Thanks a lot for the help -AJ On Mon, Mar 2, 2015 at 3:50 PM, Marcelo Vanzin van...@cloudera.com wrote: What are you calling masternode? In yarn-cluster mode, the driver is running somewhere in your cluster, not on the machine where you run spark-submit. The easiest way to get to the Spark UI when using Yarn is to use the Yarn RM's web UI. That will give you a link to the application's UI regardless of whether it's running on client or cluster mode. On Mon, Mar 2, 2015 at 3:39 PM, Anupama Joshi anupama.jo...@gmail.com wrote: Hi , When I run my application with --master yarn-cluster or --master yarn --deploy-mode cluster , I can not the spark UI at the location -- masternode:4040Even if I am running the job , I can not see teh SPARK UI. When I run with --master yarn --deploy-mode client -- I see the Spark UI but I cannot see my job running. When I run spark-submit with --master local[*] , I see the spark UI , my job everything (Thats great) Do I need to do some settings to see the UI? Thanks -AJ -- Marcelo
Re: Spark Error: Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077
In AkkaUtils.scala: val akkaLogLifecycleEvents = conf.getBoolean(spark.akka.logLifecycleEvents, false) Can you turn on life cycle event logging to see if you would get some more clue ? Cheers On Mon, Mar 2, 2015 at 3:56 PM, Krishnanand Khambadkone kkhambadk...@yahoo.com wrote: I see these messages now, spark.master - spark://krishs-mbp:7077 Classpath elements: Sending launch command to spark://krishs-mbp:7077 Driver successfully submitted as driver-20150302155433- ... waiting before polling master for driver state ... polling master for driver state State of driver-20150302155433- is FAILED On Monday, March 2, 2015 2:42 PM, Ted Yu yuzhih...@gmail.com wrote: bq. Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077 There should be some more output following the above line. Can you post them ? Cheers On Mon, Mar 2, 2015 at 2:06 PM, Krishnanand Khambadkone kkhambadk...@yahoo.com.invalid wrote: Hi, I am running spark on my mac. It is reading from a kafka topic and then writes the data to a hbase table. When I do a spark submit, I get this error, Error connecting to master spark://localhost:7077 (akka.tcp://sparkMaster@localhost:7077), exiting. Cause was: akka.remote.InvalidAssociation: Invalid address: akka.tcp://sparkMaster@localhost:7077 My submit statement looks like this, ./spark-submit --jars /Users/hadoop/kafka-0.8.1.1-src/core/build/libs/kafka_2.8.0-0.8.1.1.jar --class KafkaMain --master spark://localhost:7077 --deploy-mode cluster --num-executors 100 --driver-memory 1g --executor-memory 1g /Users/hadoop/dev/kafkahbase/kafka/src/sparkhbase.jar
Re: Resource manager UI for Spark applications
bq. spark UI does not work for Yarn-cluster. Can you be a bit more specific on the error(s) you saw ? What Spark release are you using ? Cheers On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com wrote: Sorry , for half email - here it is again in full Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup? 2. when I click on Application Monitoring or history , i get re-directed to some linked with internal Ip address. Even if I replace that address with the public IP , it still does not work. What kind of setup changes are needed for that? Thanks -roni On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi roni.epi...@gmail.com wrote: Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup?
Re: bitten by spark.yarn.executor.memoryOverhead
I have created SPARK-6085 with pull request: https://github.com/apache/spark/pull/4836 Cheers On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: +1 to a better default as well. We were working find until we ran against a real dataset which was much larger than the test dataset we were using locally. It took me a couple days and digging through many logs to figure out this value was what was causing the problem. On Sat, Feb 28, 2015 at 11:38 AM, Ted Yu yuzhih...@gmail.com wrote: Having good out-of-box experience is desirable. +1 on increasing the default. On Sat, Feb 28, 2015 at 8:27 AM, Sean Owen so...@cloudera.com wrote: There was a recent discussion about whether to increase or indeed make configurable this kind of default fraction. I believe the suggestion there too was that 9-10% is a safer default. Advanced users can lower the resulting overhead value; it may still have to be increased in some cases, but a fatter default may make this kind of surprise less frequent. I'd support increasing the default; any other thoughts? On Sat, Feb 28, 2015 at 3:34 PM, Koert Kuipers ko...@tresata.com wrote: hey, running my first map-red like (meaning disk-to-disk, avoiding in memory RDDs) computation in spark on yarn i immediately got bitten by a too low spark.yarn.executor.memoryOverhead. however it took me about an hour to find out this was the cause. at first i observed failing shuffles leading to restarting of tasks, then i realized this was because executors could not be reached, then i noticed in containers got shut down and reallocated in resourcemanager logs (no mention of errors, it seemed the containers finished their business and shut down successfully), and finally i found the reason in nodemanager logs. i dont think this is a pleasent first experience. i realize spark.yarn.executor.memoryOverhead needs to be set differently from situation to situation. but shouldnt the default be a somewhat higher value so that these errors are unlikely, and then the experts that are willing to deal with these errors can tune it lower? so why not make the default 10% instead of 7%? that gives something that works in most situations out of the box (at the cost of being a little wasteful). it worked for me. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: bitten by spark.yarn.executor.memoryOverhead
bq. that 0.1 is always enough? The answer is: it depends (on use cases). The value of 0.1 has been validated by several users. I think it is a reasonable default. Cheers On Mon, Mar 2, 2015 at 8:36 AM, Ryan Williams ryan.blake.willi...@gmail.com wrote: For reference, the initial version of #3525 https://github.com/apache/spark/pull/3525 (still open) made this fraction a configurable value, but consensus went against that being desirable so I removed it and marked SPARK-4665 https://issues.apache.org/jira/browse/SPARK-4665 as won't fix. My team wasted a lot of time on this failure mode as well and has settled in to passing --conf spark.yarn.executor.memoryOverhead=1024 to most jobs (that works out to 10-20% of --executor-memory, depending on the job). I agree that learning about this the hard way is a weak part of the Spark-on-YARN onboarding experience. The fact that our instinct here is to increase the 0.07 minimum instead of the alternate 384MB https://github.com/apache/spark/blob/3efd8bb6cf139ce094ff631c7a9c1eb93fdcd566/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L93 minimum seems like evidence that the fraction is the thing we should allow people to configure, instead of absolute amount that is currently configurable. Finally, do we feel confident that 0.1 is always enough? On Sat, Feb 28, 2015 at 4:51 PM Corey Nolet cjno...@gmail.com wrote: Thanks for taking this on Ted! On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu yuzhih...@gmail.com wrote: I have created SPARK-6085 with pull request: https://github.com/apache/spark/pull/4836 Cheers On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: +1 to a better default as well. We were working find until we ran against a real dataset which was much larger than the test dataset we were using locally. It took me a couple days and digging through many logs to figure out this value was what was causing the problem. On Sat, Feb 28, 2015 at 11:38 AM, Ted Yu yuzhih...@gmail.com wrote: Having good out-of-box experience is desirable. +1 on increasing the default. On Sat, Feb 28, 2015 at 8:27 AM, Sean Owen so...@cloudera.com wrote: There was a recent discussion about whether to increase or indeed make configurable this kind of default fraction. I believe the suggestion there too was that 9-10% is a safer default. Advanced users can lower the resulting overhead value; it may still have to be increased in some cases, but a fatter default may make this kind of surprise less frequent. I'd support increasing the default; any other thoughts? On Sat, Feb 28, 2015 at 3:34 PM, Koert Kuipers ko...@tresata.com wrote: hey, running my first map-red like (meaning disk-to-disk, avoiding in memory RDDs) computation in spark on yarn i immediately got bitten by a too low spark.yarn.executor.memoryOverhead. however it took me about an hour to find out this was the cause. at first i observed failing shuffles leading to restarting of tasks, then i realized this was because executors could not be reached, then i noticed in containers got shut down and reallocated in resourcemanager logs (no mention of errors, it seemed the containers finished their business and shut down successfully), and finally i found the reason in nodemanager logs. i dont think this is a pleasent first experience. i realize spark.yarn.executor.memoryOverhead needs to be set differently from situation to situation. but shouldnt the default be a somewhat higher value so that these errors are unlikely, and then the experts that are willing to deal with these errors can tune it lower? so why not make the default 10% instead of 7%? that gives something that works in most situations out of the box (at the cost of being a little wasteful). it worked for me. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Resource manager UI for Spark applications
bq. changing the address with internal to the external one , but still does not work. Not sure what happened. For the time being, you can use yarn command line to pull container log (put in your appId and container Id): yarn logs -applicationId application_1386639398517_0007 -containerId container_1386639398517_0007_01_19 Cheers On Tue, Mar 3, 2015 at 9:50 AM, roni roni.epi...@gmail.com wrote: Hi Ted, I used s3://support.elasticmapreduce/spark/install-spark to install spark on my EMR cluster. It is 1.2.0. When I click on the link for history or logs it takes me to http://ip-172-31-43-116.us-west-2.compute.internal:9035/node/containerlogs/container_1424105590052_0070_01_01/hadoop and I get - The server at *ip-172-31-43-116.us-west-2.compute.internal* can't be found, because the DNS lookup failed. DNS is the network service that translates a website's name to its Internet address. This error is most often caused by having no connection to the Internet or a misconfigured network. It can also be caused by an unresponsive DNS server or a firewall preventing Google Chrome from accessing the network. I tried changing the address with internal to the external one , but still does not work. Thanks _roni On Tue, Mar 3, 2015 at 9:05 AM, Ted Yu yuzhih...@gmail.com wrote: bq. spark UI does not work for Yarn-cluster. Can you be a bit more specific on the error(s) you saw ? What Spark release are you using ? Cheers On Tue, Mar 3, 2015 at 8:53 AM, Rohini joshi roni.epi...@gmail.com wrote: Sorry , for half email - here it is again in full Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup? 2. when I click on Application Monitoring or history , i get re-directed to some linked with internal Ip address. Even if I replace that address with the public IP , it still does not work. What kind of setup changes are needed for that? Thanks -roni On Tue, Mar 3, 2015 at 8:45 AM, Rohini joshi roni.epi...@gmail.com wrote: Hi , I have 2 questions - 1. I was trying to use Resource Manager UI for my SPARK application using yarn cluster mode as I observed that spark UI does not work for Yarn-cluster. IS that correct or am I missing some setup?
Re: Issue using S3 bucket from Spark 1.2.1 with hadoop 2.4
If you can use hadoop 2.6.0 binary, you can use s3a s3a is being polished in the upcoming 2.7.0 release: https://issues.apache.org/jira/browse/HADOOP-11571 Cheers On Tue, Mar 3, 2015 at 9:44 AM, Ankur Srivastava ankur.srivast...@gmail.com wrote: Hi, We recently upgraded to Spark 1.2.1 - Hadoop 2.4 binary. We are not having any other dependency on hadoop jars, except for reading our source files from S3. Since we have upgraded to the latest version our reads from S3 have considerably slowed down. For some jobs we see the read from S3 is stalled for a long time and then it starts. Is there a known issue with S3 or do we need to upgrade any settings? The only settings that we are using are: sc.hadoopConfiguration().set(fs.s3n.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem); sc.hadoopConfiguration().set(fs.s3n.awsAccessKeyId, someKey); sc.hadoopConfiguration().set(fs.s3n.awsSecretAccessKey, someSecret); Thanks for help!! - Ankur
Re: spark sql median and standard deviation
Please take a look at DoubleRDDFunctions.scala : /** Compute the mean of this RDD's elements. */ def mean(): Double = stats().mean /** Compute the variance of this RDD's elements. */ def variance(): Double = stats().variance /** Compute the standard deviation of this RDD's elements. */ def stdev(): Double = stats().stdev Cheers On Wed, Mar 4, 2015 at 10:51 AM, tridib tridib.sama...@live.com wrote: Hello, Is there in built function for getting median and standard deviation in spark sql? Currently I am converting the schemaRdd to DoubleRdd and calling doubleRDD.stats(). But still it does not have median. What is the most efficient way to get the median? Thanks Regards Tridib -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-median-and-standard-deviation-tp21914.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Driver disassociated
What release are you using ? SPARK-3923 went into 1.2.0 release. Cheers On Wed, Mar 4, 2015 at 1:39 PM, Thomas Gerber thomas.ger...@radius.com wrote: Hello, sometimes, in the *middle* of a job, the job stops (status is then seen as FINISHED in the master). There isn't anything wrong in the shell/submit output. When looking at the executor logs, I see logs like this: 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Doing the fetch; tracker actor = Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal :40019/user/MapOutputTracker#893807065] 15/03/04 21:24:51 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 38, fetching them 15/03/04 21:24:55 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@ip-10-0-11-9.ec2.internal:54766] - [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] disassociated! Shutting down. 15/03/04 21:24:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:40019] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. How can I investigate further? Thanks
Re: Where can I find more information about the R interface forSpark?
Please follow SPARK-5654 On Wed, Mar 4, 2015 at 7:22 PM, Haopu Wang hw...@qilinsoft.com wrote: Thanks, it's an active project. Will it be released with Spark 1.3.0? -- *From:* 鹰 [mailto:980548...@qq.com] *Sent:* Thursday, March 05, 2015 11:19 AM *To:* Haopu Wang; user *Subject:* Re: Where can I find more information about the R interface forSpark? you can search SparkR on google or search it on github
Re: Spark Build with Hadoop 2.6, yarn - encounter java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer
Please add the following to build command: -Djackson.version=1.9.3 Cheers On Thu, Mar 5, 2015 at 10:04 AM, Todd Nist tsind...@gmail.com wrote: I am running Spark on a HortonWorks HDP Cluster. I have deployed there prebuilt version but it is only for Spark 1.2.0 not 1.2.1 and there are a few fixes and features in there that I would like to leverage. I just downloaded the spark-1.2.1 source and built it to support Hadoop 2.6 by doing the following: radtech:spark-1.2.1 tnist$ ./make-distribution.sh --name hadoop2.6 --tgz -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests clean package When I deploy this to my hadoop cluster and kick of a spark-shell, $ spark-1.2.1-bin-hadoop2.6]# ./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m Results in java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer The full stack trace is below. I have validate that the $SPARK_HOME/lib/spark-assembly-1.2.1-hadoop2.6.0.jar does infact contain the class in question: jar -tvf spark-assembly-1.2.1-hadoop2.6.0.jar | grep 'org/codehaus/jackson/map/deser/std' ... 18002 Thu Mar 05 11:23:04 EST 2015 parquet/org/codehaus/jackson/map/deser/std/StdDeserializer.class 1584 Thu Mar 05 11:23:04 EST 2015 parquet/org/codehaus/jackson/map/deser/std/StdKeyDeserializer$BoolKD.class... Any guidance on what I missed ? If i start the spark-shell in standalone it comes up fine, $SPARK_HOME/bin/spark-shell so it looks to be related to starting it under yarn from what I can tell. TIA for the assistance. -Todd Stack Trace 15/03/05 12:12:38 INFO spark.SecurityManager: Changing view acls to: root15/03/05 12:12:38 INFO spark.SecurityManager: Changing modify acls to: root15/03/05 12:12:38 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)15/03/05 12:12:38 INFO spark.HttpServer: Starting HTTP Server15/03/05 12:12:39 INFO server.Server: jetty-8.y.z-SNAPSHOT15/03/05 12:12:39 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:3617615/03/05 12:12:39 INFO util.Utils: Successfully started service 'HTTP class server' on port 36176. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.2.1 /_/ Using Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_75) Type in expressions to have them evaluated. Type :help for more information.15/03/05 12:12:43 INFO spark.SecurityManager: Changing view acls to: root15/03/05 12:12:43 INFO spark.SecurityManager: Changing modify acls to: root15/03/05 12:12:43 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)15/03/05 12:12:44 INFO slf4j.Slf4jLogger: Slf4jLogger started15/03/05 12:12:44 INFO Remoting: Starting remoting15/03/05 12:12:44 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkdri...@hadoopdev01.opsdatastore.com:50544]15/03/05 12:12:44 INFO util.Utils: Successfully started service 'sparkDriver' on port 50544.15/03/05 12:12:44 INFO spark.SparkEnv: Registering MapOutputTracker15/03/05 12:12:44 INFO spark.SparkEnv: Registering BlockManagerMaster15/03/05 12:12:44 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-16402794-cc1e-42d0-9f9c-99f15eaa1861/spark-118bc6af-4008-45d7-a22f-491bcd1856c015/03/05 12:12:44 INFO storage.MemoryStore: MemoryStore started with capacity 265.4 MB15/03/05 12:12:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable15/03/05 12:12:45 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-5d7da34c-58d4-4d60-9b6a-3dce43cab39e/spark-4d65aacb-78bd-40fd-b6c0-53b47e28819915/03/05 12:12:45 INFO spark.HttpServer: Starting HTTP Server15/03/05 12:12:45 INFO server.Server: jetty-8.y.z-SNAPSHOT15/03/05 12:12:45 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:5645215/03/05 12:12:45 INFO util.Utils: Successfully started service 'HTTP file server' on port 56452.15/03/05 12:12:45 INFO server.Server: jetty-8.y.z-SNAPSHOT15/03/05 12:12:45 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:404015/03/05 12:12:45 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.15/03/05 12:12:45 INFO ui.SparkUI: Started SparkUI at http://hadoopdev01.opsdatastore.com:404015/03/05 12:12:46 INFO impl.TimelineClientImpl: Timeline service address: http://hadoopdev02.opsdatastore.com:8188/ws/v1/timeline/ java.lang.NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer at java.lang.ClassLoader.defineClass1(Native Method) at
Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?
/FileInputFormat.java#225 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java# } 226 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#226 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java# } However when invoking sc.textFile there are errors on directory entries: not a file. This behavior is confusing - given the proper support appears to be in place for handling directories. 2015-03-03 15:04 GMT-08:00 Sean Owen so...@cloudera.com: This API reads a directory of files, not one file. A file here really means a directory full of part-* files. You do not need to read those separately. Any syntax that works with Hadoop's FileInputFormat should work. I thought you could specify a comma-separated list of paths? maybe I am imagining that. On Tue, Mar 3, 2015 at 10:57 PM, S. Zhou myx...@yahoo.com.invalid wrote: Thanks Ted. Actually a follow up question. I need to read multiple HDFS files into RDD. What I am doing now is: for each file I read them into a RDD. Then later on I union all these RDDs into one RDD. I am not sure if it is the best way to do it. Thanks Senqiang On Tuesday, March 3, 2015 2:40 PM, Ted Yu yuzhih...@gmail.com wrote: Looking at scaladoc: /** Get an RDD for a Hadoop file with an arbitrary new API InputFormat. */ def newAPIHadoopFile[K, V, F : NewInputFormat[K, V]] Your conclusion is confirmed. On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou myx...@yahoo.com.invalid wrote: I did some experiments and it seems not. But I like to get confirmation (or perhaps I missed something). If it does support, could u let me know how to specify multiple folders? Thanks. Senqiang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?
Thanks for the confirmation, Stephen. On Tue, Mar 3, 2015 at 3:53 PM, Stephen Boesch java...@gmail.com wrote: Thanks, I was looking at an old version of FileInputFormat.. BEFORE setting the recursive config ( mapreduce.input.fileinputformat.input.dir.recursive) scala sc.textFile(dev/*).count java.io.IOException: *Not a file*: file:/shared/sparkup/dev/audit-release/blank_maven_build The default is null/not set which is evaluated as false: scala sc.hadoopConfiguration.get(mapreduce.input.fileinputformat.input.dir.recursive) res1: String = null AFTER: Now set the value : sc.hadoopConfiguration.set(mapreduce.input.fileinputformat.input.dir.recursive,true) scala sc.hadoopConfiguration.get(mapreduce.input.fileinputformat.input.dir.recursive) res4: String = true scalasc.textFile(dev/*).count .. res5: Long = 3481 So it works. 2015-03-03 15:26 GMT-08:00 Ted Yu yuzhih...@gmail.com: Looking at FileInputFormat#listStatus(): // Whether we need to recursive look into the directory structure boolean recursive = job.getBoolean(INPUT_DIR_RECURSIVE, false); where: public static final String INPUT_DIR_RECURSIVE = mapreduce.input.fileinputformat.input.dir.recursive; FYI On Tue, Mar 3, 2015 at 3:14 PM, Stephen Boesch java...@gmail.com wrote: The sc.textFile() invokes the Hadoop FileInputFormat via the (subclass) TextInputFormat. Inside the logic does exist to do the recursive directory reading - i.e. first detecting if an entry were a directory and if so then descending: for (FileStatus http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus globStat: matches) { 218 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#218 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java# * if (globStat.isDir http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus.isDir%28%29()) {* *219 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#219* http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java# for(FileStatus http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus stat: f*s**.listStatus http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileSystem.java#FileSystem.listStatus%28org.apache.hadoop.fs.Path%2Corg.apache.hadoop.fs.PathFilter%29*(globStat.getPath http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/fs/FileStatus.java#FileStatus.getPath%28%29(), 220 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#220 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java# inputFilter)) { 221 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#221 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java# result.add http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/util/List.java#List.add%28org.apache.hadoop.fs.FileStatus%29(stat); 222 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#222 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java# } 223 http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#223
Re: Does sc.newAPIHadoopFile support multiple directories (or nested directories)?
Looking at scaladoc: /** Get an RDD for a Hadoop file with an arbitrary new API InputFormat. */ def newAPIHadoopFile[K, V, F : NewInputFormat[K, V]] Your conclusion is confirmed. On Tue, Mar 3, 2015 at 1:59 PM, S. Zhou myx...@yahoo.com.invalid wrote: I did some experiments and it seems not. But I like to get confirmation (or perhaps I missed something). If it does support, could u let me know how to specify multiple folders? Thanks. Senqiang
Re: bitten by spark.yarn.executor.memoryOverhead
Having good out-of-box experience is desirable. +1 on increasing the default. On Sat, Feb 28, 2015 at 8:27 AM, Sean Owen so...@cloudera.com wrote: There was a recent discussion about whether to increase or indeed make configurable this kind of default fraction. I believe the suggestion there too was that 9-10% is a safer default. Advanced users can lower the resulting overhead value; it may still have to be increased in some cases, but a fatter default may make this kind of surprise less frequent. I'd support increasing the default; any other thoughts? On Sat, Feb 28, 2015 at 3:34 PM, Koert Kuipers ko...@tresata.com wrote: hey, running my first map-red like (meaning disk-to-disk, avoiding in memory RDDs) computation in spark on yarn i immediately got bitten by a too low spark.yarn.executor.memoryOverhead. however it took me about an hour to find out this was the cause. at first i observed failing shuffles leading to restarting of tasks, then i realized this was because executors could not be reached, then i noticed in containers got shut down and reallocated in resourcemanager logs (no mention of errors, it seemed the containers finished their business and shut down successfully), and finally i found the reason in nodemanager logs. i dont think this is a pleasent first experience. i realize spark.yarn.executor.memoryOverhead needs to be set differently from situation to situation. but shouldnt the default be a somewhat higher value so that these errors are unlikely, and then the experts that are willing to deal with these errors can tune it lower? so why not make the default 10% instead of 7%? that gives something that works in most situations out of the box (at the cost of being a little wasteful). it worked for me. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class
Have you verified that spark-catalyst_2.10 jar was in the classpath ? Cheers On Sat, Feb 28, 2015 at 9:18 AM, Ashish Nigam ashnigamt...@gmail.com wrote: Hi, I wrote a very simple program in scala to convert an existing RDD to SchemaRDD. But createSchemaRDD function is throwing exception Exception in thread main scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [.] not found Here's more info on the versions I am using - scala.binary.version2.11/scala.binary.version spark.version1.2.1/spark.version scala.version2.11.5/scala.version Please let me know how can I resolve this problem. Thanks Ashish
Re: Tools to manage workflows on Spark
Here was latest modification in spork repo: Mon Dec 1 10:08:19 2014 Not sure if it is being actively maintained. On Sat, Feb 28, 2015 at 6:26 PM, Qiang Cao caoqiang...@gmail.com wrote: Thanks for the pointer, Ashish! I was also looking at Spork https://github.com/sigmoidanalytics/spork Pig-on-Spark), but wasn't sure if that's the right direction. On Sat, Feb 28, 2015 at 6:36 PM, Ashish Nigam ashnigamt...@gmail.com wrote: You have to call spark-submit from oozie. I used this link to get the idea for my implementation - http://mail-archives.apache.org/mod_mbox/oozie-user/201404.mbox/%3CCAHCsPn-0Grq1rSXrAZu35yy_i4T=fvovdox2ugpcuhkwmjp...@mail.gmail.com%3E On Feb 28, 2015, at 3:25 PM, Qiang Cao caoqiang...@gmail.com wrote: Thanks, Ashish! Is Oozie integrated with Spark? I knew it can accommodate some Hadoop jobs. On Sat, Feb 28, 2015 at 6:07 PM, Ashish Nigam ashnigamt...@gmail.com wrote: Qiang, Did you look at Oozie? We use oozie to run spark jobs in production. On Feb 28, 2015, at 2:45 PM, Qiang Cao caoqiang...@gmail.com wrote: Hi Everyone, We need to deal with workflows on Spark. In our scenario, each workflow consists of multiple processing steps. Among different steps, there could be dependencies. I'm wondering if there are tools available that can help us schedule and manage workflows on Spark. I'm looking for something like pig on Hadoop, but it should fully function on Spark. Any suggestion? Thanks in advance! Qiang
Re: unsafe memory access in spark 1.2.1
Google led me to: https://bugs.openjdk.java.net/browse/JDK-8040802 Not sure if the last comment there applies to your deployment. On Sun, Mar 1, 2015 at 8:32 AM, Zalzberg, Idan (Agoda) idan.zalzb...@agoda.com wrote: My run time version is: java version 1.7.0_75 OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13) OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode) Thanks *From:* Ted Yu [mailto:yuzhih...@gmail.com] *Sent:* Sunday, March 01, 2015 10:18 PM *To:* Zalzberg, Idan (Agoda) *Cc:* user@spark.apache.org *Subject:* Re: unsafe memory access in spark 1.2.1 What Java version are you using ? Thanks On Sun, Mar 1, 2015 at 7:03 AM, Zalzberg, Idan (Agoda) idan.zalzb...@agoda.com wrote: Hi, I am using spark 1.2.1, sometimes I get these errors sporadically: Any thought on what could be the cause? Thanks 2015-02-27 15:08:47 ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception in thread Thread[Executor task launch worker-25,5,main] java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:365) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Re: unsafe memory access in spark 1.2.1
What Java version are you using ? Thanks On Sun, Mar 1, 2015 at 7:03 AM, Zalzberg, Idan (Agoda) idan.zalzb...@agoda.com wrote: Hi, I am using spark 1.2.1, sometimes I get these errors sporadically: Any thought on what could be the cause? Thanks 2015-02-27 15:08:47 ERROR SparkUncaughtExceptionHandler:96 - Uncaught exception in thread Thread[Executor task launch worker-25,5,main] java.lang.InternalError: a fault occurred in a recent unsafe memory access operation in compiled Java code at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:365) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Re: How to parse Json formatted Kafka message in spark streaming
Cui: You can check messages.partitions.size to determine whether messages is an empty RDD. Cheers On Thu, Mar 5, 2015 at 12:52 AM, Akhil Das ak...@sigmoidanalytics.com wrote: When you use KafkaUtils.createStream with StringDecoders, it will return String objects inside your messages stream. To access the elements from the json, you could do something like the following: val mapStream = messages.map(x= { val mapper = new ObjectMapper() with ScalaObjectMapper mapper.registerModule(DefaultScalaModule) mapper.readValue[Map[String,Any]](x)*.get(time)* }) Thanks Best Regards On Thu, Mar 5, 2015 at 11:13 AM, Cui Lin cui@hds.com wrote: Friends, I'm trying to parse json formatted Kafka messages and then send back to cassandra.I have two problems: 1. I got the exception below. How to check an empty RDD? Exception in thread main java.lang.UnsupportedOperationException: empty collection at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:869) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:869) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.reduce(RDD.scala:869) at org.apache.spark.sql.json.JsonRDD$.inferSchema(JsonRDD.scala:57) at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:232) at org.apache.spark.sql.SQLContext.jsonRDD(SQLContext.scala:204) val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](…) messages.foreachRDD { rdd = val message:RDD[String] = rdd.map { y = y._2 } sqlContext.jsonRDD(message).registerTempTable(tempTable) sqlContext.sql(SELECT time,To FROM tempTable) .saveToCassandra(cassandra_keyspace, cassandra_table, SomeColumns(key, msg)) } 2. how to get all column names from json messages? I have hundreds of columns in the json formatted message. Thanks for your help! Best regards, Cui Lin
Re: How to integrate HBASE on Spark
Installing hbase on hadoop cluster would allow hbase to utilize features provided by hdfs, such as short circuit read (See '90.2. Leveraging local data' under http://hbase.apache.org/book.html#perf.hdfs). Cheers On Sun, Feb 22, 2015 at 11:38 PM, Akhil Das ak...@sigmoidanalytics.com wrote: If you are having both the clusters on the same network, then i'd suggest you installing it on the hadoop cluster. If you install it on the spark cluster itself, then hbase might take up a few cpu cycles and there's a chance for the job to lag. Thanks Best Regards On Mon, Feb 23, 2015 at 12:48 PM, sandeep vura sandeepv...@gmail.com wrote: Hi I had installed spark on 3 node cluster. Spark services are up and running.But i want to integrate hbase on spark Do i need to install HBASE on hadoop cluster or spark cluster. Please let me know asap. Regards, Sandeep.v
Re: Spark excludes fastutil dependencies we need
Interesting. Looking at SparkConf.scala : val configs = Seq( DeprecatedConfig(spark.files.userClassPathFirst, spark.executor.userClassPathFirst, 1.3), DeprecatedConfig(spark.yarn.user.classpath.first, null, 1.3, Use spark.{driver,executor}.userClassPathFirst instead.)) It seems spark.files.userClassPathFirst and spark.yarn.user.classpath.first are deprecated. On Wed, Feb 25, 2015 at 12:39 AM, Sean Owen so...@cloudera.com wrote: No, we should not add fastutil back. It's up to the app to bring dependencies it needs, and that's how I understand this issue. The question is really, how to get the classloader visibility right. It depends on where you need these classes. Have you looked into spark.files.userClassPathFirst and spark.yarn.user.classpath.first ? On Wed, Feb 25, 2015 at 5:34 AM, Ted Yu yuzhih...@gmail.com wrote: bq. depend on missing fastutil classes like Long2LongOpenHashMap Looks like Long2LongOpenHashMap should be added to the shaded jar. Cheers On Tue, Feb 24, 2015 at 7:36 PM, Jim Kleckner j...@cloudphysics.com wrote: Spark includes the clearspring analytics package but intentionally excludes the dependencies of the fastutil package (see below). Spark includes parquet-column which includes fastutil and relocates it under parquet/ but creates a shaded jar file which is incomplete because it shades out some of the fastutil classes, notably Long2LongOpenHashMap, which is present in the fastutil jar file that parquet-column is referencing. We are using more of the clearspring classes (e.g. QDigest) and those do depend on missing fastutil classes like Long2LongOpenHashMap. Even though I add them to our assembly jar file, the class loader finds the spark assembly and we get runtime class loader errors when we try to use it. It is possible to put our jar file first, as described here: https://issues.apache.org/jira/browse/SPARK-939 http://spark.apache.org/docs/1.2.0/configuration.html#runtime-environment which I tried with args to spark-submit: --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true but we still get the class not found error. We have tried copying the source code for clearspring into our package and renaming the package and that makes it appear to work... Is this risky? It certainly is ugly. Can anyone recommend a way to deal with this dependency **ll ? === The spark/pom.xml file contains the following lines: dependency groupIdcom.clearspring.analytics/groupId artifactIdstream/artifactId version2.7.0/version exclusions exclusion groupIdit.unimi.dsi/groupId artifactIdfastutil/artifactId /exclusion /exclusions /dependency === The parquet-column/pom.xml file contains: artifactIdmaven-shade-plugin/artifactId executions execution phasepackage/phase goals goalshade/goal /goals configuration minimizeJartrue/minimizeJar artifactSet includes includeit.unimi.dsi:fastutil/include /includes /artifactSet relocations relocation patternit.unimi.dsi/pattern shadedPatternparquet.it.unimi.dsi/shadedPattern /relocation /relocations /configuration /execution /executions -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-excludes-fastutil-dependencies-we-need-tp21794.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: [Spark SQL]: Convert SchemaRDD back to RDD
Haven't found the method in http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD The new DataFrame has this method: /** * Returns the content of the [[DataFrame]] as an [[RDD]] of [[Row]]s. * @group rdd */ def rdd: RDD[Row] = { FYI On Sun, Feb 22, 2015 at 11:51 AM, stephane.collot stephane.col...@gmail.com wrote: Hi Michael, I think that the feature (convert a SchemaRDD to a structured class RDD) is now available. But I didn't understand in the PR how exactly to do this. Can you give an example or doc links? Best regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-tp9071p21753.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Posting to the list
bq. i didnt get any new subscription mail in my inbox. Have you checked your Spam folder ? Cheers On Sun, Feb 22, 2015 at 2:36 PM, hnahak harihar1...@gmail.com wrote: I'm also facing the same issue, this is third time whenever I post anything it never accept by the community and at the same time got a failure mail in my register mail id. and when click to subscribe to this mailing list link, i didnt get any new subscription mail in my inbox. Please anyone suggest a best way to subscribed the email ID -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Posting-to-the-list-tp21750p21756.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Launching Spark cluster on EC2 with Ubuntu AMI
bq. bash: git: command not found Looks like the AMI doesn't have git pre-installed. Cheers On Sun, Feb 22, 2015 at 4:29 PM, olegshirokikh o...@solver.com wrote: I'm trying to launch Spark cluster on AWS EC2 with custom AMI (Ubuntu) using the following: ./ec2/spark-ec2 --key-pair=*** --identity-file='/home/***.pem' --region=us-west-2 --zone=us-west-2b --spark-version=1.2.1 --slaves=2 --instance-type=t2.micro --ami=ami-29ebb519 --user=ubuntu launch spark-ubuntu-cluster Everything starts OK and instances are launched: Found 1 master(s), 2 slaves Waiting for all instances in cluster to enter 'ssh-ready' state. Generating cluster's SSH key on master. But then I'm getting the following SSH errors until it stops trying and quits: bash: git: command not found Connection to ***.us-west-2.compute.amazonaws.com closed. Error executing remote command, retrying after 30 seconds: Command '['ssh', '-o', 'StrictHostKeyChecking=no', '-i', '/home/***t.pem', '-o', 'UserKnownHostsFile=/dev/null', '-t', '-t', u'ubuntu@***.us-west-2.compute.amazonaws.com', 'rm -rf spark-ec2 git clone https://github.com/mesos/spark-ec2.git -b v4']' returned non-zero exit status 127 I know that Spark EC2 scripts are not guaranteed to work with custom AMIs but still, it should work... Any advice would be greatly appreciated! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Launching-Spark-cluster-on-EC2-with-Ubuntu-AMI-tp21757.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Getting to proto buff classes in Spark Context
bq. Caused by: java.lang.ClassNotFoundException: com.rick.reports.Reports$ SensorReports Is Reports$SensorReports class in rick-processors-assembly-1.0.jar ? Thanks On Mon, Feb 23, 2015 at 8:43 PM, necro351 . necro...@gmail.com wrote: Hello, I am trying to deserialize some data encoded using proto buff from within Spark and am getting class-not-found exceptions. I have narrowed the program down to something very simple that shows the problem exactly (see 'The Program' below) and hopefully someone can tell me the easy fix :) So the situation is I have some proto buff reports in /tmp/reports. I also have a Spark project with the below Scala code (under The Program) as well as a Java file defining SensorReports all in the same src sub-tree in my Spark project. Its built using sbt in the standard way. The Spark job reads in the reports from /tmp/reports and then prints them to the console. When I build and run my spark job with spark-submit everything works as expected and the reports are printed out. When I uncomment the 'XXX' variant in the Scala spark program and try to print the reports from within a Spark Context I get the class-not-found exceptions. I don't understand why. If I get this working then I will want to do more than just print the reports from within the Spark Context. My read of the documentation tells me that my spark job should have access to everything in the submitted jar and that jar includes the Java code generated by the proto buff library which defines SensorReports. This is the spark-submit invocation I use after building my job as an assembly with the sbt-assembly plugin: spark-submit --class com.rick.processors.NewReportProcessor --master local[*] ../../../analyzer/spark/target/scala-2.10/rick-processors-assembly-1.0.jar I have also tried adding the jar programmatically using sc.addJar but that does not help. I found a bug from July ( https://github.com/apache/spark/pull/181) that seems related but it went into Spark 1.2.0 (which is what I am currently using) so I don't think that's it. Any ideas? Thanks! The Program: == package com.rick.processors import java.io.File import java.nio.file.{Path, Files, FileSystems} import org.apache.spark.{SparkContext, SparkConf} import com.rick.reports.Reports.SensorReports object NewReportProcessor { private val sparkConf = new SparkConf().setAppName(ReportProcessor) private val sc = new SparkContext(sparkConf) def main(args: Array[String]) = { val protoBuffsBinary = localFileReports() val sensorReportsBundles = protoBuffsBinary.map(bundle = SensorReports.parseFrom(bundle)) // XXX: Printing from within the SparkContext throws class-not-found // exceptions, why? // sc.makeRDD(sensorReportsBundles).foreach((x: SensorReports) = println(x.toString)) sensorReportsBundles.foreach((x: SensorReports) = println(x.toString)) } private def localFileReports() = { val reportDir = new File(/tmp/reports) val reportFiles = reportDir.listFiles.filter(_.getName.endsWith(.report)) reportFiles.map(file = { val path = FileSystems.getDefault().getPath(/tmp/reports, file.getName()) Files.readAllBytes(path) }) } } The Class-not-found exceptions: = Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/02/23 17:35:03 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.241.128 instead (on interface eth0) 15/02/23 17:35:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/02/23 17:35:04 INFO SecurityManager: Changing view acls to: rick 15/02/23 17:35:04 INFO SecurityManager: Changing modify acls to: rick 15/02/23 17:35:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(rick); users with modify permissions: Set(rick) 15/02/23 17:35:04 INFO Slf4jLogger: Slf4jLogger started 15/02/23 17:35:04 INFO Remoting: Starting remoting 15/02/23 17:35:04 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.241.128:38110] 15/02/23 17:35:04 INFO Utils: Successfully started service 'sparkDriver' on port 38110. 15/02/23 17:35:04 INFO SparkEnv: Registering MapOutputTracker 15/02/23 17:35:04 INFO SparkEnv: Registering BlockManagerMaster 15/02/23 17:35:04 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150223173504-b26c 15/02/23 17:35:04 INFO MemoryStore: MemoryStore started with capacity 267.3 MB 15/02/23 17:35:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/02/23 17:35:05 INFO HttpFileServer: HTTP File server directory is /tmp/spark-c77dbc9a-d626-4991-a9b7-f593acafbe64 15/02/23 17:35:05 INFO
Re: InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag
bq. have installed hadoop on a local virtual machine Can you tell us the release of hadoop you installed ? What Spark release are you using ? Or be more specific, what hadoop release was the Spark built against ? Cheers On Mon, Feb 23, 2015 at 9:37 PM, fanooos dev.fano...@gmail.com wrote: Hi I have installed hadoop on a local virtual machine using the steps from this URL https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-ubuntu-13-10 In the local machine I write a little Spark application in java to read a file from the hadoop instance installed in the virtual machine. The code is below public static void main(String[] args) { JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName(Spark Count).setMaster(local)); JavaRDDString lines = sc.textFile(hdfs://10.62.57.141:50070/tmp/lines.txt); JavaRDDInteger lengths = lines.flatMap(new FlatMapFunctionString, Integer() { @Override public IterableInteger call(String t) throws Exception { return Arrays.asList(t.length()); } }); ListInteger collect = lengths.collect(); int totalLength = lengths.reduce(new Function2Integer, Integer, Integer() { @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1+v2; } }); System.out.println(totalLength); } The application throws this exception Exception in thread main java.io.IOException: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.; Host Details : local host is: TOSHIBA-PC/192.168.56.1; destination host is: 10.62.57.141:50070; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1351) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy12.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:651) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1679) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102) at org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:1701) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1647) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:222) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:222) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:220) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:220) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1367) at org.apache.spark.rdd.RDD.collect(RDD.scala:797) at
Re: Does Spark Streaming depend on Hadoop?
Can you pastebin the whole stack trace ? Thanks On Feb 23, 2015, at 6:14 PM, bit1...@163.com bit1...@163.com wrote: Hi, When I submit a spark streaming application with following script, ./spark-submit --name MyKafkaWordCount --master local[20] --executor-memory 512M --total-executor-cores 2 --class spark.examples.streaming.MyKafkaWordCount my.kakfa.wordcountjar An exception occurs: Exception in thread main java.net.ConnectException: Call From hadoop.master/192.168.26.137 to hadoop.master:9000 failed on connection exception. From the exception, it tries to connect to 9000 which is for Hadoop/HDFS. and I don't use Hadoop at all in my code(such as save to HDFS). bit1...@163.com
Re: Getting to proto buff classes in Spark Context
The classname given in stack trace was com.rick.reports.Reports In the output from jar command the class is com.defend7.reports.Reports. FYI On Mon, Feb 23, 2015 at 9:33 PM, necro351 . necro...@gmail.com wrote: Hi Ted, Yes it appears to be: rick@ubuntu:~/go/src/rick/sparksprint/containers/tests/StreamingReports$ jar tvf ../../../analyzer/spark/target/scala-2.10/rick-processors-assembly-1.0.jar|grep SensorReports 1128 Mon Feb 23 17:34:46 PST 2015 com/defend7/reports/Reports$SensorReports$1.class 13507 Mon Feb 23 17:34:46 PST 2015 com/defend7/reports/Reports$SensorReports$Builder.class 10640 Mon Feb 23 17:34:46 PST 2015 com/defend7/reports/Reports$SensorReports.class 815 Mon Feb 23 17:34:46 PST 2015 com/defend7/reports/Reports$SensorReportsOrBuilder.class On Mon Feb 23 2015 at 8:57:18 PM Ted Yu yuzhih...@gmail.com wrote: bq. Caused by: java.lang.ClassNotFoundException: com.rick.reports. Reports$SensorReports Is Reports$SensorReports class in rick-processors-assembly-1.0.jar ? Thanks On Mon, Feb 23, 2015 at 8:43 PM, necro351 . necro...@gmail.com wrote: Hello, I am trying to deserialize some data encoded using proto buff from within Spark and am getting class-not-found exceptions. I have narrowed the program down to something very simple that shows the problem exactly (see 'The Program' below) and hopefully someone can tell me the easy fix :) So the situation is I have some proto buff reports in /tmp/reports. I also have a Spark project with the below Scala code (under The Program) as well as a Java file defining SensorReports all in the same src sub-tree in my Spark project. Its built using sbt in the standard way. The Spark job reads in the reports from /tmp/reports and then prints them to the console. When I build and run my spark job with spark-submit everything works as expected and the reports are printed out. When I uncomment the 'XXX' variant in the Scala spark program and try to print the reports from within a Spark Context I get the class-not-found exceptions. I don't understand why. If I get this working then I will want to do more than just print the reports from within the Spark Context. My read of the documentation tells me that my spark job should have access to everything in the submitted jar and that jar includes the Java code generated by the proto buff library which defines SensorReports. This is the spark-submit invocation I use after building my job as an assembly with the sbt-assembly plugin: spark-submit --class com.rick.processors.NewReportProcessor --master local[*] ../../../analyzer/spark/target/scala-2.10/rick-processors-assembly-1.0.jar I have also tried adding the jar programmatically using sc.addJar but that does not help. I found a bug from July ( https://github.com/apache/spark/pull/181) that seems related but it went into Spark 1.2.0 (which is what I am currently using) so I don't think that's it. Any ideas? Thanks! The Program: == package com.rick.processors import java.io.File import java.nio.file.{Path, Files, FileSystems} import org.apache.spark.{SparkContext, SparkConf} import com.rick.reports.Reports.SensorReports object NewReportProcessor { private val sparkConf = new SparkConf().setAppName(ReportProcessor) private val sc = new SparkContext(sparkConf) def main(args: Array[String]) = { val protoBuffsBinary = localFileReports() val sensorReportsBundles = protoBuffsBinary.map(bundle = SensorReports.parseFrom(bundle)) // XXX: Printing from within the SparkContext throws class-not-found // exceptions, why? // sc.makeRDD(sensorReportsBundles).foreach((x: SensorReports) = println(x.toString)) sensorReportsBundles.foreach((x: SensorReports) = println(x.toString)) } private def localFileReports() = { val reportDir = new File(/tmp/reports) val reportFiles = reportDir.listFiles.filter(_.getName.endsWith(.report)) reportFiles.map(file = { val path = FileSystems.getDefault().getPath(/tmp/reports, file.getName()) Files.readAllBytes(path) }) } } The Class-not-found exceptions: = Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/02/23 17:35:03 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.241.128 instead (on interface eth0) 15/02/23 17:35:03 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 15/02/23 17:35:04 INFO SecurityManager: Changing view acls to: rick 15/02/23 17:35:04 INFO SecurityManager: Changing modify acls to: rick 15/02/23 17:35:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(rick); users with modify permissions: Set(rick) 15/02/23 17:35:04 INFO Slf4jLogger: Slf4jLogger started 15/02
Re: Running out of space (when there's no shortage)
Here is a tool which may give you some clue: http://file-leak-detector.kohsuke.org/ Cheers On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov vrodio...@splicemachine.com wrote: Usually it happens in Linux when application deletes file w/o double checking that there are no open FDs (resource leak). In this case, Linux holds all space allocated and does not release it until application exits (crashes in your case). You check file system and everything is normal, you have enough space and you have no idea why does application report no space left on device. Just a guess. -Vladimir Rodionov On Tue, Feb 24, 2015 at 8:34 AM, Joe Wass jw...@crossref.org wrote: I'm running a cluster of 3 Amazon EC2 machines (small number because it's expensive when experiments keep crashing after a day!). Today's crash looks like this (stacktrace at end of message). org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 On my three nodes, I have plenty of space and inodes: A $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97937 426351 19% / tmpfs1909200 1 19091991% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds831869296 23844 8318454521% /vol0 A $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 802G 199G 81% /vol0 B $ df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97947 426341 19% / tmpfs1906639 1 19066381% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds816200704 24223 8161764811% /vol0 B $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.6G 4.3G 46% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 805G 195G 81% /vol0 C $df -i FilesystemInodes IUsed IFree IUse% Mounted on /dev/xvda1524288 97938 426350 19% / tmpfs1906897 1 19068961% /dev/shm /dev/xvdb2457600 54 24575461% /mnt /dev/xvdc2457600 24 24575761% /mnt2 /dev/xvds755218352 24024 7551943281% /vol0 root@ip-10-204-136-223 ~]$ C $ df -h FilesystemSize Used Avail Use% Mounted on /dev/xvda17.9G 3.4G 4.5G 44% / tmpfs 7.3G 0 7.3G 0% /dev/shm /dev/xvdb 37G 1.2G 34G 4% /mnt /dev/xvdc 37G 177M 35G 1% /mnt2 /dev/xvds1000G 820G 181G 82% /vol0 The devices may be ~80% full but that still leaves ~200G free on each. My spark-env.sh has export SPARK_LOCAL_DIRS=/vol0/spark I have manually verified that on each slave the only temporary files are stored on /vol0, all looking something like this /vol0/spark/spark-f05d407c/spark-fca3e573/spark-78c06215/spark-4f0c4236/20/rdd_8_884 So it looks like all the files are being stored on the large drives (incidentally they're AWS EBS volumes, but that's the only way to get enough storage). My process crashed before with a slightly different exception under the same circumstances: kryo.KryoException: java.io.IOException: No space left on device These both happen after several hours and several GB of temporary files. Why does Spark think it's run out of space? TIA Joe Stack trace 1: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384) at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380) at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176) at
Re: Perf Prediction
Can you be a bit more specific ? Are you asking about performance across Spark releases ? Cheers On Sat, Feb 21, 2015 at 6:38 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Has some performance prediction work been done on Spark? Thank You
Re: Query data in Spark RRD
Have you looked at http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD ? Cheers On Sat, Feb 21, 2015 at 4:24 AM, Nikhil Bafna nikhil.ba...@flipkart.com wrote: Hi. My use case is building a realtime monitoring system over multi-dimensional data. The way I'm planning to go about it is to use Spark Streaming to store aggregated count over all dimensions in 10 sec interval. Then, from a dashboard, I would be able to specify a query over some dimensions, which will need re-aggregation from the already computed job. My query is, how can I run dynamic queries over data in schema RDDs? -- Nikhil Bafna
Re: upgrade to Spark 1.2.1
Could this be caused by Spark using shaded Guava jar ? Cheers On Wed, Feb 25, 2015 at 3:26 PM, Pat Ferrel p...@occamsmachete.com wrote: Getting an error that confuses me. Running a largish app on a standalone cluster on my laptop. The app uses a guava HashBiMap as a broadcast value. With Spark 1.1.0 I simply registered the class and its serializer with kryo like this: kryo.register(classOf[com.google.common.collect.HashBiMap[String, Int]], new JavaSerializer()) And all was well. I’ve also tried addSerializer instead of register. Now I get a class not found during deserialization. I checked the jar list used to create the context and found the jar that contains HashBiMap but get this error. Any ideas: 15/02/25 14:46:33 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 (TID 8, 192.168.0.2): java.io.IOException: com.esotericsoftware.kryo.KryoException: Error during Java deserialization. at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1093) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.mahout.drivers.TDIndexedDatasetReader$$anonfun$5.apply(TextDelimitedReaderWriter.scala:95) at org.apache.mahout.drivers.TDIndexedDatasetReader$$anonfun$5.apply(TextDelimitedReaderWriter.scala:94) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:366) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: com.esotericsoftware.kryo.KryoException: Error during Java deserialization. at com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:42) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:144) at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:177) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1090) ... 19 more == root error Caused by: java.lang.ClassNotFoundException: com.google.common.collect.HashBiMap at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:625) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at com.esotericsoftware.kryo.serializers.JavaSerializer.read(JavaSerializer.java:40) ... 24 more - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark excludes fastutil dependencies we need
Maybe drop the exclusion for parquet-provided profile ? Cheers On Wed, Feb 25, 2015 at 8:42 PM, Jim Kleckner j...@cloudphysics.com wrote: Inline On Wed, Feb 25, 2015 at 1:53 PM, Ted Yu yuzhih...@gmail.com wrote: Interesting. Looking at SparkConf.scala : val configs = Seq( DeprecatedConfig(spark.files.userClassPathFirst, spark.executor.userClassPathFirst, 1.3), DeprecatedConfig(spark.yarn.user.classpath.first, null, 1.3, Use spark.{driver,executor}.userClassPathFirst instead.)) It seems spark.files.userClassPathFirst and spark.yarn.user.classpath.first are deprecated. Note that I did use the non-deprecated version, spark.executor. userClassPathFirst=true. On Wed, Feb 25, 2015 at 12:39 AM, Sean Owen so...@cloudera.com wrote: No, we should not add fastutil back. It's up to the app to bring dependencies it needs, and that's how I understand this issue. The question is really, how to get the classloader visibility right. It depends on where you need these classes. Have you looked into spark.files.userClassPathFirst and spark.yarn.user.classpath.first ? I noted that I tried this in my original email. The issue appears related to the fact that parquet is also creating a shaded jar and that one leaves out the Long2LongOpenHashMap class. FYI, I have subsequently tried removing the exclusion from the spark build and that does cause the fastutil classes to be included and the example works... So, should the userClassPathFirst flag work and there is a bug? Or is it reasonable to put in a pull request for the elimination of the exclusion? On Wed, Feb 25, 2015 at 5:34 AM, Ted Yu yuzhih...@gmail.com wrote: bq. depend on missing fastutil classes like Long2LongOpenHashMap Looks like Long2LongOpenHashMap should be added to the shaded jar. Cheers On Tue, Feb 24, 2015 at 7:36 PM, Jim Kleckner j...@cloudphysics.com wrote: Spark includes the clearspring analytics package but intentionally excludes the dependencies of the fastutil package (see below). Spark includes parquet-column which includes fastutil and relocates it under parquet/ but creates a shaded jar file which is incomplete because it shades out some of the fastutil classes, notably Long2LongOpenHashMap, which is present in the fastutil jar file that parquet-column is referencing. We are using more of the clearspring classes (e.g. QDigest) and those do depend on missing fastutil classes like Long2LongOpenHashMap. Even though I add them to our assembly jar file, the class loader finds the spark assembly and we get runtime class loader errors when we try to use it. It is possible to put our jar file first, as described here: https://issues.apache.org/jira/browse/SPARK-939 http://spark.apache.org/docs/1.2.0/configuration.html#runtime-environment which I tried with args to spark-submit: --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true but we still get the class not found error. We have tried copying the source code for clearspring into our package and renaming the package and that makes it appear to work... Is this risky? It certainly is ugly. Can anyone recommend a way to deal with this dependency **ll ? === The spark/pom.xml file contains the following lines: dependency groupIdcom.clearspring.analytics/groupId artifactIdstream/artifactId version2.7.0/version exclusions exclusion groupIdit.unimi.dsi/groupId artifactIdfastutil/artifactId /exclusion /exclusions /dependency === The parquet-column/pom.xml file contains: artifactIdmaven-shade-plugin/artifactId executions execution phasepackage/phase goals goalshade/goal /goals configuration minimizeJartrue/minimizeJar artifactSet includes includeit.unimi.dsi:fastutil/include /includes /artifactSet relocations relocation patternit.unimi.dsi/pattern shadedPatternparquet.it.unimi.dsi/shadedPattern /relocation /relocations /configuration /execution /executions -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-excludes-fastutil-dependencies-we-need-tp21794.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark excludes fastutil dependencies we need
bq. depend on missing fastutil classes like Long2LongOpenHashMap Looks like Long2LongOpenHashMap should be added to the shaded jar. Cheers On Tue, Feb 24, 2015 at 7:36 PM, Jim Kleckner j...@cloudphysics.com wrote: Spark includes the clearspring analytics package but intentionally excludes the dependencies of the fastutil package (see below). Spark includes parquet-column which includes fastutil and relocates it under parquet/ but creates a shaded jar file which is incomplete because it shades out some of the fastutil classes, notably Long2LongOpenHashMap, which is present in the fastutil jar file that parquet-column is referencing. We are using more of the clearspring classes (e.g. QDigest) and those do depend on missing fastutil classes like Long2LongOpenHashMap. Even though I add them to our assembly jar file, the class loader finds the spark assembly and we get runtime class loader errors when we try to use it. It is possible to put our jar file first, as described here: https://issues.apache.org/jira/browse/SPARK-939 http://spark.apache.org/docs/1.2.0/configuration.html#runtime-environment which I tried with args to spark-submit: --conf spark.driver.userClassPathFirst=true --conf spark.executor.userClassPathFirst=true but we still get the class not found error. We have tried copying the source code for clearspring into our package and renaming the package and that makes it appear to work... Is this risky? It certainly is ugly. Can anyone recommend a way to deal with this dependency **ll ? === The spark/pom.xml file contains the following lines: dependency groupIdcom.clearspring.analytics/groupId artifactIdstream/artifactId version2.7.0/version exclusions exclusion groupIdit.unimi.dsi/groupId artifactIdfastutil/artifactId /exclusion /exclusions /dependency === The parquet-column/pom.xml file contains: artifactIdmaven-shade-plugin/artifactId executions execution phasepackage/phase goals goalshade/goal /goals configuration minimizeJartrue/minimizeJar artifactSet includes includeit.unimi.dsi:fastutil/include /includes /artifactSet relocations relocation patternit.unimi.dsi/pattern shadedPatternparquet.it.unimi.dsi/shadedPattern /relocation /relocations /configuration /execution /executions -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-excludes-fastutil-dependencies-we-need-tp21794.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to tell if one RDD depends on another
bq. whether or not rdd1 is a cached rdd RDD has getStorageLevel method which would return the RDD's current storage level. SparkContext has this method: * Return information about what RDDs are cached, if they are in mem or on disk, how much space * they take, etc. */ @DeveloperApi def getRDDStorageInfo: Array[RDDInfo] = { Cheers On Thu, Feb 26, 2015 at 4:00 PM, Corey Nolet cjno...@gmail.com wrote: Zhan, This is exactly what I'm trying to do except, as I metnioned in my first message, I am being given rdd1 and rdd2 only and I don't necessarily know at that point whether or not rdd1 is a cached rdd. Further, I don't know at that point whether or not rdd2 depends on rdd1. On Thu, Feb 26, 2015 at 6:54 PM, Zhan Zhang zzh...@hortonworks.com wrote: In this case, it is slow to wait for rdd1.saveAsHasoopFile(...) to finish probably due to writing to hdfs. a walk around for this particular case may be as follows. val rdd1 = ..cache() val rdd2 = rdd1.map().() rdd1.count future { rdd1.saveAsHasoopFile(...) } future { rdd2.saveAsHadoopFile(…)] In this way, rdd1 will be calculated once, and two saveAsHadoopFile will happen concurrently. Thanks. Zhan Zhang On Feb 26, 2015, at 3:28 PM, Corey Nolet cjno...@gmail.com wrote: What confused me is the statement of *The final result is that rdd1 is calculated twice.” *Is it the expected behavior? To be perfectly honest, performing an action on a cached RDD in two different threads and having them (at the partition level) block until the parent are cached would be the behavior and myself and all my coworkers expected. On Thu, Feb 26, 2015 at 6:26 PM, Corey Nolet cjno...@gmail.com wrote: I should probably mention that my example case is much over simplified- Let's say I've got a tree, a fairly complex one where I begin a series of jobs at the root which calculates a bunch of really really complex joins and as I move down the tree, I'm creating reports from the data that's already been joined (i've implemented logic to determine when cached items can be cleaned up, e.g. the last report has been done in a subtree). My issue is that the 'actions' on the rdds are currently being implemented in a single thread- even if I'm waiting on a cache to complete fully before I run the children jobs, I'm still in a better placed than I was because I'm able to run those jobs concurrently- right now this is not the case. What you want is for a request for partition X to wait if partition X is already being calculated in a persisted RDD. I totally agree and if I could get it so that it's waiting at the granularity of the partition, I'd be in a much much better place. I feel like I'm going down a rabbit hole and working against the Spark API. On Thu, Feb 26, 2015 at 6:03 PM, Sean Owen so...@cloudera.com wrote: To distill this a bit further, I don't think you actually want rdd2 to wait on rdd1 in this case. What you want is for a request for partition X to wait if partition X is already being calculated in a persisted RDD. Otherwise the first partition of rdd2 waits on the final partition of rdd1 even when the rest is ready. That is probably usually a good idea in almost all cases. That much, I don't know how hard it is to implement. But I speculate that it's easier to deal with it at that level than as a function of the dependency graph. On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to do the scheduling myself now- to determine that rdd2 depends on rdd1 and rdd1 is a persistent RDD (storage level != None) so that I can do the no-op on rdd1 before I run rdd2. I would much rather the DAG figure this out so I don't need to think about all this.
Re: JettyUtils.createServletHandler Method not Found?
JettyUtils is marked with: private[spark] object JettyUtils extends Logging { FYI On Fri, Mar 27, 2015 at 9:50 AM, kmader kevin.ma...@gmail.com wrote: I have a very strange error in Spark 1.3 where at runtime in the org.apache.spark.ui.JettyUtils object the method createServletHandler is not found Exception in thread main java.lang.NoSuchMethodError: org.apache.spark.ui.JettyUtils$.createServletHandler(Ljava/lang/String;Ljavax/servlet/http/HttpServlet;Ljava/lang/String;)Lorg/eclipse/jetty/servlet/ServletContextHandler; The code compiles without issue, but at runtime it fails. I know the Jetty dependencies have been changed, but this should not affect the JettyUtils inside Spark Core, or is there another change I am not aware of? Thanks, Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/JettyUtils-createServletHandler-Method-not-Found-tp22262.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDD equivalent of HBase Scan
In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala, TableInputFormat is used. TableInputFormat accepts parameter public static final String SCAN = hbase.mapreduce.scan; where if specified, Scan object would be created from String form: if (conf.get(SCAN) != null) { try { scan = TableMapReduceUtil.convertStringToScan(conf.get(SCAN)); You can use TableMapReduceUtil#convertScanToString() to convert a Scan which has filter(s) and pass to TableInputFormat Cheers On Thu, Mar 26, 2015 at 6:46 AM, Stuart Layton stuart.lay...@gmail.com wrote: HBase scans come with the ability to specify filters that make scans very fast and efficient (as they let you seek for the keys that pass the filter). Do RDD's or Spark DataFrames offer anything similar or would I be required to use a NoSQL db like HBase to do something like this? -- Stuart Layton
Re: Which RDD operations preserve ordering?
This is related: https://issues.apache.org/jira/browse/SPARK-6340 On Thu, Mar 26, 2015 at 5:58 AM, sergunok ser...@gmail.com wrote: Hi guys, I don't have exact picture about preserving of ordering of elements of RDD after executing of operations. Which operations preserve it? 1) Map (Yes?) 2) ZipWithIndex (Yes or sometimes yes?) Serg. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Which-RDD-operations-preserve-ordering-tp22239.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can't access file in spark, but can in hadoop
Looks like the following assertion failed: Preconditions.checkState(storageIDsCount == locs.size()); locs is ListDatanodeInfoProto Can you enhance the assertion to log more information ? Cheers On Thu, Mar 26, 2015 at 3:06 PM, Dale Johnson daljohn...@ebay.com wrote: There seems to be a special kind of corrupted according to Spark state of file in HDFS. I have isolated a set of files (maybe 1% of all files I need to work with) which are producing the following stack dump when I try to sc.textFile() open them. When I try to open directories, most large directories contain at least one file of this type. Curiously, the following two lines fail inside of a Spark job, but not inside of a Scoobi job: val conf = new org.apache.hadoop.conf.Configuration val fs = org.apache.hadoop.fs.FileSystem.get(conf) The stack trace follows: 15/03/26 14:22:43 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: null) Exception in thread Driver java.lang.IllegalStateException at org.spark-project.guava.common.base.Preconditions.checkState(Preconditions.java:133) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:673) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convertLocatedBlock(PBHelper.java:1100) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1118) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1251) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1354) at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1363) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:518) at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1743) at org.apache.hadoop.hdfs.DistributedFileSystem$15.init(DistributedFileSystem.java:738) at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:727) at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:1662) at org.apache.hadoop.fs.FileSystem$5.init(FileSystem.java:1724) at org.apache.hadoop.fs.FileSystem.listFiles(FileSystem.java:1721) at com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1125) at com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$$anonfun$main$2.apply(SpellQuery.scala:1123) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at com.ebay.ss.niffler.miner.speller.SpellQueryLaunch$.main(SpellQuery.scala:1123) at com.ebay.ss.niffler.miner.speller.SpellQueryLaunch.main(SpellQuery.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:427) 15/03/26 14:22:43 INFO yarn.ApplicationMaster: Invoking sc stop from shutdown hook It appears to have found the three copies of the given HDFS block, but is performing some sort of validation with them before giving them back to spark to schedule the job. But there is an assert failing. I've tried this with 1.2.0, 1.2.1 and 1.3.0, and I get the exact same error, but I've seen the line numbers change on the HDFS libraries, but not the function names. I've tried recompiling myself with different hadoop versions, and it's the same. We're running hadoop 2.4.1 on our cluster. A google search turns up absolutely nothing on this. Any insight at all would be appreciated. Dale Johnson Applied Researcher eBay.com -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-access-file-in-spark-but-can-in-hadoop-tp22251.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Building spark 1.2 from source requires more dependencies
Looking at output from dependency:tree, servlet-api is brought in by the following: [INFO] +- org.apache.cassandra:cassandra-all:jar:1.2.6:compile [INFO] | +- org.antlr:antlr:jar:3.2:compile [INFO] | +- com.googlecode.json-simple:json-simple:jar:1.1:compile [INFO] | +- org.yaml:snakeyaml:jar:1.6:compile [INFO] | +- edu.stanford.ppl:snaptree:jar:0.1:compile [INFO] | +- org.mindrot:jbcrypt:jar:0.3m:compile [INFO] | +- org.apache.thrift:libthrift:jar:0.7.0:compile [INFO] | | \- javax.servlet:servlet-api:jar:2.5:compile FYI On Thu, Mar 26, 2015 at 3:36 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi, We are trying to build spark 1.2 from source (tip of the branch-1.2 at the moment). I tried to build spark using the following command: mvn -U -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package I encountered various missing class definition exceptions (e.g: class javax.servlet.ServletException not found). I eventually got the build to succeed after adding the following set of dependencies to the spark-core's pom.xml: dependency groupIdjavax.servlet/groupId artifactId*servlet-api*/artifactId version3.0/version /dependency dependency groupIdorg.eclipse.jetty/groupId artifactId*jetty-io*/artifactId /dependency dependency groupIdorg.eclipse.jetty/groupId artifactId*jetty-http*/artifactId /dependency dependency groupIdorg.eclipse.jetty/groupId artifactId*jetty-servlet*/artifactId /dependency Pretty much all of the missing class definition errors came up while building HttpServer.scala, and went away after the above dependencies were included. My guess is official build for spark 1.2 is working already. My question is what is wrong with my environment or setup, that requires me to add dependencies to pom.xml in this manner, to get this build to succeed. Also, i am not sure if this build would work at runtime for us, i am still testing this out. Thanks, pala
Re: JAVA_HOME problem with upgrade to 1.3.0
JAVA_HOME, an environment variable, should be defined on the node where appattempt_1420225286501_4699_02 ran. Cheers On Thu, Mar 19, 2015 at 8:59 AM, Williams, Ken ken.willi...@windlogics.com wrote: I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 1.3.0, so I changed my `build.sbt` like so: -libraryDependencies += org.apache.spark %% spark-core % 1.2.1 % “provided +libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided then make an `assembly` jar, and submit it: HADOOP_CONF_DIR=/etc/hadoop/conf \ spark-submit \ --driver-class-path=/etc/hbase/conf \ --conf spark.hadoop.validateOutputSpecs=false \ --conf spark.yarn.jar=hdfs:/apps/local/spark-assembly-1.3.0-hadoop2.4.0.jar \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --deploy-mode=cluster \ --master=yarn \ --class=TestObject \ --num-executors=54 \ target/scala-2.11/myapp-assembly-1.2.jar The job fails to submit, with the following exception in the terminal: 15/03/19 10:30:07 INFO yarn.Client: client token: N/A diagnostics: Application application_1420225286501_4699 failed 2 times due to AM Container for appattempt_1420225286501_4699_02 exited with exitCode: 127 due to: Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: at org.apache.hadoop.util.Shell.runCommand(Shell.java:464) at org.apache.hadoop.util.Shell.run(Shell.java:379) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Finally, I go and check the YARN app master’s web interface (since the job is shown, I know it at least made it that far), and the only logs it shows are these: Log Type: stderr Log Length: 61 /bin/bash: {{JAVA_HOME}}/bin/java: No such file or directory Log Type: stdout Log Length: 0 I’m not sure how to interpret that – is '{{JAVA_HOME}}' a literal (including the brackets) that’s somehow making it into a script? Is this coming from the worker nodes or the driver? Anything I can do to experiment troubleshoot? -Ken -- CONFIDENTIALITY NOTICE: This e-mail message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution of any kind is strictly prohibited. If you are not the intended recipient, please contact the sender via reply e-mail and destroy all copies of the original message. Thank you.
Re: StorageLevel: OFF_HEAP
Ranga: Please apply the patch from: https://github.com/apache/spark/pull/4867 And rebuild Spark - the build would use Tachyon-0.6.1 Cheers On Wed, Mar 18, 2015 at 2:23 PM, Ranga sra...@gmail.com wrote: Hi Haoyuan No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If not, I can rebuild and try. Could you let me know how to rebuild with 0.6.0? Thanks for your help. - Ranga On Wed, Mar 18, 2015 at 12:59 PM, Haoyuan Li haoyuan...@gmail.com wrote: Did you recompile it with Tachyon 0.6.0? Also, Tachyon 0.6.1 has been released this morning: http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases Best regards, Haoyuan On Wed, Mar 18, 2015 at 11:48 AM, Ranga sra...@gmail.com wrote: I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same issue. Here are the logs: 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder' 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId' 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts to create tachyon dir in /tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/driver Thanks for any other pointers. - Ranga On Wed, Mar 18, 2015 at 9:53 AM, Ranga sra...@gmail.com wrote: Thanks for the information. Will rebuild with 0.6.0 till the patch is merged. On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu yuzhih...@gmail.com wrote: Ranga: Take a look at https://github.com/apache/spark/pull/4867 Cheers On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com fightf...@163.com wrote: Hi, Ranga That's true. Typically a version mis-match issue. Note that spark 1.2.1 has tachyon built in with version 0.5.0 , I think you may need to rebuild spark with your current tachyon release. We had used tachyon for several of our spark projects in a production environment. Thanks, Sun. -- fightf...@163.com *From:* Ranga sra...@gmail.com *Date:* 2015-03-18 06:45 *To:* user@spark.apache.org *Subject:* StorageLevel: OFF_HEAP Hi I am trying to use the OFF_HEAP storage level in my Spark (1.2.1) cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running. However, when I try to persist the RDD, I get the following error: ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0} TachyonFS.java[connect]:364) - Invalid method name: 'getUserUnderfsTempFolder' ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0} TachyonFS.java[getFileId]:1020) - Invalid method name: 'user_getFileId' Is this because of a version mis-match? On a different note, I was wondering if Tachyon has been used in a production environment by anybody in this group? Appreciate your help with this. - Ranga -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
Re: Bulk insert strategy
What's the expected number of partitions in your use case ? Have you thought of doing batching in the workers ? Cheers On Sat, Mar 7, 2015 at 10:54 PM, A.K.M. Ashrafuzzaman ashrafuzzaman...@gmail.com wrote: While processing DStream in the Spark Programming Guide, the suggested usage of connection is the following, dstream.foreachRDD(rdd = { rdd.foreachPartition(partitionOfRecords = { // ConnectionPool is a static, lazily initialized pool of connections val connection = ConnectionPool.getConnection() partitionOfRecords.foreach(record = connection.send(record)) ConnectionPool.returnConnection(connection) // return to the pool for future reuse }) }) In this case processing and the insertion is done in the workers. There, we don’t use batch insert in db. How about this use case, where we can process(parse string JSON to obj) and send back those objects to master and then send a bulk insert request. Is there any benefit for sending individually using connection pool vs use of bulk operation in the master? A.K.M. Ashrafuzzaman Lead Software Engineer NewsCred http://www.newscred.com/ (M) 880-175-5592433 Twitter https://twitter.com/ashrafuzzaman | Blog http://jitu-blog.blogspot.com/ | Facebook https://www.facebook.com/ashrafuzzaman.jitu Check out The Academy http://newscred.com/theacademy, your #1 source for free content marketing resources
Re: Spark error NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit
InputSplit is in hadoop-mapreduce-client-core jar Please check that the jar is in your classpath. Cheers On Mon, Mar 23, 2015 at 8:10 AM, , Roy rp...@njit.edu wrote: Hi, I am using CDH 5.3.2 packages installation through Cloudera Manager 5.3.2 I am trying to run one spark job with following command PYTHONPATH=~/code/utils/ spark-submit --master yarn --executor-memory 3G --num-executors 30 --driver-memory 2G --executor-cores 2 --name=analytics /home/abc/code/updb/spark/UPDB3analytics.py -date 2015-03-01 but I am getting following error 15/03/23 11:06:49 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 7, hdp003.dev.xyz.com): java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2532) at java.lang.Class.getDeclaredConstructors(Class.java:1901) at java.io.ObjectStreamClass.computeDefaultSUID(ObjectStreamClass.java:1749) at java.io.ObjectStreamClass.access$100(ObjectStreamClass.java:72) at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:250) at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:248) at java.security.AccessController.doPrivileged(Native Method) at java.io.ObjectStreamClass.getSerialVersionUID(ObjectStreamClass.java:247) at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:613) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapred.InputSplit at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 25 more here is the full trace https://gist.github.com/anonymous/3492f0ec63d7a23c47cf
Re: JDBC DF using DB2
bq. is to modify compute_classpath.sh on all worker nodes to include your driver JARs. Please follow the above advice. Cheers On Mon, Mar 23, 2015 at 12:34 PM, Jack Arenas j...@ckarenas.com wrote: Hi Team, I’m trying to create a DF using jdbc as detailed here https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#jdbc-to-other-databases – I’m currently using DB2 v9.7.0.6 and I’ve tried to use the db2jcc.jar and db2jcc_license_cu.jar combo, and while it works in --master local using the command below, I get some strange behavior in --master yarn-client. Here is the command: *val df = sql.load(jdbc, Map(url - jdbc:db2://host:port/db:currentSchema=schema;user=user;password=password;, driver - com.ibm.db2.jcc.DB2Driver, dbtable - table))* It seems to also be working on yarn-client because once executed I get the following log: *df: org.apache.spark.sql.DataFrame = [DATE_FIELD: date, INT_FIELD: int, DOUBLE_FIELD: double]* Which indicates me that Spark was able to connect to the DB. But once I run *df.count() *or *df.take(5).foreach(println)* in order to operate on the data and get a result, I get back a ‘*No suitable driver found*’ exception, which makes me think the driver wasn’t shipped with the spark job. I’ve tried using *--driver-class-path, --jars, SPARK_CLASSPATH* to add the jars to the spark job. I also have the jars in my*$CLASSPATH* and *$HADOOP_CLASSPATH*. I also saw this in the trouble shooting section, but quite frankly I’m not sure what primordial class loader it’s talking about: The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs. Any advice is welcome! Thanks, Jack
Re: Spark log shows only this line repeated: RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time X
It is logged from RecurringTimer#loop(): private def loop() { try { while (!stopped) { clock.waitTillTime(nextTime) callback(nextTime) prevTime = nextTime nextTime += period logDebug(Callback for + name + called at time + prevTime) } For the callback, JobGenerator has this: private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, longTime = eventActor ! GenerateJobs(new Time(longTime)), JobGenerator) ... /** Generate jobs and perform checkpoint for the given `time`. */ private def generateJobs(time: Time) { Cheers On Thu, Mar 26, 2015 at 6:55 AM, Adrian Mocanu amoc...@verticalscope.com wrote: Here’s my log output from a streaming job. What is this? 09:54:27.504 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067504 09:54:27.505 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067505 09:54:27.506 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067506 09:54:27.508 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067507 09:54:27.508 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067508 09:54:27.509 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067509 09:54:27.510 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067510 09:54:27.511 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067511 09:54:27.512 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067512 09:54:27.513 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067513 09:54:27.514 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067514 09:54:27.515 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067515 09:54:27.516 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067516 09:54:27.517 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067517 09:54:27.518 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067518 09:54:27.519 [RecurringTimer - JobGenerator] DEBUG o.a.s.streaming.util.RecurringTimer - Callback for JobGenerator called at time 1427378067519 09:54:27.520 [Recurri …
Re: Hive UDAF percentile_approx says This UDAF does not support the deprecated getEvaluator() method.
Looking at the source code for AbstractGenericUDAFResolver, the following (non-deprecated) method should be called: public GenericUDAFEvaluator getEvaluator(GenericUDAFParameterInfo info) It is called by hiveUdfs.scala (master branch): val parameterInfo = new SimpleGenericUDAFParameterInfo(inspectors.toArray, false, false) resolver.getEvaluator(parameterInfo) FYI On Tue, Jan 13, 2015 at 1:51 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, The following SQL query select percentile_approx(variables.var1, 0.95) p95 from model will throw ERROR SparkSqlInterpreter: Error org.apache.hadoop.hive.ql.parse.SemanticException: This UDAF does not support the deprecated getEvaluator() method. at org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.java:53) at org.apache.spark.sql.hive.HiveGenericUdaf.objectInspector$lzycompute(hiveUdfs.scala:196) at org.apache.spark.sql.hive.HiveGenericUdaf.objectInspector(hiveUdfs.scala:195) at org.apache.spark.sql.hive.HiveGenericUdaf.dataType(hiveUdfs.scala:203) at org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:105) at org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:143) at org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:143) I'm using latest branch-1.2 I found in PR that percentile and percentile_approx are supported. A bug? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: Is it possible to use json4s 3.2.11 with Spark 1.3.0?
Looking at core/pom.xml : dependency groupIdorg.json4s/groupId artifactIdjson4s-jackson_${scala.binary.version}/artifactId version3.2.10/version /dependency The version is hard coded. You can rebuild Spark 1.3.0 with json4s 3.2.11 Cheers On Mon, Mar 23, 2015 at 2:12 PM, Alexey Zinoviev alexey.zinov...@gmail.com wrote: Spark has a dependency on json4s 3.2.10, but this version has several bugs and I need to use 3.2.11. I added json4s-native 3.2.11 dependency to build.sbt and everything compiled fine. But when I spark-submit my JAR it provides me with 3.2.10. build.sbt import sbt.Keys._ name := sparkapp version := 1.0 scalaVersion := 2.10.4 libraryDependencies += org.apache.spark %% spark-core % 1.3.0 % provided libraryDependencies += org.json4s %% json4s-native % 3.2.11` plugins.sbt logLevel := Level.Warn resolvers += Resolver.url(artifactory, url( http://scalasbt.artifactoryonline.com/scalasbt/sbt-plugin-releases ))(Resolver.ivyStylePatterns) addSbtPlugin(com.eed3si9n % sbt-assembly % 0.13.0) App1.scala import org.apache.spark.SparkConf import org.apache.spark.rdd.RDD import org.apache.spark.{Logging, SparkConf, SparkContext} import org.apache.spark.SparkContext._ object App1 extends Logging { def main(args: Array[String]) = { val conf = new SparkConf().setAppName(App1) val sc = new SparkContext(conf) println(sjson4s version: ${org.json4s.BuildInfo.version.toString}) } } sbt 0.13.7, sbt-assembly 0.13.0, Scala 2.10.4 Is it possible to force 3.2.11 version usage? Thanks, Alexey
Re: pyspark hbase range scan
Have you looked at http://happybase.readthedocs.org/en/latest/ ? Cheers On Apr 1, 2015, at 4:50 PM, Eric Kimbrel eric.kimb...@soteradefense.com wrote: I am attempting to read an hbase table in pyspark with a range scan. conf = { hbase.zookeeper.quorum: host, hbase.mapreduce.inputtable: table, hbase.mapreduce.scan : scan } hbase_rdd = sc.newAPIHadoopRDD( org.apache.hadoop.hbase.mapreduce.TableInputFormat, org.apache.hadoop.hbase.io.ImmutableBytesWritable, org.apache.hadoop.hbase.client.Result, keyConverter=keyConv, valueConverter=valueConv, conf=conf) If i jump over to scala or java and generate a base64 encoded protobuf scan object and convert it to a string, i can use that value for hbase.mapreduce.scan and everything works, the rdd will correctly perform the range scan and I am happy. The problem is that I can not find any reasonable way to generate that range scan string in python. The scala code required is: import org.apache.hadoop.hbase.util.Base64; import org.apache.hadoop.hbase.protobuf.ProtobufUtil; import org.apache.hadoop.hbase.client.{Delete, HBaseAdmin, HTable, Put, Result = HBaseResult, Scan} val scan = new Scan() scan.setStartRow(test_domain\0email.getBytes) scan.setStopRow(test_domain\0email~.getBytes) def scanToString(scan:Scan): String = { Base64.encodeBytes( ProtobufUtil.toScan(scan).toByteArray()) } scanToString(scan) Is there another way to perform an hbase range scan from pyspark or is that functionality something that might be supported in the future? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-hbase-range-scan-tp22348.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Connection pooling in spark jobs
http://docs.oracle.com/cd/B10500_01/java.920/a96654/connpoca.htm The question doesn't seem to be Spark specific, btw On Apr 2, 2015, at 4:45 AM, Sateesh Kavuri sateesh.kav...@gmail.com wrote: Hi, We have a case that we will have to run concurrent jobs (for the same algorithm) on different data sets. And these jobs can run in parallel and each one of them would be fetching the data from the database. We would like to optimize the database connections by making use of connection pooling. Any suggestions / best known ways on how to achieve this. The database in question is Oracle Thanks, Sateesh - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: 答复:maven compile error
Can you include -X in your maven command and pastebin the output ? Cheers On Apr 3, 2015, at 3:58 AM, myelinji myeli...@aliyun.com wrote: Thank you for your reply. When I'm using maven to compile the whole project, the erros as follows [INFO] Spark Project Parent POM .. SUCCESS [4.136s] [INFO] Spark Project Networking .. SUCCESS [7.405s] [INFO] Spark Project Shuffle Streaming Service ... SUCCESS [5.071s] [INFO] Spark Project Core SUCCESS [3:08.445s] [INFO] Spark Project Bagel ... SUCCESS [21.613s] [INFO] Spark Project GraphX .. SUCCESS [58.915s] [INFO] Spark Project Streaming ... SUCCESS [1:26.202s] [INFO] Spark Project Catalyst FAILURE [1.537s] [INFO] Spark Project SQL . SKIPPED [INFO] Spark Project ML Library .. SKIPPED [INFO] Spark Project Tools ... SKIPPED [INFO] Spark Project Hive SKIPPED [INFO] Spark Project REPL SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Flume Sink . SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project Examples SKIPPED it seems like there is something wrong with calatlyst project. Why i cannot compile this project? -- 发件人:Sean Owen so...@cloudera.com 发送时间:2015年4月3日(星期五) 17:48 收件人:myelinji myeli...@aliyun.com 抄 送:spark用户组 user@spark.apache.org 主 题:Re: maven compile error If you're asking about a compile error, you should include the command you used to compile. I am able to compile branch 1.2 successfully with mvn -DskipTests clean package. This error is actually an error from scalac, not a compile error from the code. It sort of sounds like it has not been able to download scala dependencies. Check or maybe recreate your environment. On Fri, Apr 3, 2015 at 3:19 AM, myelinji myeli...@aliyun.com wrote: Hi,all: Just now i checked out spark-1.2 on github , wanna to build it use maven, how ever I encountered an error during compiling: [INFO] [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-catalyst_2.10: wrap: scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-catalyst_2.10: wrap: scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156) at org.apache.maven.cli.MavenCli.execute(MavenCli.java:537) at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196) at org.apache.maven.cli.MavenCli.main(MavenCli.java:141) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:290) at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:230) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:409) at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:352) Caused by:
Re: About Waiting batches on the spark streaming UI
Maybe add another stat for batches waiting in the job queue ? Cheers On Fri, Apr 3, 2015 at 10:01 AM, Tathagata Das t...@databricks.com wrote: Very good question! This is because the current code is written such that the ui considers a batch as waiting only when it has actually started being processed. Thats batched waiting in the job queue is not considered in the calculation. It is arguable that it may be more intuitive to count that in the waiting as well. On Apr 3, 2015 12:59 AM, bit1...@163.com bit1...@163.com wrote: I copied the following from the spark streaming UI, I don't know why the Waiting batches is 1, my understanding is that it should be 72. Following is my understanding: 1. Total time is 1minute 35 seconds=95 seconds 2. Batch interval is 1 second, so, 95 batches are generated in 95 seconds. 3. Processed batches are 23(Correct, because in my processing code, it does nothing but sleep 4 seconds) 4. Then the waiting batches should be 95-23=72 - *Started at: * Fri Apr 03 15:17:47 CST 2015 - *Time since start: *1 minute 35 seconds - *Network receivers: *1 - *Batch interval: *1 second - *Processed batches: *23 - *Waiting batches: *1 - *Received records: *0 - *Processed records: *0 -- bit1...@163.com
Re: Spark SQL vs map reduce tableInputOutput
Please take a look at https://issues.apache.org/jira/browse/PHOENIX-1815 On Mon, Apr 20, 2015 at 10:11 AM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks for reply. Does phoenix using inside Spark will be useful? what is the best way to bring data from Hbase into Spark in terms performance of application? Regards Jeetendra On 20 April 2015 at 20:49, Ted Yu yuzhih...@gmail.com wrote: To my knowledge, Spark SQL currently doesn't provide range scan capability against hbase. Cheers On Apr 20, 2015, at 7:54 AM, Jeetendra Gangele gangele...@gmail.com wrote: HI All, I am Querying Hbase and combining result and using in my spake job. I am querying hbase using Hbase client api inside my spark job. can anybody suggest me will Spark SQl will be fast enough and provide range of queries? Regards Jeetendra
Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)
NativeS3FileSystem class is in hadoop-aws jar. Looks like it was not on classpath. Cheers On Thu, Apr 23, 2015 at 7:30 AM, Sujee Maniyam su...@sujee.net wrote: Thanks all... btw, s3n load works without any issues with spark-1.3.1-bulit-for-hadoop 2.4 I tried this on 1.3.1-hadoop26 sc.hadoopConfiguration.set(fs.s3n.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem) val f = sc.textFile(s3n://bucket/file) f.count No it can't find the implementation path. Looks like some jar is missing ? java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3native.NativeS3FileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2578) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) On Wednesday, April 22, 2015, Shuai Zheng szheng.c...@gmail.com wrote: Below is my code to access s3n without problem (only for 1.3.1. there is a bug in 1.3.0). Configuration hadoopConf = ctx.hadoopConfiguration(); hadoopConf.set(fs.s3n.impl, org.apache.hadoop.fs.s3native.NativeS3FileSystem); hadoopConf.set(fs.s3n.awsAccessKeyId, awsAccessKeyId); hadoopConf.set(fs.s3n.awsSecretAccessKey, awsSecretAccessKey); Regards, Shuai *From:* Sujee Maniyam [mailto:su...@sujee.net] *Sent:* Wednesday, April 22, 2015 12:45 PM *To:* Spark User List *Subject:* spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:) Hi all I am unable to access s3n:// urls using sc.textFile().. getting 'no file system for scheme s3n://' error. a bug or some conf settings missing? See below for details: env variables : AWS_SECRET_ACCESS_KEY=set AWS_ACCESS_KEY_ID=set spark/RELAESE : Spark 1.3.1 (git revision 908a0bf) built for Hadoop 2.6.0 Build flags: -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Pyarn -DzincPort=3034 ./bin/spark-shell val f = sc.textFile(s3n://bucket/file) f.count error== java.io.IOException: No FileSystem for scheme: s3n at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512) at org.apache.spark.rdd.RDD.count(RDD.scala:1006) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC.init(console:39) at $iwC.init(console:41) at init(console:43) at .init(console:47) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
Re: Slower performance when bigger memory?
Shuai: Please take a look at: http://blog.takipi.com/garbage-collectors-serial-vs-parallel-vs-cms-vs-the-g1-and-whats-new-in-java-8/ On Apr 23, 2015, at 10:18 AM, Dean Wampler deanwamp...@gmail.com wrote: JVM's often have significant GC overhead with heaps bigger than 64GB. You might try your experiments with configurations below this threshold. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @deanwampler http://polyglotprogramming.com On Thu, Apr 23, 2015 at 12:14 PM, Shuai Zheng szheng.c...@gmail.com wrote: Hi All, I am running some benchmark on r3*8xlarge instance. I have a cluster with one master (no executor on it) and one slave (r3*8xlarge). My job has 1000 tasks in stage 0. R3*8xlarge has 244G memory and 32 cores. If I create 4 executors, each has 8 core+50G memory, each task will take around 320s-380s. And if I only use one big executor with 32 cores and 200G memory, each task will take 760s-900s. And I check the log, looks like the minor GC takes much longer when using 200G memory: 285.242: [GC [PSYoungGen: 29027310K-8646087K(31119872K)] 38810417K-19703013K(135977472K), 11.2509770 secs] [Times: user=38.95 sys=120.65, real=11.25 secs] And when it uses 50G memory, the minor GC takes only less than 1s. I try to see what is the best way to configure the Spark. For some special reason, I tempt to use a bigger memory on single executor if no significant penalty on performance. But now looks like it is? Anyone has any idea? Regards, Shuai
Re: implicits is not a member of org.apache.spark.sql.SQLContext
Have you tried the following ? import sqlContext._ import sqlContext.implicits._ Cheers On Tue, Apr 21, 2015 at 7:54 AM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.com wrote: I tried to convert an RDD to a data frame using the example codes on spark website case class Person(name: String, age: Int) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val people = sc.textFile(examples/src/main/resources/people.txt).map(_.split(,)).map(p = *Person*(p(0), p(1).trim.toInt)).toDF() It compile fine using sbt. But in IntelliJ IDEA 14.0.3, it fail to compile and generate the following error Error:(289, 23) value *implicits is not a member of org.apache.spark.sql.SQLContext* import sqlContext.implicits._ ^ What is the problem here? Is there any work around (e.g. do explicit conversion)? Here is my build.sbt name := my-project version := 0.2 scalaVersion := 2.10.4 val sparkVersion = 1.3.1 val luceneVersion = 4.10.2 libraryDependencies = scalaVersion { scala_version = Seq( org.apache.spark %% spark-core % sparkVersion % provided, org.apache.spark %% spark-mllib % sparkVersion % provided, spark.jobserver % job-server-api % 0.4.1 % provided, org.scalatest %% scalatest % 2.2.1 % test ) } resolvers += Spark Packages Repo at http://dl.bintray.com/spark-packages/maven; resolvers += Resolver.mavenLocal Ningjun
Re: Meet Exception when learning Broadcast Variables
Does line 27 correspond to brdcst.value ? Cheers On Apr 21, 2015, at 3:19 AM, donhoff_h 165612...@qq.com wrote: Hi, experts. I wrote a very little program to learn how to use Broadcast Variables, but met an exception. The program and the exception are listed as following. Could anyone help me to solve this problem? Thanks! **My Program is as following** object TestBroadcast02 { var brdcst : Broadcast[Array[Int]] = null def main(args: Array[String]) { val conf = new SparkConf() val sc = new SparkContext(conf) brdcst = sc.broadcast(Array(1,2,3,4,5,6)) val rdd = sc.textFile(hdfs://bgdt-dev-hrb/user/spark/tst/charset/A_utf8.txt) rdd.foreachPartition(fun1) sc.stop() } def fun1(it : Iterator[String]) : Unit = { val v = brdcst.value for(i - v) println(BroadCast Variable:+i) for(j - it) println(Text File Content:+j) } } **The Exception is as following** 15/04/21 17:39:53 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, bgdt01.dev.hrb): java.lang.NullPointerException at dhao.test.BroadCast.TestBroadcast02$.fun1(TestBroadcast02.scala:27) at dhao.test.BroadCast.TestBroadcast02$$anonfun$main$1.apply(TestBroadcast02.scala:22) at dhao.test.BroadCast.TestBroadcast02$$anonfun$main$1.apply(TestBroadcast02.scala:22) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1497) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) By the way, if I use anonymous function instead of 'fun1' in my program, it works. But since I think the readability is not good for anonymous functions, I still prefer to use the 'fun1' . - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark 1.3.1 : unable to access s3n:// urls (no file system for scheme s3n:)
This thread from hadoop mailing list should give you some clue: http://search-hadoop.com/m/LgpTk2df7822 On Wed, Apr 22, 2015 at 9:45 AM, Sujee Maniyam su...@sujee.net wrote: Hi all I am unable to access s3n:// urls using sc.textFile().. getting 'no file system for scheme s3n://' error. a bug or some conf settings missing? See below for details: env variables : AWS_SECRET_ACCESS_KEY=set AWS_ACCESS_KEY_ID=set spark/RELAESE : Spark 1.3.1 (git revision 908a0bf) built for Hadoop 2.6.0 Build flags: -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Pyarn -DzincPort=3034 ./bin/spark-shell val f = sc.textFile(s3n://bucket/file) f.count error== java.io.IOException: No FileSystem for scheme: s3n at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1512) at org.apache.spark.rdd.RDD.count(RDD.scala:1006) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:29) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:31) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:33) at $iwC$$iwC$$iwC$$iwC.init(console:35) at $iwC$$iwC$$iwC.init(console:37) at $iwC$$iwC.init(console:39) at $iwC.init(console:41) at init(console:43) at .init(console:47) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:856) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:901) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:813) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:656) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:664) at org.apache.spark.repl.SparkILoop.org $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:669) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:996) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:944) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.org $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:944) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1058) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569) at
Re: Auto Starting a Spark Job on Cluster Starup
This thread seems related: http://search-hadoop.com/m/JW1q51W02V Cheers On Wed, Apr 22, 2015 at 6:09 AM, James King jakwebin...@gmail.com wrote: What's the best way to start-up a spark job as part of starting-up the Spark cluster. I have an single uber jar for my job and want to make the start-up as easy as possible. Thanks jk
Re: Spark Performance on Yarn
In master branch, overhead is now 10%. That would be 500 MB FYI On Apr 22, 2015, at 8:26 AM, nsalian neeleshssal...@gmail.com wrote: +1 to executor-memory to 5g. Do check the overhead space for both the driver and the executor as per Wilfred's suggestion. Typically, 384 MB should suffice. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Performance-on-Yarn-tp21729p22610.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Parquet error reading data that contains array of structs
Yin: Fix Version of SPARK-4520 is not set. I assume it was fixed in 1.3.0 Cheers Fix Version On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai yh...@databricks.com wrote: The exception looks like the one mentioned in https://issues.apache.org/jira/browse/SPARK-4520. What is the version of Spark? On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, My data looks like this: +---++--+ | col_name | data_type | comment | +---++--+ | cust_id | string | | | part_num | int| | | ip_list | arraystructip:string | | | vid_list | arraystructvid:string | | | fso_list | arraystructfso:string | | | src | string | | | date | int| | +---++--+ And I did select *, it reports ParquetDecodingException. Is this type not supported in SparkSQL? Detailed error message here: Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in stage 27.0 (TID 510, lvshdc5dn0542.lvs.paypal.com): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://xxx/part-m-0.gz.parquet at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213) at parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:122) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1498) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:400) at java.util.ArrayList.get(ArrayList.java:413) at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95) at parquet.io.GroupColumnIO.getLast(GroupColumnIO.java:95) at parquet.io.PrimitiveColumnIO.getLast(PrimitiveColumnIO.java:80) at parquet.io.PrimitiveColumnIO.isLast(PrimitiveColumnIO.java:74) at parquet.io.RecordReaderImplementation.init(RecordReaderImplementation.java:290) at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:131) at parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:96) at parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:136) at parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:96) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: ORCFiles
Please see SPARK-2883 There is no Fix Version yet. On Fri, Apr 24, 2015 at 5:45 PM, David Mitchell jdavidmitch...@gmail.com wrote: Does anyone know in which version of Spark will there be support for ORCFiles via spark.sql.hive? Will it be in 1.4? David
Re: Spark SQL 1.3.1: java.lang.ClassCastException is thrown
Looks like this is related: https://issues.apache.org/jira/browse/SPARK-5456 On Sat, Apr 25, 2015 at 6:59 AM, doovs...@sina.com wrote: Hi all, When I query Postgresql based on Spark SQL like this: dataFrame.registerTempTable(Employees) val emps = sqlContext.sql(select name, sum(salary) from Employees group by name, salary) monitor { emps.take(10) .map(row = (row.getString(0), row.getDecimal(1))) .foreach(println) } The type of salary column in data table is numeric(10, 2). It throws the following exception: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal Who know this issue and how to solve it? Thanks. Regards, Yi - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to debug Spark on Yarn?
For step 2, you can pipe application log to a file instead of copy-pasting. Cheers On Apr 22, 2015, at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I submit a spark app to YARN and i get these messages 15/04/22 22:45:04 INFO yarn.Client: Application report for application_1429087638744_101363 (state: RUNNING) 15/04/22 22:45:04 INFO yarn.Client: Application report for application_1429087638744_101363 (state: RUNNING). ... 1) I can go to Spark UI and see the status of the APP but cannot see the logs as the job progresses. How can i see logs of executors as they progress ? 2) In case the App fails/completes then Spark UI vanishes and i get a YARN Job page that says job failed, there are no link to Spark UI again. Now i take the job ID and run yarn application logs appid and my console fills up (with huge scrolling) with logs of all executors. Then i copy and paste into a text editor and search for keywords Exception , Job aborted due to . Is this the right way to view logs ? -- Deepak
Re: Spark SQL vs map reduce tableInputOutput
To my knowledge, Spark SQL currently doesn't provide range scan capability against hbase. Cheers On Apr 20, 2015, at 7:54 AM, Jeetendra Gangele gangele...@gmail.com wrote: HI All, I am Querying Hbase and combining result and using in my spake job. I am querying hbase using Hbase client api inside my spark job. can anybody suggest me will Spark SQl will be fast enough and provide range of queries? Regards Jeetendra - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: compliation error
What JDK release are you using ? Can you give the complete command you used ? Which Spark branch are you working with ? Cheers On Sun, Apr 19, 2015 at 7:25 PM, Brahma Reddy Battula brahmareddy.batt...@huawei.com wrote: Hi All Getting following error, when I am compiling spark..What did I miss..? Even googled and did not find the exact solution for this... [ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class: 48188 - [Help 1] org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:2.2:shade (default) on project spark-assembly_2.10: Error creating shaded jar: Error in ASM processing class com/ibm/icu/impl/data/LocaleElements_zh__PINYIN.class at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:217) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:84) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:59) at org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuild(LifecycleStarter.java:183) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:161) at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320) Thanks Regards Brahma Reddy Battula
Re: HBase HTable constructor hangs
Can you give us more information ? Such as hbase release, Spark release. If you can pastebin jstack of the hanging HTable process, that would help. BTW I used http://search-hadoop.com/?q=spark+HBase+HTable+constructor+hangs and saw a very old thread with this subject. Cheers On Tue, Apr 28, 2015 at 7:12 PM, tridib tridib.sama...@live.com wrote: I am exactly having same issue. I am running hbase and spark in docker container. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: hive-thriftserver maven artifact
Credit goes to Misha Chernetsov (see SPARK-4925) FYI On Tue, Apr 28, 2015 at 8:25 AM, Marco marco@gmail.com wrote: Thx Ted for the info ! 2015-04-27 23:51 GMT+02:00 Ted Yu yuzhih...@gmail.com: This is available for 1.3.1: http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10 FYI On Mon, Feb 16, 2015 at 7:24 AM, Marco marco@gmail.com wrote: Ok, so will it be only available for the next version (1.30)? 2015-02-16 15:24 GMT+01:00 Ted Yu yuzhih...@gmail.com: I searched for 'spark-hive-thriftserver_2.10' on this page: http://mvnrepository.com/artifact/org.apache.spark Looks like it is not published. On Mon, Feb 16, 2015 at 5:44 AM, Marco marco@gmail.com wrote: Hi, I am referring to https://issues.apache.org/jira/browse/SPARK-4925 (Hive Thriftserver Maven Artifact). Can somebody point me (URL) to the artifact in a public repository ? I have not found it @Maven Central. Thanks, Marco -- Viele Grüße, Marco -- Viele Grüße, Marco
Re: HBase HTable constructor hangs
How did you distribute hbase-site.xml to the nodes ? Looks like HConnectionManager couldn't find the hbase:meta server. Cheers On Tue, Apr 28, 2015 at 9:19 PM, Tridib Samanta tridib.sama...@live.com wrote: I am using Spark 1.2.0 and HBase 0.98.1-cdh5.1.0. Here is the jstack trace. Complete stack trace attached. Executor task launch worker-1 #58 daemon prio=5 os_prio=0 tid=0x7fd3d0445000 nid=0x488 waiting on condition [0x7fd4507d9000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:152) - locked 0xf8cb7258 (a org.apache.hadoop.hbase.client.RpcRetryingCaller) at org.apache.hadoop.hbase.client.HTable.getRowOrBefore(HTable.java:705) at org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:144) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:1102) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1162) - locked 0xf84ac0b0 (a java.lang.Object) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:192) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:150) at com.mypackage.storeTuples(CubeStoreService.java:59) at com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:23) at com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:13) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Executor task launch worker-0 #57 daemon prio=5 os_prio=0 tid=0x7fd3d0443800 nid=0x487 waiting for monitor entry [0x7fd4506d8000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1156) - waiting to lock 0xf84ac0b0 (a java.lang.Object) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1054) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:1011) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:326) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:192) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:150) at com.mypackage.storeTuples(CubeStoreService.java:59) at com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:23) at com.mypackage.StorePartitionToHBaseStoreFunction.call(StorePartitionToHBaseStoreFunction.java:13) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195) at org.apache.spark.api.java.JavaRDDLike$$anonfun$foreachPartition$1.apply(JavaRDDLike.scala:195) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773) at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:773) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- Date: Tue, 28 Apr 2015 19:35:26 -0700
Re: Too many open files when using Spark to consume messages from Kafka
Can you run the command 'ulimit -n' to see the current limit ? To configure ulimit settings on Ubuntu, edit */etc/security/limits.conf* Cheers On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am using the direct approach to receive real-time data from Kafka in the following link: https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html My code follows the word count direct example: https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala After around 12 hours, I got the following error messages in Spark log: 15/04/29 20:18:10 ERROR JobScheduler: Error generating jobs for time 143033869 ms org.apache.spark.SparkException: ArrayBuffer(java.io.IOException: Too many open files, java.io.IOException: Too many open files, java.io.IOException: Too many open files, java.io.IOException: Too many open files, java.io.IOException: Too many open files) at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:94) at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:116) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at
Re: Too many open files when using Spark to consume messages from Kafka
Maybe add statement.close() in finally block ? Streaming / Kafka experts may have better insight. On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Thanks for the suggestion. I ran the command and the limit is 1024. Based on my understanding, the connector to Kafka should not open so many files. Do you think there is possible socket leakage? BTW, in every batch which is 5 seconds, I output some results to mysql: def ingestToMysql(data: Array[String]) { val url = jdbc:mysql://localhost:3306/realtime?user=rootpassword=123 var sql = insert into loggingserver1 values data.foreach(line = sql += line) sql = sql.dropRight(1) sql += ; logger.info(sql) var conn: java.sql.Connection = null try { conn = DriverManager.getConnection(url) val statement = conn.createStatement() statement.executeUpdate(sql) } catch { case e: Exception = logger.error(e.getMessage()) } finally { if (conn != null) { conn.close } } } I am not sure whether the leakage originates from Kafka connector or the sql connections. Bill On Wed, Apr 29, 2015 at 2:12 PM, Ted Yu yuzhih...@gmail.com wrote: Can you run the command 'ulimit -n' to see the current limit ? To configure ulimit settings on Ubuntu, edit */etc/security/limits.conf* Cheers On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am using the direct approach to receive real-time data from Kafka in the following link: https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html My code follows the word count direct example: https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala After around 12 hours, I got the following error messages in Spark log: 15/04/29 20:18:10 ERROR JobScheduler: Error generating jobs for time 143033869 ms org.apache.spark.SparkException: ArrayBuffer(java.io.IOException: Too many open files, java.io.IOException: Too many open files, java.io.IOException: Too many open files, java.io.IOException: Too many open files, java.io.IOException: Too many open files) at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:94) at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:116) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300
Re: real time Query engine Spark-SQL on Hbase
bq. a single query on one filter criteria Can you tell us more about your filter ? How selective is it ? Which hbase release are you using ? Cheers On Thu, Apr 30, 2015 at 7:23 AM, Siddharth Ubale siddharth.ub...@syncoms.com wrote: Hi, I want to use Spark as Query engine on HBase with sub second latency. I am using Spark 1.3 version. And followed the steps below on Hbase table with around 3.5 lac rows : *1. *Mapped the Dataframe to Hbase table .RDDCustomers maps to the hbase table which is used to create the Dataframe. *“ DataFrame schemaCustomers = sqlInstance* * .createDataFrame(SparkContextImpl.getRddCustomers(),* * Customers.class);” * 2. Used registertemp table i.e” *schemaCustomers.registerTempTable(customers);”* 3. Running the query on Dataframe using Sqlcontext Instance. What I am observing is that for a single query on one filter criteria the query is taking 7-8 seconds? And the time increases as I am increasing the number of rows in Hbase table. Also, there was one time when I was getting query response under 1-2 seconds. Seems like strange behavior. Is this expected behavior from Spark or am I missing something here? Can somebody help me understand this scenario . Please assist. Thanks, Siddharth Ubale,
Re: spark with standalone HBase
$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1632) at org.apache.spark.rdd.RDD.count(RDD.scala:1012) at org.apache.spark.examples.HBaseTest$.main(HBaseTest.scala:58) at org.apache.spark.examples.HBaseTest.main(HBaseTest.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:607) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:190) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Thu, Apr 30, 2015 at 1:25 PM, Saurabh Gupta saurabh.gu...@semusi.com wrote: I am using hbase -0.94.8. On Wed, Apr 29, 2015 at 11:56 PM, Ted Yu yuzhih...@gmail.com wrote: Can you enable HBase DEBUG logging in log4j.properties so that we can have more clue ? What hbase release are you using ? Cheers On Wed, Apr 29, 2015 at 4:27 AM, Saurabh Gupta saurabh.gu...@semusi.com wrote: Hi, I am working with standalone HBase. And I want to execute HBaseTest.scala (in scala examples) . I have created a test table with three rows and I just want to get the count using HBaseTest.scala I am getting this issue: 15/04/29 11:17:10 INFO BlockManagerMaster: Registered BlockManager 15/04/29 11:17:11 INFO ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT 15/04/29 11:17:11 INFO ZooKeeper: Client environment:host.name =ip-10-144-185-113 15/04/29 11:17:11 INFO ZooKeeper: Client environment:java.version=1.7.0_79 15/04/29 11:17:11 INFO ZooKeeper: Client environment:java.vendor=Oracle Corporation 15/04/29 11:17:11 INFO ZooKeeper: Client environment:java.home=/usr/lib/jvm/java-7-openjdk-amd64/jre 15/04/29 11:17:11 INFO ZooKeeper: Client environment:java.class.path=/home/ubuntu/sparkfolder/conf/:/home/ubuntu/sparkfolder/assembly/target/scala-2.10/spark-assembly-1.4.0-SNAPSHOT-hadoop2.2.0.jar:/home/ubuntu/sparkfolder/lib_managed/jars/datanucleus-core-3.2.10.jar:/home/ubuntu/sparkfolder/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/home/ubuntu/sparkfolder/lib_managed/jars/datanucleus-rdbms-3.2.9.jar 15/04/29 11:17:11 INFO ZooKeeper: Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib 15/04/29 11:17:11 INFO ZooKeeper: Client environment:java.io.tmpdir=/tmp 15/04/29 11:17:11 INFO ZooKeeper: Client environment:java.compiler=NA 15/04/29 11:17:11 INFO ZooKeeper: Client environment:os.name=Linux 15/04/29 11:17:11 INFO ZooKeeper: Client environment:os.arch=amd64 15/04/29 11:17:11 INFO ZooKeeper: Client environment:os.version=3.13.0-49-generic 15/04/29 11:17:11 INFO ZooKeeper: Client environment:user.name=root 15/04/29 11:17:11 INFO ZooKeeper: Client environment:user.home=/root 15/04/29 11:17:11 INFO ZooKeeper: Client environment:user.dir=/home/ubuntu/sparkfolder 15/04/29 11:17:11 INFO ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=9 watcher=hconnection-0x2711025f, quorum=localhost:2181, baseZNode=/hbase 15/04/29 11:17:11 INFO RecoverableZooKeeper: Process identifier=hconnection-0x2711025f connecting to ZooKeeper ensemble=localhost:2181 15/04/29 11:17:11 INFO ClientCnxn: Opening socket connection to server ip-10-144-185-113/10.144.185.113:2181. Will not attempt to authenticate using SASL (unknown error) 15/04/29 11:17:11 INFO ClientCnxn: Socket connection established to ip-10-144-185-113/10.144.185.113:2181, initiating session 15/04/29 11:17:11 INFO ClientCnxn: Session establishment complete on server ip-10-144-185-113/10.144.185.113:2181, sessionid = 0x14d04d506da0005, negotiated timeout = 4 15/04/29 11:17:11 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null Its just stuck Not showing any error. There is no Hadoop on my machine. What could be the issue? here is hbase-site.xml: configuration property namehbase.zookeeper.quorum/name valuelocalhost/value /property property namehbase.zookeeper.property.clientPort/name value2181/value /property property namezookeeper.znode.parent/name value/hbase/value /property /configuration
Re: hive-thriftserver maven artifact
This is available for 1.3.1: http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10 FYI On Mon, Feb 16, 2015 at 7:24 AM, Marco marco@gmail.com wrote: Ok, so will it be only available for the next version (1.30)? 2015-02-16 15:24 GMT+01:00 Ted Yu yuzhih...@gmail.com: I searched for 'spark-hive-thriftserver_2.10' on this page: http://mvnrepository.com/artifact/org.apache.spark Looks like it is not published. On Mon, Feb 16, 2015 at 5:44 AM, Marco marco@gmail.com wrote: Hi, I am referring to https://issues.apache.org/jira/browse/SPARK-4925 (Hive Thriftserver Maven Artifact). Can somebody point me (URL) to the artifact in a public repository ? I have not found it @Maven Central. Thanks, Marco -- Viele Grüße, Marco
Re: Exception in using updateStateByKey
Which hadoop release are you using ? Can you check hdfs audit log to see who / when deleted spark/ck/hdfsaudit/ receivedData/0/log-1430139541443-1430139601443 ? Cheers On Mon, Apr 27, 2015 at 6:21 AM, Sea 261810...@qq.com wrote: Hi, all: I use function updateStateByKey in Spark Streaming, I need to store the states for one minite, I set spark.cleaner.ttl to 120, the duration is 2 seconds, but it throws Exception Caused by: org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042) at org.apache.hadoop.ipc.Client.call(Client.java:1347) at org.apache.hadoop.ipc.Client.call(Client.java:1300) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188) at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source) Why? my code is ssc = StreamingContext(sc,2) kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1}) kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\ .filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \ .filter(lambda x: x[1]['isExisted'] != 1) \ .foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))
Re: Spark - Hive Metastore MySQL driver
Can you try the patch from: [SPARK-6913][SQL] Fixed java.sql.SQLException: No suitable driver found Cheers On Sat, Mar 28, 2015 at 12:41 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: This is from my Hive installation -sh-4.1$ ls /apache/hive/lib | grep derby derby-10.10.1.1.jar derbyclient-10.10.1.1.jar derbynet-10.10.1.1.jar -sh-4.1$ ls /apache/hive/lib | grep datanucleus datanucleus-api-jdo-3.2.6.jar datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar -sh-4.1$ ls /apache/hive/lib | grep mysql mysql-connector-java-5.0.8-bin.jar -sh-4.1$ $ hive --version Hive 0.13.0.2.1.3.6-2 Subversion git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6 -r 87da9430050fb9cc429d79d95626d26ea382b96c $ On Sat, Mar 28, 2015 at 1:05 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: I tried with a different version of driver but same error ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar --jars /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar, */home/dvasthimal/spark1.3/mysql-connector-java-5.0.8-bin.jar* --files $SPARK_HOME/conf/hive-site.xml --num-executors 1 --driver-memory 4g --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g --executor-cores 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2 On Sat, Mar 28, 2015 at 12:47 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: This is what am seeing ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar --jars /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar --files $SPARK_HOME/conf/hive-site.xml --num-executors 1 --driver-memory 4g --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g --executor-cores 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2 Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver (com.mysql.jdbc.Driver) was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver. ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar --jars /home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,*/home/dvasthimal/spark1.3/mysql-connector-java-5.1.34.jar *--files $SPARK_HOME/conf/hive-site.xml --num-executors 1 --driver-memory 4g --driver-java-options -XX:MaxPermSize=2G --executor-memory 2g --executor-cores 1 --queue hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2 Caused by: java.sql.SQLException: No suitable driver found for jdbc:mysql://db_host_name.vip.ebay.com:3306/HDB at java.sql.DriverManager.getConnection(DriverManager.java:596) Looks like the driver jar that i got in is not correct, On Sat, Mar 28, 2015 at 12:34 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Could someone please share the spark-submit command that shows their mysql jar containing driver class used to connect to Hive MySQL meta store. Even after including it through