Support for group aggregate pandas UDF in streaming aggregation for SPARK 3.0 python
Hi, Is there any plan to remove the limitation mentioned below? *Streaming aggregation doesn't support group aggregate pandas UDF * We want to run our data modelling jobs real time using Spark 3.0 and kafka 2.4 and need to have support for custom aggregate pandas UDF on stream windows. Is there any plan for this in the upcoming releases ? Regards, Aesha
External hive metastore (remote) managed tables
Hi, anyone knows the behavior of dropping managed tables in case of external hive meta store: Deletion of the data (e.g. from object store) happens from Spark sql or, the external hive metastore ? Confused by local mode and remote mode codes.
Re: Why Apache Spark doesn't use Calcite?
Thanks Xiao, a more up to date publication in a conference like VLDB will certainly turn the the tide for many of us trying to defend Spark's Optimizer. On Wed, Jan 15, 2020 at 9:39 AM Xiao Li wrote: > In the upcoming Spark 3.0, we introduced a new framework for Adaptive > Query Execution in Catalyst. This can adjust the plans based on the runtime > statistics. This is missing in Calcite based on my understanding. > > Catalyst is also very easy to enhance. We also use the dynamic programming > approach in our cost-based join reordering. If needed, in the future, we > also can improve the existing CBO and make it more general. The paper of > Spark SQL was published 5 years ago. A lot of great contributions were made > in the past 5 years. > > Cheers, > > Xiao > > Debajyoti Roy 于2020年1月15日周三 上午9:23写道: > >> Thanks all, and Matei. >> >> TL;DR of the conclusion for my particular case: >> Qualitatively, while Catalyst[1] tries to mitigate learning curve and >> maintenance burden, it lacks the dynamic programming approach used by >> Calcite[2] and risks falling into local minima. >> Quantitatively, there is no reproducible benchmark, that fairly compares >> Optimizer frameworks, apples to apples (excluding execution). >> >> References: >> [1] - >> https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf >> [2] - https://arxiv.org/pdf/1802.10233.pdf >> >> On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia >> wrote: >> >>> I’m pretty sure that Catalyst was built before Calcite, or at least in >>> parallel. Calcite 1.0 was only released in 2015. From a technical >>> standpoint, building Catalyst in Scala also made it more concise and easier >>> to extend than an optimizer written in Java (you can find various >>> presentations about how Catalyst works). >>> >>> Matei >>> >>> > On Jan 13, 2020, at 8:41 AM, Michael Mior wrote: >>> > >>> > It's fairly common for adapters (Calcite's abstraction of a data >>> > source) to push down predicates. However, the API certainly looks a >>> > lot different than Catalyst's. >>> > -- >>> > Michael Mior >>> > mm...@apache.org >>> > >>> > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin >>> > a écrit : >>> >> >>> >> The implementation they chose supports push down predicates, Datasets >>> and other features that are not available in Calcite: >>> >> >>> >> https://databricks.com/glossary/catalyst-optimizer >>> >> >>> >> On Mon, Jan 13, 2020 at 8:24 AM newroyker >>> wrote: >>> >>> >>> >>> Was there a qualitative or quantitative benchmark done before a >>> design >>> >>> decision was made not to use Calcite? >>> >>> >>> >>> Are there limitations (for heuristic based, cost based, * aware >>> optimizer) >>> >>> in Calcite, and frameworks built on top of Calcite? In the context >>> of big >>> >>> data / TCPH benchmarks. >>> >>> >>> >>> I was unable to dig up anything concrete from user group / Jira. >>> Appreciate >>> >>> if any Catalyst veteran here can give me pointers. Trying to defend >>> >>> Spark/Catalyst. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >>> >>> >>> >>> - >>> >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >>> >> >>> >> >>> >> -- >>> >> Thanks, >>> >> Jason >>> > >>> > - >>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> > >>> >>>
Re: Why Apache Spark doesn't use Calcite?
Thanks all, and Matei. TL;DR of the conclusion for my particular case: Qualitatively, while Catalyst[1] tries to mitigate learning curve and maintenance burden, it lacks the dynamic programming approach used by Calcite[2] and risks falling into local minima. Quantitatively, there is no reproducible benchmark, that fairly compares Optimizer frameworks, apples to apples (excluding execution). References: [1] - https://amplab.cs.berkeley.edu/wp-content/uploads/2015/03/SparkSQLSigmod2015.pdf [2] - https://arxiv.org/pdf/1802.10233.pdf On Mon, Jan 13, 2020 at 5:37 PM Matei Zaharia wrote: > I’m pretty sure that Catalyst was built before Calcite, or at least in > parallel. Calcite 1.0 was only released in 2015. From a technical > standpoint, building Catalyst in Scala also made it more concise and easier > to extend than an optimizer written in Java (you can find various > presentations about how Catalyst works). > > Matei > > > On Jan 13, 2020, at 8:41 AM, Michael Mior wrote: > > > > It's fairly common for adapters (Calcite's abstraction of a data > > source) to push down predicates. However, the API certainly looks a > > lot different than Catalyst's. > > -- > > Michael Mior > > mm...@apache.org > > > > Le lun. 13 janv. 2020 à 09:45, Jason Nerothin > > a écrit : > >> > >> The implementation they chose supports push down predicates, Datasets > and other features that are not available in Calcite: > >> > >> https://databricks.com/glossary/catalyst-optimizer > >> > >> On Mon, Jan 13, 2020 at 8:24 AM newroyker wrote: > >>> > >>> Was there a qualitative or quantitative benchmark done before a design > >>> decision was made not to use Calcite? > >>> > >>> Are there limitations (for heuristic based, cost based, * aware > optimizer) > >>> in Calcite, and frameworks built on top of Calcite? In the context of > big > >>> data / TCPH benchmarks. > >>> > >>> I was unable to dig up anything concrete from user group / Jira. > Appreciate > >>> if any Catalyst veteran here can give me pointers. Trying to defend > >>> Spark/Catalyst. > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > >>> > >>> - > >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >>> > >> > >> > >> -- > >> Thanks, > >> Jason > > > > - > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > >
Spark Dataset transformations for time based events
Hope everyone is enjoying their holidays. If anyone here ran into these time based event transformation patterns or have a strong opinion about the approach please let me know / reply in SO: 1. Enrich using as-of-time: https://stackoverflow.com/questions/53928880/how-to-do-a-time-based-as-of-join-of-two-datasets-in-apache-spark 2. Snapshot of state with time to state with effective start and end time: https://stackoverflow.com/questions/53928372/given-dataset-of-state-snapshots-at-time-t-how-to-transform-it-into-dataset-with/53928400#53928400 Thanks in advance! Roy
Given events with start and end times, how to count the number of simultaneous events using Spark?
The problem statement and an approach to solve it using windows is described here: https://stackoverflow.com/questions/52509498/given-events-with-start-and-end-times-how-to-count-the-number-of-simultaneous-e Looking for more elegant/performant solutions, if they exist. TIA !
spark-submit config via file
Hi, I am trying to deploy spark job by using spark-submit which has bunch of parameters like spark-submit --class StreamingEventWriterDriver --master yarn --deploy-mode cluster --executor-memory 3072m --executor-cores 4 --files streaming.conf spark_streaming_2.11-assembly-1.0-SNAPSHOT.jar -conf "streaming.conf" I was looking a way to put all these flags in the file to pass to spark-submit to make my spark-submitcommand simple like this spark-submit --class StreamingEventWriterDriver --master yarn --deploy-mode cluster --properties-file properties.conf --files streaming.conf spark_streaming_2.11-assembly-1.0-SNAPSHOT.jar -conf "streaming.conf" properties.conf has following contents spark.executor.memory 3072m spark.executor.cores 4 But I am getting following error 17/03/24 11:36:26 INFO Client: Use hdfs cache file as spark.yarn.archive for HDP, hdfsCacheFile:hdfs:///hdp/apps/2.6.0.0-403/spark2/spark2-hdp-yarn-archive.tar.gz 17/03/24 11:36:26 WARN AzureFileSystemThreadPoolExecutor: Disabling threads for Delete operation as thread count 0 is <= 1 17/03/24 11:36:26 INFO AzureFileSystemThreadPoolExecutor: Time taken for Delete operation is: 1 ms with threads: 0 17/03/24 11:36:27 INFO Client: Deleted staging directory wasb:// a...@abc.blob.core.windows.net/user/sshuser/.sparkStaging/application_1488402758319_0492 Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no host: hdfs:///hdp/apps/2.6.0.0-403/spark2/spark2-hdp-yarn-archive.tar.gz at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:154) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2791) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2825) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2807) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:386) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:364) at org.apache.spark.deploy.yarn.Client.org $apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:480) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:552) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:881) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:170) at org.apache.spark.deploy.yarn.Client.run(Client.scala:1218) at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1277) at org.apache.spark.deploy.yarn.Client.main(Client.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:745) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/03/24 11:36:27 INFO MetricsSystemImpl: Stopping azure-file-system metrics system... Anyone know is this is even possible ? Thanks... Roy
spark-itemsimilarity No FileSystem for scheme error
Hi we are using CDH 5.4.0 with Spark 1.5.2 (doesn't come with CDH 5.4.0) I am following this link https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html to trying to test/create new algorithm with mahout item-similarity. I am running following command ./bin/mahout spark-itemsimilarity \ --input $INPUT \ --output $OUTPUT \ --filter1 o --filter2 v \ --inDelim "\t" \ --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1 \ --master yarn-client \ -D:fs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem \ -D:fs.file.impl=org.apache.hadoop.fs.LocalFileSystem I am getting following error java.io.IOException: No FileSystem for scheme: hdfs at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167) at org.apache.spark.deploy.yarn.Client.cleanupStagingDir(Client.scala:143) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:129) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144) at org.apache.spark.SparkContext.(SparkContext.scala:523) at org.apache.mahout.sparkbindings.package$.mahoutSparkContext(package.scala:91) at org.apache.mahout.drivers.MahoutSparkDriver.start(MahoutSparkDriver.scala:83) at org.apache.mahout.drivers.ItemSimilarityDriver$.start(ItemSimilarityDriver.scala:118) at org.apache.mahout.drivers.ItemSimilarityDriver$.process(ItemSimilarityDriver.scala:199) at org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:112) at org.apache.mahout.drivers.ItemSimilarityDriver$$anonfun$main$1.apply(ItemSimilarityDriver.scala:110) at scala.Option.map(Option.scala:145) at org.apache.mahout.drivers.ItemSimilarityDriver$.main(ItemSimilarityDriver.scala:110) at org.apache.mahout.drivers.ItemSimilarityDriver.main(ItemSimilarityDriver.scala) I found solution here by adding following properties to into /etc/hadoop/conf/core-site.xml on client/gateway machine more info fs.file.impl org.apache.hadoop.fs.LocalFileSystem The FileSystem for file: uris. fs.hdfs.impl org.apache.hadoop.hdfs.DistributedFileSystem The FileSystem for hdfs: uris. But is there any better way to solve this error ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-itemsimilarity-No-FileSystem-for-scheme-error-tp25887.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Error in load hbase on spark
I want to load hbase table into spark. JavaPairRDDhBaseRDD = sc.newAPIHadoopRDD(conf, TableInputFormat.class, ImmutableBytesWritable.class, Result.class); *when call hBaseRDD.count(),got error.* Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:389) at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:158) ... 11 more *But when job start,I can get these logs* 2015-10-09 09:17:00[main] WARN TableInputFormatBase:447 - initializeTable called multiple times. Overwriting connection and table reference; TableInputFormatBase will not close these old references when done. Does anyone know how does this happen? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-load-hbase-on-spark-tp24986.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
python version in spark-submit
Hi, We have python2.6 (default) on cluster and also we have installed python2.7. I was looking a way to set python version in spark-submit. anyone know how to do this ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/python-version-in-spark-submit-tp24902.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
how to control timeout in node failure for spark task ?
Hi, We are running Spark 1.3 on CDH 5.4.1 on top of YARN. we want to know how do we control task timeout when node fails and task running on it should be restarted on another node. at present job wait for approximately 10 min to restart the task were running on failed node. http://spark.apache.org/docs/latest/configuration.html Here i see many timeout config, just dont know which one to override. any help here ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-control-timeout-in-node-failure-for-spark-task-tp24825.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
pyspark driver in cluster rather than gateway/client
Hi, Is there any way to make spark driver to run in side YARN containers rather than gateway/client machine. At present even with config parameters --master yarn & --deploy-mode cluster driver runs on gateway/client machine. We are on CDH 5.4.1 with YARN and Spark 1.3 any help on this ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-driver-in-cluster-rather-than-gateway-client-tp24641.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: The auxService:spark_shuffle does not exist
we tried --master yarn-client with no different result. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662p23689.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
The auxService:spark_shuffle does not exist
I am getting following error for simple spark job I am running following command /spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn /opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples-1.2.0-cdh5.3.1-hadoop2.5.0-cdh5.3.1.jar/ but job doesn't show any progress and just showing following on cmd 15/07/06 22:18:41 INFO Client: Application report for application_1436234808473_0017 (state: RUNNING) 15/07/06 22:18:42 INFO Client: Application report for application_1436234808473_0017 (state: RUNNING) 15/07/06 22:18:43 INFO Client: Application report for application_1436234808473_0017 (state: RUNNING) 15/07/06 22:18:44 INFO Client: Application report for application_1436234808473_0017 (state: RUNNING) 15/07/06 22:18:45 INFO Client: Application report for application_1436234808473_0017 (state: RUNNING) 15/07/06 22:18:46 INFO Client: Application report for application_1436234808473_0017 (state: RUNNING) 15/07/06 22:18:47 INFO Client: Application report for application_1436234808473_0017 (state: RUNNING) 15/07/06 22:18:48 INFO Client: Application report for application_1436234808473_0017 (state: RUNNING) 15/07/06 22:18:49 INFO Client: Application report for application_1436234808473_0017 (state: RUNNING) 15/07/06 22:18:50 INFO Client: Application report for application_1436234808473_0017 (state: RUNNING) Then I had to kill this job, and look into logs found following Exception in thread ContainerLauncher #4 java.lang.Error: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:206) at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:110) at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ... 2 more Exception in thread ContainerLauncher #5 java.lang.Error: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:spark_shuffle does not exist at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:168) at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) at org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:206) at org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:110) at org.apache.spark.deploy.yarn.ExecutorRunnable.run(ExecutorRunnable.scala:65) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) ... 2 more dont know why this is happening. Any one know whats wrong here. This started happening after cloudera manager upgrade from CM 5.3.1 to CM 5.4.3. We are on CDH 5.3.1 Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/The-auxService-spark-shuffle-does-not-exist-tp23662.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To
Yarn application ID for Spark job on Yarn
Hi, Is there a way to get Yarn application ID inside spark application, when running spark Job on YARN ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Yarn-application-ID-for-Spark-job-on-Yarn-tp23429.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark job fails silently
Hi, Our spark job on yarn suddenly started failing silently without showing any error following is the trace. Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer Adding default property: spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/log4j.properties Adding default property: spark.eventLog.enabled=true Adding default property: spark.shuffle.service.enabled=true Adding default property: spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native Adding default property: spark.yarn.historyServer.address=http://ds-hnn002.dev.abc.com:18088 Adding default property: spark.yarn.am.extraLibraryPath=/usr/lib/hadoop/lib/native Adding default property: spark.ui.showConsoleProgress=true Adding default property: spark.shuffle.service.port=7337 Adding default property: spark.master=yarn-client Adding default property: spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native Adding default property: spark.eventLog.dir=hdfs://magnetic-hadoop-dev/user/spark/applicationHistory Adding default property: spark.yarn.jar=local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar Parsed arguments: master yarn deployMode null executorMemory 3G executorCores null totalExecutorCores null propertiesFile /usr/lib/spark/conf/spark-defaults.conf driverMemory4G driverCores null driverExtraClassPathnull driverExtraLibraryPath /usr/lib/hadoop/lib/native driverExtraJavaOptions null supervise false queue null numExecutors30 files null pyFiles null archivesnull mainClass null primaryResource file:/home/jonathanarfa/code/updb/spark/updb2vw_testing.py nameupdb2vw_testing.py childArgs [--date 2015-05-20] jarsnull packagesnull repositoriesnull verbose true Spark properties used, including those specified through --conf and those from the properties file /usr/lib/spark/conf/spark-defaults.conf: spark.executor.extraLibraryPath - /usr/lib/hadoop/lib/native spark.yarn.jar - local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar spark.driver.extraLibraryPath - /usr/lib/hadoop/lib/native spark.yarn.historyServer.address - http://ds-hnn002.dev.abc.com:18088 spark.yarn.am.extraLibraryPath - /usr/lib/hadoop/lib/native spark.eventLog.enabled - true spark.ui.showConsoleProgress - true spark.serializer - org.apache.spark.serializer.KryoSerializer spark.executor.extraJavaOptions - -Dlog4j.configuration=file:///etc/spark/log4j.properties spark.shuffle.service.enabled - true spark.shuffle.service.port - 7337 spark.eventLog.dir - hdfs://magnetic-hadoop-dev/user/spark/applicationHistory spark.master - yarn-client Main class: org.apache.spark.deploy.PythonRunner Arguments: file:/home/jonathanarfa/code/updb/spark/updb2vw_testing.py null --date 2015-05-20 System properties: spark.executor.extraLibraryPath - /usr/lib/hadoop/lib/native spark.driver.memory - 4G spark.executor.memory - 3G spark.yarn.jar - local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar spark.driver.extraLibraryPath - /usr/lib/hadoop/lib/native spark.executor.instances - 30 spark.yarn.historyServer.address - http://ds-hnn002.dev.abc.com:18088 spark.yarn.am.extraLibraryPath - /usr/lib/hadoop/lib/native spark.ui.showConsoleProgress - true spark.eventLog.enabled - true spark.yarn.dist.files - file:/home/jonathanarfa/code/updb/spark/updb2vw_testing.py SPARK_SUBMIT - true spark.serializer - org.apache.spark.serializer.KryoSerializer spark.executor.extraJavaOptions - -Dlog4j.configuration=file:///etc/spark/log4j.properties spark.shuffle.service.enabled - true spark.app.name - updb2vw_testing.py spark.shuffle.service.port - 7337 spark.eventLog.dir - hdfs://magnetic-hadoop-dev/user/spark/applicationHistory spark.master - yarn-client Classpath elements: spark.akka.frameSize=60 spark.app.name=updb2vw_2015-05-20 spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native spark.driver.maxResultSize=2G spark.driver.memory=4G spark.eventLog.dir=hdfs://magnetic-hadoop-dev/user/spark/applicationHistory spark.eventLog.enabled=true spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/log4j.properties spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native spark.executor.instances=30 spark.executor.memory=3G spark.master=yarn-client spark.serializer=org.apache.spark.serializer.KryoSerializer spark.shuffle.manager=hash spark.shuffle.service.enabled=true spark.shuffle.service.port=7337 spark.task.maxFailures=6 spark.ui.showConsoleProgress=true
spark on yarn failing silently
Hi, suddenly our spark job on yarn started failing silently without showing any error, following is the trace in verbose mode Using properties file: /usr/lib/spark/conf/spark-defaults.conf Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer Adding default property: spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/log4j.properties Adding default property: spark.eventLog.enabled=true Adding default property: spark.shuffle.service.enabled=true Adding default property: spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native Adding default property: spark.yarn.historyServer.address=http://ds-hnn002.dev.abc.com:18088 Adding default property: spark.yarn.am.extraLibraryPath=/usr/lib/hadoop/lib/native Adding default property: spark.ui.showConsoleProgress=true Adding default property: spark.shuffle.service.port=7337 Adding default property: spark.master=yarn-client Adding default property: spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native Adding default property: spark.eventLog.dir=hdfs://my-hadoop-dev/user/spark/applicationHistory Adding default property: spark.yarn.jar=local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar Parsed arguments: master yarn deployMode null executorMemory 3G executorCores null totalExecutorCores null propertiesFile /usr/lib/spark/conf/spark-defaults.conf driverMemory4G driverCores null driverExtraClassPathnull driverExtraLibraryPath /usr/lib/hadoop/lib/native driverExtraJavaOptions null supervise false queue null numExecutors30 files null pyFiles null archivesnull mainClass null primaryResource file:/home/xyz/code/updb/spark/updb2vw_testing.py nameupdb2vw_testing.py childArgs [--date 2015-05-20] jarsnull packagesnull repositoriesnull verbose true Spark properties used, including those specified through --conf and those from the properties file /usr/lib/spark/conf/spark-defaults.conf: spark.executor.extraLibraryPath - /usr/lib/hadoop/lib/native spark.yarn.jar - local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar spark.driver.extraLibraryPath - /usr/lib/hadoop/lib/native spark.yarn.historyServer.address - http://ds-hnn002.dev.abc.com:18088 spark.yarn.am.extraLibraryPath - /usr/lib/hadoop/lib/native spark.eventLog.enabled - true spark.ui.showConsoleProgress - true spark.serializer - org.apache.spark.serializer.KryoSerializer spark.executor.extraJavaOptions - -Dlog4j.configuration=file:///etc/spark/log4j.properties spark.shuffle.service.enabled - true spark.shuffle.service.port - 7337 spark.eventLog.dir - hdfs://my-hadoop-dev/user/spark/applicationHistory spark.master - yarn-client Main class: org.apache.spark.deploy.PythonRunner Arguments: file:/home/xyz/code/updb/spark/updb2vw_testing.py null --date 2015-05-20 System properties: spark.executor.extraLibraryPath - /usr/lib/hadoop/lib/native spark.driver.memory - 4G spark.executor.memory - 3G spark.yarn.jar - local:/usr/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.0-hadoop2.6.0-cdh5.4.0.jar spark.driver.extraLibraryPath - /usr/lib/hadoop/lib/native spark.executor.instances - 30 spark.yarn.historyServer.address - http://ds-hnn002.dev.abc.com:18088 spark.yarn.am.extraLibraryPath - /usr/lib/hadoop/lib/native spark.ui.showConsoleProgress - true spark.eventLog.enabled - true spark.yarn.dist.files - file:/home/xyz/code/updb/spark/updb2vw_testing.py SPARK_SUBMIT - true spark.serializer - org.apache.spark.serializer.KryoSerializer spark.executor.extraJavaOptions - -Dlog4j.configuration=file:///etc/spark/log4j.properties spark.shuffle.service.enabled - true spark.app.name - updb2vw_testing.py spark.shuffle.service.port - 7337 spark.eventLog.dir - hdfs://my-hadoop-dev/user/spark/applicationHistory spark.master - yarn-client Classpath elements: spark.akka.frameSize=60 spark.app.name=updb2vw_2015-05-20 spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native spark.driver.maxResultSize=2G spark.driver.memory=4G spark.eventLog.dir=hdfs://my-hadoop-dev/user/spark/applicationHistory spark.eventLog.enabled=true spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/log4j.properties spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native spark.executor.instances=30 spark.executor.memory=3G spark.master=yarn-client spark.serializer=org.apache.spark.serializer.KryoSerializer spark.shuffle.manager=hash spark.shuffle.service.enabled=true spark.shuffle.service.port=7337 spark.task.maxFailures=6 spark.ui.showConsoleProgress=true spark.yarn.am.extraLibraryPath=/usr/lib/hadoop/lib/native
spark java.io.FileNotFoundException: /user/spark/applicationHistory/application
hi, Suddenly spark jobs started failing with following error Exception in thread main java.io.FileNotFoundException: /user/spark/applicationHistory/application_1432824195832_1275.inprogress (No such file or directory) full trace here [21:50:04 x...@hadoop-client01.dev:~]$ spark-submit --class org.apache.spark.examples.SparkPi --master yarn /usr/lib/spark/lib/spark-examples.jar 10 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/avro/avro-tools-1.7.6-cdh5.4.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 15/05/28 21:55:21 INFO SparkContext: Running Spark version 1.3.0 15/05/28 21:55:21 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). 15/05/28 21:55:21 WARN SparkConf: SPARK_JAVA_OPTS was detected (set to ' -Dspark.local.dir=/srv/tmp/xyz '). This is deprecated in Spark 1.0+. Please instead use: - ./spark-submit with conf/spark-defaults.conf to set defaults for an application - ./spark-submit with --driver-java-options to set -X options for a driver - spark.executor.extraJavaOptions to set -X options for executors - SPARK_DAEMON_JAVA_OPTS to set java options for standalone daemons (master or worker) 15/05/28 21:55:21 WARN SparkConf: Setting 'spark.executor.extraJavaOptions' to ' -Dspark.local.dir=/srv/tmp/xyz ' as a work-around. 15/05/28 21:55:21 WARN SparkConf: Setting 'spark.driver.extraJavaOptions' to ' -Dspark.local.dir=/srv/tmp/xyz ' as a work-around. 15/05/28 21:55:22 INFO SecurityManager: Changing view acls to: xyz 15/05/28 21:55:22 INFO SecurityManager: Changing modify acls to: xyz 15/05/28 21:55:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(xyz); users with modify permissions: Set(xyz) 15/05/28 21:55:22 INFO Slf4jLogger: Slf4jLogger started 15/05/28 21:55:22 INFO Remoting: Starting remoting 15/05/28 21:55:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkdri...@hadoop-client01.abc.com:51876] 15/05/28 21:55:22 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkdri...@hadoop-client01.abc.com:51876] 15/05/28 21:55:22 INFO Utils: Successfully started service 'sparkDriver' on port 51876. 15/05/28 21:55:22 INFO SparkEnv: Registering MapOutputTracker 15/05/28 21:55:22 INFO SparkEnv: Registering BlockManagerMaster 15/05/28 21:55:22 INFO DiskBlockManager: Created local directory at /srv/tmp/xyz/spark-1e66e6eb-7ad6-4f62-87fc-f0cfaa631e36/blockmgr-61f866b8-6475-4a11-88b2-792d2ba22662 15/05/28 21:55:22 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 15/05/28 21:55:23 INFO HttpFileServer: HTTP File server directory is /srv/tmp/xyz/spark-2b676170-3f88-44bf-87a3-600de1b7ee24/httpd-b84f76d5-26c7-4c63-9223-f6c5aa3899f0 15/05/28 21:55:23 INFO HttpServer: Starting HTTP Server 15/05/28 21:55:23 INFO Server: jetty-8.y.z-SNAPSHOT 15/05/28 21:55:23 INFO AbstractConnector: Started SocketConnector@0.0.0.0:41538 15/05/28 21:55:23 INFO Utils: Successfully started service 'HTTP file server' on port 41538. 15/05/28 21:55:23 INFO SparkEnv: Registering OutputCommitCoordinator 15/05/28 21:55:23 INFO Server: jetty-8.y.z-SNAPSHOT 15/05/28 21:55:23 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/05/28 21:55:23 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/05/28 21:55:23 INFO SparkUI: Started SparkUI at http://hadoop-client01.abc.com:4040 15/05/28 21:55:23 INFO SparkContext: Added JAR file:/usr/lib/spark/lib/spark-examples.jar at http://10.0.3.62:41538/jars/spark-examples.jar with timestamp 1432850123523 15/05/28 21:55:24 INFO Client: Requesting a new application from cluster with 16 NodeManagers 15/05/28 21:55:24 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (61440 MB per container) 15/05/28 21:55:24 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 15/05/28 21:55:24 INFO Client: Setting up container launch context for our AM 15/05/28 21:55:24 INFO Client: Preparing resources for our AM container 15/05/28 21:55:24 INFO Client: Setting up the launch environment for our AM container 15/05/28 21:55:24 INFO SecurityManager: Changing view acls to: xyz 15/05/28 21:55:24 INFO SecurityManager: Changing modify acls to: xyz 15/05/28 21:55:24 INFO SecurityManager: SecurityManager:
Re: Spark HistoryServer not coming up
This got resolved after cleaning /user/spark/applicationHistory/* -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-HistoryServer-not-coming-up-tp22975p22981.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark HistoryServer not coming up
Hi, After restarting Spark HistoryServer, it failed to come up, I checked logs for Spark HistoryServer found following messages :' 2015-05-21 11:38:03,790 WARN org.apache.spark.scheduler.ReplayListenerBus: Log path provided contains no log files. 2015-05-21 11:38:52,319 INFO org.apache.spark.deploy.history.HistoryServer: Registered signal handlers for [TERM, HUP, INT] 2015-05-21 11:38:52,328 WARN org.apache.spark.deploy.history.HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead. 2015-05-21 11:38:52,461 INFO org.apache.spark.SecurityManager: Changing view acls to: spark 2015-05-21 11:38:52,462 INFO org.apache.spark.SecurityManager: Changing modify acls to: spark 2015-05-21 11:38:52,463 INFO org.apache.spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark) 2015-05-21 11:41:24,893 ERROR org.apache.spark.deploy.history.HistoryServer: RECEIVED SIGNAL 15: SIGTERM 2015-05-21 11:41:33,439 INFO org.apache.spark.deploy.history.HistoryServer: Registered signal handlers for [TERM, HUP, INT] 2015-05-21 11:41:33,447 WARN org.apache.spark.deploy.history.HistoryServerArguments: Setting log directory through the command line is deprecated as of Spark 1.1.0. Please set this through spark.history.fs.logDirectory instead. 2015-05-21 11:41:33,578 INFO org.apache.spark.SecurityManager: Changing view acls to: spark 2015-05-21 11:41:33,579 INFO org.apache.spark.SecurityManager: Changing modify acls to: spark 2015-05-21 11:41:33,579 INFO org.apache.spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark) 2015-05-21 11:44:07,147 WARN org.apache.hadoop.hdfs.BlockReaderFactory: I/O error constructing remote block reader. java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2109) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:408) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:785) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:663) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:327) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:574) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:797) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844) at java.io.DataInputStream.read(DataInputStream.java:149) at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:154) at java.io.BufferedReader.readLine(BufferedReader.java:317) at java.io.BufferedReader.readLine(BufferedReader.java:382) at scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$5.apply(FsHistoryProvider.scala:175) at org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$5.apply(FsHistoryProvider.scala:172) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at
How to process data in chronological order
I have a key-value RDD, key is a timestamp (femto-second resolution, so grouping buys me nothing) and I want to reduce it in the chronological order. How do I do that in spark? I am fine with reducing contiguous sections of the set separately and then aggregating the resulting objects locally. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-process-data-in-chronological-order-tp22966.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Possible to disable Spark HTTP server ?
Hi, When we start spark job it start new HTTP server for each new job. Is it possible to disable HTTP server for each job ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Possible-to-disable-Spark-HTTP-server-tp22772.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark.logConf with log4j.rootCategory=WARN
Hi, I have recently enable log4j.rootCategory=WARN, console in spark configuration. but after that spark.logConf=True has becomes ineffective. So just want to confirm if this is because log4j.rootCategory=WARN ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-logConf-with-log4j-rootCategory-WARN-tp22731.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
shuffle.FetchFailedException in spark on YARN job
Hi, My spark job is failing with following error message org.apache.spark.shuffle.FetchFailedException: /mnt/ephemeral12/yarn/nm/usercache/abc/appcache/application_1429353954024_1691/spark-local-20150418132335-0723/28/shuffle_3_1_0.index (No such file or directory) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.graphx.impl.EdgePartition.updateVertices(EdgePartition.scala:89) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:75) at org.apache.spark.graphx.impl.ReplicatedVertexView$$anonfun$2$$anonfun$apply$1.apply(ReplicatedVertexView.scala:73) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.FileNotFoundException: /mnt/ephemeral12/yarn/nm/usercache/abc/appcache/application_1429353954024_1691/spark-local-20150418132335-0723/28/shuffle_3_1_0.index (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:146) at org.apache.spark.shuffle.IndexShuffleBlockManager.getBlockData(IndexShuffleBlockManager.scala:109) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:305) at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchLocalBlocks(ShuffleBlockFetcherIterator.scala:235) at org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:268) at org.apache.spark.storage.ShuffleBlockFetcherIterator.init(ShuffleBlockFetcherIterator.scala:115) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:76) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) ... 7 more ) my job conf was : --num-executors 30 --driver-memory 3G --executor-memory 5G --conf spark.yarn.executor.memoryOverhead=1000 --conf spark.yarn.driver.memoryOverhead=1000 --conf spark.akka.frameSize=1 --executor-cores 3 any idea why its failing on shuffle fetch ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/shuffle-FetchFailedException-in-spark-on-YARN-job-tp22557.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark job progress-style report on console ?
Hi, How do i get spark job progress-style report on console ? I tried to set --conf spark.ui.showConsoleProgress=true but it thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-job-progress-style-report-on-console-tp22440.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL Avro Library for 1.2
How do I build Spark SQL Avro Library for Spark 1.2 ? I was following this https://github.com/databricks/spark-avro and was able to build spark-avro_2.10-1.0.0.jar by simply running sbt/sbt package from the project root. but we are on Spark 1.2 and need compatible spark-avro jar. Any idea how do I do it ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Avro-Library-for-1-2-tp22421.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.3 on CDH 5.3.1 YARN
Hi, We have cluster running on CDH 5.3.2 and Spark 1.2 (Which is current version in CDH5.3.2), But We want to try Spark 1.3 without breaking existing setup, so is it possible to have Spark 1.3 on existing setup ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-on-CDH-5-3-1-YARN-tp22422.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: can't union two rdds
use zip -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-t-union-two-rdds-tp22320p22321.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark History Server : jobs link doesn't open
in log I found this 2015-03-26 19:42:09,531 WARN org.eclipse.jetty.servlet.ServletHandler: Error for /history/application_1425934191900_87572 org.spark-project.guava.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: GC overhead limit exceeded at org.spark-project.guava.common.cache.LocalCache$Segment.get(LocalCache.java:2261) at org.spark-project.guava.common.cache.LocalCache.get(LocalCache.java:4000) thanks On Thu, Mar 26, 2015 at 7:27 PM, , Roy rp...@njit.edu wrote: We have Spark on YARN, with Cloudera Manager 5.3.2 and CDH 5.3.2 Jobs link on spark History server doesn't open and shows following message : HTTP ERROR: 500 Problem accessing /history/application_1425934191900_87572. Reason: Server Error -- *Powered by Jetty://*
Spark History Server : jobs link doesn't open
We have Spark on YARN, with Cloudera Manager 5.3.2 and CDH 5.3.2 Jobs link on spark History server doesn't open and shows following message : HTTP ERROR: 500 Problem accessing /history/application_1425934191900_87572. Reason: Server Error -- *Powered by Jetty://*
Re: FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use
Yes I do have other application already running. Thanks for your explanation. On Wed, Mar 25, 2015 at 2:49 AM, Akhil Das ak...@sigmoidanalytics.com wrote: It means you are already having 4 applications running on 4040, 4041, 4042, 4043. And that's why it was able to run on 4044. You can do a *netstat -pnat | grep 404* *And see what all processes are running. Thanks Best Regards On Wed, Mar 25, 2015 at 1:13 AM, , Roy rp...@njit.edu wrote: I get following message for each time I run spark job 1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use full trace is here http://pastebin.com/xSvRN01f how do I fix this ? I am on CDH 5.3.1 thanks roy
FAILED SelectChannelConnector@0.0.0.0:4040 java.net.BindException: Address already in use
I get following message for each time I run spark job 1. 15/03/24 15:35:56 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use full trace is here http://pastebin.com/xSvRN01f how do I fix this ? I am on CDH 5.3.1 thanks roy
Spark error NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit
Hi, I am using CDH 5.3.2 packages installation through Cloudera Manager 5.3.2 I am trying to run one spark job with following command PYTHONPATH=~/code/utils/ spark-submit --master yarn --executor-memory 3G --num-executors 30 --driver-memory 2G --executor-cores 2 --name=analytics /home/abc/code/updb/spark/UPDB3analytics.py -date 2015-03-01 but I am getting following error 15/03/23 11:06:49 WARN TaskSetManager: Lost task 9.0 in stage 0.0 (TID 7, hdp003.dev.xyz.com): java.lang.NoClassDefFoundError: org/apache/hadoop/mapred/InputSplit at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2532) at java.lang.Class.getDeclaredConstructors(Class.java:1901) at java.io.ObjectStreamClass.computeDefaultSUID(ObjectStreamClass.java:1749) at java.io.ObjectStreamClass.access$100(ObjectStreamClass.java:72) at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:250) at java.io.ObjectStreamClass$1.run(ObjectStreamClass.java:248) at java.security.AccessController.doPrivileged(Native Method) at java.io.ObjectStreamClass.getSerialVersionUID(ObjectStreamClass.java:247) at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:613) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapred.InputSplit at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 25 more here is the full trace https://gist.github.com/anonymous/3492f0ec63d7a23c47cf
What do you think about the level of resource manager and file system?
[cid:image004.jpg@01D04629.1F451950] [cid:image005.png@01D04629.1F451950] Hi guys~ Comparing these two architectures, why BDAS put Yarn and Mesos under the HDFS, do you have any special consideration? Or just easy to express the AMPLab stack? Best regards!
unsubscribe
unsubscribe