Re: AVRO vs Parquet
Thanks Michael for clarifying this. My response is inline. Guru Medasani gdm...@gmail.com > On Mar 10, 2016, at 12:38 PM, Michael Armbrust <mich...@databricks.com> wrote: > > A few clarifications: > > 1) High memory and cpu usage. This is because Parquet files can't be streamed > into as records arrive. I have seen a lot of OOMs in reasonably sized > MR/Spark containers that write out Parquet. When doing dynamic partitioning, > where many writers are open at once, we’ve seen customers having trouble to > make it work. This has made for some very confused ETL developers. > > In Spark 1.6.1 we avoid having more than 2 files open per task, so this > should be less of a problem even for dynamic partitioning. Thanks for fixing this. Looks like this the Jira that is going into Spark 1.6.1 that is fixing the memory issues during dynamic partitioning. I copied it here so rest of the folks on the email thread can take a look. SPARK-12546 <https://issues.apache.org/jira/browse/SPARK-12546> Writing to partitioned parquet table can fail with OOM > > 2) Parquet lags well behind Avro in schema evolution semantics. Can only add > columns at the end? Deleting columns at the end is not recommended if you > plan to add any columns in the future. Reordering is not supported in current > release. > > This may be true for Impala, but Spark SQL does schema merging by name so you > can add / reorder columns with the constraint that you cannot reuse a name > with an incompatible type. As I mentioned in my previous email it is something user still needs to be aware of as user mentioned the following in the initial question. I also have to consider any file on HDFS may be accessed from other tools like Hive, Impala, HAWQ. Regarding the reordering I also mentioned in my previous email that it will be supported in Impala in the next release as well. In the next release, there might be support for optionally matching Parquet file columns by name instead of order, like Hive does. Under this scheme, you cannot rename columns (since the files will retain the old name and will no longer be matched), but you can reorder them. ( This is regarding Impala)
Re: AVRO vs Parquet
+1 Paul. Both have some pros and cons. Hope this helps. Avro: Pros: 1) Plays nice with other tools, 3rd party or otherwise, or you specifically need some data type in AVRO like binary, but gladly that list is shrinking all the time (yay nested types in Impala). 2) Good for event data that changes over time. Can be used in Kafka Schema Registry 3) Schema Evolution - supports adding, removing, renaming, reordering Cons: 1) Cannot write data from Impala 2) Timestamp data type is not supported (Hive & Impala)- we can still store them as longs and convert it in the end-user tables. Parquet: Pros: 1) High performance during both reads and writes - both wide and narrow tables (though mileage may vary based on the datasets cardinality) 2) Great for analytics Cons: 1) High memory and cpu usage. This is because Parquet files can't be streamed into as records arrive. I have seen a lot of OOMs in reasonably sized MR/Spark containers that write out Parquet. When doing dynamic partitioning, where many writers are open at once, we’ve seen customers having trouble to make it work. This has made for some very confused ETL developers. 2) Parquet lags well behind Avro in schema evolution semantics. Can only add columns at the end? Deleting columns at the end is not recommended if you plan to add any columns in the future. Reordering is not supported in current release. Here is Schema Evolution and best practices in Parquet and Avro in more detail. Parquet you can add columns to the end of the table definition and they'll be populated with NULL if missing in the file (you can technically delete columns from the end too, but then it will try to read the old deleted column if you add a new column to the end afterwards). Here is an example of how this works. At t0 - if we have a parquet table with the following columns c1,c2,c3,c4 t1 - if we delete the columns c3, c4, table now has the columns c1,c2 t2 - Now we add a new column c5, table now has the columns c1,c2,c5 So now when we query the table for columns c1,c2 and c5, it will try to read the columns c3. Impala currently matches the file schema to the table schema by column order, so at t2, if you select c5, Impala will try to read c3 from the old files, thinking it's c5 since it's the third column in the file. This will work if the types of c3 and c5 are the same (but probably not be the right results) or fail if the types don't match. It will ignore the old c4. Files with the new schema (c1,c2,c5) will be read correctly. You can also rename columns, since we only look at the column ordering and not their names. So if you have columns c1 and c2, and reorder them c2 and c1, Impala will read c1 when you query c2 and vice versa. In the next release, there might be support for optionally matching Parquet file columns by name instead of order, like Hive does. Under this scheme, you cannot rename columns (since the files will retain the old name and will no longer be matched), but you can reorder them. ( This is regarding Impala) In Avro we follow the rules described here: http://avro.apache.org/docs/1.8.0/spec.html#Schema+Resolution <http://avro.apache.org/docs/1.8.0/spec.html#Schema+Resolution> This includes rearranging columns, renaming columns via aliasing, and certain data type differences (e.g. you can "promote" an int to a bigint in the table schema, but not vice versa). AFAIK you should always modify the Avro schema directly in the table metadata, I think weird stuff happens if you try to use the usual ALTER TABLE commands. Preferred method is to modify the json schema file on HDFS and then using the ALTER TABLE to add/remove/reorder/rename the columns. I think in some cases using ALTER TABLE without changing the schema could work by chance, but in general it seems dangerous to change the column metadata in the metastore and not have that be reflected in the Avro schema as well. Guru Medasani gdm...@gmail.com > On Mar 4, 2016, at 7:36 AM, Paul Leclercq <paul.lecle...@tabmo.io> wrote: > > > > Nice article about Parquet with Avro : > https://dzone.com/articles/understanding-how-parquet > <https://dzone.com/articles/understanding-how-parquet> > http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ > <http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/> > Nice video from the good folks of Cloudera for the differences between > "Avrow" and Parquet > https://www.youtube.com/watch?v=AY1dEfyFeHc > <https://www.youtube.com/watch?v=AY1dEfyFeHc> > > 2016-03-04 7:12 GMT+01:00 Koert Kuipers <ko...@tresata.com > <mailto:ko...@tresata.com>>: > well can you use orc without bringing in the kitchen sink of dependencies > also known as hive? > > On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim <ilike...@gmail.com > <mailto:ilike...@gmail.com>> wrote: &g
Re: Building a REST Service with Spark back-end
Hi Yanlin, This is a fairly new effort and is not officially released/supported by Cloudera yet. I believe those numbers will be out once it is released. Guru Medasani gdm...@gmail.com > On Mar 2, 2016, at 10:40 AM, yanlin wang <yanl...@me.com> wrote: > > Did any one use Livy in real world high concurrency web app? I think it uses > spark submit command line to create job... How about job server or notebook > comparing with Livy? > > Thx, > Yanlin > > Sent from my iPhone > > On Mar 2, 2016, at 6:24 AM, Guru Medasani <gdm...@gmail.com > <mailto:gdm...@gmail.com>> wrote: > >> Hi Don, >> >> Here is another REST interface for interacting with Spark from anywhere. >> >> https://github.com/cloudera/livy <https://github.com/cloudera/livy> >> >> Here is an example to estimate PI using Spark from Python using requests >> library. >> >> >>> data = { >> ... 'code': textwrap.dedent("""\ >> ... val NUM_SAMPLES = 10; >> ... val count = sc.parallelize(1 to NUM_SAMPLES).map { i => >> ...val x = Math.random(); >> ...val y = Math.random(); >> ...if (x*x + y*y < 1) 1 else 0 >> ... }.reduce(_ + _); >> ... println(\"Pi is roughly \" + 4.0 * count / NUM_SAMPLES) >> ... """) >> ... } >> >>> r = requests.post(statements_url, data=json.dumps(data), headers=headers) >> >>> pprint.pprint(r.json()) >> {u'id': 1, >> u'output': {u'data': {u'text/plain': u'Pi is roughly 3.14004\nNUM_SAMPLES: >> Int = 10\ncount: Int = 78501'}, >> u'execution_count': 1, >> u'status': u'ok'}, >> u'state': u'available'} >> >> >> Guru Medasani >> gdm...@gmail.com <mailto:gdm...@gmail.com> >> >> >> >>> On Mar 2, 2016, at 7:47 AM, Todd Nist <tsind...@gmail.com >>> <mailto:tsind...@gmail.com>> wrote: >>> >>> Have you looked at Apache Toree, http://toree.apache.org/ >>> <http://toree.apache.org/>. This was formerly the Spark-Kernel from IBM >>> but contributed to apache. >>> >>> https://github.com/apache/incubator-toree >>> <https://github.com/apache/incubator-toree> >>> >>> You can find a good overview on the spark-kernel here: >>> http://www.spark.tc/how-to-enable-interactive-applications-against-apache-spark/ >>> >>> <http://www.spark.tc/how-to-enable-interactive-applications-against-apache-spark/> >>> >>> Not sure if that is of value to you or not. >>> >>> HTH. >>> >>> -Todd >>> >>> On Tue, Mar 1, 2016 at 7:30 PM, Don Drake <dondr...@gmail.com >>> <mailto:dondr...@gmail.com>> wrote: >>> I'm interested in building a REST service that utilizes a Spark SQL Context >>> to return records from a DataFrame (or IndexedRDD?) and even add/update >>> records. >>> >>> This will be a simple REST API, with only a few end-points. I found this >>> example: >>> >>> https://github.com/alexmasselot/spark-play-activator >>> <https://github.com/alexmasselot/spark-play-activator> >>> >>> which looks close to what I am interested in doing. >>> >>> Are there any other ideas or options if I want to run this in a YARN >>> cluster? >>> >>> Thanks. >>> >>> -Don >>> >>> -- >>> Donald Drake >>> Drake Consulting >>> http://www.drakeconsulting.com/ <http://www.drakeconsulting.com/> >>> https://twitter.com/dondrake <http://www.maillaunder.com/> >>> 800-733-2143 >>
Re: Building a REST Service with Spark back-end
Hi Don, Here is another REST interface for interacting with Spark from anywhere. https://github.com/cloudera/livy <https://github.com/cloudera/livy> Here is an example to estimate PI using Spark from Python using requests library. >>> data = { ... 'code': textwrap.dedent("""\ ... val NUM_SAMPLES = 10; ... val count = sc.parallelize(1 to NUM_SAMPLES).map { i => ...val x = Math.random(); ...val y = Math.random(); ...if (x*x + y*y < 1) 1 else 0 ... }.reduce(_ + _); ... println(\"Pi is roughly \" + 4.0 * count / NUM_SAMPLES) ... """) ... } >>> r = requests.post(statements_url, data=json.dumps(data), headers=headers) >>> pprint.pprint(r.json()) {u'id': 1, u'output': {u'data': {u'text/plain': u'Pi is roughly 3.14004\nNUM_SAMPLES: Int = 10\ncount: Int = 78501'}, u'execution_count': 1, u'status': u'ok'}, u'state': u'available'} Guru Medasani gdm...@gmail.com > On Mar 2, 2016, at 7:47 AM, Todd Nist <tsind...@gmail.com> wrote: > > Have you looked at Apache Toree, http://toree.apache.org/ > <http://toree.apache.org/>. This was formerly the Spark-Kernel from IBM but > contributed to apache. > > https://github.com/apache/incubator-toree > <https://github.com/apache/incubator-toree> > > You can find a good overview on the spark-kernel here: > http://www.spark.tc/how-to-enable-interactive-applications-against-apache-spark/ > > <http://www.spark.tc/how-to-enable-interactive-applications-against-apache-spark/> > > Not sure if that is of value to you or not. > > HTH. > > -Todd > > On Tue, Mar 1, 2016 at 7:30 PM, Don Drake <dondr...@gmail.com > <mailto:dondr...@gmail.com>> wrote: > I'm interested in building a REST service that utilizes a Spark SQL Context > to return records from a DataFrame (or IndexedRDD?) and even add/update > records. > > This will be a simple REST API, with only a few end-points. I found this > example: > > https://github.com/alexmasselot/spark-play-activator > <https://github.com/alexmasselot/spark-play-activator> > > which looks close to what I am interested in doing. > > Are there any other ideas or options if I want to run this in a YARN cluster? > > Thanks. > > -Don > > -- > Donald Drake > Drake Consulting > http://www.drakeconsulting.com/ <http://www.drakeconsulting.com/> > https://twitter.com/dondrake <http://www.maillaunder.com/> > 800-733-2143
Re: Error in load hbase on spark
Hi Roy, Here is a cloudera-labs project SparkOnHBase that makes it really simple to read HBase data into Spark. https://github.com/cloudera-labs/SparkOnHBase <https://github.com/cloudera-labs/SparkOnHBase> Link to blog that explains how to use the package. http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ <http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/> It also has been committed to HBase project now. http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/ <http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/> HBase Jira link: https://issues.apache.org/jira/browse/HBASE-13992 <https://issues.apache.org/jira/browse/HBASE-13992> Guru Medasani gdm...@gmail.com > On Oct 8, 2015, at 9:29 PM, Roy Wang <roywang1...@163.com> wrote: > > > I want to load hbase table into spark. > JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = > sc.newAPIHadoopRDD(conf, TableInputFormat.class, > ImmutableBytesWritable.class, Result.class); > > *when call hBaseRDD.count(),got error.* > > Caused by: java.lang.IllegalStateException: The input format instance has > not been properly initialized. Ensure you call initializeTable either in > your constructor or initialize method > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:389) > at > org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:158) > ... 11 more > > *But when job start,I can get these logs* > 2015-10-09 09:17:00[main] WARN TableInputFormatBase:447 - initializeTable > called multiple times. Overwriting connection and table reference; > TableInputFormatBase will not close these old references when done. > > Does anyone know how does this happen? > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Error-in-load-hbase-on-spark-tp24986.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >
Re: Change protobuf version or any other third party library version in Spark application
Hi Lan, Reading the pull request below. Looks like you should be able to use the config to both drivers and executors. I would give it a try with the Spark-shell on Yarn client mode. https://github.com/apache/spark/pull/3233 <https://github.com/apache/spark/pull/3233> Yarn's config option spark.yarn.user.classpath.first does not work the same way as spark.files.userClassPathFirst; Yarn's version is a lot more dangerous, in that it modifies the system classpath, instead of restricting the changes to the user's class loader. So this change implements the behavior of the latter for Yarn, and deprecates the more dangerous choice. To be able to achieve feature-parity, I also implemented the option for drivers (the existing option only applies to executors). So now there are two options, each controlling whether to apply userClassPathFirst to the driver or executors. The old option was deprecated, and aliased to the new one (spark.executor.userClassPathFirst). The existing "child-first" class loader also had to be fixed. It didn't handle resources, and it was also doing some things that ended up causing JVM errors depending on how things were being called. Guru Medasani gdm...@gmail.com > On Sep 15, 2015, at 9:33 AM, Lan Jiang <ljia...@gmail.com> wrote: > > Steve, > > Thanks for the input. You are absolutely right. When I use protobuf 2.6.1, I > also ran into method not defined errors. You suggest using Maven sharding > strategy, but I have already built the uber jar to package all my custom > classes and its dependencies including protobuf 3. The problem is how to > configure spark shell to use my uber jar first. > > java8964 -- appreciate the link and I will try the configuration. Looks > promising. However, the "user classpath first" attribute does not apply to > spark-shell, am I correct? > > Lan > > On Tue, Sep 15, 2015 at 8:24 AM, java8964 <java8...@hotmail.com > <mailto:java8...@hotmail.com>> wrote: > It is a bad idea to use the major version change of protobuf, as it most > likely won't work. > > But you really want to give it a try, set the "user classpath first", so the > protobuf 3 coming with your jar will be used. > > The setting depends on your deployment mode, check this for the parameter: > > https://issues.apache.org/jira/browse/SPARK-2996 > <https://issues.apache.org/jira/browse/SPARK-2996> > > Yong > > Subject: Re: Change protobuf version or any other third party library version > in Spark application > From: ste...@hortonworks.com <mailto:ste...@hortonworks.com> > To: ljia...@gmail.com <mailto:ljia...@gmail.com> > CC: user@spark.apache.org <mailto:user@spark.apache.org> > Date: Tue, 15 Sep 2015 09:19:28 + > > > > > On 15 Sep 2015, at 05:47, Lan Jiang <ljia...@gmail.com > <mailto:ljia...@gmail.com>> wrote: > > Hi, there, > > I am using Spark 1.4.1. The protobuf 2.5 is included by Spark 1.4.1 by > default. However, I would like to use Protobuf 3 in my spark application so > that I can use some new features such as Map support. Is there anyway to do > that? > > Right now if I build a uber.jar with dependencies including protobuf 3 > classes and pass to spark-shell through --jars option, during the execution, > I got the error java.lang.NoSuchFieldError: unknownFields. > > > protobuf is an absolute nightmare version-wise, as protoc generates > incompatible java classes even across point versions. Hadoop 2.2+ is and will > always be protobuf 2.5 only; that applies transitively to downstream projects > (the great protobuf upgrade of 2013 was actually pushed by the HBase team, > and required a co-ordinated change across multiple projects) > > > Is there anyway to use a different version of Protobuf other than the default > one included in the Spark distribution? I guess I can generalize and extend > the question to any third party libraries. How to deal with version conflict > for any third party libraries included in the Spark distribution? > > maven shading is the strategy. Generally it is less needed, though the > troublesome binaries are, across the entire apache big data stack: > > google protobuf > google guava > kryo > jackson > > you can generally bump up the other versions, at least by point releases. >
Re: Problem while loading saved data
Hi Amila, Error says that the ‘people.parquet’ file does not exist. Can you manually check to see if that file exists? > Py4JJavaError: An error occurred while calling o53840.parquet. > : java.lang.AssertionError: assertion failed: No schema defined, and no > Parquet data file or summary file found under > file:/home/ubuntu/ipython/people.parquet2. Guru Medasani gdm...@gmail.com > On Sep 2, 2015, at 8:25 PM, Amila De Silva <jaa...@gmail.com> wrote: > > Hi All, > > I have a two node spark cluster, to which I'm connecting using IPython > notebook. > To see how data saving/loading works, I simply created a dataframe using > people.json using the Code below; > > df = sqlContext.read.json("examples/src/main/resources/people.json") > > Then called the following to save the dataframe as a parquet. > df.write.save("people.parquet") > > Tried loading the saved dataframe using; > df2 = sqlContext.read.parquet('people.parquet'); > > But this simply fails giving the following exception > > --- > Py4JJavaError Traceback (most recent call last) > in () > > 1 df2 = sqlContext.read.parquet('people.parquet2'); > > /srv/spark/python/pyspark/sql/readwriter.pyc in parquet(self, *path) > 154 [('name', 'string'), ('year', 'int'), ('month', 'int'), > ('day', 'int')] > 155 """ > --> 156 return > self._df(self._jreader.parquet(_to_seq(self._sqlContext._sc, path))) > 157 > 158 @since(1.4) > > /srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 536 answer = self.gateway_client.send_command(command) > 537 return_value = get_return_value(answer, self.gateway_client, > --> 538 self.target_id, self.name <http://self.name/>) > 539 > 540 for temp_arg in temp_args: > > /srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 298 raise Py4JJavaError( > 299 'An error occurred while calling {0}{1}{2}.\n'. > --> 300 format(target_id, '.', name), value) > 301 else: > 302 raise Py4JError( > > Py4JJavaError: An error occurred while calling o53840.parquet. > : java.lang.AssertionError: assertion failed: No schema defined, and no > Parquet data file or summary file found under > file:/home/ubuntu/ipython/people.parquet2. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:429) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369) > at scala.Option.orElse(Option.scala:257) > at > org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) > at org.apache.spark.sql.parquet.ParquetRelation2.org > <http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:126) > at org.apache.spark.sql.parquet.ParquetRelation2.org > <http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:124) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165) > at > org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:165) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:506) > at > org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:505) > at > org.apache.spark.sql.sources.LogicalRelation.(LogicalRelation.scala:30) > at > org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:438) > at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:264) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI
Re: Spark + Jupyter (IPython Notebook)
For python it is really great. There is some work in progress in bringing Scala support to Jupyter as well. https://github.com/hohonuuli/sparknotebook https://github.com/hohonuuli/sparknotebook https://github.com/alexarchambault/jupyter-scala https://github.com/alexarchambault/jupyter-scala Guru Medasani gdm...@gmail.com On Aug 18, 2015, at 12:29 PM, Jerry Lam chiling...@gmail.com wrote: Hi Guru, Thanks! Great to hear that someone tried it in production. How do you like it so far? Best Regards, Jerry On Tue, Aug 18, 2015 at 11:38 AM, Guru Medasani gdm...@gmail.com mailto:gdm...@gmail.com wrote: Hi Jerry, Yes. I’ve seen customers using this in production for data science work. I’m currently using this for one of my projects on a cluster as well. Also, here is a blog that describes how to configure this. http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ Guru Medasani gdm...@gmail.com mailto:gdm...@gmail.com On Aug 18, 2015, at 8:35 AM, Jerry Lam chiling...@gmail.com mailto:chiling...@gmail.com wrote: Hi spark users and developers, Did anyone have IPython Notebook (Jupyter) deployed in production that uses Spark as the computational engine? I know Databricks Cloud provides similar features with deeper integration with Spark. However, Databricks Cloud has to be hosted by Databricks so we cannot do this. Other solutions (e.g. Zeppelin) seem to reinvent the wheel that IPython has already offered years ago. It would be great if someone can educate me the reason behind this. Best Regards, Jerry
Re: Spark + Jupyter (IPython Notebook)
Hi Jerry, Yes. I’ve seen customers using this in production for data science work. I’m currently using this for one of my projects on a cluster as well. Also, here is a blog that describes how to configure this. http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ Guru Medasani gdm...@gmail.com On Aug 18, 2015, at 8:35 AM, Jerry Lam chiling...@gmail.com wrote: Hi spark users and developers, Did anyone have IPython Notebook (Jupyter) deployed in production that uses Spark as the computational engine? I know Databricks Cloud provides similar features with deeper integration with Spark. However, Databricks Cloud has to be hosted by Databricks so we cannot do this. Other solutions (e.g. Zeppelin) seem to reinvent the wheel that IPython has already offered years ago. It would be great if someone can educate me the reason behind this. Best Regards, Jerry
Re: Topology.py -- Cannot run on Spark Gateway on Cloudera 5.4.4.
Hi Upen, Did you deploy the client configs after assigning the gateway roles? You should be able to do this from Cloudera Manager. Can you try this and let us know what you see when you run spark-shell? Guru Medasani gdm...@gmail.com On Aug 3, 2015, at 9:10 PM, Upen N ukn...@gmail.com wrote: Hi, I recently installed Cloudera CDH 5.4.4. Sparks comes shipped with this version. I created Spark gateways. But I get the following error when run Spark shell from the gateway. Does anyone have any similar experience ? If so, please share the solution. Google shows to copy the Conf files from data nodes to gateway nodes. But I highly doubt if that is the right fix. Thanks Upender etc/hadoop/conf.cloudera.yarn/topology.py java.io.IOException: Cannot run program /etc/hadoop/conf.cloudera.yarn/topology.py
Re: Spark-Submit error
Hi Satish, Can you add more error or log info to the email? Guru Medasani gdm...@gmail.com On Jul 31, 2015, at 1:06 AM, satish chandra j jsatishchan...@gmail.com wrote: HI, I have submitted a Spark Job with options jars,class,master as local but i am getting an error as below dse spark-submit spark error exception in thread main java.io.ioexception: Invalid Request Exception(Why you have not logged in) Note: submitting datastax spark node please let me know if anybody have a solutions for this issue Regards, Saish Chandra
Re: Spark-Submit error
Thanks Satish. I only see the INFO messages and don’t see any error messages in the output you pasted. Can you paste the log with the error messages? Guru Medasani gdm...@gmail.com On Aug 3, 2015, at 11:12 PM, satish chandra j jsatishchan...@gmail.com wrote: Hi Guru, I am executing this on DataStax Enterprise Spark node and ~/.dserc file exists which consists Cassandra credentials but still getting the error Below is the given command dse spark-submit --master spark://10.246.43.15:7077 http://10.246.43.15:7077/ --class HelloWorld --jars ///home/missingmerch/postgresql-9.4-1201.jdbc41.jar ///home/missingmerch/etl-0.0.1-SNAPSHOT.jar Please find the error log details below INFO 2015-07-31 05:15:17 org.apache.spark.executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] INFO 2015-07-31 05:15:18 org.apache.spark.SecurityManager: Changing view acls to: cassandra,missingmerch INFO 2015-07-31 05:15:18 org.apache.spark.SecurityManager: Changing modify acls to: cassandra,missingmerch INFO 2015-07-31 05:15:18 org.apache.spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cassandra, missingmerch); users with modify permissions: Set(cassandra, missingmerch) INFO 2015-07-31 05:15:22 akka.event.slf4j.Slf4jLogger: Slf4jLogger started INFO 2015-07-31 05:15:22 Remoting: Starting remoting INFO 2015-07-31 05:15:22 Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@10.246.43.14 mailto:driverPropsFetcher@10.246.43.14mailto://driverPropsFetcher@10.246.43.14 mailto:driverPropsFetcher@10.246.43.14:48952] INFO 2015-07-31 05:15:22 org.apache.spark.util.Utils: Successfully started service 'driverPropsFetcher' on port 48952. INFO 2015-07-31 05:15:24 akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. INFO 2015-07-31 05:15:24 akka.remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. INFO 2015-07-31 05:15:24 org.apache.spark.SecurityManager: Changing view acls to: cassandra,missingmerch INFO 2015-07-31 05:15:24 org.apache.spark.SecurityManager: Changing modify acls to: cassandra,missingmerch INFO 2015-07-31 05:15:24 org.apache.spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cassandra, missingmerch); users with modify permissions: Set(cassandra, missingmerch) INFO 2015-07-31 05:15:24 akka.event.slf4j.Slf4jLogger: Slf4jLogger started INFO 2015-07-31 05:15:24 akka.remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. INFO 2015-07-31 05:15:24 Remoting: Starting remoting INFO 2015-07-31 05:15:24 Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@10.246.43.14 mailto:sparkExecutor@10.246.43.14mailto://sparkExecutor@10.246.43.14 mailto:sparkExecutor@10.246.43.14:56358] INFO 2015-07-31 05:15:24 org.apache.spark.util.Utils: Successfully started service 'sparkExecutor' on port 56358. INFO 2015-07-31 05:15:24 org.apache.spark.executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://sparkdri...@tstl400029.wal-mart.com mailto:sparkdri...@tstl400029.wal-mart.commailto://sparkdri...@tstl400029.wal-mart.com mailto:sparkdri...@tstl400029.wal-mart.com:60525/user/CoarseGrainedScheduler INFO 2015-07-31 05:15:24 org.apache.spark.deploy.worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@10.246.43.14 mailto:sparkWorker@10.246.43.14mailto://sparkWorker@10.246.43.14 mailto:sparkWorker@10.246.43.14:51552/user/Worker INFO 2015-07-31 05:15:24 org.apache.spark.deploy.worker.WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@10.246.43.14 mailto:sparkWorker@10.246.43.14mailto://sparkWorker@10.246.43.14 mailto:sparkWorker@10.246.43.14:51552/user/Worker INFO 2015-07-31 05:15:24 org.apache.spark.executor.CoarseGrainedExecutorBackend: Successfully registered with driver INFO 2015-07-31 05:15:24 org.apache.spark.executor.Executor: Starting executor ID 0 on host 10.246.43.14 INFO 2015-07-31 05:15:24 org.apache.spark.SecurityManager: Changing view acls to: cassandra,missingmerch INFO 2015-07-31 05:15:24 org.apache.spark.SecurityManager: Changing modify acls to: cassandra,missingmerch INFO 2015-07-31 05:15:24 org.apache.spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cassandra, missingmerch); users with modify permissions: Set(cassandra, missingmerch) INFO 2015-07-31 05:15:24 org.apache.spark.util.AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkdri...@tstl400029.wal-mart.com mailto:sparkdri...@tstl400029.wal-mart.commailto://sparkdri...@tstl400029.wal-mart.com mailto:sparkdri...@tstl400029.wal-mart.com:60525/user/MapOutputTracker INFO
Re: How to verify that the worker is connected to master in CDH5.4
Hi Ashish, Are you running Spark-on-YARN on the cluster with an instance of Spark History server? Also if you are using Cloudera Manager and using Spark on YARN, spark on yarn service has a link for the history server web UI. Can you paste the command and the output you are seeing in the thread? Guru Medasani gdm...@gmail.com On Jul 7, 2015, at 10:42 PM, Ashish Dutt ashish.du...@gmail.com wrote: Hi, I have CDH 5.4 installed on a linux server. It has 1 cluster in which spark is deployed as a history server. I am trying to connect my laptop to the spark history server. When I run spark-shell master ip: port number I get the following output How can I verify that the worker is connected to the master? Thanks, Ashish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to verify that the worker is connected to master in CDH5.4
Hi Ashish, If you are not using Spark on YARN and instead using Spark Standalone, you don’t need Spark history server. More on the Web Interfaces is provided in the following link. Since are using standalone mode, you should be able to access the web UI for the master and workers at ports that Ayan provided in early email. Master: http://masterip:8080 Worker: http://workerIp:8081 https://spark.apache.org/docs/latest/monitoring.html https://spark.apache.org/docs/latest/monitoring.html If you are using Spark on YARN, spark history server is configured to run on port 18080 by default on the server where Spark history server is running. Guru Medasani gdm...@gmail.com On Jul 8, 2015, at 12:01 AM, Ashish Dutt ashish.du...@gmail.com wrote: Hello Guru, Thank you for your quick response. This is what i get when I try executing spark-shell master ip:port number C:\spark-1.4.0\binspark-shell master IP:18088 log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/07/08 11:28:35 INFO SecurityManager: Changing view acls to: Ashish Dutt 15/07/08 11:28:35 INFO SecurityManager: Changing modify acls to: Ashish Dutt 15/07/08 11:28:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set (Ashish Dutt); users with modify permissions: Set(Ashish Dutt) 15/07/08 11:28:35 INFO HttpServer: Starting HTTP Server 15/07/08 11:28:35 INFO Utils: Successfully started service 'HTTP class server' on port 52767. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.4.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79) Type in expressions to have them evaluated. Type :help for more information. 15/07/08 11:28:39 INFO SparkContext: Running Spark version 1.4.0 15/07/08 11:28:39 INFO SecurityManager: Changing view acls to: Ashish Dutt 15/07/08 11:28:39 INFO SecurityManager: Changing modify acls to: Ashish Dutt 15/07/08 11:28:39 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set (Ashish Dutt); users with modify permissions: Set(Ashish Dutt) 15/07/08 11:28:40 INFO Slf4jLogger: Slf4jLogger started 15/07/08 11:28:40 INFO Remoting: Starting remoting 15/07/08 11:28:40 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.228.208.74:52780 http://sparkDriver@10.228.208.74:52780/] 15/07/08 11:28:40 INFO Utils: Successfully started service 'sparkDriver' on port 52780. 15/07/08 11:28:40 INFO SparkEnv: Registering MapOutputTracker 15/07/08 11:28:40 INFO SparkEnv: Registering BlockManagerMaster 15/07/08 11:28:40 INFO DiskBlockManager: Created local directory at C:\Users\Ashish Dutt\AppData\Local\Temp\spark-80c4f1fe-37de-4aef -9063-cae29c488382\blockmgr-a967422b-05e8-4fc1-b60b-facc7dbd4414 15/07/08 11:28:40 INFO MemoryStore: MemoryStore started with capacity 265.4 MB 15/07/08 11:28:40 INFO HttpFileServer: HTTP File server directory is C:\Users\Ashish Dutt\AppData\Local\Temp\spark-80c4f1fe-37de-4ae f-9063-cae29c488382\httpd-928f4485-ea08-4749-a478-59708db0fefa 15/07/08 11:28:40 INFO HttpServer: Starting HTTP Server 15/07/08 11:28:40 INFO Utils: Successfully started service 'HTTP file server' on port 52781. 15/07/08 11:28:40 INFO SparkEnv: Registering OutputCommitCoordinator 15/07/08 11:28:40 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/07/08 11:28:40 INFO SparkUI: Started SparkUI at http://10.228.208.74:4040 http://10.228.208.74:4040/ 15/07/08 11:28:40 INFO Executor: Starting executor ID driver on host localhost 15/07/08 11:28:41 INFO Executor: Using REPL class URI: http://10.228.208.74:52767 http://10.228.208.74:52767/ 15/07/08 11:28:41 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52800. 15/07/08 11:28:41 INFO NettyBlockTransferService: Server created on 52800 15/07/08 11:28:41 INFO BlockManagerMaster: Trying to register BlockManager 15/07/08 11:28:41 INFO BlockManagerMasterEndpoint: Registering block manager localhost:52800 with 265.4 MB RAM, BlockManagerId(drive r, localhost, 52800) 15/07/08 11:28:41 INFO BlockManagerMaster: Registered BlockManager 15/07/08 11:28:41 INFO SparkILoop: Created spark context.. Spark context available as sc. 15/07/08 11:28:41 INFO HiveContext: Initializing execution hive, version 0.13.1 15/07/08 11:28:42 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore 15/07/08 11:28:42
Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?
Hi Elkhan, There are couple of ways to do this. 1) Spark-jobserver is a popular web server that is used to submit spark jobs. https://github.com/spark-jobserver/spark-jobserver https://github.com/spark-jobserver/spark-jobserver 2) Spark-submit script sets the classpath for the job. Bypassing the spark-submit script means you have to manage some of this work in your program itself. Here is a link with some discussions around how to handle this scenario. http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/What-dependencies-to-submit-Spark-jobs-programmatically-not-via/td-p/24721 http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/What-dependencies-to-submit-Spark-jobs-programmatically-not-via/td-p/24721 Guru Medasani gdm...@gmail.com On Jun 17, 2015, at 6:01 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: This is not independent programmatic way of running of Spark job on Yarn cluster. That example demonstrates running on Yarn-client mode, also will be dependent of Jetty. Users writing Spark programs do not want to depend on that. I found this SparkLauncher class introduced in Spark 1.4 version (https://github.com/apache/spark/tree/master/launcher https://github.com/apache/spark/tree/master/launcher) which allows running Spark jobs in programmatic way. SparkLauncher exists in Java and Scala APIs, but I could not find in Python API. Did not try it yet, but seems promising. Example: import org.apache.spark.launcher.SparkLauncher; public class MyLauncher { public static void main(String[] args) throws Exception { Process spark = new SparkLauncher() .setAppResource(/my/app.jar) .setMainClass(my.spark.app.Main) .setMaster(local) .setConf(SparkLauncher.DRIVER_MEMORY, 2g) .launch(); spark.waitFor(); } } } On Wed, Jun 17, 2015 at 5:51 PM, Corey Nolet cjno...@gmail.com mailto:cjno...@gmail.com wrote: An example of being able to do this is provided in the Spark Jetty Server project [1] [1] https://github.com/calrissian/spark-jetty-server https://github.com/calrissian/spark-jetty-server On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov elkhan8...@gmail.com mailto:elkhan8...@gmail.com wrote: Hi all, Is there any way running Spark job in programmatic way on Yarn cluster without using spark-submit script ? I cannot include Spark jars on my Java application (due o dependency conflict and other reasons), so I'll be shipping Spark assembly uber jar (spark-assembly-1.3.1-hadoop2.3.0.jar) to Yarn cluster, and then execute job (Python or Java) on Yarn-cluster. So is there any way running Spark job implemented in python file/Java class without calling it through spark-submit script ? Thanks. -- Best regards, Elkhan Dadashov
Re: SparkR 1.4.0: read.df() function fails
Hi Esten, Looks like your sqlContext is connected to a Hadoop/Spark cluster, but the file path you specified is local?. mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json”, Error below shows that the Input path you specified does not exist on the cluster. Pointing to the right hdfs path should be able to help here. Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json Guru Medasani gdm...@gmail.com On Jun 16, 2015, at 10:39 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: The error you are running into is that the input file does not exist -- You can see it from the following line Input path does not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json Thanks Shivaram On Tue, Jun 16, 2015 at 1:55 AM, esten erik.stens...@dnvgl.com mailto:erik.stens...@dnvgl.com wrote: Hi, In SparkR shell, I invoke: mydf-read.df(sqlContext, /home/esten/ami/usaf.json, source=json, header=false) I have tried various filetypes (csv, txt), all fail. RESPONSE: ERROR RBackendHandler: load on 1 failed BELOW THE WHOLE RESPONSE: 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(177600) called with curMem=0, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 173.4 KB, free 265.2 MB) 15/06/16 08:09:13 INFO MemoryStore: ensureFreeSpace(16545) called with curMem=177600, maxMem=278302556 15/06/16 08:09:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 16.2 KB, free 265.2 MB) 15/06/16 08:09:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:37142 (size: 16.2 KB, free: 265.4 MB) 15/06/16 08:09:13 INFO SparkContext: Created broadcast 0 from load at NativeMethodAccessorImpl.java:-2 15/06/16 08:09:16 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/06/16 08:09:17 ERROR RBackendHandler: load on 1 failed java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:127) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://smalldata13.hdp:8020/home/esten/ami/usaf.json at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228
Re: Spark 1.4 release date
Here is a spark 1.4 release blog by data bricks. https://databricks.com/blog/2015/06/11/announcing-apache-spark-1-4.html https://databricks.com/blog/2015/06/11/announcing-apache-spark-1-4.html Guru Medasani gdm...@gmail.com On Jun 12, 2015, at 7:08 AM, ayan guha guha.a...@gmail.com wrote: Thanks a lot. On 12 Jun 2015 19:46, Todd Nist tsind...@gmail.com mailto:tsind...@gmail.com wrote: It was released yesterday. On Friday, June 12, 2015, ayan guha guha.a...@gmail.com mailto:guha.a...@gmail.com wrote: Hi When is official spark 1.4 release date? Best Ayan
Re: Nightly builds/releases?
I see a Jira for this one, but unresolved. https://issues.apache.org/jira/browse/SPARK-1517 https://issues.apache.org/jira/browse/SPARK-1517 On May 4, 2015, at 10:25 PM, Ankur Chauhan achau...@brightcove.com wrote: Hi, Does anyone know if spark has any nightly builds or equivalent that provides binaries that have passed a CI build so that one could try out the bleeding edge without having to compile. -- Ankur
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Hi Antony, Did you get pass this error by repartitioning your job with smaller tasks as Sven Krasser pointed out? From: Antony Mayi antonym...@yahoo.com Reply-To: Antony Mayi antonym...@yahoo.com Date: Tuesday, January 27, 2015 at 5:24 PM To: Guru Medasani gdm...@outlook.com, Sven Krasser kras...@gmail.com Cc: Sandy Ryza sandy.r...@cloudera.com, user@spark.apache.org user@spark.apache.org Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded I have yarn configured with yarn.nodemanager.vmem-check-enabled=false and yarn.nodemanager.pmem-check-enabled=false to avoid yarn killing the containers. the stack trace is bellow. thanks, Antony. 15/01/27 17:02:53 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 15/01/27 17:02:53 ERROR executor.Executor: Exception in task 21.0 in stage 12.0 (TID 1312) java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal a:33) at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156) at scala.collection.SeqLike$class.distinct(SeqLike.scala:493) at scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156) at org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommenda tion$ALS$$makeOutLinkBlock(ALS.scala:404) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.sca la:130) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.sca la:127) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(Traver sableLike.scala:772) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal a:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:7 71) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:127) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala: 31) 15/01/27 17:02:53 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-8,5,main] java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.Integer.valueOf(Integer.java:642) at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70) at scala.collection.mutable.ArrayOps$ofInt.apply(ArrayOps.scala:156) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scal a:33) at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:156) at scala.collection.SeqLike$class.distinct(SeqLike.scala:493) at scala.collection.mutable.ArrayOps$ofInt.distinct(ArrayOps.scala:156) at org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommenda tion$ALS$$makeOutLinkBlock(ALS.scala:404) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:459) at org.apache.spark.mllib.recommendation.ALS$$anonfun$15.apply(ALS.scala:456) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:614) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Hi Anthony, What is the setting of the total amount of memory in MB that can be allocated to containers on your NodeManagers? yarn.nodemanager.resource.memory-mb Can you check this above configuration in yarn-site.xml used by the node manager process? -Guru Medasani From: Sandy Ryza sandy.r...@cloudera.com Date: Tuesday, January 27, 2015 at 3:33 PM To: Antony Mayi antonym...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Hi Antony, If you look in the YARN NodeManager logs, do you see that it's killing the executors? Or are they crashing for a different reason? -Sandy On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors crashed with this error. does that mean I have genuinely not enough RAM or is this matter of config tuning? other config options used: spark.storage.memoryFraction=0.3 SPARK_EXECUTOR_MEMORY=14G running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is ALS trainImplicit on ~15GB dataset) thanks for any ideas, Antony.
Re: java.lang.OutOfMemoryError: GC overhead limit exceeded
Can you attach the logs where this is failing? From: Sven Krasser kras...@gmail.com Date: Tuesday, January 27, 2015 at 4:50 PM To: Guru Medasani gdm...@outlook.com Cc: Sandy Ryza sandy.r...@cloudera.com, Antony Mayi antonym...@yahoo.com, user@spark.apache.org user@spark.apache.org Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Since it's an executor running OOM it doesn't look like a container being killed by YARN to me. As a starting point, can you repartition your job into smaller tasks? -Sven On Tue, Jan 27, 2015 at 2:34 PM, Guru Medasani gdm...@outlook.com wrote: Hi Anthony, What is the setting of the total amount of memory in MB that can be allocated to containers on your NodeManagers? yarn.nodemanager.resource.memory-mb Can you check this above configuration in yarn-site.xml used by the node manager process? -Guru Medasani From: Sandy Ryza sandy.r...@cloudera.com Date: Tuesday, January 27, 2015 at 3:33 PM To: Antony Mayi antonym...@yahoo.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: java.lang.OutOfMemoryError: GC overhead limit exceeded Hi Antony, If you look in the YARN NodeManager logs, do you see that it's killing the executors? Or are they crashing for a different reason? -Sandy On Tue, Jan 27, 2015 at 12:43 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi, I am using spark.yarn.executor.memoryOverhead=8192 yet getting executors crashed with this error. does that mean I have genuinely not enough RAM or is this matter of config tuning? other config options used: spark.storage.memoryFraction=0.3 SPARK_EXECUTOR_MEMORY=14G running spark 1.2.0 as yarn-client on cluster of 10 nodes (the workload is ALS trainImplicit on ~15GB dataset) thanks for any ideas, Antony. -- http://sites.google.com/site/krasser/?utm_source=sig
RE: Spark Installation Maven PermGen OutOfMemoryException
Hi Vladimir, From the link Sean posted, if you use Java 8 there is this following note. Note: For Java 8 and above this step is not required. So if you have no problems using Java 8, give it a shot. Best Regards,Guru Medasani From: so...@cloudera.com Date: Tue, 23 Dec 2014 15:04:42 + Subject: Re: Spark Installation Maven PermGen OutOfMemoryException To: protsenk...@gmail.com CC: user@spark.apache.org You might try a little more. The official guidance suggests 2GB: https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage On Tue, Dec 23, 2014 at 2:57 PM, Vladimir Protsenko protsenk...@gmail.com wrote: I am installing Spark 1.2.0 on CentOS 6.6. Just downloaded code from github, installed maven and trying to compile system: git clone https://github.com/apache/spark.git git checkout v1.2.0 mvn -DskipTests clean package leads to OutOfMemoryException. What amount of memory does it requires? export MAVEN_OPTS=`-Xmx=1500m -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m` doesn't help. Waht is a straight forward way to start using Spark? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Installation-Maven-PermGen-OutOfMemoryException-tp20831.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Spark Installation Maven PermGen OutOfMemoryException
Thanks for the clarification Sean. Best Regards,Guru Medasani From: so...@cloudera.com Date: Tue, 23 Dec 2014 15:39:59 + Subject: Re: Spark Installation Maven PermGen OutOfMemoryException To: gdm...@outlook.com CC: protsenk...@gmail.com; user@spark.apache.org The text there is actually unclear. In Java 8, you still need to set the max heap size (-Xmx2g). The optional bit is the -XX:MaxPermSize=512M actually. Java 8 no longer has a separate permanent generation. On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com wrote: Hi Vladimir, From the link Sean posted, if you use Java 8 there is this following note. Note: For Java 8 and above this step is not required. So if you have no problems using Java 8, give it a shot. Best Regards, Guru Medasani - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Programatically running of the Spark Jobs.
I am able to run Spark jobs and Spark Streaming jobs successfully via YARN on a CDH cluster. When you mean YARN isn’t quite there yet, you mean to submit the jobs programmatically? or just in general? On Sep 4, 2014, at 1:45 AM, Matt Chu m...@kabam.com wrote: https://github.com/spark-jobserver/spark-jobserver Ooyala's Spark jobserver is the current de facto standard, IIUC. I just added it to our prototype stack, and will begin trying it out soon. Note that you can only do standalone or Mesos; YARN isn't quite there yet. (The repo just moved from https://github.com/ooyala/spark-jobserver, so don't trust Google on this one (yet); development is happening in the first repo.) On Wed, Sep 3, 2014 at 11:39 PM, Vicky Kak vicky@gmail.com wrote: I have been able to submit the spark jobs using the submit script but I would like to do it via code. I am unable to search anything matching to my need. I am thinking of using org.apache.spark.deploy.SparkSubmit to do so, may be have to write some utility that passes the parameters required for this class. I would be interested to know how community is doing. Thanks, Vicky
Re: Spark-submit not running
Can you copy the exact spark-submit command that you are running? You should be able to run it locally without installing hadoop. Here is an example on how to run the job locally. # Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100 On Aug 28, 2014, at 7:54 AM, Hingorani, Vineet vineet.hingor...@sap.com wrote: How can I set HADOOP_HOME if I am running the Spark on my local machine without anything else? Do I have to install some other pre-built file? I am on Windows 7 and Spark’s official site says that it is available on Windows, I added Java path in the PATH variable. Vineet From: Sean Owen [mailto:so...@cloudera.com] Sent: Donnerstag, 28. August 2014 13:49 To: Hingorani, Vineet Cc: user@spark.apache.org Subject: Re: Spark-submit not running You need to set HADOOP_HOME. Is Spark officially supposed to work on Windows or not at this stage? I know the build doesn't quite yet. On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet vineet.hingor...@sap.com wrote: The file is compiling properly but when I try to run the jar file using spark-submit, it is giving some errors. I am running spark locally and have downloaded a pre-built version of Spark named “For Hadoop 2 (HDP2, CDH5)”.AI don’t know if it is a dependency problem but I don’t want to have Hadoop in my system. The error says: 14/08/28 12:34:36 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. Vineet
Re: Spark-submit not running
Thanks Sean. Looks like there is a workaround as per the JIRA https://issues.apache.org/jira/browse/SPARK-2356 . http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7. May be that's worth a shot? On Aug 28, 2014, at 8:15 AM, Sean Owen so...@cloudera.com wrote: Yes, but I think at the moment there is still a dependency on Hadoop even when not using it. See https://issues.apache.org/jira/browse/SPARK-2356 On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani gdm...@outlook.com wrote: Can you copy the exact spark-submit command that you are running? You should be able to run it locally without installing hadoop. Here is an example on how to run the job locally. # Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100