>From that, I'd guesd that HDFS isn't setup between the nodes, or for some >reason writes are defaulting to file:///path/ rather than hdfs:///path/
------ Original message------ From: Amila De Silva Date: Thu, 3 Sep 2015 17:12 To: Ewan Leith; Cc: user@spark.apache.org; Subject:Re: Problem while loading saved data Hi Ewan, Yes, 'people.parquet' is from the first attempt and in that attempt it tried to save the same people.json. It seems that the same folder is created on both the nodes and contents of the files are distributed between the two servers. On the master node(this is the same node which runs IPython Notebook) this is what I have: people.parquet └── _SUCCESS On the slave I get, people.parquet └── _temporary └── 0 ├── task_201509030057_4699_m_000000 │ └── part-r-00000-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet ├── task_201509030057_4699_m_000001 │ └── part-r-00001-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet └── _temporary I have zipped and attached both the folders. On Thu, Sep 3, 2015 at 5:58 PM, Ewan Leith <ewan.le...@realitymine.com<mailto:ewan.le...@realitymine.com>> wrote: Your error log shows you attempting to read from 'people.parquet2' not ‘people.parquet’ as you’ve put below, is that just from a different attempt? Otherwise, it’s an odd one! There aren’t _SUCCESS, _common_metadata and _metadata files under people.parquet that you’ve listed below, which would normally be created when the write completes, can you show us your write output? Thanks, Ewan From: Amila De Silva [mailto:jaa...@gmail.com<mailto:jaa...@gmail.com>] Sent: 03 September 2015 05:44 To: Guru Medasani <gdm...@gmail.com<mailto:gdm...@gmail.com>> Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Problem while loading saved data Hi Guru, Thanks for the reply. Yes, I checked if the file exists. But instead of a single file what I found was a directory having the following structure. people.parquet └── _temporary └── 0 ├── task_201509030057_4699_m_000000 │ └── part-r-00000-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet ├── task_201509030057_4699_m_000001 │ └── part-r-00001-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet └── _temporary On Thu, Sep 3, 2015 at 7:13 AM, Guru Medasani <gdm...@gmail.com<mailto:gdm...@gmail.com>> wrote: Hi Amila, Error says that the ‘people.parquet’ file does not exist. Can you manually check to see if that file exists? Py4JJavaError: An error occurred while calling o53840.parquet. : java.lang.AssertionError: assertion failed: No schema defined, and no Parquet data file or summary file found under file:/home/ubuntu/ipython/people.parquet2. Guru Medasani gdm...@gmail.com<mailto:gdm...@gmail.com> On Sep 2, 2015, at 8:25 PM, Amila De Silva <jaa...@gmail.com<mailto:jaa...@gmail.com>> wrote: Hi All, I have a two node spark cluster, to which I'm connecting using IPython notebook. To see how data saving/loading works, I simply created a dataframe using people.json using the Code below; df = sqlContext.read.json("examples/src/main/resources/people.json") Then called the following to save the dataframe as a parquet. df.write.save("people.parquet") Tried loading the saved dataframe using; df2 = sqlContext.read.parquet('people.parquet'); But this simply fails giving the following exception --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) <ipython-input-97-35f91873c48f> in <module>() ----> 1 df2 = sqlContext.read.parquet('people.parquet2'); /srv/spark/python/pyspark/sql/readwriter.pyc in parquet(self, *path) 154 [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')] 155 """ --> 156 return self._df(self._jreader.parquet(_to_seq(self._sqlContext._sc, path))) 157 158 @since(1.4) /srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __call__(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name<http://self.name/>) 539 540 for temp_arg in temp_args: /srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError( Py4JJavaError: An error occurred while calling o53840.parquet. : java.lang.AssertionError: assertion failed: No schema defined, and no Parquet data file or summary file found under file:/home/ubuntu/ipython/people.parquet2. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org<http://MetadataCache.org>$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:429) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369) at scala.Option.orElse(Option.scala:257) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369) at org.apache.spark.sql.parquet.ParquetRelation2.org<http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:126) at org.apache.spark.sql.parquet.ParquetRelation2.org<http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:124) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165) at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:165) at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:506) at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:505) at org.apache.spark.sql.sources.LogicalRelation.<init>(LogicalRelation.scala:30) at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:438) at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:264) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:601) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:722) I'm using spark-1.4.1-bin-hadoop2.6 with java 1.7. Thanks Amila