RE: Problem while loading saved data

Ewan Leith Thu, 03 Sep 2015 05:30:08 -0700

Your error log shows you attempting to read from 'people.parquet2' not 
‘people.parquet’ as you’ve put below, is that just from a different attempt?


Otherwise, it’s an odd one! There aren’t _SUCCESS, _common_metadata and 
_metadata files under people.parquet that you’ve listed below, which would 
normally be created when the write completes, can you show us your write output?


Thanks,
Ewan



From: Amila De Silva [mailto:jaa...@gmail.com]
Sent: 03 September 2015 05:44
To: Guru Medasani <gdm...@gmail.com>
Cc: user@spark.apache.org
Subject: Re: Problem while loading saved data

Hi Guru,

Thanks for the reply.

Yes, I checked if the file exists. But instead of a single file what I found 
was a directory having the following structure.

people.parquet
└── _temporary
    └── 0
        ├── task_201509030057_4699_m_000000
        │   └── part-r-00000-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
        ├── task_201509030057_4699_m_000001
        │   └── part-r-00001-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
        └── _temporary


On Thu, Sep 3, 2015 at 7:13 AM, Guru Medasani 
<gdm...@gmail.com<mailto:gdm...@gmail.com>> wrote:
Hi Amila,

Error says that the ‘people.parquet’ file does not exist. Can you manually 
check to see if that file exists?


Py4JJavaError: An error occurred while calling o53840.parquet.

: java.lang.AssertionError: assertion failed: No schema defined, and no Parquet 
data file or summary file found under file:/home/ubuntu/ipython/people.parquet2.



Guru Medasani
gdm...@gmail.com<mailto:gdm...@gmail.com>



On Sep 2, 2015, at 8:25 PM, Amila De Silva 
<jaa...@gmail.com<mailto:jaa...@gmail.com>> wrote:

Hi All,

I have a two node spark cluster, to which I'm connecting using IPython notebook.
To see how data saving/loading works, I simply created a dataframe using 
people.json using the Code below;

df = sqlContext.read.json("examples/src/main/resources/people.json")

Then called the following to save the dataframe as a parquet.
df.write.save("people.parquet")

Tried loading the saved dataframe using;
df2 = sqlContext.read.parquet('people.parquet');

But this simply fails giving the following exception


---------------------------------------------------------------------------

Py4JJavaError                             Traceback (most recent call last)

<ipython-input-97-35f91873c48f> in <module>()

----> 1 df2 = sqlContext.read.parquet('people.parquet2');



/srv/spark/python/pyspark/sql/readwriter.pyc in parquet(self, *path)

    154         [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 
'int')]

    155         """

--> 156         return 
self._df(self._jreader.parquet(_to_seq(self._sqlContext._sc, path)))

    157

    158     @since(1.4)



/srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)

    536         answer = self.gateway_client.send_command(command)

    537         return_value = get_return_value(answer, self.gateway_client,

--> 538                 self.target_id, self.name<http://self.name/>)

    539

    540         for temp_arg in temp_args:



/srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)

    298                 raise Py4JJavaError(

    299                     'An error occurred while calling {0}{1}{2}.\n'.

--> 300                     format(target_id, '.', name), value)

    301             else:

    302                 raise Py4JError(



Py4JJavaError: An error occurred while calling o53840.parquet.

: java.lang.AssertionError: assertion failed: No schema defined, and no Parquet 
data file or summary file found under file:/home/ubuntu/ipython/people.parquet2.

       at scala.Predef$.assert(Predef.scala:179)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org<http://MetadataCache.org>$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:429)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)

       at scala.Option.orElse(Option.scala:257)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)

       at 
org.apache.spark.sql.parquet.ParquetRelation2.org<http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:126)

       at 
org.apache.spark.sql.parquet.ParquetRelation2.org<http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:124)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)

       at scala.Option.getOrElse(Option.scala:120)

       at 
org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:165)

       at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:506)

       at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:505)

       at 
org.apache.spark.sql.sources.LogicalRelation.<init>(LogicalRelation.scala:30)

       at 
org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:438)

       at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:264)

       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

       at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

       at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

       at java.lang.reflect.Method.invoke(Method.java:601)

       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

       at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

       at py4j.Gateway.invoke(Gateway.java:259)

       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

       at py4j.commands.CallCommand.execute(CallCommand.java:79)

       at py4j.GatewayConnection.run(GatewayConnection.java:207)

       at java.lang.Thread.run(Thread.java:722)


I'm using spark-1.4.1-bin-hadoop2.6 with java 1.7.

 Thanks
Amila

RE: Problem while loading saved data

Reply via email to