Re: Problem while loading saved data

Ewan Leith Thu, 03 Sep 2015 12:14:20 -0700

>From that, I'd guesd that HDFS isn't setup between the nodes, or for some 
>reason writes are defaulting to file:///path/ rather than hdfs:///path/





------ Original message------

From: Amila De Silva

Date: Thu, 3 Sep 2015 17:12

To: Ewan Leith;

Cc: [email protected];

Subject:Re: Problem while loading saved data


Hi Ewan,

Yes, 'people.parquet' is from the first attempt and in that attempt it tried to 
save the same people.json.

It seems that the same folder is created on both the nodes and contents of the 
files are distributed between the two servers.

On the master node(this is the same node which runs IPython Notebook) this is 
what I have:

people.parquet
└── _SUCCESS

On the slave I get,
people.parquet
└── _temporary
    └── 0
        ├── task_201509030057_4699_m_000000
        │   └── part-r-00000-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
        ├── task_201509030057_4699_m_000001
        │   └── part-r-00001-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
        └── _temporary

I have zipped and attached both the folders.

On Thu, Sep 3, 2015 at 5:58 PM, Ewan Leith 
<[email protected]<mailto:[email protected]>> wrote:
Your error log shows you attempting to read from 'people.parquet2' not 
‘people.parquet’ as you’ve put below, is that just from a different attempt?

Otherwise, it’s an odd one! There aren’t _SUCCESS, _common_metadata and 
_metadata files under people.parquet that you’ve listed below, which would 
normally be created when the write completes, can you show us your write output?


Thanks,
Ewan



From: Amila De Silva [mailto:[email protected]<mailto:[email protected]>]
Sent: 03 September 2015 05:44
To: Guru Medasani <[email protected]<mailto:[email protected]>>
Cc: [email protected]<mailto:[email protected]>
Subject: Re: Problem while loading saved data

Hi Guru,

Thanks for the reply.

Yes, I checked if the file exists. But instead of a single file what I found 
was a directory having the following structure.

people.parquet
└── _temporary
    └── 0
        ├── task_201509030057_4699_m_000000
        │   └── part-r-00000-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
        ├── task_201509030057_4699_m_000001
        │   └── part-r-00001-b921ed54-53fa-459b-881c-cccde7f79320.gz.parquet
        └── _temporary


On Thu, Sep 3, 2015 at 7:13 AM, Guru Medasani 
<[email protected]<mailto:[email protected]>> wrote:
Hi Amila,

Error says that the ‘people.parquet’ file does not exist. Can you manually 
check to see if that file exists?


Py4JJavaError: An error occurred while calling o53840.parquet.

: java.lang.AssertionError: assertion failed: No schema defined, and no Parquet 
data file or summary file found under file:/home/ubuntu/ipython/people.parquet2.



Guru Medasani
[email protected]<mailto:[email protected]>



On Sep 2, 2015, at 8:25 PM, Amila De Silva 
<[email protected]<mailto:[email protected]>> wrote:

Hi All,

I have a two node spark cluster, to which I'm connecting using IPython notebook.
To see how data saving/loading works, I simply created a dataframe using 
people.json using the Code below;

df = sqlContext.read.json("examples/src/main/resources/people.json")

Then called the following to save the dataframe as a parquet.
df.write.save("people.parquet")

Tried loading the saved dataframe using;
df2 = sqlContext.read.parquet('people.parquet');

But this simply fails giving the following exception


---------------------------------------------------------------------------

Py4JJavaError                             Traceback (most recent call last)

<ipython-input-97-35f91873c48f> in <module>()

----> 1 df2 = sqlContext.read.parquet('people.parquet2');



/srv/spark/python/pyspark/sql/readwriter.pyc in parquet(self, *path)

    154         [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 
'int')]

    155         """

--> 156         return 
self._df(self._jreader.parquet(_to_seq(self._sqlContext._sc, path)))

    157

    158     @since(1.4)



/srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)

    536         answer = self.gateway_client.send_command(command)

    537         return_value = get_return_value(answer, self.gateway_client,

--> 538                 self.target_id, self.name<http://self.name/>)

    539

    540         for temp_arg in temp_args:



/srv/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)

    298                 raise Py4JJavaError(

    299                     'An error occurred while calling {0}{1}{2}.\n'.

--> 300                     format(target_id, '.', name), value)

    301             else:

    302                 raise Py4JError(



Py4JJavaError: An error occurred while calling o53840.parquet.

: java.lang.AssertionError: assertion failed: No schema defined, and no Parquet 
data file or summary file found under file:/home/ubuntu/ipython/people.parquet2.

       at scala.Predef$.assert(Predef.scala:179)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org<http://MetadataCache.org>$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:429)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)

       at scala.Option.orElse(Option.scala:257)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)

       at 
org.apache.spark.sql.parquet.ParquetRelation2.org<http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:126)

       at 
org.apache.spark.sql.parquet.ParquetRelation2.org<http://org.apache.spark.sql.parquet.parquetrelation2.org/>$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:124)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)

       at 
org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)

       at scala.Option.getOrElse(Option.scala:120)

       at 
org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:165)

       at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:506)

       at 
org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:505)

       at 
org.apache.spark.sql.sources.LogicalRelation.<init>(LogicalRelation.scala:30)

       at 
org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:438)

       at 
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:264)

       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

       at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

       at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

       at java.lang.reflect.Method.invoke(Method.java:601)

       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)

       at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)

       at py4j.Gateway.invoke(Gateway.java:259)

       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

       at py4j.commands.CallCommand.execute(CallCommand.java:79)

       at py4j.GatewayConnection.run(GatewayConnection.java:207)

       at java.lang.Thread.run(Thread.java:722)


I'm using spark-1.4.1-bin-hadoop2.6 with java 1.7.

 Thanks
Amila

Re: Problem while loading saved data

Reply via email to