Re: Spark ACID compatibility

Mich Talebzadeh Mon, 14 Jun 2021 11:07:01 -0700

I think we are hitting an old bug.

tried it with


Hadoop 3.1.1
Hive 3.1.1
Spark 3.1.1

Try to create an ORC transactional table in Hive (PySpark)

  CREATE TABLE if not exists test.randomDataDelta(
       ID INT
     , CLUSTERED INT
     , SCATTERED INT
     , RANDOMISED INT
     , RANDOM_STRING VARCHAR(50)
     , SMALL_VC VARCHAR(50)
     , PADDING  VARCHAR(40)
    )
  STORED AS ORC
  TBLPROPERTIES (






*"transactional" = "true",  "orc.create.index"="true",
"orc.bloom.filter.columns"="ID",  "orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",  "orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )*


And populate it through Spark with random data

it works and can red it through Spark

starting at ID =  218 ,ending on =  236
Schema of delta table
root
 |-- ID: long (nullable = true)
 |-- CLUSTERED: double (nullable = true)
 |-- SCATTERED: double (nullable = true)
 |-- RANDOMISED: double (nullable = true)
 |-- RANDOM_STRING: string (nullable = true)
 |-- SMALL_VC: string (nullable = true)
 |-- PADDING: string (nullable = true)

+-----+-----+
|minID|maxID|
+-----+-----+
|    1|  236|
+-----+-----+

Finished at
14/06/2021 19:02:43.43


Now I am trying to read it in Hive

0: jdbc:hive2://rhes75:10099/default> desc test.randomDataDelta;
+----------------+--------------+----------+
|    col_name    |  data_type   | comment  |
+----------------+--------------+----------+
| id             | int          |          |
| clustered      | int          |          |
| scattered      | int          |          |
| randomised     | int          |          |
| random_string  | varchar(50)  |          |
| small_vc       | varchar(50)  |          |
| padding        | varchar(40)  |          |
+----------------+--------------+----------+
7 rows selected (0.169 seconds)
0: jdbc:hive2://rhes75:10099/default>

*select count(1) from test.randomDataDelta;Error: Error while processing
statement: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed
with exception: java.lang.NoSuchMethodError:
org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
(state=08S01,code=1)*

I did a Google search and showed the error I raised three years ago

https://user.hive.apache.narkive.com/Td3He6Vj/failed-execution-error-return-code-1-from-org-apache-hadoop-hive-ql-exec-mr-mapredtask-orc-split

So it has not been fixed yet!

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Jun 2021 at 16:29, Suryansh Agnihotri <[email protected]>
wrote:

> No this also does not work.
> Steps I followed.
> spark-sql:
> CREATE TABLE students (id int, name string, marks int) STORED AS ORC
> TBLPROPERTIES ('transactional' = 'true');
>
> hive-cli:
> created a students_copy table and inserted some values in it and did
> "INSERT OVERWRITE TABLE students select * from default.students_copy;"
> I am able to query both tables from hive-cli but not from spark (table
> students is created using spark )
>
> Thanks
>
> On Mon, 14 Jun 2021 at 20:07, Mich Talebzadeh <[email protected]>
> wrote:
>
>> Ok there were issues in the past with the ORC table read through Spark.
>>
>> If the ORC table is created through Spark I believe it will work
>>
>> Do a test. Create the ORC table through Spark first.
>>
>> Then do insert overwrite into that table through Hive cli from your Hive
>> created ORC table and see if you can access data in the new table through
>> Spark.
>>
>> HTH
>>
>>
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 14 Jun 2021 at 15:19, Suryansh Agnihotri <
>> [email protected]> wrote:
>>
>>> Table was created by hive (hive-cli) , format is orc. I am able to get
>>> data from hive-cli (hive return rows).
>>> But spark-sql/spark-shell does not return any rows.
>>>
>>> On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh <[email protected]>
>>> wrote:
>>>
>>>> How the table was created in the first place, spark or Hive?
>>>>
>>>> Is this table an ORC table and does Spark or Hive return rows?
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi
>>>>> Does spark support querying hive tables which are transactional?
>>>>>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
>>>>> table but I am not able to see the data from the table , although *show
>>>>> tables *does list the table from hive metastore and desc table works
>>>>> fine but *select * from table* gives *empty result*.
>>>>> Does the later version of spark have the fix or is there another way
>>>>> to query?
>>>>> Thanks
>>>>>
>>>>

Re: Spark ACID compatibility

Reply via email to