Re: Spark ACID compatibility

2021-06-14 Thread Mich Talebzadeh
I think we are hitting an old bug.

tried it with

Hadoop 3.1.1
Hive 3.1.1
Spark 3.1.1

Try to create an ORC transactional table in Hive (PySpark)

  CREATE TABLE if not exists test.randomDataDelta(
   ID INT
 , CLUSTERED INT
 , SCATTERED INT
 , RANDOMISED INT
 , RANDOM_STRING VARCHAR(50)
 , SMALL_VC VARCHAR(50)
 , PADDING  VARCHAR(40)
)
  STORED AS ORC
  TBLPROPERTIES (






*"transactional" = "true",  "orc.create.index"="true",
"orc.bloom.filter.columns"="ID",  "orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",  "orc.stripe.size"="16777216",
"orc.row.index.stride"="1" )*


And populate it through Spark with random data

it works and can red it through Spark

starting at ID =  218 ,ending on =  236
Schema of delta table
root
 |-- ID: long (nullable = true)
 |-- CLUSTERED: double (nullable = true)
 |-- SCATTERED: double (nullable = true)
 |-- RANDOMISED: double (nullable = true)
 |-- RANDOM_STRING: string (nullable = true)
 |-- SMALL_VC: string (nullable = true)
 |-- PADDING: string (nullable = true)

+-+-+
|minID|maxID|
+-+-+
|1|  236|
+-+-+

Finished at
14/06/2021 19:02:43.43


Now I am trying to read it in Hive

0: jdbc:hive2://rhes75:10099/default> desc test.randomDataDelta;
++--+--+
|col_name|  data_type   | comment  |
++--+--+
| id | int  |  |
| clustered  | int  |  |
| scattered  | int  |  |
| randomised | int  |  |
| random_string  | varchar(50)  |  |
| small_vc   | varchar(50)  |  |
| padding| varchar(40)  |  |
++--+--+
7 rows selected (0.169 seconds)
0: jdbc:hive2://rhes75:10099/default>

*select count(1) from test.randomDataDelta;Error: Error while processing
statement: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed
with exception: java.lang.NoSuchMethodError:
org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
(state=08S01,code=1)*

I did a Google search and showed the error I raised three years ago

https://user.hive.apache.narkive.com/Td3He6Vj/failed-execution-error-return-code-1-from-org-apache-hadoop-hive-ql-exec-mr-mapredtask-orc-split

So it has not been fixed yet!

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Jun 2021 at 16:29, Suryansh Agnihotri 
wrote:

> No this also does not work.
> Steps I followed.
> spark-sql:
> CREATE TABLE students (id int, name string, marks int) STORED AS ORC
> TBLPROPERTIES ('transactional' = 'true');
>
> hive-cli:
> created a students_copy table and inserted some values in it and did
> "INSERT OVERWRITE TABLE students select * from default.students_copy;"
> I am able to query both tables from hive-cli but not from spark (table
> students is created using spark )
>
> Thanks
>
> On Mon, 14 Jun 2021 at 20:07, Mich Talebzadeh 
> wrote:
>
>> Ok there were issues in the past with the ORC table read through Spark.
>>
>> If the ORC table is created through Spark I believe it will work
>>
>> Do a test. Create the ORC table through Spark first.
>>
>> Then do insert overwrite into that table through Hive cli from your Hive
>> created ORC table and see if you can access data in the new table through
>> Spark.
>>
>> HTH
>>
>>
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 14 Jun 2021 at 15:19, Suryansh Agnihotri <
>> sagnihotri2...@gmail.com> wrote:
>>
>>> Table was created by hive (hive-cli) , format is orc. I am able to get
>>> data from hive-cli (hive return rows).
>>> But spark-sql/spark-shell does not return any rows.
>>>
>>> On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh 
>>> wrote:
>>>
 How the table was created in the first place, spark or Hive?

 Is this table an ORC table and does Spark or Hive return rows?

 HTH



view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, 

Re: Spark ACID compatibility

2021-06-14 Thread Suryansh Agnihotri
No this also does not work.
Steps I followed.
spark-sql:
CREATE TABLE students (id int, name string, marks int) STORED AS ORC
TBLPROPERTIES ('transactional' = 'true');

hive-cli:
created a students_copy table and inserted some values in it and did
"INSERT OVERWRITE TABLE students select * from default.students_copy;"
I am able to query both tables from hive-cli but not from spark (table
students is created using spark )

Thanks

On Mon, 14 Jun 2021 at 20:07, Mich Talebzadeh 
wrote:

> Ok there were issues in the past with the ORC table read through Spark.
>
> If the ORC table is created through Spark I believe it will work
>
> Do a test. Create the ORC table through Spark first.
>
> Then do insert overwrite into that table through Hive cli from your Hive
> created ORC table and see if you can access data in the new table through
> Spark.
>
> HTH
>
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Jun 2021 at 15:19, Suryansh Agnihotri 
> wrote:
>
>> Table was created by hive (hive-cli) , format is orc. I am able to get
>> data from hive-cli (hive return rows).
>> But spark-sql/spark-shell does not return any rows.
>>
>> On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh 
>> wrote:
>>
>>> How the table was created in the first place, spark or Hive?
>>>
>>> Is this table an ORC table and does Spark or Hive return rows?
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri <
>>> sagnihotri2...@gmail.com> wrote:
>>>
 Hi
 Does spark support querying hive tables which are transactional?
  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
 table but I am not able to see the data from the table , although *show
 tables *does list the table from hive metastore and desc table works
 fine but *select * from table* gives *empty result*.
 Does the later version of spark have the fix or is there another way to
 query?
 Thanks

>>>


Re: Spark ACID compatibility

2021-06-14 Thread Mich Talebzadeh
Ok there were issues in the past with the ORC table read through Spark.

If the ORC table is created through Spark I believe it will work

Do a test. Create the ORC table through Spark first.

Then do insert overwrite into that table through Hive cli from your Hive
created ORC table and see if you can access data in the new table through
Spark.

HTH





   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Jun 2021 at 15:19, Suryansh Agnihotri 
wrote:

> Table was created by hive (hive-cli) , format is orc. I am able to get
> data from hive-cli (hive return rows).
> But spark-sql/spark-shell does not return any rows.
>
> On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh 
> wrote:
>
>> How the table was created in the first place, spark or Hive?
>>
>> Is this table an ORC table and does Spark or Hive return rows?
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri <
>> sagnihotri2...@gmail.com> wrote:
>>
>>> Hi
>>> Does spark support querying hive tables which are transactional?
>>>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
>>> table but I am not able to see the data from the table , although *show
>>> tables *does list the table from hive metastore and desc table works
>>> fine but *select * from table* gives *empty result*.
>>> Does the later version of spark have the fix or is there another way to
>>> query?
>>> Thanks
>>>
>>


Re: Spark ACID compatibility

2021-06-14 Thread Suryansh Agnihotri
Table was created by hive (hive-cli) , format is orc. I am able to get data
from hive-cli (hive return rows).
But spark-sql/spark-shell does not return any rows.

On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh 
wrote:

> How the table was created in the first place, spark or Hive?
>
> Is this table an ORC table and does Spark or Hive return rows?
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri 
> wrote:
>
>> Hi
>> Does spark support querying hive tables which are transactional?
>>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
>> table but I am not able to see the data from the table , although *show
>> tables *does list the table from hive metastore and desc table works
>> fine but *select * from table* gives *empty result*.
>> Does the later version of spark have the fix or is there another way to
>> query?
>> Thanks
>>
>


Re: Spark ACID compatibility

2021-06-14 Thread Mich Talebzadeh
How the table was created in the first place, spark or Hive?

Is this table an ORC table and does Spark or Hive return rows?

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri 
wrote:

> Hi
> Does spark support querying hive tables which are transactional?
>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
> table but I am not able to see the data from the table , although *show
> tables *does list the table from hive metastore and desc table works fine
> but *select * from table* gives *empty result*.
> Does the later version of spark have the fix or is there another way to
> query?
> Thanks
>


Spark ACID compatibility

2021-06-14 Thread Suryansh Agnihotri
Hi
Does spark support querying hive tables which are transactional?
 I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
table but I am not able to see the data from the table , although *show
tables *does list the table from hive metastore and desc table works fine
but *select * from table* gives *empty result*.
Does the later version of spark have the fix or is there another way to
query?
Thanks


Fwd: CRAN package SparkR

2021-06-14 Thread Felix Cheung
It looks like they would not allow caching the Spark
Distribution.

I’m not sure what can be done about this.

If I recall, the package should remove this during test. Or maybe
spark.install() ie optional (hence getting user confirmation?)


-- Forwarded message -
Date: Sun, Jun 13, 2021 at 10:19 PM
Subject: CRAN package SparkR
To: Felix Cheung 
CC: 


Dear maintainer,

Checking this apparently creates the default directory as per

#' @param localDir a local directory where Spark is installed. The
directory con
tains
#' version-specific folders of Spark packages. Default is
path t
o
#' the cache directory:
#' \itemize{
#'   \item Mac OS X: \file{~/Library/Caches/spark}
#'   \item Unix: \env{$XDG_CACHE_HOME} if defined,
otherwise \file{~/.cache/spark}
#'   \item Windows:
\file{\%LOCALAPPDATA\%\\Apache\\Spark\\Cache}.
#' }

However, the CRAN Policy says

  - Packages should not write in the user’s home filespace (including
clipboards), nor anywhere else on the file system apart from the R
session’s temporary directory (or during installation in the
location pointed to by TMPDIR: and such usage should be cleaned
up). Installing into the system’s R installation (e.g., scripts to
its bin directory) is not allowed.

Limited exceptions may be allowed in interactive sessions if the
package obtains confirmation from the user.

For R version 4.0 or later (hence a version dependency is required
or only conditional use is possible), packages may store
user-specific data, configuration and cache files in their
respective user directories obtained from tools::R_user_dir(),
provided that by default sizes are kept as small as possible and the
contents are actively managed (including removing outdated
material).

Can you pls fix as necessary?

Please fix before 2021-06-28 to safely retain your package on CRAN.

Best
-k