Re: Reading Hive RCFiles?

2018-01-29 Thread Michael Segel
Just to follow up…

I was able to create an RDD from the file, however,  diving in to the RDD is a 
bit weird, and I’m working thru it.  My test file seems to be one block … 3K 
rows. So when I tried to get the first column of the first row, I ended up 
getting all of the rows for the first column which were comma delimited.   The 
other issue is then converting numeric fields back from their byte code.  I 
have the schema so I can do that.  (This is also an issue with RCFileCat  
(sorry if I messed that name up…) things work great if you’re using strings 
only. )

I guess this could be a start of a project (time permitting) to enhance the 
ability to read older file formats as easy as it is to read Parquet and ORC 
files.

Will have to follow up in Dev.

Thanks everyone for the pointers.


On Jan 20, 2018, at 5:55 PM, Jörn Franke 
> wrote:

Forgot to add the mailinglist

On 18. Jan 2018, at 18:55, Jörn Franke 
> wrote:

Welll you can use:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopRDD-org.apache.hadoop.mapred.JobConf-java.lang.Class-java.lang.Class-java.lang.Class-int-

with the following inputformat:
https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/RCFileInputFormat.html

(note the version of the Javadoc does not matter it is already possible since a 
long time).

Writing is similarly with PairRDD and RCFileOutputFormat

On Thu, Jan 18, 2018 at 5:02 PM, Michael Segel 
> wrote:
No idea on how that last line of garbage got in the message.


> On Jan 18, 2018, at 9:32 AM, Michael Segel 
> > wrote:
>
> Hi,
>
> I’m trying to find out if there’s a simple way for Spark to be able to read 
> an RCFile.
>
> I know I can create a table in Hive, then drop the files in to that directory 
> and use a sql context to read the file from Hive, however I wanted to read 
> the file directly.
>
> Not a lot of details to go on… even the Apache site’s links are broken.
> See :
> https://cwiki.apache.org/confluence/display/Hive/RCFile
>
> Then try to follow the Javadoc link.
>
>
> Any suggestions?
>
> Thx
>
> -Mike
>
>




Re: Reading Hive RCFiles?

2018-01-20 Thread Jörn Franke
Forgot to add the mailinglist 

> On 18. Jan 2018, at 18:55, Jörn Franke  wrote:
> 
> Welll you can use:
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopRDD-org.apache.hadoop.mapred.JobConf-java.lang.Class-java.lang.Class-java.lang.Class-int-
> 
> with the following inputformat:
> https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/RCFileInputFormat.html
> 
> (note the version of the Javadoc does not matter it is already possible since 
> a long time).
> 
> Writing is similarly with PairRDD and RCFileOutputFormat
> 
>> On Thu, Jan 18, 2018 at 5:02 PM, Michael Segel  
>> wrote:
>> No idea on how that last line of garbage got in the message.
>> 
>> 
>> > On Jan 18, 2018, at 9:32 AM, Michael Segel  
>> > wrote:
>> >
>> > Hi,
>> >
>> > I’m trying to find out if there’s a simple way for Spark to be able to 
>> > read an RCFile.
>> >
>> > I know I can create a table in Hive, then drop the files in to that 
>> > directory and use a sql context to read the file from Hive, however I 
>> > wanted to read the file directly.
>> >
>> > Not a lot of details to go on… even the Apache site’s links are broken.
>> > See :
>> > https://cwiki.apache.org/confluence/display/Hive/RCFile
>> >
>> > Then try to follow the Javadoc link.
>> >
>> >
>> > Any suggestions?
>> >
>> > Thx
>> >
>> > -Mike
>> >
>> >
> 


Re: Reading Hive RCFiles?

2018-01-20 Thread Prakash Joshi
If it's simply reading the files from source in HDFS then we have an option
of  sc.hadoopFile  in spark API
Not sure if Spark SQL provides direct method to read



On Jan 18, 2018 9:32 PM, "Michael Segel"  wrote:

> No idea on how that last line of garbage got in the message.
>
>
> > On Jan 18, 2018, at 9:32 AM, Michael Segel 
> wrote:
> >
> > Hi,
> >
> > I’m trying to find out if there’s a simple way for Spark to be able to
> read an RCFile.
> >
> > I know I can create a table in Hive, then drop the files in to that
> directory and use a sql context to read the file from Hive, however I
> wanted to read the file directly.
> >
> > Not a lot of details to go on… even the Apache site’s links are broken.
> > See :
> > https://cwiki.apache.org/confluence/display/Hive/RCFile
> >
> > Then try to follow the Javadoc link.
> >
> >
> > Any suggestions?
> >
> > Thx
> >
> > -Mike
> >
> >
>


Re: Reading Hive RCFiles?

2018-01-18 Thread Michael Segel
No idea on how that last line of garbage got in the message. 


> On Jan 18, 2018, at 9:32 AM, Michael Segel  wrote:
> 
> Hi, 
> 
> I’m trying to find out if there’s a simple way for Spark to be able to read 
> an RCFile. 
> 
> I know I can create a table in Hive, then drop the files in to that directory 
> and use a sql context to read the file from Hive, however I wanted to read 
> the file directly. 
> 
> Not a lot of details to go on… even the Apache site’s links are broken. 
> See :
> https://cwiki.apache.org/confluence/display/Hive/RCFile
> 
> Then try to follow the Javadoc link. 
> 
> 
> Any suggestions? 
> 
> Thx
> 
> -Mike
> 
> 


Reading Hive RCFiles?

2018-01-18 Thread Michael Segel
Hi, 

I’m trying to find out if there’s a simple way for Spark to be able to read an 
RCFile. 

I know I can create a table in Hive, then drop the files in to that directory 
and use a sql context to read the file from Hive, however I wanted to read the 
file directly. 

Not a lot of details to go on… even the Apache site’s links are broken. 
See :
https://cwiki.apache.org/confluence/display/Hive/RCFile

Then try to follow the Javadoc link. 


Any suggestions? 

Thx

-Mike