Re: How to create DataFrame from a binary file?

2015-08-10 Thread Ted Yu
Umesh:
Please take a look at the classes under:
sql/core/src/main/scala/org/apache/spark/sql/parquet

FYI

On Mon, Aug 10, 2015 at 10:35 AM, Umesh Kacha umesh.ka...@gmail.com wrote:

 Hi Bo thanks much let me explain please see the following code

 JavaPairRDDString,PortableDataStream pairRdd =
 javaSparkContext.binaryFiles(/hdfs/path/to/binfile);
 JavaRDDPortableDataStream javardd = pairRdd.values();

 DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,
  PortableDataStream.class);
 binDataFrame.show(); //shows just one row with above file path
 /hdfs/path/to/binfile

 I want binary data in DataFrame from above file so that I can directly do
 analytics on it. My data is binary so I cant use StructType
 with primitive data types rigth since everything is binary/byte. My custom
 data format in binary is same as Parquet I did not find any good example
 where/how parquet is read into DataFrame. Please guide.





 On Sun, Aug 9, 2015 at 11:52 PM, bo yang bobyan...@gmail.com wrote:

 Well, my post uses raw text json file to show how to create data frame
 with a custom data schema. The key idea is to show the flexibility to deal
 with any format of data by using your own schema. Sorry if I did not make
 you fully understand.

 Anyway, let us know once you figure out your problem.




 On Sun, Aug 9, 2015 at 11:10 AM, Umesh Kacha umesh.ka...@gmail.com
 wrote:

 Hi Bo I know how to create a DataFrame my question is how to create a
 DataFrame for binary files and in your blog it is raw text json files
 please read my question properly thanks.

 On Sun, Aug 9, 2015 at 11:21 PM, bo yang bobyan...@gmail.com wrote:

 You can create your own data schema (StructType in spark), and use
 following method to create data frame with your own data schema:

 sqlContext.createDataFrame(yourRDD, structType);

 I wrote a post on how to do it. You can also get the sample code there:

 Light-Weight Self-Service Data Query through Spark SQL:

 https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang

 Take a look and feel free to  let me know for any question.

 Best,
 Bo



 On Sat, Aug 8, 2015 at 1:42 PM, unk1102 umesh.ka...@gmail.com wrote:

 Hi how do we create DataFrame from a binary file stored in HDFS? I was
 thinking to use

 JavaPairRDDString,PortableDataStream pairRdd =
 javaSparkContext.binaryFiles(/hdfs/path/to/binfile);
 JavaRDDPortableDataStream javardd = pairRdd.values();

 I can see that PortableDataStream has method called toArray which can
 convert into byte array I was thinking if I have JavaRDDbyte[] can I
 call
 the following and get DataFrame

 DataFrame binDataFrame =
 sqlContext.createDataFrame(javaBinRdd,Byte.class);

 Please guide I am new to Spark. I have my own custom format which is
 binary
 format and I was thinking if I can convert my custom format into
 DataFrame
 using binary operations then I dont need to create my own custom Hadoop
 format am I on right track? Will reading binary data into DataFrame
 scale?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: How to create DataFrame from a binary file?

2015-08-10 Thread Umesh Kacha
Hi Bo thanks much let me explain please see the following code

JavaPairRDDString,PortableDataStream pairRdd =
javaSparkContext.binaryFiles(/hdfs/path/to/binfile);
JavaRDDPortableDataStream javardd = pairRdd.values();

DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,
 PortableDataStream.class);
binDataFrame.show(); //shows just one row with above file path
/hdfs/path/to/binfile

I want binary data in DataFrame from above file so that I can directly do
analytics on it. My data is binary so I cant use StructType
with primitive data types rigth since everything is binary/byte. My custom
data format in binary is same as Parquet I did not find any good example
where/how parquet is read into DataFrame. Please guide.





On Sun, Aug 9, 2015 at 11:52 PM, bo yang bobyan...@gmail.com wrote:

 Well, my post uses raw text json file to show how to create data frame
 with a custom data schema. The key idea is to show the flexibility to deal
 with any format of data by using your own schema. Sorry if I did not make
 you fully understand.

 Anyway, let us know once you figure out your problem.




 On Sun, Aug 9, 2015 at 11:10 AM, Umesh Kacha umesh.ka...@gmail.com
 wrote:

 Hi Bo I know how to create a DataFrame my question is how to create a
 DataFrame for binary files and in your blog it is raw text json files
 please read my question properly thanks.

 On Sun, Aug 9, 2015 at 11:21 PM, bo yang bobyan...@gmail.com wrote:

 You can create your own data schema (StructType in spark), and use
 following method to create data frame with your own data schema:

 sqlContext.createDataFrame(yourRDD, structType);

 I wrote a post on how to do it. You can also get the sample code there:

 Light-Weight Self-Service Data Query through Spark SQL:

 https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang

 Take a look and feel free to  let me know for any question.

 Best,
 Bo



 On Sat, Aug 8, 2015 at 1:42 PM, unk1102 umesh.ka...@gmail.com wrote:

 Hi how do we create DataFrame from a binary file stored in HDFS? I was
 thinking to use

 JavaPairRDDString,PortableDataStream pairRdd =
 javaSparkContext.binaryFiles(/hdfs/path/to/binfile);
 JavaRDDPortableDataStream javardd = pairRdd.values();

 I can see that PortableDataStream has method called toArray which can
 convert into byte array I was thinking if I have JavaRDDbyte[] can I
 call
 the following and get DataFrame

 DataFrame binDataFrame =
 sqlContext.createDataFrame(javaBinRdd,Byte.class);

 Please guide I am new to Spark. I have my own custom format which is
 binary
 format and I was thinking if I can convert my custom format into
 DataFrame
 using binary operations then I dont need to create my own custom Hadoop
 format am I on right track? Will reading binary data into DataFrame
 scale?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: How to create DataFrame from a binary file?

2015-08-09 Thread bo yang
You can create your own data schema (StructType in spark), and use
following method to create data frame with your own data schema:

sqlContext.createDataFrame(yourRDD, structType);

I wrote a post on how to do it. You can also get the sample code there:

Light-Weight Self-Service Data Query through Spark SQL:
https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang

Take a look and feel free to  let me know for any question.

Best,
Bo



On Sat, Aug 8, 2015 at 1:42 PM, unk1102 umesh.ka...@gmail.com wrote:

 Hi how do we create DataFrame from a binary file stored in HDFS? I was
 thinking to use

 JavaPairRDDString,PortableDataStream pairRdd =
 javaSparkContext.binaryFiles(/hdfs/path/to/binfile);
 JavaRDDPortableDataStream javardd = pairRdd.values();

 I can see that PortableDataStream has method called toArray which can
 convert into byte array I was thinking if I have JavaRDDbyte[] can I call
 the following and get DataFrame

 DataFrame binDataFrame = sqlContext.createDataFrame(javaBinRdd,Byte.class);

 Please guide I am new to Spark. I have my own custom format which is binary
 format and I was thinking if I can convert my custom format into DataFrame
 using binary operations then I dont need to create my own custom Hadoop
 format am I on right track? Will reading binary data into DataFrame scale?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: How to create DataFrame from a binary file?

2015-08-09 Thread Umesh Kacha
Hi Bo I know how to create a DataFrame my question is how to create a
DataFrame for binary files and in your blog it is raw text json files
please read my question properly thanks.

On Sun, Aug 9, 2015 at 11:21 PM, bo yang bobyan...@gmail.com wrote:

 You can create your own data schema (StructType in spark), and use
 following method to create data frame with your own data schema:

 sqlContext.createDataFrame(yourRDD, structType);

 I wrote a post on how to do it. You can also get the sample code there:

 Light-Weight Self-Service Data Query through Spark SQL:

 https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang

 Take a look and feel free to  let me know for any question.

 Best,
 Bo



 On Sat, Aug 8, 2015 at 1:42 PM, unk1102 umesh.ka...@gmail.com wrote:

 Hi how do we create DataFrame from a binary file stored in HDFS? I was
 thinking to use

 JavaPairRDDString,PortableDataStream pairRdd =
 javaSparkContext.binaryFiles(/hdfs/path/to/binfile);
 JavaRDDPortableDataStream javardd = pairRdd.values();

 I can see that PortableDataStream has method called toArray which can
 convert into byte array I was thinking if I have JavaRDDbyte[] can I
 call
 the following and get DataFrame

 DataFrame binDataFrame =
 sqlContext.createDataFrame(javaBinRdd,Byte.class);

 Please guide I am new to Spark. I have my own custom format which is
 binary
 format and I was thinking if I can convert my custom format into DataFrame
 using binary operations then I dont need to create my own custom Hadoop
 format am I on right track? Will reading binary data into DataFrame scale?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: How to create DataFrame from a binary file?

2015-08-09 Thread bo yang
Well, my post uses raw text json file to show how to create data frame with
a custom data schema. The key idea is to show the flexibility to deal with
any format of data by using your own schema. Sorry if I did not make you
fully understand.

Anyway, let us know once you figure out your problem.




On Sun, Aug 9, 2015 at 11:10 AM, Umesh Kacha umesh.ka...@gmail.com wrote:

 Hi Bo I know how to create a DataFrame my question is how to create a
 DataFrame for binary files and in your blog it is raw text json files
 please read my question properly thanks.

 On Sun, Aug 9, 2015 at 11:21 PM, bo yang bobyan...@gmail.com wrote:

 You can create your own data schema (StructType in spark), and use
 following method to create data frame with your own data schema:

 sqlContext.createDataFrame(yourRDD, structType);

 I wrote a post on how to do it. You can also get the sample code there:

 Light-Weight Self-Service Data Query through Spark SQL:

 https://www.linkedin.com/pulse/light-weight-self-service-data-query-through-spark-sql-bo-yang

 Take a look and feel free to  let me know for any question.

 Best,
 Bo



 On Sat, Aug 8, 2015 at 1:42 PM, unk1102 umesh.ka...@gmail.com wrote:

 Hi how do we create DataFrame from a binary file stored in HDFS? I was
 thinking to use

 JavaPairRDDString,PortableDataStream pairRdd =
 javaSparkContext.binaryFiles(/hdfs/path/to/binfile);
 JavaRDDPortableDataStream javardd = pairRdd.values();

 I can see that PortableDataStream has method called toArray which can
 convert into byte array I was thinking if I have JavaRDDbyte[] can I
 call
 the following and get DataFrame

 DataFrame binDataFrame =
 sqlContext.createDataFrame(javaBinRdd,Byte.class);

 Please guide I am new to Spark. I have my own custom format which is
 binary
 format and I was thinking if I can convert my custom format into
 DataFrame
 using binary operations then I dont need to create my own custom Hadoop
 format am I on right track? Will reading binary data into DataFrame
 scale?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-create-DataFrame-from-a-binary-file-tp24179.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org