Re: How to apply schema to queried data from Hive before saving it as parquet file?
Thanks for replying .I was unable to figure out how after I use jsonFile/jsonRDD be able to load data into a hive table. Also I was able to save the SchemaRDD I got via hiveContext.sql(...).saveAsParquetFile(Path) ie. save schemardd as parquetfile but when I tried to fetch data from parquet file back like so(below) and save data back to a text file i Got some weird values like org.apache.spark.sql.api.java.Row@e26c01c7 in the text files generated as output :-- JavaSchemaRDD parquetfilerdd=sqlContext.parquetFile(path/to/parquet/File); parquetfilerdd.registerTempTable(pq); JavaSchemaRDD writetxt=sqlCtx.sql(Select * from pq); writetxt.saveAsTextFile(Path/To/Text/Files); // This step created text files which was filled with values likeorg.apache.spark.sql.api.java.Row@e26c01c7 I know there must be something which could do it right, just that I haven't been able to figure out all the while. Could you please help . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-apply-schema-to-queried-data-from-Hive-before-saving-it-as-parquet-file-tp19259p19338.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to apply schema to queried data from Hive before saving it as parquet file?
Sorry about the confusion I created . I just have started learning this week. Silly me, I was actually writing the schema to a txt file and expecting records. This is what I was supposed to do. Also if you could let me know about adding the data from jsonFile/jsonRDD methods of hiveContext to hive tables it will be appreciated. JavaRDDString result=writetxt.map(new FunctionRow, String() { public String call(Row row) { String temp=; temp+=(row.getInt(0))+ ; temp+=row.getString(1)+ ; temp+=(row.getInt(2)); return temp; } }); result.saveAsTextFile(pqtotxt); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-apply-schema-to-queried-data-from-Hive-before-saving-it-as-parquet-file-tp19259p19343.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Building Spark for Hive The requested profile hadoop-1.2 could not be activated because it does not exist.
I am using Apache Hadoop 1.2.1 . I wanted to use Spark Sql with Hive. So I tried to build Spark like so . mvn -Phive,hadoop-1.2 -Dhadoop.version=1.2.1 clean -DskipTests package But I get the following error. The requested profile hadoop-1.2 could not be activated because it does not exist. Is there some way to handle this or do I have to downgrade Hadoop. Any help is Appreciated -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-for-Hive-The-requested-profile-hadoop-1-2-could-not-be-activated-because-it-does-not--tp19063.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Building Spark for Hive The requested profile hadoop-1.2 could not be activated because it does not exist.
Oops , I guess , this is the right way to do it mvn -Phive -Dhadoop.version=1.2.1 clean -DskipTests package -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-for-Hive-The-requested-profile-hadoop-1-2-could-not-be-activated-because-it-does-not--tp19063p19068.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Query from two or more tables Spark Sql .I have done this . Is there any simpler solution.
As of now my approach is to fetch all data from tables located in different databases in separate RDD's and then make a union of them and then query on them together. I want to know whether I can perform a query on it directly along with creating an RDD. i.e. Instead of creating two RDDs , firing a query on both the tables in a single JdbcRDD and creating an RDD of it. Other than the above if any alternate methods are available they are also welcome. The below way of doing it is complex involving fetching all data. I want to reduce time . Any help is appreciated . import java.io.Serializable; import scala.*; import scala.runtime.AbstractFunction0; import scala.runtime.AbstractFunction1; import scala.runtime.*; import scala.collection.mutable.LinkedHashMap; import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.*; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.rdd.*; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.api.java.JavaSQLContext; import org.apache.spark.sql.api.java.JavaSchemaRDD; import org.apache.spark.sql.api.java.Row; import org.apache.spark.sql.api.java.DataType; import org.apache.spark.sql.api.java.StructType; import org.apache.spark.sql.api.java.StructField; import org.apache.spark.sql.api.java.Row; import org.apache.spark.SparkConf; import org.apache.spark.SparkContext; import java.sql.*; import java.util.*; import org.postgresql.Driver; import java.io.*; public class Spark_Postgresql { public static class tableSchema implements Serializable { private String ID; private String INTENSITY; public String getID() { return ID; } public void setINTENSITY(String iNTENSITY) { INTENSITY = iNTENSITY; } public String getINTENSITY() { return INTENSITY; } public void setID(String iD) { ID = iD; } } static class Z extends AbstractFunction0java.sql.Connection implements Serializable { java.sql.Connection con; static String URL=jdbc:postgresql://localhost:5432/postgres?user=postgrespassword=postgres; public java.sql.Connection apply() { try { con=DriverManager.getConnection(URL); } catch(Exception e) { e.printStackTrace(); } return con; } // public void change(String DB) // { // URL=jdbc:postgresql://+DB+?user=postgrespassword=postgres; // } } static class Z2 extends AbstractFunction0java.sql.Connection implements Serializable { java.sql.Connection con; static String URL=jdbc:postgresql://localhost:5432/postgres1?user=postgrespassword=postgres; public java.sql.Connection apply() { try { con=DriverManager.getConnection(URL); } catch(Exception e) { e.printStackTrace(); } return con; } } static public class Z1 extends AbstractFunction1ResultSet,String implements Serializable { public String apply(ResultSet i) { String ret=; Integer colcnt=0; try{ ResultSetMetaData meta = i.getMetaData(); colcnt=meta.getColumnCount(); Integer k=1,type; while(k=colcnt) { type=meta.getColumnType(k); if(k==1) { if((type==Types.VARCHAR)||(type==Types.CHAR)) ret+=i.getString(k); else if(type==Types.FLOAT) ret+=String.valueOf(i.getFloat(k)); else if(type==Types.INTEGER) ret+=String.valueOf(i.getInt(k)); else if(type==Types.DOUBLE) ret+=String.valueOf(i.getDouble(k)); } else { if((type==Types.VARCHAR)||(type==Types.CHAR)) ret+= +i.getString(k); else if(type==Types.FLOAT) ret+= +String.valueOf(i.getFloat(k)); else if((type==Types.INTEGER)||(type==Types.BIGINT)) ret+= +String.valueOf(i.getInt(k)); else if(type==Types.DOUBLE) ret+= +String.valueOf(i.getDouble(k));
Combining data from two tables in two databases postgresql, JdbcRDD.
I want to be able to perform a query on two tables in different databases. I want to know whether it can be done. I've heard about union of two RDD's but here I want to connect to something like different partitions of a table. Any help is appreciated import java.io.Serializable; //import org.junit.*; //import static org.junit.Assert.*; import scala.*; import scala.runtime.AbstractFunction0; import scala.runtime.AbstractFunction1; import scala.runtime.*; import scala.collection.mutable.LinkedHashMap; //import static scala.collection.Map.Projection; import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.*; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.rdd.*; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.api.java.JavaSQLContext; import org.apache.spark.sql.api.java.JavaSchemaRDD; import org.apache.spark.sql.api.java.Row; import org.apache.spark.sql.api.java.DataType; import org.apache.spark.sql.api.java.StructType; import org.apache.spark.sql.api.java.StructField; import org.apache.spark.sql.api.java.Row; import org.apache.spark.SparkConf; import org.apache.spark.SparkContext; import java.sql.*; import java.util.*; import com.mysql.jdbc.Driver; import com.mysql.jdbc.*; import java.io.*; public class Spark_Mysql { static class Z extends AbstractFunction0java.sql.Connection implements Serializable { java.sql.Connection con; public java.sql.Connection apply() { try { con=DriverManager.getConnection(jdbc:mysql://localhost:3306/azkaban?user=azkabanpassword=password); } catch(Exception e) { e.printStackTrace(); } return con; } } static public class Z1 extends AbstractFunction1ResultSet,Integer implements Serializable { int ret; public Integer apply(ResultSet i) { try{ ret=i.getInt(1); } catch(Exception e) {e.printStackTrace();} return ret; } } public static void main(String[] args) throws Exception { String arr[]=new String[1]; arr[0]=/home/hduser/Documents/Credentials/Newest_Credentials_AX/spark-1.1.0-bin-hadoop1/lib/mysql-connector-java-5.1.33-bin.jar; JavaSparkContext ctx = new JavaSparkContext(new SparkConf().setAppName(JavaSparkSQL).setJars(arr)); SparkContext sctx = new SparkContext(new SparkConf().setAppName(JavaSparkSQL).setJars(arr)); JavaSQLContext sqlCtx = new JavaSQLContext(ctx); try { Class.forName(com.mysql.jdbc.Driver); } catch(Exception ex) { ex.printStackTrace(); System.exit(1); } JdbcRDD rdd=new JdbcRDD(sctx,new Z(),SELECT * FROM spark WHERE ? = id AND id = ?,0L, 1000L, 10,new Z1(),scala.reflect.ClassTag$.MODULE$.AnyRef()); rdd.saveAsTextFile(hdfs://127.0.0.1:9000/user/hduser/mysqlrdd); rdd.saveAsTextFile(/home/hduser/mysqlrdd); } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Combining-data-from-two-tables-in-two-databases-postgresql-JdbcRDD-tp18597.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Mysql retrieval and storage using JdbcRDD
So far I have tried this and I am able to compile it successfully . There isn't enough documentation on spark for its usage with databases. I am using AbstractFunction0 and AbsctractFunction1 here. I am unable to access the database. The jar just runs without doing anything when submitted. I want to know how is it supposed to be done and what wrongs have I done here. Any help is appreciated. import java.io.Serializable; import scala.*; import scala.runtime.AbstractFunction0; import scala.runtime.AbstractFunction1; import scala.runtime.*; import scala.collection.mutable.LinkedHashMap; import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.*; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.rdd.*; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.api.java.JavaSQLContext; import org.apache.spark.sql.api.java.JavaSchemaRDD; import org.apache.spark.sql.api.java.Row; import org.apache.spark.sql.api.java.DataType; import org.apache.spark.sql.api.java.StructType; import org.apache.spark.sql.api.java.StructField; import org.apache.spark.sql.api.java.Row; import org.apache.spark.SparkConf; import org.apache.spark.SparkContext; import java.sql.*; import java.util.*; import java.io.*; public class Spark_Mysql { @SuppressWarnings(serial) static class Z extends AbstractFunction0Connection { Connection con; public Connection apply() { try { Class.forName(com.mysql.jdbc.Driver); con=DriverManager.getConnection(jdbc:mysql://localhost:3306/?user=azkabanpassword=password); } catch(Exception e) { } return con; } } static public class Z1 extends AbstractFunction1ResultSet,Integer { int ret; public Integer apply(ResultSet i) { try{ ret=i.getInt(1); } catch(Exception e) {} return ret; } } @SuppressWarnings(serial) public static void main(String[] args) throws Exception { String arr[]=new String[1]; arr[0]=/home/hduser/Documents/Credentials/Newest_Credentials_AX/spark-1.1.0-bin-hadoop1/lib/mysql-connector-java-5.1.33-bin.jar; JavaSparkContext ctx = new JavaSparkContext(new SparkConf().setAppName(JavaSparkSQL).setJars(arr)); SparkContext sctx = new SparkContext(new SparkConf().setAppName(JavaSparkSQL).setJars(arr)); JavaSQLContext sqlCtx = new JavaSQLContext(ctx); try { Class.forName(com.mysql.jdbc.Driver); } catch(Exception ex) { System.exit(1); } Connection zconn=DriverManager.getConnection(jdbc:mysql://localhost:3306/?user=azkabanpassword=password); JdbcRDD rdd=new JdbcRDD(sctx,new Z(),SELECT * FROM spark WHERE ? = id AND id = ?,0L, 1000L, 10,new Z1(),scala.reflect.ClassTag$.MODULE$.AnyRef()); rdd.saveAsTextFile(hdfs://127.0.0.1:9000/user/hduser/mysqlrdd); rdd.saveAsTextFile(/home/hduser/mysqlrdd); } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mysql-retrieval-and-storage-using-JdbcRDD-tp18479.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org