Re: run reduceByKey on huge data in spark

2015-06-30 Thread barge.nilesh
"I 'm using 50 servers , 35 executors per server, 140GB memory per server"

35 executors *per server* sounds kind of odd to me.

With 35 executors per server and server having 140gb, meaning each executor
is going to get only 4gb, 4gb will be divided in to shuffle/storage memory
fractions... assuming storage memory fraction=0.6 as default then 2.4gb
working space for each executor, so if any of the partition size (key group
size) exceeds 2.4gb there will be OOM...

May be you can try with the less number of executors per server/node...






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546p23555.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL and Hive interoperability

2015-05-09 Thread barge.nilesh
hi, try your first method but create an external table in hive.

like:
hive -e "CREATE *EXTERNAL* TABLE people (name STRING, age INT) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '\t';" 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-Hive-interoperability-tp22690p22828.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Schema change on Spark Hive (Parquet file format) table not working

2014-10-07 Thread barge.nilesh
To find root cause, I installed hive 0.12 separately and tried the exact same
test through Hive CLI and it *passed*. So, looks like it is a problem with
Spark-SQL.

Has anybody else faced this issue (Hive-parquet table schema change)??
Should I create JIRA ticket for this?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Schema-change-on-Spark-Hive-Parquet-file-format-table-not-working-tp15360p15851.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: timestamp not implemented yet

2014-10-01 Thread barge.nilesh
Parquet format seems to be comparatively better for analytic load, it has
performance & compression benefits for large analytic workload.
A workaround could be to use long datatype to store epoch timestamp value.
If you already have existing parquet files (impala tables) then you may need
to consider doing some migration.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/timestamp-not-implemented-yet-tp15414p15571.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: timestamp not implemented yet

2014-09-30 Thread barge.nilesh
Spark 1.1 comes with Hive 0.12 and Hive 0.12, for parquet format, doesn't
support timestamp datatype.

https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Limitations
  





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/timestamp-not-implemented-yet-tp15414p15417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Schema change on Spark Hive (Parquet file format) table not working

2014-09-30 Thread barge.nilesh
code snippet in short:

hiveContext.sql("*CREATE EXTERNAL TABLE IF NOT EXISTS people_table (name
String, age INT) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'*"); 
hiveContext.sql("*INSERT INTO TABLE people_table SELECT name, age FROM
temp_table_people1*"); 
hiveContext.sql("*SELECT * FROM people_table*"); ///Here, data read was
successful./ 
hiveContext.sql("*ALTER TABLE people_table ADD COLUMNS (gender STRING)*");
hiveContext.sql("*SELECT * FROM people_table*"); ///Not able to read
existing data and ArrayIndexOutOfBoundsException is thrown./



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Schema-change-on-Spark-Hive-Parquet-file-format-table-not-working-tp15360p15415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Schema change on Spark Hive (Parquet file format) table not working

2014-09-29 Thread barge.nilesh
I am using following releases: 
Spark 1.1 (built using */sbt/sbt -Dhadoop.version=2.2.0 -Phive assembly/*) ,
Apache HDFS 2.2

My job is able to create/add/read data in hive, parquet formatted, tables
using HiveContext. 
But, after changing schema, job is not able to read existing data and throws
following exception:
*/java.lang.ArrayIndexOutOfBoundsException: 2
at
org.apache.hadoop.hive.ql.io.parquet.serde.ArrayWritableObjectInspector.getStructFieldData(ArrayWritableObjectInspector.java:127)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:284)
at
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:278)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:774)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)/*


Please find below, code snippet:

/   public static void main(String[] args) {
SparkConf sparkConf = (new
SparkConf()).setAppName("SchemaChangeTest").set("spark.cores.max",
"16").set("spark.executor.memory", "8g");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
JavaHiveContext hiveContext = new JavaHiveContext(sparkContext);

List people1List = new ArrayList();
people1List.add("Michael,30");
people1List.add("William,31");
JavaRDD people1RDD = sparkContext.parallelize(people1List);

//String encoded schema#1
String schema1String = "name STRING,age INT";
//Generate the schema based on the string of schema
StructType people1Schema = getSchema(schema1String);
//Convert records of the RDD (people) to Rows.
JavaRDD people1RowRDD = people1RDD.map(new Function() {
public Row call(String record) throws 
Exception {
String[] fields = record.split(",");
return Row.create(fields[0],
Integer.parseInt(fields[1].trim()));
}
});
//Apply schema & register as temporary table
hiveContext.applySchema(people1RowRDD,
people1Schema).registerTempTable("temp_table_people1");
//Create people table
hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS people_table
(name String, age INT) *ROW FORMAT SERDE
'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT
'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT
'parquet.hive.DeprecatedParquetOutputFormat*'");
//Add new data
hiveContext.sql("INSERT INTO TABLE people_table SELECT name, age
FROM temp_table_people1");
//Fetch rows and print
JavaSchemaRDD people1TableRows = hiveContext.sql("SELECT * FROM
people_table");
logger.info(people1TableRows.collect());

*//Until this point everything is fine, job creates new table, add
data in to table and then able to read from table*

//
// Change Schema 
//
hiveContext.sql("ALTER TABLE people_table ADD COLUMNS (gender
STRING)");

List people2List = new ArrayList();
people2List.add("David,32,M");
people2Li